this came up elsewhere, and I wanted to start a separate discussion about it!
the basic question is: what should happen if I call Str.split on the string "abc" passing "" as the delimiter?
we can't make it a compile error, because the second argument could be a variable
we could make it a compiler warning if you pass the actual string literal "" (which I do think is a reasonable design)
we could make it crash, but I don't like that idea
two reasonable options for what it could do are:
["abc"] (behave like List.single)["a", "b", "c"] (behave like Str.graphemes - which I just realized needs some docs!)in either case, if you're calling Str.split passing the literal "" as the delimiter, there's another function you could call to get the exact same answer (List.single or Str.graphemes)
so one way to answer what it should do is to think about the scenario where you're not passing a string literal, but rather a variable
e.g. Str.split str delimiter
and then asking what should happen when the variable is "" as opposed to other things
when I think about this scenario, I think about a situation where I have a user inputting what delimiter they want to split on (e.g. they enter "," if they want to split up something comma-delimited)
in that case, I think the behavior that would be most convenient to me is if Str.split returns the original string - because that way by default, when the delimiter input is "", I'm displaying the entire string without any splitting happening
whereas splitting on grapheme cluster boundaries is going to result in a huge number of entries in my output, such that I'd probably write if delimiter != "" to special-case that and have it do something else
(which I might end up wanting to do regardless, to be fair!)
another reason for having it not do Str.graphemes is that splitting a string around grapheme cluster boundaries is a pretty involved process which involves Unicode libraries
whereas finding a delimiter is just byte matching; there's no need to get Unicode handling logic involved
this is potentially relevant for code size, e.g. when building for wasm - in one design, using Str.split at all means we have to compile in a bunch of Unicode handling logic just in case you pass "" as your delimter, and in the other design we don't have to do that. To be fair, I don't weight this consideration particularly highly as a design consideration here, but I did want to at least note it!
a reasonable argument for doing Str.graphemes is that it's what other languages do, so it's more familiar
well, kinda - most languages don't have grapheme cluster-based APIs, so things get weird...e.g. here's what Ruby does for a compound emoji:
irb(main):007:0> "ab👩👩👦👦cd".split ""
=> ["a", "b", "👩", "", "👩", "", "👦", "", "👦", "c", "d"]
Python actually crashes on "" for a delimiter:
ValueError: empty separator
JavaScript:
"ab👩👩👦👦cd".split("")
['a', 'b', '�', '�', '', '�', '�', '', '�', '�', '', '�', '�', 'c', 'd']
so I'm curious what others think Roc should do in this scenario!
personally I like the idea of having it work the way it does today, but if you pass it an empty string literal, we give a warning saying that you should explicitly use either List.single or Str.graphemes instead
Go splits on codepoints (and so it's a pretty cheap utf8 scan, which of course, is a bit check, and afaik can use SIMD to achieve).
I do think it's more... mathematically consistent to split into multiple elements, but whether that's intuitive depends on whether the reader believes every string "contains" the empty string as a substring.
That said, I haven't ever needed to use this capability in an imperative language, and while i see the likelihood of needing it in a functional language as higher, I'd probably prefer to use a function which returns a tuple consisting of the first codepoint/grapheme and the rest of the string, and call that recursively to walk the whole string, or use a List.map equivalent for strings.
Only in a lazy evaluated language like Haskell might i think to actually allocate a throwaway array, though I admittedly don't know how well Roc could optimize away such extra work/allocation
whereas finding a delimiter is just byte matching; there's no need to get Unicode handling logic involved
Is this true? Couldn't the second byte of a Unicode grapheme be the same as the byte from ","? So you need to ensure you aren't in a larger Unicode grapheme as opposed to just splitting based on an individual byte value?
I don't think so
oh
hmm, yeah like if there's a modifier applied you probably don't want that to match :thinking:
this is a good point and makes me think we probably should look into how cases like that in the current implementation :big_smile:
We seem to do the correct thing in my super quick test: Str.split "ĬĬĬĬ" "," returns ["ĬĬĬĬ"] : List Str. Ĭ should be 0x012C while , is 0x2C.
Last updated: Jun 16 2026 at 16:19 UTC