Stream: ideas

Topic: Str.split passing ""


view this post on Zulip Richard Feldman (Dec 04 2022 at 07:13):

this came up elsewhere, and I wanted to start a separate discussion about it!

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:13):

the basic question is: what should happen if I call Str.split on the string "abc" passing "" as the delimiter?

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:14):

we can't make it a compile error, because the second argument could be a variable

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:14):

we could make it a compiler warning if you pass the actual string literal "" (which I do think is a reasonable design)

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:15):

we could make it crash, but I don't like that idea

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:15):

two reasonable options for what it could do are:

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:28):

in either case, if you're calling Str.split passing the literal "" as the delimiter, there's another function you could call to get the exact same answer (List.single or Str.graphemes)

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:29):

so one way to answer what it should do is to think about the scenario where you're not passing a string literal, but rather a variable

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:29):

e.g. Str.split str delimiter

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:29):

and then asking what should happen when the variable is "" as opposed to other things

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:30):

when I think about this scenario, I think about a situation where I have a user inputting what delimiter they want to split on (e.g. they enter "," if they want to split up something comma-delimited)

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:31):

in that case, I think the behavior that would be most convenient to me is if Str.split returns the original string - because that way by default, when the delimiter input is "", I'm displaying the entire string without any splitting happening

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:32):

whereas splitting on grapheme cluster boundaries is going to result in a huge number of entries in my output, such that I'd probably write if delimiter != "" to special-case that and have it do something else

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:32):

(which I might end up wanting to do regardless, to be fair!)

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:33):

another reason for having it not do Str.graphemes is that splitting a string around grapheme cluster boundaries is a pretty involved process which involves Unicode libraries

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:33):

whereas finding a delimiter is just byte matching; there's no need to get Unicode handling logic involved

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:35):

this is potentially relevant for code size, e.g. when building for wasm - in one design, using Str.split at all means we have to compile in a bunch of Unicode handling logic just in case you pass "" as your delimter, and in the other design we don't have to do that. To be fair, I don't weight this consideration particularly highly as a design consideration here, but I did want to at least note it!

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:36):

a reasonable argument for doing Str.graphemes is that it's what other languages do, so it's more familiar

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:37):

well, kinda - most languages don't have grapheme cluster-based APIs, so things get weird...e.g. here's what Ruby does for a compound emoji:

irb(main):007:0> "ab👩‍👩‍👦‍👦cd".split ""
=> ["a", "b", "👩", "‍", "👩", "‍", "👦", "‍", "👦", "c", "d"]

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:44):

Python actually crashes on "" for a delimiter:

ValueError: empty separator

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:46):

JavaScript:

"ab👩‍👩‍👦‍👦cd".split("")
['a', 'b', '�', '�', '‍', '�', '�', '‍',  '�', '�', '‍',  '�', '�', 'c', 'd']

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:50):

so I'm curious what others think Roc should do in this scenario!

view this post on Zulip Richard Feldman (Dec 04 2022 at 07:58):

personally I like the idea of having it work the way it does today, but if you pass it an empty string literal, we give a warning saying that you should explicitly use either List.single or Str.graphemes instead

view this post on Zulip Kevin Gillette (Dec 04 2022 at 15:40):

Go splits on codepoints (and so it's a pretty cheap utf8 scan, which of course, is a bit check, and afaik can use SIMD to achieve).

I do think it's more... mathematically consistent to split into multiple elements, but whether that's intuitive depends on whether the reader believes every string "contains" the empty string as a substring.

That said, I haven't ever needed to use this capability in an imperative language, and while i see the likelihood of needing it in a functional language as higher, I'd probably prefer to use a function which returns a tuple consisting of the first codepoint/grapheme and the rest of the string, and call that recursively to walk the whole string, or use a List.map equivalent for strings.

Only in a lazy evaluated language like Haskell might i think to actually allocate a throwaway array, though I admittedly don't know how well Roc could optimize away such extra work/allocation

view this post on Zulip Brendan Hansknecht (Dec 04 2022 at 17:03):

whereas finding a delimiter is just byte matching; there's no need to get Unicode handling logic involved

Is this true? Couldn't the second byte of a Unicode grapheme be the same as the byte from ","? So you need to ensure you aren't in a larger Unicode grapheme as opposed to just splitting based on an individual byte value?

view this post on Zulip Richard Feldman (Dec 04 2022 at 17:39):

I don't think so

view this post on Zulip Richard Feldman (Dec 04 2022 at 17:40):

oh

view this post on Zulip Richard Feldman (Dec 04 2022 at 17:41):

hmm, yeah like if there's a modifier applied you probably don't want that to match :thinking:

view this post on Zulip Richard Feldman (Dec 04 2022 at 17:43):

this is a good point and makes me think we probably should look into how cases like that in the current implementation :big_smile:

view this post on Zulip Brendan Hansknecht (Dec 04 2022 at 18:44):

We seem to do the correct thing in my super quick test: Str.split "ĬĬĬĬ" "," returns ["ĬĬĬĬ"] : List Str. Ĭ should be 0x012C while , is 0x2C.


Last updated: Jun 16 2026 at 16:19 UTC