Str.split passing "" · ideas · Zulip Chat Archive

the basic question is: what should happen if I call Str.split on the string "abc" passing "" as the delimiter?

Richard Feldman (Dec 04 2022 at 07:14):

we can't make it a compile error, because the second argument could be a variable

Richard Feldman (Dec 04 2022 at 07:14):

we could make it a compiler warning if you pass the actual string literal "" (which I do think is a reasonable design)

Richard Feldman (Dec 04 2022 at 07:15):

Richard Feldman (Dec 04 2022 at 07:28):

in either case, if you're calling Str.split passing the literal "" as the delimiter, there's another function you could call to get the exact same answer (List.single or Str.graphemes)

Richard Feldman (Dec 04 2022 at 07:29):

so one way to answer what it should do is to think about the scenario where you're not passing a string literal, but rather a variable

Richard Feldman (Dec 04 2022 at 07:29):

and then asking what should happen when the variable is "" as opposed to other things

Richard Feldman (Dec 04 2022 at 07:30):

when I think about this scenario, I think about a situation where I have a user inputting what delimiter they want to split on (e.g. they enter "," if they want to split up something comma-delimited)

Richard Feldman (Dec 04 2022 at 07:31):

in that case, I think the behavior that would be most convenient to me is if Str.split returns the original string - because that way by default, when the delimiter input is "", I'm displaying the entire string without any splitting happening

Richard Feldman (Dec 04 2022 at 07:32):

whereas splitting on grapheme cluster boundaries is going to result in a huge number of entries in my output, such that I'd probably write if delimiter != "" to special-case that and have it do something else

Richard Feldman (Dec 04 2022 at 07:32):

Richard Feldman (Dec 04 2022 at 07:33):

another reason for having it not do Str.graphemes is that splitting a string around grapheme cluster boundaries is a pretty involved process which involves Unicode libraries

Richard Feldman (Dec 04 2022 at 07:33):

whereas finding a delimiter is just byte matching; there's no need to get Unicode handling logic involved

Richard Feldman (Dec 04 2022 at 07:35):

this is potentially relevant for code size, e.g. when building for wasm - in one design, using Str.split at all means we have to compile in a bunch of Unicode handling logic just in case you pass "" as your delimter, and in the other design we don't have to do that. To be fair, I don't weight this consideration particularly highly as a design consideration here, but I did want to at least note it!

Richard Feldman (Dec 04 2022 at 07:36):

a reasonable argument for doing Str.graphemes is that it's what other languages do, so it's more familiar

Richard Feldman (Dec 04 2022 at 07:37):

well, kinda - most languages don't have grapheme cluster-based APIs, so things get weird...e.g. here's what Ruby does for a compound emoji:

irb(main):007:0> "ab👩‍👩‍👦‍👦cd".split ""
=> ["a", "b", "👩", "‍", "👩", "‍", "👦", "‍", "👦", "c", "d"]

Richard Feldman (Dec 04 2022 at 07:44):

Richard Feldman (Dec 04 2022 at 07:46):

"ab👩‍👩‍👦‍👦cd".split("")
['a', 'b', '�', '�', '‍', '�', '�', '‍',  '�', '�', '‍',  '�', '�', 'c', 'd']

Richard Feldman (Dec 04 2022 at 07:50):

Richard Feldman (Dec 04 2022 at 07:58):

personally I like the idea of having it work the way it does today, but if you pass it an empty string literal, we give a warning saying that you should explicitly use either List.single or Str.graphemes instead

Kevin Gillette (Dec 04 2022 at 15:40):

Go splits on codepoints (and so it's a pretty cheap utf8 scan, which of course, is a bit check, and afaik can use SIMD to achieve).

I do think it's more... mathematically consistent to split into multiple elements, but whether that's intuitive depends on whether the reader believes every string "contains" the empty string as a substring.

That said, I haven't ever needed to use this capability in an imperative language, and while i see the likelihood of needing it in a functional language as higher, I'd probably prefer to use a function which returns a tuple consisting of the first codepoint/grapheme and the rest of the string, and call that recursively to walk the whole string, or use a List.map equivalent for strings.

Only in a lazy evaluated language like Haskell might i think to actually allocate a throwaway array, though I admittedly don't know how well Roc could optimize away such extra work/allocation

Brendan Hansknecht (Dec 04 2022 at 17:03):

Is this true? Couldn't the second byte of a Unicode grapheme be the same as the byte from ","? So you need to ensure you aren't in a larger Unicode grapheme as opposed to just splitting based on an individual byte value?

Richard Feldman (Dec 04 2022 at 17:39):

Richard Feldman (Dec 04 2022 at 17:40):

Richard Feldman (Dec 04 2022 at 17:41):

hmm, yeah like if there's a modifier applied you probably don't want that to match :thinking:

Richard Feldman (Dec 04 2022 at 17:43):

this is a good point and makes me think we probably should look into how cases like that in the current implementation :big_smile:

Brendan Hansknecht (Dec 04 2022 at 18:44):

We seem to do the correct thing in my super quick test: Str.split "ĬĬĬĬ" "," returns ["ĬĬĬĬ"] : List Str. Ĭ should be 0x012C while , is 0x2C.

Stream: ideas

Topic: Str.split passing ""

Richard Feldman (Dec 04 2022 at 07:13):

Richard Feldman (Dec 04 2022 at 07:13):

Richard Feldman (Dec 04 2022 at 07:14):

Richard Feldman (Dec 04 2022 at 07:14):

Richard Feldman (Dec 04 2022 at 07:15):

Richard Feldman (Dec 04 2022 at 07:15):

Richard Feldman (Dec 04 2022 at 07:28):

Richard Feldman (Dec 04 2022 at 07:29):

Richard Feldman (Dec 04 2022 at 07:29):

Richard Feldman (Dec 04 2022 at 07:29):

Richard Feldman (Dec 04 2022 at 07:30):

Richard Feldman (Dec 04 2022 at 07:31):

Richard Feldman (Dec 04 2022 at 07:32):

Richard Feldman (Dec 04 2022 at 07:32):

Richard Feldman (Dec 04 2022 at 07:33):

Richard Feldman (Dec 04 2022 at 07:33):

Richard Feldman (Dec 04 2022 at 07:35):

Richard Feldman (Dec 04 2022 at 07:36):

Richard Feldman (Dec 04 2022 at 07:37):

Richard Feldman (Dec 04 2022 at 07:44):

Richard Feldman (Dec 04 2022 at 07:46):

Richard Feldman (Dec 04 2022 at 07:50):

Richard Feldman (Dec 04 2022 at 07:58):

Kevin Gillette (Dec 04 2022 at 15:40):

Brendan Hansknecht (Dec 04 2022 at 17:03):

Richard Feldman (Dec 04 2022 at 17:39):

Richard Feldman (Dec 04 2022 at 17:40):

Richard Feldman (Dec 04 2022 at 17:41):

Richard Feldman (Dec 04 2022 at 17:43):

Brendan Hansknecht (Dec 04 2022 at 18:44):