Str.split - How to "split" each character? · beginners

Given Str.split "1234" "" == ["1234"] and not as it is in e.g. JavaScript with "1234".split("")' ["1","2","3","4"]` what's the recommended way to split a string into it's individual characters when missing a distinct splitting character?

Luke Boswell (Apr 29 2024 at 10:06):

Luke Boswell (Apr 29 2024 at 10:08):

The short answer is really it depends on what you mean by "character". If you are just looking to split into raw utf8 bytes there is Str.toUtf8 ... but that wont be very helpful if you are dealing with anything other than ASCII.

Tobias Steckenborn (Apr 29 2024 at 10:19):

As this right now just came from solving some Advent of Code tasks there's not much intention behind it. In this specific tasks given JavaScript I would've split the string '()()' into e.g. '["(",")","(",")"]' and then used reduce on it. With Roc requiring a specific split character I was looking at other ways to do it. Going via utf8 bytes doesn't really seem to be the most straightforward solution to me, but will take a look.

Luke Boswell (Apr 29 2024 at 10:20):

For Advent of Code, I always use that and it works great Str.toUtf8 "1234" == ['1','2','3','4'] :smiley:

Luke Boswell (Apr 29 2024 at 10:21):

» Str.toUtf8 "1234"

[49, 50, 51, 52] : List U8

Luke Boswell (Apr 29 2024 at 10:21):

» Str.toUtf8 "1234" |> List.map \b -> b - '0'

[1, 2, 3, 4] : List U8

Tobias Steckenborn (Apr 29 2024 at 10:24):

:sweat_smile: Okay. Do you happen to know why the default behaviour of Str.split "" wasn't implemented in a manner as e.g JavaScript does (meaning split every character)? Does that relate to the Str.length discussion I've seen above?

Hristo (Apr 29 2024 at 10:25):

Just as Luke has indicated - within the context of that particular use-case, the input seems to be indeed ASCII-only, so Str.toUtf8 would be the idiomatic way to go about this.

The name might not be really the first association that comes to mind, when you think about this particular use-case, but what I usually do in such examples is - I read the signatures of the standard-library functions (just as Luke has suggested) and that's been working quite well for me so far.

Hristo (Apr 29 2024 at 10:31):

@Tobias Steckenborn to me this portion from the comment section of the Str.roc standard-library file does answer those questions.

## * Most often, using `Str` values along with helper functions like [`split`](https://www.roc-lang.org/builtins/Str#split), [`joinWith`](https://www.roc-lang.org/builtins/Str#joinWith), and so on, is the best option.
## * If you are specifically implementing a parser, working in UTF-8 bytes is usually the best option. So functions like [`walkUtf8`](https://www.roc-lang.org/builtins/Str#walkUtf8), [toUtf8](https://www.roc-lang.org/builtins/Str#toUtf8), and so on. (Note that single-quote literals produce number literals, so ASCII-range literals like `'a'` gives an integer literal that works with a UTF-8 `U8`.)
## * If you are implementing a Unicode library like [roc-lang/unicode](https://github.com/roc-lang/unicode), working in terms of code points will be unavoidable. Aside from basic readability considerations like `\u(...)` in string literals, if you have the option to avoid working in terms of code points, it is almost always correct to avoid them.
## * If it seems like a good idea to split a string into "characters" (graphemes), you should definitely stop and reconsider whether this is really the best design. Almost always, doing this is some combination of more error-prone or slower (usually both) than doing something else that does not require taking graphemes into consideration.

Tobias Steckenborn (Apr 29 2024 at 10:34):

Imagining to explain to somebody "Hey, in case you've got a specific character on which you want to split use Str.split, in case you don't, you need to identify the right byte representation and use e.g. given ASCII characters toUtf8`. just still feels weird. But that might be due to me being used to how split works in another language. And I agree that that's most likely not a real world use case.

Hristo (Apr 29 2024 at 10:41):

I recently wrote a similar comment in another thread (please, note I'm a Roc beginner, and this could be partly or fully incorrect) - my undrestanding is that the intention behind such kind of language design choices is to empower the user as much as possible, in terms of understanding the true nature of their use case and understand whether the use case is sufficiently disambiguated. Splitting by an empty string is considered an ambiguous use case, because it can be interpreted in multiple ways, when the underlying strings are allowed to be more than just ASCII.
In such cases, Roc opts to invite the user to consider the more granular details about the corresponding use case a bit further, as not doing so potentially exposes the user to a risk of "shooting oneself in the foot".

Hristo (Apr 29 2024 at 10:42):

The fact that JavaScript has opted to gloss over some potentially important details, which could then manifest down the line (in hours, days, weeks or even years) as nasty bugs - that shouldn't be generally a measuring stick.

Luke Boswell (Apr 29 2024 at 10:47):

» Str.split "1,2,3" ","

["1", "2", "3"] : List Str

Hristo (Apr 29 2024 at 11:42):

Would the API be more intuitive if toUtf8 is renamed to splitToUtf8 (or something similar)?

Brendan Hansknecht (Apr 29 2024 at 15:24):

Just to clarify here, Str.split "1234" "" returning ["1234"] is 100% a limitation of the stardard library not supporting unicode string operation. Only a unicode understanding package could give a proper answer to splitting all characters/graphemes/codepoints/whatever.

If we were to consider everything a match for the empty string, it would work for ascii, but return broken strings for more complex characters. We would have issues like Str.split "😧" "" would return strings that are invalid utf8.

In JS, they will split into utf16 codepoints. Which is really unexpected in some cases:
"👩🏻‍❤️‍👨🏽".split("")
-> ['\uD83D', '\uDC69', '\uD83C', '\uDFFB', '‍', '❤', '️', '‍', '\uD83D', '\uDC68', '\uD83C', '\uDFFD']

Brendan Hansknecht (Apr 29 2024 at 15:27):

Richard Feldman (Apr 29 2024 at 15:28):

building on this, if we did want to do that, I'd want to have multiple "split" functions whose names specify what they are splitting into - e.g. splitGraphemes : Str -> List Str, etc.

Brendan Hansknecht (Apr 29 2024 at 15:30):

One other specific comment. In javascript, you have the distinction of passing an empty string "👩🏻‍❤️‍👨🏽".split("") vs passing nothing "👩🏻‍❤️‍👨🏽".split(). The api in Roc only has the empty string as an option. So we can't match to both of the functions that exist in JS. We map to the nothing passed split instead of the empty string split.

Richard Feldman (Apr 29 2024 at 15:30):

yeah, I think the real problem is that culturally we think of "splitting on characters" as a fast, straightforward, and unambiguous thing to do, and unfortunately the combination of Unicode and character encodings mean that in reality it's actually none of those things :sweat_smile:

Richard Feldman (Apr 29 2024 at 15:31):

or to put it another way, the normal state of affairs in programming language standard libraries is that everything is super convenient until you start using them with emojis or non-Latin characters, at which point everything catches fire and explodes

Richard Feldman (Apr 29 2024 at 15:32):

and I'd really like to try to change that default with Roc's standard library :big_smile:

Stream: beginners

Topic: Str.split - How to "split" each character?

Tobias Steckenborn (Apr 29 2024 at 09:56):

Luke Boswell (Apr 29 2024 at 10:06):

Luke Boswell (Apr 29 2024 at 10:08):

Tobias Steckenborn (Apr 29 2024 at 10:19):

Luke Boswell (Apr 29 2024 at 10:20):

Luke Boswell (Apr 29 2024 at 10:21):

Luke Boswell (Apr 29 2024 at 10:21):

Tobias Steckenborn (Apr 29 2024 at 10:24):

Hristo (Apr 29 2024 at 10:25):

Hristo (Apr 29 2024 at 10:31):

Tobias Steckenborn (Apr 29 2024 at 10:34):

Hristo (Apr 29 2024 at 10:41):

Hristo (Apr 29 2024 at 10:42):

Luke Boswell (Apr 29 2024 at 10:47):

Hristo (Apr 29 2024 at 11:42):

Brendan Hansknecht (Apr 29 2024 at 15:24):

Brendan Hansknecht (Apr 29 2024 at 15:27):

Richard Feldman (Apr 29 2024 at 15:28):

Brendan Hansknecht (Apr 29 2024 at 15:30):

Richard Feldman (Apr 29 2024 at 15:30):

Richard Feldman (Apr 29 2024 at 15:31):

Richard Feldman (Apr 29 2024 at 15:32):