Should we have `Str.walkUtf8*`? · ideas

Stream: ideas

Topic: Should we have `Str.walkUtf8*`?

Brendan Hansknecht (Dec 28 2023 at 15:28):

A youtube comment made a good point about the string byte walking functions:

Strings purport to reduce emoji bugs by operating on grapheme-clusters instead of codepoints, yet the docs aptly demonstrate that this helps not with emoji at all - and there's a fold over raw UTF-8 bytes too, for some reason (am I expected to re-parse the UTF-8? Is this for sending over the wire? Who knows?)

I assume we added this functionality in order to be able to walking small string bytes for performance reasons in a specific case. Don't recall exactly what for (probably hashing?). If this is just for hashing (or some other builtin specific use), I definitely think we should remove it and just add a custom call in the builtins for it.

Given our focus on correctness around strings, is this the right tradeoff. If user needs this functionality, they can always do it in the explicit way (costs an extra allocation for small strings sadly): myStr |> Str.toUtf8 |> List.walk ....

Curious what others think and if I am missing something.

Maybe given that Strings are already utf8, this is really just a naming issue for a function that exposes low level primitives. Maybe it really should be called Str.walkCodeUnits and Str.toCodeUnits. That is the unicode official name at a minimum and it also reads in a way that it is more clearly a potential code smell.

Richard Feldman (Dec 28 2023 at 15:37):

so certain use cases require access to the UTF-8 bytes

Richard Feldman (Dec 28 2023 at 15:37):

e.g. encoding and decoding

Richard Feldman (Dec 28 2023 at 15:38):

so there has to be some way to access those

Brendan Hansknecht (Dec 28 2023 at 15:38):

Access for sure, but walking them as a primitive?

Richard Feldman (Dec 28 2023 at 15:39):

removing that means that parsing small strings always requires a heap allocation

Richard Feldman (Dec 28 2023 at 15:39):

so that's a pretty serious performance downside; I don't think the upside outweighs it

Brendan Hansknecht (Dec 28 2023 at 15:42):

Are parsers written that just parse by a single linear walk of UTF-8 bytes of a string? I am used to recursively pattern matching like we do with a list (not possible on a string).

Richard Feldman (Dec 28 2023 at 15:43):

also, I'm almost done with a language reference on strings - but the short version is that the more time I spend with strings, the more convinced I am that:

Writing production code that uses the concept of graphemes in any way is almost always a mistake, except for extremely rare niche use cases, and so graphemes should not be in Str
Writing production code that uses the concept of code points/code units/scalars is almost guaranteed to be a mistake unless you are very specifically doing low-level Unicode encoding/decoding
Writing a production parser should be done in terms of UTF-8 bytes

Brendan Hansknecht (Dec 28 2023 at 15:44):

Yeah, I definitely tend to agree with that. Though, a utf8 byte is a utf8 code unit.

Richard Feldman (Dec 28 2023 at 15:45):

sure haha

Richard Feldman (Dec 28 2023 at 15:45):

Brendan Hansknecht said:

Are parsers written that just parse by a single linear walk of UTF-8 bytes of a string? I am used to recursively pattern matching like we do with a list (not possible on a string).

hm, that's possible

Richard Feldman (Dec 28 2023 at 15:46):

I guess there's an argument for removing it and then seeing if there's specifically demand for re-adding it

Richard Feldman (Dec 28 2023 at 15:46):

because adding is a nonbreaking change

Richard Feldman (Dec 28 2023 at 15:48):

I could also see an argument for having like Str.utf8 : codec where codec implements EncoderFormatting, DecoderFormatting

Richard Feldman (Dec 28 2023 at 15:48):

and that's the only way to convert a Str to List U8

Richard Feldman (Dec 28 2023 at 15:50):

but I don't like that because:

encoding can fail, and having to handle the possibility of a string failing to become UTF-8 doesn't make sense and is wasteful
UTF-8 as a concept is still in Str, so I'm not sure that's actually meaningfully better than Str.toUtf8

Brendan Hansknecht (Dec 28 2023 at 16:15):

In my mind, I see Str.toUtf8 or Str.toCodeUnit what ever name as more ok than Str.walkUtf8.

Str.toUtf8 is admitting the reality of needing to get at the bytes of a string. Then you can use List.walk and friends.

Str.walkUtf8 seems to promote potentially wrong use of a string where you probably are gonna assume that the bytes are actually ascii. I would guess that most people that use this function actually want a Unicode.walkCharacters where 12😀👩‍❤️‍👨 is handled correctly as 1, 2, 😀, 👩‍❤️‍👨. Note 👩‍❤️‍👨 is extra complex, it is woman, zwj, heart, zwj, man

Richard Feldman (Dec 28 2023 at 18:15):

so what I struggle with there is:

if production code is walking utf8 for parsing, that makes sense
if production code is walking "characters" even in the context of parsing it's almost certainly a mistake
if it's just a toy example or an Advent of Code puzzle or something, then does it really matter if they're only handling ASCII?

Brendan Hansknecht (Dec 28 2023 at 18:21):

Can you expand on point 2?

As a simple example, what if you want to write a compiler and need to parse a language that allows emoji variables or similar.

Richard Feldman (Dec 28 2023 at 19:23):

so in that case, if I'm lexing 1 byte at a time, what I want to be doing is:

branch on whether the current byte is a special hardcoded character we care about, e.g. .
if it's not one of those, branch on whether it's over 127 and therefore a multibyte sequence
parse the multibyte sequence to both validate the code point, verify that it's valid utf-8, etc.
at the end I need to know how many utf-8 bytes were consumed, so I know how far to advance the parser

Richard Feldman (Dec 28 2023 at 19:27):

so there's a theoretical argument for having like a Str.parseFirstCodePoint : List U8 -> Result (List U8, U32) (List U8, [InvalidCodePoint, StrWasEmpty]) which would do all that, but that feels too niche for Str :sweat_smile:

Brendan Hansknecht (Dec 28 2023 at 21:32):

What is the advantage of that over walking full characters?

Richard Feldman (Dec 28 2023 at 21:38):

if there's that very specialized function I mentioned, then either is probably fine

Richard Feldman (Dec 28 2023 at 21:40):

but if you're iterating over only code points, then all your error locations are in terms of indices that don't map directly back to the source bytes

Richard Feldman (Dec 28 2023 at 21:40):

so you can't do things like just slice back into the original bytes to print error context anymore

Brendan Hansknecht (Dec 28 2023 at 21:55):

You cant slice a string anyway by indices. So I think fundamentally, you probably should just convert to a List U8 anyway

Richard Feldman (Dec 28 2023 at 21:58):

yeah exactly :big_smile:

Richard Feldman (Dec 28 2023 at 21:59):

so I think the best way to go is to work on the List U8 directly

Brendan Hansknecht (Dec 28 2023 at 22:01):

Exactly. And generally you can load directly into a List U8 from file/network/etc

Brendan Hansknecht (Dec 28 2023 at 22:02):

So my gut feeling is we should remove Str.walkUtf8* and tell people to use a List U8 directly instead

Last updated: Jul 23 2026 at 13:15 UTC