Stream: ideas

Topic: Should we have `Str.walkUtf8*`?


view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 15:28):

A youtube comment made a good point about the string byte walking functions:

Strings purport to reduce emoji bugs by operating on grapheme-clusters instead of codepoints, yet the docs aptly demonstrate that this helps not with emoji at all - and there's a fold over raw UTF-8 bytes too, for some reason (am I expected to re-parse the UTF-8? Is this for sending over the wire? Who knows?)

I assume we added this functionality in order to be able to walking small string bytes for performance reasons in a specific case. Don't recall exactly what for (probably hashing?). If this is just for hashing (or some other builtin specific use), I definitely think we should remove it and just add a custom call in the builtins for it.

Given our focus on correctness around strings, is this the right tradeoff. If user needs this functionality, they can always do it in the explicit way (costs an extra allocation for small strings sadly): myStr |> Str.toUtf8 |> List.walk ....

Curious what others think and if I am missing something.


Maybe given that Strings are already utf8, this is really just a naming issue for a function that exposes low level primitives. Maybe it really should be called Str.walkCodeUnits and Str.toCodeUnits. That is the unicode official name at a minimum and it also reads in a way that it is more clearly a potential code smell.

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:37):

so certain use cases require access to the UTF-8 bytes

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:37):

e.g. encoding and decoding

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:38):

so there has to be some way to access those

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 15:38):

Access for sure, but walking them as a primitive?

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:39):

removing that means that parsing small strings always requires a heap allocation

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:39):

so that's a pretty serious performance downside; I don't think the upside outweighs it

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 15:42):

Are parsers written that just parse by a single linear walk of UTF-8 bytes of a string? I am used to recursively pattern matching like we do with a list (not possible on a string).

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:43):

also, I'm almost done with a language reference on strings - but the short version is that the more time I spend with strings, the more convinced I am that:

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 15:44):

Yeah, I definitely tend to agree with that. Though, a utf8 byte is a utf8 code unit.

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:45):

sure haha

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:45):

Brendan Hansknecht said:

Are parsers written that just parse by a single linear walk of UTF-8 bytes of a string? I am used to recursively pattern matching like we do with a list (not possible on a string).

hm, that's possible

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:46):

I guess there's an argument for removing it and then seeing if there's specifically demand for re-adding it

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:46):

because adding is a nonbreaking change

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:48):

I could also see an argument for having like Str.utf8 : codec where codec implements EncoderFormatting, DecoderFormatting

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:48):

and that's the only way to convert a Str to List U8

view this post on Zulip Richard Feldman (Dec 28 2023 at 15:50):

but I don't like that because:

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 16:15):

In my mind, I see Str.toUtf8 or Str.toCodeUnit what ever name as more ok than Str.walkUtf8.

Str.toUtf8 is admitting the reality of needing to get at the bytes of a string. Then you can use List.walk and friends.

Str.walkUtf8 seems to promote potentially wrong use of a string where you probably are gonna assume that the bytes are actually ascii. I would guess that most people that use this function actually want a Unicode.walkCharacters where 12๐Ÿ˜€๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ‘จ is handled correctly as 1, 2, ๐Ÿ˜€, ๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ‘จ. Note ๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ‘จ is extra complex, it is woman, zwj, heart, zwj, man

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:15):

so what I struggle with there is:

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 18:21):

Can you expand on point 2?

As a simple example, what if you want to write a compiler and need to parse a language that allows emoji variables or similar.

view this post on Zulip Richard Feldman (Dec 28 2023 at 19:23):

so in that case, if I'm lexing 1 byte at a time, what I want to be doing is:

view this post on Zulip Richard Feldman (Dec 28 2023 at 19:27):

so there's a theoretical argument for having like a Str.parseFirstCodePoint : List U8 -> Result (List U8, U32) (List U8, [InvalidCodePoint, StrWasEmpty]) which would do all that, but that feels too niche for Str :sweat_smile:

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 21:32):

What is the advantage of that over walking full characters?

view this post on Zulip Richard Feldman (Dec 28 2023 at 21:38):

if there's that very specialized function I mentioned, then either is probably fine

view this post on Zulip Richard Feldman (Dec 28 2023 at 21:40):

but if you're iterating over only code points, then all your error locations are in terms of indices that don't map directly back to the source bytes

view this post on Zulip Richard Feldman (Dec 28 2023 at 21:40):

so you can't do things like just slice back into the original bytes to print error context anymore

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 21:55):

You cant slice a string anyway by indices. So I think fundamentally, you probably should just convert to a List U8 anyway

view this post on Zulip Richard Feldman (Dec 28 2023 at 21:58):

yeah exactly :big_smile:

view this post on Zulip Richard Feldman (Dec 28 2023 at 21:59):

so I think the best way to go is to work on the List U8 directly

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 22:01):

Exactly. And generally you can load directly into a List U8 from file/network/etc

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 22:02):

So my gut feeling is we should remove Str.walkUtf8* and tell people to use a List U8 directly instead


Last updated: Jun 16 2026 at 16:19 UTC