A youtube comment made a good point about the string byte walking functions:
Strings purport to reduce emoji bugs by operating on grapheme-clusters instead of codepoints, yet the docs aptly demonstrate that this helps not with emoji at all - and there's a fold over raw UTF-8 bytes too, for some reason (am I expected to re-parse the UTF-8? Is this for sending over the wire? Who knows?)
I assume we added this functionality in order to be able to walking small string bytes for performance reasons in a specific case. Don't recall exactly what for (probably hashing?). If this is just for hashing (or some other builtin specific use), I definitely think we should remove it and just add a custom call in the builtins for it.
Given our focus on correctness around strings, is this the right tradeoff. If user needs this functionality, they can always do it in the explicit way (costs an extra allocation for small strings sadly): myStr |> Str.toUtf8 |> List.walk ....
Curious what others think and if I am missing something.
Maybe given that Strings are already utf8, this is really just a naming issue for a function that exposes low level primitives. Maybe it really should be called Str.walkCodeUnits and Str.toCodeUnits. That is the unicode official name at a minimum and it also reads in a way that it is more clearly a potential code smell.
so certain use cases require access to the UTF-8 bytes
e.g. encoding and decoding
so there has to be some way to access those
Access for sure, but walking them as a primitive?
removing that means that parsing small strings always requires a heap allocation
so that's a pretty serious performance downside; I don't think the upside outweighs it
Are parsers written that just parse by a single linear walk of UTF-8 bytes of a string? I am used to recursively pattern matching like we do with a list (not possible on a string).
also, I'm almost done with a language reference on strings - but the short version is that the more time I spend with strings, the more convinced I am that:
StrYeah, I definitely tend to agree with that. Though, a utf8 byte is a utf8 code unit.
sure haha
Brendan Hansknecht said:
Are parsers written that just parse by a single linear walk of UTF-8 bytes of a string? I am used to recursively pattern matching like we do with a list (not possible on a string).
hm, that's possible
I guess there's an argument for removing it and then seeing if there's specifically demand for re-adding it
because adding is a nonbreaking change
I could also see an argument for having like Str.utf8 : codec where codec implements EncoderFormatting, DecoderFormatting
and that's the only way to convert a Str to List U8
but I don't like that because:
Str, so I'm not sure that's actually meaningfully better than Str.toUtf8In my mind, I see Str.toUtf8 or Str.toCodeUnit what ever name as more ok than Str.walkUtf8.
Str.toUtf8 is admitting the reality of needing to get at the bytes of a string. Then you can use List.walk and friends.
Str.walkUtf8 seems to promote potentially wrong use of a string where you probably are gonna assume that the bytes are actually ascii. I would guess that most people that use this function actually want a Unicode.walkCharacters where 12๐๐ฉโโค๏ธโ๐จ is handled correctly as 1, 2, ๐, ๐ฉโโค๏ธโ๐จ. Note ๐ฉโโค๏ธโ๐จ is extra complex, it is woman, zwj, heart, zwj, man
so what I struggle with there is:
Can you expand on point 2?
As a simple example, what if you want to write a compiler and need to parse a language that allows emoji variables or similar.
so in that case, if I'm lexing 1 byte at a time, what I want to be doing is:
.so there's a theoretical argument for having like a Str.parseFirstCodePoint : List U8 -> Result (List U8, U32) (List U8, [InvalidCodePoint, StrWasEmpty]) which would do all that, but that feels too niche for Str :sweat_smile:
What is the advantage of that over walking full characters?
if there's that very specialized function I mentioned, then either is probably fine
but if you're iterating over only code points, then all your error locations are in terms of indices that don't map directly back to the source bytes
so you can't do things like just slice back into the original bytes to print error context anymore
You cant slice a string anyway by indices. So I think fundamentally, you probably should just convert to a List U8 anyway
yeah exactly :big_smile:
so I think the best way to go is to work on the List U8 directly
Exactly. And generally you can load directly into a List U8 from file/network/etc
So my gut feeling is we should remove Str.walkUtf8* and tell people to use a List U8 directly instead
Last updated: Jun 16 2026 at 16:19 UTC