what if the scope of the Str module no longer included Unicode concepts like grapheme clusters, scalars, etc?
the diff would look like this:
https://github.com/roc-lang/roc/compare/main...simplify-str
and then those functions could be moved into a dedicated (currently extremely WIP) unicode package
also the module docs for Str could be dramatically simplified, because currently most of them consist of explaining Unicode concepts like extended grapheme clusters, which aren't commonly known but which are currently a prerequisite for understanding these Str APIs :sweat_smile:
Not being able to get the "chars" or the length of a Str without a package seems like a big disadvantage. I find it hard to imagine that users would prefer it that way.
hm, so what are some use cases where people want those things? :thinking:
String.length is used 4.5K times in elm code on github, I'd have to deeper relative analysis to see if that's a lot but it's a data point.
String.toList is used 2k times in elm code on github
Due to my background in data science I definitely use these functions more than most but validation of string inputs is a generally common use case.
Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.
interesting! So for some more context: pat of the reason I want to explore this is that the status quo (in programming in general) is that it's very common for people to reach for the wrong string operations, and then have code that works in a lot of cases but then totally breaks on edge cases like non-English languages or (especially) emoji
and while I think this is probably true:
Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.
I'm wondering whether this is one of those cases where that's kind of the only way to get to a world where people reach for less error-prone tools
for example:
validation, for example in web forms
so let's say we have Str.countGraphemes as the easiest thing to reach for, and people have a mental model of "ok, if I want to count the number of 'characters', that's what I reach for, so I'll make sure the 'first name' field is no more than 16 'characters' by using Str.countGraphemes"
but actually the reason they're enforcing that is that the database column has a capacity of 16 bytes, and so when they get a first name that's 16 grapheme clusters but actually more than 16 bytes, the validation passes but then the database insert fails and the end user gets a cryptic server 500 error instead of a nice message
also, I think in the case of parsers - I could be wrong! - but I think there's a good chance that the nicest parser API is one that doesn't include a concept of "character"
and when it comes to programming puzzles, I guess it depends on the puzzle, but as Ayaz noted, for something like Advent of Code, having a parser package would be a big win, and if there's a nice parser package, I wonder what use cases remain for puzzles and things like "split this string into extended grapheme clusters"
I don't have experience with stats/data science use cases though - maybe some specific examples there would help!
maybe a better way to convey my thoughts here is that while I agree with this:
Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.
...I wonder if this is a situation similar to RegEx: it's a tool that is useful, and should exist, but it's also a tool that is often used out of familiarity when another tool would be better overall - so maybe (like RegEx in Elm and in Rust) it's better if we don't have it in the stdlib, even though that's what people expect
worth noting btw that Rust doesn't have grapheme clusters in the stdlib - for that, you have to use the unicode_segmentation crate
I'm going to think about this some more tomorrow. It's a national holiday today so I wanted to finish up early :sweat_smile: :wave:
that said, Rust does have stdlib support for splitting strings into Unicode scalar values, which I definitely think is a footgun that's asking for emoji bugs :sweat_smile:
sounds good, thanks for bringing up the use cases, and enjoy the holiday! :smiley:
My initial reaction was that I would probably import the Unicode library in almost every program I wrote, but after reading what @Richard Feldman wrote about having a good parser package I've realised that would probably cover 90% of my use cases.
nice! Can you think of what the other 10% might be?
thinking through specific use cases is very helpful here, I think!
That 10% number might be inflated by the fact that I started a new Unicode heavy project. It's a pretty printer for CSV files which aligns the columns, so I need to know if a character is a double width CJK character when printing it
oh cool! :smiley:
how would you feel if you had to import a unicode package for that project?
I knew going in that there would be edge cases with double width characters because that's a problem I was trying to solve with existing tools, so I'd be fine with it
It's a complex enough problem that I wouldn't expect it to be in the stdlib and I was surprised there was a specific function in the Python stdlib to get the visual width of a character
If unicode functions aren't in the stdlib I'd expect lots of beginner questions asking questions like "how do I get the length of a string?", which is probably a better outcome because then someone can ask for more specific details about what they're tryng to achieve
yeah that's a great point!
The big question for me, will this just lead to many many users just converting strings to a list of bytes and then doing things manually. If so they probably will only think about ASCII and not Unicode.
You could have the Unicode package be builtin to make more people use it. People would still be forced to think about how the operation they're doing is complicated by unicode if they use it as Unicode.countGraphemes instead of Str.countGraphemes.
Is Str supposed to represent arbitrary binary data, any human text, or only unicode? If it's only unicode, then that would be a reason to keep unicode functions within Str. (But then what would be the right way to handle arbitrary binary data in Roc?)
As it currently stands Str is a List U8 with Utf8 encoding.
So only unicode, which is often only ascii, which is why so many people mess up/write bugged code.
That is a really good point. Should we be trying to remove unicode support from std, or should we be trying to write a good unicode interface in std with names that help users just make the right decision?
I think that's a plausible path, but my concern is that teaching Unicode is a big deal, and I'm not sure how to organize that if there are several unicode modules
like I don't think "names that help users just make the right decision" is achievable without someone having a certain baseline amount of Unicode knowledge, and it seems that this baseline level of knowledge is far from universal
Brendan Hansknecht said:
The big question for me, will this just lead to many many users just converting strings to a list of bytes and then doing things manually. If so they probably will only think about ASCII and not Unicode.
I also have this concern, although my hope is that doing stuff like "convert it to a List U8 and then work with the U8s directly" will feel janky enough that people will (correctly) feel they're doing something sketchy, and will at least have the intuition that they should look for another way to do it
Yeah, probably true for most normal string operations. That said, for a class of problems, working with the bytes and assuming they are ascii is pretty common.
hm, which problems?
anything with parsing
I think for a huge class of parsing problems, people don't think about unicode and just operate over ascii bytes.
Brendan Hansknecht said:
anything with parsing
but if we have a parser package, wouldn't people use that anyway?
:shrug: I probably am the wrong person to ask. I still don't know if I like parser combinators. So I general would avoid a parser package.
Often times it is simple enough to just use something recursive over the bytes.
Brendan Hansknecht said:
I think for a huge class of parsing problems, people don't think about unicode and just operate over ascii bytes.
would this change if we offered a Str.walkGraphemes or something?
like if Str.walkGraphemes is available, and parser is available, and Str.walkUtf8 is available, how many people pick the first option over the other two? :thinking:
Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.
I agree, but also:
Having string bugs enabled by the standard library will likely be surprising to users and I would expect this to evoke negative emotions.
well I don't know how surprising it would be, since most languages do it :sweat_smile:
Well that's true. The bugs might be surprising though!
the norm is that end users suffer the consequences, unfortunately, and then programmers write blog posts about things they learned the hard way
I know nothing about the intricacies of Unicode. But knowing that they exist, I think I'd vote for an approach that favors reliability and predictability. I don't want users of my software to suffer because I made a mistake.
This is reminding me of Json decoders in Elm. Surprising and unfamiliar to newcomers, but they force you to deal with the reality that servers cannot be trusted :melting_face:. After experiencing one production bug caused by failure to validate a server response in a JS codebase, You'd be hard pressed to convince me to skip that step for production code, regardless of language.
And there's nothing stopping someone who wants something more familiar from either:
a) Using a helper package that provides all the error-prone but familiar functions like Str.len
b) using a language that is better suited to applications where reliability and correctness aren't as important
Idk now you have me paranoid about Unicode-related bugs :sweat_smile:
yeah I remember a fun outage referred to as "the emoji fire" because users submitting emoji to a certain input box caused a serious production fire :sweat_smile:
Wheeeee
After reading this, i think my main thought is that having it out of the standard mostly feels like a way to let it get iterated one more and have a different perception of quality. Also, helps keep the standard simple and focused. Unicode is simply a complex problem that is easy to mess up. If we already had a high quality Unicode package we felt was safe to set in stone, maybe we should put it in the standard, but not with the flux around how it should be handled
I can get behind taking the unicode stuff out of std now, put it in its own package, improve it over the years, and if we want to we can put it back in std once we're satisfied with it.
interesting, I'm open to that!
random idea: Str.len : [Graphemes, Utf8Bytes] -> Nat
like just make it really up-front that you have to choose one or the other every time
and it would optimize away after inlining
I feel like I would prefer |> Str.toUtf8 |> List.len as it is more explicit, and then keep unicode in a separate package that can change over time with the standards. To do this with graphmemes we would need to support most of unicode in the stdlib right? Would the std lib need to understand different character code charts to understand graphmeme boundaries? Would keeping unicode in a separate package reduce pressure for std to change over time?
My biggest concern is that putting it in one function makes it feel like the perf cost is the same. Which it very much is not.
Luke Boswell said:
I feel like I would prefer
|> Str.toUtf8 |> List.lenas it is more explicit
I like this idea in theory, but unfortunately in practice it means a small string would have to incur a heap allocation just to get the byte length :sweat_smile:
I guess if we add bytes as a type, that would be fixed? Str.toUtf8 would return a Bytes type that can be small.
if Bytes can be small, then yes
I don't see why we would make it otherwise.
I think the main argument against it is that it would be a cost paid in a lot of places for a pretty narrow use case (potentially just small strings you want to do byte operations on)
a concern I have after thinking about this more is that if all the grapheme stuff is in a separate library, but byte-wise operations are in the stdlib (which they have to be; it has to be possible to convert a Str to bytes so they can be serialized) then people might just reach for things that way, which is the worst thing for correctness :sweat_smile:
I'm not sure I follow in terms of perf. If it is a perf gain for strings, why wouldn't it be a perf gain for bytes?
oh, I'm assuming an opaque Bytes type would be used (approximately) everywhere List U8 is today
e.g. http requests, file I/O...
Yep
and although a lot of UTF-8 strings are under 24B, and therefore can avoid a heap allocation, that's probably not going to be true of basically any of the I/O use cases of Bytes - which I think would be almost all of them in practice
I guess that is fair, but if you are accessing memory on the heap, you are screwed anyway, so I don't it would make much of a difference for perf.
Also, people tend to chop things up, but I guess with seemless slice, we can keep those on the heap, so maybe we have a better case than most lanaguage here.
Cause in c++ with llvm::SmallVector for example, when you chop things up, small list optimization is a huge gain. In roc, if you aren't modifying the chunks, I guess it isn't a gain cause seamless slices will rescue you from the extra allocations.
I guess that specifically may make a difference in terms of rocs performance here
So yeah, maybe not as clear cut due to that.
But yeah, before we added seemless slices, splitting a giant string by spaces was definitely slower due to all of the data copying.
I guess the question is: will strings need to be converted to bytes often to be used. If so, that would lead to a lot of pain points do to losing the small str optimization and paying the cost of allocating, but at the same time, you are reducing some costs for the potentially more common case of byte based data from files, http, etc.
yeah the other option is to try to implement things without needing Str to be converted into bytes
"things" being Unicode things, mainly :big_smile:
As long as the unicode library is smart and able to walk (potentially in chunks), I bet a lot can be done without conversions. But, I'm not sure. Could always expose a function to get a tuple of the bytes or something, but that sounds very unpleasant
Maybe a silly question, but is it possible to have a small list optimisation? where we keep the bytes on the stack in the same way as small Str?
Tangentially, also probably a silly question, is getting the byte length of a small Str a common operation? Like would you be sorting a list of small Str by length or similar?
is it possible to have a small list optimisation? where we keep the bytes on the stack in the same way as small Str?
100% llvm::SmallVector does this for example. It just is a less clear tradeoff and really works best with dynamically sized list types (as in different number of bytes on the stack depending on the element size). With Str and Bytes, they primitive is a U8 and the small optimization is pretty clear and simple. With a U64, and other bigger/more complex types, small list can only hold 2 U64if we keep the limit of 24 bytes (also need to store the length).
In the case of llvm::SmallVector, I believe that it targarts 64 bytes worth of elements (unless otherwise changed by the end user). However many fit in that. So 8 U64s. This just adds a huge level of tradeoffs and perf details to consider. If we want to keep lists simple and don't require essentially ridiculous c++ templates in any host language that accesses a roc list, this is even more complex. I honestly don't think it would likely be worth it to do this generically, especially since we have seamless slices.
As such, that means, we would maybe only apply it to very small types. So maybe for single byte types or something of that nature.
is getting the byte length of a small Str a common operation?
Yes, but only because most people have strings they assume are 100% ascii
So most code I have seen assumes that byte length == str length. Thus pretty common to get the length of a str that way whether small or large.
Maybe also used for prefix trees or similar, but that is pretty niche
Last updated: Jun 16 2026 at 16:19 UTC