this is the best analysis I've ever read about the design of strings and "characters" - it's long, but great!
https://hsivonen.fi/string-length/
I did not realize that extended grapheme cluster lengths can change when new versions of unicode are released
that is a strong argument that we should move that concept out of builtins and into a separate package
because otherwise we can't have Str update to new Unicode versions without potentially causing regressions in existing userspace code
which in turn means if you actually use any of those grapheme features, and you want to make sure things don't regress, you'd have to just not upgrade your Roc version (as opposed to declining to upgrade your unicode package version, for example, which is a much tamer choice to make)
here's a more specific design: https://docs.google.com/document/d/1TTYGVKhq0Jy43-j9AIt7B0PiAravloYmOVw9Dd_cAts/edit?usp=sharing
Looks solid!
@drathier any thoughts on this?
Is Str.contains
guaranteed to be correct? Could you find an e
byte in the second byte of a unicode codepoint? If so, does that mean we have to deal with unicode to do replace or contains correctly?
not in utf8
in utf8 that won't be a problem
Ah yeah cause they don't support extended ASCII and everything else has a prefix
thanks for tagging me @Richard Feldman , here's a wall of text :)
normalization is tricky
personally I always prefer the correctness argument over performance, so I want everything normalized as early as possible, in the same normalization form every time (pick whatever form is the most common, probably one of the compacting forms that encode ä as a single utf8 byte)
re grapheme string length
I'm fairly sure the length only ever decreases, and only for previously unassigned code points? I.e., by default all graphemes are one code point, and then you declare that some are actually larger than one code point. Maybe some characters merge and "bugfixes" are made, idk. I'd be happy to make that "breaking change" though as I'm very confident that it wouldn't actually break anything in practice. How often do you use and depend on the length of unicode extended grapheme clusters? Overly strict unit tests maybe, idk. You'll probably only see this in practice whenever a new ios release comes out with new fancy emojis, and people start using them against your not-yet-upgraded roc code. I'm assuming you're shipping unicode with the binary here for portability and developer sanity.
re what's a good default on the bytes, code unit, code point or grapheme clusters scale
I'd try to go as far right on that scale as reasonable. Code point minimum, but seriously consider graphemes as the default, as that's what people expect. Nobody thinks of :family_woman_woman_girl_boy: as anything other than one emoji, assuming your font renders them as a single emoji, and it doesn't feel like a stretch to see fonts adding support for e.g. skin color on the family members. Right now, that single emoji is built from 4 emojis, 7 code points, or 11-25 code units depending on encoding.
I've considered having multiple functions like lengthGraphemes
, lengthCodePoints
, lengthUtf8Bytes
where they differ (names are hard), to spark interest and to get devs to realize that there's a meaningful difference outside of ascii, which includes emojis. Having a length = lengthGraphemes
to indicate a good default is also useful. As for the actual default encoding, utf-8 seems like the obvious choice and what everything's standardizing on. It's absolutely reasonable to expose functions that work on utf8 bytes, e.g. for parsers, but I'd hide them away a bit. Purescript for example has exactly three use-cases of code units; two different string parsing libraries, and indices into strings from js "stdlib/ffi" functions. We looked through all usages when suggesting a change to the Char type.
We tried to make grapheme/codepoint/byte indices type-safe in the purescript stdlib when having functions for all of them, but reached the conclusion that it's not possible to be fully type safe here.
awesome, thanks for the detailed write-up! :smiley:
I hadn't thought about names and loan words...is there any uppercasing function in existence that gets those right? :sweat_smile:
Richard Feldman said:
I hadn't thought about names and loan words...is there any uppercasing function in existence that gets those right? :sweat_smile:
No, and that's a big problem imo :) To get it right, you have to annotate what parts of the text is following which locale, and absolutely nobody is going to do that work. So much of unicode is assuming text is one language at a time, but afaict, most bilingual people mix languages all the time.
There's tons of loan words too, and grammar for loan words is hard to get right. Everyone in sweden will agree that it's "en pokémon/pokémonen" and "ett e-mail/(e-)mailet". Somehow all swedes agree what the grammatical gender of those non-swedish words are, and that we should apply swedish grammar rules to them when used in a swedish sentence.
Basically, you can't really assume a single locale ever.
Here's an example sentence from a chat of mine, about an hour ago: "konstig notis att få out of context", where the first third of the sentence is implied ("det där var en" is missing), the second third is swedish and the last third is english. What locale should that chat use, when languages are mixed within single sentences? When chatting with bilingual and multilingual friends, about 25% of the words I write are in english, about 70% in swedish and about 5% other.
I feel like the only sensible way to treat text is to really lock it down to operations that for sure will work.
yeah that's my conclusion too :+1:
and then treat everything else as an advanced use case that requires going to a package outside the stdlib
But who's going to use a programming language where super basic stuff like String.length and String.startsWith are missing? :)
I feel like Erlang waited 14 years before adding strings to the language for a reason...
oh I think starts with is fine
er, well that's because I'm ok with not normalizing
(although I can totally see the argument for normalizing too - I just think the perf cost crosses a line, especially for equals and hashing)
wait, startsWith
is only fine if you are normalizing, isn't it? You'd normalize when constructing strings, and keep the contents normalized as you go
Last updated: Jul 06 2025 at 12:14 UTC