Str and "characters" · API design

Stream: API design

Topic: Str and "characters"

Richard Feldman (Oct 07 2023 at 14:46):

this is the best analysis I've ever read about the design of strings and "characters" - it's long, but great!

https://hsivonen.fi/string-length/

Richard Feldman (Oct 07 2023 at 16:30):

I did not realize that extended grapheme cluster lengths can change when new versions of unicode are released

Richard Feldman (Oct 07 2023 at 16:31):

that is a strong argument that we should move that concept out of builtins and into a separate package

Richard Feldman (Oct 07 2023 at 16:31):

because otherwise we can't have Str update to new Unicode versions without potentially causing regressions in existing userspace code

Richard Feldman (Oct 07 2023 at 16:33):

which in turn means if you actually use any of those grapheme features, and you want to make sure things don't regress, you'd have to just not upgrade your Roc version (as opposed to declining to upgrade your unicode package version, for example, which is a much tamer choice to make)

Richard Feldman (Oct 13 2023 at 10:52):

here's a more specific design: https://docs.google.com/document/d/1TTYGVKhq0Jy43-j9AIt7B0PiAravloYmOVw9Dd_cAts/edit?usp=sharing

Anton (Oct 13 2023 at 11:07):

Looks solid!

Richard Feldman (Oct 14 2023 at 18:50):

@drathier any thoughts on this?

Brendan Hansknecht (Oct 14 2023 at 19:07):

Is Str.contains guaranteed to be correct? Could you find an e byte in the second byte of a unicode codepoint? If so, does that mean we have to deal with unicode to do replace or contains correctly?

Richard Feldman (Oct 14 2023 at 19:22):

not in utf8

Richard Feldman (Oct 14 2023 at 19:22):

in utf8 that won't be a problem

Brendan Hansknecht (Oct 14 2023 at 19:48):

Ah yeah cause they don't support extended ASCII and everything else has a prefix

drathier (Oct 17 2023 at 16:18):

thanks for tagging me @Richard Feldman , here's a wall of text :)

normalization is tricky
personally I always prefer the correctness argument over performance, so I want everything normalized as early as possible, in the same normalization form every time (pick whatever form is the most common, probably one of the compacting forms that encode ä as a single utf8 byte)
re grapheme string length
I'm fairly sure the length only ever decreases, and only for previously unassigned code points? I.e., by default all graphemes are one code point, and then you declare that some are actually larger than one code point. Maybe some characters merge and "bugfixes" are made, idk. I'd be happy to make that "breaking change" though as I'm very confident that it wouldn't actually break anything in practice. How often do you use and depend on the length of unicode extended grapheme clusters? Overly strict unit tests maybe, idk. You'll probably only see this in practice whenever a new ios release comes out with new fancy emojis, and people start using them against your not-yet-upgraded roc code. I'm assuming you're shipping unicode with the binary here for portability and developer sanity.
re what's a good default on the bytes, code unit, code point or grapheme clusters scale
I'd try to go as far right on that scale as reasonable. Code point minimum, but seriously consider graphemes as the default, as that's what people expect. Nobody thinks of :family_woman_woman_girl_boy: as anything other than one emoji, assuming your font renders them as a single emoji, and it doesn't feel like a stretch to see fonts adding support for e.g. skin color on the family members. Right now, that single emoji is built from 4 emojis, 7 code points, or 11-25 code units depending on encoding.

I've considered having multiple functions like lengthGraphemes, lengthCodePoints, lengthUtf8Bytes where they differ (names are hard), to spark interest and to get devs to realize that there's a meaningful difference outside of ascii, which includes emojis. Having a length = lengthGraphemes to indicate a good default is also useful. As for the actual default encoding, utf-8 seems like the obvious choice and what everything's standardizing on. It's absolutely reasonable to expose functions that work on utf8 bytes, e.g. for parsers, but I'd hide them away a bit. Purescript for example has exactly three use-cases of code units; two different string parsing libraries, and indices into strings from js "stdlib/ffi" functions. We looked through all usages when suggesting a change to the Char type.

We tried to make grapheme/codepoint/byte indices type-safe in the purescript stdlib when having functions for all of them, but reached the conclusion that it's not possible to be fully type safe here.

uppercase/lowercase and char types in general
these are incredibly hard. On one hand, they're so complex for unicode use-cases that nobody is going to use them correctly. On the other hand, people want to use these all the time, primarily for styling reasons. In html, css can deal with it, but outside of a browser, it's really really hard to get right. I've completely banned the char type in all code bases I'm working in. Even including e.g. "english speaking uppercase" is going to break for names and loan-words. "latin script uppercase" doesn't get you much further, as it differs per locale (something like a (language,location,culture) tuple). It's going to differ per user, and per device for the same user, and for bilingual text which is super common in non-english speaking countries. My preferred locale is en_SE, for example (us english with iso units, literally english_sweden) but it's not always available, so I'm using 4 different locales on my 4 commonly used devices. It might even differ between browsers on one of my devices; I'd have to check.

Richard Feldman (Oct 17 2023 at 16:21):

awesome, thanks for the detailed write-up! :smiley:

Richard Feldman (Oct 17 2023 at 16:22):

I hadn't thought about names and loan words...is there any uppercasing function in existence that gets those right? :sweat_smile:

drathier (Oct 17 2023 at 21:22):

Richard Feldman said:

I hadn't thought about names and loan words...is there any uppercasing function in existence that gets those right? :sweat_smile:

No, and that's a big problem imo :) To get it right, you have to annotate what parts of the text is following which locale, and absolutely nobody is going to do that work. So much of unicode is assuming text is one language at a time, but afaict, most bilingual people mix languages all the time.

There's tons of loan words too, and grammar for loan words is hard to get right. Everyone in sweden will agree that it's "en pokémon/pokémonen" and "ett e-mail/(e-)mailet". Somehow all swedes agree what the grammatical gender of those non-swedish words are, and that we should apply swedish grammar rules to them when used in a swedish sentence.

Basically, you can't really assume a single locale ever.

Here's an example sentence from a chat of mine, about an hour ago: "konstig notis att få out of context", where the first third of the sentence is implied ("det där var en" is missing), the second third is swedish and the last third is english. What locale should that chat use, when languages are mixed within single sentences? When chatting with bilingual and multilingual friends, about 25% of the words I write are in english, about 70% in swedish and about 5% other.

drathier (Oct 17 2023 at 21:24):

I feel like the only sensible way to treat text is to really lock it down to operations that for sure will work.

concat
utf8 byte size

Richard Feldman (Oct 17 2023 at 21:24):

yeah that's my conclusion too :+1:

Richard Feldman (Oct 17 2023 at 21:24):

and then treat everything else as an advanced use case that requires going to a package outside the stdlib

drathier (Oct 17 2023 at 21:25):

But who's going to use a programming language where super basic stuff like String.length and String.startsWith are missing? :)

drathier (Oct 17 2023 at 21:27):

I feel like Erlang waited 14 years before adding strings to the language for a reason...

Richard Feldman (Oct 17 2023 at 21:52):

oh I think starts with is fine

Richard Feldman (Oct 17 2023 at 21:53):

er, well that's because I'm ok with not normalizing

Richard Feldman (Oct 17 2023 at 21:54):

(although I can totally see the argument for normalizing too - I just think the perf cost crosses a line, especially for equals and hashing)

drathier (Oct 18 2023 at 11:36):

wait, startsWith is only fine if you are normalizing, isn't it? You'd normalize when constructing strings, and keep the contents normalized as you go

Last updated: Jul 26 2025 at 12:14 UTC