Stream: API design

Topic: Str and "characters"


view this post on Zulip Richard Feldman (Oct 07 2023 at 14:46):

this is the best analysis I've ever read about the design of strings and "characters" - it's long, but great!

https://hsivonen.fi/string-length/

view this post on Zulip Richard Feldman (Oct 07 2023 at 16:30):

I did not realize that extended grapheme cluster lengths can change when new versions of unicode are released

view this post on Zulip Richard Feldman (Oct 07 2023 at 16:31):

that is a strong argument that we should move that concept out of builtins and into a separate package

view this post on Zulip Richard Feldman (Oct 07 2023 at 16:31):

because otherwise we can't have Str update to new Unicode versions without potentially causing regressions in existing userspace code

view this post on Zulip Richard Feldman (Oct 07 2023 at 16:33):

which in turn means if you actually use any of those grapheme features, and you want to make sure things don't regress, you'd have to just not upgrade your Roc version (as opposed to declining to upgrade your unicode package version, for example, which is a much tamer choice to make)

view this post on Zulip Richard Feldman (Oct 13 2023 at 10:52):

here's a more specific design: https://docs.google.com/document/d/1TTYGVKhq0Jy43-j9AIt7B0PiAravloYmOVw9Dd_cAts/edit?usp=sharing

view this post on Zulip Anton (Oct 13 2023 at 11:07):

Looks solid!

view this post on Zulip Richard Feldman (Oct 14 2023 at 18:50):

@drathier any thoughts on this?

view this post on Zulip Brendan Hansknecht (Oct 14 2023 at 19:07):

Is Str.contains guaranteed to be correct? Could you find an e byte in the second byte of a unicode codepoint? If so, does that mean we have to deal with unicode to do replace or contains correctly?

view this post on Zulip Richard Feldman (Oct 14 2023 at 19:22):

not in utf8

view this post on Zulip Richard Feldman (Oct 14 2023 at 19:22):

in utf8 that won't be a problem

view this post on Zulip Brendan Hansknecht (Oct 14 2023 at 19:48):

Ah yeah cause they don't support extended ASCII and everything else has a prefix

view this post on Zulip drathier (Oct 17 2023 at 16:18):

thanks for tagging me @Richard Feldman , here's a wall of text :)

I've considered having multiple functions like lengthGraphemes, lengthCodePoints, lengthUtf8Bytes where they differ (names are hard), to spark interest and to get devs to realize that there's a meaningful difference outside of ascii, which includes emojis. Having a length = lengthGraphemes to indicate a good default is also useful. As for the actual default encoding, utf-8 seems like the obvious choice and what everything's standardizing on. It's absolutely reasonable to expose functions that work on utf8 bytes, e.g. for parsers, but I'd hide them away a bit. Purescript for example has exactly three use-cases of code units; two different string parsing libraries, and indices into strings from js "stdlib/ffi" functions. We looked through all usages when suggesting a change to the Char type.

We tried to make grapheme/codepoint/byte indices type-safe in the purescript stdlib when having functions for all of them, but reached the conclusion that it's not possible to be fully type safe here.

view this post on Zulip Richard Feldman (Oct 17 2023 at 16:21):

awesome, thanks for the detailed write-up! :smiley:

view this post on Zulip Richard Feldman (Oct 17 2023 at 16:22):

I hadn't thought about names and loan words...is there any uppercasing function in existence that gets those right? :sweat_smile:

view this post on Zulip drathier (Oct 17 2023 at 21:22):

Richard Feldman said:

I hadn't thought about names and loan words...is there any uppercasing function in existence that gets those right? :sweat_smile:

No, and that's a big problem imo :) To get it right, you have to annotate what parts of the text is following which locale, and absolutely nobody is going to do that work. So much of unicode is assuming text is one language at a time, but afaict, most bilingual people mix languages all the time.

There's tons of loan words too, and grammar for loan words is hard to get right. Everyone in sweden will agree that it's "en pokémon/pokémonen" and "ett e-mail/(e-)mailet". Somehow all swedes agree what the grammatical gender of those non-swedish words are, and that we should apply swedish grammar rules to them when used in a swedish sentence.

Basically, you can't really assume a single locale ever.

Here's an example sentence from a chat of mine, about an hour ago: "konstig notis att få out of context", where the first third of the sentence is implied ("det där var en" is missing), the second third is swedish and the last third is english. What locale should that chat use, when languages are mixed within single sentences? When chatting with bilingual and multilingual friends, about 25% of the words I write are in english, about 70% in swedish and about 5% other.

view this post on Zulip drathier (Oct 17 2023 at 21:24):

I feel like the only sensible way to treat text is to really lock it down to operations that for sure will work.

view this post on Zulip Richard Feldman (Oct 17 2023 at 21:24):

yeah that's my conclusion too :+1:

view this post on Zulip Richard Feldman (Oct 17 2023 at 21:24):

and then treat everything else as an advanced use case that requires going to a package outside the stdlib

view this post on Zulip drathier (Oct 17 2023 at 21:25):

But who's going to use a programming language where super basic stuff like String.length and String.startsWith are missing? :)

view this post on Zulip drathier (Oct 17 2023 at 21:27):

I feel like Erlang waited 14 years before adding strings to the language for a reason...

view this post on Zulip Richard Feldman (Oct 17 2023 at 21:52):

oh I think starts with is fine

view this post on Zulip Richard Feldman (Oct 17 2023 at 21:53):

er, well that's because I'm ok with not normalizing

view this post on Zulip Richard Feldman (Oct 17 2023 at 21:54):

(although I can totally see the argument for normalizing too - I just think the perf cost crosses a line, especially for equals and hashing)

view this post on Zulip drathier (Oct 18 2023 at 11:36):

wait, startsWith is only fine if you are normalizing, isn't it? You'd normalize when constructing strings, and keep the contents normalized as you go


Last updated: Jul 06 2025 at 12:14 UTC