Given that Roc mostly seems to prioritize Unicode correctness when it comes to strings, I was surprised to learn that string equality doesn't respect canonical equivalence. I was trying to find the reasoning for that from previous discussions, and these are the justifications I found:
Str
docs say that for some programs, it might be important to tell canonically equivalent strings apart.What if instead of normalizing before every comparison operation, normalization happened as soon a Str
was constructed, sort of like how UTF-8 is validated before converting to a Str
? This document seems to have some information about efficiently normalizing and checking canonical equivalence. Apparently Swift uses the "FCC" algorithm described there.
For reason 2, how often does it come up that it's important to distinguish canonically equivalent strings? I don't know the answer to that. If it's uncommon, maybe people could use List U8
s in those circumstances?
To be clear, I'm not an expert on any of this, and maybe automatic normalization isn't worth it, but I just thought maybe it deserves a little more consideration. If nothing else, maybe the docs for startsWith
, contains
, etc. could include a warning that for maximum Unicode correctness, the strings should be normalized first.
I think it would have to be on every creations and modification, but yeah, should be doable
yeah it's a good point and it's a very tricky thing to balance
a thing that could definitely be surprising is that if you read some raw UTF-8 bytes in from somewhere, parse them into a Str
, and then write them back out again without modifying them in any way, you might get different bytes back - which definitely seems like the type of thing that could cause subtle and extremely frustrating bugs :sweat_smile:
True, that does seem like it could be super confusing
Last updated: Jul 06 2025 at 12:14 UTC