Stream: API design

Topic: String normalization


view this post on Zulip Ajai Nelson (Jun 18 2024 at 08:54):

Given that Roc mostly seems to prioritize Unicode correctness when it comes to strings, I was surprised to learn that string equality doesn't respect canonical equivalence. I was trying to find the reasoning for that from previous discussions, and these are the justifications I found:

  1. This doc says that normalizing 'on every single "starts with" or "equals" operation' (and also every "hash code" operation) would be expensive and that it's probably better to do normalization in one pass if desired.
  2. The Str docs say that for some programs, it might be important to tell canonically equivalent strings apart.

What if instead of normalizing before every comparison operation, normalization happened as soon a Str was constructed, sort of like how UTF-8 is validated before converting to a Str? This document seems to have some information about efficiently normalizing and checking canonical equivalence. Apparently Swift uses the "FCC" algorithm described there.

For reason 2, how often does it come up that it's important to distinguish canonically equivalent strings? I don't know the answer to that. If it's uncommon, maybe people could use List U8s in those circumstances?

To be clear, I'm not an expert on any of this, and maybe automatic normalization isn't worth it, but I just thought maybe it deserves a little more consideration. If nothing else, maybe the docs for startsWith, contains, etc. could include a warning that for maximum Unicode correctness, the strings should be normalized first.

view this post on Zulip Brendan Hansknecht (Jun 18 2024 at 15:03):

I think it would have to be on every creations and modification, but yeah, should be doable

view this post on Zulip Richard Feldman (Jun 18 2024 at 21:55):

yeah it's a good point and it's a very tricky thing to balance

view this post on Zulip Richard Feldman (Jun 18 2024 at 21:55):

a thing that could definitely be surprising is that if you read some raw UTF-8 bytes in from somewhere, parse them into a Str, and then write them back out again without modifying them in any way, you might get different bytes back - which definitely seems like the type of thing that could cause subtle and extremely frustrating bugs :sweat_smile:

view this post on Zulip Ajai Nelson (Jun 18 2024 at 23:13):

True, that does seem like it could be super confusing


Last updated: Jul 06 2025 at 12:14 UTC