How does the team feel about having these functions in std:
Str.from_utf8 : List U8 -> Result Str { err : InvalidUnicodeErr, index : U64 }
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str { err : InvalidUnicodeErr, index : U64 }
Str.from_utf16_lossy : List U16 -> Str
The *_lossy
functions should replace invalid chars with the replacement char
Note -- are you tacking https://github.com/roc-lang/roc/pull/7321
Str.fromUtf8 : List U8 -> Result Str [BadUtf8 { problem : Utf8ByteProblem, index : U64 }]
The tag's unify nicer with other tag based errors -- for when your doing the "just pass it up the chain" thing
Oh yeah, the tag is good
So I'd add...
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { problem : Utf16ByteProblem, index : U64 }]
Yes
I'll be quite honest: I didn't try to make the error types useful
Thanks for thinking for me
I think if Anton was here, he'd ask for a PR to merge into that PR. He's planning on making a testing release I think
Instead of lossy, do we want with replacement? Then just expose the default replacement character?
Maybe that isn't valuable or worth it, just curious. Have seen that API before
If you're opt-ing into quick and dirty... you want minimal friction
I'm just not sure what the replacement API would look like
I guess you can always use lossy and the call replace separately to change the replacement char
We definitely should expose the replacement char though
We don't have a char
type, so what if they pass an invalid UTF-8 char?
I have to google it every time
"\u(FFFD)"
Also, might as well add in utf32 while we're here?
Str.from_utf8_with_replacement : List U8, { replacement_char ? Str } -> Str
Luke Boswell said:
If you're opt-ing into quick and dirty... you want minimal friction
Very true. I'm happy with just lossy.
What if they call Str.from_utf8_with_replacement(bytes, "not-a-char")
Yeah, let's just leave it as lossy and let them separately call Str.replace
separately to change the character
One thought I have (unrelated to design stuff, more scheduling), is any of this going to block our current upgrade. I'd like to land the new PI basic-cli this week. I think we can just make Arg : [Unix (List U16, Windows (List U16)]
and land the Weaver and these builtin upgrades later.
If we just make an issue for these Lossy strings additions, we can track it.
So:
Str.from_utf8 : List U8 -> Result Str [BadUtf8 { err : Utf8ByteProblem, index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { err : Utf16ByteProblem, index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [BadUtf32 { err : Utf32ByteProblem, index : U64 }]
Str.from_utf32_lossy : List U32 -> Str
Potentially merging the error tag of it makes sense
Let's do Arg, and I'll try to get a PR for these functions by tomorrow. If I can't do it in time, we can do it later
Adding new functions isn't breaking
I feel like we should drop “Byte” from the utf16/32 error tags. It’s not just one byte for those :smiley:
Works for me!
Agus Zubiaga said:
I feel like we should drop “Byte” from the utf16/32 error tags. It’s not just one byte for those :smiley:
Can we drop it off all tags?
You have a list of u8s and and index
I think anyone can figure out that is a problem with a specific byte
Though I still in the camp that if possible, we should just have a single UnicodeError
Or UtfError
(deleted)
For naming, I'd prefer Err [InvalidUtf*]
over Err [Utf*Error]
(whether it's 8/16/generic) - Bad
is fine too.
Good call
https://github.com/roc-lang/roc/issues/7390
If someone could validate that issue, that'd be great
Luke and I will be getting the APIs in place for basic-cli and Weaver, respectively
Which means creating an Arg := [Unix (List U8), Windows (List U16)]
type and just crashing on Windows
for now. Once these are implemented, it should be a simple change to properly support 16-bit encoded strings in basic-cli
and Weaver
Why the redundancy of Bad+Invalid in [BadUtf8 { err : InvalidUtf8, index : U64 }]
?
Is [InvalidUtf8 { index : U64 }]
sufficient? (Happy to Q&A in GH thread, if you'd prefer.)
I also prefer Zulip, it's more back and forth
The error holds info about why the UTF was encoded incorrectly
It's a tag union
Ohhh I forget those sneaky invisible payloads exist, thanks.
When I mentioned naming above, I was ignorant to these tag union(s) already existing, and they seem fine as-is. Does this issue intend to refactor this existing pattern from Problem+ByteProblem to Bad+Invalid? (Sorry if I'm bikeshedding this away from implementation concerns.)
Utf8ByteProblem : [
InvalidStartByte,
UnexpectedEndOfSequence,
ExpectedContinuation,
OverlongEncoding,
CodepointTooLarge,
EncodesSurrogateHalf,
]
Utf8Problem : { byteIndex : U64, problem : Utf8ByteProblem }
That PR you linked seems to want [BadUtf* { problem : Utf*ByteProblem, index : U64 }]
If #7390 intends to be ambivalent on Err structure, then please ignore everything I've said above.
Oh yeah let's remove one level of tag nesting
It isn't needed
Directly return a Result Str { byteIndex : U64, problem : Utf8ByteProblem }
@shua -- it's not urgent or anything... but would you be interested in tackling this?
Here's the tracking issue from Sam https://github.com/roc-lang/roc/issues/7390
I can pick this up. Should I merge it into https://github.com/roc-lang/roc/pull/7321 or a separate PR?
Also, would the preference be for 3 separate tag sets (Utf8ByteProblem
, Utf16ByteProblem
, and Utf32ByteProblem
) or should they all just be UtfDecodingProblem
even if some variants aren't possible for utf16/utf32?
I think it's ok to merge it into your PR
shua said:
Also, would the preference be for 3 separate tag sets (
Utf8ByteProblem
,Utf16ByteProblem
, andUtf32ByteProblem
) or should they all just beUtfDecodingProblem
even if some variants aren't possible for utf16/utf32?
I'm not quite following this...
Is this in the Issue?
Str.from_utf8 : List U8 -> Result Str [BadUtf8 { err : InvalidUtf8, index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { err : InvalidUtf16, index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [BadUtf32 { err : InvalidUtf32, index : U64 }]
Str.from_utf32_lossy : List U32 -> Str
Ohk, maybe we should update the issue -- nvm
I think I see, the tag union is inside the record right?
I'm not sure we need all the different error tags. Just InvalidUtf8
would be ok wouldn't it?
If possible, should only be a single tag
InvalidUnicode
probably
So like this?
Str.from_utf8 : List U8 -> Result Str [InvalidUnicode { err : [BadUtf8], index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [InvalidUnicode { err : [BadUtf16], index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [InvalidUnicode { err : [BadUtf32], index : U64 }]
Str.from_utf32_lossy : List U32 -> Str
Or maybe
Str.from_utf8 : List U8 -> Result Str [InvalidUtf8 { index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [InvalidUtf16 { index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [InvalidUtf32 { index : U64 }]
Str.from_utf32_lossy : List U32 -> Str
Or
UnicodeErr : [
InvalidUtf8 U64,
InvalidUtf16 U64,
InvalidUtf32 U64,
]
Str.from_utf8 : List U8 -> Result Str UnicodeErr
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str UnicodeErr
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str UnicodeErr
Str.from_utf32_lossy : List U32 -> Str
I think like this:
UnicodeProblem : [
InvalidStartByte,
UnexpectedEndOfSequence,
ExpectedContinuation,
OverlongEncoding,
CodepointTooLarge,
EncodesSurrogateHalf,
]
Str.from_utf8 : List U8 -> Result Str [BadUtf8 { index : U64, problem : UnicodeProblem }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { index : U64, problem : UnicodeProblem }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [BadUtf32 { index : U64, problem : UnicodeProblem }]
Str.from_utf32_lossy : List U32 -> Str
This all looks good to me, assuming we're okay with providing a superset of the errors we see for all Unicode variants.
I'm not sure what they type of errors we see for UTF-8 vs 16 vs 32
It's probably better to provide the actual set of errors per encoding instead of just a single union
If we go with a single union, we should maybe aim for naming with respect to codepoints instead of just bytes? Since UTF-16 and 32 don't process bytes
My thought is: if the error set mostly overlaps, then just merge it, if not, then add separate tag unions.
So that is my default to try
If it doesnt work in practice due to disjoint errors, then make Utf8Problem
, Utf16Problem
, and Utf32Problem
I don't feel strongly in opposition, though I do think it's better for API users to get UTFXXProblem
. Either works for me
Looking at the zig standard library, errors look to be disjoint
So I think we will have a sepearate Utf8Problem
, Utf16Problem
and Utf32Problem
Yep
Also, it sounds like we actually need to support wtf-8 and wtf-16 for windows paths.
zig does support this, but you have to explicitly tell it to how to handle the utf-16 it is given
https://ziglang.org/documentation/0.13.0/std/#std.unicode.Surrogates
WTF is very appropriate
An infinite hole, Windows is
Basically, is wtf is for old utf-16 that is not technically valid modern utf-16 (like windows path and js strings apparently)
probably Str.from_utf16
should just take an extra arg and forward that to zig
I'm sure having UTF-16 is good, but it seems like we only need WTF-16 for now
Yeah, I have no idea where you would run into valid modern utf-16.
Well, do you know how this will affect OsArg := [Unix (List U8), Windows (List U16)]
? Seems like we'll only have UTF-8 and WTF-16
Also, we can still implement UTF-16/32 in the stdlib, they won't hurt anything
correct. We will only have UTF-8
and WTF-16
Okay, great
Reading up on this more, it sounds like using WTF-16
for all UTF-16
parsing is valid (and likely required in many cases due to legacy). It just loses some performance due to adding extra checks for unpaired surrogates. So I think we should make Str.from_utf16
, but under the hood, it will just parse WTF-16
. According to wikipedia, most utf-16 decoders do this.
I'm not sure the perf cost, but it sounds like many systems require it in general.
We could also add Str.from_wtf16
and just alias Str.from_utf16
in that case
If someone wants to parse WTF-16, it'd be good to not need to ask in Zulip or read what could be the 3rd docs paragraph of Str.parse_utf16
Yeah, sounds good
eh I think just doing utf16 is fine, and then document that it's actually wtf-16
I think the likelihood that perf is a problem here is low, and I wouldn't be surprised if people chose the wrong one
leading to things almost always working, but then in super rare scenarios not working right :sweat_smile:
as in, they choose utf-16 not realizing they need wtf-16
(and perhaps not knowing wtf-16 exists!)
edit: below is incorrect, we want the wrapping BadUtf8
tags
Merging suggestions from above on error api:
Brendan Hansknecht said:
Looking at the zig standard library, errors look to be disjoint
means we want distinct Utf8Problem
, Utf16Problem
and Utf32Problem
tagsets, and
Brendan Hansknecht said:
Oh yeah let's remove one level of tag nesting
indicates we can remove the wrapping BadUtf8
etc tag, leading to the following api:
Utf8Problem : [ ... ]
Utf16Problem : [ ... ]
Utf32Problem : [ ... ]
Str.from_utf8 : List U8 -> Result Str { index : U64, problem : Utf8Problem }
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str { index : U64, problem : Utf16Problem }
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str { index : U64, problem : Utf32Problem }
Str.from_utf32_lossy : List U32 -> Str
I think we still want a tag union, because it merges nicely with other errors when you pass them up from a callsite within a function Result _ []err
If it's a record, then its an extra step to wrap it in a tag
And to be clear: we want from_utf16
to actually accept wtf-16, but from_utf8
should still only accept utf-8 not wtf-8?
yes
I think that is correct.
wtf-8
seems to be exceptionally rare from what I can tell. So it is reasonable to just consider it malformed
wtf-16
seems to be the default for many apis on the otherhand.
yeah, that matches my understanding as well
coming back to wtf-8
, what is expected for
when Str.fromUtf16 [0xd800] is
Err _ -> "a"
Ok s -> when Str.fromUtf8 (Str.toUtf8 s) is
Err _ -> "b"
Ok _ -> "c"
a. Str.fromUtf16
will implement strict utf-16 (ie not wtf-16) and fail with an error about an unpaired surrogate
b. Str.fromUtf16
will implement wtf-16, but Str.fromUtf8
will implement strict utf-8 (ie not wtf-8), and return an error about encoding a surrogate pair
c. Str.fromUtf16
implements wtf-16, and Str.fromUtf8
implements (possibly a subset of) wtf-8, both accepting unpaired surrogate pair codepoints
the agreement before (to make from_utf16
actually be from_wtf16
but leave from_utf8
as-is) would imply "b" which leads to Str.fromUtf8
and Str.toUtf8
not being able to roundtrip
we could make Str.toUtf8
return a Result Str _
. It seems unfortunate but if you allow Str
to contain surrogate pair codepoints, then it cannot be encoded to standards-compliant utf-8/16/32 afaict.
I'm leaning towards "c". I have never cared about whether my unicode encoding/decoding libraries checked for unpaired surrogate codepoints. I would mentally file it in the same category as unpaired combining characters in the input string.
but maybe someone else has stronger opinions?
I think the bar should be super high for Str
to be encoded as anything other than unmodified standard utf-8, because a ton of hosts will naively convert utf8 strings to roc Str
s
and if we have a slightly different representation, it sounds like a major UB footgun, not to mention a potential performance problem where we have to check every utf8 string the host wants to send in, to make sure it doesn't contain any edge cases
I guess going from valid utf-8 to roc Str
s should remain the same. wtf-8 is a superset of utf-8 which allows encoding more codepoints than utf-8.
I think we should consider our strings to be strictly utf-8 and not wtf-8
When converting wtf-16 to utf-8, I assume we need to cleanup the surrogate pairs to make it strict utf-8
So we parse wtf-16, but canonicalize to cleanly convert into uft-8. If we can't canonicalize, we fail.
Yeah, specifically _unpaired_ surrogate codepoints are an issue. We could replace them with unicode replacement character '�' or we could remove them.
I think the motivating reason to accept wtf-16 is windows paths which can contain unpaired surrogate codepoints. If we replace or remove things in a path string, will windows still recognize it?
I thought the point of wtf-16 is that it correctly knows how to convert unpaired surrogates into the correct Unicode code point.
Cause an unpaired surrogates means that it is actually the old ucs-2 format
ah, no, not as far as I understand. wtf-16 is "potentially ill-formed utf-16".
I think at least in the document, "ucs-2" refers to a unicode encoding which was defined _before_ unicode codepoints exceeded 0xFFFF so everything fit in 16bits and surrogate pairs were not defined nor necessary. ucs-2 and utf-16 differ from codepoints 0xD800 to 0x110000, as ucs-2 can't encode anything higher than 0xFFFF and utf-16 has those ill-formed constraints around 0xD800 to 0xDFFF.
WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, either by neglect or because of backward-compatibility constraints.
Brendan Hansknecht said:
I thought the point of wtf-16 is that it correctly knows how to convert unpaired surrogates into the correct Unicode code point.
yes, but encoding those codepoints as utf-8 is disallowed by the official standard
The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
Thus, the solution is to use a not-as-strict encoding which is the same as utf-8, except it allows encoding those codepoints. This not-as-strict encoding is wtf-8
Hmm. I guess I don't fully understand wtf-16. Anyway, I would guess we want to keep strings as fully valid utf-8. So when converting from wtf-16, we would have one form of the function that uses replacement characters as necessary and another that just fails.
Okay, I think that's what I'm doing currently. fromUtf16
and fromUtf8
will fail if the input is not utf-16 or utf-8 respectively, while fromUtf16Lossy
and fromUtf8Lossy
accept a superset of utf-16 and utf-8, which replaces ill-formed sequences which cannot be encoded or codepoints which cannot be encoded as utf-8 (ie surrogates) to the unicode replacement character.
This does mean that roc will not be able to work with host paths as Str
but would rather have to work with List U8
and fallibly convert them to Str
.
Yeah, I think a host path has to be a List U8
or a List U16
, I guess.
oh the joys of standards compliance
host paths aren't valid anything :laughing:
UNIX allows anything in paths other than 0
bytes, and ASCII forward slashes mean directory separators
the spec doesn't even have anything to do with Unicode :sweat_smile:
Windows is similar except it's like they also ban bytes under 32 or something like that
they're a valid nuisance is what they are :wink:
I think WASI filesystem spec uses string
as path values, which must be valid unicode scalar values, and they just accept that some (pathologically-named) files will be unreachable via WASI. So if WASI is considered as a host, that's at least one.
edit: adding link to wasi:filesystem spec
interesting - we might want to add WASI to roc-lang/path
just in case someone's following this thread: https://github.com/roc-lang/roc/pull/7514 has been posted and approved, but is currently blocked on some interesting CI failures #bugs > mono mismatch between mac aarch64 and rest of targets
Last updated: Jul 06 2025 at 12:14 UTC