Stream: API design

Topic: Lossy unicode conversion builtins


view this post on Zulip Sam Mohr (Dec 17 2024 at 22:24):

How does the team feel about having these functions in std:

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:25):

The *_lossy functions should replace invalid chars with the replacement char

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:31):

Note -- are you tacking https://github.com/roc-lang/roc/pull/7321

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:31):

Str.fromUtf8 : List U8 -> Result Str [BadUtf8 { problem : Utf8ByteProblem, index : U64 }]

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:32):

The tag's unify nicer with other tag based errors -- for when your doing the "just pass it up the chain" thing

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:33):

Oh yeah, the tag is good

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:34):

So I'd add...

Str.from_utf16 : List U16 -> Result Str [BadUtf16 { problem : Utf16ByteProblem, index : U64 }]

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:34):

Yes

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:34):

I'll be quite honest: I didn't try to make the error types useful

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:35):

Thanks for thinking for me

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:35):

I think if Anton was here, he'd ask for a PR to merge into that PR. He's planning on making a testing release I think

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:35):

Instead of lossy, do we want with replacement? Then just expose the default replacement character?

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:36):

Maybe that isn't valuable or worth it, just curious. Have seen that API before

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:36):

If you're opt-ing into quick and dirty... you want minimal friction

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:36):

I'm just not sure what the replacement API would look like

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:36):

I guess you can always use lossy and the call replace separately to change the replacement char

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:36):

We definitely should expose the replacement char though

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:36):

We don't have a char type, so what if they pass an invalid UTF-8 char?

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:36):

I have to google it every time

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:37):

"\u(FFFD)"

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:37):

Also, might as well add in utf32 while we're here?

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:38):

Str.from_utf8_with_replacement : List U8, { replacement_char ? Str } -> Str

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:38):

Luke Boswell said:

If you're opt-ing into quick and dirty... you want minimal friction

Very true. I'm happy with just lossy.

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:38):

What if they call Str.from_utf8_with_replacement(bytes, "not-a-char")

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:39):

Yeah, let's just leave it as lossy and let them separately call Str.replace separately to change the character

view this post on Zulip Luke Boswell (Dec 17 2024 at 22:41):

One thought I have (unrelated to design stuff, more scheduling), is any of this going to block our current upgrade. I'd like to land the new PI basic-cli this week. I think we can just make Arg : [Unix (List U16, Windows (List U16)] and land the Weaver and these builtin upgrades later.

If we just make an issue for these Lossy strings additions, we can track it.

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:41):

So:

Str.from_utf8 : List U8 -> Result Str [BadUtf8 { err : Utf8ByteProblem, index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { err : Utf16ByteProblem, index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [BadUtf32 { err : Utf32ByteProblem, index : U64 }]
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Brendan Hansknecht (Dec 17 2024 at 22:43):

Potentially merging the error tag of it makes sense

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:43):

Let's do Arg, and I'll try to get a PR for these functions by tomorrow. If I can't do it in time, we can do it later

view this post on Zulip Sam Mohr (Dec 17 2024 at 22:43):

Adding new functions isn't breaking

view this post on Zulip Agus Zubiaga (Dec 18 2024 at 03:59):

I feel like we should drop “Byte” from the utf16/32 error tags. It’s not just one byte for those :smiley:

view this post on Zulip Sam Mohr (Dec 18 2024 at 05:05):

Works for me!

view this post on Zulip Brendan Hansknecht (Dec 18 2024 at 17:20):

Agus Zubiaga said:

I feel like we should drop “Byte” from the utf16/32 error tags. It’s not just one byte for those :smiley:

Can we drop it off all tags?

view this post on Zulip Brendan Hansknecht (Dec 18 2024 at 17:21):

You have a list of u8s and and index

view this post on Zulip Brendan Hansknecht (Dec 18 2024 at 17:21):

I think anyone can figure out that is a problem with a specific byte

view this post on Zulip Brendan Hansknecht (Dec 18 2024 at 17:21):

Though I still in the camp that if possible, we should just have a single UnicodeError

view this post on Zulip Brendan Hansknecht (Dec 18 2024 at 17:22):

Or UtfError

view this post on Zulip jan kili (Dec 18 2024 at 17:31):

(deleted)

view this post on Zulip jan kili (Dec 18 2024 at 17:31):

For naming, I'd prefer Err [InvalidUtf*] over Err [Utf*Error] (whether it's 8/16/generic) - Bad is fine too.

view this post on Zulip Brendan Hansknecht (Dec 18 2024 at 18:12):

Good call

view this post on Zulip Sam Mohr (Dec 19 2024 at 06:48):

https://github.com/roc-lang/roc/issues/7390

view this post on Zulip Sam Mohr (Dec 19 2024 at 06:48):

If someone could validate that issue, that'd be great

view this post on Zulip Sam Mohr (Dec 19 2024 at 06:48):

Luke and I will be getting the APIs in place for basic-cli and Weaver, respectively

view this post on Zulip Sam Mohr (Dec 19 2024 at 06:49):

Which means creating an Arg := [Unix (List U8), Windows (List U16)] type and just crashing on Windows for now. Once these are implemented, it should be a simple change to properly support 16-bit encoded strings in basic-cli and Weaver

view this post on Zulip jan kili (Dec 19 2024 at 07:20):

Why the redundancy of Bad+Invalid in [BadUtf8 { err : InvalidUtf8, index : U64 }]?

view this post on Zulip jan kili (Dec 19 2024 at 07:25):

Is [InvalidUtf8 { index : U64 }] sufficient? (Happy to Q&A in GH thread, if you'd prefer.)

view this post on Zulip Sam Mohr (Dec 19 2024 at 07:27):

I also prefer Zulip, it's more back and forth

view this post on Zulip Sam Mohr (Dec 19 2024 at 07:27):

The error holds info about why the UTF was encoded incorrectly

view this post on Zulip Sam Mohr (Dec 19 2024 at 07:27):

It's a tag union

view this post on Zulip jan kili (Dec 19 2024 at 07:28):

Ohhh I forget those sneaky invisible payloads exist, thanks.

view this post on Zulip jan kili (Dec 19 2024 at 07:31):

When I mentioned naming above, I was ignorant to these tag union(s) already existing, and they seem fine as-is. Does this issue intend to refactor this existing pattern from Problem+ByteProblem to Bad+Invalid? (Sorry if I'm bikeshedding this away from implementation concerns.)

Utf8ByteProblem : [
    InvalidStartByte,
    UnexpectedEndOfSequence,
    ExpectedContinuation,
    OverlongEncoding,
    CodepointTooLarge,
    EncodesSurrogateHalf,
]

Utf8Problem : { byteIndex : U64, problem : Utf8ByteProblem }

view this post on Zulip jan kili (Dec 19 2024 at 07:42):

That PR you linked seems to want [BadUtf* { problem : Utf*ByteProblem, index : U64 }]

view this post on Zulip jan kili (Dec 19 2024 at 07:44):

If #7390 intends to be ambivalent on Err structure, then please ignore everything I've said above.

view this post on Zulip Brendan Hansknecht (Dec 19 2024 at 15:16):

Oh yeah let's remove one level of tag nesting

view this post on Zulip Brendan Hansknecht (Dec 19 2024 at 15:17):

It isn't needed

view this post on Zulip Brendan Hansknecht (Dec 19 2024 at 15:18):

Directly return a Result Str { byteIndex : U64, problem : Utf8ByteProblem }

view this post on Zulip Luke Boswell (Dec 19 2024 at 22:34):

@shua -- it's not urgent or anything... but would you be interested in tackling this?

Here's the tracking issue from Sam https://github.com/roc-lang/roc/issues/7390

view this post on Zulip shua (Dec 29 2024 at 00:24):

I can pick this up. Should I merge it into https://github.com/roc-lang/roc/pull/7321 or a separate PR?

view this post on Zulip shua (Dec 29 2024 at 00:24):

Also, would the preference be for 3 separate tag sets (Utf8ByteProblem, Utf16ByteProblem, and Utf32ByteProblem) or should they all just be UtfDecodingProblem even if some variants aren't possible for utf16/utf32?

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:31):

I think it's ok to merge it into your PR

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:32):

shua said:

Also, would the preference be for 3 separate tag sets (Utf8ByteProblem, Utf16ByteProblem, and Utf32ByteProblem) or should they all just be UtfDecodingProblem even if some variants aren't possible for utf16/utf32?

I'm not quite following this...

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:32):

Is this in the Issue?

Str.from_utf8 : List U8 -> Result Str [BadUtf8 { err : InvalidUtf8, index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { err : InvalidUtf16, index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [BadUtf32 { err : InvalidUtf32, index : U64 }]
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:32):

Ohk, maybe we should update the issue -- nvm

I think I see, the tag union is inside the record right?

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:35):

I'm not sure we need all the different error tags. Just InvalidUtf8 would be ok wouldn't it?

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 00:37):

If possible, should only be a single tag

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 00:37):

InvalidUnicode probably

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:39):

So like this?

Str.from_utf8 : List U8 -> Result Str [InvalidUnicode { err : [BadUtf8], index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [InvalidUnicode { err : [BadUtf16], index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [InvalidUnicode { err : [BadUtf32], index : U64 }]
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:41):

Or maybe

Str.from_utf8 : List U8 -> Result Str [InvalidUtf8 { index : U64 }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [InvalidUtf16 { index : U64 }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [InvalidUtf32 { index : U64 }]
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Luke Boswell (Dec 29 2024 at 00:42):

Or

UnicodeErr : [
    InvalidUtf8 U64,
    InvalidUtf16 U64,
    InvalidUtf32 U64,
]

Str.from_utf8 : List U8 -> Result Str UnicodeErr
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str UnicodeErr
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str UnicodeErr
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 00:46):

I think like this:

UnicodeProblem : [
    InvalidStartByte,
    UnexpectedEndOfSequence,
    ExpectedContinuation,
    OverlongEncoding,
    CodepointTooLarge,
    EncodesSurrogateHalf,
]

Str.from_utf8 : List U8 -> Result Str [BadUtf8 { index : U64, problem : UnicodeProblem }]
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str [BadUtf16 { index : U64, problem : UnicodeProblem }]
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str [BadUtf32 { index : U64, problem : UnicodeProblem }]
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Sam Mohr (Dec 29 2024 at 00:50):

This all looks good to me, assuming we're okay with providing a superset of the errors we see for all Unicode variants.

view this post on Zulip Sam Mohr (Dec 29 2024 at 00:50):

I'm not sure what they type of errors we see for UTF-8 vs 16 vs 32

view this post on Zulip Sam Mohr (Dec 29 2024 at 00:51):

It's probably better to provide the actual set of errors per encoding instead of just a single union

view this post on Zulip Sam Mohr (Dec 29 2024 at 00:52):

If we go with a single union, we should maybe aim for naming with respect to codepoints instead of just bytes? Since UTF-16 and 32 don't process bytes

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 00:52):

My thought is: if the error set mostly overlaps, then just merge it, if not, then add separate tag unions.

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 00:52):

So that is my default to try

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 00:53):

If it doesnt work in practice due to disjoint errors, then make Utf8Problem, Utf16Problem, and Utf32Problem

view this post on Zulip Sam Mohr (Dec 29 2024 at 00:54):

I don't feel strongly in opposition, though I do think it's better for API users to get UTFXXProblem. Either works for me

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:01):

Looking at the zig standard library, errors look to be disjoint

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:02):

So I think we will have a sepearate Utf8Problem, Utf16Problem and Utf32Problem

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:02):

Yep

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:10):

Also, it sounds like we actually need to support wtf-8 and wtf-16 for windows paths.

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:11):

zig does support this, but you have to explicitly tell it to how to handle the utf-16 it is given

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:11):

https://ziglang.org/documentation/0.13.0/std/#std.unicode.Surrogates

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:11):

WTF is very appropriate

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:11):

An infinite hole, Windows is

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:11):

Basically, is wtf is for old utf-16 that is not technically valid modern utf-16 (like windows path and js strings apparently)

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:12):

probably Str.from_utf16 should just take an extra arg and forward that to zig

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:13):

I'm sure having UTF-16 is good, but it seems like we only need WTF-16 for now

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:13):

Yeah, I have no idea where you would run into valid modern utf-16.

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:14):

Well, do you know how this will affect OsArg := [Unix (List U8), Windows (List U16)]? Seems like we'll only have UTF-8 and WTF-16

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:15):

Also, we can still implement UTF-16/32 in the stdlib, they won't hurt anything

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:15):

correct. We will only have UTF-8 and WTF-16

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:15):

Okay, great

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:22):

Reading up on this more, it sounds like using WTF-16 for all UTF-16 parsing is valid (and likely required in many cases due to legacy). It just loses some performance due to adding extra checks for unpaired surrogates. So I think we should make Str.from_utf16, but under the hood, it will just parse WTF-16. According to wikipedia, most utf-16 decoders do this.

I'm not sure the perf cost, but it sounds like many systems require it in general.

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:24):

We could also add Str.from_wtf16 and just alias Str.from_utf16 in that case

view this post on Zulip Sam Mohr (Dec 29 2024 at 01:25):

If someone wants to parse WTF-16, it'd be good to not need to ask in Zulip or read what could be the 3rd docs paragraph of Str.parse_utf16

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 01:27):

Yeah, sounds good

view this post on Zulip Richard Feldman (Dec 29 2024 at 01:48):

eh I think just doing utf16 is fine, and then document that it's actually wtf-16

view this post on Zulip Richard Feldman (Dec 29 2024 at 01:48):

I think the likelihood that perf is a problem here is low, and I wouldn't be surprised if people chose the wrong one

view this post on Zulip Richard Feldman (Dec 29 2024 at 01:48):

leading to things almost always working, but then in super rare scenarios not working right :sweat_smile:

view this post on Zulip Richard Feldman (Dec 29 2024 at 01:49):

as in, they choose utf-16 not realizing they need wtf-16

view this post on Zulip Richard Feldman (Dec 29 2024 at 01:49):

(and perhaps not knowing wtf-16 exists!)

view this post on Zulip shua (Dec 29 2024 at 03:46):

edit: below is incorrect, we want the wrapping BadUtf8 tags


Merging suggestions from above on error api:

Brendan Hansknecht said:

Looking at the zig standard library, errors look to be disjoint

means we want distinct Utf8Problem, Utf16Problem and Utf32Problem tagsets, and

Brendan Hansknecht said:

Oh yeah let's remove one level of tag nesting

indicates we can remove the wrapping BadUtf8 etc tag, leading to the following api:

Utf8Problem : [ ... ]
Utf16Problem : [ ... ]
Utf32Problem : [ ... ]

Str.from_utf8 : List U8 -> Result Str { index : U64, problem : Utf8Problem }
Str.from_utf8_lossy : List U8 -> Str
Str.from_utf16 : List U16 -> Result Str { index : U64, problem : Utf16Problem }
Str.from_utf16_lossy : List U16 -> Str
Str.from_utf32 : List U32 -> Result Str { index : U64, problem : Utf32Problem }
Str.from_utf32_lossy : List U32 -> Str

view this post on Zulip Luke Boswell (Dec 29 2024 at 03:47):

I think we still want a tag union, because it merges nicely with other errors when you pass them up from a callsite within a function Result _ []err

view this post on Zulip Luke Boswell (Dec 29 2024 at 03:48):

If it's a record, then its an extra step to wrap it in a tag

view this post on Zulip shua (Dec 29 2024 at 03:50):

And to be clear: we want from_utf16 to actually accept wtf-16, but from_utf8 should still only accept utf-8 not wtf-8?

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 03:52):

yes

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 03:52):

I think that is correct.

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 03:53):

wtf-8 seems to be exceptionally rare from what I can tell. So it is reasonable to just consider it malformed

view this post on Zulip Brendan Hansknecht (Dec 29 2024 at 03:53):

wtf-16 seems to be the default for many apis on the otherhand.

view this post on Zulip shua (Dec 29 2024 at 03:53):

yeah, that matches my understanding as well

view this post on Zulip shua (Jan 07 2025 at 21:37):

coming back to wtf-8, what is expected for

when Str.fromUtf16 [0xd800] is
  Err _ -> "a"
  Ok s -> when Str.fromUtf8 (Str.toUtf8 s) is
    Err _ -> "b"
    Ok _ -> "c"

a. Str.fromUtf16 will implement strict utf-16 (ie not wtf-16) and fail with an error about an unpaired surrogate
b. Str.fromUtf16 will implement wtf-16, but Str.fromUtf8 will implement strict utf-8 (ie not wtf-8), and return an error about encoding a surrogate pair
c. Str.fromUtf16 implements wtf-16, and Str.fromUtf8 implements (possibly a subset of) wtf-8, both accepting unpaired surrogate pair codepoints

view this post on Zulip shua (Jan 07 2025 at 21:38):

the agreement before (to make from_utf16 actually be from_wtf16 but leave from_utf8 as-is) would imply "b" which leads to Str.fromUtf8 and Str.toUtf8 not being able to roundtrip

view this post on Zulip shua (Jan 07 2025 at 21:55):

we could make Str.toUtf8 return a Result Str _. It seems unfortunate but if you allow Str to contain surrogate pair codepoints, then it cannot be encoded to standards-compliant utf-8/16/32 afaict.

view this post on Zulip shua (Jan 07 2025 at 21:57):

I'm leaning towards "c". I have never cared about whether my unicode encoding/decoding libraries checked for unpaired surrogate codepoints. I would mentally file it in the same category as unpaired combining characters in the input string.

view this post on Zulip shua (Jan 07 2025 at 21:57):

but maybe someone else has stronger opinions?

view this post on Zulip Richard Feldman (Jan 07 2025 at 22:06):

I think the bar should be super high for Str to be encoded as anything other than unmodified standard utf-8, because a ton of hosts will naively convert utf8 strings to roc Strs

view this post on Zulip Richard Feldman (Jan 07 2025 at 22:07):

and if we have a slightly different representation, it sounds like a major UB footgun, not to mention a potential performance problem where we have to check every utf8 string the host wants to send in, to make sure it doesn't contain any edge cases

view this post on Zulip shua (Jan 07 2025 at 22:07):

I guess going from valid utf-8 to roc Strs should remain the same. wtf-8 is a superset of utf-8 which allows encoding more codepoints than utf-8.

view this post on Zulip Brendan Hansknecht (Jan 07 2025 at 22:09):

I think we should consider our strings to be strictly utf-8 and not wtf-8

view this post on Zulip Brendan Hansknecht (Jan 07 2025 at 22:10):

When converting wtf-16 to utf-8, I assume we need to cleanup the surrogate pairs to make it strict utf-8

view this post on Zulip Brendan Hansknecht (Jan 07 2025 at 22:11):

So we parse wtf-16, but canonicalize to cleanly convert into uft-8. If we can't canonicalize, we fail.

view this post on Zulip shua (Jan 07 2025 at 22:12):

Yeah, specifically _unpaired_ surrogate codepoints are an issue. We could replace them with unicode replacement character '�' or we could remove them.

view this post on Zulip shua (Jan 07 2025 at 22:14):

I think the motivating reason to accept wtf-16 is windows paths which can contain unpaired surrogate codepoints. If we replace or remove things in a path string, will windows still recognize it?

view this post on Zulip Brendan Hansknecht (Jan 07 2025 at 22:15):

I thought the point of wtf-16 is that it correctly knows how to convert unpaired surrogates into the correct Unicode code point.

view this post on Zulip Brendan Hansknecht (Jan 07 2025 at 22:16):

Cause an unpaired surrogates means that it is actually the old ucs-2 format

view this post on Zulip shua (Jan 07 2025 at 22:26):

ah, no, not as far as I understand. wtf-16 is "potentially ill-formed utf-16".

I think at least in the document, "ucs-2" refers to a unicode encoding which was defined _before_ unicode codepoints exceeded 0xFFFF so everything fit in 16bits and surrogate pairs were not defined nor necessary. ucs-2 and utf-16 differ from codepoints 0xD800 to 0x110000, as ucs-2 can't encode anything higher than 0xFFFF and utf-16 has those ill-formed constraints around 0xD800 to 0xDFFF.

WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, either by neglect or because of backward-compatibility constraints.

view this post on Zulip shua (Jan 07 2025 at 22:43):

Brendan Hansknecht said:

I thought the point of wtf-16 is that it correctly knows how to convert unpaired surrogates into the correct Unicode code point.

yes, but encoding those codepoints as utf-8 is disallowed by the official standard

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

Thus, the solution is to use a not-as-strict encoding which is the same as utf-8, except it allows encoding those codepoints. This not-as-strict encoding is wtf-8

view this post on Zulip Brendan Hansknecht (Jan 07 2025 at 22:58):

Hmm. I guess I don't fully understand wtf-16. Anyway, I would guess we want to keep strings as fully valid utf-8. So when converting from wtf-16, we would have one form of the function that uses replacement characters as necessary and another that just fails.

view this post on Zulip shua (Jan 08 2025 at 22:57):

Okay, I think that's what I'm doing currently. fromUtf16 and fromUtf8 will fail if the input is not utf-16 or utf-8 respectively, while fromUtf16Lossy and fromUtf8Lossy accept a superset of utf-16 and utf-8, which replaces ill-formed sequences which cannot be encoded or codepoints which cannot be encoded as utf-8 (ie surrogates) to the unicode replacement character.

This does mean that roc will not be able to work with host paths as Str but would rather have to work with List U8 and fallibly convert them to Str.

view this post on Zulip Brendan Hansknecht (Jan 08 2025 at 22:59):

Yeah, I think a host path has to be a List U8 or a List U16, I guess.

view this post on Zulip shua (Jan 08 2025 at 23:00):

oh the joys of standards compliance

view this post on Zulip Richard Feldman (Jan 08 2025 at 23:07):

host paths aren't valid anything :laughing:

view this post on Zulip Richard Feldman (Jan 08 2025 at 23:08):

UNIX allows anything in paths other than 0 bytes, and ASCII forward slashes mean directory separators

view this post on Zulip Richard Feldman (Jan 08 2025 at 23:08):

the spec doesn't even have anything to do with Unicode :sweat_smile:

view this post on Zulip Richard Feldman (Jan 08 2025 at 23:09):

Windows is similar except it's like they also ban bytes under 32 or something like that

view this post on Zulip shua (Jan 08 2025 at 23:13):

they're a valid nuisance is what they are :wink:

view this post on Zulip shua (Jan 08 2025 at 23:13):

I think WASI filesystem spec uses string as path values, which must be valid unicode scalar values, and they just accept that some (pathologically-named) files will be unreachable via WASI. So if WASI is considered as a host, that's at least one.

edit: adding link to wasi:filesystem spec

view this post on Zulip Richard Feldman (Jan 08 2025 at 23:15):

interesting - we might want to add WASI to roc-lang/path

view this post on Zulip shua (Jan 16 2025 at 21:35):

just in case someone's following this thread: https://github.com/roc-lang/roc/pull/7514 has been posted and approved, but is currently blocked on some interesting CI failures #bugs > mono mismatch between mac aarch64 and rest of targets


Last updated: Jul 06 2025 at 12:14 UTC