Dedicated Char type · ideas · Zulip Chat Archive

Stream: ideas

Topic: Dedicated Char type

Qqwy / Marten (Jul 07 2022 at 21:32):

So character constants like 'a' or '☕' are currently turned into U32s.
However, putting these later into a string requires calling Str.appendScalar : Str, U32 -> (Result Str [ InvalidScalar ]*).
This introduces a pesky Result type, even though we know that these are valid!

If we were to introduce a new numeric type wrapper around U32 that is guaranteed to only contain the values for valid scalars (c.f. https://unicode.org/glossary/#unicode_scalar_value ), we would make dealing with strings and individual chars nicer. :grinning:

Whether this should be called Char, UnicodeScalar or something else is a secondary matter :bike: 🛖 :stuck_out_tongue_wink:.

Brian Carroll (Jul 07 2022 at 21:34):

Cool. A private tag is probably the best way to do that.

Brian Carroll (Jul 07 2022 at 21:39):

If we define such a Char type inside the Str package such that it can only be constructed using functions from Str then it should all work out with no language level changes, just using stuff we already have.

Brendan Hansknecht (Jul 07 2022 at 21:40):

I think we plan to add (maybe already added) byte characters that become U8. So that would theoretically also want to be wrapped as well. Just a note that we may have to add 2 different names

Richard Feldman (Jul 08 2022 at 00:51):

so there's a pretty deep rabbit hole here :big_smile:

Richard Feldman (Jul 08 2022 at 00:51):

one factor is that encouraging using scalars over plain bytes has a performance cost

Richard Feldman (Jul 08 2022 at 00:52):

like let's say I'm parsing some bytes I just read from disk as JSON

Richard Feldman (Jul 08 2022 at 00:53):

and I want to see if the next byte is { or " or [

Richard Feldman (Jul 08 2022 at 00:54):

today, I do a when on those (maybe I convert the byte to U32 first so I can use the single-quote literals, but that could easily end up compiling to 0 machine instructions depending on what LLVM does in terms of eliding unnecessary casts and also how the registers work out)

Richard Feldman (Jul 08 2022 at 00:54):

and I have branches like '{' ->, '[' ->, etc.

Richard Feldman (Jul 08 2022 at 00:55):

in comparison, let's say the single quote literals give you a Char or UnicodeScalar, which lets me skip a conditional later when adding it back into the string

Richard Feldman (Jul 08 2022 at 00:55):

now I have an incentive to take this raw byte and convert it to an opaque Scalar so that I can match on that

Richard Feldman (Jul 08 2022 at 00:56):

but of course, an individual byte might not necessarily be a valid Scalar - if it's between 0 and 127 it is, but otherwise it's not

Richard Feldman (Jul 08 2022 at 00:56):

so the function that goes from U8 to Scalar would have to return a Result

Richard Feldman (Jul 08 2022 at 00:57):

but that's okay ergonomically because I can do when Scalar.fromU8 byte is and then match on like Ok '{' ->, Ok '[' ->, etc

Richard Feldman (Jul 08 2022 at 00:58):

this has just moved the conditional from one place to another, which is not a huge deal in terms of performance

Richard Feldman (Jul 08 2022 at 00:58):

should be roughly a wash

Richard Feldman (Jul 08 2022 at 00:58):

however, if I do like walkScalars instead of walkUtf8Bytes, that is a big difference

Richard Feldman (Jul 08 2022 at 00:59):

because going from arbitrary sequences of raw bytes to arbitrary sequences of scalars is actually quite a few conditionals per step

Richard Feldman (Jul 08 2022 at 00:59):

when the scalars are over 127

Richard Feldman (Jul 08 2022 at 01:01):

so today, there's not much incentive to use U32 over U8 (except maybe in terms of using the single quote literals in pattern matches, but as previously noted, that runtime cost seems likely to be zero) because you need the same number of conditionals in the same places

Richard Feldman (Jul 08 2022 at 01:02):

but if a UnicodeScalar type let you avoid |> Result.withDefault "" # this can never happen (or explicit panics, for that matter) then it would incentivize using that over raw bytes, even though the latter is more efficient by default

Richard Feldman (Jul 08 2022 at 01:02):

so one possible solution to that problem is to have separate Scalar and AsciiByte types, or something like that

Richard Feldman (Jul 08 2022 at 01:03):

but you can't really do Str.walkAsciiBytes and have it give you a list of raw ASCII bytes, because often you'll have non-ASCII characters in there

Richard Feldman (Jul 08 2022 at 01:03):

so at each step it would have to give you something like [Ascii AsciiByte, NonAscii U8]

Richard Feldman (Jul 08 2022 at 01:04):

or else go ahead and parse the entire scalar, at which point we have something even less efficient than walkScalars

Richard Feldman (Jul 08 2022 at 01:07):

because really, at the end of the day, the fastest way to parse JSON (aside from SIMDjson which is its own entire rabbit hole) is by looking at either individual bytes (for byte-sized delimiters like { or ") or string slices (e.g. validating that everything between two "s is valid UTF-8, so we can convert it to a Str), and never actually converting anything to a scalar

Richard Feldman (Jul 08 2022 at 01:07):

but having a first-class scalar type encourages working with scalars instead, even though they're less efficient!

Richard Feldman (Jul 08 2022 at 01:09):

the only idea I had for a type that seemed like it wouldn't have this downside is to have just the AsciiByte type (give or take a different name), which has a literal syntax, and then have a compiler optimization where Result AsciiByte [NonAscii] monomorphizes specially

Richard Feldman (Jul 08 2022 at 01:09):

and in particular it monomorphizes to the result discriminant being the high bit

Richard Feldman (Jul 08 2022 at 01:10):

which would mean that AsciiByte.fromU8 : U8 -> Result AsciiByte [NonAscii]* would actually compile to the identity function (as long as the error union didn't end up picking up any other tags, which it typically wouldn't) and get optimized out

Richard Feldman (Jul 08 2022 at 01:11):

and also if you did when AsciiByte.fromU8 byte is and then matched individual branches, the matching on like Ok '{' -> could be the same as just matching on the number directly

Richard Feldman (Jul 08 2022 at 01:12):

so that design could encourage using bytes over scalars (which seems like the right thing to incentivize!) and could eliminate the |> Result.withDefault "" # this can never happen in those situations

Richard Feldman (Jul 08 2022 at 01:13):

however, it's kinda predicated on a pretty special-case optimization, and it also seems kinda niche - like I'm not sure what it would be useful for other than parsing textual formats with ASCII delimiters (which, granted, is pretty much all of them)

Richard Feldman (Jul 08 2022 at 01:14):

so considering all that, I figured "let's just leave it at plain numbers and revisit after we've gathered a substantial body of use cases and revisit then" :big_smile:

Qqwy / Marten (Jul 08 2022 at 07:52):

"let's just leave it at plain numbers and revisit after we've gathered a substantial body of use cases and revisit then" :big_smile:

This I definitely agree with! :happy:

Qqwy / Marten (Jul 08 2022 at 07:56):

I saw some article a while back talking about switching from one text encoding to the other, with the gist being that in practice characters in the ASCII set (e.g. < 256) are much more common by virtue of most non-binary serialization formats heavily using only those, and thus it makes sense to optimize code for that in many situations.

Let me see if I can find it again.

EDIT: Found it! https://www.channable.com/tech/so-long-surrogatesa

Very much worth reading.

Qqwy / Marten (Jul 08 2022 at 11:17):

Richard Feldman said:

so the function that goes from U8 to Scalar would have to return a Result

I don't think this is correct. Valid unicode Unicode scalars are in "the ranges of integers 0x0 to 0xD7FF and 0xE000 to 10FFFF inclusive". So any U8 is a valid scalar.

Qqwy / Marten (Jul 08 2022 at 11:29):

and never actually converting anything to a scalar

Currently, I'm going the other way by the way: Converting scalars to mini-bytestrings (of 1-4 bytes in length) and then comparing the current start of the string against that mini-bytestring. (bytestring = List U8).

Qqwy / Marten (Jul 08 2022 at 11:30):

I'm pretty sure that algorithms exist that do this comparison in a more clever way by casting both the scalar and the start of the list to a machine word size, and doing a bit masking check.

Richard Feldman (Jul 08 2022 at 12:25):

Qqwy / Marten said:

Richard Feldman said:

so the function that goes from U8 to Scalar would have to return a Result

I don't think this is correct. Valid unicode Unicode scalars are in "the ranges of integers 0x0 to 0xD7FF and 0xE000 to 10FFFF inclusive". So any U8 is a valid scalar.

kinda - so if the U8 is encoded as UTF-8, then if it's in the 0-127 range it's already a scalar, but if it's 128-255 then it's a multibyte character and you need to read at least 1 more byte to convert it to a scalar

Richard Feldman (Jul 08 2022 at 12:26):

so technically you could interpret a UTF-8 byte of 200 as a scalar directly, and it would be some valid scalar, but it wouldn't be the right one :sweat_smile:

Richard Feldman (Jul 08 2022 at 12:28):

Qqwy / Marten said:

and never actually converting anything to a scalar

Currently, I'm going the other way by the way: Converting scalars to mini-bytestrings (of 1-4 bytes in length) and then comparing the current start of the string against that mini-bytestring. (bytestring = List U8).

that should work, although I suspect there will be room for performance optimization in the future by doing more direct byte comparisons! :smiley:

also, converting from a Str of 4 bytes to a List of 4 bytes actually requires a heap allocation, because the 4-byte string is small enough to be stored on the stack (because we do the small string optimization) whereas List is always stored on the heap

Richard Feldman (Jul 08 2022 at 12:32):

so in general it should be (much) faster to find algorithms that can avoid converting from Str to List U8 when the Str is small

Richard Feldman (Jul 08 2022 at 12:33):

("small" being specifically under 24 bytes on a 64-bit target, and under 12 bytes on a 32-bit target like wasm)

Qqwy / Marten (Jul 08 2022 at 13:00):

Interesting. Was it a conscious choice to not have a 'small List optimization'?

Qqwy / Marten (Jul 08 2022 at 13:03):

Regardless: Conversion from Str to List U8 should only happen once per parser (maybe depending on how the compiler is able to optimize this) and once on the input string.

Qqwy / Marten (Jul 08 2022 at 13:05):

I am more and more convinced of the merits of using a {coll: List U8, index: Nat} 'slice-like' representation internally while parsing, as this would make the consumption of the prefix of the input a constant-time operation.

Qqwy / Marten (Jul 08 2022 at 13:08):

That would by the way be immensely helped by a native builtin along the lines of List.containsSublistAt : List a, List a, Nat -> Bool.

Folkert de Vries (Jul 08 2022 at 13:13):

why does that need to be a builtin?

Qqwy / Marten (Jul 08 2022 at 13:13):

For efficiency

Qqwy / Marten (Jul 08 2022 at 13:14):

At least, I think that this can be implemented more efficiently in native code than in pure Roc

Qqwy / Marten (Jul 08 2022 at 13:15):

Especially a specialization for List U8

Richard Feldman (Jul 08 2022 at 13:16):

Was it a conscious choice to not have a 'small List optimization'?

yeah I've thought about it a bunch, and it seems like it would make all List operations other than parsing run slower, in order to make just parsing not have to think about this :sweat_smile:

Qqwy / Marten (Jul 08 2022 at 13:21):

:open_mouth: That is unexpected. I'd expect it to allow better cache locality and stuff for small unique lists.

Folkert de Vries (Jul 08 2022 at 13:22):

well now every list operation would involve a conditional

Folkert de Vries (Jul 08 2022 at 13:22):

to check if it is small

Folkert de Vries (Jul 08 2022 at 13:22):

and when performance matters, lists are usually big

Qqwy / Marten (Jul 08 2022 at 13:37):

Ah, that makes sense

Qqwy / Marten (Jul 08 2022 at 13:43):

@Folkert de Vries I had something like the following in mind for the List.containsSublistAt. A Rust example, but probably is reproducable in LLVM itself as well. I am not very good at reading x64 assembly, but it seems like this gets optimized to the point where comparison happens in 4-byte (register-sized) chunks.
https://godbolt.org/z/b6beaYG5M

Folkert de Vries (Jul 08 2022 at 13:53):

I would kind of hope LLVM does something similar

Brendan Hansknecht (Jul 08 2022 at 14:32):

To the {list, index} idea

If the index never goes backwards, theoretically long term we will have slices. A list used there where you just keep subslicing so that the first element is at the index would be just as efficient and not require the index variable.

Brendan Hansknecht (Jul 08 2022 at 14:38):

Also, as for the small/large list question, I think in many cases lists stay reasonably small, but you need to be able to tweak and tune how large a small list is for it to work well. If you stay with a static number of elements or a static size of bytes, it can really lose all of the performance benefits.

Brendan Hansknecht (Jul 08 2022 at 14:39):

I think in many cases the branches around small/large list are extremely predictable and low cost especially compared to the following memory access.

Brendan Hansknecht (Jul 08 2022 at 14:40):

So I think the main probably with small lists is that they need to be tunable. And that is probably not something roc would want to expose to userland. Which means they wouldn't be worth adding.

Last updated: Jul 23 2026 at 13:15 UTC