So character constants like 'a' or '☕' are currently turned into U32s.
However, putting these later into a string requires calling Str.appendScalar : Str, U32 -> (Result Str [ InvalidScalar ]*).
This introduces a pesky Result type, even though we know that these are valid!
If we were to introduce a new numeric type wrapper around U32 that is guaranteed to only contain the values for valid scalars (c.f. https://unicode.org/glossary/#unicode_scalar_value ), we would make dealing with strings and individual chars nicer. :grinning:
Whether this should be called Char, UnicodeScalar or something else is a secondary matter :bike: 🛖 :stuck_out_tongue_wink:.
Cool. A private tag is probably the best way to do that.
If we define such a Char type inside the Str package such that it can only be constructed using functions from Str then it should all work out with no language level changes, just using stuff we already have.
I think we plan to add (maybe already added) byte characters that become U8. So that would theoretically also want to be wrapped as well. Just a note that we may have to add 2 different names
so there's a pretty deep rabbit hole here :big_smile:
one factor is that encouraging using scalars over plain bytes has a performance cost
like let's say I'm parsing some bytes I just read from disk as JSON
and I want to see if the next byte is { or " or [
today, I do a when on those (maybe I convert the byte to U32 first so I can use the single-quote literals, but that could easily end up compiling to 0 machine instructions depending on what LLVM does in terms of eliding unnecessary casts and also how the registers work out)
and I have branches like '{' ->, '[' ->, etc.
in comparison, let's say the single quote literals give you a Char or UnicodeScalar, which lets me skip a conditional later when adding it back into the string
now I have an incentive to take this raw byte and convert it to an opaque Scalar so that I can match on that
but of course, an individual byte might not necessarily be a valid Scalar - if it's between 0 and 127 it is, but otherwise it's not
so the function that goes from U8 to Scalar would have to return a Result
but that's okay ergonomically because I can do when Scalar.fromU8 byte is and then match on like Ok '{' ->, Ok '[' ->, etc
this has just moved the conditional from one place to another, which is not a huge deal in terms of performance
should be roughly a wash
however, if I do like walkScalars instead of walkUtf8Bytes, that is a big difference
because going from arbitrary sequences of raw bytes to arbitrary sequences of scalars is actually quite a few conditionals per step
when the scalars are over 127
so today, there's not much incentive to use U32 over U8 (except maybe in terms of using the single quote literals in pattern matches, but as previously noted, that runtime cost seems likely to be zero) because you need the same number of conditionals in the same places
but if a UnicodeScalar type let you avoid |> Result.withDefault "" # this can never happen (or explicit panics, for that matter) then it would incentivize using that over raw bytes, even though the latter is more efficient by default
so one possible solution to that problem is to have separate Scalar and AsciiByte types, or something like that
but you can't really do Str.walkAsciiBytes and have it give you a list of raw ASCII bytes, because often you'll have non-ASCII characters in there
so at each step it would have to give you something like [Ascii AsciiByte, NonAscii U8]
or else go ahead and parse the entire scalar, at which point we have something even less efficient than walkScalars
because really, at the end of the day, the fastest way to parse JSON (aside from SIMDjson which is its own entire rabbit hole) is by looking at either individual bytes (for byte-sized delimiters like { or ") or string slices (e.g. validating that everything between two "s is valid UTF-8, so we can convert it to a Str), and never actually converting anything to a scalar
but having a first-class scalar type encourages working with scalars instead, even though they're less efficient!
the only idea I had for a type that seemed like it wouldn't have this downside is to have just the AsciiByte type (give or take a different name), which has a literal syntax, and then have a compiler optimization where Result AsciiByte [NonAscii] monomorphizes specially
and in particular it monomorphizes to the result discriminant being the high bit
which would mean that AsciiByte.fromU8 : U8 -> Result AsciiByte [NonAscii]* would actually compile to the identity function (as long as the error union didn't end up picking up any other tags, which it typically wouldn't) and get optimized out
and also if you did when AsciiByte.fromU8 byte is and then matched individual branches, the matching on like Ok '{' -> could be the same as just matching on the number directly
so that design could encourage using bytes over scalars (which seems like the right thing to incentivize!) and could eliminate the |> Result.withDefault "" # this can never happen in those situations
however, it's kinda predicated on a pretty special-case optimization, and it also seems kinda niche - like I'm not sure what it would be useful for other than parsing textual formats with ASCII delimiters (which, granted, is pretty much all of them)
so considering all that, I figured "let's just leave it at plain numbers and revisit after we've gathered a substantial body of use cases and revisit then" :big_smile:
"let's just leave it at plain numbers and revisit after we've gathered a substantial body of use cases and revisit then" :big_smile:
This I definitely agree with! :happy:
I saw some article a while back talking about switching from one text encoding to the other, with the gist being that in practice characters in the ASCII set (e.g. < 256) are much more common by virtue of most non-binary serialization formats heavily using only those, and thus it makes sense to optimize code for that in many situations.
Let me see if I can find it again.
EDIT: Found it! https://www.channable.com/tech/so-long-surrogatesa
Very much worth reading.
Richard Feldman said:
so the function that goes from
U8toScalarwould have to return aResult
I don't think this is correct. Valid unicode Unicode scalars are in "the ranges of integers 0x0 to 0xD7FF and 0xE000 to 10FFFF inclusive". So any U8 is a valid scalar.
and never actually converting anything to a scalar
Currently, I'm going the other way by the way: Converting scalars to mini-bytestrings (of 1-4 bytes in length) and then comparing the current start of the string against that mini-bytestring. (bytestring = List U8).
I'm pretty sure that algorithms exist that do this comparison in a more clever way by casting both the scalar and the start of the list to a machine word size, and doing a bit masking check.
Qqwy / Marten said:
Richard Feldman said:
so the function that goes from
U8toScalarwould have to return aResultI don't think this is correct. Valid unicode Unicode scalars are in "the ranges of integers 0x0 to 0xD7FF and 0xE000 to 10FFFF inclusive". So any
U8is a valid scalar.
kinda - so if the U8 is encoded as UTF-8, then if it's in the 0-127 range it's already a scalar, but if it's 128-255 then it's a multibyte character and you need to read at least 1 more byte to convert it to a scalar
so technically you could interpret a UTF-8 byte of 200 as a scalar directly, and it would be some valid scalar, but it wouldn't be the right one :sweat_smile:
Qqwy / Marten said:
and never actually converting anything to a scalar
Currently, I'm going the other way by the way: Converting scalars to mini-bytestrings (of 1-4 bytes in length) and then comparing the current start of the string against that mini-bytestring. (bytestring =
List U8).
that should work, although I suspect there will be room for performance optimization in the future by doing more direct byte comparisons! :smiley:
also, converting from a Str of 4 bytes to a List of 4 bytes actually requires a heap allocation, because the 4-byte string is small enough to be stored on the stack (because we do the small string optimization) whereas List is always stored on the heap
so in general it should be (much) faster to find algorithms that can avoid converting from Str to List U8 when the Str is small
("small" being specifically under 24 bytes on a 64-bit target, and under 12 bytes on a 32-bit target like wasm)
Interesting. Was it a conscious choice to not have a 'small List optimization'?
Regardless: Conversion from Str to List U8 should only happen once per parser (maybe depending on how the compiler is able to optimize this) and once on the input string.
I am more and more convinced of the merits of using a {coll: List U8, index: Nat} 'slice-like' representation internally while parsing, as this would make the consumption of the prefix of the input a constant-time operation.
That would by the way be immensely helped by a native builtin along the lines of List.containsSublistAt : List a, List a, Nat -> Bool.
why does that need to be a builtin?
For efficiency
At least, I think that this can be implemented more efficiently in native code than in pure Roc
Especially a specialization for List U8
Was it a conscious choice to not have a 'small List optimization'?
yeah I've thought about it a bunch, and it seems like it would make all List operations other than parsing run slower, in order to make just parsing not have to think about this :sweat_smile:
:open_mouth: That is unexpected. I'd expect it to allow better cache locality and stuff for small unique lists.
well now every list operation would involve a conditional
to check if it is small
and when performance matters, lists are usually big
Ah, that makes sense
@Folkert de Vries I had something like the following in mind for the List.containsSublistAt. A Rust example, but probably is reproducable in LLVM itself as well. I am not very good at reading x64 assembly, but it seems like this gets optimized to the point where comparison happens in 4-byte (register-sized) chunks.
https://godbolt.org/z/b6beaYG5M
I would kind of hope LLVM does something similar
To the {list, index} idea
If the index never goes backwards, theoretically long term we will have slices. A list used there where you just keep subslicing so that the first element is at the index would be just as efficient and not require the index variable.
Also, as for the small/large list question, I think in many cases lists stay reasonably small, but you need to be able to tweak and tune how large a small list is for it to work well. If you stay with a static number of elements or a static size of bytes, it can really lose all of the performance benefits.
I think in many cases the branches around small/large list are extremely predictable and low cost especially compared to the following memory access.
So I think the main probably with small lists is that they need to be tunable. And that is probably not something roc would want to expose to userland. Which means they wouldn't be worth adding.
Last updated: Jun 16 2026 at 16:19 UTC