avoiding UTF-8 revalidation · ideas

Stream: ideas

Topic: avoiding UTF-8 revalidation

Richard Feldman (Apr 08 2024 at 15:13):

Brendan Hansknecht said:

Haven't done any analysis as to why we are 2.2x slower than Go. Might be a bad implementation of splitWhiteSpace on my part (probably this I think I am doing extra unicode verification that is costing, why a builtin may be needed here).

this led to #ideas > split whitespace but I wanted to start a separate thread to discuss the (likely) broader problem here and potential ideas for a more general solution

Richard Feldman (Apr 08 2024 at 15:14):

so let's say I want to split a Str into smaller strings based on some unusual delimiter - like something that would never be a builtin, for example "split it whenever we encounter a digit followed immediately by a digit that's 1 more than that digit was" - e.g. 0 followed by 1 means split, and so does 2 followed by 3

Richard Feldman (Apr 08 2024 at 15:15):

that's a contrived example, but the point is that a builtin like Str.split wouldn't work, because that requires a hardcoded separator - whereas this is a separator that varies based on what came before it - and also, unlike splitting on whitespace, it wouldn't make sense to have a builtin for something this specific

Richard Feldman (Apr 08 2024 at 15:16):

now today, I could do the splitting by walking over the string as UTF-8 bytes - that's no problem

Richard Feldman (Apr 08 2024 at 15:17):

however, what I can't do is end up with Str slices into the original string, like I can with Str.split

Richard Feldman (Apr 08 2024 at 15:17):

that's unfortunate, because for performance reasons that's definitely what I want it to do

Richard Feldman (Apr 08 2024 at 15:19):

so to get the best performance out of an operation like this, while still returning a List Str (of slices into the original string, like Str.split can return), I would need some way to say "traverse the UTF-8 bytes of this string, and then after each byte tell me whether you want to split or not"

Richard Feldman (Apr 08 2024 at 15:20):

and in fact if I want to end up with a List Str, then there's no way for me to avoid doing some amount of UTF-8 revalidation (in addition to one allocation per Str in that list), because either I have to convert a List slice into a Str (which requires revalidation) or else I have to build up the Str one code point at a time (which also requires validation)

Richard Feldman (Apr 08 2024 at 15:22):

one idea for how that could look:

Str.partitionByUtf8 : Str, state, (state, U8 -> (state, [Keep, Discard])) -> List Str

Richard Feldman (Apr 08 2024 at 15:23):

so basically you go 1 byte at a time through the UTF-8, and then return whether you want to keep that byte or discard it, and the discarded bytes end up being the separators between the strings (so, all the "keeps" result in one Str slice until you get to either a Discard or the end of the input Str)

Brendan Hansknecht (Apr 08 2024 at 15:24):

A note, here:

You totally can get seamless slices from the original string (as long as it isn't a small string). I am doing it in my function.

Str.toUtf8 This gives you a list sharing the underlying allocation with the string
List.walkWithIndex or whatever to whatever the bytes.
Very important, do not build new strings a character at a time. Instead track indices
List.sublist to get seamless slices of the original string.
Str.fromUtf8 is the only cost we currently can't remove. Will be seamless still, but will pay the cost of validating the utf8.

Richard Feldman (Apr 08 2024 at 15:25):

ha, that's clever! I didn't think of sublist and indices :thumbs_up:

Brendan Hansknecht (Apr 08 2024 at 15:26):

One idea that is safe and would help (but not completely fix) the perf issues would be to a Str.subStrUtf8Bytes(or whatever similar name). That function would only verify the first character and last character in the substring are valid utf8. All characters in between are guaranteed to be valid substrings due to being part of the original string. Just need to make sure not to split in the build of a character.

Brendan Hansknecht (Apr 08 2024 at 15:27):

Str.partitionByUtf8 : Str, state, (state, U8 -> (state, [Keep, Discard])) -> List Str

I assume you would want this to return a results and track the utf8 is split in valid locations that don't lead to invalid utf8 in strings?

Richard Feldman (Apr 08 2024 at 15:28):

so there would be a couple of annoying things about that Str.partitionByUtf8 API idea:

each individual UTF-8 byte would contain the usual UTF-8 markers about whether you have a multi-byte code point, and you'd have to translate those into code points yourself, which would be annoying in a "walk" style where you're maintaining partial code point parsing in a state
you could return Keep in the middle of a multi-byte code point, and then that would result in a returned Str being invalid UTF-8, so we'd still need to validate that

Richard Feldman (Apr 08 2024 at 15:28):

yeah

Richard Feldman (Apr 08 2024 at 15:29):

so (1) could be improved by changing (state, U8 -> ...) to (state, U32 -> ...) and having it receive a decoded code point instead of a raw UTF-8 byte, but now we're always paying for a bunch of code point decoding work that might be unnecessary to the partitioning algorithm

Brendan Hansknecht (Apr 08 2024 at 15:30):

the original api could have relatively cheap validation. ~~Just is the Discard within a utf multi-byte code point, but not repeated for the whole codepoint.~~ actually, it is more simple, just did we change state in the middle of a multi-byte codepoint

Richard Feldman (Apr 08 2024 at 15:30):

also it still wouldn't change the fact that some code points are not valid scalar values, so it wouldn't be valid to keep them either, so (2) would still apply

Richard Feldman (Apr 08 2024 at 15:33):

I have a vague sense that we could further improve validation performance by not getting Result involved

Brendan Hansknecht (Apr 08 2024 at 15:33):

Oh, big problem with partitionByUtf8 api if it only uses U8

You may not know if you want to discard until you read the entire codepoint. This api you have to specify keep/discard at each individual byte.

Richard Feldman (Apr 08 2024 at 15:33):

that's a good point!

Richard Feldman (Apr 08 2024 at 15:34):

also you might need multiple code points I guess

Richard Feldman (Apr 08 2024 at 15:38):

Brendan Hansknecht said:

One idea that is safe and would help (but not completely fix) the perf issues would be to a Str.subStrUtf8Bytes(or whatever similar name). That function would only verify the first character and last character in the substring are valid utf8. All characters in between are guaranteed to be valid substrings due to being part of the original string.

this is an interesting idea...the validation for the first byte would be trivial: "it's valid iff it's either less than 128 or else the start of a multibyte code point"

Richard Feldman (Apr 08 2024 at 15:39):

validating the end would be more work, because it would involve looking at up to 4 bytes

Brendan Hansknecht (Apr 08 2024 at 15:40):

I bet we could load all 4 as a u32 and do a smart super quick processing.

Richard Feldman (Apr 08 2024 at 15:41):

a nice thing about that design is that since we're always looking at the bytes from the original Str, you can walk over those bytes and then do Str.subStrUtf8Bytes (or whatever) as you go, while all that memory is in cache

Richard Feldman (Apr 08 2024 at 15:42):

and then you never even need to build up an intermediate List of indices to split on later

Richard Feldman (Apr 08 2024 at 15:42):

you just build up the final List Str as you go, and they're all slices

Last updated: Jul 23 2026 at 13:15 UTC