Brendan Hansknecht said:
Haven't done any analysis as to why we are 2.2x slower than Go. Might be a bad implementation of
splitWhiteSpaceon my part (probably this I think I am doing extra unicode verification that is costing, why a builtin may be needed here).
this led to #ideas > split whitespace but I wanted to start a separate thread to discuss the (likely) broader problem here and potential ideas for a more general solution
so let's say I want to split a Str into smaller strings based on some unusual delimiter - like something that would never be a builtin, for example "split it whenever we encounter a digit followed immediately by a digit that's 1 more than that digit was" - e.g. 0 followed by 1 means split, and so does 2 followed by 3
that's a contrived example, but the point is that a builtin like Str.split wouldn't work, because that requires a hardcoded separator - whereas this is a separator that varies based on what came before it - and also, unlike splitting on whitespace, it wouldn't make sense to have a builtin for something this specific
now today, I could do the splitting by walking over the string as UTF-8 bytes - that's no problem
however, what I can't do is end up with Str slices into the original string, like I can with Str.split
that's unfortunate, because for performance reasons that's definitely what I want it to do
so to get the best performance out of an operation like this, while still returning a List Str (of slices into the original string, like Str.split can return), I would need some way to say "traverse the UTF-8 bytes of this string, and then after each byte tell me whether you want to split or not"
and in fact if I want to end up with a List Str, then there's no way for me to avoid doing some amount of UTF-8 revalidation (in addition to one allocation per Str in that list), because either I have to convert a List slice into a Str (which requires revalidation) or else I have to build up the Str one code point at a time (which also requires validation)
one idea for how that could look:
Str.partitionByUtf8 : Str, state, (state, U8 -> (state, [Keep, Discard])) -> List Str
so basically you go 1 byte at a time through the UTF-8, and then return whether you want to keep that byte or discard it, and the discarded bytes end up being the separators between the strings (so, all the "keeps" result in one Str slice until you get to either a Discard or the end of the input Str)
A note, here:
You totally can get seamless slices from the original string (as long as it isn't a small string). I am doing it in my function.
Str.toUtf8 This gives you a list sharing the underlying allocation with the stringList.walkWithIndex or whatever to whatever the bytes. List.sublist to get seamless slices of the original string.Str.fromUtf8 is the only cost we currently can't remove. Will be seamless still, but will pay the cost of validating the utf8.ha, that's clever! I didn't think of sublist and indices :thumbs_up:
One idea that is safe and would help (but not completely fix) the perf issues would be to a Str.subStrUtf8Bytes(or whatever similar name). That function would only verify the first character and last character in the substring are valid utf8. All characters in between are guaranteed to be valid substrings due to being part of the original string. Just need to make sure not to split in the build of a character.
Str.partitionByUtf8 : Str, state, (state, U8 -> (state, [Keep, Discard])) -> List Str
I assume you would want this to return a results and track the utf8 is split in valid locations that don't lead to invalid utf8 in strings?
so there would be a couple of annoying things about that Str.partitionByUtf8 API idea:
stateKeep in the middle of a multi-byte code point, and then that would result in a returned Str being invalid UTF-8, so we'd still need to validate thatyeah
so (1) could be improved by changing (state, U8 -> ...) to (state, U32 -> ...) and having it receive a decoded code point instead of a raw UTF-8 byte, but now we're always paying for a bunch of code point decoding work that might be unnecessary to the partitioning algorithm
the original api could have relatively cheap validation. Just is the actually, it is more simple, just did we change state in the middle of a multi-byte codepointDiscard within a utf multi-byte code point, but not repeated for the whole codepoint.
also it still wouldn't change the fact that some code points are not valid scalar values, so it wouldn't be valid to keep them either, so (2) would still apply
I have a vague sense that we could further improve validation performance by not getting Result involved
Oh, big problem with partitionByUtf8 api if it only uses U8
You may not know if you want to discard until you read the entire codepoint. This api you have to specify keep/discard at each individual byte.
that's a good point!
also you might need multiple code points I guess
Brendan Hansknecht said:
One idea that is safe and would help (but not completely fix) the perf issues would be to a
Str.subStrUtf8Bytes(or whatever similar name). That function would only verify the first character and last character in the substring are valid utf8. All characters in between are guaranteed to be valid substrings due to being part of the original string.
this is an interesting idea...the validation for the first byte would be trivial: "it's valid iff it's either less than 128 or else the start of a multibyte code point"
validating the end would be more work, because it would involve looking at up to 4 bytes
I bet we could load all 4 as a u32 and do a smart super quick processing.
a nice thing about that design is that since we're always looking at the bytes from the original Str, you can walk over those bytes and then do Str.subStrUtf8Bytes (or whatever) as you go, while all that memory is in cache
and then you never even need to build up an intermediate List of indices to split on later
you just build up the final List Str as you go, and they're all slices
Last updated: Jun 16 2026 at 16:19 UTC