Optimizing Tokenization vs Canonicalization · contributing

Stream: contributing

Topic: Optimizing Tokenization vs Canonicalization

Joshua Warner (Jul 17 2025 at 17:01):

By numbers parsing, you mean conversion to binary? I'd be a little hesitant about that specifically because it loses information about how the number was written - e.g. hex vs decimal. That's critical for formatting, at at least possibly useful for better error messages in some contexts. Lastly, it's extra work that the tokenizer doesn't really need to do (i.e. can be done later instead, and isn't needed by all downstream consumers of tokens).

Kiryl Dziamura (Jul 17 2025 at 17:03):

for the context:

#compiler development > casual conversation @ 💬
#contributing > Worklog (Draft PRs and coordination) @ 💬

Joshua Warner (Jul 17 2025 at 17:18):

Hmm, I don't think this is the right approach

Joshua Warner (Jul 17 2025 at 17:32):

I agree operating on things in cache is important, but I think @Richard Feldman is over-indexing on that. One of the things that makes SIMDJson fast is delaying a bunch of processing until later when it's actually needed - even if that means reloading that from cache.

Joshua Warner (Jul 17 2025 at 17:33):

Carbon-lang, which has some very ambitious perf targets, copies out numbers and such to separate bufferes during tokenization (lexing), but doesn't do the actual conversion to binary until later.

Joshua Warner (Jul 17 2025 at 17:35):

Also, extra is IMO currently too large and needs to be trimmed down. Trying to store parsed 64-bit (or 128-bit!) values there will significantly increase memory usage and maybe even slow other things down.

Joshua Warner (Jul 17 2025 at 17:38):

At a minimum, this processing should be separable / configurable, perhaps via an interface (anytype?) passed to the tokenizer with a no-op impl we can use for the formatter. And I would do everything we can to not add to extra

Brendan Hansknecht (Jul 17 2025 at 17:40):

I think you are likely accurate about most of this. Especially given individual hot loops especially that stay in Simd are very powerful. It is important to remember that a branch miss can often be much worse than a cache miss.

Brendan Hansknecht (Jul 17 2025 at 17:46):

But I would guess both can be made to work with great perf...I mean tokenization is such a small part of E2E time anyway.

Joshua Warner (Jul 17 2025 at 17:48):

Tokenization is currently more that half the time when benchmarking the parser. That’s non-trivial!

Joshua Warner (Jul 17 2025 at 17:50):

About a third of that time is spent interning identifiers currently.

Richard Feldman (Jul 17 2025 at 17:59):

I think this sounds right as a high level but is likely wrong in practice :smile:

the specific reason that the paper talks about for SIMDjson's speed is avoiding branch mispredictions

let's say I want to not do any processing and instead just store the number 12345 in a side table (like Carbon apparently does). what specifically happens to store those bytes in a side table?

we iterate through the digits in a for loop until we hit the end of the sequence, each time copying a byte over. there is 1 branch misprediction, at the end of the loop, when we hit the end.

suppose instead that I am processing 12345 from a string into an u64 and then putting that in a side table. what does the CPU do differently?

again, it iterates through each digit, does some extra arithmetic (no impact on branch predictions and essentially free), then at the end of the loop, hits the same 1 branch misprediction because we hit the end of the digits, and then copies the u64 into the target side table, which does not involve branching

Richard Feldman (Jul 17 2025 at 17:59):

so if the reason SIMDjson is fast is (as the paper says) because it minimizes branch mispredictions, and processing the digits into u64 involves the exact same number of branch mispredictions (namely, 1) as copying it into a side table instead...where specifically is the anticipated slowdown coming from?

Richard Feldman (Jul 17 2025 at 18:02):

alternatively, if the idea is to not copy anything into a side table (so, don't do what Carbon does) and instead just write down the Region for a later stage to use, now we're saying that it's a good idea to trade a branch misprediction in tokenization for a cache miss in a later stage? that definitely does not sound like it would improve performance :sweat_smile:

Brendan Hansknecht (Jul 17 2025 at 18:03):

Joshua Warner said:

Tokenization is currently more that half the time when benchmarking the parser. That’s non-trivial!

Oh, I'm sure it will be important, but it is small potatoes compared to can and type checking. At least last time I measured

Kiryl Dziamura (Jul 17 2025 at 19:34):

Btw, I watched this talk yesterday and must say the part about how to avoid mispredictions and branching in general (in particular in json parsing) is a great exercise for an array language :smile:

Joshua Warner (Jul 17 2025 at 20:22):

we're saying that it's a good idea to trade a branch misprediction in tokenization for a cache miss in a later stage? that definitely does not sound like it would improve performance

There are multiple other possible effects here:

Depending on tooling setup/etc, tokenization can easily happen much more frequently than canonicalization, because of things like format-on-save and possible later optimizations like (in a language server) realizing that all the tokens are the same and so we don't need to re-run the parser or canonicalization/etc.
You might cache miss _regardless_ in that later stage. Perhaps not on that exact memory load, but on an independent/neighboring one effectively covers this one.
You're also paying a cost to copy this information around thru the parser, can, etc - none of which it's relevant for. Neither the parser nor Can need to known the value of an int/string/etc to do their job. An extra 64-bit value in the parse tree is a non-trivial extra bit of data to store and copy, when multiplied by a large number of tokens. In contrast, basically everything needs to keep the range around for error reporting, so if we just use that it's free.

Notification Bot (Jul 17 2025 at 20:23):

17 messages were moved here from #contributing > Pull Request for Review by Joshua Warner.

Richard Feldman (Jul 17 2025 at 21:42):

Can needs to know because we report errors for literal out of bounds, which is where I first noticed this

Joshua Warner (Jul 17 2025 at 22:19):

Ah, interesting. Why do that in Can rather than after type checking? I assume you have to re-do that after type checking _anyway_, to handle cases where type checking refined the type to e.g. a U8 or whatever.

Richard Feldman (Jul 18 2025 at 01:08):

ah, it's just because we need to classify it as like "this frac fits in F32" vs "this frac doesn't fit in F32" so that we can inform the type system, so that during type checking we can unify with the appropriate constraints

Richard Feldman (Jul 18 2025 at 01:08):

I guess another way to say it is that we're "generating the constraints" during canonicalization that type unification will later turn into errors (or not) during normal unification

Brendan Hansknecht (Jul 18 2025 at 01:45):

this frac fits in F32

Don't all Fracs fit in an F32?

Brendan Hansknecht (Jul 18 2025 at 01:46):

Or I guess essentially all Fracs?

Joshua Warner (Jul 18 2025 at 01:55):

Wait, frac = float? And you're deciding between f32 and f64? (essentially?)

Richard Feldman (Jul 18 2025 at 02:18):

Frac means one of F32, F64, or Dec

(Dec is fixed-point, not floating-point, so the name "float" doesn't accurately describe it)

Richard Feldman (Jul 18 2025 at 02:20):

https://github.com/roc-lang/roc/blob/716f2294dbcd8777055ba3886efc1249e55630d3/src/check/canonicalize.zig#L3371

Kiryl Dziamura (Jul 18 2025 at 10:50):

I don't feel good about replacing region info with numeric data in extra. the other options are always pushing this data to a separate collection (as with interned), or extending token size which of course is a terrible idea.

Last updated: Jul 26 2025 at 12:14 UTC