By numbers parsing, you mean conversion to binary? I'd be a little hesitant about that specifically because it loses information about how the number was written - e.g. hex vs decimal. That's critical for formatting, at at least possibly useful for better error messages in some contexts. Lastly, it's extra work that the tokenizer doesn't really need to do (i.e. can be done later instead, and isn't needed by all downstream consumers of tokens).
for the context:
Hmm, I don't think this is the right approach
I agree operating on things in cache is important, but I think @Richard Feldman is over-indexing on that. One of the things that makes SIMDJson fast is delaying a bunch of processing until later when it's actually needed - even if that means reloading that from cache.
Carbon-lang, which has some very ambitious perf targets, copies out numbers and such to separate bufferes during tokenization (lexing), but doesn't do the actual conversion to binary until later.
Also, extra is IMO currently too large and needs to be trimmed down. Trying to store parsed 64-bit (or 128-bit!) values there will significantly increase memory usage and maybe even slow other things down.
At a minimum, this processing should be separable / configurable, perhaps via an interface (anytype?) passed to the tokenizer with a no-op impl we can use for the formatter. And I would do everything we can to not add to extra
I think you are likely accurate about most of this. Especially given individual hot loops especially that stay in Simd are very powerful. It is important to remember that a branch miss can often be much worse than a cache miss.
But I would guess both can be made to work with great perf...I mean tokenization is such a small part of E2E time anyway.
Tokenization is currently more that half the time when benchmarking the parser. That’s non-trivial!
About a third of that time is spent interning identifiers currently.
I think this sounds right as a high level but is likely wrong in practice :smile:
the specific reason that the paper talks about for SIMDjson's speed is avoiding branch mispredictions
let's say I want to not do any processing and instead just store the number 12345
in a side table (like Carbon apparently does). what specifically happens to store those bytes in a side table?
we iterate through the digits in a for
loop until we hit the end of the sequence, each time copying a byte over. there is 1 branch misprediction, at the end of the loop, when we hit the end.
suppose instead that I am processing 12345
from a string into an u64
and then putting that in a side table. what does the CPU do differently?
again, it iterates through each digit, does some extra arithmetic (no impact on branch predictions and essentially free), then at the end of the loop, hits the same 1 branch misprediction because we hit the end of the digits, and then copies the u64 into the target side table, which does not involve branching
so if the reason SIMDjson is fast is (as the paper says) because it minimizes branch mispredictions, and processing the digits into u64 involves the exact same number of branch mispredictions (namely, 1) as copying it into a side table instead...where specifically is the anticipated slowdown coming from?
alternatively, if the idea is to not copy anything into a side table (so, don't do what Carbon does) and instead just write down the Region for a later stage to use, now we're saying that it's a good idea to trade a branch misprediction in tokenization for a cache miss in a later stage? that definitely does not sound like it would improve performance :sweat_smile:
Joshua Warner said:
Tokenization is currently more that half the time when benchmarking the parser. That’s non-trivial!
Oh, I'm sure it will be important, but it is small potatoes compared to can and type checking. At least last time I measured
Btw, I watched this talk yesterday and must say the part about how to avoid mispredictions and branching in general (in particular in json parsing) is a great exercise for an array language :smile:
we're saying that it's a good idea to trade a branch misprediction in tokenization for a cache miss in a later stage? that definitely does not sound like it would improve performance
There are multiple other possible effects here:
17 messages were moved here from #contributing > Pull Request for Review by Joshua Warner.
Can needs to know because we report errors for literal out of bounds, which is where I first noticed this
Ah, interesting. Why do that in Can rather than after type checking? I assume you have to re-do that after type checking _anyway_, to handle cases where type checking refined the type to e.g. a U8 or whatever.
ah, it's just because we need to classify it as like "this frac fits in F32" vs "this frac doesn't fit in F32" so that we can inform the type system, so that during type checking we can unify with the appropriate constraints
I guess another way to say it is that we're "generating the constraints" during canonicalization that type unification will later turn into errors (or not) during normal unification
this frac fits in F32
Don't all Fracs fit in an F32?
Or I guess essentially all Fracs?
Wait, frac = float? And you're deciding between f32 and f64? (essentially?)
Frac
means one of F32
, F64
, or Dec
(Dec
is fixed-point, not floating-point, so the name "float" doesn't accurately describe it)
I don't feel good about replacing region info with numeric data in extra
. the other options are always pushing this data to a separate collection (as with interned), or extending token size which of course is a terrible idea.
Last updated: Jul 26 2025 at 12:14 UTC