I wanted to lay out a goal - not something we need to pursue right away, necessarily! - but rather a place I'd like us to end up when it comes to Regions
this is not about what their numbers represent (e.g. line numbers, indices, whatever) but rather how they're stored
this is also starting from Parse AST nodes - tokenization is out of scope (and decoupled from this)
so let's say we're parsing some tokens and we end up with this memory layout:
Idx: 0 1 2
AST: {…} {…} {…}
Regions: {…} {…} {…}
this means we've parsed 3 AST nodes, and we have a Region for each of them
if I have an Idx
, I can look up the AST node at that index, and also I can look up its Region
they're just stored in different arrays
now let's say I'm doing canonicalization, and - in this world, although we don't have it today! - we have made it be the case that there is exactly 1 CIR node for each AST node. They are exactly 1-to-1.
Now we've added 3 CIR nodes to the mix, one for each AST node, and memory looks like this:
Idx: 0 1 2
AST: {…} {…} {…}
CIR: {…} {…} {…}
Regions: {…} {…} {…}
importantly, I didn't have to touch Regions to do this
because we made AST and CIR 1-to-1, the entire canonicalization pass was able to just look at AST nodes and make CIR nodes and then in the end we just automatically still have the property that, given an Idx
for a CIR node, I can get its Region - even though we didn't have to do any work to copy Region info over or anything
now we get to type-checking
it's already becoming the case (although we're not there yet) that there is one CIR node per Type Var
and if we use the same indexing scheme for those, then we get:
Idx: 0 1 2
AST: {…} {…} {…}
CIR: {…} {…} {…}
Types: {…} {…} {…}
Regions: {…} {…} {…}
again, no Region copying was necessary
we didn't need to touch it, we just introduced the types and the indices just work out that we still have the same Regions we initially set up during parsing
and when we report type mismatch error messages, we have the region info easily accessible - just take the Idx of the type (aka the type var) and go look up the region, and that's the source location
there is one more case here, which is that during type-checking, we sometimes need to introduce new types - for example, when we do a polymorphic function call, we have to copy the function's type and then mutate the copy based on how it was called (e.g. if the function was Num(a), Num(a) -> Num(a)
, we need to be able to call it using a F32
as well as an I64
, and each of those calls need to make a separate copy of the type so that we can infer the return value at the call site correctly)
and when we're doing that, we still want to be able to refer the regions back to the original source code point if there's an error
so at that point, we would actually finally make some new Regions, one per new type added during checking - e.g.
Idx: 0 1 2 3 4
AST: {…} {…} {…}
CIR: {…} {…} {…}
Types: {…} {…} {…} {…} {…}
Regions: {…} {…} {…} {…} {…}
importantly, we don't need to go back and make new CIR nodes or AST nodes for this, it's just about types and regions
so this is about as memory-efficient as I think we can get things, and it's pretty simple!
there is one more potential trick we could do, although I'm not 100% sure it will work (but I suspect it is), which is that if we have a 1-to-1 correspondence between AST nodes and CIR nodes, and we never use AST nodes again after canonicalization...
we could just have the canonicalization pass mutate the AST nodes in-place rather than creating new CIR nodes and then discarding the old AST nodes
like I said, I'm not 100% sure we could/should do that, but it's worth noting that it's something we could try too
anyway, the main thing here (like the title of the topic says) is regions!
I think if we can organize things such that regions are laid out in memory this way, it'll make a lot of things simpler and nicer and faster
any thoughts on this welcome!
Out of curiosity:
so from there on we'll need to do something else
Is it right to say the benefit here is that we reduce the size of our AST and CIR nodes by 64bits, moving that memory to the side, such that doesn't need to be touched at until we are generating error reports or debug info like s-expressions.
So improving the likelihood that the IR for a module will fit completely in a smaller cache?
And if we ever achieved the modify in place idea, then we would be effectively halving the memory used again.
i was actually already planning on doing the first thing here in ParseAST!
not mainly for memory savings but because i was sick of all the region extracting code we have for every node type
but this could really reduce the node size and fit way more nodes in the cache line
Is there a good way to measure/benchmark this? I'd be super interested to see the difference between our current implementation and after we make this change.
partly the benefits are reducing size of nodes, but also the lack of copying means that there's less memory traffic to/from the cache
like if I want to copy a Region
from AST to CIR node, I need the source Region
's address in cache, and also the destination's address in cache
so if we can skip the copy, canonicalization doesn't need either of those in cache
so the savings in terms of cache impact during canonicalization are actually @sizeOf(Region) * 2
per node
because the AST's Region
can just get evicted from cache for other things and never need to be brought back in, and there is no such thing as "CIR's Region" to need to get brought in
is there an argument that during parsing this could cause more cache eviction of the node array since we are switching to writing to a different region (pun intended) of memory for adding the region? i guess the same could be said of extra data
But we'll see. it'd be nice if we had a ready made benchmark set up to run when we make a change like this
Technically yes depending on usage pattern, but I would guess quite unlikely to be a problem.
I don't think so
we're writing the same total number of bytes during parsing, and it's all being written sequentially (appending to arrays) so it probably doesn't matter much whether it's split between 2 address ranges or 1
Luke Boswell said:
Is there a good way to measure/benchmark this? I'd be super interested to see the difference between our current implementation and after we make this change.
Cache misses can be benchmarked like here
Last updated: Jul 26 2025 at 12:14 UTC