At some point we need to really investigate closure data copying in Roc. Especially to compare it to either a manual implementation in c/zig or to compare it to rust async (or similar).
I am still kinda stunned that in an example like rocci-bird, essentially all of the code size seems to come from simply copying data from one closure to the next. Which probably also means that a significant portion of the runtime is just copying data between closures.
It fundamentally feels off to me. Maybe we are simply generating in a way that llvm can't optimize. May we need to unify our closure captures different to reduce data movement. I'm really not sure, but this really feels off to me.
I wonder how hard it would be to start passing them around by reference if they're over a certain size
I don't think that's the issue.
I think the issue is that essentially every new await captures slightly different data (cause new local variables). Instead of having space and just adding the single new variable to the closure capture, we create a totally new closure capture with a slightly different layout.
So like
state = SomeBigRecord
# captures {state}
clickedMouse = getMouse!
# captures {state, clickedMouse}
pressedW = getKey! 'w'
activateAbility = calc state pressedW
if clickedMouse && activateAbility then
# captures {state, activateAbility}
changeColorPalette!
nextState =
if activateAbility then
# something that uses state ...
else
# something that uses state ...
# captures {nextState}
displayState! nextState
Task.ok nextState
Optimally, this would be one shared closure capture. It would have enough space for everything and state would never move around.
Instead we make 4 unique closure captures:
Tons of copying and data movement
ahhhh interesting
not saying we should do this, but if we had nested closures capture strictly more as they got more and more nested, then:
and that would mainly affect tasks, parsers, etc.
Due to each cluster capture being is own alloca and everything being nested, I think we also have all 4 closure captures alive for a really long time(so much more memory use than sharing even in the naive way).
sure
but some amount of this is inherent to async I/O
like the state in between operations has to be saved somewhere, and that either results in a big chunk of memory that lives a long time, or else copying between minimally-sized chunks of memory
I guess there could be some cleverness to try to reduce that, e.g. choosing layouts where values that refer to the same thing will be in the same place from one capture to the next, so if we're able to reuse that memory then we don't need to copy those particular fields because they're already in the right place
I don't know how hard of a problem that is to automate (seems like the type of thing where there's an off-the-shelf algorithm somewhere which optimizes it) but it could be a helpful exercise to try to find a manual organization of the rocci bird captures that would minimize copying and memory usage
like if you were doing all of it by hand in Zig or something, what's the most efficient layout and copying strategy you could come up with
For sure, some of this is inherent. That said, I think we currently use more memory overall and do a bunch of copying.
But yeah, let me try to port to zig while still using an async io style with capturing lambas. not sure when I will have time, but should be a really good exercise in what we could theoretically generate.
As a note, I think the Layout variants
section of this post on rust async await is the optimization we are fundamentally missing:
https://docs.roblab.la/asyncawait/posts/optimizing-await-1/
ahh yes!
that's exactly the optimization I was talking about
with this part:
Richard Feldman said:
I guess there could be some cleverness to try to reduce that, e.g. choosing layouts where values that refer to the same thing will be in the same place from one capture to the next, so if we're able to reuse that memory then we don't need to copy those particular fields because they're already in the right place
so I guess not only is it optimizable, but rustc already implemented it! :laughing:
Yeah, in the roc context. It would mean collapsing the state of nested closures that tail call each other. Which is most often seen in Task.await. but also result.try and similar.
In current roc, llvm often manages to inline these chains, but still creates a state for every single closure and copies around tons of data.
I learned some new things about this. First of all, the original post is https://tmandry.gitlab.io/blog/posts/optimizing-await-1/ and it has some followup posts.
Also, this problem is still not fully solved in the rust compiler. The latest attempt is https://github.com/rust-lang/rust/pull/120168 which seems like it's another step in the right direction, but notes that there are still cases that are not covered.
what's upvar?
not much how bout u
Upvar is a variable captured by a closure
There are a couple ideas to resolve this that would likely work well.
One is to more aggressively specialize unique lambdas to the same type. Part of the issue is that lambda sets monomorphize very aggressively, so two closures may have similar (but not the same) layouts, and be forced into different types. This can be seen for example with nested Task.await
calls, where each nested Task.await
is a separate closure. It may be better to force all nested calls to be the same closure type.
Another is to store closures that are additive as a linked list. If you indicate in the closure type the name of a symbol that the closure references (which is already done, at least at the type-checking level), you can avoid unpacking and repacking closure sets by instead representing each closure set as a linked list of records. When you need to add new data, append to the linked list. This used to be an approach for closure compilation, though I don't know how popular it is today.
Yeah, I think the nested closures being the same type would be a huge win and fix this issue for the most part.
Last updated: Jul 06 2025 at 12:14 UTC