Just curious if anyone has ideas of what might be causing this. I add a few fields to a struct and suddenly roc is trying to access invalid memory and crashing.
is it necessarily the size or maybe the layout? Could try adding some extra unused fields in different places, or changing the types of the added fields to see if it still reproduces
That's fair. My gut feeling is that it is broken either way, but one version is exposing the error.
I'll test a few layouts/extra fields.
Interestingly, it fails before even running that function. So that suggests the real issue is with the tag union that it is a part of.
Model is:
Model : [
TitleScreen TitleScreenState,
Game GameState,
GameOver GameOverState,
]
and gets boxed
I am gonna guess the root of the bug is in one of these crazy long chains of data movement. IIUC, this is just copying data from one lambda capture to another (with of course a few small effects)
https://gist.github.com/bhansconnect/e10db8aafb8250740bbff22a18d58d78
basically the entire ir is just this repeated a bunch of times and this is the optimized IR. So I guess all this app does is copy data between captures well occasionally performing effects.
@Folkert de Vries or @Ayaz Hafiz any thoughts on trying to debug or work around something like this?
This is the full.ll file: https://gist.github.com/bhansconnect/abdfa39c195482ca502555c0fc8e6306
full mono after refcount: https://gist.github.com/bhansconnect/995ea9f0d4f5e9c57aac0f8d0c9b4a5d
definitely some pretty intense captures:
procedure : `w4.Effect.always` [C {{U64, {U64, U64, List {U64, {List U8, {U32, U32, U32, U32}, U32, Int1}}, U8}, {List U8, {U32, U32, U32, U32}, U32, Int1}, {List U8, {U32, U32, U32, U32}, U32, Int1}, List {U32, I32}}, {Int1, Int1, Int1, Int1, Int1, Int1}} {{}, {}}, C [C {U64, U64, {U64, U64, List {U64, {List U8, {U32, U32, U32, U32}, U32, Int1}}, U8}, {List U8, {U32, U32, U32, U32}, U32, Int1}, {List U8, {U32, U32, U32, U32}, U32, Int1}, List {I32, I32}, {Float32, Float32}, Int1, U8}, C {U64, {U64, U64, List {U64, {List U8, {U32, U32, U32, U32}, U32, Int1}}, U8}, {List U8, {U32, U32, U32, U32}, U32, Int1}, {List U8, {U32, U32, U32, U32}, U32, Int1}, List {I32, I32}, {Float32, Float32}, U8}, C {U64, {U64, U64, List {U64, {List U8, {U32, U32, U32, U32}, U32, Int1}}, U8}, {List U8, {U32, U32, U32, U32}, U32, Int1}, {List U8, {U32, U32, U32, U32}, U32, Int1}, List {U32, I32}}]]
Ok, so I found a fix. Apparently Roc is assuming certain memory is zero when it actually isn't. So by always zeroing allocated memory, the issue is fixed (that said, I also made a few other allocator cleanup, they may have been partial fixes)
The only problem is that the fixed binary is way too large. The limit is 64k, it is 124K.
The actual roc wasm emitted is 644k but apparently most of that is dead code that zig removes when linking.
Essentially all of that code is just copying closure captures one struct field at a time. Maybe we need to change those to memcpys cause they generate way to much code. Though probably there is a way smarter solution that avoids these copies all together and instead makes closure captures reuse space.
There other decent cost in binary size is that constant lists are built via store instructions instead of just being actual constants in the binary.
This is probably indirectly my fault. I believe that due to surgical linker limitations we started doing this as a workaround. Technically we only need to do it for data that contains pointers, but I think we do it for all lists currently. I guess that would suggest using strings instead of lists for the constants would be more efficient as a workaround. (though it isn't guaranteed valid utf8...sooo hmm)
Oh actually I guess we had an exception for at least integer lists, but it was removed due to a morphic bug: https://github.com/roc-lang/roc/blob/ae0e3593a405e77a039fc10c613eb45ef83b8ab5/crates/compiler/gen_llvm/src/llvm/build.rs#L2980-L2982
I haven't dug, will do so later, but thought I would just ask here first. Anyone know where the code is for copying data into and out of captures in the llvm backend? I want to trying and convert it to memcpy data around if possible.
captures don't exist at that point, it's just data
If it is just data, I guess that means we really need to optimize how we copy around structs. For example, if we are loading a large (greater than 2 usizes) struct from another struct, we are just gonna put it on the stack anyway and pass it around by references, so we probably should just load and pass around a pointer to the field instead of actually loading the stuct to somewhere else.
On top of that, instead of copying each individual sub field one at a time, we should either be copying in larger chunks or just calling memcpy (though preferably we would avoid the data copy all together in a number of these case). We probably need a way to reuse closure storage space to avoid all the copying all together.
By removing 4 sprites from the game state and instead reloading them on every frame, I cut the generated executable size down by 64KB. This is not a large game. That is literally half of the executable size including the 8KB used for allocations and all of the zig host.
I guess that is 9 less load and store instructions per sprite per capture.
estimated load store count and executable bloat
seems like a good next step would be to do the reference and memcpy optimizations, because those would have benefits in the shared case even if we also implemented in-place updates for unique structs on the stack
Quick question I have with lambdasets.
Do they unify all closure captures in a chain? I would assume that yes they have to cause the platform only sees closure captures as a static sized byte allocation.
We probably have a lot of waste that is just copying from the input closure capture to the output closure capture the exact same only occasionally modified data. For example, captured in essentially every task is the previous version of the state. Even if we don't do any sort of true inplace updates, we probably need to find a way recognize that the exact same large amount of data is captured for many lambdas in a row and avoid ever moving it at all. This probably will require some form of boxing such that we can just give a pointer to the platform, but we should definitely think more about the ramifications of large closure captures.
Any sort of stateful application that uses tasks will end up capturing the entire state in every single task closure. I guess we will need to do some performance measurements of memcpy vs malloc. Is it better to implicitly box an 100 byte struct in a capture or just pay the cost of copying it around over and over again?
It may turn out that we want boxed lambdasets such that the capture can be reused between calls (always the same size in bytes), but we only end up passing around a single pointer instead of giant captures. Kinda in between current lambasets and the option to use erased closures that are individually dynamically allocated.
:thinking: could it be worth automatically heap allocating after a certain size threshold?
Yeah, definitely something to consider
also, shouldn't the 4 sprites each be 12B List
s each on wasm32, which would take 2-3 instructions each to copy? If so, 64KB seems like a lot!
look at the spoiler above for an estimate
ah gotcha!
@Folkert de Vries do you know where the code would be that generates loading and storing captured data. I assume the generation currently breaks everything down recursively for some reason. In reality, we would perfer to just load the first layer and store the same thing. That hopefully would enable more optimization on the backend.
As it stands currently, I think the backend is given the broken down version of the data copy where the struct has each individual field loaded and then a new struct is made from all the individual fields. Though maybe I am missing something here.
Fundamentally, I want to figure out how to make sure things are grouped so that we can either use memcpy directly or else something smarter to enable copying without recursively loading and storing each field.
so, again, captured data is just data. You could write your program without any higher-order functions at all by just passing the lambda sets manually. So the question is really about data in general
so, I guess the problem is that
x = foo.bar.baz
f x
means that we actually do
tmp1 = foo.bar
x = tmp1.baz
f x
so the whole bar
struct is moved to the stack even if only one field is actually used. We break this up in the parser already, where foo.bar.baz
is really just (foo.bar).baz
. The fact that this is a chain is lost very early on. We cannot currently represent the chain in either can or mono IR.
I see. I was assuming the pattern was more of
prevCapture = someLargeStruct
newData = 7
newCapture = { a: prevCapture.a, b: prevCapture.b, ..., prevCapture.zz, newData }
f newCapture
I was hoping there might be some way to help copy the data in a way data doesn't generate an individual copy for every single field. I would assume the final IR generated is load N symbols then call the struct builder ir.
yes, likely, but there is no specific code for that
that just naturally falls out of the current implementation
When are the captures actually generated? Like when do we build the capture structs?
eh, intuitively a closure definition turns into a struct/tag union literal
there is a function construct_closure_data
which sounds promising (in ir.rs)
So trying to make a minimal example to figure out what is going on. All built with roc-wasm4
.
Just kinda made an arbitrary model:
Model : {
data1: U64,
data2: U64,
data3: U64,
data4: U64,
r: I32,
r2: I32,
r3: I32,
fg : W4.Palette,
bg : W4.Palette,
}
Starter function:
update = \model ->
{} <- W4.setTextColors {fg : model.fg, bg: Color2} |> Task.await
Task.ok model
All looks good:
define void @roc__mainForHost_2_caller(ptr nocapture readnone %0, ptr nocapture readonly %1, ptr nocapture writeonly %2) local_unnamed_addr {
entry:
%result_value.i.i = alloca { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, align 8
call void @llvm.lifetime.start.p0(i64 48, ptr nonnull %result_value.i.i)
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(48) %result_value.i.i, ptr noundef nonnull align 8 dereferenceable(48) %1, i64 48, i1 false)
%struct_field1.i.i.sroa.4.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 48
%struct_field1.i.i.sroa.4.0.copyload = load i16, ptr %struct_field1.i.i.sroa.4.0..sroa_idx, align 8
tail call void @roc_fx_setDrawColors(i16 %struct_field1.i.i.sroa.4.0.copyload)
%call_builtin.i.i.i.i.i = tail call fastcc ptr @roc_builtins.utils.allocate_with_refcount()
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(48) %call_builtin.i.i.i.i.i, ptr noundef nonnull align 8 dereferenceable(48) %result_value.i.i, i64 48, i1 false)
call void @llvm.lifetime.end.p0(i64 48, ptr nonnull %result_value.i.i)
store ptr %call_builtin.i.i.i.i.i, ptr %2, align 4
ret void
}
Add just a single layer of extra awaiting:
update = \model ->
{} <- W4.setTextColors {fg : model.fg, bg: Color2} |> Task.await
{} <- W4.setTextColors {fg : Color1, bg: model.bg} |> Task.await
Task.ok model
Suddenly, we are loading every single individual struct field and treating them all individually.
llvm ir
Given we are capturing the entire model, I don't think there is any reason this should be handling each piece individually. Maybe our captures aren't nested correctly. As in, feels like we are treating the closure capture as {a, b, c} instead of { myStruct }.
Definitely will have to dig into the final ir that is given to llvm.
Due to this current setup, if we do anything more complex, we get tons of code gen.
Ex, 1 extra step:
update = \model ->
r <- W4.rand |> Task.await
{} <- W4.setTextColors {fg : model.fg, bg: Color2} |> Task.await
{} <- W4.setTextColors {fg : Color1, bg: model.bg} |> Task.await
Task.ok {model & r}
That leads to literally double the number of lines of llvm ir compared to the example above.
Huh....So I think I at least partially miss diagnosed this problem.
The debug llvm ir is generating the memcpy calls that I would expect (though a ton of them).
So I guess llvm is choosing to remove the memcpy calls at the cost of binary size.
Kinda surprised it is still doing it will opt-size
All that said, we give llvm a really hard job. We have tons of extra unneeded allocas and memcpy calls.
Not sure where that leaves this issue though. Cause it still is generating like 10s of KB of binary size for just adding a field or 2 to a closure capture. There has to be some way to avoid this for cases where Roc wants to be used on more constrained systems.
Potentially reducing and removing the extra memcpy calls would help llvm reason about things, but I'm not sure, it might be actually reasoning correctly and generating what it expects to.
We also still have artifacts like this that look totally wrong.
IIUC, this is:
repeated copying
I’m pretty sure I implemented an optimization earlier to pass large structs by pointer
So it’s not clear to me why this is happening
I think the current state of things is lots of allocas and memcpy, but everything passed around by pointer. LLVM when optimizing introduces most of these direct data movement as it tries to remove memcpys. At least that is my current understanding
One example of extra allocas is that if we return by pointer. Instead of writing straight to the output pointer, we make an alloca, write to that, then memcpy from the alloca to the output. llvm tends to seem to optimize that reasonably, but I'm not sure if it always figures out more complex cases.
One other thought:
The real issue may be that we need our own inlining or smarter closure data sharing.
Cause fundamentally, In one of these chains, we should only need actually copy and store all of this extra closure data when going to the host.
As it stands currently, for every modification to a task, I think we add another copying of the closure data.
Take calling the W4.rand task
Actually that seems wrong, if I just change a number of random generation task, I won't actually have any issues. (Though maybe llvm just understands that basic direct form of optimization).
(deleted)
(deleted)
That said, if we were smart enough to realize the data being capture was the same, maybe we could share it an elide the copy.
Those are at least my ramblings as I try to figure this out without really understanding how all of the pieces fit together.
Ok...just found something useful. I am pretty sure the issue is inlining with captures.
Big captures -> larger function -> less inlining -> missing optimizations and tons of copying
Really simple function:
update = \model ->
r1 <- W4.rand |> Task.await
r2 <- W4.rand |> Task.await
Task.ok
{
model &
r1: Num.addWrap model.r1 r1,
r2: Num.addWrap model.r2 r2,
}
This generates essentially what I would expect.
Super small change to this:
update = \model ->
r1 <- W4.rand |> Task.await
r2 <- W4.rand |> Task.await
Task.ok
{
model &
r1: model.r1 + r1,
r2: model.r2 + r2,
}
Now we have a "real" num.add function that has to deal with panicking. Normally it would be inlined, but it is more complex. especially when consider closure captures.
So here is what is generated:
I think fundamentally, this is what I am hitting that is leading to crazy amounts of code gen.
But yeah, I do think we start with emitting memcpy for this, but llvm removes them and instead inserts element by element copy cause the size is small enough and know.
Also, I guess the final version that I listed here with the 7 steps is technically what we expect it to do. Cause the capture are being modified.
I guess I am left thinking that we need either (possibly multiple of):
Anyway, sorry for the wall of text, just really want to figure this out and unblock any reasonably sized roc-wasm4 game.
I'm probably missing it, so sorry about that, but what exactly is the problem?
If we're passing by pointer and doing a memcpy and LLVM is desugaring that to stack moves, is that worth trying to avoid? Is it a compile time increase or something else?
We could implement a destination-driven compilation scheme, where you pass around the location you want a value to be compiled to, rather than taking an opaque value (or pointer) and working off of that. I've done that in the past and it's very effective at removing trivial loads/stores
Fundamentally, I am trying to compile Roc to target smaller binary size.
For roc-wasm4, the entire application is constrained to 64KB. I have been digging into this cause I was really suprised when adding a couple of fields to a struct (like 5 U32s or so) lead to my binary growing by 30+ KB. This was before even using the data for anything.
Originally, I thought that was a bug. I still think it is a behavior that shouldn't happen, but I guess it may be working as expected. So may more a case of opt-size not functioning cause of tons and tons of binary bloat from just copying data to and from closure captures.
Found at least one ok sized win:
In the case we are building a struct that will be passed by reference, build it directly in the alloca instead of building a giant ssa value and then storing it. When building it, directly memcpy values in instead of loading and then storing the values.
before: 76KB
after: 68KB
Yes, this is what I refer to by destination driven compilation
that is kind of what TRMC does so we may have a lot of the primitives for that already actually. Though with TRMC you don't cross function boundaries
It may not matter for most types, but from some llvm forum I was reading yesterday. Apparently llvm sucks at optimizing aggregate values (structs and arrays) that are used in SSA registers unless they actually fit in a single passable value (so 2 or less real register and actually passed around in registers).
The recommendation is to always use them from alloca/as pointers and never materialize them in SSA register form.
Last updated: Jul 06 2025 at 12:14 UTC