Large captures create crazy amounts of IR · compiler development

Just curious if anyone has ideas of what might be causing this. I add a few fields to a struct and suddenly roc is trying to access invalid memory and crashing.

Richard Feldman (Jan 15 2024 at 20:40):

is it necessarily the size or maybe the layout? Could try adding some extra unused fields in different places, or changing the types of the added fields to see if it still reproduces

Brendan Hansknecht (Jan 15 2024 at 21:02):

That's fair. My gut feeling is that it is broken either way, but one version is exposing the error.

Brendan Hansknecht (Jan 15 2024 at 21:02):

Brendan Hansknecht (Jan 15 2024 at 21:05):

Interestingly, it fails before even running that function. So that suggests the real issue is with the tag union that it is a part of.

Brendan Hansknecht (Jan 15 2024 at 21:05):

Brendan Hansknecht (Jan 16 2024 at 02:04):

I am gonna guess the root of the bug is in one of these crazy long chains of data movement. IIUC, this is just copying data from one lambda capture to another (with of course a few small effects)

Brendan Hansknecht (Jan 16 2024 at 02:06):

basically the entire ir is just this repeated a bunch of times and this is the optimized IR. So I guess all this app does is copy data between captures well occasionally performing effects.

Brendan Hansknecht (Jan 16 2024 at 02:07):

@Folkert de Vries or @Ayaz Hafiz any thoughts on trying to debug or work around something like this?

Brendan Hansknecht (Jan 16 2024 at 02:13):

Brendan Hansknecht (Jan 16 2024 at 02:16):

Brendan Hansknecht (Jan 16 2024 at 02:18):

Brendan Hansknecht (Jan 16 2024 at 06:40):

Ok, so I found a fix. Apparently Roc is assuming certain memory is zero when it actually isn't. So by always zeroing allocated memory, the issue is fixed (that said, I also made a few other allocator cleanup, they may have been partial fixes)

Brendan Hansknecht (Jan 16 2024 at 16:27):

The only problem is that the fixed binary is way too large. The limit is 64k, it is 124K.

The actual roc wasm emitted is 644k but apparently most of that is dead code that zig removes when linking.

Essentially all of that code is just copying closure captures one struct field at a time. Maybe we need to change those to memcpys cause they generate way to much code. Though probably there is a way smarter solution that avoids these copies all together and instead makes closure captures reuse space.

Brendan Hansknecht (Jan 16 2024 at 16:31):

There other decent cost in binary size is that constant lists are built via store instructions instead of just being actual constants in the binary.

This is probably indirectly my fault. I believe that due to surgical linker limitations we started doing this as a workaround. Technically we only need to do it for data that contains pointers, but I think we do it for all lists currently. I guess that would suggest using strings instead of lists for the constants would be more efficient as a workaround. (though it isn't guaranteed valid utf8...sooo hmm)

Brendan Hansknecht (Jan 16 2024 at 16:39):

Brendan Hansknecht (Jan 16 2024 at 20:49):

I haven't dug, will do so later, but thought I would just ask here first. Anyone know where the code is for copying data into and out of captures in the llvm backend? I want to trying and convert it to memcpy data around if possible.

Folkert de Vries (Jan 16 2024 at 22:45):

Brendan Hansknecht (Jan 16 2024 at 23:06):

If it is just data, I guess that means we really need to optimize how we copy around structs. For example, if we are loading a large (greater than 2 usizes) struct from another struct, we are just gonna put it on the stack anyway and pass it around by references, so we probably should just load and pass around a pointer to the field instead of actually loading the stuct to somewhere else.

On top of that, instead of copying each individual sub field one at a time, we should either be copying in larger chunks or just calling memcpy (though preferably we would avoid the data copy all together in a number of these case). We probably need a way to reuse closure storage space to avoid all the copying all together.

By removing 4 sprites from the game state and instead reloading them on every frame, I cut the generated executable size down by 64KB. This is not a large game. That is literally half of the executable size including the 8KB used for allocations and all of the zig host.

estimated load store count and executable bloat

each sprite is a list, 5 U32s and a union with 2 possible tags.
List is 3 usizes of data.

So 3 for list + 5 u32s + 1 for the union = 9.

4 sprites = 9 * 4 = 36 loads and stores per capture

captures as measured by number of <- in the code is 61. I assume inlining and some normal list functions and such add more to the real capture count.

each load and store will be creating a constant offset then run the actual load/store instructions. That should be 2 bytes of instructions plus 4 bytes of offset per load/store. so 12 bytes for a load and store pair.

Final count would be 12 bytes * 36 *61 ~= 26KB.

That is about half of what it is in practice, but shows the rough point/problem

EDIT: I might be missing some sort of offset + ptr instruction as well. Maybe that makes the difference?

Richard Feldman (Jan 16 2024 at 23:17):

seems like a good next step would be to do the reference and memcpy optimizations, because those would have benefits in the shared case even if we also implemented in-place updates for unique structs on the stack

Brendan Hansknecht (Jan 16 2024 at 23:29):

Do they unify all closure captures in a chain? I would assume that yes they have to cause the platform only sees closure captures as a static sized byte allocation.

We probably have a lot of waste that is just copying from the input closure capture to the output closure capture the exact same only occasionally modified data. For example, captured in essentially every task is the previous version of the state. Even if we don't do any sort of true inplace updates, we probably need to find a way recognize that the exact same large amount of data is captured for many lambdas in a row and avoid ever moving it at all. This probably will require some form of boxing such that we can just give a pointer to the platform, but we should definitely think more about the ramifications of large closure captures.

Any sort of stateful application that uses tasks will end up capturing the entire state in every single task closure. I guess we will need to do some performance measurements of memcpy vs malloc. Is it better to implicitly box an 100 byte struct in a capture or just pay the cost of copying it around over and over again?

Brendan Hansknecht (Jan 16 2024 at 23:31):

It may turn out that we want boxed lambdasets such that the capture can be reused between calls (always the same size in bytes), but we only end up passing around a single pointer instead of giant captures. Kinda in between current lambasets and the option to use erased closures that are individually dynamically allocated.

Richard Feldman (Jan 16 2024 at 23:35):

:thinking: could it be worth automatically heap allocating after a certain size threshold?

Brendan Hansknecht (Jan 16 2024 at 23:45):

Richard Feldman (Jan 16 2024 at 23:46):

also, shouldn't the 4 sprites each be 12B Lists each on wasm32, which would take 2-3 instructions each to copy? If so, 64KB seems like a lot!

Brendan Hansknecht (Jan 16 2024 at 23:47):

Richard Feldman (Jan 16 2024 at 23:49):

Brendan Hansknecht (Jan 16 2024 at 23:56):

@Folkert de Vries do you know where the code would be that generates loading and storing captured data. I assume the generation currently breaks everything down recursively for some reason. In reality, we would perfer to just load the first layer and store the same thing. That hopefully would enable more optimization on the backend.

As it stands currently, I think the backend is given the broken down version of the data copy where the struct has each individual field loaded and then a new struct is made from all the individual fields. Though maybe I am missing something here.

Fundamentally, I want to figure out how to make sure things are grouped so that we can either use memcpy directly or else something smarter to enable copying without recursively loading and storing each field.

Folkert de Vries (Jan 19 2024 at 15:59):

so, again, captured data is just data. You could write your program without any higher-order functions at all by just passing the lambda sets manually. So the question is really about data in general

so the whole bar struct is moved to the stack even if only one field is actually used. We break this up in the parser already, where foo.bar.baz is really just (foo.bar).baz. The fact that this is a chain is lost very early on. We cannot currently represent the chain in either can or mono IR.

Brendan Hansknecht (Jan 19 2024 at 16:10):

I was hoping there might be some way to help copy the data in a way data doesn't generate an individual copy for every single field. I would assume the final IR generated is load N symbols then call the struct builder ir.

Folkert de Vries (Jan 19 2024 at 16:11):

Brendan Hansknecht (Jan 19 2024 at 16:11):

When are the captures actually generated? Like when do we build the capture structs?

Folkert de Vries (Jan 19 2024 at 16:13):

Folkert de Vries (Jan 19 2024 at 16:15):

Brendan Hansknecht (Jan 20 2024 at 01:11):

So trying to make a minimal example to figure out what is going on. All built with roc-wasm4.

define void @roc__mainForHost_2_caller(ptr nocapture readnone %0, ptr nocapture readonly %1, ptr nocapture writeonly %2) local_unnamed_addr {
entry:
  %result_value.i.i = alloca { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, align 8
  call void @llvm.lifetime.start.p0(i64 48, ptr nonnull %result_value.i.i)
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(48) %result_value.i.i, ptr noundef nonnull align 8 dereferenceable(48) %1, i64 48, i1 false)
  %struct_field1.i.i.sroa.4.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 48
  %struct_field1.i.i.sroa.4.0.copyload = load i16, ptr %struct_field1.i.i.sroa.4.0..sroa_idx, align 8
  tail call void @roc_fx_setDrawColors(i16 %struct_field1.i.i.sroa.4.0.copyload)
  %call_builtin.i.i.i.i.i = tail call fastcc ptr @roc_builtins.utils.allocate_with_refcount()
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(48) %call_builtin.i.i.i.i.i, ptr noundef nonnull align 8 dereferenceable(48) %result_value.i.i, i64 48, i1 false)
  call void @llvm.lifetime.end.p0(i64 48, ptr nonnull %result_value.i.i)
  store ptr %call_builtin.i.i.i.i.i, ptr %2, align 4
  ret void
}

Suddenly, we are loading every single individual struct field and treating them all individually.

llvm ir

define void @roc__mainForHost_2_caller(ptr nocapture readnone %0, ptr nocapture readonly %1, ptr nocapture writeonly %2) local_unnamed_addr {
entry:
  %struct_field1.i.i.sroa.0.sroa.0.0.copyload = load i64, ptr %1, align 8
  %struct_field1.i.i.sroa.0.sroa.4.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 8
  %struct_field1.i.i.sroa.0.sroa.4.0.copyload = load i64, ptr %struct_field1.i.i.sroa.0.sroa.4.0..sroa_idx, align 8
  %struct_field1.i.i.sroa.0.sroa.5.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 16
  %struct_field1.i.i.sroa.0.sroa.5.0.copyload = load i64, ptr %struct_field1.i.i.sroa.0.sroa.5.0..sroa_idx, align 8
  %struct_field1.i.i.sroa.0.sroa.6.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 24
  %struct_field1.i.i.sroa.0.sroa.6.0.copyload = load i64, ptr %struct_field1.i.i.sroa.0.sroa.6.0..sroa_idx, align 8
  %struct_field1.i.i.sroa.0.sroa.7.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 32
  %struct_field1.i.i.sroa.0.sroa.7.0.copyload = load i32, ptr %struct_field1.i.i.sroa.0.sroa.7.0..sroa_idx, align 8
  %struct_field1.i.i.sroa.0.sroa.8.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 36
  %struct_field1.i.i.sroa.0.sroa.8.0.copyload = load i32, ptr %struct_field1.i.i.sroa.0.sroa.8.0..sroa_idx, align 4
  %struct_field1.i.i.sroa.0.sroa.9.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 40
  %struct_field1.i.i.sroa.0.sroa.9.0.copyload = load i32, ptr %struct_field1.i.i.sroa.0.sroa.9.0..sroa_idx, align 8
  %struct_field1.i.i.sroa.0.sroa.10.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 44
  %struct_field1.i.i.sroa.0.sroa.10.0.copyload = load i8, ptr %struct_field1.i.i.sroa.0.sroa.10.0..sroa_idx, align 4
  %struct_field1.i.i.sroa.0.sroa.11.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 45
  %struct_field1.i.i.sroa.0.sroa.11.0.copyload = load i8, ptr %struct_field1.i.i.sroa.0.sroa.11.0..sroa_idx, align 1
  %struct_field1.i.i.sroa.4.0..sroa_idx = getelementptr inbounds i8, ptr %1, i32 48
  %struct_field1.i.i.sroa.4.0.copyload = load i16, ptr %struct_field1.i.i.sroa.4.0..sroa_idx, align 8
  tail call void @roc_fx_setDrawColors(i16 %struct_field1.i.i.sroa.4.0.copyload)
  %3 = icmp ult i8 %struct_field1.i.i.sroa.0.sroa.10.0.copyload, 5
  br i1 %3, label %switch.lookup, label %_47_8ba6802dffb747a8e9687df41a048d15ada9efac1560a7cdcc82eb95345ce.exit

switch.lookup:                                    ; preds = %entry
  %4 = sext i8 %struct_field1.i.i.sroa.0.sroa.10.0.copyload to i32
  %switch.gep = getelementptr inbounds [5 x i16], ptr @switch.table.roc__mainForHost_2_caller, i32 0, i32 %4
  %switch.load = load i16, ptr %switch.gep, align 2
  br label %_47_8ba6802dffb747a8e9687df41a048d15ada9efac1560a7cdcc82eb95345ce.exit

_47_8ba6802dffb747a8e9687df41a048d15ada9efac1560a7cdcc82eb95345ce.exit: ; preds = %switch.lookup, %entry
  %joinpointarg2.i.i.i.i.i.i.i.i = phi i16 [ 65, %entry ], [ %switch.load, %switch.lookup ]
  tail call void @roc_fx_setDrawColors(i16 %joinpointarg2.i.i.i.i.i.i.i.i)
  %call_builtin.i.i.i.i.i = tail call fastcc ptr @roc_builtins.utils.allocate_with_refcount()
  store i64 %struct_field1.i.i.sroa.0.sroa.0.0.copyload, ptr %call_builtin.i.i.i.i.i, align 8
  %result_value.i.i.sroa.5.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 8
  store i64 %struct_field1.i.i.sroa.0.sroa.4.0.copyload, ptr %result_value.i.i.sroa.5.0.call_builtin.i.i.i.i.i.sroa_idx, align 8
  %result_value.i.i.sroa.6.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 16
  store i64 %struct_field1.i.i.sroa.0.sroa.5.0.copyload, ptr %result_value.i.i.sroa.6.0.call_builtin.i.i.i.i.i.sroa_idx, align 8
  %result_value.i.i.sroa.7.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 24
  store i64 %struct_field1.i.i.sroa.0.sroa.6.0.copyload, ptr %result_value.i.i.sroa.7.0.call_builtin.i.i.i.i.i.sroa_idx, align 8
  %result_value.i.i.sroa.8.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 32
  store i32 %struct_field1.i.i.sroa.0.sroa.7.0.copyload, ptr %result_value.i.i.sroa.8.0.call_builtin.i.i.i.i.i.sroa_idx, align 8
  %result_value.i.i.sroa.9.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 36
  store i32 %struct_field1.i.i.sroa.0.sroa.8.0.copyload, ptr %result_value.i.i.sroa.9.0.call_builtin.i.i.i.i.i.sroa_idx, align 4
  %result_value.i.i.sroa.10.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 40
  store i32 %struct_field1.i.i.sroa.0.sroa.9.0.copyload, ptr %result_value.i.i.sroa.10.0.call_builtin.i.i.i.i.i.sroa_idx, align 8
  %result_value.i.i.sroa.11.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 44
  store i8 %struct_field1.i.i.sroa.0.sroa.10.0.copyload, ptr %result_value.i.i.sroa.11.0.call_builtin.i.i.i.i.i.sroa_idx, align 4
  %result_value.i.i.sroa.12.0.call_builtin.i.i.i.i.i.sroa_idx = getelementptr inbounds i8, ptr %call_builtin.i.i.i.i.i, i32 45
  store i8 %struct_field1.i.i.sroa.0.sroa.11.0.copyload, ptr %result_value.i.i.sroa.12.0.call_builtin.i.i.i.i.i.sroa_idx, align 1
  store ptr %call_builtin.i.i.i.i.i, ptr %2, align 4
  ret void
}

Given we are capturing the entire model, I don't think there is any reason this should be handling each piece individually. Maybe our captures aren't nested correctly. As in, feels like we are treating the closure capture as {a, b, c} instead of { myStruct }.

Due to this current setup, if we do anything more complex, we get tons of code gen.
Ex, 1 extra step:

That leads to literally double the number of lines of llvm ir compared to the example above.

Brendan Hansknecht (Jan 20 2024 at 04:42):

Brendan Hansknecht (Jan 20 2024 at 04:43):

The debug llvm ir is generating the memcpy calls that I would expect (though a ton of them).

Brendan Hansknecht (Jan 20 2024 at 04:43):

So I guess llvm is choosing to remove the memcpy calls at the cost of binary size.

Brendan Hansknecht (Jan 20 2024 at 04:44):

All that said, we give llvm a really hard job. We have tons of extra unneeded allocas and memcpy calls.

Brendan Hansknecht (Jan 20 2024 at 04:45):

Not sure where that leaves this issue though. Cause it still is generating like 10s of KB of binary size for just adding a field or 2 to a closure capture. There has to be some way to avoid this for cases where Roc wants to be used on more constrained systems.

Brendan Hansknecht (Jan 20 2024 at 04:47):

Potentially reducing and removing the extra memcpy calls would help llvm reason about things, but I'm not sure, it might be actually reasoning correctly and generating what it expects to.

Brendan Hansknecht (Jan 20 2024 at 04:55):

repeated copying

  %struct_alloca = alloca { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, align 8
  store i64 %struct_field1.sroa.0.0.copyload, ptr %struct_alloca, align 8
  %insert_record_field5.fca.0.1.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 1
  store i64 %struct_field1.sroa.2.0.copyload, ptr %insert_record_field5.fca.0.1.gep, align 8
  %insert_record_field5.fca.0.2.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 2
  store i64 %struct_field1.sroa.3.0.copyload, ptr %insert_record_field5.fca.0.2.gep, align 8
  %insert_record_field5.fca.0.3.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 3
  store i64 %struct_field1.sroa.4.0.copyload, ptr %insert_record_field5.fca.0.3.gep, align 8
  %insert_record_field5.fca.0.4.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 4
  store i32 %struct_field1.sroa.5.0.copyload, ptr %insert_record_field5.fca.0.4.gep, align 8
  %insert_record_field5.fca.0.5.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 5
  store i32 %struct_field1.sroa.6.0.copyload, ptr %insert_record_field5.fca.0.5.gep, align 4
  %insert_record_field5.fca.0.6.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 6
  store i32 %struct_field1.sroa.7.0.copyload, ptr %insert_record_field5.fca.0.6.gep, align 8
  %insert_record_field5.fca.0.7.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 7
  store i8 %struct_field1.sroa.8.0.copyload, ptr %insert_record_field5.fca.0.7.gep, align 4
  %insert_record_field5.fca.0.8.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 0, i32 8
  store i8 %struct_field1.sroa.10.0.copyload, ptr %insert_record_field5.fca.0.8.gep, align 1
  %insert_record_field5.fca.1.gep = getelementptr inbounds { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, i32 0, i32 1
  store i32 %struct_field, ptr %insert_record_field5.fca.1.gep, align 8
  %load_tag_to_put_in_struct.i.i = load { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 }, ptr %struct_alloca, align 8
  %insert_record_field1.fca.0.0.0.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 0
  %insert_record_field1.fca.0.0.1.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 1
  %insert_record_field1.fca.0.0.2.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 2
  %insert_record_field1.fca.0.0.3.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 3
  %insert_record_field1.fca.0.0.4.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 4
  %insert_record_field1.fca.0.0.5.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 5
  %insert_record_field1.fca.0.0.6.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 6
  %insert_record_field1.fca.0.0.7.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 7
  %insert_record_field1.fca.0.0.8.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 0, 8
  %insert_record_field1.fca.0.1.extract.i.i = extractvalue { { i64, i64, i64, i64, i32, i32, i32, i8, i8 }, i32 } %load_tag_to_put_in_struct.i.i, 1
  store i64 %insert_record_field1.fca.0.0.0.extract.i.i, ptr %0, align 8
  %result_value.sroa.2.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 8
  store i64 %insert_record_field1.fca.0.0.1.extract.i.i, ptr %result_value.sroa.2.0..sroa_idx, align 8
  %result_value.sroa.3.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 16
  store i64 %insert_record_field1.fca.0.0.2.extract.i.i, ptr %result_value.sroa.3.0..sroa_idx, align 8
  %result_value.sroa.4.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 24
  store i64 %insert_record_field1.fca.0.0.3.extract.i.i, ptr %result_value.sroa.4.0..sroa_idx, align 8
  %result_value.sroa.5.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 32
  store i32 %insert_record_field1.fca.0.0.4.extract.i.i, ptr %result_value.sroa.5.0..sroa_idx, align 8
  %result_value.sroa.6.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 36
  store i32 %insert_record_field1.fca.0.0.5.extract.i.i, ptr %result_value.sroa.6.0..sroa_idx, align 4
  %result_value.sroa.7.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 40
  store i32 %insert_record_field1.fca.0.0.6.extract.i.i, ptr %result_value.sroa.7.0..sroa_idx, align 8
  %result_value.sroa.8.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 44
  store i8 %insert_record_field1.fca.0.0.7.extract.i.i, ptr %result_value.sroa.8.0..sroa_idx, align 4
  %result_value.sroa.9.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 45
  store i8 %insert_record_field1.fca.0.0.8.extract.i.i, ptr %result_value.sroa.9.0..sroa_idx, align 1
  %result_value.sroa.11.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 48
  store i32 %insert_record_field1.fca.0.1.extract.i.i, ptr %result_value.sroa.11.0..sroa_idx, align 8
  %result_value.sroa.13.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 56
  store i16 %joinpointarg2.i.i.i, ptr %result_value.sroa.13.0..sroa_idx, align 8
  %result_value.sroa.15.0..sroa_idx = getelementptr inbounds i8, ptr %0, i32 64
  store i8 0, ptr %result_value.sroa.15.0..sroa_idx, align 8

Ayaz Hafiz (Jan 20 2024 at 05:04):

I’m pretty sure I implemented an optimization earlier to pass large structs by pointer

Ayaz Hafiz (Jan 20 2024 at 05:04):

Brendan Hansknecht (Jan 20 2024 at 05:05):

I think the current state of things is lots of allocas and memcpy, but everything passed around by pointer. LLVM when optimizing introduces most of these direct data movement as it tries to remove memcpys. At least that is my current understanding

Brendan Hansknecht (Jan 20 2024 at 05:06):

One example of extra allocas is that if we return by pointer. Instead of writing straight to the output pointer, we make an alloca, write to that, then memcpy from the alloca to the output. llvm tends to seem to optimize that reasonably, but I'm not sure if it always figures out more complex cases.

Brendan Hansknecht (Jan 20 2024 at 05:20):

The real issue may be that we need our own inlining or smarter closure data sharing.

Cause fundamentally, In one of these chains, we should only need actually copy and store all of this extra closure data when going to the host.

~~As it stands currently, for every modification to a task, I think we add another copying of the closure data.~~
~~Take calling the W4.rand task~~

Actually that seems wrong, if I just change a number of random generation task, I won't actually have any issues. (Though maybe llvm just understands that basic direct form of optimization).

Brendan Hansknecht (Jan 20 2024 at 05:20):

Brendan Hansknecht (Jan 20 2024 at 05:21):

Brendan Hansknecht (Jan 20 2024 at 05:22):

That said, if we were smart enough to realize the data being capture was the same, maybe we could share it an elide the copy.

Brendan Hansknecht (Jan 20 2024 at 05:22):

Those are at least my ramblings as I try to figure this out without really understanding how all of the pieces fit together.

Brendan Hansknecht (Jan 20 2024 at 05:54):

Ok...just found something useful. I am pretty sure the issue is inlining with captures.

Big captures -> larger function -> less inlining -> missing optimizations and tons of copying

Now we have a "real" num.add function that has to deal with panicking. Normally it would be inlined, but it is more complex. especially when consider closure captures.

I think fundamentally, this is what I am hitting that is leading to crazy amounts of code gen.

Brendan Hansknecht (Jan 20 2024 at 06:07):

But yeah, I do think we start with emitting memcpy for this, but llvm removes them and instead inserts element by element copy cause the size is small enough and know.

Also, I guess the final version that I listed here with the 7 steps is technically what we expect it to do. Cause the capture are being modified.

Anyway, sorry for the wall of text, just really want to figure this out and unblock any reasonably sized roc-wasm4 game.

Ayaz Hafiz (Jan 20 2024 at 06:38):

Ayaz Hafiz (Jan 20 2024 at 06:39):

If we're passing by pointer and doing a memcpy and LLVM is desugaring that to stack moves, is that worth trying to avoid? Is it a compile time increase or something else?

Ayaz Hafiz (Jan 20 2024 at 06:40):

We could implement a destination-driven compilation scheme, where you pass around the location you want a value to be compiled to, rather than taking an opaque value (or pointer) and working off of that. I've done that in the past and it's very effective at removing trivial loads/stores

Brendan Hansknecht (Jan 20 2024 at 06:50):

For roc-wasm4, the entire application is constrained to 64KB. I have been digging into this cause I was really suprised when adding a couple of fields to a struct (like 5 U32s or so) lead to my binary growing by 30+ KB. This was before even using the data for anything.

Originally, I thought that was a bug. I still think it is a behavior that shouldn't happen, but I guess it may be working as expected. So may more a case of opt-size not functioning cause of tons and tons of binary bloat from just copying data to and from closure captures.

Brendan Hansknecht (Jan 20 2024 at 07:31):

In the case we are building a struct that will be passed by reference, build it directly in the alloca instead of building a giant ssa value and then storing it. When building it, directly memcpy values in instead of loading and then storing the values.

Ayaz Hafiz (Jan 20 2024 at 15:46):

Folkert de Vries (Jan 20 2024 at 15:58):

that is kind of what TRMC does so we may have a lot of the primitives for that already actually. Though with TRMC you don't cross function boundaries

Brendan Hansknecht (Jan 20 2024 at 16:38):

It may not matter for most types, but from some llvm forum I was reading yesterday. Apparently llvm sucks at optimizing aggregate values (structs and arrays) that are used in SSA registers unless they actually fit in a single passable value (so 2 or less real register and actually passed around in registers).

The recommendation is to always use them from alloca/as pointers and never materialize them in SSA register form.

Stream: compiler development

Topic: Large captures create crazy amounts of IR

Brendan Hansknecht (Jan 15 2024 at 20:34):

Richard Feldman (Jan 15 2024 at 20:40):

Brendan Hansknecht (Jan 15 2024 at 21:02):

Brendan Hansknecht (Jan 15 2024 at 21:02):

Brendan Hansknecht (Jan 15 2024 at 21:05):

Brendan Hansknecht (Jan 15 2024 at 21:05):

Brendan Hansknecht (Jan 16 2024 at 02:04):

Brendan Hansknecht (Jan 16 2024 at 02:06):

Brendan Hansknecht (Jan 16 2024 at 02:07):

Brendan Hansknecht (Jan 16 2024 at 02:13):

Brendan Hansknecht (Jan 16 2024 at 02:16):

Brendan Hansknecht (Jan 16 2024 at 02:18):

Brendan Hansknecht (Jan 16 2024 at 06:40):

Brendan Hansknecht (Jan 16 2024 at 16:27):

Brendan Hansknecht (Jan 16 2024 at 16:31):

Brendan Hansknecht (Jan 16 2024 at 16:39):

Brendan Hansknecht (Jan 16 2024 at 20:49):

Folkert de Vries (Jan 16 2024 at 22:45):

Brendan Hansknecht (Jan 16 2024 at 23:06):

Richard Feldman (Jan 16 2024 at 23:17):

Brendan Hansknecht (Jan 16 2024 at 23:29):

Brendan Hansknecht (Jan 16 2024 at 23:31):

Richard Feldman (Jan 16 2024 at 23:35):

Brendan Hansknecht (Jan 16 2024 at 23:45):

Richard Feldman (Jan 16 2024 at 23:46):

Brendan Hansknecht (Jan 16 2024 at 23:47):

Richard Feldman (Jan 16 2024 at 23:49):

Brendan Hansknecht (Jan 16 2024 at 23:56):

Folkert de Vries (Jan 19 2024 at 15:59):

Brendan Hansknecht (Jan 19 2024 at 16:10):

Folkert de Vries (Jan 19 2024 at 16:11):

Folkert de Vries (Jan 19 2024 at 16:11):

Brendan Hansknecht (Jan 19 2024 at 16:11):

Folkert de Vries (Jan 19 2024 at 16:13):

Folkert de Vries (Jan 19 2024 at 16:15):

Brendan Hansknecht (Jan 20 2024 at 01:11):

Brendan Hansknecht (Jan 20 2024 at 04:42):

Brendan Hansknecht (Jan 20 2024 at 04:43):

Brendan Hansknecht (Jan 20 2024 at 04:43):

Brendan Hansknecht (Jan 20 2024 at 04:44):

Brendan Hansknecht (Jan 20 2024 at 04:44):

Brendan Hansknecht (Jan 20 2024 at 04:45):

Brendan Hansknecht (Jan 20 2024 at 04:47):

Brendan Hansknecht (Jan 20 2024 at 04:55):

Ayaz Hafiz (Jan 20 2024 at 05:04):

Ayaz Hafiz (Jan 20 2024 at 05:04):

Brendan Hansknecht (Jan 20 2024 at 05:05):

Brendan Hansknecht (Jan 20 2024 at 05:06):

Brendan Hansknecht (Jan 20 2024 at 05:20):

Brendan Hansknecht (Jan 20 2024 at 05:20):

Brendan Hansknecht (Jan 20 2024 at 05:21):

Brendan Hansknecht (Jan 20 2024 at 05:22):

Brendan Hansknecht (Jan 20 2024 at 05:22):

Brendan Hansknecht (Jan 20 2024 at 05:54):

Brendan Hansknecht (Jan 20 2024 at 06:07):

Ayaz Hafiz (Jan 20 2024 at 06:38):

Ayaz Hafiz (Jan 20 2024 at 06:39):

Ayaz Hafiz (Jan 20 2024 at 06:40):

Brendan Hansknecht (Jan 20 2024 at 06:50):

Brendan Hansknecht (Jan 20 2024 at 07:31):

Ayaz Hafiz (Jan 20 2024 at 15:46):

Folkert de Vries (Jan 20 2024 at 15:58):

Brendan Hansknecht (Jan 20 2024 at 16:38):

Brendan Hansknecht (Jan 20 2024 at 17:48):