As a spin off of discussion in #ideas > Purity inference proposal v3 and #ideas>`crash` as state machine entry. I think instead of having the platform pass in explicit allocators, we should consider just having the allocation as part of the effect state machine.
The platform still has the same amount of control. We no longer have to deal with passing around an extra data structure to all functions that might allocate.
Benefits: Simplicity and consistency.
Cost: Suspension of intermediate state at each allocation.
This feels like something we should consider in general. It may be too heavy of a perf cost to be worthwhile. Really depends on how fast suspension can be (assuming we reuse the same buffer and minimize data movement it may not be a big deal). If allocations are in the hot loop perf is probably already rough. This could throw it over the edge. So may not be worth it.
I think this would mainly be a huge boon in terms of simplicity and maintainability. But it may not be the right tradeoff for perf.
To clarify, the platform has the same amount of control cause each time it calls to roc, it is starting a state machine execution. For each of those calls it can use a different state machine with its own arena allocator or whatever other allocator design.
This would be great! We eventually want to reduce the platform authoring complexity to the point that developers feel they can implement their own platforms. I think if this was added, plus more glue being added, we could end up doing 90% of the scaffolding work for users.
we should consider just having the allocation as part of the effect state machine
Can someone explain what this means? Do we still have a roc_alloc that is called by roc?
Eventual future with Effect interpretters.
The platform roughly sees a single tag union:
Op: [
StdoutLine Str \{} -> Op,
StdinLine \Str -> Op,
]
I am suggesting that we automatically add:
RocAlloc U64 U32 \Ptr -> Op
RocDealloc Ptr \{} -> Op
RocEtc ...
The effect interpreter implemented on the host would just handle this as part of a big switch statement.
The alternative (and current plan) is explicit allocators. Where the platform creates an allocator struct and passes that into every single call to roc. Roc would implicitly thread that behind the scenes to all functions that need to allocate. The struct would be a function pointer to roc_alloc, roc_dealloc, etc. It would also be a pointer to arbitrary data (void*) that contains any allocator state.
This is a really neat idea!
Yeah, definitely want good glue for tag unions with payloads, but this would simplify platform development a bit I think.
definitely want good glue for tag unions with payload
Yeah, this definitely will be required for effect interpreter in general.
Looking through the basic-cli host lib, we still have a few functions like roc_dbg and roc_panic that aren't yet proposed for addition to the state machine. Would we want to add them as well, or is there anything that shouldn't/couldn't be supported?
Debug and (expect_failed when we add it) do printing and are slower, so should be fine to be part of the state machine
Allocations are probably fine.
Panic (and crash) are probably fine if we implement exception handling for pure function hot loops with crash in them.
@Brendan Hansknecht fine here means we can support them right?
We could make them all use the state machine.
There are zero concerns with making debug and expect_failed use the state machine.
There are minor concerns around perf for the allocation functions, but my guess is they probably would not be an issue if using the state machine.
With crash, there are extra complexities to handle around hot loops in pure functions that call crash (those absolutely 100% need to stay fast and have essentially zero cost except if the cash is hit). That is the discussion happening in #ideas>`crash` as state machine entry and I think we have a path forward to making it work fine.
I think a pretty interesting argument for this direction is that there's both a perf benefit and an implementation simplification benefit to not having to thread through any kind of struct to say what all the allocators are etc
like if we can get rid of all of them and the state machine is all there is
so a potential thing we could try is switching over to just state machine on current platforms, and just see how the perf is in practice
compared to status quo
and if we're unhappy with that perf in practice, then we'd have a clear motivation to implement the other design
Yeah. I think that will be worth testing. My guess is that it will be noticably worse unless we do some smart optimizations around layout and minimizing copy of data captured in the state machine. Like the tag layout algorithm today focuses on minimizing memory usage. For the state machine, we really want to minimize data movement more than memory use (so a lifetime based api of some sort). Like if I see:
someTask! {}
x = 3
someTask! {}
I would hope that 3 is just stuck somewhere in the state machine without any other data movement. In current roc, placing the 3 in the state machine capture could lead to a shift of all data in the capture.
Aside: obviously in a perfect world, we would actually never capture the 3 in this case. We would just move the definition to where it needs to be used.
yeah I also would anticipate that it would be noticeably worse
but I do wonder if there are any interesting platform designs that would otherwise be impossible
Yeah, I think this can be pretty easily seen by the amount of data movement generated by the captures in wasm4. They actually still run pretty darn fast, but they add up to a crazy amount of code bloat. Most code in a wasm4 app is spent moving data from one closure capture to the next.
hm yeah
oh yeah replacing all of those with direct ffi-style calls would be epic :big_smile:
bc they don't want to be async anyway, right?
Yeah, though in most cases the capture is just slowly growing, so preallocating one giant capture would also fix the issue pretty well (obviously not as well as just making everything sync).
So with reusing the same capture allocation and laying out to minimize data movement, I think it would be very very close to the sync code. But definitely still have extra overhead.
Not sure what the percentages would actually be. Very very small would be my guess.
:thinking: is there some way we could facilitate reusing the same capture allocation for async effects?
I don't see why not. I guess we would need a refcount the capture allocation as a whole to make sure the host isn't holding on to an extra copy.
eh we could just tell the host not to
there are plenty of host rules around "don't do this or else UB"
What about platforms, that do not want to use tasks?
Currently, it is relatively easy to write a platform, where the roc function takes some argument and returns something without doing effects. Like main : \Str -> {part1: Str, part2: Str} for Advent of Code. At the moment, calling roc from the host is like calling any other function, you just have to transform the memory.
Would this proposal (or the effect interpreter proposal in general) mean, that this is no longer possible and you always have to handle the state machine?
In any case, I like this proposal. I think it will make interesting stuff easy, like arena allocators or zero allocation calls. In go, I currently have to use C.malloc to make the garbage collector happy. With this proposal, it should be possible to use a native go []byte for the roc memory.
Oskar Hahn said:
Would this proposal (or the effect interpreter proposal in general) mean, that this is no longer possible and you always have to handle the state machine?
Yes, but I'm pretty sure that Str -> { part1: Str, part2: Str } would need to use the effect interpreter. Cause it will be needed for allocations and such.
Though hopefully it has more consistent glue generation and other simplifications
Last updated: Jun 16 2026 at 16:19 UTC