here's a thing that occurred to me recently, and I made a little proof-of-concept (linked in the intro section) - https://docs.google.com/document/d/1mcIWe6qcfOYyGGdDDVdKObss7Mc3ypZZHO7E3C8vvic/edit
any thoughts and feedback welcome!
Presumably the reason Go starts with very small stacks (0.2MB compared to 0.5MB in macOS threads, 1MB on Windows, and 8MB on Linux)
I'm guessing you meant 0.02MB
I still have all 4GB available
Guessing you meant 6GB ?
Regarding madvise with MEM_RESET, iirc, at least on Linux, there's a flag you can use to indicate that a page can be reclaimed by the kernel if it can make better use of it, but otherwise, it would remain with the same process if there's ample unused memory. Probably a write would rescind that madvise effect if not already reclaimed by that point
fixed, thanks!
Note, modern Mac has working ulimit
So you can change the stack size
On my m1 default was 8176 KB with a default hard limit of 65520 KB stack size
can you change it from within the currently running process?
However, this would be in such a distant part of memory, most lookups into it would likely be cache misses.
It could also end up taking up an undesirable amount of binary size
If this is just function metadata, couldn't we arrange for that metadata to live just before the function instructions (and within the same 4k page)?
Redis does something similar for strings, whereby the pointer points at the start of a normal C-style string, though any code aware of Redis strings can look at one word earlier before the start of the string data to find the string length, thus making some operations safer and faster, while libc string functions continue to work (there's still a null byte).
can you change it from within the currently running process?
Seems to work the same as linux
I'm guessing you meant 0.02MB
Still off by a factor of ten. 0.002MB is 2KB.
depending on the amount of function meta, you could even precede it with a halt instruction to add a little bit more security
@Richard Feldman I presume you did some verification using a deliberately contrived program with a bunch of threads that would sample as using N*8MB VIRT, but indeed only worked out to scale to N*4KB RES?
Calculating the maximum stack size number for a given function seems straightforward
Only a function without recursion anywhere in the call stack? Or guaranteed tail recursion I guess. Oh, and with estimates for stack size of every zig builtin.
I mean I'm general every function knows it's own stack size (at least after love optimizations) and what functions it calls. After that, it is just a simple dynamic programming problem, but it is exceptionally likely to hit some form of recursion or unbounded stack size.
Why would Roc need a scheduler in the case it's just using threads and relying on modern page mapping to keep memory low? Go does it really only because it strictly knows better than the kernel how some concurrency primitives relate (timers, channels, locks), and because it uses async i/o for pretty much everything it can (epoll, kqueue, etc), and thus doesn't have any need for threads to block on syscalls.
Part of that, I think, is also to avoid a bunch of overhead by having X memory for Y threads, but as you pointed out, that old wisdom probably hasn't been accurate for a long time.
Thus, unless there are some pretty bold plans for Roc to have specialized concurrency, or other internals that would make a Roc scheduler significantly better than just relying on the kernel alone, then it seems prudent to just rely on the kernel.
Because of Roc's emphasis on platforms, it suggests to me that it'll be harder to do those clever things (because it'd significantly complicate the process of writing a platform), thus lowering the benefit of a scheduler.
The alternative I could see is something like a hybrid platform, wherein memory management, threading, and other common or tricky bits are just provided by default unless overridden, leaving the user-specified platform to just provide the entrypoint, i/o, effects, etc
@Brendan Hansknecht iirc, Richard mentioned maxing out with the default (or configured) stack size, so if there's any non-fixed-bound non-tail recursion, it'd just end up assuming 8MB on Linux for example
That makes some sense, but roc still calls out to host functions which it won't know the stack size of. So I don't think this is likely to work out fully in practice.
Kevin Gillette said:
Richard Feldman I presume you did some verification using a deliberately contrived program with a bunch of threads that would sample as using
N*8MBVIRT, but indeed only worked out to scale toN*4KBRES?
all I did was make sure that the functions could be called if they had the relevant characteristics of a compiled Roc program, and that they could be run on different OS threads while still having their own stacks and preserving backtraces and debugging
Brendan Hansknecht said:
Calculating the maximum stack size number for a given function seems straightforward
Only a function without recursion anywhere in the call stack? Or guaranteed tail recursion I guess. Oh, and with estimates for stack size of every zig builtin.
I mean I'm general every function knows it's own stack size (at least after love optimizations) and what functions it calls. After that, it is just a simple dynamic programming problem, but it is exceptionally likely to hit some form of recursion or unbounded stack size.
I don't think it's all that likely to hit something that actually recurses at runtime (so, tail recursion that will be optimized into a loop doesn't count) but I could be wrong!
and just to note, I don't think it's very likely that we'd end up needing to do that part
mainly wanted to note that if it turned out that 4KB minimum stack size was really just too much and it needed to be 2KB, there at least exists a potential path to get to ~2KB while keeping the rest of the design
Kevin Gillette said:
Why would Roc need a scheduler in the case it's just using threads and relying on modern page mapping to keep memory low? Go does it really only because it strictly knows better than the kernel how some concurrency primitives relate (timers, channels, locks), and because it uses async i/o for pretty much everything it can (epoll, kqueue, etc), and thus doesn't have any need for threads to block on syscalls.
so if you just use OS threads for everything, the cost of switching threads is significantly more expensive because it involves a roundtrip through the kernel, which involves another save/restore of (even more) registers plus some cache flushing
but someone can always try that and find out! :big_smile:
Technically, the platform could be forced to implement the green threads/system threads instead of roc having a scheduler?
Roc just does everything sync and has a map2 that the platform implements that spawns tasks/threads however it likes?
@Richard Feldman how do you see this working in single threaded apps or low resource systems (like embedded)?
For example, if I want to run on embedded, I would most likely want this direct call style, but I probably want to opt out of the code required for concurrency and green threads. I might still have promotes for running tasks on different cores, but it would be much more explicit and separate from green threads (to avoid resource consumption of green threads).
On top of that, if I am building a platform on a system that controls the threading (maybe a webserver, maybe built on go, idk), I most likely would prefer if roc didn't use any green threads. I would prefer it to stay on the confines of my threading solution instead of having it's own separate threading solution.
I guess I'm wondering if we can just have a single threaded direct call roc and leave the entire green thread solution to the platform. Roc would call a task to spawn another thread instead of queueing something in its own green thread implementation.
I think having our own green thread implementation that is on by default could still be very useful. Though no matter what it needs to work with the platform I think. Cause the platform will have to enable roc green threads to run on multiple real threads. Otherwise there is only concurrency and no parallelism.
I'm thinking about this more right now cause I think it would be something really useful to implement sooner rather than later. It also solves most of our code bloat and data movement issues around captures.
And I personally want it for roc-wasm4 and embedded. Though a full state machine solution with a smart data management solution to avoid closure copy problem would also solve most of the problems hit by embedded and roc-wasm4 use cases
Brendan Hansknecht said:
I guess I'm wondering if we can just have a single threaded direct call roc and leave the entire green thread solution to the platform. Roc would call a task to spawn another thread instead of queueing something in its own green thread implementation.
oh this was always the idea! :smiley:
sorry if that wasn't clear
the whole point of this is that it's something where we can just code gen direct calls and then hosts can implement coroutines using that as a primitive! (or not)
so yeah embedded can just be like direct calls, no parallelism, the end
Ah, I thought the coroutine part was something you were suggesting would be generated by roc as part of a roc runtime that is just embedded in every app.
nah, it's more like "what if we only had direct calls and no state machine...what kind of parallelism could hosts build on top of that?"
(we could still do state machine, of course - but this would unlock a very direct path to a really nice compiler implementation that could also have a really nice parallelism story!)
Yep. And that means the host could also just use regular threads if they want.
absolutely, yeah
or start with threads and then switch to coroutines later as a nonbreaking chanage
The only sad part is that it can't work with async as well. Or at least not easily.
But that really only matters for rust platforms which are the main world of async platforms
Go, zig, c++ all would probably map more directly to the coroutine world
yeah
but of course state machine would map directly to Rust async
and we could still offer that in the future if there's demand
I read the proposal. I really like the goals. I am happy, that you plan to support this in Roc.
When reading the proposal, I though, that you want to implement greenthreads in Roc. I am happy to read, that you don't plan this, but want to give this task to the platform. This will fit much better in the environment of the platform-language.
What I don't understand is, how this is possible without returning Task-like lambdas to the platform. I don't understand, how this could work, if Roc calls the effects immediately, instead of returning something, that can called by the platform.
For example, to support a function like List.for_each_parallel! : List a, (a => b) => b`, I would guess, that I have to implement it in a Go platform like this:
func roc_fx_for_each_parallel(roc_effect_list RocList[RocEffect]) RocList[RocValue] {
// Create a WaitGroup: https://pkg.go.dev/sync#WaitGroup
var wg sync.WaitGroup
// Allocate a big enough memory so every effect can write into in parallel.
results := make([]RocValue, len(roc_effect_list))
for i, effect := range roc_effect_list.List() {
// Increment the WaitGroup counter.
wg.Add(1)
// Launch a goroutine for each effect.
go func() {
// Decrement the counter when the goroutine completes.
defer wg.Done()
// Call the effect
results[i] = roc_call_effect(effect)
}()
}
// Wait for all effects to complete.
wg.Wait()
return NewRocList(results)
}
But for this to work, Roc has to give a list of RocEffect to the platform, that can be called. This was the case before purity inference, but it not the way, Roc currently calls effects.
Is this what you plan? Or is there something, that I am missing?
If this is what you plan, then I think, it will work with any language. Even if the language wants to do its effect with async/await. You just have to start a new async-loop and wait for all the results.
And if the language does not support any kind of parallel functions (for example on embedded or simple wasm), it could just call any effect in the list after each other to support List.for_each_parallel! in a non-parallel way.
so this already works - https://github.com/bhansconnect/roc-coro-webserver
Oskar Hahn said:
for this to work, Roc has to give a list of
RocEffectto the platform, that can be called. This was the case before purity inference, but it not the way, Roc currently calls effects.Is this what you plan? Or is there something, that I am missing?
we can still introduce something like this in the future - basically, compiling effectful function calls into a state machine (like we were planning to do with Task)
but purity inference means we aren't blocked on that; it's possible to do parallelism in Roc right now, as Brendan's repo demonstrates
I still don't get it.
I was not able to build the platform. So I can only look at the source. The platform implements one effect sleepMillis! : U64 => {}. The zig-implementation of this effect is here. There are basically three examples:
The first example only does CPU-calculations in roc. The comment says, it takes around 1-5 ms.
The second example calls Sleep.millis! 50. I would guess, that it calls all of them in parallel, so it would take around 50 ms in total.
The third example first calls Sleep.millis! 20 and then does the CPU calculation, that takes around 1-5 ms. I would guess, that it does the sleep and the CPU calculation in parallel and therefore takes 20 ms in total.
This is impressive. But most effects have to return something. Like a Response, or the content of a file.
How would this look like, when an effect reads a file?
respond! = \_ ->
content = File.read_line! file_name
# This takes about 1-5ms of compute.
x = fact 1000000
if x == 12 then
"Goodbye, World! ${content!}"
else
"Hello, World! ${content!}"
Would this still run in parallel? Is the return type like a future and Roc waits when it tries to read it? Could you do List.for_each_parallel([File.read_line! file1, File.read_line! file2]) and Roc would return the content of both files in parallel?
So there are two different sides here and that coroutine example only deals with one.
The coroutine example there simply deals with managing stackfil coroutines and suspending them when they have Io requests. This also could be made to work with file Io via non blocking requests (or via running the requests on background threads)
This gives roc no control over the concurrency
In something like basic CLI (and even basic webserver depending on the use case), the application will want control over the concurrency
In those cases, we likely would create an effect like Thread.spawn : ({} => {}) => {}
Relatedly we would probably want effects to create channels and send data over them.
Roc would be given the platform a lambda via an effect. The lambda would then be launched on another thread by the platform
Would would then continue on the original thread while the other thread is running concurrent.
So it is still possible to send lazy units of work to the platform. The platform just gets a lambda/closure to call.
Brendan Hansknecht said:
Roc would be given the platform a lambda via an effect. The lambda would then be launched on another thread by the platform
My take away from your post is, that for concurrency to work with Roc, Roc has to return a "function" to the platform and also gives the platform a way to call that function. For example by generate closure callers for effects.
My follow-up question is: Does Roc need lambda sets for this? The reason I ask is, that @Richard Feldman is writing here that lambda sets will be removed. Will the current idea of concurrency in Roc still be possible?
lambda sets are not being removed. They are being fixed with the rewrite
To do that, we put all values a function captures into a single datatype (either a struct or a tagged union) and pass that as a new first argument to the function. At runtime, functions are just normal functions this way
And yeah, I think that roc should generate a cffi interface to call each closure. This way, a user does not need to know how to interact with lambda sets. On top of that, we should generate functions to refcount the closure and change it to atomic refcount mode.
Ahh thank you for the clarification. After reading the post from Richard again, it says "removing lambda set inference", which is probably something different then "removing lambda sets". On another place, he even writes "Monomorphization needs a rewrite to fix lambda sets."
So it was my mistake. Sorry.
In the new design, closures don't exist, unless you're referring to the tagged union that represents the captures for a closure?
Not sure how we'd refcount them
unless you're referring to the tagged union that represents the captures for a closure?
Yes, that, but boxed when exposed to the host
and they refcount the same as the underlying captured data.
If your capture is a string, you just refcount the string.
yeah maybe I should clarify: lambda set inference is being removed from the type inference step because it turns out that's too early to do it. We're still going to do it, just in its own pass later on. :big_smile:
Last updated: Jun 16 2026 at 16:19 UTC