Stream: ideas

Topic: stackful coroutines in hosts


view this post on Zulip Richard Feldman (Sep 27 2024 at 02:21):

here's a thing that occurred to me recently, and I made a little proof-of-concept (linked in the intro section) - https://docs.google.com/document/d/1mcIWe6qcfOYyGGdDDVdKObss7Mc3ypZZHO7E3C8vvic/edit

any thoughts and feedback welcome!

view this post on Zulip Kevin Gillette (Sep 28 2024 at 03:15):

Presumably the reason Go starts with very small stacks (0.2MB compared to 0.5MB in macOS threads, 1MB on Windows, and 8MB on Linux)

I'm guessing you meant 0.02MB

view this post on Zulip Kevin Gillette (Sep 28 2024 at 03:16):

I still have all 4GB available

Guessing you meant 6GB ?

view this post on Zulip Kevin Gillette (Sep 28 2024 at 03:31):

Regarding madvise with MEM_RESET, iirc, at least on Linux, there's a flag you can use to indicate that a page can be reclaimed by the kernel if it can make better use of it, but otherwise, it would remain with the same process if there's ample unused memory. Probably a write would rescind that madvise effect if not already reclaimed by that point

view this post on Zulip Richard Feldman (Sep 28 2024 at 16:36):

fixed, thanks!

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 22:30):

Note, modern Mac has working ulimit

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 22:30):

So you can change the stack size

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 22:31):

On my m1 default was 8176 KB with a default hard limit of 65520 KB stack size

view this post on Zulip Richard Feldman (Sep 28 2024 at 22:32):

can you change it from within the currently running process?

view this post on Zulip Kevin Gillette (Sep 28 2024 at 22:46):

However, this would be in such a distant part of memory, most lookups into it would likely be cache misses.
It could also end up taking up an undesirable amount of binary size

If this is just function metadata, couldn't we arrange for that metadata to live just before the function instructions (and within the same 4k page)?

Redis does something similar for strings, whereby the pointer points at the start of a normal C-style string, though any code aware of Redis strings can look at one word earlier before the start of the string data to find the string length, thus making some operations safer and faster, while libc string functions continue to work (there's still a null byte).

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 22:53):

can you change it from within the currently running process?

Seems to work the same as linux

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 22:53):

I'm guessing you meant 0.02MB

Still off by a factor of ten. 0.002MB is 2KB.

view this post on Zulip Kevin Gillette (Sep 28 2024 at 22:53):

depending on the amount of function meta, you could even precede it with a halt instruction to add a little bit more security

view this post on Zulip Kevin Gillette (Sep 28 2024 at 23:00):

@Richard Feldman I presume you did some verification using a deliberately contrived program with a bunch of threads that would sample as using N*8MB VIRT, but indeed only worked out to scale to N*4KB RES?

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 23:02):

Calculating the maximum stack size number for a given function seems straightforward

Only a function without recursion anywhere in the call stack? Or guaranteed tail recursion I guess. Oh, and with estimates for stack size of every zig builtin.

I mean I'm general every function knows it's own stack size (at least after love optimizations) and what functions it calls. After that, it is just a simple dynamic programming problem, but it is exceptionally likely to hit some form of recursion or unbounded stack size.

view this post on Zulip Kevin Gillette (Sep 28 2024 at 23:16):

Why would Roc need a scheduler in the case it's just using threads and relying on modern page mapping to keep memory low? Go does it really only because it strictly knows better than the kernel how some concurrency primitives relate (timers, channels, locks), and because it uses async i/o for pretty much everything it can (epoll, kqueue, etc), and thus doesn't have any need for threads to block on syscalls.

Part of that, I think, is also to avoid a bunch of overhead by having X memory for Y threads, but as you pointed out, that old wisdom probably hasn't been accurate for a long time.

Thus, unless there are some pretty bold plans for Roc to have specialized concurrency, or other internals that would make a Roc scheduler significantly better than just relying on the kernel alone, then it seems prudent to just rely on the kernel.

Because of Roc's emphasis on platforms, it suggests to me that it'll be harder to do those clever things (because it'd significantly complicate the process of writing a platform), thus lowering the benefit of a scheduler.

The alternative I could see is something like a hybrid platform, wherein memory management, threading, and other common or tricky bits are just provided by default unless overridden, leaving the user-specified platform to just provide the entrypoint, i/o, effects, etc

view this post on Zulip Kevin Gillette (Sep 28 2024 at 23:18):

@Brendan Hansknecht iirc, Richard mentioned maxing out with the default (or configured) stack size, so if there's any non-fixed-bound non-tail recursion, it'd just end up assuming 8MB on Linux for example

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 23:20):

That makes some sense, but roc still calls out to host functions which it won't know the stack size of. So I don't think this is likely to work out fully in practice.

view this post on Zulip Richard Feldman (Sep 28 2024 at 23:46):

Kevin Gillette said:

Richard Feldman I presume you did some verification using a deliberately contrived program with a bunch of threads that would sample as using N*8MB VIRT, but indeed only worked out to scale to N*4KB RES?

all I did was make sure that the functions could be called if they had the relevant characteristics of a compiled Roc program, and that they could be run on different OS threads while still having their own stacks and preserving backtraces and debugging

view this post on Zulip Richard Feldman (Sep 28 2024 at 23:47):

Brendan Hansknecht said:

Calculating the maximum stack size number for a given function seems straightforward

Only a function without recursion anywhere in the call stack? Or guaranteed tail recursion I guess. Oh, and with estimates for stack size of every zig builtin.

I mean I'm general every function knows it's own stack size (at least after love optimizations) and what functions it calls. After that, it is just a simple dynamic programming problem, but it is exceptionally likely to hit some form of recursion or unbounded stack size.

I don't think it's all that likely to hit something that actually recurses at runtime (so, tail recursion that will be optimized into a loop doesn't count) but I could be wrong!

view this post on Zulip Richard Feldman (Sep 28 2024 at 23:48):

and just to note, I don't think it's very likely that we'd end up needing to do that part

view this post on Zulip Richard Feldman (Sep 28 2024 at 23:48):

mainly wanted to note that if it turned out that 4KB minimum stack size was really just too much and it needed to be 2KB, there at least exists a potential path to get to ~2KB while keeping the rest of the design

view this post on Zulip Richard Feldman (Sep 28 2024 at 23:50):

Kevin Gillette said:

Why would Roc need a scheduler in the case it's just using threads and relying on modern page mapping to keep memory low? Go does it really only because it strictly knows better than the kernel how some concurrency primitives relate (timers, channels, locks), and because it uses async i/o for pretty much everything it can (epoll, kqueue, etc), and thus doesn't have any need for threads to block on syscalls.

so if you just use OS threads for everything, the cost of switching threads is significantly more expensive because it involves a roundtrip through the kernel, which involves another save/restore of (even more) registers plus some cache flushing

view this post on Zulip Richard Feldman (Sep 28 2024 at 23:51):

but someone can always try that and find out! :big_smile:

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 23:51):

Technically, the platform could be forced to implement the green threads/system threads instead of roc having a scheduler?

view this post on Zulip Brendan Hansknecht (Sep 28 2024 at 23:52):

Roc just does everything sync and has a map2 that the platform implements that spawns tasks/threads however it likes?

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:26):

@Richard Feldman how do you see this working in single threaded apps or low resource systems (like embedded)?

For example, if I want to run on embedded, I would most likely want this direct call style, but I probably want to opt out of the code required for concurrency and green threads. I might still have promotes for running tasks on different cores, but it would be much more explicit and separate from green threads (to avoid resource consumption of green threads).

On top of that, if I am building a platform on a system that controls the threading (maybe a webserver, maybe built on go, idk), I most likely would prefer if roc didn't use any green threads. I would prefer it to stay on the confines of my threading solution instead of having it's own separate threading solution.


I guess I'm wondering if we can just have a single threaded direct call roc and leave the entire green thread solution to the platform. Roc would call a task to spawn another thread instead of queueing something in its own green thread implementation.

I think having our own green thread implementation that is on by default could still be very useful. Though no matter what it needs to work with the platform I think. Cause the platform will have to enable roc green threads to run on multiple real threads. Otherwise there is only concurrency and no parallelism.

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:27):

I'm thinking about this more right now cause I think it would be something really useful to implement sooner rather than later. It also solves most of our code bloat and data movement issues around captures.

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:28):

And I personally want it for roc-wasm4 and embedded. Though a full state machine solution with a smart data management solution to avoid closure copy problem would also solve most of the problems hit by embedded and roc-wasm4 use cases

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:30):

Brendan Hansknecht said:

I guess I'm wondering if we can just have a single threaded direct call roc and leave the entire green thread solution to the platform. Roc would call a task to spawn another thread instead of queueing something in its own green thread implementation.

oh this was always the idea! :smiley:

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:30):

sorry if that wasn't clear

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:31):

the whole point of this is that it's something where we can just code gen direct calls and then hosts can implement coroutines using that as a primitive! (or not)

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:32):

so yeah embedded can just be like direct calls, no parallelism, the end

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:33):

Ah, I thought the coroutine part was something you were suggesting would be generated by roc as part of a roc runtime that is just embedded in every app.

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:34):

nah, it's more like "what if we only had direct calls and no state machine...what kind of parallelism could hosts build on top of that?"

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:35):

(we could still do state machine, of course - but this would unlock a very direct path to a really nice compiler implementation that could also have a really nice parallelism story!)

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:35):

Yep. And that means the host could also just use regular threads if they want.

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:35):

absolutely, yeah

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:35):

or start with threads and then switch to coroutines later as a nonbreaking chanage

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:36):

The only sad part is that it can't work with async as well. Or at least not easily.

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:36):

But that really only matters for rust platforms which are the main world of async platforms

view this post on Zulip Brendan Hansknecht (Oct 03 2024 at 00:37):

Go, zig, c++ all would probably map more directly to the coroutine world

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:39):

yeah

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:39):

but of course state machine would map directly to Rust async

view this post on Zulip Richard Feldman (Oct 03 2024 at 00:39):

and we could still offer that in the future if there's demand

view this post on Zulip Oskar Hahn (Jan 12 2025 at 10:39):

I read the proposal. I really like the goals. I am happy, that you plan to support this in Roc.

When reading the proposal, I though, that you want to implement greenthreads in Roc. I am happy to read, that you don't plan this, but want to give this task to the platform. This will fit much better in the environment of the platform-language.

What I don't understand is, how this is possible without returning Task-like lambdas to the platform. I don't understand, how this could work, if Roc calls the effects immediately, instead of returning something, that can called by the platform.

For example, to support a function like List.for_each_parallel! : List a, (a => b) => b`, I would guess, that I have to implement it in a Go platform like this:

func roc_fx_for_each_parallel(roc_effect_list RocList[RocEffect]) RocList[RocValue] {
    // Create a WaitGroup: https://pkg.go.dev/sync#WaitGroup
    var wg sync.WaitGroup

    // Allocate a big enough memory so every effect can write into in parallel.
    results := make([]RocValue, len(roc_effect_list))
    for i, effect := range roc_effect_list.List() {
        // Increment the WaitGroup counter.
        wg.Add(1)
        // Launch a goroutine for each effect.
        go func() {
            // Decrement the counter when the goroutine completes.
            defer wg.Done()
            // Call the effect
            results[i] = roc_call_effect(effect)
        }()
    }
    // Wait for all effects to complete.
    wg.Wait()
    return NewRocList(results)
}

But for this to work, Roc has to give a list of RocEffect to the platform, that can be called. This was the case before purity inference, but it not the way, Roc currently calls effects.

Is this what you plan? Or is there something, that I am missing?

view this post on Zulip Oskar Hahn (Jan 12 2025 at 10:49):

If this is what you plan, then I think, it will work with any language. Even if the language wants to do its effect with async/await. You just have to start a new async-loop and wait for all the results.

And if the language does not support any kind of parallel functions (for example on embedded or simple wasm), it could just call any effect in the list after each other to support List.for_each_parallel! in a non-parallel way.

view this post on Zulip Richard Feldman (Jan 12 2025 at 13:52):

so this already works - https://github.com/bhansconnect/roc-coro-webserver

view this post on Zulip Richard Feldman (Jan 12 2025 at 13:53):

Oskar Hahn said:

for this to work, Roc has to give a list of RocEffect to the platform, that can be called. This was the case before purity inference, but it not the way, Roc currently calls effects.

Is this what you plan? Or is there something, that I am missing?

we can still introduce something like this in the future - basically, compiling effectful function calls into a state machine (like we were planning to do with Task)

view this post on Zulip Richard Feldman (Jan 12 2025 at 13:53):

but purity inference means we aren't blocked on that; it's possible to do parallelism in Roc right now, as Brendan's repo demonstrates

view this post on Zulip Oskar Hahn (Jan 12 2025 at 15:13):

I still don't get it.

I was not able to build the platform. So I can only look at the source. The platform implements one effect sleepMillis! : U64 => {}. The zig-implementation of this effect is here. There are basically three examples:

The first example only does CPU-calculations in roc. The comment says, it takes around 1-5 ms.

The second example calls Sleep.millis! 50. I would guess, that it calls all of them in parallel, so it would take around 50 ms in total.

The third example first calls Sleep.millis! 20 and then does the CPU calculation, that takes around 1-5 ms. I would guess, that it does the sleep and the CPU calculation in parallel and therefore takes 20 ms in total.

This is impressive. But most effects have to return something. Like a Response, or the content of a file.

How would this look like, when an effect reads a file?

respond! = \_ ->
    content = File.read_line! file_name
    # This takes about 1-5ms of compute.
    x = fact 1000000
    if x == 12 then
        "Goodbye, World! ${content!}"
    else
        "Hello, World! ${content!}"

Would this still run in parallel? Is the return type like a future and Roc waits when it tries to read it? Could you do List.for_each_parallel([File.read_line! file1, File.read_line! file2]) and Roc would return the content of both files in parallel?

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:49):

So there are two different sides here and that coroutine example only deals with one.

The coroutine example there simply deals with managing stackfil coroutines and suspending them when they have Io requests. This also could be made to work with file Io via non blocking requests (or via running the requests on background threads)

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:50):

This gives roc no control over the concurrency

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:50):

In something like basic CLI (and even basic webserver depending on the use case), the application will want control over the concurrency

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:52):

In those cases, we likely would create an effect like Thread.spawn : ({} => {}) => {}

Relatedly we would probably want effects to create channels and send data over them.

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:52):

Roc would be given the platform a lambda via an effect. The lambda would then be launched on another thread by the platform

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:53):

Would would then continue on the original thread while the other thread is running concurrent.

view this post on Zulip Brendan Hansknecht (Jan 12 2025 at 16:54):

So it is still possible to send lazy units of work to the platform. The platform just gets a lambda/closure to call.

view this post on Zulip Oskar Hahn (Feb 04 2025 at 07:32):

Brendan Hansknecht said:

Roc would be given the platform a lambda via an effect. The lambda would then be launched on another thread by the platform

My take away from your post is, that for concurrency to work with Roc, Roc has to return a "function" to the platform and also gives the platform a way to call that function. For example by generate closure callers for effects.

My follow-up question is: Does Roc need lambda sets for this? The reason I ask is, that @Richard Feldman is writing here that lambda sets will be removed. Will the current idea of concurrency in Roc still be possible?

view this post on Zulip Brendan Hansknecht (Feb 04 2025 at 07:37):

lambda sets are not being removed. They are being fixed with the rewrite

view this post on Zulip Sam Mohr (Feb 04 2025 at 07:38):

To do that, we put all values a function captures into a single datatype (either a struct or a tagged union) and pass that as a new first argument to the function. At runtime, functions are just normal functions this way

view this post on Zulip Brendan Hansknecht (Feb 04 2025 at 07:39):

And yeah, I think that roc should generate a cffi interface to call each closure. This way, a user does not need to know how to interact with lambda sets. On top of that, we should generate functions to refcount the closure and change it to atomic refcount mode.

view this post on Zulip Oskar Hahn (Feb 04 2025 at 07:40):

Ahh thank you for the clarification. After reading the post from Richard again, it says "removing lambda set inference", which is probably something different then "removing lambda sets". On another place, he even writes "Monomorphization needs a rewrite to fix lambda sets."

So it was my mistake. Sorry.

view this post on Zulip Sam Mohr (Feb 04 2025 at 07:40):

In the new design, closures don't exist, unless you're referring to the tagged union that represents the captures for a closure?

view this post on Zulip Sam Mohr (Feb 04 2025 at 07:40):

Not sure how we'd refcount them

view this post on Zulip Brendan Hansknecht (Feb 04 2025 at 07:41):

unless you're referring to the tagged union that represents the captures for a closure?

Yes, that, but boxed when exposed to the host

view this post on Zulip Brendan Hansknecht (Feb 04 2025 at 07:42):

and they refcount the same as the underlying captured data.

view this post on Zulip Brendan Hansknecht (Feb 04 2025 at 07:43):

If your capture is a string, you just refcount the string.

view this post on Zulip Richard Feldman (Feb 04 2025 at 14:21):

yeah maybe I should clarify: lambda set inference is being removed from the type inference step because it turns out that's too early to do it. We're still going to do it, just in its own pass later on. :big_smile:


Last updated: Jun 16 2026 at 16:19 UTC