stackful coroutines in hosts · ideas

Regarding madvise with MEM_RESET, iirc, at least on Linux, there's a flag you can use to indicate that a page can be reclaimed by the kernel if it can make better use of it, but otherwise, it would remain with the same process if there's ample unused memory. Probably a write would rescind that madvise effect if not already reclaimed by that point

Richard Feldman (Sep 28 2024 at 16:36):

Brendan Hansknecht (Sep 28 2024 at 22:30):

Brendan Hansknecht (Sep 28 2024 at 22:31):

Richard Feldman (Sep 28 2024 at 22:32):

Kevin Gillette (Sep 28 2024 at 22:46):

If this is just function metadata, couldn't we arrange for that metadata to live just before the function instructions (and within the same 4k page)?

Redis does something similar for strings, whereby the pointer points at the start of a normal C-style string, though any code aware of Redis strings can look at one word earlier before the start of the string data to find the string length, thus making some operations safer and faster, while libc string functions continue to work (there's still a null byte).

Brendan Hansknecht (Sep 28 2024 at 22:53):

Kevin Gillette (Sep 28 2024 at 22:53):

depending on the amount of function meta, you could even precede it with a halt instruction to add a little bit more security

Kevin Gillette (Sep 28 2024 at 23:00):

@Richard Feldman I presume you did some verification using a deliberately contrived program with a bunch of threads that would sample as using N*8MB VIRT, but indeed only worked out to scale to N*4KB RES?

Brendan Hansknecht (Sep 28 2024 at 23:02):

Only a function without recursion anywhere in the call stack? Or guaranteed tail recursion I guess. Oh, and with estimates for stack size of every zig builtin.

I mean I'm general every function knows it's own stack size (at least after love optimizations) and what functions it calls. After that, it is just a simple dynamic programming problem, but it is exceptionally likely to hit some form of recursion or unbounded stack size.

Kevin Gillette (Sep 28 2024 at 23:16):

Why would Roc need a scheduler in the case it's just using threads and relying on modern page mapping to keep memory low? Go does it really only because it strictly knows better than the kernel how some concurrency primitives relate (timers, channels, locks), and because it uses async i/o for pretty much everything it can (epoll, kqueue, etc), and thus doesn't have any need for threads to block on syscalls.

Part of that, I think, is also to avoid a bunch of overhead by having X memory for Y threads, but as you pointed out, that old wisdom probably hasn't been accurate for a long time.

Thus, unless there are some pretty bold plans for Roc to have specialized concurrency, or other internals that would make a Roc scheduler significantly better than just relying on the kernel alone, then it seems prudent to just rely on the kernel.

Because of Roc's emphasis on platforms, it suggests to me that it'll be harder to do those clever things (because it'd significantly complicate the process of writing a platform), thus lowering the benefit of a scheduler.

The alternative I could see is something like a hybrid platform, wherein memory management, threading, and other common or tricky bits are just provided by default unless overridden, leaving the user-specified platform to just provide the entrypoint, i/o, effects, etc

Kevin Gillette (Sep 28 2024 at 23:18):

@Brendan Hansknecht iirc, Richard mentioned maxing out with the default (or configured) stack size, so if there's any non-fixed-bound non-tail recursion, it'd just end up assuming 8MB on Linux for example

Brendan Hansknecht (Sep 28 2024 at 23:20):

That makes some sense, but roc still calls out to host functions which it won't know the stack size of. So I don't think this is likely to work out fully in practice.

Richard Feldman (Sep 28 2024 at 23:46):

all I did was make sure that the functions could be called if they had the relevant characteristics of a compiled Roc program, and that they could be run on different OS threads while still having their own stacks and preserving backtraces and debugging

Richard Feldman (Sep 28 2024 at 23:47):

I don't think it's all that likely to hit something that actually recurses at runtime (so, tail recursion that will be optimized into a loop doesn't count) but I could be wrong!

Richard Feldman (Sep 28 2024 at 23:48):

and just to note, I don't think it's very likely that we'd end up needing to do that part

Richard Feldman (Sep 28 2024 at 23:48):

mainly wanted to note that if it turned out that 4KB minimum stack size was really just too much and it needed to be 2KB, there at least exists a potential path to get to ~2KB while keeping the rest of the design

Richard Feldman (Sep 28 2024 at 23:50):

so if you just use OS threads for everything, the cost of switching threads is significantly more expensive because it involves a roundtrip through the kernel, which involves another save/restore of (even more) registers plus some cache flushing

Richard Feldman (Sep 28 2024 at 23:51):

Brendan Hansknecht (Sep 28 2024 at 23:51):

Technically, the platform could be forced to implement the green threads/system threads instead of roc having a scheduler?

Brendan Hansknecht (Sep 28 2024 at 23:52):

Roc just does everything sync and has a map2 that the platform implements that spawns tasks/threads however it likes?

Brendan Hansknecht (Oct 03 2024 at 00:26):

@Richard Feldman how do you see this working in single threaded apps or low resource systems (like embedded)?

For example, if I want to run on embedded, I would most likely want this direct call style, but I probably want to opt out of the code required for concurrency and green threads. I might still have promotes for running tasks on different cores, but it would be much more explicit and separate from green threads (to avoid resource consumption of green threads).

On top of that, if I am building a platform on a system that controls the threading (maybe a webserver, maybe built on go, idk), I most likely would prefer if roc didn't use any green threads. I would prefer it to stay on the confines of my threading solution instead of having it's own separate threading solution.

I guess I'm wondering if we can just have a single threaded direct call roc and leave the entire green thread solution to the platform. Roc would call a task to spawn another thread instead of queueing something in its own green thread implementation.

I think having our own green thread implementation that is on by default could still be very useful. Though no matter what it needs to work with the platform I think. Cause the platform will have to enable roc green threads to run on multiple real threads. Otherwise there is only concurrency and no parallelism.

Brendan Hansknecht (Oct 03 2024 at 00:27):

I'm thinking about this more right now cause I think it would be something really useful to implement sooner rather than later. It also solves most of our code bloat and data movement issues around captures.

Brendan Hansknecht (Oct 03 2024 at 00:28):

And I personally want it for roc-wasm4 and embedded. Though a full state machine solution with a smart data management solution to avoid closure copy problem would also solve most of the problems hit by embedded and roc-wasm4 use cases

Richard Feldman (Oct 03 2024 at 00:30):

Richard Feldman (Oct 03 2024 at 00:31):

the whole point of this is that it's something where we can just code gen direct calls and then hosts can implement coroutines using that as a primitive! (or not)

Richard Feldman (Oct 03 2024 at 00:32):

Brendan Hansknecht (Oct 03 2024 at 00:33):

Ah, I thought the coroutine part was something you were suggesting would be generated by roc as part of a roc runtime that is just embedded in every app.

Richard Feldman (Oct 03 2024 at 00:34):

nah, it's more like "what if we only had direct calls and no state machine...what kind of parallelism could hosts build on top of that?"

Richard Feldman (Oct 03 2024 at 00:35):

(we could still do state machine, of course - but this would unlock a very direct path to a really nice compiler implementation that could also have a really nice parallelism story!)

Brendan Hansknecht (Oct 03 2024 at 00:35):

Richard Feldman (Oct 03 2024 at 00:35):

or start with threads and then switch to coroutines later as a nonbreaking chanage

Brendan Hansknecht (Oct 03 2024 at 00:36):

The only sad part is that it can't work with async as well. Or at least not easily.

Brendan Hansknecht (Oct 03 2024 at 00:36):

But that really only matters for rust platforms which are the main world of async platforms

Brendan Hansknecht (Oct 03 2024 at 00:37):

Richard Feldman (Oct 03 2024 at 00:39):

Oskar Hahn (Jan 12 2025 at 10:39):

I read the proposal. I really like the goals. I am happy, that you plan to support this in Roc.

When reading the proposal, I though, that you want to implement greenthreads in Roc. I am happy to read, that you don't plan this, but want to give this task to the platform. This will fit much better in the environment of the platform-language.

What I don't understand is, how this is possible without returning Task-like lambdas to the platform. I don't understand, how this could work, if Roc calls the effects immediately, instead of returning something, that can called by the platform.

For example, to support a function like List.for_each_parallel! : List a, (a => b) => b`, I would guess, that I have to implement it in a Go platform like this:

func roc_fx_for_each_parallel(roc_effect_list RocList[RocEffect]) RocList[RocValue] {
    // Create a WaitGroup: https://pkg.go.dev/sync#WaitGroup
    var wg sync.WaitGroup

    // Allocate a big enough memory so every effect can write into in parallel.
    results := make([]RocValue, len(roc_effect_list))
    for i, effect := range roc_effect_list.List() {
        // Increment the WaitGroup counter.
        wg.Add(1)
        // Launch a goroutine for each effect.
        go func() {
            // Decrement the counter when the goroutine completes.
            defer wg.Done()
            // Call the effect
            results[i] = roc_call_effect(effect)
        }()
    }
    // Wait for all effects to complete.
    wg.Wait()
    return NewRocList(results)
}

But for this to work, Roc has to give a list of RocEffect to the platform, that can be called. This was the case before purity inference, but it not the way, Roc currently calls effects.

Oskar Hahn (Jan 12 2025 at 10:49):

If this is what you plan, then I think, it will work with any language. Even if the language wants to do its effect with async/await. You just have to start a new async-loop and wait for all the results.

And if the language does not support any kind of parallel functions (for example on embedded or simple wasm), it could just call any effect in the list after each other to support List.for_each_parallel! in a non-parallel way.

Richard Feldman (Jan 12 2025 at 13:52):

Richard Feldman (Jan 12 2025 at 13:53):

we can still introduce something like this in the future - basically, compiling effectful function calls into a state machine (like we were planning to do with Task)

Richard Feldman (Jan 12 2025 at 13:53):

but purity inference means we aren't blocked on that; it's possible to do parallelism in Roc right now, as Brendan's repo demonstrates

Oskar Hahn (Jan 12 2025 at 15:13):

I was not able to build the platform. So I can only look at the source. The platform implements one effect sleepMillis! : U64 => {}. The zig-implementation of this effect is here. There are basically three examples:

The first example only does CPU-calculations in roc. The comment says, it takes around 1-5 ms.

The second example calls Sleep.millis! 50. I would guess, that it calls all of them in parallel, so it would take around 50 ms in total.

The third example first calls Sleep.millis! 20 and then does the CPU calculation, that takes around 1-5 ms. I would guess, that it does the sleep and the CPU calculation in parallel and therefore takes 20 ms in total.

This is impressive. But most effects have to return something. Like a Response, or the content of a file.

respond! = \_ ->
    content = File.read_line! file_name
    # This takes about 1-5ms of compute.
    x = fact 1000000
    if x == 12 then
        "Goodbye, World! ${content!}"
    else
        "Hello, World! ${content!}"

Would this still run in parallel? Is the return type like a future and Roc waits when it tries to read it? Could you do List.for_each_parallel([File.read_line! file1, File.read_line! file2]) and Roc would return the content of both files in parallel?

Brendan Hansknecht (Jan 12 2025 at 16:49):

So there are two different sides here and that coroutine example only deals with one.

The coroutine example there simply deals with managing stackfil coroutines and suspending them when they have Io requests. This also could be made to work with file Io via non blocking requests (or via running the requests on background threads)

Brendan Hansknecht (Jan 12 2025 at 16:50):

In something like basic CLI (and even basic webserver depending on the use case), the application will want control over the concurrency

Brendan Hansknecht (Jan 12 2025 at 16:52):

In those cases, we likely would create an effect like Thread.spawn : ({} => {}) => {}

Relatedly we would probably want effects to create channels and send data over them.

Brendan Hansknecht (Jan 12 2025 at 16:52):

Roc would be given the platform a lambda via an effect. The lambda would then be launched on another thread by the platform

Brendan Hansknecht (Jan 12 2025 at 16:53):

Would would then continue on the original thread while the other thread is running concurrent.

Brendan Hansknecht (Jan 12 2025 at 16:54):

So it is still possible to send lazy units of work to the platform. The platform just gets a lambda/closure to call.

Oskar Hahn (Feb 04 2025 at 07:32):

My take away from your post is, that for concurrency to work with Roc, Roc has to return a "function" to the platform and also gives the platform a way to call that function. For example by generate closure callers for effects.

My follow-up question is: Does Roc need lambda sets for this? The reason I ask is, that @Richard Feldman is writing here that lambda sets will be removed. Will the current idea of concurrency in Roc still be possible?

Brendan Hansknecht (Feb 04 2025 at 07:37):

Sam Mohr (Feb 04 2025 at 07:38):

To do that, we put all values a function captures into a single datatype (either a struct or a tagged union) and pass that as a new first argument to the function. At runtime, functions are just normal functions this way

Brendan Hansknecht (Feb 04 2025 at 07:39):

And yeah, I think that roc should generate a cffi interface to call each closure. This way, a user does not need to know how to interact with lambda sets. On top of that, we should generate functions to refcount the closure and change it to atomic refcount mode.

Oskar Hahn (Feb 04 2025 at 07:40):

Ahh thank you for the clarification. After reading the post from Richard again, it says "removing lambda set inference", which is probably something different then "removing lambda sets". On another place, he even writes "Monomorphization needs a rewrite to fix lambda sets."

Sam Mohr (Feb 04 2025 at 07:40):

In the new design, closures don't exist, unless you're referring to the tagged union that represents the captures for a closure?

Sam Mohr (Feb 04 2025 at 07:40):

Brendan Hansknecht (Feb 04 2025 at 07:41):

Brendan Hansknecht (Feb 04 2025 at 07:42):

Brendan Hansknecht (Feb 04 2025 at 07:43):

Richard Feldman (Feb 04 2025 at 14:21):

yeah maybe I should clarify: lambda set inference is being removed from the type inference step because it turns out that's too early to do it. We're still going to do it, just in its own pass later on. :big_smile:

Stream: ideas

Topic: stackful coroutines in hosts

Richard Feldman (Sep 27 2024 at 02:21):

Kevin Gillette (Sep 28 2024 at 03:15):

Kevin Gillette (Sep 28 2024 at 03:16):

Kevin Gillette (Sep 28 2024 at 03:31):

Richard Feldman (Sep 28 2024 at 16:36):

Brendan Hansknecht (Sep 28 2024 at 22:30):

Brendan Hansknecht (Sep 28 2024 at 22:30):

Brendan Hansknecht (Sep 28 2024 at 22:31):

Richard Feldman (Sep 28 2024 at 22:32):

Kevin Gillette (Sep 28 2024 at 22:46):

Brendan Hansknecht (Sep 28 2024 at 22:53):

Brendan Hansknecht (Sep 28 2024 at 22:53):

Kevin Gillette (Sep 28 2024 at 22:53):

Kevin Gillette (Sep 28 2024 at 23:00):

Brendan Hansknecht (Sep 28 2024 at 23:02):

Kevin Gillette (Sep 28 2024 at 23:16):

Kevin Gillette (Sep 28 2024 at 23:18):

Brendan Hansknecht (Sep 28 2024 at 23:20):

Richard Feldman (Sep 28 2024 at 23:46):

Richard Feldman (Sep 28 2024 at 23:47):

Richard Feldman (Sep 28 2024 at 23:48):

Richard Feldman (Sep 28 2024 at 23:48):

Richard Feldman (Sep 28 2024 at 23:50):

Richard Feldman (Sep 28 2024 at 23:51):

Brendan Hansknecht (Sep 28 2024 at 23:51):

Brendan Hansknecht (Sep 28 2024 at 23:52):

Brendan Hansknecht (Oct 03 2024 at 00:26):

Brendan Hansknecht (Oct 03 2024 at 00:27):

Brendan Hansknecht (Oct 03 2024 at 00:28):

Richard Feldman (Oct 03 2024 at 00:30):

Richard Feldman (Oct 03 2024 at 00:30):

Richard Feldman (Oct 03 2024 at 00:31):

Richard Feldman (Oct 03 2024 at 00:32):

Brendan Hansknecht (Oct 03 2024 at 00:33):

Richard Feldman (Oct 03 2024 at 00:34):

Richard Feldman (Oct 03 2024 at 00:35):

Brendan Hansknecht (Oct 03 2024 at 00:35):

Richard Feldman (Oct 03 2024 at 00:35):

Richard Feldman (Oct 03 2024 at 00:35):

Brendan Hansknecht (Oct 03 2024 at 00:36):

Brendan Hansknecht (Oct 03 2024 at 00:36):

Brendan Hansknecht (Oct 03 2024 at 00:37):

Richard Feldman (Oct 03 2024 at 00:39):

Richard Feldman (Oct 03 2024 at 00:39):

Richard Feldman (Oct 03 2024 at 00:39):

Oskar Hahn (Jan 12 2025 at 10:39):

Oskar Hahn (Jan 12 2025 at 10:49):

Richard Feldman (Jan 12 2025 at 13:52):

Richard Feldman (Jan 12 2025 at 13:53):

Richard Feldman (Jan 12 2025 at 13:53):

Oskar Hahn (Jan 12 2025 at 15:13):

Brendan Hansknecht (Jan 12 2025 at 16:49):

Brendan Hansknecht (Jan 12 2025 at 16:50):

Brendan Hansknecht (Jan 12 2025 at 16:50):

Brendan Hansknecht (Jan 12 2025 at 16:52):

Brendan Hansknecht (Jan 12 2025 at 16:52):

Brendan Hansknecht (Jan 12 2025 at 16:53):

Brendan Hansknecht (Jan 12 2025 at 16:54):

Oskar Hahn (Feb 04 2025 at 07:32):

Brendan Hansknecht (Feb 04 2025 at 07:37):

Sam Mohr (Feb 04 2025 at 07:38):

Brendan Hansknecht (Feb 04 2025 at 07:39):

Oskar Hahn (Feb 04 2025 at 07:40):

Sam Mohr (Feb 04 2025 at 07:40):

Sam Mohr (Feb 04 2025 at 07:40):

Brendan Hansknecht (Feb 04 2025 at 07:41):

Brendan Hansknecht (Feb 04 2025 at 07:42):

Brendan Hansknecht (Feb 04 2025 at 07:43):

Richard Feldman (Feb 04 2025 at 14:21):