`crash` as state machine entry · ideas

Stream: ideas

Topic: `crash` as state machine entry

Richard Feldman (Sep 18 2024 at 00:50):

splitting this off from:

Richard Feldman said:

so in this design, I think we have a type variable behind the scenes which tracks which of these 4 function types a given function has (only 2 of the types are visible, namely pure vs effectful - but we need to track all 4 as distinct from one another behind the scenes, in order to compile the way we want):

effectful and synchronous - compiles to what we have today, where at some point the application just straight-up calls a function in the host and then continues. Examples of where we'd want this: Time.now! and just about everything in the wasm4 platform :big_smile:

effectful and async - compiles to state machine. This is the really involved one because we need to convert everything that's happened up to that point into a Task equivalent so that the function actually returns at that point, and one of the things it returns is "everything that happens next" inside a closure. (Today we call that Task.await.) Examples: Http.get!, Result.parallel!, etc.

pure and sequential - the normal thing we do today, no special compilation necessary

pure and concurrent - for List.mapParallel and such - from a type-checking perspective, this counts as pure, but from a compilation perspective, it's exactly the same as "effectful and async" in that it compiles down to a state machine entry. (The host doesn't care about the distinction between pure and effectful, it just cares about the distinction between sync and async.)

Richard Feldman (Sep 18 2024 at 00:50):

so today, when a crash happens, we immediately call the host and say "whatever is going on in the call stack right now, you need to stop what you're doing and deal with this crash"

Richard Feldman (Sep 18 2024 at 00:51):

another design we could go with is to treat crash as an async state machine entry as described above :point_up:

Richard Feldman (Sep 18 2024 at 00:51):

essentially putting it in the "pure and concurrent" bucket

Richard Feldman (Sep 18 2024 at 00:52):

such that when a crash occurs, the host just sees a normal-looking return with a state machine entry that has no continuation in it

Richard Feldman (Sep 18 2024 at 00:54):

off the top of my head, some of the tradeoffs involved here:

makes it easier for hosts to recover from crash if that's a thing they want to do, e.g. no setjmp/longjmp
there's some performance cost compared to status quo, although I really have no idea how to estimate how much it would be

Richard Feldman (Sep 18 2024 at 00:55):

(this is essentially the "RocResult" design from awhile back, except rolling it into the state machine instead of wrapping the state machine, since the state machine already has a discriminant, so why introduce a second one?)

Brendan Hansknecht (Sep 18 2024 at 00:55):

Oh, and then roc would clean everything up before the return. That would actually be really awesome (though wasteful in some platforms with arenas)

Richard Feldman (Sep 18 2024 at 00:56):

yeah, there are also implications for stack traces

Sam Mohr (Sep 18 2024 at 00:56):

Though crash should be sparing enough that we don't really care about said waste

Richard Feldman (Sep 18 2024 at 00:56):

e.g. right now hosts can grab a backtrace right inside the crash handler, and the stack still exists

Richard Feldman (Sep 18 2024 at 00:56):

whereas if we wanted to get a trace to them, we'd need to capture it before returning the state machine entry etc.

Richard Feldman (Sep 18 2024 at 00:57):

but I think that's something we'll want to figure out anyway for async stack traces, so I'm not considering it a tradeoff really

Richard Feldman (Sep 18 2024 at 00:57):

Sam Mohr said:

Though crash should be sparing enough that we don't really care about said waste

I haven't thought it through all the way, but I think there could be a perf impact even if the crash doesn't occur

Richard Feldman (Sep 18 2024 at 00:58):

although actually that might not be true anymore if it's just one more discriminant in the state machine :thinking:

Richard Feldman (Sep 18 2024 at 00:58):

certainly it was true in the RocResult design

Richard Feldman (Sep 18 2024 at 00:58):

but maybe doesn't apply anymore

Brendan Hansknecht (Sep 18 2024 at 00:58):

I feel like in my mind, the perfect implementation would:

let the host pick if clean is run or not
be implemented using exception handling mechanism that have basically zero runtime cost in the good case
also let the host choose if roc will generate a nice stack trace for the crash (would just be an optional part of the crash tag)

Brendan Hansknecht (Sep 18 2024 at 00:59):

although actually that might not be true anymore if it's just one more discriminant in the state machine

In an already effectful function, it is essentially no extra cost.
In a pure function, it is extra cost.

Richard Feldman (Sep 18 2024 at 00:59):

yeah @Folkert de Vries and I had the exception handling thing implemented a long time ago...it didn't go great :sweat_smile:

Brendan Hansknecht (Sep 18 2024 at 01:01):

Yeah, for proper exception we need full debug info with exception frames that track everything that must be refcounted if an exception is thown. It also has to walk the stack a frame at a time as it unwinds

Brendan Hansknecht (Sep 18 2024 at 01:01):

or, I guess we just need exception frames and not full debug info

Brendan Hansknecht (Sep 18 2024 at 01:02):

Some of this stuff is built at least partially into llvm, but I don't think it is simple to implement

Richard Feldman (Sep 18 2024 at 01:03):

yeah based on our experience last time I don't think we want to go down that road again :laughing:

Brendan Hansknecht (Sep 18 2024 at 01:04):

Honestly I think that is a mistake. Essentially all programming langauges have exceptions. Most llvm supported languages have them. So they can't be that hard to implement even if it is opaque and a general pain to do so.

Richard Feldman (Sep 18 2024 at 01:05):

what about dev backend?

Brendan Hansknecht (Sep 18 2024 at 01:07):

In the worst case, dev backend could go the result under the hood route. But I assume once we figure things out in llvm, it will be easier to figure things out in the dev backend.

Richard Feldman (Sep 18 2024 at 01:08):

that's interesting, I hadn't thought about that!

Richard Feldman (Sep 18 2024 at 01:09):

one of the problems as I recall was that LLVM basically requires you to link libcpp for the exceptions to work

Richard Feldman (Sep 18 2024 at 01:09):

and trying to remove that dependency was...not straightforward haha

Brendan Hansknecht (Sep 18 2024 at 01:10):

I'm imagining a hot loop that is using a dictionary. It has a crash for the impossible case of loading an out of bounds element. If that function and everything call it has to be turned into a result under the hood to deal with crash, it will lead to major perf regressions. Any hot loop with a crash in it would hit this.

Brendan Hansknecht (Sep 18 2024 at 01:10):

one of the problems as I recall was that LLVM basically requires you to link libcpp for the exceptions to work

... That would really suck

Brendan Hansknecht (Sep 18 2024 at 01:10):

I wonder what rust does

Richard Feldman (Sep 18 2024 at 01:11):

do they use llvm exceptions? :thinking:

Brendan Hansknecht (Sep 18 2024 at 01:12):

I'm not sure, but they definitely have catchable unwinds, an llvm backend, and I didn't think they always linked libc++.

Brendan Hansknecht (Sep 18 2024 at 01:14):

As a note, we technically could add crash to the task state machine, but still use setjmp and longjmp for pure functions. Just jump back to where we generate the crash for the state machine. Of course, that wouldn't deal with cleanup, but if we don't have a good way to deal with cleanup, that still could be a nicer interface for platforms without harming perf.

Richard Feldman (Sep 18 2024 at 01:15):

true!

Richard Feldman (Sep 18 2024 at 01:17):

as I recall, the basic way that they deal with cleanup is that each function gets a little header in the machine code that runs to perform cleanup if it's unwinding

Richard Feldman (Sep 18 2024 at 01:17):

so you specify that in llvm and it puts it in the machine code

Richard Feldman (Sep 18 2024 at 01:18):

also, there's a "personality function" that is also a little header, and it's for catch - basically a way to say "here's what my class is" or something like that, so your code can detect whether it's time to stop unwinding and run the catch code

Richard Feldman (Sep 18 2024 at 01:18):

but we wouldn't need that aspect

Richard Feldman (Sep 18 2024 at 01:18):

or rather we'd only need it at the entrypoint from the host

Richard Feldman (Sep 18 2024 at 01:19):

anyway, I agree that this would be the best for both perf and host ergonomics if we could make it work

Richard Feldman (Sep 18 2024 at 01:19):

one important prerequisite would be figuring out how to do it without libcpp :big_smile:

Richard Feldman (Sep 18 2024 at 01:19):

I think that was where we got stuck last time

Richard Feldman (Sep 18 2024 at 01:19):

because I think we had the other stuff working

Brendan Hansknecht (Sep 18 2024 at 01:20):

Good to know

Brendan Hansknecht (Sep 18 2024 at 01:24):

I wonder if we'll have to do something like statically link libunwind or something.

Richard Feldman (Sep 18 2024 at 01:26):

I think libunwind is only part of it

Richard Feldman (Sep 18 2024 at 01:26):

but we could prob just get the sources from that and import them into our zig builtin code, because zig is awesome like that :grinning_face_with_smiling_eyes:

Richard Feldman (Sep 18 2024 at 01:26):

we didn't have zig back when we tried this last time haha

Richard Feldman (Sep 18 2024 at 01:26):

we may want libunwind regardless for async backtraces

Brendan Hansknecht (Sep 18 2024 at 01:36):

does zig have exceptions? Can they just tell us how to do everything?

Brendan Hansknecht (Sep 18 2024 at 01:37):

:tears: They tend to be super helpful and have low dependency ways of doing things.

Brendan Hansknecht (Sep 18 2024 at 01:45):

Oh, it looks like they just have printing an error, dumping a stack trace, and then hanging.

Brendan Hansknecht (Sep 18 2024 at 01:45):

So no unwind and what not

Brendan Hansknecht (Sep 18 2024 at 02:16):

Not that I understand the pieces yet, but rust's implementation seems to exist in these locations and only depends on libunwind (or libgcc), not libc++ or libstdc++ (at least from what I can tell).

https://github.com/rust-lang/rust/blob/master/library/std/src/panicking.rs
https://github.com/rust-lang/rust/tree/master/library/panic_unwind/src
https://github.com/rust-lang/rust/tree/master/library/unwind/src

Brendan Hansknecht (Sep 18 2024 at 02:22):

Then this just walks the landing pads and what not create by the llvm ir

Brendan Hansknecht (Sep 18 2024 at 02:47):

And an example using only c and llvm. No linking to anything c++:
https://youtu.be/gH5-lITYrMg?si=nf7DFINdmhxDBQRl&t=1110

Brendan Hansknecht (Sep 18 2024 at 02:48):

Source before they switch to using c++ (so just c and llvm): https://github.com/AlexDenisov/llvm-social-exception-handling/tree/main/05

Richard Feldman (Sep 18 2024 at 02:49):

whoa!

Brendan Hansknecht said:

And an example using only c and llvm. No linking to anything c++:
https://youtu.be/gH5-lITYrMg?si=nf7DFINdmhxDBQRl&t=1110

whoa! :open_mouth:

Richard Feldman (Sep 18 2024 at 02:50):

:thinking: so if desired, we could theoretically switch to that already, if we wanted to switch from roc_panic to RocResult?

Brendan Hansknecht (Sep 18 2024 at 02:50):

yes

Richard Feldman (Sep 18 2024 at 02:50):

oh I guess dev backend wouldn't love that though

Brendan Hansknecht (Sep 18 2024 at 02:51):

also yes. Need to figure out generating these landing pads and eh headers from the dev backend as well.

Brendan Hansknecht (Sep 18 2024 at 02:52):

Also, I'm guessing the issue was needing to implement your own personality functions and what not instead of depending on the c++ ones.

Brendan Hansknecht (Sep 18 2024 at 02:52):

Also, no idea how this all works in wasm

Richard Feldman (Sep 18 2024 at 02:53):

how do we do crashes in wasm today?

Brendan Hansknecht (Sep 18 2024 at 02:55):

we call roc_panic and then let the host language figure it out. So we let zig deal with generating it.

Brendan Hansknecht (Sep 18 2024 at 02:56):

And I think it calls some sort of wasm halt instruction

Richard Feldman (May 18 2025 at 15:11):

I realized something about the whole "automatic unwinding such that host calls to Roc functions return a Result" idea: there's basically no way to have Roc handle stack overflows automatically in this way

Richard Feldman (May 18 2025 at 15:14):

that is, the way stack overflow handling works (and has to work) is:

the host does a one-time mprotect on a stack guard page and a signal handler for SIGSEGV which occurs when something writes to the readonly guard page
Roc cannot reasonably set this up automatically, partly because it needs to happen exactly once (and not just once per Roc call), but also because the host might want to have host-specific logic in there which Roc doesn't know about
That SIGSEGV handler runs in the middle of a Roc program's execution, and needs to handle cleanup right away - so there's no opportunity for Roc to convert things into Result

Richard Feldman (May 18 2025 at 15:14):

in other words, if hosts want to gracefully handle stack overflows in Roc programs (which they should!) then they already need to deal with the circumstances of today's roc_panic

Richard Feldman (May 18 2025 at 15:15):

so it's actually better to not do the whole "Roc functions return a Result to the host" because the host needs to deal with the "gracefully clean up a Roc program, including unwinding the stack and dealing with heap resources/file handles/etc. in the middle of the Roc program's execution" thing no matter what because of stack overflows

Richard Feldman (May 18 2025 at 15:16):

so the roc_panic design lets the host reuse code between the stack overflow handling logic and the "Roc executed a crash" handling logic

Brendan Hansknecht (May 18 2025 at 15:51):

I kinda agree, kinda don't. I think in practice, most programs accept that crash on stack overflow is fine behaviour.

Brendan Hansknecht (May 18 2025 at 15:51):

But I do agree that is important to be able to handle it.

Richard Feldman (May 18 2025 at 18:01):

Brendan Hansknecht said:

I think in practice, most programs accept that crash on stack overflow is fine behaviour.

sure, but for those programs it's presumably fine to crash on crash too :smile:

Sky Rose (May 18 2025 at 18:02):

Richard Feldman said:

Roc cannot reasonably set this up automatically, partly because it needs to happen exactly once (and not just once per Roc call), but also because the host might want to have host-specific logic in there which Roc doesn't know about

These don't seem unsolvable.

To the first point: Can Roc provide the host an init function it has to call before it can call any other roc entry points? Or alternatively, can Roc maintain some state (at the top of the stack?) about whether init has been called?

To the second: can the host provide a function to Roc for a custom stack overflow handler? Roc gets the stack overflow first, wraps it in a result, and then passes the result to the host's callback.

These would make the interface between host and Roc more complex, so there's a tradeoff. But if that can provide a better abstraction boundary over Roc crashes, it could be worth considering.

Richard Feldman (May 18 2025 at 18:02):

Sky Rose said:

Richard Feldman said:

Roc cannot reasonably set this up automatically, partly because it needs to happen exactly once (and not just once per Roc call), but also because the host might want to have host-specific logic in there which Roc doesn't know about

These don't seem unsolvable.

:thinking: can you give an example of how that could be done?

Richard Feldman (May 18 2025 at 18:05):

like for example, let's say the host wants to do its own custom stack overflow handling via a segfault handler (for stack overflows in the host itself), and wants to incorporate into that handler the logic for handling a stack overflow in a call to a roc function

Richard Feldman (May 18 2025 at 18:08):

also, in wasm there is no way to do this in wasm itself; the best you can (apparently) do is to have a try/catch in the JavaScript code that invokes the wasm, and then it can inspect the error message string to try to guess whether it was a stack overflow

Richard Feldman (May 18 2025 at 18:09):

anyway, the reason I ask is because I started from the premise that these seemed solvable and then (after a lot of investigation) concluded that this was the best way to go...it's very possible that I missed something, but if so, I need to know the specific design that I missed! :smile:

Sky Rose (May 18 2025 at 18:10):

Okay, I don't have a solution in mind. I was just unconvinced by your short summary. If there's a bigger proof or a previous attempt backing up that argument, then I certainly don't have anything better.

Sky Rose (May 18 2025 at 18:12):

"The host needs its own stack overflow handler for the host stack, and so roc can't have a stack overflow handler for the Roc stack" is a more convincing reason than the bullet point I quoted.

Brendan Hansknecht (May 18 2025 at 18:35):

Richard Feldman said:

Brendan Hansknecht said:

I think in practice, most programs accept that crash on stack overflow is fine behaviour.

sure, but for those programs it's presumably fine to crash on crash too :smile:

I think there is overlap, but I wouldn't call this necessarily correct. Different classes of errors with different expectations. Like taking down a server due to an int overflow is very different than taking it down from a stack overflow in my opinion....but I see your point.

Brendan Hansknecht (May 18 2025 at 18:37):

All this to say, I think it would be reasonable to turn crashes into results, but still have stack overflows.

Brendan Hansknecht (May 18 2025 at 18:39):

That being said, I don't feel strong either way at this point....but the concept of a simple recovery from a crash is important....currently in roc, that is not easy. And forcing arenas is not necessarily the solution....so we may want to think deeper about that.

Anthony Bullard (May 18 2025 at 19:54):

i think a platform that supports concurrency through co-routines want a way to have a stack over flow in one coroutine not crash the entire system

Brendan Hansknecht (May 18 2025 at 19:55):

Oh sure, but a platform can always do that no matter how we design Roc. Really the question is if after a stack overflow they can cleanup the garbage left behind.

Last updated: Jul 23 2026 at 13:15 UTC