I want to write down some notes on how hosts should handle the roc program either performing a crash
or stack overflowing. this is going to be kinda unstructured; I just want to get it down in one place and we can talk about whatever aspects of it :smile:
so wasm and non-wasm targets need totally different things, so let's start with Windows because it's the most straightforward
outside wasm, the way stack overflows get detected is that the host decides how much stack space to reserve, and then "reserves" it by telling the OS to mark a page of memory just past that point as readonly
e.g. if you want 2MB of stack space, you set the memory page right after the 2MB mark to be readonly
the CPU is really fast and efficient at bumping the stack pointer, but it's also really dumb. it doesn't do any checking whatsoever about whether it has run out of stack space
so what happens is that eventually it bumps its way into the readonly memory, causing a segfault
(in contrast, heap allocation - which makes sure not to allocate into the 2MB of reserved space, or the readonly guard page, and which can explicitly do conditionals and give an OOM error if it has run out - does not need to cause a segfault on purpose like this)
so, to recover from the stack overflow, there are a few things that need to happen:
at this point, when the "handle stack overflow" logic is running, the system is in a weird state because:
also, once we're done handling everything, we need to clean up all the resources and get back to a normal host state
I mention this stack overflow case first, because:
crash
do a bunch of automatic fancy cleanup stuff for the host (e.g. making every Roc call return a Result
), but even if we did that, the host would already need to do all this tricky stuff to handle stack overflows anywaycrash
(where we just call a host function that says "Roc crashed; do whatever you want to do to handle that right now!") it seems best to not attempt to automatically do fancy cleanup stuff, and instead just let the host reuse code for stack overflow handling and crash
handlingok so let's start with recovering and cleaning up, and get to showing a trace later
the simplest way to recover from a stack overflow (and this works for crash
too) is:
setjmp
- which basically just copies all the current CPU registers, including the stack pointer, to some threadlocal memorycrash
occurs, we do a longjmp
- which basically just takes that memory we stored earlier and puts everything back in the registers againsetjmp
so we can use that to get execution back where we want it, but there's still the matter that the Roc program may have allocated heap memory or opened file handles, and this doesn't clean any of that up
I think the best solution there is that the host should not share a single global allocator between itself and Roc, but rather provide a dedicated allocator just for the Roc program
and if there are multiple Roc entrypoints running concurrently, they should each have their own allocators
this way, if one of them crashes or stack overflows, the host can just reset that allocator and cleanup has been achieved
separately, if it's doing file handle stuff, that should be tracked somewhere the host can access it, so that it can iterate through the open file handles and close them
we can model all of this in basic-cli
and basic-webserver
so host authors can have a good example to follow
ok, so at this point we have a way for the host to:
crash
of note, an alternative to setjmp
/longjmp
is to do stack unwinding - e.g. libunwind
(~100KB dependency) lets you avoid the setjmp
up front (maybe like a dozen CPU instructions), but that seems not worth it to me considering setjmp
and longjmp
are much simpler
ok so now the backtrace
on Windows, there's a built-in API for getting backtraces - https://learn.microsoft.com/en-us/windows/win32/debug/capturestackbacktrace
however, it isn't aware of debuginfo, so it wouldn't give things like line numbers etc.
however, there's a more advanced one that does get you debug info - https://learn.microsoft.com/en-us/windows/win32/api/dbghelp/nf-dbghelp-stackwalk64 - but to get that, the host has to dynamically link dbghelp.dll
(which does ship with Windows, fortunately)
unfortunately, this is a pretty complex function that does use stack memory itself, and if we're in the middle of a stack overflow, it's possible that trying to walk the stack could result in another stack overflow, at which point we're toast
so we need to be really really conservative in what we do
one thing that helps is that Windows automatically converts the guard page to be writable again (so, no longer a guard page) at which point we can grow our stack into it
so we get one more page (at least 4096B) of memory to work with, for just the stack walking and recording it somewhere else (most likely a threadlocal that we pre-reserved and can access after the longjmp to display the recorded stack trace to the end user in whatever way we like)
however, this does mean that if we don't want to "leak" guard pages, we need to manually change that page back to a guard page after the longjmp - which is doable, it's just something the host needs to do
one more note: the interpreter just does all this stuff itself and reports a stack overflow as a normal crash
because it's managing its own "stack"
ok so that's Windows - we can recover from stack overflows and crash
es, including showing the end user a backtrace
which includes debuginfo, so function names and line numbers etc.
on UNIX, we can follow the same basic strategy but with less built-in behavior
specifically:
libbacktrace
(adds ~50KB to the binary), which parses DWARF debuginfo.since the backtrace-obtaining functionality is completely different for interpreter vs optimized build, I think the compiled roc application should expose a backtrace()
symbol the host can link and call, and it will have the same API regardless of whether we're interpreting or running an optimized build (the optimized build will use libbacktrace
, which we'll need to bundle along with our builtins, and then the interpreter will do its own thing with its own stack memory)
ok so that's Windows and UNIX...now for wasm!
wasm is totally different from either of those
the wasm binary is not allowed to look at its own bytes, so it can't do its own stack unwinding or generate its own backtrace
however, the things we need to happen can be done outside the wasm vm (so, in the browser that means JS can do it)
for simplicity I'll talk about the browser, but hopefully other wasm vms will have equivalent ways of doing these things
when a stack overflow occurs in wasm, execution immediately halts and JS gets a runtime exception, which it can catch with try
/ catch
that exception includes the backtrace info, and as long as the binary was built with source maps (which can be either included in the wasm binary itself or can be hosted separate from the binary and the browser can download it), JS can use the source maps to get line numbers etc.
for crash
, wasm can call out to an external JS function, and JS can inspect the current wasm stack while it's paused - and again can use source maps to get debuginfo
JS can also just cancel the entire wasm process, which takes care of stack memory cleanup, execution, etc.
for the interpreter inside the browser, it gets slightly trickier because we can't just be like "here is my backtrace()
function and I'll give you what you want regardless of whether I'm an interpreter or an optimized build"
because although the interpreter can offer that, the optimized build can't (because it can't see its own executable bytes like it can outside of wasm)
so instead, basically I think what we need to do in wasm is make it so that when we call the "I crashed" host handler (in JS), if we're the interpreter, we automatically create a backtrace and pass it to JS, so JS can just use that
but if we're an optimized build, we just pass null
, and then JS in the host can check for that and see "oh, there's a null
backtrace here, so that must mean I need to go inspect the currently-running wasm stack myself and build my own trace"
for the stack overflow scenario it's actually more straightforward, because that's naturally a thrown exception with backtrace info in the exception, so in theory we can have the interpreter just handcraft one of those exceptions, except including all the backtrace info from the interpreter
ok I think that's everything!
I think the key here is having good examples of how to do all of that in basic-cli
, basic-webserver
, and wasm hosts too, so that other platform authors can treat everything I just described as just some general boilerplate stuff they set up and then backtraces for stack overflows and crash
Just Work as if this were an interpreted language or a language with a VM, except without the VM overhead :smile:
Couple thoughts:
Joshua Warner said:
- We should use a frame pointer, which (as long as all code on the stack is Roc code) makes backtraces relatively trivial to compute.
the problem with that is that then inlined functions disappear completely, whereas e.g. DWARF has metadata about "here's the function that was inlined here" so you can get a more useful stack trace even if things were inlined. also we want the DWARF metadata anyway for line numbers etc.
I haven't actually implemented a stack walker with or without libbactrace, but from what I've read, if you want to get the DWARF metadata anyway, that's the whole hard part - and at that point you might as well rely on that as the source of truth instead of the frame pointer, even if you have one
Joshua Warner said:
- We may want to consider abstracting stack overflow a _bit_, such that the host doesn't have to know about both the interpreter and the compiler.
in theory I'd like to do this, but it doesn't seem possible for the crash
scenario in wasm (at least based on my research) to avoid the host having to do something different in one scenario vs the other one
and outside of wasm, I think the stack overflow handler has to be installed by the host, because Roc functions are (and ought to be) stateless, plus the host might have other signal handlers that could conflict with ours if we installed it ourselves
and the interpreter can just do a normal crash
for signalling that a stack overflow has happened
so from that perspective, the only thing I think we could really abstract is backtrace()
- but honestly I actually think I'd rather do the same thing we do with wasm there, where we say "here's a backtrace if I'm an interpreter, but otherwise I just give you null
and you need to go walk the stack yourself"
because that way we don't have to staple a 50KB libbacktrace
static dependency to every optimized Roc app, when the host might already have that dependency themselves and just be able to use the one they already have
Hmm, tbh I wouldn't optimize for making backtraces perfect in the presence of inlining and optimization.
There never going to be perfect anyway
I'd just keep track of which functions have other functions inlined into them, and make a note of that in the backtrace
e.g.
foo
bar (*contains inlined functions)
baz
In other words, what I'd do is:
ah that's interesting
so basically give up a slight amount of perf (reserving 1 register for the duration of the roc call) for the benefit of hosts having the option to do simple backtrace without line numbers on optimized builds
I could also see that being a platform
module knob maybe
like they can request a frame pointer or not (if they're not gonna use it anyway, might as well take the extra perf improvement)
lol, utm outing me :sweat_smile:
I would actually say that unless we find a strong case that really really needs that register, we should not even make it an option
Note that line numbers can work just fine with frame pointer backtraces
All you need is addr2line
at that point
(either the tool itself, or functionality thereof)
TIL about addr2line
- but apparently it's basically the same thing as libbacktrace
except it's designed to be invoked in a separate process on a file that's a memory dump of the frame?
Yep!
And, most importantly IMO, it doesn't require linking, and can even be done after the fact
(doesn't require symbols being embedded in the binary, or even being on the machine where the crash happened)
hm, but wouldn't the best UX normally be getting the stack trace immediately?
I mean if all we have to do is enable stack frames to get it, then sure, might as well, but wouldn't host authors pretty much always want to use libbacktrace
anyway for runtime stack traces? :thinking:
I don't think it is safe to longjmp from a segfault handler, but you can modify the return address to go to a temp function and then longjmp.
Also, of note, you want to leave segfault handlers as fast as possible. So all of the stuff discussed here should be deferred and outside of the segfault handler.
and if there are multiple Roc entrypoints running concurrently, they should each have their own allocators
I wonder how expensive that would be... A lot of perf is gained from things like tcmalloc (thread cache malloc) with its smart reuse of memory.
if it's doing file handle stuff, that should be tracked somewhere the host can access it, so that it can iterate through the open file handles and close them
Or we implement standard exceptions in roc. Use debug info to walk the stack and run destructors on the roc values. Stack walking is slower when there is an exception but greatly reduces the cost when no exceptions happen.
of note, an alternative to
setjmp
/longjmp
is to do stack unwinding - e.g.libunwind
(~100KB dependency) lets you avoid thesetjmp
up front (maybe like a dozen CPU instructions), but that seems not worth it to me consideringsetjmp
andlongjmp
are much simpler
I'm not sold due to all the extra cost of the host needing to manually track every resource that roc uses....
one more note: the interpreter just does all this stuff itself and reports a stack overflow as a normal
crash
Yeah, interpreter should be super simple here. And should be able to give nicer errors.
since the backtrace-obtaining functionality is completely different for interpreter vs optimized build, I think the compiled roc application should expose a
backtrace()
symbol the host can link and call, and it will have the same API regardless of whether we're interpreting or running an optimized build (the optimized build will uselibbacktrace
, which we'll need to bundle along with our builtins, and then the interpreter will do its own thing with its own stack memory)
If compiled binaries are just for optimized builds, why not just not have backtraces? It is really common to leave out debug info in release builds anyway. Leave that for the interpreter.
guarantee you can always get some kind of backtrace via frame pointers
Frame pointer is a waste in my opinion (hurts perf for little gain most of the time, should be optional and not default). Just don't have debug info in release builds and use the interpreter which is better suited for this anyway. Give a super crude trace for anything optimized
Brendan Hansknecht [said](https://roc.zulipchat.com/#narrow/channel/395097-compiler-development/topic/host.20crash.20and.20stack.20overflow.20recov
Brendan Hansknecht said:
I don't think it is safe to longjmp from a segfault handler, but you can modify the return address to go to a temp function and then longjmp.
Also, of note, you want to leave segfault handlers as fast as possible. So all of the stuff discussed here should be deferred and outside of the segfault handler.
hm, how would this work for purposes of getting a backtrace? :thinking:
modify return address, go to temp function, then do backtrace, then longjmp?
I assume by default libbacktrace
would not work post-longjmp, since the backtrace code itself would be potentially stomping over the previous stack memory :sweat_smile:
Brendan Hansknecht said:
guarantee you can always get some kind of backtrace via frame pointers
Frame pointer is a waste in my opinion (hurts perf for little gain most of the time, should be optional and not default). Just don't have debug info in release builds and use the interpreter which is better suited for this anyway. Give a super crude trace for anything optimized
I think that would be fine for some use cases but pretty bad for others - e.g. if my web server crash
es, I really want as real backtrace in my logs as I can get without sacrificing optimizations :smile:
Brendan Hansknecht said:
if it's doing file handle stuff, that should be tracked somewhere the host can access it, so that it can iterate through the open file handles and close them
Or we implement standard exceptions in roc. Use debug info to walk the stack and run destructors on the roc values. Stack walking is slower when there is an exception but greatly reduces the cost when no exceptions happen.
I dunno, I know we've talked about this a bunch over the years, but I just keep coming back to:
it just feels like exceptions would be a way to make host authors do a bit less work for the tradeoff of having worse perf when there's an exception and a ton of added complexity
btw I don't actually think cleaning up file handles would be much work for the host
they already need to be tracked somewhere for the whole "refcounted fd via roc_alloc
that cleans itself up" design, so even in the worst case scenario where you have a bunch of roc functions running at once and they're passing the handles between one another, even then all you have to do is just make sure to keep a running list of which fds a given roc function is executing, and you just go through and decref those if it crashes.
feels like stack overflow handling is the big burden here, and I don't know of a way to make that burden easier without sacrificing flexibility (e.g. host can have their own signal handlers without worrying about roc installing its own that conflict with the host's) and/or some huge perf penalty like introducing a vm or something
Richard Feldman said:
hm, how would this work for purposes of getting a backtrace? :thinking:
modify return address, go to temp function, then do backtrace, then longjmp?
I assume by default
libbacktrace
would not work post-longjmp, since the backtrace code itself would be potentially stomping over the previous stack memory :sweat_smile:
Yeah, I think something like that. Savinging the old stack address when you do so
Maybe all the ceremony isn't required but I remember seeing something like this in go when dealing with stack swapping (which happens on a signal from a timer if a go routine runs for top long)
Richard Feldman said:
btw I don't actually think cleaning up file handles would be much work for the host
No, I assume the memory allocator cleanup would be the bigger hassle (and having state per thread). Specifically if you just want a gpa and not arenas
it just feels like exceptions would be a way to make host authors do a bit less work for the tradeoff of having worse perf when there's an exception and a ton of added complexity
I mean I would hope exceptions are truly exceptional and roc rarely crashes in production, but I guess that may vary a lot by use case.
And for backrraces in release builds, it definitely is true that having full dwarf info and backrraces from debug info is the most friendly.
Brendan Hansknecht said:
Richard Feldman said:
btw I don't actually think cleaning up file handles would be much work for the host
No, I assume the memory allocator cleanup would be the bigger hassle (and having state per thread). Specifically if you just want a gpa and not arenas
hm, why would that be hard? seems like e.g. mimalloc, jemalloc, snmalloc all have a concept of making an allocator instance
I don't think you actually want a new allocator per coroutine
That could be hundreds of thousands of allocator instances in a webserver.
not per coroutine, per roc entrypoint
so per request handler
Yeah, that is per coroutine. Each coroutine would be running a roc request
hm, why would that be bad? :thinking:
having a separate heap for each of them
e.g. seems to work great for Erlang :smile:
Maybe it wouldn't be, but it would be a lot more overhead than what is normal.
Cause they won't share pages for example. So each will end up claim some number of pages even if very little memory is used by roc.
yeah that's fair
I mean we can try it out and see how it does
can also try that out alongside arenas per request
I mean, to be clear, Folkert and I spent a ton of hours trying to get C++ style exceptions working without a libc++ dependency, everyone I asked about it said what we were trying to do was a huge mistake, and scrapping it really made things a lot simpler - so I'm not saying we could never do it, just that I think the downside is very serious, so the upside really needs to justify it :sweat_smile:
Yeah, totally fair around exceptions.
Also, to be fair, most languages just don't recover from stack overflows. Like no attempt is made at all
Same as OOMs
Just classes of bugs that are accepted as fatal generally
But I guess you still need some solution to calls to crash....so :shrug:, may need this kind of heavy handed solution.
Brendan Hansknecht said:
Also, to be fair, most languages just don't recover from stack overflows. Like no attempt is made at all
that's true of like C and C++ and Rust, sure, but not the garbage collected languages we're competing directly with
like if you hit a stack overflow in Java, C#, JavaScript, Ruby, Python, they all definitely let you gracefully recover - and normally they translate it to a normal exception
I don't know about Go but given its approach to stacks I assume so too
like a whole webserver going down because one request handler stack overflowed would be a deal-breaker I think
and if we solve it for hosts in that use case, then we have a solution for all hosts :smile:
I don't know about Go but given its approach to stacks I assume so too
Go doesn't have stack overflows. Just OOMs. Stack will keep growing until OOM. At least that is my recollection
ahh right
Also, it might just be the case that we have to hack something like mimalloc or tcmalloc to instead of being thread local, being coroutine local. That may work to avoid the overhead, but give good cleanup.
ooh interesting!
Last updated: Aug 17 2025 at 12:14 UTC