Errors with stack traces · ideas

Stream: ideas

Topic: Errors with stack traces

Kasper Møller Andersen (Jan 01 2025 at 11:48):

Having worked with functional error handling a bunch in Elm and Scala, I feel confident in saying that it is quite good at discerning what went wrong and adding context at different layers. But functional error handling is fairly poor at discerning where the error happened, and certainly much inferior to the stack trace you get when throwing an exception (regardless of any other deficiencies exceptions have).

I'm not suggesting that stack traces are the end-all-be-all of error locations, but I do think they would be a big step up of what we have today. I'll write out a bit about what's wrong with the current setup further down, but I'll write my proposal first:

Proposal

We have talked a bunch about how to manage errors with try, ?, ??, and map_err, all centered around Result where you can put in anything as an error. But what if the error variant, instead of being Err e was Err (Error e) where Error is a custom type provided by the standard library, which can contain extra information, like a stack trace.

The reason we need to store the stack trace is because we're going to build it as we pass errors around with the various operators. I'll admit that I'm only confident in what semantics map_err has and the others I just find confusing at the momemt, so I'm going to suggest my own operators here to avoid getting tangled up in existing semantics. But in the real world, they should just be merged into the existing operators.

Operator 1, aka new_error (just pretend it's an operator and not a keyword):
The new_error operator takes any value e and creates an Error e type from it. When the compiler sees the new_error operator, it creates the string of the location of current file and line, e.g. "MyRocProject/MyCode.roc:234" and it inserts that as the first element of the stack trace in the Error, which contains the stack trace as just a List String.

After that comes operator 2, aka propagate_error:
The propagate_error operator works like new_error, except it doesn't create a new error but just propagates an existing one. But when the compiler sees this operator, it will still create the next level of the stack trace and append it to the Error.

Then there's operator 3, map_error_operator (named to not be confused with the existing function):
Again it's only special in that it builds the stack trace behind the scenes when used, but otherwise it just maps the contents of the given error as usual.

In other words, Error can only be constructed and mapped by using operators that build out the stack trace, and the existing map_err function would be removed. Users can still return an existing Error normally which would not build out the stack trace, but I think that could produce a warning when building.

Problems in current setup

Imagine an error like this, which is nested according to current best practices:

ConfigError
    CouldNotReadConfigError
        FileReadError

FileReadError is the original error, which has been wrapped up a few times.

This structure is a "jack of all trades, master of none" error structure. That is:

FileReadError is just the original "symptom" of what went wrong, as many errors are only symptoms of the real problem somewhere else anyway.
CouldNotReadConfigError actually tells you what went wrong at the context level that, in this example, has the most information about the error.
ConfigError exists mainly as form of stack trace, to tell you where the error occurred, but it doesn't add any useful information about what went wrong.

And in the real world you'd have potentially many more levels mixing both context and tracing together. I think it would be preferable for this error to be something like

{ payload= CouldNotReadConfigError (FileReadError)
, trace = [
    "MyRocProject/Config.roc:234"
    , "MyRocProject/Setup.roc:123"
    , "MyRocProject/Main.roc:23"
  ]
}

where the contents of payload is what Error is generic over (so Error payload essentially).

The benefits here are twofold:

The payload is less noisy than the original error, as it is only responsible for providing error context.
The stack trace is a much clearer trace than what the old error provided

There's a few reasons the stack trace is superior to the old trace:

Error variants need to be named uniquely across the code base to be fully traceable (e.g. ConfigError may be applied in multiple places in the code, and it might not be clear from the trace which one was hit)
With Roc, propagating errors without wrapping them is easy, so the old kind of tracing can easily be missing information that's needed to fully trace the error.
Asking people to map their errors everywhere easily leads to very stuttering code a la Config.get!().map_err(ConfigError). This code is harder to read without any real benefit. Roc afford us easy propagation of errors, but we're not really getting the most of it when the best practice is to map the errors everywhere, even when it doesn't add value to do so.

All the issues above can also be fixed with developers writing disciplined code, but in that case, it still feels like it needs a lot of effort compared to just getting a stack trace for free, and only mapping your errors when you actually have something useful to add to it.

Anton (Jan 01 2025 at 14:00):

I'm not sure about the specifics with the proposed operators but I would like to have errors with traces as well :)

Brendan Hansknecht (Jan 01 2025 at 14:45):

One of the reasons this isn't done is because the performance is generally awful.

The issue is that errors in results are generally not exceptional. They are actually pretty common in many cases. Adding locations is both bloat to the binary and extra allocation with data movement.

That said, I completely understand the goal here. Location traces can be great. Debugging can be a pain without them.

This is the one advantage of exceptions in my opinion. They essentially have free error traces. That said, exceptions are also slower in the error cases where errors are often handled by default. They really are meant for the exceptional case where errors are very uncommon.

Brendan Hansknecht (Jan 01 2025 at 14:47):

I generally find that either the error is expected to be handled. At which point any sort of error trace or even wrapping is pretty wasteful. (Might still be worth while to wrap a little, but not great to do it a ton).
Or the error is truly exceptional and is not expected to be handled. At which point crashing can give you a backtrace and has low cost.

Brendan Hansknecht (Jan 01 2025 at 14:49):

Yet, as shown by error context wrapping in go and rust, clearly some form of nested error that kinda half holds a stack trace will exist in pretty much any language that defaults to using errors instead of exceptions. (My gut feeling is that this is a bad design pattern)

One main advantage of simply wrapping is that it is way cheaper than strings and exceptions. You get a string in the form of the tag name, but it is nearly free to add.

Anton (Jan 01 2025 at 14:52):

Could we go with something like RUST_BACKTRACE=1?

Brendan Hansknecht (Jan 01 2025 at 14:52):

General question. What is your plan for the trace? Is it ever actionable in code?

Anton (Jan 01 2025 at 14:53):

What is your plan for the trace?

I would want it just to understand the path the code has taken so I can understand and solve the bug more quickly

Brendan Hansknecht (Jan 01 2025 at 14:56):

Yeah, so not actionable in code.

Brendan Hansknecht (Jan 01 2025 at 14:58):

Also, looking at the state of anyhow in rust and go error wrapping. They basically are akin to nested tags, but with slightly more free form error strings. Also, anyhow will apparently store a backtrace in the root error if RUST_BACKTRACE=1.

Brendan Hansknecht (Jan 01 2025 at 15:00):

That definitely is an interesting idea. Not sure what it would take to orchestrate, but just grabbing a backtrace (not converting it to string yet) and holding onto that for printing when the error is printed.

Brendan Hansknecht (Jan 01 2025 at 15:04):

The only way to get an equivalent trace in roc today would be to crash at the error generation sight. Which is actually really easy to do with ??.

Brendan Hansknecht (Jan 01 2025 at 15:04):

Which, if you don't plan to handle the error in your app at all sounds totally reasonable

Richard Feldman (Jan 01 2025 at 15:37):

now that we have purity inference, platforms can offer a backtrace! function

Richard Feldman (Jan 01 2025 at 15:37):

which applications could use to log backtraces immediately when desired, separately from error handling

Richard Feldman (Jan 01 2025 at 15:38):

in other words, do something like log_error!(backtrace!(), "Something really unexpected happened" and then return Err

Richard Feldman (Jan 01 2025 at 15:39):

so decoupling the logging of the backtrace from the handling of the error

Anthony Bullard (Jan 01 2025 at 15:40):

If backtrace is a => Str function, you _could_ even store the backtrace in your Err if you wanted

Brendan Hansknecht (Jan 01 2025 at 15:42):

Switching to logging does work, but it also loses some value due to requiring an effectful function chain.

Anthony Bullard (Jan 01 2025 at 15:42):

That's a strong point

Anthony Bullard (Jan 01 2025 at 15:43):

Can't we just have a dbg-like command add a stack-trace for development?

Anthony Bullard (Jan 01 2025 at 15:43):

Or something of the sort?

Richard Feldman (Jan 01 2025 at 15:45):

yeah there's an ongoing open question as to whether there will be demand in practice for logging in the middle of pure functions, or if it's fine to have logging (which is obviously an effect) only allowed in effectful functions

Richard Feldman (Jan 01 2025 at 15:46):

that makes the most sense by default, of course

Brendan Hansknecht (Jan 01 2025 at 15:48):

Yeah, I guess backtraces are still kinda tangential to this. If we exposed logging as a special non-effectful builtin like dbg, a platform could choose to log a backtrace on every error log.

Richard Feldman (Jan 01 2025 at 15:48):

the argument for "allow logging in pure functions as a special case exception" is that it's an effect that isn't supposed to affect the rest of the program, and is also theoretically only supposed to be recording what's happening, so if the compiler decides to optimize the pure function away (e.g. evaluate it at compile time) then the fact that the logging gets skipped should also be harmless

Anthony Bullard (Jan 01 2025 at 15:48):

I think it's fine with something like dbg that's stripped from release builds

Brendan Hansknecht (Jan 01 2025 at 15:48):

I guess the biggest disadvantage of solutions like logging is that they are more verbose...though it could just be my_fn(a, b, c) ? log_err!

Richard Feldman (Jan 01 2025 at 15:49):

certainly I expect webservers to do lots of logging (and/or spans/traces/etc)

Brendan Hansknecht (Jan 01 2025 at 15:49):

Yep

Anthony Bullard (Jan 01 2025 at 15:49):

I think a keyword like trace would be nice paired with ??

Richard Feldman (Jan 01 2025 at 15:49):

not sure how much that will vary by use case, and how much other use cases want logging

Anthony Bullard (Jan 01 2025 at 15:50):

I think it's fine if the handler type is => for a webserver

Anthony Bullard (Jan 01 2025 at 15:50):

As long as your core logic is pure

Anthony Bullard (Jan 01 2025 at 15:50):

That keeps effects on the edges

Anthony Bullard (Jan 01 2025 at 15:51):

If you don't allow for logging in request handlers, I think your webserver use cases are dead in the water

Richard Feldman (Jan 01 2025 at 15:52):

well request handlers are usually full of I/O, so certainly those are effectful :big_smile:

Anthony Bullard (Jan 01 2025 at 15:52):

Exactly

Brendan Hansknecht (Jan 01 2025 at 15:54):

I think it is very important to note that the goal of this idea is to get a backtrace from any function (including a pure function), not just an effectful one. The pure function returns a result and as part of the error, a trace is included. This enables an effectful logger to capture the full contexts that starts at the error root in a pure function.

Brendan Hansknecht (Jan 01 2025 at 15:55):

If it only works for effectful functions it is a lot less useful for debugging and understand the full stack trace when first developing code.

Richard Feldman (Jan 01 2025 at 16:02):

yeah so it seems to me like:

if we treat logging as special, and permit it inside pure functions even though it's an effect (with the understanding that logging statements inside pure functions may get optimized away and are never guaranteed to run), then having those logging functions be able to include backtraces seems fine
having a supposedly pure function which returns a backtrace seems super scary to me. A pure function absolutely should never return a different value when given the same arguments, and it feels like we'd be in UB territory (conceptually) in the sense that I can't even begin to imagine all the assumptions having a function like that would break. So I definitely don't think we should do that.
Having a language keyword that evaluates to "the source code location where the keyword was used" would not be a backtrace, but wouldn't break purity. It would be a constant that never changes at runtime. It would still be risky to have in the language (because it would suddenly become true that you can add a newline somewhere and now your program breaks because some code somewhere was relying on source position at runtime for some reason) but it at least wouldn't break purity.

Richard Feldman (Jan 01 2025 at 16:03):

then there's the separate issue of "I'm just trying to debug the program I'm running right now, I don't care about persistent logging"

Richard Feldman (Jan 01 2025 at 16:04):

for that use case, one obvious question is "if we had a really nice debugger, how much demand would remain for backtraces inside pure functions?"

Richard Feldman (Jan 01 2025 at 16:04):

I'm not sure what the answer would be there

Richard Feldman (Jan 01 2025 at 16:06):

the nice thing about doing effectful logging of backtraces via the platform is that it's already doable today, so there's no blocker to trying it out and seeing what use cases remain in practice when you already have that

Brendan Hansknecht (Jan 01 2025 at 16:07):

Richard Feldman said:

for that use case, one obvious question is "if we had a really nice debugger, how much demand would remain for backtraces inside pure functions?"

I kinda have answer to this. There are no good debuggers that exist on all platforms and are easy to use. As such, there is almost always a demand to easily add a backtrace at least for people who aren't used to debuggers. As someone used to debuggers (albeit mostly stuck in cli with lldb), I still often would rather just get a good backtrace and never have to open a debugger

Brendan Hansknecht (Jan 01 2025 at 16:08):

Yeah, I definitely think we should add a logging effect to basic CLI and basic webserver that can also log a backtrace. That might alleviate a lot of the pain.

Kasper Møller Andersen (Jan 01 2025 at 16:48):

The goal is really to create more useful errors. Both in stopping people from trying to come up with make-shift alternatives to stack traces inside their errors, which dilute the error itself, but also in actually giving people the information they need proper.

Generally I find the types of errors reported with Result break down into three kinds:

The ones you actually recover from
The ones you surface to the user, because the error pertains to something they did wrong
The ones you surface to the user because something completely unexpected went wrong, and e.g. they need to file an error report.

Before you figure which kind of error you have, I think every error goes through roughly these stages:

Every Result error starts as a potentially recoverable error
As you return the error through the stack, you map it and accumulate context
At some point, you decide whether you will in fact recover from it or not. If you recover, your work is finished.
If you don't recover, you're reaching a point where the code has concluded its investigation of the error, and there's nothing left but to propagate it out.

In the ideal world, I think I would want the following:

When the code has decided that it cannot recover from the error, it will log the complete "investigation" that has been done up until that point. That is, both the useful error tags like CouldNotReadConfigError, as well as any context it's received, along with the full stack trace of where the error originated.
All code handling this error after this point should only propagate it, until there's some code that actually shows it to the user. This could be crash, but I think it's important that crash won't give the stack trace that I want. It only has the stack trace of where the investigation concluded, not where the error originated.

Brendan Hansknecht (Jan 01 2025 at 16:54):

Is a stack trace actually useful to an end user though?

Kasper Møller Andersen (Jan 01 2025 at 16:54):

I'm alright with logging context along the way, rather than at a single point, but it requires more discipline from developers and more tooling to get right, since you need to correlate more logs, rather than having a single log entry that contains everything you need, and you need to remember to log everywhere that might be relevant.

Kasper Møller Andersen (Jan 01 2025 at 16:56):

No, I wouldn't want a stack trace when reporting something end users messed up (e.g. permissions, configuration, etc.). But if the error is for the end user to report to the developer, you definitely want the stack trace available for the user to report. Like "please open a GitHub issue at #link and include file applogs.log"

Brendan Hansknecht (Jan 01 2025 at 17:07):

Yeah, this is one of those surprising annoying problems of results and error returns. You really don't want to pay extra cost on every error, but stack traces are amazing when needed.

Brendan Hansknecht (Jan 01 2025 at 17:07):

And you can't get a full stack trace if you first return through a chain of pure functions.

Jasper Woudenberg (Jan 02 2025 at 08:48):

Kasper Møller Andersen said:

I'm alright with logging context along the way, rather than at a single point, but it requires more discipline from developers and more tooling to get right, since you need to correlate more logs, rather than having a single log entry that contains everything you need, and you need to remember to log everywhere that might be relevant.

Instrumenting code with tracing instead of logging might be a good alternative for this. If an operation fails you'll end up with a single trace of the request/operation in which the error happened.

Because trace frames, like logs, are added manually, they don't capture as many frames as a stack trace, so that's a downside.

But an upside is that a trace of an error can contain frames about code branches that were completed before the error happened, which can provide a ton of useful information when debugging. Plus, you can use that trace for other types of debugging as well, such as looking into performance problems.

Tobias Steckenborn (Jan 02 2025 at 09:10):

Yet tracing without some correlated logs (or events) aren't really helpful as well, are they?

I like what e.g. they are doing here:
https://effect.website/blog/releases/effect/311/#effectfn
https://effect.website/blog/releases/effect/312/#effectfn-improvements
https://effect.website/docs/observability/tracing/

Kasper Møller Andersen (Feb 02 2025 at 11:53):

I wanna bang this drum again, because I think stack traces are still really important. Roc's error handling is pretty similar to Rust in a lot of ways, and it's very easy to find people asking about how they get stack traces in Rust, like so and so. Not having stack traces by default is essentially a big deficiency and eats a good chunk of weirdness budget.

I also still think it nullifies a good deal of the benefit of the Roc's tag unions, because it essentially forces people to wrap their error types a whole bunch to try and recreate stack traces. So even though you can just propagate errors in Roc, you might not really want to, because you lose location information that way. Roc wants to make it easy to map errors, but I think that's partially about fixing this symptom, rather than fixing the root problem, because we would be mapping errors way less if we didn't have to manually build up types to mimick stack traces.

Kasper Møller Andersen (Feb 02 2025 at 11:53):

So here's a revised proposal for how this might look!

Result looks like this:

Result ok err :
    [
        Ok ok,
        Err StackTrace err
    ]

where StackTrace is a nominal type that stores stack trace lines (we'll get back to how it does so later). Those lines would all be the type StackTraceLine a la:

StackTraceLine :
    [
        MyRocProject__Config U16
        MyRocProject__Setup U16
        MyRocProject__Main U16
        ...
    ]

where there exists a function line_to_string that does this:

when line is
    MyRocProject__Config line -> Str.concat("MyRocProject/Config.roc:", Num.to_str line)
    ...

StackTraceLine and line_to_string would be generated by the compiler and not something the developer would deal with. It does have a few implications:

Developers can't construct an Err themselves, because they would need to insert the right stack trace line when doing so.
Err should have a function trace_to_string that converts the full stack trace to a human readable string.

In other words, we would need a way to construct an Err with a keyword probably. Like fail or (please don't shoot me) throw :big_smile:

Taking an example from the tutorial might then look like:

|str|
    if Str.is_empty(str) then
        Ok "it was empty"
    else
        fail ["it was not empty"]

And any time you use ? and whatever else we have to handle errors these days, the compiler basically desugars that to the same code as today, except it also inserts the corresponding StackTraceLine into the StackTrace.

What would StackTrace look like though? Ideally it would be an array on the stack, as that would be the simplest and most performant solution I think. We would have to spill the lines onto the heap at some point of course, but given that a single StackTraceLine would only take up something like 32 bits, there's at least room for a chunk of them on the stack.

Without arrays though, how might StackTrace look? It might just be

{ line1, line2, line3, line4 ... }

and keeping track of which line to use with an integer and something like

when previousLine is
    1 -> set_line_2(...)
    2 -> set_line_3(...)
    ...

Not as nice as an array, but workable at least. Alternatively, I don't know if it's been considered to have List be able to start off on the stack when there are only a few elements in it, and only spilling to the heap as needed?

And then there's the question of how many errors should be storable before we spill onto the heap. I don't have a good answer to this. On the one hand, a small number might be sufficient, because if you're not handling the error in short order, you're probably going to let it bubble all the way out anyway. On the other hand, a library might have a deep stack of its own before the error reaches the user of the library, and it would be nice not to use the heap before they've had the chance to deal with the error. So I'm not really sure what makes the most sense there.

Richard Feldman (Feb 02 2025 at 13:07):

I think adding this amount of runtime overhead to every error operation is too incompatible with Roc's goals of running fast

Richard Feldman (Feb 02 2025 at 13:08):

for example, this would mean that doing div_checked potentially does a bunch of string copying and dynamic array resizing - that's just wildly out of bounds for an acceptable performance cost compared to today where it's a single branch

Richard Feldman (Feb 02 2025 at 13:11):

of the options we've discussed, this seems like the frontrunner to me:

Richard Feldman said:

now that we have purity inference, platforms can offer a backtrace! function

then error tracing libraries like bugsnag can take an effectful "get backtrace" function during init, so when they log errors they automatically include stack traces just like they do in e.g. JavaScript. "Log an error to an external service, including stack trace" is the most common scenario I've seen for stack traces being useful for debugging after the fact, and we already have full support for that use case today!

if you're debugging locally, there are other options (e.g. setting a breakpoint and seeing what the trace is at that point)

Kasper Møller Andersen (Feb 02 2025 at 13:14):

Note that I specifically addressed those performance concerns with the new proposal. Creating an error shouldn’t allocate strings or lists on the heap, require anything to be resized, etc.

Richard Feldman (Feb 02 2025 at 13:41):

:thinking: how would it prevent resizing?

Richard Feldman (Feb 02 2025 at 13:42):

sorry, I think I should be more direct about this: regardless of performance, I don't think we should do this.

Richard Feldman (Feb 02 2025 at 13:43):

I don't think Result should store stack trace information, period, and I think the Rust error handling libraries that do similar things are the wrong design

Richard Feldman (Feb 02 2025 at 13:47):

I think there are two scenarios where we want stack traces:

we want to log it immediately for later analysis, right at the point where an error occurred. I think this can be done in an effectful function just fine.
we're in the middle of debugging a locally running program. I think this can be done using breakpoints or similar

Richard Feldman (Feb 02 2025 at 13:49):

I agree that stack traces are valuable information, but I disagree with the premise that we should store them eagerly and accumulate them and pass them around, just in case we want them later. I think we explicitly should do the opposite of that, and only retrieve a trace on demand, right at the point where we've determined we want it.

Brendan Hansknecht (Feb 02 2025 at 19:39):

Note that I specifically addressed those performance concerns with the new proposal. Creating an error shouldn’t allocate strings or lists on the heap, require anything to be resized, etc.

I work on a system that has to store stack trace on creation of nodes. It just stores the raw reference to the stack trace and is doing pretty minimal work. It is still quite costly. Much more expensive than the old version that was just return errors without nice locations. (like 1.5 to 2x slower and it is not storing that many stack trace references).

I am really curious to see what mojo ends up doing in this space. They currently only have an error type and not an exception type cause they have not found a performant enough way to do exceptions (though I think they had some ideas). Due to wanting to be a superset of python in the long term, they definitely want exceptions eventually. Currently the solution is to run code in the debugger and make it so that any time an error is generated the debugger adds the stack trace and treats it like an exception (that or manually grabbing the stack trace and adding it to an error explicitly). It is currently pretty painful to work with.

Brendan Hansknecht (Feb 02 2025 at 19:41):

I wanna bang this drum again, because I think stack traces are still really important.

Yeah, rust and go often deal with this by repeated wrapping and adding of more and more context. It is definitely not as nice as a stack trace in most cases.

Richard Feldman (Feb 02 2025 at 19:47):

I think it really depends on the application

Richard Feldman (Feb 02 2025 at 19:48):

like in the Roc compiler I want context so I can report them to the user

Richard Feldman (Feb 02 2025 at 19:48):

I wouldn't want to spit out a stack trace even if it were free

Richard Feldman (Feb 02 2025 at 19:50):

in a web server I want my logged error events to have stack traces but I don't think having the stack trace be passed around as a value is in any way useful to me (although it's a security concern if it's inspectable)

Brendan Hansknecht (Feb 02 2025 at 19:53):

I wonder if we can enable getting a stack trace cleanly (even if only for crash messages).

With debug info and the llvm backend, backtraces should work if grabbed from the host. If running via the interpreter, a host backtrace would be useless.

Richard Feldman (Feb 02 2025 at 20:10):

hm yeah that's true

Richard Feldman (Feb 02 2025 at 20:11):

actually the compiled Roc app could expose a function to the host for getting the current roc backtrace

Richard Feldman (Feb 02 2025 at 20:12):

which the host could call, both for its own use and also as a way to provide it to the app

Brendan Hansknecht (Feb 02 2025 at 20:12):

Yep, though I assume that would add a dependency on libunwind to roc. Which might be ok.

Richard Feldman (Feb 02 2025 at 20:12):

and then that function could silently either ask the interpreter or else walk stack frames

Richard Feldman (Feb 02 2025 at 20:13):

yeah that seems fine

Brendan Hansknecht (Feb 02 2025 at 20:13):

Richard Feldman (Feb 02 2025 at 20:13):

like we want every host to be able to support backtraces

Brendan Hansknecht (Feb 02 2025 at 20:13):

And then we could also expose that functionality to the app (though only as an effect?)

Richard Feldman (Feb 02 2025 at 20:13):

and right now you kind of have to know the tricks

Richard Feldman (Feb 02 2025 at 20:14):

I think it should be up to the platform to provide that functionality to the application (or not), but we should make it trivial for platforms to offer it

Luke Boswell (Feb 02 2025 at 20:15):

Richard Feldman said:

like we want every host to be able to support backtraces

Even in a fully roc is embedded in a larger host use case, like a game engine?

Richard Feldman (Feb 02 2025 at 20:15):

I think it's just simpler if all effectful functions come from the platform, no exceptions

Brendan Hansknecht (Feb 02 2025 at 20:15):

fair

Richard Feldman (Feb 02 2025 at 20:16):

@Luke Boswell sure, like if the roc plugin crashes, you want to be able to know what chain of calls led to the crash

Kasper Møller Andersen (Feb 02 2025 at 20:42):

Brendan Hansknecht sagde:

I work on a system that has to store stack trace on creation of nodes. It just stores the raw reference to the stack trace and is doing pretty minimal work. It is still quite costly. Much more expensive than the old version that was just return errors without nice locations. (like 1.5 to 2x slower and it is not storing that many stack trace references).

What is the reference to in this case? I assume it's a heap allocated collection (whether string or something else), so the cost is for building that initial trace, rather than just holding on to it?

Kasper Møller Andersen (Feb 02 2025 at 20:44):

Richard Feldman sagde:

I don't think Result should store stack trace information, period, and I think the Rust error handling libraries that do similar things are the wrong design

I'm curious what you see as being wrong about that design? Not that I disagree necessarily, I just want to make sure we're talking about the same things :blush:

Brendan Hansknecht (Feb 02 2025 at 20:46):

What is the reference to in this case?

A traceback object which should should just be a list of function pointers extracted from the stack. No strings have been created yet. But I assume it has to walk the stack and make a list. I guess you could minorly amortize the cost if you grab it one step at a time on every return, but I think it is fundamentally the same amount of extra cost.

Richard Feldman (Feb 02 2025 at 21:06):

Kasper Møller Andersen said:

Richard Feldman sagde:

I don't think Result should store stack trace information, period, and I think the Rust error handling libraries that do similar things are the wrong design

I'm curious what you see as being wrong about that design? Not that I disagree necessarily, I just want to make sure we're talking about the same things :blush:

in no particular order:

it complicates the type
it breaks referential transparency (you can call the same function with the same args from 2 different functions and get 2 different answers)
it's either a security concern (packages being able to see the trace in which they were used) or else the API is even more complicated to prevent that
it's redundantly storing info that's already available right at the moment you're most likely to want it (right at the moment where the error occurred, because you either want to log it right away or else pause the program right away and explore the state) and then passing it around just in case you also want it later too

Richard Feldman (Feb 02 2025 at 21:09):

Result is a simple and flexible type, and including stack trace information seems like massive scope creep for it with really unclear benefits in comparison to alternative ways of getting stack traces that don't involve Result

Richard Feldman (Feb 02 2025 at 21:12):

even if it were free, the idea that a Dict.get saying a key wasn't present in the dictionary triggers an automatic walking of the entire stack frame feels wrong in a visceral way

Kasper Møller Andersen (Feb 02 2025 at 21:30):

Richard Feldman sagde:

I agree that stack traces are valuable information, but I disagree with the premise that we should store them eagerly and accumulate them and pass them around, just in case we want them later. I think we explicitly should do the opposite of that, and only retrieve a trace on demand, right at the point where we've determined we want it.

My problem with this approach is that it relies on discipline to get a lot of things right, and you don't really know ahead of time when you're going to need it. Since capturing a trace is not the default, you end up having to decide between paying the performance cost or the debugging cost without knowing what the debugging cost is (because you have to understand every way a piece of code can fail in order to know that cost).

I would personally much rather lug around a stack trace, and be able to opt out of collecting it in the few places where I know this performance matters.

Kasper Møller Andersen (Feb 02 2025 at 21:36):

Richard Feldman sagde:

even if it were free, the idea that a Dict.get saying a key wasn't present in the dictionary triggers an automatic walking of the entire stack frame feels wrong in a visceral way

This seems like you're thinking the Err should walk the entire stack upon creation, which isn't what I'm proposing. Instead I'm proposing that the stack trace is built up as the error is propagated through the stack anyway. In this sense, creating an Err still has no logic attached (no branches, no heap allocations). You only have to deal with this as you start propagating the error.

Just so I'm sure you're disagreeing with the right thing :smiley:

Richard Feldman (Feb 02 2025 at 22:05):

fair, but I don't see how that would work without the possibility of reallocation if that gets too big

Kasper Møller Andersen (Feb 02 2025 at 22:06):

I guess the underlying problem is that stack traces attached to Result are an imperfect approximation anyway. What we really want is a way to retrace the exact steps the code took to get to a certain point, and it just happens that Result is usually the place where the breakage becomes visible.

Having said that, I do worry that Roc's strength of allowing you to do whatever you want with errors is also a great weakness. Because it means you are free to do nothing at all with the error until you are far away from its origin. It's kind of like exceptions in that regard, except you don't get a stack trace either, so you're truly in trouble when you have to debug where it came from. And it's not like making this error easily debuggable is a one-off effort. It requires continuous discipline at every level the error gets passed around.

Richard Feldman (Feb 02 2025 at 22:07):

I think the history of errors in programming is that they are mostly ignored way more often than they should be

Richard Feldman (Feb 02 2025 at 22:07):

I've never seen any system that really fixes this

Kasper Møller Andersen (Feb 02 2025 at 22:08):

Richard Feldman sagde:

fair, but I don't see how that would work without the possibility of reallocation if that gets too big

You would need to reallocate at some point. I'm just distinguishing between:

initial creation (which costs about the same as today, but maybe using slightly more stack space)
first few propagations (which we can also keep on the stack up to some limit)
later propagations (where you need to reallocate, but you may already have propagated it out enough times that you're already sending the error all the way out to the user anyway)

Richard Feldman (Feb 02 2025 at 22:08):

exception-throwing systems and null-based systems seem to result in more unintentionally u handled errors than Result/Option/Maybe

Richard Feldman (Feb 02 2025 at 22:11):

maybe a better way to frame my thinking on this is:

"Okay, so this will break purity, but hear me out..."
"Yikes, this had better be the most incredible upside I've ever heard of to compensate for that downside"
"Well it has a bunch of other tradeoffs"
"Okay then absolutely not"

Richard Feldman (Feb 02 2025 at 22:11):

like I don't really think it's worth spending more time talking about it, sorry :sweat_smile:

Kasper Møller Andersen (Feb 02 2025 at 22:15):

I have, if nothing else, won myself the right to feel smug the day people start complaining about not having stack traces :stuck_out_tongue:

Richard Feldman (Feb 02 2025 at 22:15):

hahaha :joy:

Kasper Møller Andersen (Feb 03 2025 at 07:28):

I went to bed feeling weird about this, because I think most arguments against this proposal are not based on the proposal itself, but rather just perceptions of what it is. Maybe that's on me for communicating it poorly, so I want to try again!

Just to get it out of the way: my proposal does not mess with purity in any way @Richard Feldman

As I laid out in the original post, today it is up to users to construct their error types such that they can actually be traced back to their origin. You do this by wrapping layers upon layers of error types, with the associated risks that you forget wrapping some places and/or you reuse names of these error wrapper types. This makes it very easy to have an error that is only partially traceable, because you weren't 100% disciplined about the tracing.

My proposal takes that work that users need to be doing today themselves, and automates it. It's the same fundamental mechanism, just handled by the language as opposed to the user. And because it doesn't rely on effects, it works just as well for libraries as for applications (where backtrace! needs to be hooked up in a library for example).

Kasper Møller Andersen (Feb 03 2025 at 07:40):

Regarding security, the only way a library would be able to read from a stack trace is if you pass it a Result as input (but it's still pure!). This is actually less invasive than calling backtrace!, because you can only see where the code has been since it became an Err, whereas backtrace! will give the code the full trace.

Kasper Møller Andersen (Feb 03 2025 at 07:42):

One thing that's not really clear to me here though, is whether we would want to encourage people to use backtrace! instead of creating these adhoc tracing structures in their error.

Kasper Møller Andersen (Feb 03 2025 at 07:47):

I think that becomes important for the argument of the type complexity at least. Because the argument that the stack trace complicates Err is kind of fair, but also ignores a bunch of other complexity.

The type signature Err err is obviously simpler than Err StackTrace err of course, but that's also glossing over the complexity of err. Without StackTrace in there, users are asked to build that structure themselves and contain it in err (and most likely do a somewhat poor job with it). Introducing the StackTrace type is not about introducing new complexity. It's about taking complexity away from err, and by extension the user, and automating it.

Richard Feldman (Feb 03 2025 at 11:07):

Kasper Møller Andersen said:

Just to get it out of the way: my proposal does not mess with purity in any way Richard Feldman

Kasper Møller Andersen said:

So here's a revised proposal for how this might look!

Result looks like this:
Result ok err :
    [
        Ok ok,
        Err StackTrace err
    ]
where StackTrace is a nominal type that stores stack trace lines (we'll get back to how it does so later).

[...]

where there exists a function line_to_string that does this:
when line is
    MyRocProject__Config line -> Str.concat("MyRocProject/Config.roc:", Num.to_str line)
    ...
[...]

Err should have a function trace_to_string that converts the full stack trace to a human readable string.

[...]

And any time you use ? and whatever else we have to handle errors these days, the compiler basically desugars that to the same code as today, except it also inserts the corresponding StackTraceLine into the StackTrace.

the parts I just quoted mean:

I can take a "pure function" which returns a Result and call it from two different functions, passing the same arguments
I can then call trace_to_string on the returned Result
I will end up with two different strings, even though I passed the same arguments, because I called the two functions from two different places in my code base
pure functions always return the same values when given the same arguments, and now no function that returns Result has this property

what am I missing? :sweat_smile:

Richard Feldman (Feb 03 2025 at 11:13):

oh I guess it's that the trace only starts when you call fail for the first time?

in that case, you get way less info than with backtrace!() because you don't get to see which calls led to the error in the first place.

Richard Feldman (Feb 03 2025 at 11:14):

assuming that's correct, this still has the problem that now reorganizing pure functions can break them

Richard Feldman (Feb 03 2025 at 11:14):

like I can take working code, add a comment somewhere, and now the code breaks

Richard Feldman (Feb 03 2025 at 11:15):

because pure functions that return Result now incorporate their own source path and line numbers into their own return values

Richard Feldman (Feb 03 2025 at 11:17):

Kasper Møller Andersen said:

The type signature Err err is obviously simpler than Err StackTrace err of course, but that's also glossing over the complexity of err. Without StackTrace in there, users are asked to build that structure themselves and contain it in err (and most likely do a somewhat poor job with it). Introducing the StackTrace type is not about introducing new complexity. It's about taking complexity away from err, and by extension the user, and automating it.

I'm just not convinced by the fundamental premise that accumulating and passing around stack traces is actually the right way to organize error handling code

Richard Feldman (Feb 03 2025 at 11:19):

like if an error happens in my webserver, and I want to log a stack trace to a reporting service, literally what I want is to call bugsnag.error!("This should never happen...") and have it capture a stack trace for me. This is what error logging services do in languages that support getting a stack trace anywhere.

Richard Feldman (Feb 03 2025 at 11:20):

I actively do not want to pass the stack trace around anywhere in that scenario

Richard Feldman (Feb 03 2025 at 11:20):

I just want bugsnag to put it in my logs and then I want to move on

Richard Feldman (Feb 03 2025 at 11:20):

I'm never going to do anything with it again

Richard Feldman (Feb 03 2025 at 11:24):

moreover, I'm often not going to return a Result

Richard Feldman (Feb 03 2025 at 11:25):

I'm going to try to gracefully handle the error for the end user, but I still want to have captured the trace of how I got to the point where gracefully recovering was necessary, so I can fix it next time and not have to recover

Richard Feldman (Feb 03 2025 at 11:28):

in contrast, other times I'm translating from one error type to another because I want to report it to the end user without a stack trace, since a stack trace would not help them

Richard Feldman (Feb 03 2025 at 11:28):

again, in that scenario I want to log that the problem happened, right where it happened (possibly just locally via log levels rather than to an external service, depending on what the program does) and then after that point I'm not going to use the trace ever again

Richard Feldman (Feb 03 2025 at 11:29):

so I think if Roc changed Result in this way, I would be inconvenienced by the type and performance costs and then never use the thing it's encouraging me to do

Anton (Feb 03 2025 at 11:39):

Richard Feldman said:

like if an error happens in my webserver, and I want to log a stack trace to a reporting service, literally what I want is to call bugsnag.error!("This should never happen...") and have it capture a stack trace for me. This is what error logging services do in languages that support getting a stack trace anywhere.

How would it get the stack trace in this case?

Richard Feldman (Feb 03 2025 at 11:51):

by calling backtrace!() - as the application author, I'd pass it in both that function as well as the function to do an http request, and it would store both of them

Richard Feldman (Feb 03 2025 at 11:51):

(on initialization, when I'm providing the API key - not every time)

Richard Feldman (Feb 03 2025 at 11:57):

one possible design:

Bugsnag.add_backtrace : Request, List TracedCall -> Request

Bugsnag.init : (Request => {}) -> Bugsnag

example:

Bugsnag.init(|req|
    _ =
        req
        .(Bugsnag.add_backtrace(backtrace!()))
        .(Bugsnag.add_api_key(key))
        .(Http.send!)

    {} # if logging fails, do nothing
)

Richard Feldman (Feb 03 2025 at 11:58):

nice properties of this design:

the bugsnag library never actually sees the API key, so it can't be compromised even if the package is compromised (maybe not critical for Bugsnag in particular, but seems like a good pattern for libraries with more sensitive API keys)
one arg to the init function, that's it.
backtraces in all error logs (the package can trim off the outermost fn call from the trace)

Brendan Hansknecht (Feb 03 2025 at 16:39):

Note, even if this doesn't break purity cause it accumulates one function at a time as a result is returned, it still breaks correctness in roc.

A pure roc function should return the same thing no matter the compilation mode or target. These back traces would depend on inlining, file location, and potentially target (due to different debug info on windows and wasm)

Kasper Møller Andersen (Feb 04 2025 at 19:20):

Richard Feldman sagde:

because pure functions that return Result now incorporate their own source path and line numbers into their own return values

It feels weird because Roc doesn't have other meta programming built in I suppose. If it was clearer that this actually takes the entire code base as input, then it would still be considered pure, because any code relying on the output would only "break" if the input was changed. Not that I think Roc should have more meta programming, it just still makes sense to call this a pure function to me :big_smile:

Brendan Hansknecht sagde:

Note, even if this doesn't break purity cause it accumulates one function at a time as a result is returned, it still breaks correctness in roc.

A pure roc function should return the same thing no matter the compilation mode or target. These back traces would depend on inlining, file location, and potentially target (due to different debug info on windows and wasm)

I don't get how this would break across targets? Wouldn't it work exactly the same as the current error mapping and propagation does with respect to inlining and so on? This wouldn't rely on debuginfo at all in my mind.

Kasper Møller Andersen (Feb 04 2025 at 19:39):

Richard Feldman sagde:

like if an error happens in my webserver, and I want to log a stack trace to a reporting service, literally what I want is to call bugsnag.error!("This should never happen...") and have it capture a stack trace for me. This is what error logging services do in languages that support getting a stack trace anywhere.

I think about it this way:

At some point I get an error, but it's not always easy to tell if you're at the point where the error "just happened". Maybe you called a function from a library that failed, so from your point of view, the error just occurred, but inside the library, the error may have occurred 20 layers away.
I may not have all the information I need to decide if I want to recover from an error or not right where it happened, and I may need to propagate the error out some layers before this can be decided. I only want to report the error to bugsnag if I determine that it is in fact not an error I can handle.

In this view, backtrace! and my proposal are complimentary actually. backtrace! tells you how you reached function x, whereas the Result stack trace states what went on inside of x.

Kasper Møller Andersen (Feb 04 2025 at 20:24):

I think it would help me to break this down by error type here too.

The two overall categories of errors are predictable (Result) and unpredictable (crash essentially).

When talking about unpredictable errors, I think we concluded that those aren't recoverable in Roc unless the platforms specifically gives the tools needed for that (spawn a new process that is allowed to crash for example).

Predictable errors on the other hand can be either recoverable or unrecoverable. That depends entirely on the error. And when I say "recoverable", I mean the error is entirely within the normal functioning of the system, and all ends up being good.

I assume the different approaches we're talking about here pertain to predictable unrecoverable errors. But I think I'm missing parts of the larger picture still.

Kasper Møller Andersen (Feb 04 2025 at 20:35):

In the thread about unpredictable errors with crashes and recoveries, I also believe we discussed that the platform would be the stability boundary, and if you had an unrecoverable error (even from a Result), it was better to crash and let the platform handle the error.

The reason I made this proposal is really to help capture what goes on before you decide to crash. That is, when you get a Result, you will presumably do some analysis and propagate it around a bit before you conclude that it is indeed unrecoverable and you decide to crash. But capturing the stack trace inside the Result from the first Err instance through to calling crash, helps provide a fuller stack trace. Because the stack trace from crash is only the trace from when you finished your analysis, and not from the original error actually occurred.

Kasper Møller Andersen (Feb 04 2025 at 20:35):

But maybe I got the wrong impression from that thread, and crash is no longer the preferred option for dealing with unrecoverable errors?

Kasper Møller Andersen (Feb 04 2025 at 20:43):

I know there's also a lot of moving parts here and that it's uncharted territory for Roc, so I'm not blaming anyone if it's all down to things having yet to settle :big_smile:

Brendan Hansknecht (Feb 04 2025 at 21:59):

Note, even if this doesn't break purity cause it accumulates one function at a time as a result is returned, it still breaks correctness in roc.

A pure roc function should return the same thing no matter the compilation mode or target. These back traces would depend on inlining, file location, and potentially target (due to different debug info on windows and wasm)

I don't get how this would break across targets? Wouldn't it work exactly the same as the current error mapping and propagation does with respect to inlining and so on? This wouldn't rely on debuginfo at all in my mind.

Let's not even think across targets. Let's just think solely across optimization levels. An optimized build can inline functions. This changes the stack trace. An optimized build can also remove debug info. That also changes the names in a stack trace.

Unless an effect occurs, roc will return the exact same result across targets and optimization levels. Adding stack traces to results without some sort of explicit backtrace! effect would break that.

Joshua Warner (Feb 12 2025 at 16:05):

What if there was a syntax that generated a unique tag that the roc toolchain/debugger/etc could easily map back to a specific source location? Something like $MyTagName, with some tooling/library functions that could extract the name and source location it was instantiated from.

You'd use it like so:

File.readUtf8!(my_path) ? $ErrorReadingConfig

If that read fails you get back something that'd print as Err($ErrorReadingConfig:1234(FileNotFound)). The actual representation would be a guaranteed-unique tag value, even unique across callsites where the same $tag name is used.

And you could match on that (relatively) normally, like so:

match err {
  $ErrorReadingConfig(FileNotFound) -> Stdout.line!("Config file ${config_path} was not found")
  # etc
}

That would use a global table to grab the "canonical" tag id and match on that (mapping potentially several source locations where $ErrorReadingConfig is constructed, down to a single value suitable for matching).

You'd have something like Tag.source_location(err), which would use a similar global table to map back to the filename name line/col number and return that.

You'd need some utilities that could take roc binary and a string "backtrace" like the above, and map that back to source locations. Or perhaps that just gets compiled into the binary.

That gets you 70% of the way to full backtraces and encourages wrapping errors with useful, match-able context.

Brendan Hansknecht (Feb 12 2025 at 16:19):

What enables this mapping from source location to backtraces? Is it simply recursive with tons of tag wrapping?

Joshua Warner (Feb 12 2025 at 16:22):

With this, the back trace -is- the ‘inspect’ output. You’d have another tool that maps the generated ids parsed from that string back into source locations that are more human readable.

Brendan Hansknecht (Feb 12 2025 at 16:26):

Generally stack traces represent recursion and mutual recursion, but tags don't.

Brendan Hansknecht (Feb 12 2025 at 16:29):

If you make your stacktrace represent recursion or mutual recursion, you are back to allocating a ton for an error that likely will just get thrown away and handled.

On top of that, you still lose information if some chunk of code doesn't opt into this form of error. I know one important call put above is that you may want a stack trace from a library, but this would be opt-in at a perf cost, so libraries probably wouldn't opt in.

I'm not sure this would turn out well in practice, but maybe.

Joshua Warner (Feb 12 2025 at 16:30):

Yeah, you’d have to be careful with recursive cases. I’d probably just not add tags there.

Joshua Warner (Feb 12 2025 at 16:32):

The perf cost should be very low, if the backtrace isn’t inspected.

Brendan Hansknecht (Feb 12 2025 at 16:32):

Even for flat tags, you likely would devour stack space. Assuming no recursion, but a large branching tree of functions, the top level main has to allocate stack space for a nested tag that can represent any possible call chain

Brendan Hansknecht (Feb 12 2025 at 16:32):

And the next function down the same but with one less wrapping and same again

Brendan Hansknecht (Feb 12 2025 at 16:32):

So you would get gigantic error result payloads

Joshua Warner (Feb 12 2025 at 16:33):

Hence why this would be judicious and manually added

Joshua Warner (Feb 12 2025 at 16:33):

Don’t want that to happen automatically for all calls, just semantically important ones

Joshua Warner (Feb 12 2025 at 16:35):

This solves the problem of getting “just enough” source location info back to understand errors from production.

Joshua Warner (Feb 12 2025 at 16:36):

And it’s less annoying than manually choosing unique tag names

Brendan Hansknecht (Feb 12 2025 at 16:49):

Possibly. Still block any library introspection (without author opt in), but is definitely something.

Last updated: Jul 23 2026 at 13:15 UTC