Stream: ideas

Topic: Dealing with logic errors in Roc


view this post on Zulip Kasper Møller Andersen (Oct 02 2024 at 11:52):

In the world of software, I think it's fair to divide the kinds of errors we need to deal with into two camps:

Predictable errors are where we model things with Result primarily, because we know all the error conditions ahead of time, roughly.

But unpredictable errors are the things where we don't know what's going to happen. Strictly speaking, that also includes things like segfaults, but I primarily want to view the unpredictable errors as being logic errors here. Cases where Roc encourages using crash today if you ever reach an impossible branch for example.

There are a couple of important facets of logic errors as they are handled, not only in Roc, but in other languages too:

crash and expect provide this experience in Roc, because they allow developers to be quickly notified that something is wrong during development, and we don't have to bend the type system to model something that shouldn't actually be happening in the first place.

However, I think there's a couple of issues that would be useful to address:

To be clear, I think there is a real need for something like crash and expect. They are important for making it easy to spot logic errors during development, because they are very loud. But if I'm developing some web service, I might be serving many REST endpoints from the same service. If one of those endpoints call a function that happens to crash, then I don't want it to bring down my entire service. At the very least, I just want to make the single endpoint unavailable while I investigate, so the rest of the service can remain live. So even I can't do anything useful with the logic error itself, I can still offer a degraded service that remains useful.

So my first question is: should there be a way to catch a crash, a la catch_unwind in Rust?

One of my problems with catch_unwind is that it's very general, and you can really put it anywhere. But for logic errors specifically, I think there's usually a clear boundary of where you want it:

Maybe there's a way to (optionally) lint that you are remembering to call catch in only these places?

My next question though, is whether the behaviors in crash and expect are even the right primitives to begin with? Maybe there are better tools waiting to be found here?

One small evolution might be to make crash and expect parametrized over the behavior we expect them to trigger. For example, what if module params was used to pass in a "crasher", that could specialise how they would crash and define any guards where the crash would be caught? I don't really know if this would be feasible or even sensible, but I want to float the idea at least.

Or maybe we could define something more targeted at dealing with logic errors, e.g. LogicErrors.detectedError : Severity, Str -> .... This function would allow you to report that you hit an impossible error case to the platform, and then the platform could decide what to do. For example, if you pass in a severity of Fatal, it might crash as today. Or it might decide to block access to the specific REST endpoint where this error was detected, and automatically return a 503 error on it, until it can be manually unblocked.

view this post on Zulip Hannes (Oct 02 2024 at 12:16):

One thing to keep in mind, I'm pretty sure platforms can choose how to handle crashes, e.g. a web server can choose to let each request crash separately

view this post on Zulip Richard Feldman (Oct 02 2024 at 13:44):

yeah I think this is the answer!

view this post on Zulip Richard Feldman (Oct 02 2024 at 13:52):

I definitely think we should never add anything to the language that applications or packages could use to recover from crash

view this post on Zulip Richard Feldman (Oct 02 2024 at 13:52):

but platform authors already can, and absolutely should!

view this post on Zulip Kasper Møller Andersen (Oct 02 2024 at 14:14):

To be clear, let’s not get hung up on web servers. But you’re saying that the platform is responsible for insulating the software from crashes?

As another example, let’s say I’m building a game, and there’s a crash in my animation subsystem. If the platform is running animations isolated from the rest of the code, then it can recover by just disabling animations for example. But if the animation is just running with the rest of the game loop, can the platform do that?

view this post on Zulip Richard Feldman (Oct 02 2024 at 14:44):

I think the best way to think about this is in terms of errors, e.g. mistakes

view this post on Zulip Richard Feldman (Oct 02 2024 at 14:45):

"if there is a mistake in this part of the code, what's the worst that can happen? how can we insulate against that to make it less bad if it happens? what's the cost of that insulation?"

view this post on Zulip Richard Feldman (Oct 02 2024 at 14:45):

forget about crash

view this post on Zulip Kasper Møller Andersen (Oct 02 2024 at 17:11):

Sure, and nobody sets out to overflow the stack or run out of memory. Before I go on, I would like to better understand why you don’t want developers to write this insulation code in Roc though?

view this post on Zulip Richard Feldman (Oct 02 2024 at 17:29):

if you can recover from crash in Roc code, then it has become throw and we have added try/catch, and it will become used for recoverable error handling

view this post on Zulip Kasper Møller Andersen (Oct 06 2024 at 11:04):

Is that a problem in Rust, since it has that exact setup? It's also something that can be designed in a few different ways in my mind, to discourage such use if that's really a worry.

To better motivate why I think it makes sense, I think software in general has really awful user-facing errors. Part of that I think, is that programming languages and frameworks make it very easy to "create" an error (crash, Err, etc.), but then leaves it up to developers to make it good for users.

The reason I wanted to single out logic errors, is because Roc has effectively given the tooling to create errors, with no chance for developers to create a good recovery story for users, outside of modifying the platform directly.

It's of course alright to say that these logic errors should be rare enough that it's not something Roc developers are actually exposed to dealing with. But having a way to trigger errors without a way of creating a user friendly recovery story just makes me uneasy.

view this post on Zulip Richard Feldman (Oct 06 2024 at 11:15):

Kasper Møller Andersen said:

Is that a problem in Rust, since it has that exact setup?

I wouldn't say Rust has that exact setup - if you look at the docs for panic recovery they say things like "this doesn't actually catch all panics, and also here are a bunch of things to be careful of if you use this with other language features"

view this post on Zulip Richard Feldman (Oct 06 2024 at 11:19):

Kasper Møller Andersen said:

The reason I wanted to single out logic errors, is because Roc has effectively given the tooling to create errors, with no chance for developers to create a good recovery story for users

right, but if you read "logic error" as "logic mistake" then the feature request becomes "Roc needs a way to correct arbitrary mistakes in other people's code without modifying that code"

view this post on Zulip Richard Feldman (Oct 06 2024 at 11:21):

Roc originally did not have a crash keyword; if you made a mistake that crashed the program, it was probably because you ran it out of memory or tried to do integer division by zero

view this post on Zulip Richard Feldman (Oct 06 2024 at 11:22):

and actually we tried having division return Result, but that didn't go well

view this post on Zulip Richard Feldman (Oct 06 2024 at 11:24):

so I think if there's a case to be made for a feature like this, it has to be made without reference to crash

view this post on Zulip Richard Feldman (Oct 06 2024 at 11:26):

for example, something like "if I call library code and it overflows the stack, or the heap, or goes into an infinite loop, or does integer division by zero, I want my application code to be able to defensively write code which recovers from that possibility, and spawning a new process to run that code is not a sufficient solution"

view this post on Zulip Kasper Møller Andersen (Oct 06 2024 at 11:50):

Richard Feldman sagde:

right, but if you read "logic error" as "logic mistake" then the feature request becomes "Roc needs a way to correct arbitrary mistakes in other people's code without modifying that code"

I think there's a gap between the two that doesn't need to be crossed though. I'm not interested in "fixing" the error, but it's still quite important that:

Going back to the example of a game where the animation system crashes, you probably want something like the following to happen:

Those are the kinds of scenarios I think are worth being able to deal with, which right now, I would need to fork the platform to do (as I understand it).

view this post on Zulip Richard Feldman (Oct 06 2024 at 12:19):

Kasper Møller Andersen said:

Going back to the example of a game where the animation system crashes

why did it crash?

view this post on Zulip Richard Feldman (Oct 06 2024 at 12:20):

(e.g. did it run out of stack space? heap space? integer division by zero?)

view this post on Zulip Richard Feldman (Oct 06 2024 at 12:21):

I think maybe another way to frame this is: for the purposes of this discussion, let's assume that crash has been removed from the language. There is no longer a keyword for that, or any equivalent of it.

view this post on Zulip Richard Feldman (Oct 06 2024 at 12:22):

in that world, what's the specific scenario we're trying to address?

view this post on Zulip Richard Feldman (Oct 06 2024 at 12:22):

the animation system did not do a crash because that doesn't exist. So what did it do that we're trying to recover from?

view this post on Zulip Kasper Møller Andersen (Oct 06 2024 at 14:21):

I guess it doesn’t actually matter. What’s important to me is the experience of recovering from something deemed unrecoverable. I would expect it to be the same in all those cases.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 15:32):

with no chance for developers to create a good recovery story for users, outside of modifying the platform directly.

I think it is important to remember that platforms are not set in stone. If you are working on a game, you definitely own the platform. So you can modify it to crash however you like. Even if you are working on basic-cli, you can fork it and change how panics are handled.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 15:34):

I do agree that crash is scary. There is a reason that I push for essentially no one ever using it (especially in libraries) unless they somehow hit an unreachable state. I think the world before crash was much worse than the world we have now with crash. At some point there are going to be:

  1. libraries with unreachable states (like a dictionary loading from the underlying list which will never fail)
  2. application authors writing quick scripts that just want to throw away error cause they don't care.

Without crash, hacks have to be used that generate terrible error messages in order to cause a crash in a unreachable state: 255u8 + 1.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 15:37):

All this said, I think crash should be used as sparingly as possible. If crash becomes common in libraries, and actually gets hit semi often. I would hope the community as a whole would put pressure on those libraries to remove crash or switch to different libraries. If that fails, I think that Roc has a major issue. That would be proof to me that adding crash to Roc was probably a mistake.

In other words, in my view (as one of the main people that pushed for crash existing), feeling like you need to catch crashes would mean adding crash was a mistake. It would not be a reason to add some form of catch.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 15:48):

As a point of comparison, catch_unwind in rust.

The general advice is to only use it at the ffi (eg rust plugins for godot) or stability boundary (eg stop web request from killing webserver). Both of these exist in the platform layer for roc.

view this post on Zulip Richard Feldman (Oct 06 2024 at 16:36):

Kasper Møller Andersen said:

I guess it doesn’t actually matter. What’s important to me is the experience of recovering from something deemed unrecoverable. I would expect it to be the same in all those cases.

interesting - so in other languages, do you use things like try/catch around library calls in case they stack overflow, or heap overflow?

view this post on Zulip Richard Feldman (Oct 06 2024 at 16:36):

I don't, but maybe others do! :big_smile:

view this post on Zulip Kasper Møller Andersen (Oct 06 2024 at 18:09):

Brendan Hansknecht sagde:

I think it is important to remember that platforms are not set in stone. If you are working on a game, you definitely own the platform. So you can modify it to crash however you like. Even if you are working on basic-cli, you can fork it and change how panics are handled.

This may also just be somewhere where my mental model is off. With what I know about platforms today, it seems like they are responsible for a whole heap of things that I don't want to take responsibility for necessarily (like building good stack traces). So my intuition around forking a platform, is that it's analoguos to forking the JVM because you want different crash handling in your Java code. I know the JVM is a much bigger beast than a Roc platform, but in my mind, they fill a similar space.

view this post on Zulip Kasper Møller Andersen (Oct 06 2024 at 19:03):

I think the term "stability boundary" describes well what I feel is kind of missing.

Richard Feldman sagde:

interesting - so in other languages, do you use things like try/catch around library calls in case they stack overflow, or heap overflow?

Not for those kinds of crashes, but I've definitely seen Scala and Java code with extra catches in it to catch various runtime exceptions (which are basically the Java equivalents of crash). Part of that is also just Java having poor standards around what kinds of exceptions should actually be runtime exceptions rather than checked exceptions though.

As @Brendan Hansknecht was saying, if crash becomes a common thing in libraries, it can potentially make for a poor experience of using Roc libraries. But the way to minimize the problem is to encourage people to model their code in such a way that explicit crashing is not needed. But I'm not confident that people can be relied on enough to do so. Going back to Rust, it's basically the argument for social pressure on libraries to avoid unsafe code. This has kind of worked, but also had times where people came under a lot of pressure to get rid of their unsafe usage (i.e. the Actix debacle).

The thing that came to a front there was that:

So I guess my position is that I really want to keep crash around, but I'm also looking for tooling to help me define and maintain my stability boundaries. Even in a web server where a request causes a crash, what if that means I've written some data into the database but not completed the rest of the required work? Is all the work being done for that request being seen as one atomic transaction that will get rolled back, or do I need to do something myself in case of a crash?

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:12):

Kasper Møller Andersen said:

As Brendan Hansknecht was saying, if crash becomes a common thing in libraries, it can potentially make for a poor experience of using Roc libraries. But the way to minimize the problem is to encourage people to model their code in such a way that explicit crashing is not needed. But I'm not confident that people can be relied on enough to do so.

just to be totally honest, I think if we get to the point where crash is overused in libraries because library don't want to handle errors via Result (or otherwise), my ordering of preferences would put "go back to not having crash again" as preferable to "add a way to recover from crash"

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:14):

basically, there is always going to be some point where library authors can make mistakes

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:14):

and I don't think we should try to have a language feature that corrects arbitrary mistakes

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:14):

in third-party code

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:15):

a mistake in third-party code is going to cause a degraded user experience no matter what

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:15):

maybe that's because a calculation is incorrect

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:17):

for some category of mistakes, it's possible to recover from them gracefully

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:18):

for example, you can recover from an accidental infinite loop by running that code on a different thread, timing it, and killing the thread if it runs longer than some configured timeout

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:19):

this has downsides such as performance and code complexity, but it can result in a less bad user experience (an error message) than an infinite loop

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:19):

so we could totally do that, but I don't think it's worth it

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:19):

I do think there is an important point brought up in terms of stability boundary.

If in basic webserver I start a transaction then a crash happens, I believe it will completely clog up sqlite. So a user that starts a transaction then runs 3rd party code would be left in a fundamentally broken state if a crash happens. They have no way to fix this but avoid calling any code that might crash after a transaction is stated on a thread.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:20):

the platform can already solve that

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:20):

because it knows about transactions

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:20):

No it doesnt

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:21):

It just knows about statements (which would all clean up but that transaction would still be open)

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:21):

It also definitely wouldn't know about a transaction to something like postgres over TCP.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:22):

In sqlite, one statement requests beginning the transaction. Then you run n statements in the transaction. Then you run a final statement to close the transaction.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:22):

So even if you clean up all statements you may still be in the middle of a transaction.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:30):

oh you're assuming the platform isn't aware of SQLite and is instead just offering a socket primitive

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:30):

I see

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:30):

So for sqlite, I am just noting the state today in basic webserver. For postgres, I am trying to point out the general issue.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:31):

so for that issue, let's frame it in terms of a stack overflow

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:32):

so an individual request handler overflows the stack and we want to close its SQLite transactions, right?

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:32):

I don't think that is reasonable cause it many systems, stack overflows always crash the program. Like completely crash it.

So I would prefer to keep it in terms of something that might happen but would generally recover with a 500. So numeric overflow or any sort of exception otherwise.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:33):

Or division by zero

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:34):

division by zero we can choose to handle in other ways, e.g. by having it return 0 like pony does

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:34):

I'd specifically like to focus on stack overflow because it is unquestionably unrecoverable

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:35):

That's exactly why it isn't useful to talk about

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:35):

You want something in the grey area. How about integer overflow which we want to have crash due to all of the correctness implications

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:36):

Like on stack overflow, we aren't even going to 500, the entire webserver is probably going to crash.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:36):

ok so I think this actually gets to the fundamental difference in perspective here

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:37):

to me, the only cases we are talking about are situations like stack overflows where if it happens, we have decided it should be game over and there's no recovering from it

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:38):

Stack overflow doesn't even call roc_panic though. It literally doesn't hit the path we are discussing

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:38):

I cannot overstate how important that distinction is to me in this discussion

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:39):

yes, exactly!

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:39):

that is exactly why I want to focus on it

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:40):

I don't follow.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:43):

if people are misusing crash, then to me the default best response to that is to try to educate people about what it's for.

if that doesn't work, then the next best response is to remove crash from the language so that if you're writing a library and hit a "this should never happen" situation, then your only option is to use one of the hacks we used to use (e.g. inducing a stack overflow) to address it, at which point it at least becomes obvious that you should never use that technique for error handling

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:44):

the reason I think it's useful to talk about how to handle SQLite transactions in the presence of a stack overflow is that it's not a situation where either of those solutions would help

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:44):

because the problem isn't that people have misused crash

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:45):

and we also couldn't fix it by changing how integer overflow or division by zero works

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:48):

I think there are 3 states:

  1. returns result: user can do cleanup
  2. completely crashes the app (oom, stack overflow): due to full reset and all connections getting killed, everything gets cleaned up. This should get reported, but the service will be auto restarted and load balancing will send to other servers, so this is all fine.
  3. we crash through roc_panic and partially recover. If we can't do cleanup this may leave the app in a continually partially failing state.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:49):

So personally, I don't care much about 1 or 2. 3 is the important problematic case.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:54):

As a concrete case of what could happen in basic webserver. We can even pretend crash is gone:

  1. I get a request that turns out to be invalid, but I miss that
  2. I start an sqlite transaction
  3. I get an integer overflow when calculate something due to missed validation on the initial request.
  4. roc_panic on overflow which cleans up all statements but doesn't know anything about the open transaction.
  5. From that point forward all threads are blocked waiting on the still open transaction to finish. So we either get a hang, or more likely, all threads fail anytime they interact with sqlite and the user keeps returning 500s.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:54):

well state 2 doesn't have to kill the entire root process - the host can install a signal handler, unwind, and return 500

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:55):

and that's true for both stack overflow and running out of heap memory

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:55):

Sure, then 2 falls into 3 and it matters, but it specifically only matters if the host might recover in a potentially invalid state.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:56):

yeah exactly

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:57):

Personally. I would like to pin to the concrete example. I think it is what would lead to users wanting some form of catch

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 19:57):

Or at least some form of errdefer or finally to clean up in the case of an exception.

view this post on Zulip Richard Feldman (Oct 06 2024 at 19:59):

so how would this specific case be implemented in the compiler?

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:00):

C++ exception style?

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 20:04):

I'm just trying to point it out as a motivating example. One solution would be c++ style exceptions then enable finally to clean up.

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:05):

this specific case being (just to make sure we're on the same page):

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:05):

that's the scenario, right?

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 20:05):

Or even could be done without c++ exceptions before calling roc_panic. Like register functions to call to clean up before calling roc_panic. I just assume a platform won't be able to deal with all of these cases, but maybe that is wrong. It at least is worth thinking about critically

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 20:06):

we want the outcome to be that the request handler unwinds, responds with a 500, and the transaction is closed and nothing is leaked

That would be optimal. Also, the platform may know about sqlite, but sqlite transactions are built into standard statements. So the platform may have zero control over transactions (state of current basic webserver)

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:09):

ok so the way I'm approaching this is "I want to avoid having exception handling semantics in Roc applications, so how can we solve this in a way that gets to a good outcome in this scenario without doing that?"

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:09):

one thing that comes to mind is that it seems like the best experience if the author of the SQLite library can solve this

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:10):

so application authors don't need to code defensively for this scenario

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 20:10):

Yep

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:13):

one idea for how that could work: have the platform expose a function that accepts a boxed "cleanup function" and returns a token value that works like a file descriptor

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:14):

in terms of how the host tracks it

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 20:15):

We could even make roc_panic take a list of cleanup closures that the platform can use if they want. Add some sort of Task.finally that just adds onto the list. Not sure when exactly it would be cleared out though. But yeah. Something like that which gives platform cleanup control should work

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:16):

so then the SQLite author can make sure that token has the same lifetime as the transaction normally, so when it gets deallocated it closes the transaction

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:17):

and if the request handler stack overflows, the host has a list of tokens that never got resolved, and can run them all before 500ing

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:18):

I think that can be implemented today with no implementation changes needed

view this post on Zulip Richard Feldman (Oct 06 2024 at 20:18):

in the same way that file descriptors can be

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 20:19):

Ah, I see. Yeah. I think that works.

view this post on Zulip Kasper Møller Andersen (Oct 06 2024 at 20:51):

Through all this, I’ve had a hard time figuring out what users are expected to do with their platforms. It feels like there’s two different ways of looking at platforms, which I find hard to reconcile:

  1. Platforms are sort of complicated, handling a lot of general Roc logic (like unwinding, creating stack traces, etc.)
  2. Platforms can also handle logic which is very application specific, like defining your stability boundaries. Users are encouraged to own their platform for this if they want it changed.

To me, this sounds like there will be an explosion of very specific, and not necessarily well maintained, platforms a la basic-webserver-mssql-windowsserver2022, basic-webserver-postgres-ubuntu, etc.
This doesn’t feel like a healthy place to end up in though. It feels like any single platform consists of a number of choices, but an application author can’t easily compose those choices without forking a complete platform, and handling the full responsibility this entails.

I think a healthier position would be for some “base” platforms to be incredibly small and stable, to provide the raw primitives that a full platform needs (reusable Roc setup, IO, etc), and then have another layer on top where you define the interface and behavior that you want to work against. In other words, you may have a “base” ubuntu platform, which exposes the capabilities you get in Ubuntu, and then someone would build a basic-cli API on top of that. Does that make sense, or am I misunderstanding platforms here?

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 22:14):

you may have a “base” ubuntu platform

This could be done but would probably a terrible experience and wouldn't be worth trying to do. If we need an ubuntu platform, the platform concept has probably failed. That sounds like a generic standard library with all of the io primitives.

platforms a la basic-webserver-mssql-windowsserver2022

These probably will exist for certain power users, but not for the average user. That said, forking an existing platform and slightly modifying it may be reasonably common. Not some super fork like integrating an entire mssql library into the platform, but a tiny fork that may modify the panic handler to send requests to your logging server.


I think that long term, most platforms will be in the vein of basic-webserver in terms of complexity. They won't be trying to own the world, but they be trying to have a robust set of primitives (for a webserver, these primitives are most likely for file io and for socket io). Once you have those two, you can do essentially everything a webserver would want to do. The postgres library can be built right in roc.

view this post on Zulip Brendan Hansknecht (Oct 06 2024 at 22:16):

The last part of what richard and I were talking about is how a platform like basic-webserver could offer the ability to run cleanup code after a panic. This would enable a postgres library to hook in and request for transactions to be rolled back on panic. It would enable more control from the end user in terms of adding a finally to any crash. This is all within a single platform.

view this post on Zulip Richard Feldman (Oct 06 2024 at 22:40):

yeah, to elaborate, here are some scenarios I imagine happening long-term:

view this post on Zulip Richard Feldman (Oct 06 2024 at 22:41):

I wouldn't expect many platforms to be operating system specific (although it's always possible) outside of maybe embedded systems

view this post on Zulip Richard Feldman (Oct 06 2024 at 22:41):

so I don't think it would be basic-webserver-postgres-ubuntu, but I could imagine a postgres-webserver platform existing

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:22):

Kasper Møller Andersen said:

Platforms can also handle logic which is very application specific, like defining your stability boundaries. Users are encouraged to own their platform for this if they want it changed.

just to clarify on this part, I don't actually think this is the right way to think about it

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:22):

for example, let's say I am an application author and I spawn an OS subprocess and run some logic in there

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:23):

that's an example of creating a very hard boundary where anything can go wrong and I can definitely recover from it, and the platform doesn't need to be customized to do that

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:28):

however, the platform is in charge of whether to offer primitives for spawning processes

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:52):

maybe a bit of relevant context I should have started with: in Elm, you can't publish a package with Elm's equivalent of a crash in it

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:52):

so in Elm, if you really want to publish a package that has an "unreachable" state in it, you really do have to do one of the hacks

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:54):

Elm applications have a reputation for approximately never crashing in practice, and I think this is part of the reason why: aside from mistakes (e.g. of course libraries can still overflow the stack or get into an infinite loop) you really have to go out of your way to have a library crash

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:54):

compare this to, for example, unwrap in Rust - which I think would be a mistake to include in Roc, precisely because it makes it so easy to introduce an unnecessary crash

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:56):

the outcome Elm has seen from not having a crash equivalent is a major reason that Roc did not have crash at first, and why I was hesitant to introduce it (but I did find the argument compelling, and continue to, that if you think it's going to be unreachable, but it actually is reached in practice somehow, it's definitely best to be able to at least include some context on what happened that you thought couldn't happen)

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:56):

it's also a reason that I am so strongly resistant to adding recovery mechanisms in userspace: it seems like every ecosystem that has these gets more crashes in practice than the Elm ecosystem does

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:57):

and I think a contributing factor there is that culturally it's not only okay, but expected in many languages to throw for error handling

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:58):

and if it doesn't end up being caught, shrug, not my problem; someone else should have handled it

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:58):

whereas in Elm, there is only one way to say "it's someone else's job to handle this," which is Result (or some equivalent)

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:58):

and that's the way I want it to be in Roc

view this post on Zulip Richard Feldman (Oct 06 2024 at 23:59):

in order to get the same "applications essentially never crash in practice" outcome that Elm has gotten, and which systems with try/catch (and Rust too) are extremely far away from getting

view this post on Zulip Richard Feldman (Oct 07 2024 at 00:00):

so this is also why my ordering of preferences is:

  1. Have crash, so if the inconceivable happens, at least there's a helpful message to explain wha happened
  2. If it ends up being overused, try to solve that by changing the culture and emphasizing that it really should be used almost never in library code
  3. If that doesn't work, consider banning it in the package index
  4. If that still doesn't work (e.g. people just start using direct URLs, or make a competing index), remove crash from the language
  5. Add a way to recover from crash in applications

view this post on Zulip Richard Feldman (Oct 07 2024 at 00:01):

because based on the experience of Elm vs every other language, I can't possibly see how #5 would do anything other than explode the number of crashes that actually happen to real end users

view this post on Zulip Richard Feldman (Oct 07 2024 at 00:01):

if it did anything other than that, Roc would be the first language where that turned out to be true, so I'd be really curious to see why we should expect it to be different from all the others :big_smile:

view this post on Zulip Oskar Hahn (Oct 07 2024 at 06:32):

Richard Feldman said:

because based on the experience of Elm vs every other language, I can't possibly see how #5 would do anything other than explode the number of crashes that actually happen to real end users

I like, that an Roc application can not recover from a crash. But there is a counter example. Go has panic and recover. The normal way in Go to handle errors is by returning an error-value. Many people do not like the way you have to handle errors in Go. But still, (nearly) nobody is reaching for panic and recover.

I think the only time recover is used in Go is, when you are writing a framework and you do not expect your users to write high quality code. So you have to expect runtime-panics like zero division, nil pointer dereference or out of bound checks. In a context of Roc, the framework is the platform. So only the platform should use recover. This is already possible.

view this post on Zulip Oskar Hahn (Oct 07 2024 at 06:44):

If the pressure to add recover gets two high, then a way to add it could be, that you can only clean up your own mess with it. What I mean is, that it can only recover crashes from your own module/application but never above the module boundary.

So you would implement unwind, but internally check the stack trace. If the stack-trace only contains entries from builtins or the module calling recover, then it returns a Result. But if the stack contains other modules, it continues the crash.

With a recover like this, it would be impossible for library authors to expect there users to recover. But it would be possible to use it for your own code. Like

doSomeCalculateion = \a, b, c ->
  # This could overflow or divide by zero
  a * b / c

exportedFunction = \a, b, c ->
  recover (doSomeCalculation a b c)
  |> Result.mapErr \_-> OverflowOrZeroDivision

view this post on Zulip Richard Feldman (Oct 07 2024 at 11:16):

that's an interesting idea, although it seems likely that with that restriction there wouldn't be any demand for it in practice anyway! :big_smile:

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:01):

Okay, I wasn't sure if you were implying that the database driver had to be integrated into the platform to give a robust experience. And yeah, having a platform per OS probably doesn't make sense, but what I was really going for was just a way to have shared primitives that you don't need to take responsibility for when forking a platform. Like you could have a Rust crate that defines a lot of common Roc operations (IO, panic handling, etc.), and then it would be easy to write your own platform by just calling out to that library for anything where you just want the "standard" behavior. This way you also get all the usual benefits of version tracking, and you don't have to worry about keeping your fork up to date.

view this post on Zulip Richard Feldman (Oct 07 2024 at 15:04):

ahh gotcha!

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:06):

Richard Feldman sagde:

it's also a reason that I am so strongly resistant to adding recovery mechanisms in userspace: it seems like every ecosystem that has these gets more crashes in practice than the Elm ecosystem does

I do agree it's nice that Elm libraries tend to not crash, but it's actually also something that makes me a bit nervous. Having worked for a number of years on a big and complicated Elm codebase, I know there will be times when you just end up in these branches, and the best thing you can really do is return some default value when it happens, because Elm doesn't give you tools to do better. So I don't think people would use their own custom stack overflow in general when they have to handle such a branch, but instead, I think the most common behavior is to just eat the failure, and try to keep going anyway. Which is also something I don't think you can really measure how often it happens across the ecosystem, so that's something of a ghost that's bothering me there.

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:08):

Also, fun side note: this morning I got into work and the first thing I see is someone with a PR doing handling of stack overflow exceptions :big_smile:

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:14):

I'm not sure how much I can say, so I'll be light on details, but we have a runtime that is executing queries on the JVM, and those can blow the stack. And we'd rather give the queries access to as much stack memory as possible, and recover when they exceed it, rather than limit how much stack they can use in the first place and not let them use the hardware to its full potential.

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:27):

But anyway, summing up the current state of affairs:

view this post on Zulip Richard Feldman (Oct 07 2024 at 15:29):

Kasper Møller Andersen said:

Roc code is not supposed to be able to deal with runtime errors

I'd say something more like "runtime crashes" (errors in general are for Result!) but yeah :big_smile:

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:32):

Fair enough :stuck_out_tongue:

view this post on Zulip Richard Feldman (Oct 07 2024 at 15:32):

Kasper Møller Andersen said:

Having worked for a number of years on a big and complicated Elm codebase, I know there will be times when you just end up in these branches, and the best thing you can really do is return some default value when it happens, because Elm doesn't give you tools to do better. So I don't think people would use their own custom stack overflow in general when they have to handle such a branch, but instead, I think the most common behavior is to just eat the failure, and try to keep going anyway. Which is also something I don't think you can really measure how often it happens across the ecosystem, so that's something of a ghost that's bothering me there.

I do think graceful recovery (when possible) is beneficial to user experience, although ideally it would be accompanied by logging so someone can find out it happened!

Silently giving an incorrect answer, on the other hand (e.g. division by zero silently returning a zero and continuing as if nothing wrong had happened) is definitely bad, and I'd say worse than crashing

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 15:40):

Yup, and those are the ones I’m worried about, because they can be really hard to spot and I think many people may not be aware enough of them. But anyway, that’s all just good arguments for why Roc does as it does today. I just wanted to be sure it’s not about stability at all costs :blush:

view this post on Zulip Richard Feldman (Oct 07 2024 at 15:48):

cool, thanks for talking through all that! :smiley:

view this post on Zulip Richard Feldman (Oct 07 2024 at 15:49):

I appreciate your patience with it :heart:

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 19:24):

Likewise! It’s always nice to get to go through a subject in a thorough manner, when everyone is gracious enough to talk through rookie understandings:smiling_face:

view this post on Zulip Kasper Møller Andersen (Oct 07 2024 at 20:01):

One example of this becoming an awkward fit I think is with GraphQL. GraphQL is most often used over HTTP, but it doesn’t need to be. HTTP is just one of many potential transports you can use for GraphQL. Because GraphQL is transport agnostic, it includes its own error handling mechanism. So ideally, if you have a webserver serving a GraphQL API, the GraphQL layer is actually the stability boundary. To have that be the case in Roc, you need a platform that is specifically for serving a GraphQL API with a webserver, as opposed to using a webserver platform where you just plug in a GraphQL library.

It’s still not a huge deal, because these errors should be quite rare of course. But I do think it works as an example of having a stability boundary above the platform.

view this post on Zulip Brendan Hansknecht (Oct 07 2024 at 23:03):

I think as long as you abstract away the protocol you could have a basic-webserver type platform that is separate from a GraphQL library. The library just will accept primitives that are protocol generic to build it's requests.

view this post on Zulip Brendan Hansknecht (Oct 07 2024 at 23:04):

That said, basic-webserver may not enable websockets. So you may still need to switch platforms if you want to server graphql over websockets. But you should still be able to use a protocol agnostic graphql library either way.

view this post on Zulip Richard Feldman (Oct 07 2024 at 23:53):

I could be wrong, but my understanding is that websockets can be implemented on top of a normal TCP socket

view this post on Zulip Brendan Hansknecht (Oct 08 2024 at 00:12):

I think so, but I have never looked into it. Just trying to point out that whatever graphql is implemented on top of can be decoupled from a graphql library in roc. A platform doesn't need to support all protocols. The library doesn't need to be specialized to the protocol. Should be able to make it generic and flexible.

view this post on Zulip Kasper Møller Andersen (Oct 08 2024 at 05:03):

My point related to the topic was that the GraphQL layer is where you want to catch a crash though :blush:

view this post on Zulip Brendan Hansknecht (Oct 08 2024 at 05:08):

I don't actually understand that part. Why would graphql be catching the crash?

Oh... you want to send the graphql equivalent of an http 500 status code if there is a crash. And you want to send it over whatever protocol the platform supports (http, websocket, etc)

view this post on Zulip Kasper Møller Andersen (Oct 08 2024 at 05:19):

Precisely! A GraphQL response should encode the error in its own format. GraphQL over HTTP will generally send errors as a HTTP 200 status for that reason for example.

view this post on Zulip Brendan Hansknecht (Oct 08 2024 at 05:31):

Yeah, so to support this case in a generic form, you would at least need to be able to register what the crash response should be. Graphql could register a 200 message with the text:

{ "errors": [{ "message": "Server error" }]}

view this post on Zulip Kasper Møller Andersen (Oct 08 2024 at 05:38):

Any webserver can serve both GraphQL and REST endpoints at the same time, so the error handling would to differentiate on which kind of endpoint was reached

view this post on Zulip Brendan Hansknecht (Oct 08 2024 at 05:40):

yeah, would need to be path aware.

view this post on Zulip Kasper Møller Andersen (Oct 08 2024 at 07:11):

While this error handling would work in this case, this also reflects what I was trying to get at with platforms not being composable enough. That is, if a user has an existing webserver (based on any general web server platform likebasic-webserver, nea, or something else) which doesn’t offer this kind of error handling, and they now want to serve GraphQL, they’ll need to fork the platform to do so. It’s completely reasonable for a platform author to not have this error handling in their web server still though. So it feels to me like this error handling should be application code, and not something platform authors have the only say in.

Again, not a huge deal for this particular example, but it just makes my spidey sense tingle none the less :blush:


Last updated: Jun 16 2026 at 16:19 UTC