Related to the Q&A, I wonder how hard it would be to make a tree walking or bitcode interpreter that runs on one of our irs.
not that hard but you'd have to make it work with the platform
just for the repl that would work fine but not for actual executables
You could generate a shim to deal with that, but it wouldn't be pretty. Basically the interpreter would be a c library. When calling roc someapp.roc, it would figure out all the symbols it needs to expose and build a shared library that essentially exposes those symbols, converts to a generic format, and then call the underlying interpreter generic execution symbol.
So not pretty, but probably not to hard to do. Though you would definitely want to manually build that shared library to not waste 100s of milliseconds just on linking before you can even start running your interpreter.
I really want to build such an interpreter. At least we could reuse it for constant eval
I really want to build such an interpreter.
For any specific reason?
For constant eval. Also easy way to validate correctness of the IR compiler
Also seems useful for debugging (for roc users).
I'm open to being wrong about this, but I actually think we'll want to use the dev backend for constant eval
in dev builds, I suspect in a lot of cases it would result in almost exclusively compiling things we're compiling anyway, plus then the execution would be faster!
and in optimized builds, once we have caching, it might be able to reuse previously built and cached dev artifacts to avoid rebuilding them (before executing the constant eval)
I'm interested how the mechanics of that would work. The useful thing about an interpreter at the IR level is you can partially evaluate anything (including partial evaluation inside a function) and get out the same IR that you can then generate to any machine code. How would that work with the dev backend? My intuition is it would only be able to evaluate top-level values, and only generate platform-specific values, unless you also have some read-back mechanism.
oh yeah that was my intention, just evaluate top-level values
(or values that can be hoisted to top level because they don't depend on anything else)
wouldn't partial evaluation that could be done by an interpreter need to be kicked off from a top level value anyway? :thinking:
interpreting as opposed to like eta reduction
17 messages were moved here from #show and tell > RustNL compiler performance talk by Richard Feldman.
If you have a function
foo = \x ->
heavyCompileTimeComputation = \{} -> ... + x
heavyCompileTimeComputation {}
you could partially evaluate "heavyCompileTimeComputation" with an IR-level interpreter, but I don't think you could with a dev backend
You also need to implement readback from the dev backend if you want the computed values to be architecture-independent
as for the mechanics, I was thinking that the goal would be to turn all top-level values into bytes in the readonly section - so basically:
roc_alloc. They can be called concurrently.I was thinking they'd be target-specific
hm that's interesting - I wonder about the tradeoffs around total compile time
the main thing I'm thinking about here is dev build times :big_smile:
but maybe if we had an interpreter that could do partial evaluation during --optimize builds that would be the best of both worlds?
I mean in dev builds do you want partial evaluation at all?
Presumably it would be faster except in the case where you have a hot loop of expensive computation to just execute the code as-is
I don't think we want partial evaluation in dev builds, but there's a "consistency between dev and optimized builds" issue if we don't evaluate top-level values at compile time in both cases
e.g. if there's a crash or dbg in them, when do you see it (or do you see it at all) etc.
also potentially in the future there might be value to being able to expose top-level static values directly to hosts instead of putting them behind a thunk (not sure, but it's a consideration)
but there's a "consistency between dev and optimized builds" issue if we don't evaluate top-level values at compile time in both cases
I worry about supporting consistency of this form in the first place - this means if we ever want to change the scope of what can and cannot be evaluated at compile time, this is a problem, right?
I think one reasonable alternative is to not evaluate anything with a dbg/crash in it at compile time, or make it an error to have that in a toplevel value.
potentially, although I know having crash in toplevel values is useful - e.g. in Elm this would come up when people wanted to compile regular expressions as toplevel values and then immediately get rid of the Result from parsing the regex string, so if it failed you'd know about it right away when the program started up, and otherwise you could use it without the Result across the whole program
so I guess any top level value that would involve parsing something would want crash
and disallowing dbg could make it hard to debug nontrivial function calls in the top level values
one reasonable alternative is to not evaluate anything with a dbg/crash in it at compile time
This part of Ayaz's suggestion doesn't cause a mismatch between dev and release though.
And it's compatible with what Richard is describing from Elm.
The main difference is that when Elm generates constant values in JS, they are dynamically evaluated once, when the JS engine reads the elm.js script.
To get that behaviour, we could generate an init_constants function that evaluates all the constant thunks on starting the program. I'm not sure how it would get called from the host though.
I suppose we could insert some code in mainForHost to call the init thunks.
But the current behaviour of crashing on first usage probably fits better with other things in Roc (like running with type errors as long as you don't hit that path)
In any case, that’s probably a separate concern from the method of evaluation itself. I’m still pretty concerned about the dev backend’s architecture-specific output (doesn’t that mean cross compilation is limited?) and only being to work on the top level, so no partial constant folding etc
Couldn’t constant expressions in functions be lifted to the top so that they can be evaluated independently (with the dev backend)? Sorry if this is a dumb question, I’m just curious :)
@Agus - I'm assuming so, yes :big_smile:
Ayaz Hafiz said:
I’m still pretty concerned about the dev backend’s architecture-specific output (doesn’t that mean cross compilation is limited?)
I'm not sure I follow this part - do you mean that, for example, something might crash when building due to OOM on a 32-bit target but not on a 64-bit target? Or something else?
as an aside, I didn't say it explicitly earlier, but one of the things I like about the dev backend for top-level constant evaluation idea is that it essentially adds nothing to the build+run cost:
so in terms of dev builds, where you basically always build and then immediately run, I think the cost of evaluating them at compile-time ends up being essentially zero. Really it's the difference between however we end up storing them (after evaluating them) at build time versus at runtime, but if anything that probably makes the build-time version faster because it always gets to bump allocate the entire evaluation, whereas the runtime version only can in some situations (depending on the platform)
in contrast, I'd expect using an interpreter at build time to slow down dev builds by some amount. Probably for most constants it would be trivial, but I can imagine people trying to do some ambitious things at build time using this feature, especially in conjunction with being able to bring in files as Str or List U8 constants.
but like I said, I think using an interpreter in --optimize for partial constant folding seems reasonable - and in fact we could potentially do both in --optimize builds if that's faster: use the dev backend for evaluating all the top-level constants (none of which will be partial) and then afterwards doing a pass with the interpreter do all the remaining (potentially partial) things
Richard Feldman said:
Ayaz Hafiz said:
I’m still pretty concerned about the dev backend’s architecture-specific output (doesn’t that mean cross compilation is limited?)
I'm not sure I follow this part - do you mean that, for example, something might crash when building due to OOM on a 32-bit target but not on a 64-bit target? Or something else?
Layouts can be different between targets. Alignment of 128 bit ints, for example, which could affect field order. Or pointer size 64 vs 32. So if you evaluate constants in the dev backend, you will get bytes that are appropriate for the machine you compiled on. But maybe not on the final target machine.
But I suppose you could do what we do in the REPL to turn it back into an AST value. Then turn that into IR and swap _that_ into the target program.
Brian Carroll said:
if you evaluate constants in the dev backend, you will get bytes that are appropriate for the machine you compiled on. But maybe not on the final target machine.
ahh that makes sense! :thumbs_up:
I think that wouldn't be a problem in practice because the steps would be:
even when compiling for the same target as the current machine, the second step couldn't be as simple as one big memcpy because the pointer locations would need to change—so there would always need to be some transformations during the copying.
but I think the key thing is that all the same information is there regardless of target architecture, if that makes sense - like, all the numbers should be the same, all the strings should be the same, all the records and tag unions should have the same number of fields/variants, etc. (even if they have different orderings on the target)
so if the layouts happen to be identical, then more things can be memcpy'd over, but otherwise there might need to be some reordering (e.g. copy from offset 0-15 in the source address into offset 16-31 in the destination address because the field ordering is different on this target) and also pointer size differences, but I think that's it!
That kind of readback is doable, but to me it feels really complicated - now you need a separate pass in the compiler that's similar to what the REPL/glue do, for converting the bytes to the target (in my mind this is more complicated than an interpreter given our IR is very small, but I appreciate this could be a wrong opinion). I guess I'm just not convinced that the complexity, and the limitations of the only-toplevel-evaluator are better than having an IR-level interpreter, given that most of the time the expressions/partial programs to evaluate will be pretty small in size (of course there are exceptions), and the performance is likely not to be too differential.
Another consideration is that to perform any meaningful evaluation over the IR (including partial evaluation) we would need to have a pass that performs analysis over the constant subset (i.e. similar to CSE)
there might need to be some reordering ... and also pointer size differences, but I think that's it!
Not quite. For some tag unions, we store the tag ID in the lower bits of the pointer, if the number of tags is less than the pointer width in bytes. So if you have a 5-tag union that you're cross compiling from x86_64 to Wasm, that complicates things a bit.
We are all probably thinking of different implementations
The implementation I imagined earlier was this:
But I think Richard is describing something like this:
layout.stack_size(target_info), layout.alignment_bytes(target_info), layout.stores_tag_id_in_pointer_bits(target_info).And I think the IR interpreter would work like this
2+2 to 4 etc.Note that the dev backend and the interpreter both have to traverse the IR. But after that pass, the "interpreter based" system is finished and can immediately start generating the final target. Whereas in the "dev backend based" solution we have to do a bunch of extra steps - executing, transforming bytes, etc.
that's true if the top level expression is trivial, but it could be very involved - e.g. it might call functions, those might conditionally call other functions, etc.
so maybe another way to express my concern is that if people start doing more involved things with constant evaluation, if we're executing all of that with the dev backend then it's going at dev backend speeds, and if interpreted, it's going at slower speeds
so it could be fine (or even faster) to interpret when the top level expression is not doing much, but maybe not if it's doing a lot
I suppose one thing we could try is interpreting by default and then if in the future people run into perf problems in practice, consider dev backend as a way to improve that
if we're executing all of that with the dev backend then it's going at dev backend speeds, and if interpreted, it's going at slower speeds
I don't think that's right.
I know we associate "dev backend" with "fast" and we associate "interpreter" with "slow".
But I think it's a false comparison.
It is comparing "fast" compile time to "slow" execution time.
But those times are actually about the same.
The dev backend traverses the IR, accumulates some stuff, and then serializes that stuff. And we are saying that's fast.
The IR interpreter just needs to traverse the IR and accumlate some stuff. That's a similar operation to the dev backend. But now just because it's an "interpreter", we are saying it's slow.
Especially when after running the dev backend you are still not finished, you still have to execute it and transform the bytes.
Hmm... although I just realised that in the case of an iteration or recursion that might not be true.
The IR interpreter would have to actually do the iteration but the code generator wouldn't.
so it could be fine (or even faster) to interpret when the top level expression is not doing much, but maybe not if it's doing a lot
OK yes I see this point now, it could go either way.
Maybe the IR interpreter should bail out of recursions/iterations after some limit. Or bail out of function calls after some stack depth limit.
Though maybe that makes it hard for a user to know what to expect.
yeah some systems do that, but I don't like the idea of doing that - it means you can take working code, extract it to the top level, and now it no longer builds
in general I don't think top level evaluations should have any restrictions beyond what they'd have if they were evaluated anywhere else
an implication of this is that they can hang your build if there's an infinite loop in them, but I think that's worth accepting as a downside
Which IR would we run this interpreter on? Canonical or monomorphized?
I suppose the canonical one should contain less data, so it would be quicker to traverse.
Better to evaluate each def once generically rather than redoing it for each specialization.
So we transform a complicated generic expression into a simpler generic expression, and then generate specializations of that simplified version.
Probably worth at least spiking both and profiling? The nice thing from the IR one is you are like to be able to evaluate deeper, inline more, etc since you may have non-generic specialization calls that aren’t visible before type specialization (eg ability members). And you can trivially write a stack based interpreter instead of a tree one.
The nice thing from the IR one
again, which IR? We are comparing two IRs! :laughing:
Oh OK you mean mono IR
Maybe nobody refers to can as an IR. I thought we did but I haven't really worked that close to the front end very much so I don't know.
I suppose AST is probably a better description of can, and IR usually implies something more linearised.
So this is probably just me still learning compiler terminology 2 years in!
Sorry, yes, the mono IR
cross-posting since it's relevant to this discussion:
Richard Feldman said:
I wrote up a proposal for this!
https://docs.google.com/document/d/1hp12UnSdS0cIIaUjLhp2NzmzUewwBxp2Mtwk3CNtnI4/edit?usp=sharing
(let's discuss that proposal on the other thread, since this one is about IR interpreting and that proposal proposes using a different design!)
Last updated: Jun 16 2026 at 16:19 UTC