Here is my solution of part 1:
day 3 part1
It's quite slow (about 2 and a half minutes on my machine) and I had to overcome the following bugs I found: #8662 #8663 #8664
Also is a break keyword planned to exit loops early?
Nice, I’m working on mine too :)
PS, you can do a code block inside of the spoiler block, to have the content formatted correctly
Here is my day 3 part 1 too :)
day 3 part 1
It took 6 min to solve and up to roughly 8GB of ram. What’s (not so) funny to see is how the interpreter slows down little by little while in theory, there is no reason for the slowdown. I mean no reason why processing each line of the puzzle input should take longer and longer with time passing. So I suppose there is something in the interpreter VM which grows linearly with time and that needs to be accessed for every instruction.
Here is a video recording of the solution being computed. I cut the video in the middle (where nothing happens) and made it 8x faster than realtime to not bore you. But you can still clearly see how slower the last few lines of the puzzle input are to be processed compared to the first few ones.
Here is my input puzzle in case you want to try to reproduce it exactly.
ok I'm pretty convinced it's worth going back to a dev backend :sweat_smile:
what’s a dev backend?
Backend in the sense of compiler backend. The part of the compiler that generates the actual machine code. We will use LLVM for this for release builds because it has a lot of optimizations and generates fast code. But it takes forever to generate that code, so in the old Rust compiler there is a dev backend that generates not so optimized code quickly to run during development. The idea was that an interpreter might be better than a dev backend, but as you showed it's slow.
There doesn’t exist an alternative to LLVM specialized for this use case? Focused only on code optimizations that are the most worth it and the least time consuming to get to an executable as fast as possible?
Quick search seems to suggest using cranelift for this, as it’s apparently quite fast and supports multiple architectures. Is that what you had in mind?
Cranelift is in between llvm and what we already did for the Rust compiler (which can be ported over to Zig) - which is to go straight from our monomorphized IR to machine code
Richard Feldman said:
ok I'm pretty convinced it's worth going back to a dev backend :sweat_smile:
Hmmm... Why? For perf?
Pretty clearly languages have made relatively performant interpretters. I'm sure roc could do this.
We currently have a young, likely quite slow interpretter. I mean currently it is a tree walking interpreter that is very unoptimized in how it handles refcounts (and thus allocations)
Though I guess if you care maximally about the dev mode experience, it would probably be an interpreter with jit. For fast start and fast peak perf.
I guess I just feel that there are likely large well trodden paths to making a way more optimized interpretter.
A good metric would be to port the code to equivalent python or pypy or js or lua or etc. See where perf can be and how that compares to a debug build of some compiled language with recounting (might have to manually add in the recounting to some language).
Yeah, ported to python or node, this runs in like 60 ms. So I see this as clearly our interpreter is slow, not that we need a dev backend.
That said, part of building a better interpreter may overlap with dev backend like constructions (e.g. compiling to a monomorphized linear bitcode).
yeah like - previously we had a dev backend with zero perf problems for either small or big examples
we went with an interpreter this time partly for constant folding but also bc we didn't have any other backend
I big reason for the interpreter is for fast startup time without dealing with surgical linker.
but it's like - yeah we could try to optimize this interpreter and JIT and all that...but wouldn't it be simpler to go back to the design that had neither slow startup problems in practice nor performance problems once it got running?
slow startup problems in practice
It will without a surgical linker on all platforms.
we don't need to do surgical linking anymore though - the current "embed the interpreter shim" design can just as easily do "embed an even thinner shim that just executes the bytes we give it"
Hmm. Launch the same way a jit would launch a compiled function
That is a nice simple idea.
Ok yeah, then same cost of one linked unit that is more costly, but it is one off.
yeah so in that world, hosts are still pre-linked like they are today
You still may want an interpreter on the lowest bitcode that would get converted into dev backend assembly. (and it might be best to start by working towards that).
The interpreter at that level would make it so you immediately support all platforms (x86, arm, risc5, wasm, etc). Then if you want more perf, you could just nearly 1:1 map from the low level interpreter to assembly.
Aside, I had claude hack together some really awful c++ code that is trying to model what debug backend roc would actually run here. Takes ~150ms to run.
since we already have something that works (although too slowly) I was thinking we'd do the same progression as last time:
--optimize working for realI would advise not just monomorphizing, but also flattening the IR. I think that will make it easier to generate both llvm ir and dev backend assembly. The flattened IR might also be a better setup for refcounting and some optimizations.
And yeah, llvm first sounds totally reasonable
which is to go straight from our monomorphized IR to machine code
Isn’t that a lot of maintenance to have something running in a few different architectures? I mean I could imagine x86, arm, and RiscV sharing non-negligible shares in a few years. I’m curious, with something like cranelift (Roc IR -> Cranelift IR -> binary) + regular linking but done with good practice (reduce object count, static linking, ...) and a fast linker like mold, how far is this from the targets you have for Roc?
"Sharing non-negligible shares in a few years"
What do you mean by this?
I mean if like a few years down, we see market shares like 50% / 30% / 20% or similar. No dominant one.
I think there is some risk with the long tail of odd variations of arm/risc5 for embedded, but having a dev backend for x86, arm, and risc5 shouldn't particularly be hard.
The dev backend in the rust compiler supports x86 and arm already. Not totally filled out, but definitely the core. And we had a separate dev backend for wasm. So many of these things already worked.
The core part is that these are dev backends and have essentially no optimizations. They really just dump a handful of raw machine code bytes for any given roc ir node.
They do have a few minor optimizations, but those are all shared between all backends.
I suppose there is still things like TCO, in-place memory mutations and others important optimizations like that happening for all backends right?
yes
As I see cranelift, is that it is not particularly compelling for dev builds cause it is not particularly fast at compilation. It is great as a slightly optimize alternative to llvm, but when you really just want to blit out machine code and want to run so fast as to compete with interpreter startup time, it is really slow.
dev build -> fastest compilation time, compete with optimized interpreters for execution speed
cranelift -> ok compilation time, rather solid base optimization for execution speed
llvm -> slow to exceptionally slow compilation time, any execution speed level you want to target
For linkers, mold is mostly faster for larger link jobs. So for small apps and dev builds it is not a huge help. Even really basic linking tends to really hurt experience for tiny apps. You feel it if every run of the compiler has an extra 500ms to link before execution.
In my mind, for dev builds we compete with a python like experience. In release builds, we compete with a go like experience.
I mean in the end as a user, I don’t care if my program takes 50ms or 200ms. It’s cool to have superfast responses technically but yeah user experience mostly feel instantaneous below 0.2s
Yeah, but your bar was rightfully somewhere around 200ms. Linking the full dependencies that a platform takes will often push you to 500ms to 1000ms right from the starting point.
I used to do a lot of image processing and data processing, and was using Rust. And in this context, not compiling in --release mode was not even an option. Because I would get programs like 100x slower in debug builds. So that’s why I’m always suspicious of the actual usefulness of unoptimized compilations.
That's totally fair. There are many cases where you always need at least the equivalent of -O.
One nice thing with roc is that the platform can always be compiled with -O3 equivalents. So you get a very python style tradeoff in dev builds. If enough of the heavy lifting is in the platform than roc perf doesn't really matter, it is just glue.
For cases like you mentioned above in general, our hope is that our release compile times can compete with go/D/other considered fast compiler. Different target for total compilation time, but trying to keep things fast and smooth for you.
That may be hard when comparing to compilers that do not use LLVM, but as long as we break down our compilation unit to enable more incremental compilation and caching, I'm sure we can manage a lot.
In your opinion, how fast could the roc interpreter be, without spending way too much on optimizing it? For reference, I converted my solution to python, doing the same thing, and the python solution took 30ms. (Roc took 6 min).
But obv there is the question of is it worth it if a dev backend is the goal? I suppose you still need something to compute all the constants right?
I think it definitely should be able to reach a similar speed to python.
yeah, but the dev backend approach should be much faster than Python :smile:
For sure. I tried to port this to the old compiler dev backend (but segfault). So instead, old compiler with dev llvm build. Which should be pretty similar. Python takes in the range of 50-60ms. Dev backend executable takes in the range of 10-20ms.
That said, old compiler takes like 60ms for the frontend to process all of the basic cli roc files and builtins, so that would make us slower than python in total time for this app. Roughly in the same ballpark though.
But why is our interpreter so much slower than Python? I feel like there is probably some very low hanging fruit we can get after that will speed our current interpreter up heaps
I tried asking Claude to give me a review of the overall architecture of the compiler, which was a very interesting read for me. And then, with that context in mind, asked it if it could assess what are potential issues making this program slow. Here is what it gave me.
INTERPRETER_PERFORMANCE_ANALYSIS.md
I wouldn’t take anything for granted, but with your knowledge of the compiler, you probably can easily distinguish what makes sense or not in its report.
I always made the assumptions when writing the roc code that some of the things Richard presented are present. But if things some things like in-place replacement of lists when RC is 1 are not in the interpreter, it could be a high cost for the way I wrote that. Considering the program took 8GB of ram, there is also most likely an issue with not deallocating intermediate values, there might be degradations of the stack or other parts of the interpreter that keep references to things.
Yeah these are the kinds of low hanging fruit I'm thinking about...
the analysis seems pretty bad to me haha
as in, seems like a lot of naive guessing
could be that in-place optimizations are not working, but also could be that the analysis isn't aware they exist :smile:
that one's certainly worth looking into though! I forget if in-place is in list.zig or outside it
the pushInPlacefunction in list.zig for example is not used anywhere else in src/ than in the list.zig file in its tests. So I suppose the interpreter never does the push in place optimization for example.
Yeah I wasn't saying I agree with that analysis... just that the general vibe I have is we have a lot of optimisations we can work through that should speed us up a lot. Like we don't have tail recursion yet the plan is to support TRMC etc
What is the appropriate way to instrument the interpreter to add some stderr outputs to the program execution? When I try to add some, the compiler is unhappy because it’s not wasm-compatible.
If you do something like != .freestanding or whatever so it doesn't try to build in the playground you should be fine
Here's an example of how the tracing for refcounting is done
if (comptime trace_refcount and builtin.os.tag != .freestanding) {
const stderr_file: std.fs.File = .stderr();
var buf: [256]u8 = undefined;
const msg = std.fmt.bufPrint(&buf, "[INTERP] upsertBinding from temp_binds ptr=0x{x}\n", .{
@intFromPtr(binding.value.ptr),
}) catch "[INTERP] upsertBinding\n";
stderr_file.writeAll(msg) catch {};
}
But why is our interpreter so much slower than Python?
Yeah, we must have a crazy amount of low hanging fruit. That said, we also are a tree walking interpreter which is generally the slowest form of interpreter....but yeah, low hanging fruit too.
Looks like a lot of random interpreter overhead in a few methods. Not something like cloning or etc. But also, I don't really know what these methods do
Screenshot 2025-12-14 at 4.59.43 PM.png
Definitely the first place to look would be applyContinuation which seems to have a lot of cost within itself.
Something super off is that like 95+% of the time is system time.
So this is almost not running roc code at all, just kernel calls. I guess that could be tons of allocations or copies or something.
Something super off is that like 95+% of the time is system time.
Yeah, my htop bar was like 90% red instead of being green
I think we may call UnifyWithConf 1,022 times per line processed. And I assume it is the same unification every line. So this may be cachable or something that should be calculated ahead of time before running the interpreter at all.
call UnifyWithConf 1,022 times per line processed
that sounds like a bug
Maybe it's something strange we are doing in our runtime evaluation or something related to polymorphism in the interpreter
May be related I'm not sure -- but I'm currently investigating an issue with infinite recursion in store.Store.addTypeVar which has come up from cross-module opaque types
This is a lot of allocations for 30 seconds
Screenshot 2025-12-14 at 6.33.01 PM.png
This is time spent in the specific function. It does not include time spent in called function/children functions.
Screenshot 2025-12-14 at 6.34.51 PM.png
likely better claude analysis
yeah those all sound good! :tada:
I also am working on a PR to revamp the tracy hooks. Adding tons more, making it work with executables from roc build. Some other improvements.
Now with finer grain traces:
Screenshot 2025-12-14 at 7.52.08 PM.png
Yeah, clearly the issue is allocations. Look at the time spent in the allocator alloc function:
Screenshot 2025-12-14 at 8.27.02 PM.png
This is in a 60s run
This is great news!
Haha. I think I just figured out a major part of the issue. Just ran in 5s for me
I'm gonna just merge the fix into my tracy pr cause it is allocator related and I fixed it while also adding some more tracy allocator things.
Top costs after fix:
Screenshot 2025-12-14 at 8.58.47 PM.png
I think with claude's help I see another huge win after this. Hopefully this can get us down to sub 1s here.
Crazy how much you can do with very carefully crafted prompts and guiding with claude. I don't fully know this code, but I know a lot about what would definitely be wrong and how to be very specific around what I want.
Yay! sub 1s!
Will put this in a different PR cause I am less sure it is valid/correct
The improvement is quite amazing for my day03.roc program! Thanks Brendan! Here are my timings:
This is quite an extraordinary speedup, almost 1000x. I don’t know if you are interested in looking at another point. I made another benchmark that doesn’t show such a dramatic increase. In that benchmark, the TLDR is:
For this benchmark, I’ve made a couple of files, one roc file, and one python file, then I have a wrapper python script that run both these benches with input sizes varying from 50 to 6400, with 2x steps. Finally, I’m plotting the timing for all these sizes.
Here is old main:
Here is PR 8680:
And here is PR 8681:
If you are ok to have a look at these: here are the files I used for it.
Also, not explicitely perf-related, but maybe more a bug for @Richard Feldman but when I run the above script with sizes up to 12800 then I get crash with memory leak detected.
Running benchmarks...
Sizes to test: [50, 100, 200, 400, 800, 1600, 3200, 6400, 12800]
============================================================
[1/9] Testing size 50... ✓ Roc: 0.0315s, Python: 0.0124s
[2/9] Testing size 100... ✓ Roc: 0.0366s, Python: 0.0127s
[3/9] Testing size 200... ✓ Roc: 0.0472s, Python: 0.0128s
[4/9] Testing size 400... ✓ Roc: 0.0740s, Python: 0.0130s
[5/9] Testing size 800... ✓ Roc: 0.1385s, Python: 0.0138s
[6/9] Testing size 1600... ✓ Roc: 0.3593s, Python: 0.0150s
[7/9] Testing size 3200... ✓ Roc: 1.1039s, Python: 0.0179s
[8/9] Testing size 6400... ✓ Roc: 3.6303s, Python: 0.0233s
[9/9] Testing size 12800... Error running benchmark with size 12800:
error(gpa): memory address 0x342710000 leaked:
Unable to print stack trace: Unable to open debug info: MissingDebugInfo
error: Memory leak detected!
I compiled #8681 with ReleaseFast and I don't see any significant speedup. Did not measure it, but it's still several minutes to run. I have a x64 machine, so maybe that's a reason.
M1 Max:
Run 1: 0.49s user 0.20s system 29% cpu 2.350 total
Run 2: 0.45s user 0.16s system 74% cpu 0.817 total
NixOS x86 (AMD 9950X):
Run 1: 4.32s user 13.01s system 97% cpu 17.816 total
Run 2: 4.07s user 12.94s system 99% cpu 17.073 total
Testing more than two runs made no difference on either machine. Used #8681, ran git clean -dfx before building, built with zig build roc -Doptimize=ReleaseFast. Roc source and input file from https://roc.zulipchat.com/#narrow/channel/358903-advent-of-code/topic/2025.20Day.203/near/563633329.
Found my own speedup (#8682), now the whole thing runs in under a second for me as well.
Using #8682 gave me slower M1 times and faster x86 times:
M1 Max:
Run 1: 2.49s user 0.16s system 51% cpu 5.184 total
Run 2: 2.45s user 0.12s system 96% cpu 2.670 total
NixOS x86 (AMD 9950X):
Run 1: 2.13s user 0.18s system 79% cpu 2.914 total
Run 2: 2.03s user 0.18s system 99% cpu 2.221 total
You have to rebase it on top of the new main, which has now included #8681 @Niclas Ahden .
The PR #8682 branch is based on #8680 not 81, so that may be the slowing you see for the M1 max
Ah, indeed, thanks! Rebased results:
M1 Max:
Run 1: 0.45s user 0.14s system 17% cpu 3.334 total
Run 2: 0.41s user 0.10s system 83% cpu 0.616 total
NixOS x86 (AMD 9950X):
Run 1: 0.46s user 0.15s system 58% cpu 1.028 total
Run 2: 0.34s user 0.14s system 96% cpu 0.495 total
Last updated: Dec 21 2025 at 12:15 UTC