Stream: advent of code

Topic: 2025 Day 3


view this post on Zulip Fabian Schmalzried (Dec 13 2025 at 15:56):

Here is my solution of part 1:

day 3 part1

It's quite slow (about 2 and a half minutes on my machine) and I had to overcome the following bugs I found: #8662 #8663 #8664

Also is a break keyword planned to exit loops early?

view this post on Zulip Matthieu Pizenberg (Dec 13 2025 at 16:49):

Nice, I’m working on mine too :)
PS, you can do a code block inside of the spoiler block, to have the content formatted correctly

view this post on Zulip Matthieu Pizenberg (Dec 13 2025 at 21:36):

Here is my day 3 part 1 too :)

day 3 part 1

It took 6 min to solve and up to roughly 8GB of ram. What’s (not so) funny to see is how the interpreter slows down little by little while in theory, there is no reason for the slowdown. I mean no reason why processing each line of the puzzle input should take longer and longer with time passing. So I suppose there is something in the interpreter VM which grows linearly with time and that needs to be accessed for every instruction.

view this post on Zulip Matthieu Pizenberg (Dec 13 2025 at 21:38):

Here is a video recording of the solution being computed. I cut the video in the middle (where nothing happens) and made it 8x faster than realtime to not bore you. But you can still clearly see how slower the last few lines of the puzzle input are to be processed compared to the first few ones.

roc-day3-part1-8x-short.mp4

view this post on Zulip Matthieu Pizenberg (Dec 13 2025 at 21:41):

Here is my input puzzle in case you want to try to reproduce it exactly.

03.txt

view this post on Zulip Richard Feldman (Dec 13 2025 at 23:17):

ok I'm pretty convinced it's worth going back to a dev backend :sweat_smile:

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 09:30):

what’s a dev backend?

view this post on Zulip Fabian Schmalzried (Dec 14 2025 at 09:36):

Backend in the sense of compiler backend. The part of the compiler that generates the actual machine code. We will use LLVM for this for release builds because it has a lot of optimizations and generates fast code. But it takes forever to generate that code, so in the old Rust compiler there is a dev backend that generates not so optimized code quickly to run during development. The idea was that an interpreter might be better than a dev backend, but as you showed it's slow.

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 09:42):

There doesn’t exist an alternative to LLVM specialized for this use case? Focused only on code optimizations that are the most worth it and the least time consuming to get to an executable as fast as possible?

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 09:52):

Quick search seems to suggest using cranelift for this, as it’s apparently quite fast and supports multiple architectures. Is that what you had in mind?

view this post on Zulip Richard Feldman (Dec 14 2025 at 14:28):

Cranelift is in between llvm and what we already did for the Rust compiler (which can be ported over to Zig) - which is to go straight from our monomorphized IR to machine code

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:05):

Richard Feldman said:

ok I'm pretty convinced it's worth going back to a dev backend :sweat_smile:

Hmmm... Why? For perf?

Pretty clearly languages have made relatively performant interpretters. I'm sure roc could do this.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:06):

We currently have a young, likely quite slow interpretter. I mean currently it is a tree walking interpreter that is very unoptimized in how it handles refcounts (and thus allocations)

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:09):

Though I guess if you care maximally about the dev mode experience, it would probably be an interpreter with jit. For fast start and fast peak perf.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:10):

I guess I just feel that there are likely large well trodden paths to making a way more optimized interpretter.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:12):

A good metric would be to port the code to equivalent python or pypy or js or lua or etc. See where perf can be and how that compares to a debug build of some compiled language with recounting (might have to manually add in the recounting to some language).

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:35):

Yeah, ported to python or node, this runs in like 60 ms. So I see this as clearly our interpreter is slow, not that we need a dev backend.

That said, part of building a better interpreter may overlap with dev backend like constructions (e.g. compiling to a monomorphized linear bitcode).

view this post on Zulip Richard Feldman (Dec 14 2025 at 17:42):

yeah like - previously we had a dev backend with zero perf problems for either small or big examples

view this post on Zulip Richard Feldman (Dec 14 2025 at 17:42):

we went with an interpreter this time partly for constant folding but also bc we didn't have any other backend

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:43):

I big reason for the interpreter is for fast startup time without dealing with surgical linker.

view this post on Zulip Richard Feldman (Dec 14 2025 at 17:44):

but it's like - yeah we could try to optimize this interpreter and JIT and all that...but wouldn't it be simpler to go back to the design that had neither slow startup problems in practice nor performance problems once it got running?

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:44):

slow startup problems in practice

It will without a surgical linker on all platforms.

view this post on Zulip Richard Feldman (Dec 14 2025 at 17:45):

we don't need to do surgical linking anymore though - the current "embed the interpreter shim" design can just as easily do "embed an even thinner shim that just executes the bytes we give it"

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:46):

Hmm. Launch the same way a jit would launch a compiled function

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:46):

That is a nice simple idea.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:47):

Ok yeah, then same cost of one linked unit that is more costly, but it is one off.

view this post on Zulip Richard Feldman (Dec 14 2025 at 17:47):

yeah so in that world, hosts are still pre-linked like they are today

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 17:49):

You still may want an interpreter on the lowest bitcode that would get converted into dev backend assembly. (and it might be best to start by working towards that).

The interpreter at that level would make it so you immediately support all platforms (x86, arm, risc5, wasm, etc). Then if you want more perf, you could just nearly 1:1 map from the low level interpreter to assembly.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:00):

Aside, I had claude hack together some really awful c++ code that is trying to model what debug backend roc would actually run here. Takes ~150ms to run.

view this post on Zulip Richard Feldman (Dec 14 2025 at 18:04):

since we already have something that works (although too slowly) I was thinking we'd do the same progression as last time:

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:14):

I would advise not just monomorphizing, but also flattening the IR. I think that will make it easier to generate both llvm ir and dev backend assembly. The flattened IR might also be a better setup for refcounting and some optimizations.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:15):

And yeah, llvm first sounds totally reasonable

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 18:23):

which is to go straight from our monomorphized IR to machine code

Isn’t that a lot of maintenance to have something running in a few different architectures? I mean I could imagine x86, arm, and RiscV sharing non-negligible shares in a few years. I’m curious, with something like cranelift (Roc IR -> Cranelift IR -> binary) + regular linking but done with good practice (reduce object count, static linking, ...) and a fast linker like mold, how far is this from the targets you have for Roc?

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:25):

"Sharing non-negligible shares in a few years"

What do you mean by this?

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 18:25):

I mean if like a few years down, we see market shares like 50% / 30% / 20% or similar. No dominant one.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:27):

I think there is some risk with the long tail of odd variations of arm/risc5 for embedded, but having a dev backend for x86, arm, and risc5 shouldn't particularly be hard.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:29):

The dev backend in the rust compiler supports x86 and arm already. Not totally filled out, but definitely the core. And we had a separate dev backend for wasm. So many of these things already worked.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:30):

The core part is that these are dev backends and have essentially no optimizations. They really just dump a handful of raw machine code bytes for any given roc ir node.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:30):

They do have a few minor optimizations, but those are all shared between all backends.

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 18:32):

I suppose there is still things like TCO, in-place memory mutations and others important optimizations like that happening for all backends right?

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:32):

yes

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:35):

As I see cranelift, is that it is not particularly compelling for dev builds cause it is not particularly fast at compilation. It is great as a slightly optimize alternative to llvm, but when you really just want to blit out machine code and want to run so fast as to compete with interpreter startup time, it is really slow.

dev build -> fastest compilation time, compete with optimized interpreters for execution speed
cranelift -> ok compilation time, rather solid base optimization for execution speed
llvm -> slow to exceptionally slow compilation time, any execution speed level you want to target

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:38):

For linkers, mold is mostly faster for larger link jobs. So for small apps and dev builds it is not a huge help. Even really basic linking tends to really hurt experience for tiny apps. You feel it if every run of the compiler has an extra 500ms to link before execution.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:39):

In my mind, for dev builds we compete with a python like experience. In release builds, we compete with a go like experience.

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 18:39):

I mean in the end as a user, I don’t care if my program takes 50ms or 200ms. It’s cool to have superfast responses technically but yeah user experience mostly feel instantaneous below 0.2s

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:40):

Yeah, but your bar was rightfully somewhere around 200ms. Linking the full dependencies that a platform takes will often push you to 500ms to 1000ms right from the starting point.

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 18:41):

I used to do a lot of image processing and data processing, and was using Rust. And in this context, not compiling in --release mode was not even an option. Because I would get programs like 100x slower in debug builds. So that’s why I’m always suspicious of the actual usefulness of unoptimized compilations.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:41):

That's totally fair. There are many cases where you always need at least the equivalent of -O.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:42):

One nice thing with roc is that the platform can always be compiled with -O3 equivalents. So you get a very python style tradeoff in dev builds. If enough of the heavy lifting is in the platform than roc perf doesn't really matter, it is just glue.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:44):

For cases like you mentioned above in general, our hope is that our release compile times can compete with go/D/other considered fast compiler. Different target for total compilation time, but trying to keep things fast and smooth for you.

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 18:45):

That may be hard when comparing to compilers that do not use LLVM, but as long as we break down our compilation unit to enable more incremental compilation and caching, I'm sure we can manage a lot.

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 19:18):

In your opinion, how fast could the roc interpreter be, without spending way too much on optimizing it? For reference, I converted my solution to python, doing the same thing, and the python solution took 30ms. (Roc took 6 min).

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 19:22):

But obv there is the question of is it worth it if a dev backend is the goal? I suppose you still need something to compute all the constants right?

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 19:37):

I think it definitely should be able to reach a similar speed to python.

view this post on Zulip Richard Feldman (Dec 14 2025 at 22:08):

yeah, but the dev backend approach should be much faster than Python :smile:

view this post on Zulip Brendan Hansknecht (Dec 14 2025 at 22:57):

For sure. I tried to port this to the old compiler dev backend (but segfault). So instead, old compiler with dev llvm build. Which should be pretty similar. Python takes in the range of 50-60ms. Dev backend executable takes in the range of 10-20ms.

That said, old compiler takes like 60ms for the frontend to process all of the basic cli roc files and builtins, so that would make us slower than python in total time for this app. Roughly in the same ballpark though.

view this post on Zulip Luke Boswell (Dec 14 2025 at 23:01):

But why is our interpreter so much slower than Python? I feel like there is probably some very low hanging fruit we can get after that will speed our current interpreter up heaps

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 23:14):

I tried asking Claude to give me a review of the overall architecture of the compiler, which was a very interesting read for me. And then, with that context in mind, asked it if it could assess what are potential issues making this program slow. Here is what it gave me.

INTERPRETER_PERFORMANCE_ANALYSIS.md

I wouldn’t take anything for granted, but with your knowledge of the compiler, you probably can easily distinguish what makes sense or not in its report.

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 23:21):

I always made the assumptions when writing the roc code that some of the things Richard presented are present. But if things some things like in-place replacement of lists when RC is 1 are not in the interpreter, it could be a high cost for the way I wrote that. Considering the program took 8GB of ram, there is also most likely an issue with not deallocating intermediate values, there might be degradations of the stack or other parts of the interpreter that keep references to things.

view this post on Zulip Luke Boswell (Dec 14 2025 at 23:22):

Yeah these are the kinds of low hanging fruit I'm thinking about...

view this post on Zulip Richard Feldman (Dec 14 2025 at 23:27):

the analysis seems pretty bad to me haha

view this post on Zulip Richard Feldman (Dec 14 2025 at 23:27):

as in, seems like a lot of naive guessing

view this post on Zulip Richard Feldman (Dec 14 2025 at 23:28):

could be that in-place optimizations are not working, but also could be that the analysis isn't aware they exist :smile:

view this post on Zulip Richard Feldman (Dec 14 2025 at 23:28):

that one's certainly worth looking into though! I forget if in-place is in list.zig or outside it

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 23:37):

the pushInPlacefunction in list.zig for example is not used anywhere else in src/ than in the list.zig file in its tests. So I suppose the interpreter never does the push in place optimization for example.

view this post on Zulip Luke Boswell (Dec 14 2025 at 23:39):

Yeah I wasn't saying I agree with that analysis... just that the general vibe I have is we have a lot of optimisations we can work through that should speed us up a lot. Like we don't have tail recursion yet the plan is to support TRMC etc

view this post on Zulip Matthieu Pizenberg (Dec 14 2025 at 23:45):

What is the appropriate way to instrument the interpreter to add some stderr outputs to the program execution? When I try to add some, the compiler is unhappy because it’s not wasm-compatible.

view this post on Zulip Luke Boswell (Dec 14 2025 at 23:47):

If you do something like != .freestanding or whatever so it doesn't try to build in the playground you should be fine

view this post on Zulip Luke Boswell (Dec 14 2025 at 23:48):

Here's an example of how the tracing for refcounting is done

if (comptime trace_refcount and builtin.os.tag != .freestanding) {
    const stderr_file: std.fs.File = .stderr();
    var buf: [256]u8 = undefined;
    const msg = std.fmt.bufPrint(&buf, "[INTERP] upsertBinding from temp_binds ptr=0x{x}\n", .{
        @intFromPtr(binding.value.ptr),
    }) catch "[INTERP] upsertBinding\n";
    stderr_file.writeAll(msg) catch {};
}

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 00:58):

But why is our interpreter so much slower than Python?

Yeah, we must have a crazy amount of low hanging fruit. That said, we also are a tree walking interpreter which is generally the slowest form of interpreter....but yeah, low hanging fruit too.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 01:00):

Looks like a lot of random interpreter overhead in a few methods. Not something like cloning or etc. But also, I don't really know what these methods do

Screenshot 2025-12-14 at 4.59.43 PM.png

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 01:06):

Definitely the first place to look would be applyContinuation which seems to have a lot of cost within itself.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 01:07):

Something super off is that like 95+% of the time is system time.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 01:07):

So this is almost not running roc code at all, just kernel calls. I guess that could be tons of allocations or copies or something.

view this post on Zulip Matthieu Pizenberg (Dec 15 2025 at 01:25):

Something super off is that like 95+% of the time is system time.

Yeah, my htop bar was like 90% red instead of being green

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 01:37):

I think we may call UnifyWithConf 1,022 times per line processed. And I assume it is the same unification every line. So this may be cachable or something that should be calculated ahead of time before running the interpreter at all.

view this post on Zulip Luke Boswell (Dec 15 2025 at 02:04):

call UnifyWithConf 1,022 times per line processed

that sounds like a bug

view this post on Zulip Luke Boswell (Dec 15 2025 at 02:12):

Maybe it's something strange we are doing in our runtime evaluation or something related to polymorphism in the interpreter

view this post on Zulip Luke Boswell (Dec 15 2025 at 02:19):

May be related I'm not sure -- but I'm currently investigating an issue with infinite recursion in store.Store.addTypeVar which has come up from cross-module opaque types

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 02:33):

This is a lot of allocations for 30 seconds

Screenshot 2025-12-14 at 6.33.01 PM.png

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 02:35):

This is time spent in the specific function. It does not include time spent in called function/children functions.

Screenshot 2025-12-14 at 6.34.51 PM.png

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 02:54):

likely better claude analysis

view this post on Zulip Richard Feldman (Dec 15 2025 at 03:33):

yeah those all sound good! :tada:

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 03:35):

I also am working on a PR to revamp the tracy hooks. Adding tons more, making it work with executables from roc build. Some other improvements.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 03:53):

Now with finer grain traces:
Screenshot 2025-12-14 at 7.52.08 PM.png

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 04:28):

Yeah, clearly the issue is allocations. Look at the time spent in the allocator alloc function:

Screenshot 2025-12-14 at 8.27.02 PM.png

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 04:28):

This is in a 60s run

view this post on Zulip Luke Boswell (Dec 15 2025 at 04:53):

This is great news!

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 04:53):

Haha. I think I just figured out a major part of the issue. Just ran in 5s for me

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 04:54):

I'm gonna just merge the fix into my tracy pr cause it is allocator related and I fixed it while also adding some more tracy allocator things.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 04:59):

Top costs after fix:

Screenshot 2025-12-14 at 8.58.47 PM.png

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 06:19):

I think with claude's help I see another huge win after this. Hopefully this can get us down to sub 1s here.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 06:19):

Crazy how much you can do with very carefully crafted prompts and guiding with claude. I don't fully know this code, but I know a lot about what would definitely be wrong and how to be very specific around what I want.

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 06:36):

Yay! sub 1s!

view this post on Zulip Brendan Hansknecht (Dec 15 2025 at 06:36):

Will put this in a different PR cause I am less sure it is valid/correct

view this post on Zulip Matthieu Pizenberg (Dec 15 2025 at 10:51):

The improvement is quite amazing for my day03.roc program! Thanks Brendan! Here are my timings:

view this post on Zulip Matthieu Pizenberg (Dec 15 2025 at 11:19):

This is quite an extraordinary speedup, almost 1000x. I don’t know if you are interested in looking at another point. I made another benchmark that doesn’t show such a dramatic increase. In that benchmark, the TLDR is:

For this benchmark, I’ve made a couple of files, one roc file, and one python file, then I have a wrapper python script that run both these benches with input sizes varying from 50 to 6400, with 2x steps. Finally, I’m plotting the timing for all these sizes.

Here is old main:

benchmark_results_e8b2c4d.png

Here is PR 8680:

benchmark_results_8680.png

And here is PR 8681:

benchmark_results_8681.png

view this post on Zulip Matthieu Pizenberg (Dec 15 2025 at 11:21):

If you are ok to have a look at these: here are the files I used for it.

run_benchmark.py

bench_fold_append.roc

bench_fold_append.py

view this post on Zulip Matthieu Pizenberg (Dec 15 2025 at 11:29):

Also, not explicitely perf-related, but maybe more a bug for @Richard Feldman but when I run the above script with sizes up to 12800 then I get crash with memory leak detected.

Running benchmarks...
Sizes to test: [50, 100, 200, 400, 800, 1600, 3200, 6400, 12800]
============================================================
[1/9] Testing size     50... ✓ Roc: 0.0315s, Python: 0.0124s
[2/9] Testing size    100... ✓ Roc: 0.0366s, Python: 0.0127s
[3/9] Testing size    200... ✓ Roc: 0.0472s, Python: 0.0128s
[4/9] Testing size    400... ✓ Roc: 0.0740s, Python: 0.0130s
[5/9] Testing size    800... ✓ Roc: 0.1385s, Python: 0.0138s
[6/9] Testing size   1600... ✓ Roc: 0.3593s, Python: 0.0150s
[7/9] Testing size   3200... ✓ Roc: 1.1039s, Python: 0.0179s
[8/9] Testing size   6400... ✓ Roc: 3.6303s, Python: 0.0233s
[9/9] Testing size  12800... Error running benchmark with size 12800:
error(gpa): memory address 0x342710000 leaked:
Unable to print stack trace: Unable to open debug info: MissingDebugInfo

error: Memory leak detected!

view this post on Zulip Fabian Schmalzried (Dec 15 2025 at 11:53):

I compiled #8681 with ReleaseFast and I don't see any significant speedup. Did not measure it, but it's still several minutes to run. I have a x64 machine, so maybe that's a reason.

view this post on Zulip Niclas Ahden (Dec 15 2025 at 12:40):

M1 Max:
Run 1: 0.49s user 0.20s system 29% cpu 2.350 total
Run 2: 0.45s user 0.16s system 74% cpu 0.817 total

NixOS x86 (AMD 9950X):
Run 1: 4.32s user 13.01s system 97% cpu 17.816 total
Run 2: 4.07s user 12.94s system 99% cpu 17.073 total

Testing more than two runs made no difference on either machine. Used #8681, ran git clean -dfx before building, built with zig build roc -Doptimize=ReleaseFast. Roc source and input file from https://roc.zulipchat.com/#narrow/channel/358903-advent-of-code/topic/2025.20Day.203/near/563633329.

view this post on Zulip Fabian Schmalzried (Dec 15 2025 at 13:00):

Found my own speedup (#8682), now the whole thing runs in under a second for me as well.

view this post on Zulip Niclas Ahden (Dec 15 2025 at 13:27):

Using #8682 gave me slower M1 times and faster x86 times:

M1 Max:
Run 1: 2.49s user 0.16s system 51% cpu 5.184 total
Run 2: 2.45s user 0.12s system 96% cpu 2.670 total

NixOS x86 (AMD 9950X):
Run 1: 2.13s user 0.18s system 79% cpu 2.914 total
Run 2: 2.03s user 0.18s system 99% cpu 2.221 total

view this post on Zulip Matthieu Pizenberg (Dec 15 2025 at 13:31):

You have to rebase it on top of the new main, which has now included #8681 @Niclas Ahden .
The PR #8682 branch is based on #8680 not 81, so that may be the slowing you see for the M1 max

view this post on Zulip Niclas Ahden (Dec 15 2025 at 13:36):

Ah, indeed, thanks! Rebased results:

M1 Max:
Run 1: 0.45s user 0.14s system 17% cpu 3.334 total
Run 2: 0.41s user 0.10s system 83% cpu 0.616 total

NixOS x86 (AMD 9950X):
Run 1: 0.46s user 0.15s system 58% cpu 1.028 total
Run 2: 0.34s user 0.14s system 96% cpu 0.495 total


Last updated: Dec 21 2025 at 12:15 UTC