2025 Day 3 · advent of code · Zulip Chat Archive

It's quite slow (about 2 and a half minutes on my machine) and I had to overcome the following bugs I found: #8662 #8663 #8664

Matthieu Pizenberg (Dec 13 2025 at 16:49):

Nice, I’m working on mine too :)
PS, you can do a code block inside of the spoiler block, to have the content formatted correctly

Matthieu Pizenberg (Dec 13 2025 at 21:36):

day 3 part 1

# cat ../inputs/XX.txt | roc dayXX.roc

app [main!] { pf: platform "https://github.com/lukewilliamboswell/roc-platform-template-zig/releases/download/0.6/2BfGn4M9uWJNhDVeMghGeXNVDFijMfPsmmVeo6M4QjKX.tar.zst" }

import pf.Stdout
import pf.Stdin

demo_input =
    \\987654321111111
    \\811111111111119
    \\234234234234278
    \\818181911112111


main! = |_args| {
    match run!() {
        Ok(_) => Ok({})
        Err(_) => Err(1)
    }
}

run! = || {
    demo_input_lines = demo_input.trim().split_on("\n")
    day_input_lines = read_input!()
    Stdout.line!("Part 1 (demo): ${part1!(demo_input_lines)?.to_str()}")
    Stdout.line!("Part 1: ${part1!(day_input_lines)?.to_str()}")
    #Stdout.line!("Part 2 (demo): ${part2!(demo_input_lines)?.to_str()}")
    #Stdout.line!("Part 2: ${part2!(day_input_lines)?.to_str()}")
    Ok({})
}

# Helper function to read input from stdin
read_input! : () => List(Str)
read_input! = || {
    var $lines = []
    var $continue = True
    while $continue {
        line = Stdin.line!()
        # Empty string indicates EOF in this platform
        if line == "" {
            $continue = False
        } else {
            $lines = $lines.append(line)
        }
    }
    $lines
}

part1! : List(Str) => Try(U64, _)
part1! = |input| {
    var $sum = 0
    for line in input {
        Stdout.line!(line)
        # ASCII digits 0-9 are encoded as bytes 48-57
        digits = line.to_utf8().map(|byte| byte - 48)
        $sum = $sum + max_two_digits_fast!(digits)
    }
    Ok($sum)
}

# Complexity O(n^2)
max_two_digits_naive! : List(U8) => U64
max_two_digits_naive! = |digits| {
    # We make the assumptions that 0 < number of digits < 256
    # Would be nice to have U64.until also like U8
    len = digits.len().to_u8_wrap()
    indices = U8.until(0, len).map(U8.to_u64)
    var $max = 0
    for i in indices {
        d1 = 10 * U8.to_u64(Try.ok_or(List.get(digits, i), 0))
        for d2 in digits.drop_first(i + 1) {
            m = d1 + d2.to_u64()
            $max = if m > $max { m } else { $max }
        }
    }
    $max
}

# Complexity O(2n)
max_two_digits_fast! : List(U8) => U64
max_two_digits_fast! = |digits| {
    # Start by building a list with the maximum of the rest of the digits
    (_m, max_rest) = digits.fold_rev((0, []), |d, (m, rest)| {
        new_max = if d > m { d } else { m }
        (new_max, rest.append((d, m)))
    })
    # Drop the first element of the max_rest list, since it doesn’t correspond to a pair of digits
    # Then find the max in the list
    var $max = 0
    for (d1, d2) in max_rest.drop_first(1) {
        m = 10 * d1.to_u64() + d2.to_u64()
        $max = if m > $max { m } else { $max }
    }
    $max
}

fold_range : U64, U64, state, (state, U64 -> state) -> state
fold_range = |start, end, init, f| {
    var $acc = init
    var $i = start
    while $i < end {
        $acc = f($acc, $i)
        $i = $i + 1
    }
    $acc
}

part2! : List(Str) => Try(U64, _)
part2! = |input| {
    part1!(input)
}

It took 6 min to solve and up to roughly 8GB of ram. What’s (not so) funny to see is how the interpreter slows down little by little while in theory, there is no reason for the slowdown. I mean no reason why processing each line of the puzzle input should take longer and longer with time passing. So I suppose there is something in the interpreter VM which grows linearly with time and that needs to be accessed for every instruction.

Matthieu Pizenberg (Dec 13 2025 at 21:38):

Here is a video recording of the solution being computed. I cut the video in the middle (where nothing happens) and made it 8x faster than realtime to not bore you. But you can still clearly see how slower the last few lines of the puzzle input are to be processed compared to the first few ones.

Matthieu Pizenberg (Dec 13 2025 at 21:41):

Richard Feldman (Dec 13 2025 at 23:17):

Matthieu Pizenberg (Dec 14 2025 at 09:30):

Fabian Schmalzried (Dec 14 2025 at 09:36):

Backend in the sense of compiler backend. The part of the compiler that generates the actual machine code. We will use LLVM for this for release builds because it has a lot of optimizations and generates fast code. But it takes forever to generate that code, so in the old Rust compiler there is a dev backend that generates not so optimized code quickly to run during development. The idea was that an interpreter might be better than a dev backend, but as you showed it's slow.

Matthieu Pizenberg (Dec 14 2025 at 09:42):

There doesn’t exist an alternative to LLVM specialized for this use case? Focused only on code optimizations that are the most worth it and the least time consuming to get to an executable as fast as possible?

Matthieu Pizenberg (Dec 14 2025 at 09:52):

Quick search seems to suggest using cranelift for this, as it’s apparently quite fast and supports multiple architectures. Is that what you had in mind?

Richard Feldman (Dec 14 2025 at 14:28):

Cranelift is in between llvm and what we already did for the Rust compiler (which can be ported over to Zig) - which is to go straight from our monomorphized IR to machine code

Brendan Hansknecht (Dec 14 2025 at 17:05):

Pretty clearly languages have made relatively performant interpretters. I'm sure roc could do this.

Brendan Hansknecht (Dec 14 2025 at 17:06):

We currently have a young, likely quite slow interpretter. I mean currently it is a tree walking interpreter that is very unoptimized in how it handles refcounts (and thus allocations)

Brendan Hansknecht (Dec 14 2025 at 17:09):

Though I guess if you care maximally about the dev mode experience, it would probably be an interpreter with jit. For fast start and fast peak perf.

Brendan Hansknecht (Dec 14 2025 at 17:10):

I guess I just feel that there are likely large well trodden paths to making a way more optimized interpretter.

Brendan Hansknecht (Dec 14 2025 at 17:12):

A good metric would be to port the code to equivalent python or pypy or js or lua or etc. See where perf can be and how that compares to a debug build of some compiled language with recounting (might have to manually add in the recounting to some language).

Brendan Hansknecht (Dec 14 2025 at 17:35):

Yeah, ported to python or node, this runs in like 60 ms. So I see this as clearly our interpreter is slow, not that we need a dev backend.

That said, part of building a better interpreter may overlap with dev backend like constructions (e.g. compiling to a monomorphized linear bitcode).

Richard Feldman (Dec 14 2025 at 17:42):

yeah like - previously we had a dev backend with zero perf problems for either small or big examples

Richard Feldman (Dec 14 2025 at 17:42):

we went with an interpreter this time partly for constant folding but also bc we didn't have any other backend

Brendan Hansknecht (Dec 14 2025 at 17:43):

I big reason for the interpreter is for fast startup time without dealing with surgical linker.

Richard Feldman (Dec 14 2025 at 17:44):

but it's like - yeah we could try to optimize this interpreter and JIT and all that...but wouldn't it be simpler to go back to the design that had neither slow startup problems in practice nor performance problems once it got running?

Brendan Hansknecht (Dec 14 2025 at 17:44):

Richard Feldman (Dec 14 2025 at 17:45):

we don't need to do surgical linking anymore though - the current "embed the interpreter shim" design can just as easily do "embed an even thinner shim that just executes the bytes we give it"

Brendan Hansknecht (Dec 14 2025 at 17:46):

Brendan Hansknecht (Dec 14 2025 at 17:47):

Ok yeah, then same cost of one linked unit that is more costly, but it is one off.

Richard Feldman (Dec 14 2025 at 17:47):

Brendan Hansknecht (Dec 14 2025 at 17:49):

You still may want an interpreter on the lowest bitcode that would get converted into dev backend assembly. (and it might be best to start by working towards that).

The interpreter at that level would make it so you immediately support all platforms (x86, arm, risc5, wasm, etc). Then if you want more perf, you could just nearly 1:1 map from the low level interpreter to assembly.

Brendan Hansknecht (Dec 14 2025 at 18:00):

Aside, I had claude hack together some really awful c++ code that is trying to model what debug backend roc would actually run here. Takes ~150ms to run.

Richard Feldman (Dec 14 2025 at 18:04):

since we already have something that works (although too slowly) I was thinking we'd do the same progression as last time:

Brendan Hansknecht (Dec 14 2025 at 18:14):

I would advise not just monomorphizing, but also flattening the IR. I think that will make it easier to generate both llvm ir and dev backend assembly. The flattened IR might also be a better setup for refcounting and some optimizations.

Brendan Hansknecht (Dec 14 2025 at 18:15):

Matthieu Pizenberg (Dec 14 2025 at 18:23):

Isn’t that a lot of maintenance to have something running in a few different architectures? I mean I could imagine x86, arm, and RiscV sharing non-negligible shares in a few years. I’m curious, with something like cranelift (Roc IR -> Cranelift IR -> binary) + regular linking but done with good practice (reduce object count, static linking, ...) and a fast linker like mold, how far is this from the targets you have for Roc?

Brendan Hansknecht (Dec 14 2025 at 18:25):

Matthieu Pizenberg (Dec 14 2025 at 18:25):

I mean if like a few years down, we see market shares like 50% / 30% / 20% or similar. No dominant one.

Brendan Hansknecht (Dec 14 2025 at 18:27):

I think there is some risk with the long tail of odd variations of arm/risc5 for embedded, but having a dev backend for x86, arm, and risc5 shouldn't particularly be hard.

Brendan Hansknecht (Dec 14 2025 at 18:29):

The dev backend in the rust compiler supports x86 and arm already. Not totally filled out, but definitely the core. And we had a separate dev backend for wasm. So many of these things already worked.

Brendan Hansknecht (Dec 14 2025 at 18:30):

The core part is that these are dev backends and have essentially no optimizations. They really just dump a handful of raw machine code bytes for any given roc ir node.

Brendan Hansknecht (Dec 14 2025 at 18:30):

They do have a few minor optimizations, but those are all shared between all backends.

Matthieu Pizenberg (Dec 14 2025 at 18:32):

I suppose there is still things like TCO, in-place memory mutations and others important optimizations like that happening for all backends right?

Brendan Hansknecht (Dec 14 2025 at 18:32):

Brendan Hansknecht (Dec 14 2025 at 18:35):

As I see cranelift, is that it is not particularly compelling for dev builds cause it is not particularly fast at compilation. It is great as a slightly optimize alternative to llvm, but when you really just want to blit out machine code and want to run so fast as to compete with interpreter startup time, it is really slow.

dev build -> fastest compilation time, compete with optimized interpreters for execution speed
cranelift -> ok compilation time, rather solid base optimization for execution speed
llvm -> slow to exceptionally slow compilation time, any execution speed level you want to target

Brendan Hansknecht (Dec 14 2025 at 18:38):

For linkers, mold is mostly faster for larger link jobs. So for small apps and dev builds it is not a huge help. Even really basic linking tends to really hurt experience for tiny apps. You feel it if every run of the compiler has an extra 500ms to link before execution.

Brendan Hansknecht (Dec 14 2025 at 18:39):

In my mind, for dev builds we compete with a python like experience. In release builds, we compete with a go like experience.

Matthieu Pizenberg (Dec 14 2025 at 18:39):

I mean in the end as a user, I don’t care if my program takes 50ms or 200ms. It’s cool to have superfast responses technically but yeah user experience mostly feel instantaneous below 0.2s

Brendan Hansknecht (Dec 14 2025 at 18:40):

Yeah, but your bar was rightfully somewhere around 200ms. Linking the full dependencies that a platform takes will often push you to 500ms to 1000ms right from the starting point.

Matthieu Pizenberg (Dec 14 2025 at 18:41):

I used to do a lot of image processing and data processing, and was using Rust. And in this context, not compiling in --release mode was not even an option. Because I would get programs like 100x slower in debug builds. So that’s why I’m always suspicious of the actual usefulness of unoptimized compilations.

Brendan Hansknecht (Dec 14 2025 at 18:41):

That's totally fair. There are many cases where you always need at least the equivalent of -O.

Brendan Hansknecht (Dec 14 2025 at 18:42):

One nice thing with roc is that the platform can always be compiled with -O3 equivalents. So you get a very python style tradeoff in dev builds. If enough of the heavy lifting is in the platform than roc perf doesn't really matter, it is just glue.

Brendan Hansknecht (Dec 14 2025 at 18:44):

For cases like you mentioned above in general, our hope is that our release compile times can compete with go/D/other considered fast compiler. Different target for total compilation time, but trying to keep things fast and smooth for you.

Brendan Hansknecht (Dec 14 2025 at 18:45):

That may be hard when comparing to compilers that do not use LLVM, but as long as we break down our compilation unit to enable more incremental compilation and caching, I'm sure we can manage a lot.

Matthieu Pizenberg (Dec 14 2025 at 19:18):

In your opinion, how fast could the roc interpreter be, without spending way too much on optimizing it? For reference, I converted my solution to python, doing the same thing, and the python solution took 30ms. (Roc took 6 min).

Matthieu Pizenberg (Dec 14 2025 at 19:22):

But obv there is the question of is it worth it if a dev backend is the goal? I suppose you still need something to compute all the constants right?

Brendan Hansknecht (Dec 14 2025 at 19:37):

Richard Feldman (Dec 14 2025 at 22:08):

Brendan Hansknecht (Dec 14 2025 at 22:57):

For sure. I tried to port this to the old compiler dev backend (but segfault). So instead, old compiler with dev llvm build. Which should be pretty similar. Python takes in the range of 50-60ms. Dev backend executable takes in the range of 10-20ms.

That said, old compiler takes like 60ms for the frontend to process all of the basic cli roc files and builtins, so that would make us slower than python in total time for this app. Roughly in the same ballpark though.

Luke Boswell (Dec 14 2025 at 23:01):

But why is our interpreter so much slower than Python? I feel like there is probably some very low hanging fruit we can get after that will speed our current interpreter up heaps

Matthieu Pizenberg (Dec 14 2025 at 23:14):

I tried asking Claude to give me a review of the overall architecture of the compiler, which was a very interesting read for me. And then, with that context in mind, asked it if it could assess what are potential issues making this program slow. Here is what it gave me.

I wouldn’t take anything for granted, but with your knowledge of the compiler, you probably can easily distinguish what makes sense or not in its report.

Matthieu Pizenberg (Dec 14 2025 at 23:21):

I always made the assumptions when writing the roc code that some of the things Richard presented are present. But if things some things like in-place replacement of lists when RC is 1 are not in the interpreter, it could be a high cost for the way I wrote that. Considering the program took 8GB of ram, there is also most likely an issue with not deallocating intermediate values, there might be degradations of the stack or other parts of the interpreter that keep references to things.

Luke Boswell (Dec 14 2025 at 23:22):

Richard Feldman (Dec 14 2025 at 23:27):

Richard Feldman (Dec 14 2025 at 23:28):

could be that in-place optimizations are not working, but also could be that the analysis isn't aware they exist :smile:

Richard Feldman (Dec 14 2025 at 23:28):

that one's certainly worth looking into though! I forget if in-place is in list.zig or outside it

Matthieu Pizenberg (Dec 14 2025 at 23:37):

the pushInPlacefunction in list.zig for example is not used anywhere else in src/ than in the list.zig file in its tests. So I suppose the interpreter never does the push in place optimization for example.

Luke Boswell (Dec 14 2025 at 23:39):

Yeah I wasn't saying I agree with that analysis... just that the general vibe I have is we have a lot of optimisations we can work through that should speed us up a lot. Like we don't have tail recursion yet the plan is to support TRMC etc

Matthieu Pizenberg (Dec 14 2025 at 23:45):

What is the appropriate way to instrument the interpreter to add some stderr outputs to the program execution? When I try to add some, the compiler is unhappy because it’s not wasm-compatible.

Luke Boswell (Dec 14 2025 at 23:47):

If you do something like != .freestanding or whatever so it doesn't try to build in the playground you should be fine

Luke Boswell (Dec 14 2025 at 23:48):

Brendan Hansknecht (Dec 15 2025 at 00:58):

Yeah, we must have a crazy amount of low hanging fruit. That said, we also are a tree walking interpreter which is generally the slowest form of interpreter....but yeah, low hanging fruit too.

Brendan Hansknecht (Dec 15 2025 at 01:00):

Looks like a lot of random interpreter overhead in a few methods. Not something like cloning or etc. But also, I don't really know what these methods do

Brendan Hansknecht (Dec 15 2025 at 01:06):

Definitely the first place to look would be applyContinuation which seems to have a lot of cost within itself.

Brendan Hansknecht (Dec 15 2025 at 01:07):

So this is almost not running roc code at all, just kernel calls. I guess that could be tons of allocations or copies or something.

Matthieu Pizenberg (Dec 15 2025 at 01:25):

Brendan Hansknecht (Dec 15 2025 at 01:37):

I think we may call UnifyWithConf 1,022 times per line processed. And I assume it is the same unification every line. So this may be cachable or something that should be calculated ahead of time before running the interpreter at all.

Luke Boswell (Dec 15 2025 at 02:04):

Luke Boswell (Dec 15 2025 at 02:12):

Maybe it's something strange we are doing in our runtime evaluation or something related to polymorphism in the interpreter

Luke Boswell (Dec 15 2025 at 02:19):

May be related I'm not sure -- but I'm currently investigating an issue with infinite recursion in store.Store.addTypeVar which has come up from cross-module opaque types

Brendan Hansknecht (Dec 15 2025 at 02:33):

Brendan Hansknecht (Dec 15 2025 at 02:35):

This is time spent in the specific function. It does not include time spent in called function/children functions.

Brendan Hansknecht (Dec 15 2025 at 02:54):

likely better claude analysis

Part 2: What's Built Poorly (Root Causes)

Problem 1: Per-Composite-Value Heap Allocations (CRITICAL)

Location: applyContinuation in interpreter.zig

Every composite value construction allocates temporary arrays that are immediately
freed:

tuple_collect (lines 13011-13021):
var elem_layouts = try self.allocator.alloc(Layout, total_count);
defer self.allocator.free(elem_layouts);
var values = try self.allocator.alloc(StackValue, total_count);
defer self.allocator.free(values);
var elem_rt_vars = try self.allocator.alloc(types.Var, total_count);
defer self.allocator.free(elem_rt_vars);
3 allocations + 3 frees per tuple

list_collect (lines 13101-13102):
var values = try self.allocator.alloc(StackValue, total_count);
defer self.allocator.free(values);
1 allocation + 1 free per list

record_collect (lines 13206-13215):
var union_names = AlignedManaged(Ident.Idx, null).init(self.allocator);
var union_layouts = AlignedManaged(Layout, null).init(self.allocator);
var union_indices = std.AutoHashMap(u32, usize).init(self.allocator);
var field_values = try self.allocator.alloc(StackValue, total_field_values);
4 allocations + 4 frees per record (plus HashMap internal allocations)

Impact: A program with 1000 tuples = 3000 malloc + 3000 free syscalls just for tuple
temporaries.

Problem 2: Type Translation Allocations (HIGH IMPACT)

Location: translateTypeVar (line 8536) and gatherTags (line 9008)

Called 128 times per typical program. Each tag union translation does:

// In translateTypeVar (line 8590)
var rt_tag_args = try std.ArrayList(types.Var).initCapacity(self.allocator, 8);
defer rt_tag_args.deinit(self.allocator);

var rt_tags = try self.gatherTags(module, tu); // Another allocation!
defer rt_tags.deinit(self.allocator);

// In gatherTags (line 9013)
var scratch_tags = try std.ArrayList(types.Tag).initCapacity(ctx.allocator, 8);
// Then appends in a loop - potential reallocs

2+ allocations per tag union type translation, even for identical types due to per-call
ArrayList creation.

Problem 3: No Arena/Pool Allocator for Evaluation

The interpreter uses self.allocator (likely GeneralPurposeAllocator or page allocator)
for all temporary allocations. Each allocation is a potential syscall.

Missing: An arena allocator that's reset between top-level evaluations, or fixed-size
scratch buffers for common sizes.

Problem 4: HashMap in Record Construction

record_collect creates a fresh std.AutoHashMap for every record:
var union_indices = std.AutoHashMap(u32, usize).init(self.allocator);
HashMap has internal bucket arrays - this is expensive for small records.

Problem 5: Repeated Layout Lookups

Pattern seen throughout:
const tuple_layout_idx = try self.runtime_layout_store.putTuple(elem_layouts);
const tuple_layout = self.runtime_layout_store.getLayout(tuple_layout_idx);
If putTuple isn't O(1) cached, this is redundant work for identical layouts.

Problem 6: Value Stack Pop-Then-Copy Pattern

The continuation pattern requires:
1. Evaluate all elements (push to value_stack)
2. Pop ALL elements into temporary array
3. Copy from temporary array into final composite
4. Free temporary array

This is 2x memory movement vs directly writing to destination.

Part 3: Potential Fixes (Ranked by Impact)

Fix 1: Scratch Buffer Pool (HIGH IMPACT)

Add reusable scratch buffers to Interpreter struct:
const Interpreter = struct {
// ... existing fields
scratch_layouts: [64]Layout = undefined,
scratch_values: [64]StackValue = undefined,
scratch_rt_vars: [64]types.Var = undefined,
// ...
};
Use for composites with <= 64 elements (vast majority). Only heap-allocate for larger.

Fix 2: Arena Allocator for Evaluation (HIGH IMPACT)

Create an arena at evalWithExpectedType entry, pass through context:
var arena = std.heap.ArenaAllocator.init(self.allocator);
defer arena.deinit();
// Use arena.allocator() for all temporary allocations
Single free at end instead of per-allocation.

Fix 3: Remove HashMap from record_collect (MEDIUM IMPACT)

For small records (< 16 fields), use linear search on arrays instead of HashMap.
Most records are small - HashMap overhead isn't worth it.

Fix 4: Direct-Write Pattern for Composites (MEDIUM IMPACT)

Instead of: evaluate all -> pop all -> copy all
Do: pre-allocate destination -> evaluate directly into slots

Requires continuation redesign but eliminates temporary arrays entirely.

Fix 5: Layout Interning (MEDIUM IMPACT)

Ensure putTuple, putRecord, putList are O(1) lookups for previously-seen layouts.
Verify with tracy that layout store operations aren't the bottleneck.

Richard Feldman (Dec 15 2025 at 03:33):

Brendan Hansknecht (Dec 15 2025 at 03:35):

I also am working on a PR to revamp the tracy hooks. Adding tons more, making it work with executables from roc build. Some other improvements.

Brendan Hansknecht (Dec 15 2025 at 03:53):

Brendan Hansknecht (Dec 15 2025 at 04:28):

Yeah, clearly the issue is allocations. Look at the time spent in the allocator alloc function:

Brendan Hansknecht (Dec 15 2025 at 04:28):

Luke Boswell (Dec 15 2025 at 04:53):

Brendan Hansknecht (Dec 15 2025 at 04:53):

Haha. I think I just figured out a major part of the issue. Just ran in 5s for me

Brendan Hansknecht (Dec 15 2025 at 04:54):

I'm gonna just merge the fix into my tracy pr cause it is allocator related and I fixed it while also adding some more tracy allocator things.

Brendan Hansknecht (Dec 15 2025 at 04:59):

Brendan Hansknecht (Dec 15 2025 at 06:19):

I think with claude's help I see another huge win after this. Hopefully this can get us down to sub 1s here.

Brendan Hansknecht (Dec 15 2025 at 06:19):

Crazy how much you can do with very carefully crafted prompts and guiding with claude. I don't fully know this code, but I know a lot about what would definitely be wrong and how to be very specific around what I want.

Brendan Hansknecht (Dec 15 2025 at 06:36):

Matthieu Pizenberg (Dec 15 2025 at 10:51):

The improvement is quite amazing for my day03.roc program! Thanks Brendan! Here are my timings:

Matthieu Pizenberg (Dec 15 2025 at 11:19):

This is quite an extraordinary speedup, almost 1000x. I don’t know if you are interested in looking at another point. I made another benchmark that doesn’t show such a dramatic increase. In that benchmark, the TLDR is:

For this benchmark, I’ve made a couple of files, one roc file, and one python file, then I have a wrapper python script that run both these benches with input sizes varying from 50 to 6400, with 2x steps. Finally, I’m plotting the timing for all these sizes.

Matthieu Pizenberg (Dec 15 2025 at 11:21):

Matthieu Pizenberg (Dec 15 2025 at 11:29):

Also, not explicitely perf-related, but maybe more a bug for @Richard Feldman but when I run the above script with sizes up to 12800 then I get crash with memory leak detected.

Fabian Schmalzried (Dec 15 2025 at 11:53):

I compiled #8681 with ReleaseFast and I don't see any significant speedup. Did not measure it, but it's still several minutes to run. I have a x64 machine, so maybe that's a reason.

Niclas Ahden (Dec 15 2025 at 12:40):

M1 Max:
Run 1: 0.49s user 0.20s system 29% cpu 2.350 total
Run 2: 0.45s user 0.16s system 74% cpu 0.817 total

NixOS x86 (AMD 9950X):
Run 1: 4.32s user 13.01s system 97% cpu 17.816 total
Run 2: 4.07s user 12.94s system 99% cpu 17.073 total

Fabian Schmalzried (Dec 15 2025 at 13:00):

Found my own speedup (#8682), now the whole thing runs in under a second for me as well.

Niclas Ahden (Dec 15 2025 at 13:27):

~~M1 Max:~~
~~Run 1: 2.49s user 0.16s system 51% cpu 5.184 total~~
~~Run 2: 2.45s user 0.12s system 96% cpu 2.670 total~~

~~NixOS x86 (AMD 9950X):~~
~~Run 1: 2.13s user 0.18s system 79% cpu 2.914 total~~
~~Run 2: 2.03s user 0.18s system 99% cpu 2.221 total~~

Matthieu Pizenberg (Dec 15 2025 at 13:31):

You have to rebase it on top of the new main, which has now included #8681 @Niclas Ahden .
The PR #8682 branch is based on #8680 not 81, so that may be the slowing you see for the M1 max

Niclas Ahden (Dec 15 2025 at 13:36):

M1 Max:
Run 1: 0.45s user 0.14s system 17% cpu 3.334 total
Run 2: 0.41s user 0.10s system 83% cpu 0.616 total

NixOS x86 (AMD 9950X):
Run 1: 0.46s user 0.15s system 58% cpu 1.028 total
Run 2: 0.34s user 0.14s system 96% cpu 0.495 total

Stream: advent of code

Topic: 2025 Day 3

Fabian Schmalzried (Dec 13 2025 at 15:56):

Matthieu Pizenberg (Dec 13 2025 at 16:49):

Matthieu Pizenberg (Dec 13 2025 at 21:36):

Matthieu Pizenberg (Dec 13 2025 at 21:38):

Matthieu Pizenberg (Dec 13 2025 at 21:41):

Richard Feldman (Dec 13 2025 at 23:17):

Matthieu Pizenberg (Dec 14 2025 at 09:30):

Fabian Schmalzried (Dec 14 2025 at 09:36):

Matthieu Pizenberg (Dec 14 2025 at 09:42):

Matthieu Pizenberg (Dec 14 2025 at 09:52):

Richard Feldman (Dec 14 2025 at 14:28):

Brendan Hansknecht (Dec 14 2025 at 17:05):

Brendan Hansknecht (Dec 14 2025 at 17:06):

Brendan Hansknecht (Dec 14 2025 at 17:09):

Brendan Hansknecht (Dec 14 2025 at 17:10):

Brendan Hansknecht (Dec 14 2025 at 17:12):

Brendan Hansknecht (Dec 14 2025 at 17:35):

Richard Feldman (Dec 14 2025 at 17:42):

Richard Feldman (Dec 14 2025 at 17:42):

Brendan Hansknecht (Dec 14 2025 at 17:43):

Richard Feldman (Dec 14 2025 at 17:44):

Brendan Hansknecht (Dec 14 2025 at 17:44):

Richard Feldman (Dec 14 2025 at 17:45):

Brendan Hansknecht (Dec 14 2025 at 17:46):

Brendan Hansknecht (Dec 14 2025 at 17:46):

Brendan Hansknecht (Dec 14 2025 at 17:47):

Richard Feldman (Dec 14 2025 at 17:47):

Brendan Hansknecht (Dec 14 2025 at 17:49):

Brendan Hansknecht (Dec 14 2025 at 18:00):

Richard Feldman (Dec 14 2025 at 18:04):

Brendan Hansknecht (Dec 14 2025 at 18:14):

Brendan Hansknecht (Dec 14 2025 at 18:15):

Matthieu Pizenberg (Dec 14 2025 at 18:23):

Brendan Hansknecht (Dec 14 2025 at 18:25):

Matthieu Pizenberg (Dec 14 2025 at 18:25):

Brendan Hansknecht (Dec 14 2025 at 18:27):

Brendan Hansknecht (Dec 14 2025 at 18:29):

Brendan Hansknecht (Dec 14 2025 at 18:30):

Brendan Hansknecht (Dec 14 2025 at 18:30):

Matthieu Pizenberg (Dec 14 2025 at 18:32):

Brendan Hansknecht (Dec 14 2025 at 18:32):

Brendan Hansknecht (Dec 14 2025 at 18:35):

Brendan Hansknecht (Dec 14 2025 at 18:38):

Brendan Hansknecht (Dec 14 2025 at 18:39):

Matthieu Pizenberg (Dec 14 2025 at 18:39):

Brendan Hansknecht (Dec 14 2025 at 18:40):

Matthieu Pizenberg (Dec 14 2025 at 18:41):

Brendan Hansknecht (Dec 14 2025 at 18:41):

Brendan Hansknecht (Dec 14 2025 at 18:42):

Brendan Hansknecht (Dec 14 2025 at 18:44):

Brendan Hansknecht (Dec 14 2025 at 18:45):

Matthieu Pizenberg (Dec 14 2025 at 19:18):

Matthieu Pizenberg (Dec 14 2025 at 19:22):

Brendan Hansknecht (Dec 14 2025 at 19:37):

Richard Feldman (Dec 14 2025 at 22:08):

Brendan Hansknecht (Dec 14 2025 at 22:57):

Luke Boswell (Dec 14 2025 at 23:01):

Matthieu Pizenberg (Dec 14 2025 at 23:14):

Matthieu Pizenberg (Dec 14 2025 at 23:21):

Luke Boswell (Dec 14 2025 at 23:22):

Richard Feldman (Dec 14 2025 at 23:27):

Richard Feldman (Dec 14 2025 at 23:27):

Richard Feldman (Dec 14 2025 at 23:28):

Richard Feldman (Dec 14 2025 at 23:28):

Matthieu Pizenberg (Dec 14 2025 at 23:37):

Luke Boswell (Dec 14 2025 at 23:39):

Matthieu Pizenberg (Dec 14 2025 at 23:45):

Luke Boswell (Dec 14 2025 at 23:47):

Luke Boswell (Dec 14 2025 at 23:48):

Brendan Hansknecht (Dec 15 2025 at 00:58):

Brendan Hansknecht (Dec 15 2025 at 01:00):

Brendan Hansknecht (Dec 15 2025 at 01:06):

Brendan Hansknecht (Dec 15 2025 at 01:07):

Brendan Hansknecht (Dec 15 2025 at 01:07):

Matthieu Pizenberg (Dec 15 2025 at 01:25):

Brendan Hansknecht (Dec 15 2025 at 01:37):

Luke Boswell (Dec 15 2025 at 02:04):

Luke Boswell (Dec 15 2025 at 02:12):