I feel like I'm missing something. Does anyone see anything obviously wrong with this code? (Forget about the parsing bit, it is immaterial in performance)
app "AoC"
packages {
pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.0/bkGby8jb0tmZYsy2hg1E_B2QrCgcSTxdUlHtETwm5m4.tar.br",
parser: "https://github.com/lukewilliamboswell/roc-parser/releases/download/0.2.0/dJQSsSmorujhiPNIvJKlQoI92RFIG_JQwUfIxZsCSwE.tar.br",
}
imports [
pf.Stdout,
pf.Task.{ Task },
parser.Core.{
Parser,
const,
keep,
skip,
oneOf,
many,
sepBy1,
},
parser.String.{
string,
parseStr,
codeunitSatisfies,
},
"day-8-input.txt" as input : Str,
]
provides [main] to pf
main : Task {} *
main =
Stdout.line part1
|> Task.await \_ -> Stdout.line ""
part1 : Str
part1 =
input
|> Str.trim
|> parseMap
|> stepsAAAToZZZ
|> Num.toStr
Direction : [Left, Right]
Node : { left : Str, right : Str }
Network : Dict Str Node
Map : { directions : List Direction, network : Network }
findNode : Network, Str -> Node
findNode = \network, node ->
when Dict.get network node is
Ok n -> n
Err _ -> crash "bad input: node \(node) not found"
stepsAAAToZZZ : Map -> Nat
stepsAAAToZZZ = \{ directions, network } ->
go = \steps, ds, d, cur ->
if cur == "ZZZ" then
steps
else
when List.get ds (Num.rem d (List.len ds)) is
Ok Left -> go (steps + 1) directions (d + 1) (findNode network cur).left
Ok Right -> go (steps + 1) directions (d + 1) (findNode network cur).right
Err _ -> crash "bad input"
go 0 directions 0 "AAA"
parseMap : Str -> Map
parseMap = \str ->
pDirection : Parser _ Direction
pDirection = oneOf [
const Left |> skip (string "L"),
const Right |> skip (string "R"),
]
pDirections = many pDirection
pLabel : Parser _ Str
pLabel =
const \c ->
when Str.fromUtf8 c is
Ok s -> s
Err _ -> crash "bad input: non-utf8 character"
|> keep (many (codeunitSatisfies \b -> b >= 'A' && b <= 'Z'))
pNode : Parser _ (Str, Node)
pNode =
const (\node -> \left -> \right -> (node, { left, right }))
|> keep pLabel
|> skip (string " = (")
|> keep pLabel
|> skip (string ", ")
|> keep pLabel
|> skip (string ")")
pNetwork : Parser _ Network
pNetwork =
const Dict.fromList
|> keep (sepBy1 pNode (string "\n"))
pMap : Parser _ Map
pMap =
const (\directions -> \network -> { directions, network })
|> keep pDirections
|> skip (string "\n\n")
|> keep pNetwork
when parseStr pMap str is
Ok g -> g
Err (ParsingFailure e) | Err (ParsingIncomplete e) -> crash "bad input: '\(e)'"
If I run this as-is, without opt or #6216, it takes ~9 seconds. stepsAAAToZZZ#go
makes 22K iterations.
If I compile with opt and #6216 it's at 860ms. But this is still paltry compared to a simple Node.js version:
❯ hyperfine --warmup 2 ./day8 'node day8.js'
Benchmark 1: ./day8
Time (mean ± σ): 867.2 ms ± 17.8 ms [User: 856.2 ms, System: 5.4 ms]
Range (min … max): 848.4 ms … 895.3 ms 10 runs
Benchmark 2: node day8.js
Time (mean ± σ): 35.0 ms ± 1.5 ms [User: 29.7 ms, System: 5.3 ms]
Range (min … max): 31.5 ms … 38.6 ms 74 runs
Summary
'node day8.js' ran
24.77 ± 1.18 times faster than './day8'
const fs = require('fs');
const path = require('path');
const input = fs.readFileSync(path.resolve(__dirname, './day-8-input.txt'), 'utf8');
const dir = {left: 0, right: 1};
function parseInput(input) {
const [directions, nodes] = input.trim().split('\n\n');
const dirs = directions.split('').map(d => d === 'L' ? dir.left : dir.right);
const network = {};
const parsedNodes = nodes.split('\n').forEach(n => {
const [name, lrs] = n.split(' = (');
const left = lrs.substring(0, 3);
const right = lrs.substring(3 + ', '.length, 3 + ', '.length + 3);
network[name] = {left, right};
});
return {dirs, network};
}
const {dirs, network} = parseInput(input);
function stepsAAAToZZZ(dirs, network) {
let steps = 0;
let cur = "AAA";
while (cur !== "ZZZ") {
const {left, right} = network[cur];
const dir = dirs[steps % dirs.length];
cur = dir === 0 ? left : right;
steps++;
}
return steps;
}
console.log(stepsAAAToZZZ(dirs, network));
I ran a profile and all the time is spent in Dict and doing str refcounts. But 800ms for 22K iterations of that loop feels insane to me. The input file has ~700 nodes. So in the absolute worst case, where each node must be scanned, this Roc program amortizes to 800ms/22K/700~=5ns per scan. While that might be reasonable, it seems super unlikely in practice.
Dicts with refcounts keys or values are currently a pathological case
Cause we don't have any sort of smarter refcounting or borrowing, each call to dict.get will increment the refcount and then decrement it.
This will be recursive
So it will spend all the time traversing all elements of the dict incrementing and decrementing refcounts
This is why I want automatic borrowing
If the strings are small and you write a when to convert them to a static sized tuple, it would run way way faster
Oh, also, it will chase the node and increment those two strings as well.
I tried it with a assoc list too and it was just as slow
Did you make a get method? If so it took ownership of the list, requiring a recursive increment when called and a recursive decrement in the function when the list is dropped. Has to be manually inlined to avoid this
It's a huge perf problem for anything wrapping list.get
in another function when the list contains recounted types
Also, before we fix add the borrowing, probably fixing this recounting worst case it's more important. I think I'll work on that next. Shouldn't be too hard
@Ayaz Hafiz
AoC Day 8 spoiler
Just to clearly show how fast this should be. Just took Ayaz's code and changed from a Str
to a (U8, U8, U8)
:
Benchmark 1: /tmp/aoc
Time (mean ± σ): 861.2 ms ± 24.9 ms [User: 855.8 ms, System: 1.5 ms]
Range (min … max): 828.2 ms … 894.9 ms 10 runs
Benchmark 2: /tmp/aoc-u8
Time (mean ± σ): 7.9 ms ± 0.2 ms [User: 6.4 ms, System: 0.9 ms]
Range (min … max): 7.5 ms … 8.5 ms 270 runs
Summary
'/tmp/aoc-u8' ran
109.63 ± 4.14 times faster than '/tmp/aoc'
New version of the code
Once we fix the recursive list refcounting, these two should run at essentially the same speed
Nice work Brendan, this is really cool so see how performance can be impacted like this, and that there are solutions.
Haha, now I just need to fix the underlying issue in the compiler
I have no contributions to this little optimisation saga, I just want to say that I'm loving it. There's something hypnotic about watching something get optimised.
Should be roughly:
Hmm, though on the heap I have to think about capacity vs size. Cause ideally we want to know the full capacity for in the future when roc_dealloc
is passed the full allocation size. But for recount purposes, we want to know the full size such that the slice can just decrease the refcounts of real elements, not old junk elements. I'll have to think about this a bit more, but something along those lines should free us from this problem (though will keep some refcounted objects alive for longer)
@Folkert de Vries do my comments here sound roughly right to you? Anything I am missing? I know you have thought about this recursive refcounting issue before.
well fundamentally this is a simple change. But there are many places now where the assumption is made that an inc/dec on a list is recursive, and we have to track them all down and fix them
For sure, there will be memory leaks and double frees for a while as I do this. Hopefully due to tests and valgrind I'll be able to dig through those
Im not sure that refcounting is enough to explain the problem here. In a trace I ran, refcounts of the inner strings as 70ms/300ms of the profile. So I'm not sure that automatically translates to a 100x improvement
I also wrote this program in Koka, and it ran in about the same time as the JS
the refcounts also mess with the cache though
I'd try running it with refcounts disabled (like they're a no-op) and see how that compares
Does koka have the recursive refcount issue where if you pass a List Str
to a function, it will end up recursively incrementing the refcount of all strings in the list? Or in this case Dict Str {Str, Str}
so 3x as many strings to increment. Then at the end of the function it will recursively decrement the refcount of all the strings. (in this case, technically it is just traversing all the strings and realizing they are small then moving to the next one.)
You need full debug info from dwarf and perf using it to see the inlined functions. Almost all of the refcounted functions are inlined
Purple is refcounting. It is 98.6% of the program execution:
Screenshot-2023-12-09-at-10.30.06AM.png
Flamegraph of my faster version with U8
s Purple is still refcounting. Essentially all of the time is spent in the parser:
Screenshot-2023-12-09-at-10.37.10AM.png
~5% of the 109x faster version of the application is spent in refcounting
Also, wow, the parser has a deep call stack.
Dicts with refcounts keys or values are currently a pathological case
Is this still a WIP issue? I ran into a problem where my roc code was running almost a 1000x slower than similar go code. Playing around with it a bit more I realized most of this overhead is coming from using Str
key values for my Dict
. Using a num type puts me back in go ballpark numbers
Yeah, I have a PR that is mostly done and I need to circle back to.
That said, may be slow to finish. I think most remaining failing tests are segfaults. (Probably also have a number of memory leaks)
Last updated: Jul 06 2025 at 12:14 UTC