does anything strike you all is particularly bad from a perf perspective here? https://gist.github.com/drewolson/9efeb114307c7323ef5d7bfcf4775a52
it's an identical algorithm to this gleam https://github.com/drewolson/aoc-gleam/blob/main/src/aoc/year2023/day14.gleam
and this ocaml https://github.com/drewolson/aoc-ocaml/blob/d3bd355365308c018ada088063895b2d4743975f/lib/year2023/day14.ml
and in both cases it is effectively an order of magnitude slower
this is for part 2. it's a problem that requires some type of caching / dynamic programming. i'm using a dict to cache in all cases
ocaml is using a Hashtbl
which is mutable, but the gleam solution is very similar (using an immutable dict)
i had a version using a State
monad in roc initially, but i rewrote it with manual state threading because i was worried that was the performance issue
Are you just running it with roc test? is there a way to run it will full input?
yes, roc run — -y 2023 -d 14 -p 2
the full input is where the performance degradation is obvious
this assumes your input is at ./data/2023/day14.txt
do I need to download a full repo somewhere?
let me push it to the full report, one sec
apologies, it's here https://github.com/drewolson/aoc-roc
for reference, here's roc:
$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2 2.56s user 0.02s system 93% cpu 2.751 total
and here's ocaml:
$ time ./_build/install/default/bin/aoc -y 2023 -d 14 -p 2
90982
./_build/install/default/bin/aoc -y 2023 -d 14 -p 2 0.40s user 0.02s system 98% cpu 0.430 total
spoiler alert for my answer :smile:
and here's gleam, even including compilation
$ time gleam run -- -y 2023 -d 14 -p 2
Compiled in 0.01s
Running aoc.main
90982
gleam run -- -y 2023 -d 14 -p 2 0.47s user 0.18s system 102% cpu 0.636 total
Wow, this program takes forever to compile an optimized build of
lol
it’s relatively simple, compared to large programs, right?
40984032832 peak memory footprint
was using about 40GB of ram/swap before I killed it. kinda surprised it runs at all for you
just for roc run
?
this was roc build --optimize main.roc
. Then the crazy memory came from ./main -y 2023 -d 14 -p 2
with the input data I just downloaded.
i don’t understand how i keep hitting all of these things. i swear i’m just trying to write normal programs
This is the big problem location
aux = \m, acc ->
when List.mapTry m List.first is
Ok firsts ->
rest = List.map m (\l -> List.dropFirst l 1)
aux rest (List.append acc firsts)
Err _ -> acc
transpose?
weird
Specifically aux
did i write it strangely?
It makes sense. It is gonna be duplicating m
a bunch
Cause you are mutating m twice on every iteration
It also is not tail recursive. So that will be a crazy about of live copies for a large array
how is it not tail recusive?
all calls are in the tail position, yeah?
also re double mutation, does roc not share structure?
It is, my brain failed for a moment and thought the tail call was within a lambda
a la persistent data structures?
Roc does not use persistent data structures, so it requires different design from other ml style languages
Instead it uses flat memory dense data structures.
It just mutates in place wherever possible
oof, i had no idea.
that’s … tricky
also, transpose only happens once in a not that large list
so i’m still confused how it is the problem
looks to be 1361
by 17
for me
Also, I bet I am missing part of the problem, but not sure how yet
appreciate you taking a look
Hmm.... actually rotation looks to happen fast when just running a debug build with roc ...
. Maybe something separate is breaking the optimized build for that function...
got it. i haven’t attempted an optimized build over here
Ok. So maybe optimized build is broken in a totally different way.
Let me look at a debug build
Ok yeah, optimized builds are just broken in a different way. Debug actually finishes executing without eating tons of memory
yep, but it is quite slow compared to other “not known as super fast” languages. but perhaps this is expected for debug builds?
No, I think there is still a deeper issue here. Also, your matrix is ragged, is that intentional? Normally transpose is done on rectangles, not something ragged.
is it? i think it is rectangular
i definitely is after the first transpose. and i was incorrect, i transpose every cycle
so very often
is it? i think it is rectangular
From debug prints, it is not, you transpose the entire input. Which is a list a matrices.
Maybe that is related to the root issue
Cause the input isn't like this:
It is like:
You are transposing things like the second spoiler here as if they were a single matrix
i think you have an issue with your input
it should be a single matrix
That is not what I get from the download button here: https://adventofcode.com/2023/day/13
i just rechecked mine, it is definitely rectangular
$ awk '{print length+1}' data/2023/day14.txt | uniq
101
Oh....I just realized...I downloaded day 13....ugh
So ignore my crazy results, they are all from bad input....
ha, phew!
Yeah, rotate does not matter for perf with proper input
about 50% of the time is the concat and prepend in tilt
prepend is not friendly to roc lists. They prefer append
Also, optimized build gives a 3 to 4x speed up. Makes the time about the same as gleam.
got it. i can try without prepend to see if that helps
i’m guessing native ocaml (instead of bytecode) would be even faster
but again, it is using mutation
Roughly the perf breakdown of a non-optimized build is:
60% tilt
30% rotate
10% hashing strings for dict
thanks. how are you getting this breakdown?
I'm currently on mac and using a profiler called samply. Should work on linux too I think. Just have to add --linker=legacy
to roc.
Otherwise, I tend to just use perf
and a flamegraph tool if on linux
thanks!
FWIW i replaced all prepends with appends and it's about the same
just pushed the change
given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here
(my roc build --optimize
seems to be taking roughly forever as well)
ah finished in 37 seconds
certainly faster though
$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2 0.42s user 0.01s system 69% cpu 0.616 total
not really related, but running roc check
on almost every file in the project causes a compiler bug. i opened a minimal repro on github. thanks again for all the help!
(see https://github.com/roc-lang/roc/issues/7175)
A little off-topic -- but man that samply app is awesome!! Just tested it out on the roc-ray demo I'm working on. < 5% of the time is Roc stuff. :smiley:
drew said:
given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here
explanations I can think of for why that might be are:
m
being deeply cloned in the Roc version but shallowly cloned in the languages that use persistent data structures for that operation - although it sounded like from later on in the conversation that maybe the flame graph showed other places being the bigger issue?)The time of a non-optimized run got cut in half on my machine due to the changes from prepend to append
optimized build is about the same though
About half of all samples in an optimized build are spent in some form of allocation function. That suggests tons of unnecessary data copying going on.
Oh, it looks like small string hashing today leads to an allocation, a copy, hashing, a free....ew
That's costing about 18% of the execution time, but is no where near the biggest contributor to extra allocations.
This example probably will be good to tinker on. Some of it will be wielding roc incorrectly (cause it isn't design to run algorithms the same as gleam). Some of it will be things like extra allocations during string hashing.
Also, simply changing away from all of those single character strings to U8
is pretty huge for this app. (to be fair, fixing the str byte allocation thing would get a significant portion of the perf back as well, but still won't be as fast)
For me, changing from Str
to U8
leads to more than a 2x improvement for the optimized build
Benchmark 1: ./main-str -y 2023 -d 14 -p 2
Time (mean ± σ): 367.6 ms ± 3.1 ms [User: 361.0 ms, System: 5.7 ms]
Range (min … max): 362.4 ms … 374.0 ms 30 runs
Benchmark 2: ./main-u8 -y 2023 -d 14 -p 2
Time (mean ± σ): 107.5 ms ± 0.4 ms [User: 104.5 ms, System: 2.4 ms]
Range (min … max): 106.8 ms … 108.4 ms 30 runs
Summary
./main-u8 -y 2023 -d 14 -p 2 ran
3.42 ± 0.03 times faster than ./main-str -y 2023 -d 14 -p 2
Note, with time
these benchmarks are generally around 2x slower. By running in a hot loop with the benchmarking tool, everything runs faster due to all of the files being fresh and in cache
i don’t understand how i keep hitting all of these things. i swear i’m just trying to write normal programs
Apologies for the frustrating bugs and rough edges @drew!
Roc is a very complicated project and we work with very limited resources.
of course, no apologies necessary. i just want to make sure i’m approaching the language correctly. i have baggage from using other fp languages.
Richard Feldman said:
drew said:
given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here
explanations I can think of for why that might be are:
- we have an issue somewhere (e.g. in the compiler or builtins) that's causing a slowdown
- the program doesn't run long enough for Gleam or Haskell to attempt a GC trace, whereas Roc pays for reference counting along the way, so that automatic memory management cost delta overwhelms the cost of the actual program (I'm not sure how much time is spent in refcounting here)
- the applications differ in a way that means the Roc one is doing something slower than the others (e.g.
m
being deeply cloned in the Roc version but shallowly cloned in the languages that use persistent data structures for that operation - although it sounded like from later on in the conversation that maybe the flame graph showed other places being the bigger issue?)
makes sense, thanks for the detailed breakdown
as an aside, given that roc doesn't use persistent data structures, should i be thinking differently about how to write my code? i'm not sure how to idiomatically use immutable data structures that aren't also persistent
Brendan Hansknecht said:
Also, simply changing away from all of those single character strings to
U8
is pretty huge for this app. (to be fair, fixing the str byte allocation thing would get a significant portion of the perf back as well, but still won't be as fast)For me, changing from
Str
toU8
leads to more than a 2x improvement for the optimized build
nice, i moved to U8
and it was quite easy.
is it normal for the optimized builds to take ~30 seconds for this size of codebase? or have i hit something weird?
it's very fast now :smile:
$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2 0.15s user 0.01s system 98% cpu 0.154 total
thanks for the help!
awesome! :smiley:
thanks for sticking with it and trying out different things
drew said:
is it normal for the optimized builds to take ~30 seconds for this size of codebase? or have i hit something weird?
we haven't spent much time trying to make the optimization itself run faster, but that does sound high to me too.
Sounds really high to me. It also spends almost all of that time in the frontend. Not sure what is confusing the compiler, but I definitely think it is a bug or algorithm (in the compiler passes) that is designed fundamentally wrong
Richard Feldman said:
thanks for sticking with it and trying out different things
thanks for the super positive engagement! i hope i’m not coming off as criticizing. i have very high hopes for the language.
i hope i’m not coming off as criticizing
No worries, all cool :)
Definitely not criticizing. This is really important discussion. Part of it is that we need to educate users better on how roc wants to be used for perf. Part of it is just perf bugs in roc. Both very importnant.
cool. i feel like i have a bunch of small "paper cut" things in my head, i may write them down in a private gist and share here
should i be thinking differently about how to write my code?
In general, roc uses the same low level data structures that would be seen in C/C++/rust/zig. Generally one dense allocation. The big difference in roc is that everything is refcounted and immutable. As such, you have to be careful to design high performance sections of code to only have a single reference to the data structure. If you only have a single reference, roc can be significantly faster than persistent data structures. It will mutate in place. Also, flat data structures are much more cpu friendly than persistent ones. That said, if you have two reference in a hot loop, roc has no option but to copy. This is where perf can get really bad.
We think that in practice, this will be faster overall where it matters (though it may require some perf tuning). A big example showing off that this should work is borrow inference in rust. In rust, they compiler enforces that there is only ever one mutable reference to an object. If you need to mutate something and can't get a mutable reference, you have to manually clone. For most code, you don't have to worry about this too much. Standard code design tends to have few references for most things. Roc does essentially the same thing with 2 caveats: Roc simplifies code greatly by checking at runtime instead of compile time and Roc implicitly adds in the copy. So this is kinda the rust model but at runtime. Has similar theoretical perf limits (though has cost of refcounting added, and less super low level nobs), but requires runtime benchmarking and/or deeper awareness of reference counts when writing code at compile time.
I definitely think it is a bug or algorithm (in the compiler passes) that is designed fundamentally wrong
84% of time spent in morphic. Yeah, that matches roughly what I expected. Generally when frontend time explodes, it is morphic
This call stack is crazy recursive
I think I have seen this exact thing before. I tried digging into it and fixing, but I don't understand morphic enough to fix it. I had mostly tried minor changes cause a lot of the time is spent in hash maps and iterators. Probably would really use someone with more context to look into it. That said, maybe we could get a presentation on the morphic code from some of the morphic folks? That might help more people be able to look into this kind of issue.
also, is there a better place for me to put these threads? "beginners" feels weird to me, but i couldn't think of anything else that made sense.
Currently this is the best location, but maybe we should have another channel for things like this....not exactly sure what the split would be....performance optimization?
just "performance" maybe? So people can go there with performance problems too, not just for optimization.
I like that idea!
For the description, should we go with "To discuss performance of Roc programs and the compiler."
should there also be a "bugs" channel? i suppose that could go in "compiler development" but that also feels weird
Done!
drew has marked this topic as resolved.
Last updated: Jul 06 2025 at 12:14 UTC