✔ performance question · beginners

Stream: beginners

Topic: ✔ performance question

drew (Oct 20 2024 at 20:03):

does anything strike you all is particularly bad from a perf perspective here? https://gist.github.com/drewolson/9efeb114307c7323ef5d7bfcf4775a52

drew (Oct 20 2024 at 20:03):

it's an identical algorithm to this gleam https://github.com/drewolson/aoc-gleam/blob/main/src/aoc/year2023/day14.gleam

drew (Oct 20 2024 at 20:04):

and this ocaml https://github.com/drewolson/aoc-ocaml/blob/d3bd355365308c018ada088063895b2d4743975f/lib/year2023/day14.ml

drew (Oct 20 2024 at 20:04):

and in both cases it is effectively an order of magnitude slower

drew (Oct 20 2024 at 20:04):

this is for part 2. it's a problem that requires some type of caching / dynamic programming. i'm using a dict to cache in all cases

drew (Oct 20 2024 at 20:05):

ocaml is using a Hashtbl which is mutable, but the gleam solution is very similar (using an immutable dict)

drew (Oct 20 2024 at 20:06):

i had a version using a State monad in roc initially, but i rewrote it with manual state threading because i was worried that was the performance issue

Brendan Hansknecht (Oct 21 2024 at 00:06):

Are you just running it with roc test? is there a way to run it will full input?

drew (Oct 21 2024 at 00:07):

the full input is where the performance degradation is obvious

drew (Oct 21 2024 at 00:08):

this assumes your input is at ./data/2023/day14.txt

Brendan Hansknecht (Oct 21 2024 at 00:09):

do I need to download a full repo somewhere?

drew (Oct 21 2024 at 00:09):

let me push it to the full report, one sec

drew (Oct 21 2024 at 00:10):

apologies, it's here https://github.com/drewolson/aoc-roc

drew (Oct 21 2024 at 00:12):

$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2  2.56s user 0.02s system 93% cpu 2.751 total

drew (Oct 21 2024 at 00:12):

$ time ./_build/install/default/bin/aoc -y 2023 -d 14 -p 2
90982
./_build/install/default/bin/aoc -y 2023 -d 14 -p 2  0.40s user 0.02s system 98% cpu 0.430 total

drew (Oct 21 2024 at 00:13):

and here's gleam, even including compilation

drew (Oct 21 2024 at 00:13):

$ time gleam run -- -y 2023 -d 14 -p 2
   Compiled in 0.01s
    Running aoc.main
90982
gleam run -- -y 2023 -d 14 -p 2  0.47s user 0.18s system 102% cpu 0.636 total

Brendan Hansknecht (Oct 21 2024 at 00:18):

Wow, this program takes forever to compile an optimized build of

drew (Oct 21 2024 at 00:19):

it’s relatively simple, compared to large programs, right?

Brendan Hansknecht (Oct 21 2024 at 00:22):

40984032832 peak memory footprint

was using about 40GB of ram/swap before I killed it. kinda surprised it runs at all for you

Brendan Hansknecht (Oct 21 2024 at 00:24):

this was roc build --optimize main.roc . Then the crazy memory came from ./main -y 2023 -d 14 -p 2 with the input data I just downloaded.

drew (Oct 21 2024 at 00:34):

i don’t understand how i keep hitting all of these things. i swear i’m just trying to write normal programs

Brendan Hansknecht (Oct 21 2024 at 00:40):

This is the big problem location

    aux = \m, acc ->
        when List.mapTry m List.first is
            Ok firsts ->
                rest = List.map m (\l -> List.dropFirst l 1)
                aux rest (List.append acc firsts)

            Err _ -> acc

Brendan Hansknecht (Oct 21 2024 at 00:40):

Specifically aux

Brendan Hansknecht (Oct 21 2024 at 00:40):

It makes sense. It is gonna be duplicating m a bunch

Brendan Hansknecht (Oct 21 2024 at 00:41):

Cause you are mutating m twice on every iteration

Brendan Hansknecht (Oct 21 2024 at 00:42):

~~It also is not tail recursive. So that will be a crazy about of live copies for a large array~~

drew (Oct 21 2024 at 00:43):

all calls are in the tail position, yeah?

drew (Oct 21 2024 at 00:44):

also re double mutation, does roc not share structure?

Brendan Hansknecht (Oct 21 2024 at 00:44):

It is, my brain failed for a moment and thought the tail call was within a lambda

Brendan Hansknecht (Oct 21 2024 at 00:45):

Roc does not use persistent data structures, so it requires different design from other ml style languages

Brendan Hansknecht (Oct 21 2024 at 00:45):

Instead it uses flat memory dense data structures.

Brendan Hansknecht (Oct 21 2024 at 00:45):

It just mutates in place wherever possible

drew (Oct 21 2024 at 00:46):

also, transpose only happens once in a not that large list

drew (Oct 21 2024 at 00:46):

so i’m still confused how it is the problem

Brendan Hansknecht (Oct 21 2024 at 00:50):

looks to be 1361 by 17 for me

Brendan Hansknecht (Oct 21 2024 at 00:50):

Also, I bet I am missing part of the problem, but not sure how yet

Brendan Hansknecht (Oct 21 2024 at 00:53):

Hmm.... actually rotation looks to happen fast when just running a debug build with roc .... Maybe something separate is breaking the optimized build for that function...

drew (Oct 21 2024 at 00:53):

got it. i haven’t attempted an optimized build over here

Brendan Hansknecht (Oct 21 2024 at 00:54):

Ok. So maybe optimized build is broken in a totally different way.

Brendan Hansknecht (Oct 21 2024 at 00:54):

Let me look at a debug build

Brendan Hansknecht (Oct 21 2024 at 00:56):

Ok yeah, optimized builds are just broken in a different way. Debug actually finishes executing without eating tons of memory

drew (Oct 21 2024 at 01:07):

yep, but it is quite slow compared to other “not known as super fast” languages. but perhaps this is expected for debug builds?

Brendan Hansknecht (Oct 21 2024 at 01:17):

No, I think there is still a deeper issue here. Also, your matrix is ragged, is that intentional? Normally transpose is done on rectangles, not something ragged.

drew (Oct 21 2024 at 01:28):

i definitely is after the first transpose. and i was incorrect, i transpose every cycle

Brendan Hansknecht (Oct 21 2024 at 01:35):

is it? i think it is rectangular

From debug prints, it is not, you transpose the entire input. Which is a list a matrices.

Brendan Hansknecht (Oct 21 2024 at 01:35):

Maybe that is related to the root issue

Brendan Hansknecht (Oct 21 2024 at 01:36):

Cause the input isn't like this:

It is like:

######....#######
..#...####...#..#
.###.#....#.###.#
....##.##.##....#
.##....##....##..
.####...#..####.#
##..#..##..#..###
..#.########.#...
#..#.######.#..#.
.##..#.##.#..##..
###.#.####.#.###.
###.#.####.#.###.
.##..#.##.#..##..
#..#.######.#..#.
..#.########.#...

......##....#
.#.#.##...#.#
.##.#..#..#.#
.##.#..#..#.#
.#.#.##...#.#
......##..#.#
....#.#....#.
...#.###...##
.#.##.#.###..
###.....##.##
#.##..##...#.
###.#..##.##.
#..#######..#
#..#######..#
###.#..##.##.

#.#.#.#..
#.#.#.##.
#.#.#.##.
#.#.#.#..
...#.....
#..#.##..
....#...#
####....#
#####....
.###.#.##
####...##
.#.#...#.
#....#.#.

Brendan Hansknecht (Oct 21 2024 at 01:38):

You are transposing things like the second spoiler here as if they were a single matrix

drew (Oct 21 2024 at 01:38):

i think you have an issue with your input

Brendan Hansknecht (Oct 21 2024 at 01:38):

That is not what I get from the download button here: https://adventofcode.com/2023/day/13

drew (Oct 21 2024 at 01:39):

i just rechecked mine, it is definitely rectangular

drew (Oct 21 2024 at 01:40):

$ awk '{print length+1}' data/2023/day14.txt | uniq
101

Brendan Hansknecht (Oct 21 2024 at 01:41):

Oh....I just realized...I downloaded day 13....ugh

Brendan Hansknecht (Oct 21 2024 at 01:41):

So ignore my crazy results, they are all from bad input....

Brendan Hansknecht (Oct 21 2024 at 01:45):

Yeah, rotate does not matter for perf with proper input

Brendan Hansknecht (Oct 21 2024 at 01:47):

about 50% of the time is the concat and prepend in tilt

Brendan Hansknecht (Oct 21 2024 at 01:47):

prepend is not friendly to roc lists. They prefer append

Brendan Hansknecht (Oct 21 2024 at 01:49):

Also, optimized build gives a 3 to 4x speed up. Makes the time about the same as gleam.

drew (Oct 21 2024 at 01:49):

got it. i can try without prepend to see if that helps

drew (Oct 21 2024 at 01:50):

i’m guessing native ocaml (instead of bytecode) would be even faster

Brendan Hansknecht (Oct 21 2024 at 01:55):

Roughly the perf breakdown of a non-optimized build is:
60% tilt
30% rotate
10% hashing strings for dict

drew (Oct 21 2024 at 01:59):

thanks. how are you getting this breakdown?

Brendan Hansknecht (Oct 21 2024 at 02:02):

I'm currently on mac and using a profiler called samply. Should work on linux too I think. Just have to add --linker=legacy to roc.

Brendan Hansknecht (Oct 21 2024 at 02:02):

Otherwise, I tend to just use perf and a flamegraph tool if on linux

drew (Oct 21 2024 at 02:12):

FWIW i replaced all prepends with appends and it's about the same

drew (Oct 21 2024 at 02:13):

given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here

drew (Oct 21 2024 at 02:17):

(my roc build --optimize seems to be taking roughly forever as well)

drew (Oct 21 2024 at 02:17):

$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2  0.42s user 0.01s system 69% cpu 0.616 total

drew (Oct 21 2024 at 02:21):

not really related, but running roc check on almost every file in the project causes a compiler bug. i opened a minimal repro on github. thanks again for all the help!

drew (Oct 21 2024 at 02:22):

(see https://github.com/roc-lang/roc/issues/7175)

Luke Boswell (Oct 21 2024 at 02:26):

A little off-topic -- but man that samply app is awesome!! Just tested it out on the roc-ray demo I'm working on. < 5% of the time is Roc stuff. :smiley:

Richard Feldman (Oct 21 2024 at 03:15):

drew said:

given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here

explanations I can think of for why that might be are:

we have an issue somewhere (e.g. in the compiler or builtins) that's causing a slowdown
the program doesn't run long enough for Gleam or Haskell to attempt a GC trace, whereas Roc pays for reference counting along the way, so that automatic memory management cost delta overwhelms the cost of the actual program (I'm not sure how much time is spent in refcounting here)
the applications differ in a way that means the Roc one is doing something slower than the others (e.g. m being deeply cloned in the Roc version but shallowly cloned in the languages that use persistent data structures for that operation - although it sounded like from later on in the conversation that maybe the flame graph showed other places being the bigger issue?)

Brendan Hansknecht (Oct 21 2024 at 04:52):

The time of a non-optimized run got cut in half on my machine due to the changes from prepend to append

optimized build is about the same though

Brendan Hansknecht (Oct 21 2024 at 04:55):

About half of all samples in an optimized build are spent in some form of allocation function. That suggests tons of unnecessary data copying going on.

Brendan Hansknecht (Oct 21 2024 at 04:56):

Oh, it looks like small string hashing today leads to an allocation, a copy, hashing, a free....ew

Brendan Hansknecht (Oct 21 2024 at 04:57):

That's costing about 18% of the execution time, but is no where near the biggest contributor to extra allocations.

Brendan Hansknecht (Oct 21 2024 at 04:59):

This example probably will be good to tinker on. Some of it will be wielding roc incorrectly (cause it isn't design to run algorithms the same as gleam). Some of it will be things like extra allocations during string hashing.

Brendan Hansknecht (Oct 21 2024 at 05:47):

Also, simply changing away from all of those single character strings to U8 is pretty huge for this app. (to be fair, fixing the str byte allocation thing would get a significant portion of the perf back as well, but still won't be as fast)

For me, changing from Str to U8 leads to more than a 2x improvement for the optimized build

Brendan Hansknecht (Oct 21 2024 at 06:01):

Benchmark 1: ./main-str -y 2023 -d 14 -p 2
  Time (mean ± σ):     367.6 ms ±   3.1 ms    [User: 361.0 ms, System: 5.7 ms]
  Range (min … max):   362.4 ms … 374.0 ms    30 runs

Benchmark 2: ./main-u8 -y 2023 -d 14 -p 2
  Time (mean ± σ):     107.5 ms ±   0.4 ms    [User: 104.5 ms, System: 2.4 ms]
  Range (min … max):   106.8 ms … 108.4 ms    30 runs

Summary
  ./main-u8 -y 2023 -d 14 -p 2 ran
    3.42 ± 0.03 times faster than ./main-str -y 2023 -d 14 -p 2

Note, with time these benchmarks are generally around 2x slower. By running in a hot loop with the benchmarking tool, everything runs faster due to all of the files being fresh and in cache

Anton (Oct 21 2024 at 08:46):

i don’t understand how i keep hitting all of these things. i swear i’m just trying to write normal programs

Apologies for the frustrating bugs and rough edges @drew!
Roc is a very complicated project and we work with very limited resources.

drew (Oct 21 2024 at 11:52):

of course, no apologies necessary. i just want to make sure i’m approaching the language correctly. i have baggage from using other fp languages.

drew (Oct 21 2024 at 12:55):

Richard Feldman said:

drew said:

given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here

explanations I can think of for why that might be are:

we have an issue somewhere (e.g. in the compiler or builtins) that's causing a slowdown

the program doesn't run long enough for Gleam or Haskell to attempt a GC trace, whereas Roc pays for reference counting along the way, so that automatic memory management cost delta overwhelms the cost of the actual program (I'm not sure how much time is spent in refcounting here)

the applications differ in a way that means the Roc one is doing something slower than the others (e.g. m being deeply cloned in the Roc version but shallowly cloned in the languages that use persistent data structures for that operation - although it sounded like from later on in the conversation that maybe the flame graph showed other places being the bigger issue?)

makes sense, thanks for the detailed breakdown

drew (Oct 21 2024 at 13:03):

as an aside, given that roc doesn't use persistent data structures, should i be thinking differently about how to write my code? i'm not sure how to idiomatically use immutable data structures that aren't also persistent

drew (Oct 21 2024 at 13:59):

Brendan Hansknecht said:

Also, simply changing away from all of those single character strings to U8 is pretty huge for this app. (to be fair, fixing the str byte allocation thing would get a significant portion of the perf back as well, but still won't be as fast)

For me, changing from Str to U8 leads to more than a 2x improvement for the optimized build

nice, i moved to U8 and it was quite easy.

drew (Oct 21 2024 at 13:59):

is it normal for the optimized builds to take ~30 seconds for this size of codebase? or have i hit something weird?

drew (Oct 21 2024 at 14:00):

it's very fast now :smile:

drew (Oct 21 2024 at 14:00):

$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2  0.15s user 0.01s system 98% cpu 0.154 total

drew (Oct 21 2024 at 14:00):

thanks for the help!

Richard Feldman (Oct 21 2024 at 14:23):

awesome! :smiley:

Richard Feldman (Oct 21 2024 at 14:24):

thanks for sticking with it and trying out different things

Richard Feldman (Oct 21 2024 at 14:39):

drew said:

is it normal for the optimized builds to take ~30 seconds for this size of codebase? or have i hit something weird?

we haven't spent much time trying to make the optimization itself run faster, but that does sound high to me too.

Brendan Hansknecht (Oct 21 2024 at 14:59):

Sounds really high to me. It also spends almost all of that time in the frontend. Not sure what is confusing the compiler, but I definitely think it is a bug or algorithm (in the compiler passes) that is designed fundamentally wrong

drew (Oct 21 2024 at 15:08):

Richard Feldman said:

thanks for sticking with it and trying out different things

thanks for the super positive engagement! i hope i’m not coming off as criticizing. i have very high hopes for the language.

Anton (Oct 21 2024 at 15:20):

i hope i’m not coming off as criticizing

No worries, all cool :)

Brendan Hansknecht (Oct 21 2024 at 16:07):

Definitely not criticizing. This is really important discussion. Part of it is that we need to educate users better on how roc wants to be used for perf. Part of it is just perf bugs in roc. Both very importnant.

drew (Oct 21 2024 at 16:09):

cool. i feel like i have a bunch of small "paper cut" things in my head, i may write them down in a private gist and share here

Brendan Hansknecht (Oct 21 2024 at 16:15):

should i be thinking differently about how to write my code?

In general, roc uses the same low level data structures that would be seen in C/C++/rust/zig. Generally one dense allocation. The big difference in roc is that everything is refcounted and immutable. As such, you have to be careful to design high performance sections of code to only have a single reference to the data structure. If you only have a single reference, roc can be significantly faster than persistent data structures. It will mutate in place. Also, flat data structures are much more cpu friendly than persistent ones. That said, if you have two reference in a hot loop, roc has no option but to copy. This is where perf can get really bad.

We think that in practice, this will be faster overall where it matters (though it may require some perf tuning). A big example showing off that this should work is borrow inference in rust. In rust, they compiler enforces that there is only ever one mutable reference to an object. If you need to mutate something and can't get a mutable reference, you have to manually clone. For most code, you don't have to worry about this too much. Standard code design tends to have few references for most things. Roc does essentially the same thing with 2 caveats: Roc simplifies code greatly by checking at runtime instead of compile time and Roc implicitly adds in the copy. So this is kinda the rust model but at runtime. Has similar theoretical perf limits (though has cost of refcounting added, and less super low level nobs), but requires runtime benchmarking and/or deeper awareness of reference counts when writing code at compile time.

Brendan Hansknecht (Oct 21 2024 at 16:22):

I definitely think it is a bug or algorithm (in the compiler passes) that is designed fundamentally wrong

84% of time spent in morphic. Yeah, that matches roughly what I expected. Generally when frontend time explodes, it is morphic

Brendan Hansknecht (Oct 21 2024 at 16:23):

This call stack is crazy recursive

Brendan Hansknecht (Oct 21 2024 at 16:28):

I think I have seen this exact thing before. I tried digging into it and fixing, but I don't understand morphic enough to fix it. I had mostly tried minor changes cause a lot of the time is spent in hash maps and iterators. Probably would really use someone with more context to look into it. That said, maybe we could get a presentation on the morphic code from some of the morphic folks? That might help more people be able to look into this kind of issue.

drew (Oct 21 2024 at 16:30):

also, is there a better place for me to put these threads? "beginners" feels weird to me, but i couldn't think of anything else that made sense.

Brendan Hansknecht (Oct 21 2024 at 16:35):

Currently this is the best location, but maybe we should have another channel for things like this....not exactly sure what the split would be....performance optimization?

Anton (Oct 21 2024 at 16:41):

just "performance" maybe? So people can go there with performance problems too, not just for optimization.

Richard Feldman (Oct 21 2024 at 17:11):

I like that idea!

Anton (Oct 21 2024 at 17:20):

For the description, should we go with "To discuss performance of Roc programs and the compiler."

drew (Oct 21 2024 at 17:21):

should there also be a "bugs" channel? i suppose that could go in "compiler development" but that also feels weird

Anton (Oct 21 2024 at 17:40):

Done!

Notification Bot (Oct 21 2024 at 18:58):

drew has marked this topic as resolved.

Last updated: Aug 17 2025 at 12:14 UTC

Stream: beginners

Topic: ✔ performance question

drew (Oct 20 2024 at 20:03):

drew (Oct 20 2024 at 20:03):

drew (Oct 20 2024 at 20:04):

drew (Oct 20 2024 at 20:04):

drew (Oct 20 2024 at 20:04):

drew (Oct 20 2024 at 20:05):

drew (Oct 20 2024 at 20:06):

Brendan Hansknecht (Oct 21 2024 at 00:06):

drew (Oct 21 2024 at 00:06):

drew (Oct 21 2024 at 00:07):

drew (Oct 21 2024 at 00:08):

Brendan Hansknecht (Oct 21 2024 at 00:09):

drew (Oct 21 2024 at 00:09):

drew (Oct 21 2024 at 00:10):

drew (Oct 21 2024 at 00:12):

drew (Oct 21 2024 at 00:12):

drew (Oct 21 2024 at 00:12):

drew (Oct 21 2024 at 00:12):

drew (Oct 21 2024 at 00:13):

drew (Oct 21 2024 at 00:13):

drew (Oct 21 2024 at 00:13):

Brendan Hansknecht (Oct 21 2024 at 00:18):

drew (Oct 21 2024 at 00:19):

drew (Oct 21 2024 at 00:19):

Brendan Hansknecht (Oct 21 2024 at 00:22):

drew (Oct 21 2024 at 00:23):

Brendan Hansknecht (Oct 21 2024 at 00:24):

drew (Oct 21 2024 at 00:34):

Brendan Hansknecht (Oct 21 2024 at 00:40):

drew (Oct 21 2024 at 00:40):

drew (Oct 21 2024 at 00:40):

Brendan Hansknecht (Oct 21 2024 at 00:40):

drew (Oct 21 2024 at 00:40):

Brendan Hansknecht (Oct 21 2024 at 00:40):

Brendan Hansknecht (Oct 21 2024 at 00:41):

Brendan Hansknecht (Oct 21 2024 at 00:42):

drew (Oct 21 2024 at 00:43):

drew (Oct 21 2024 at 00:43):

drew (Oct 21 2024 at 00:44):

Brendan Hansknecht (Oct 21 2024 at 00:44):

drew (Oct 21 2024 at 00:44):

Brendan Hansknecht (Oct 21 2024 at 00:45):

Brendan Hansknecht (Oct 21 2024 at 00:45):

Brendan Hansknecht (Oct 21 2024 at 00:45):

drew (Oct 21 2024 at 00:46):

drew (Oct 21 2024 at 00:46):

drew (Oct 21 2024 at 00:46):

drew (Oct 21 2024 at 00:46):

Brendan Hansknecht (Oct 21 2024 at 00:50):

Brendan Hansknecht (Oct 21 2024 at 00:50):

drew (Oct 21 2024 at 00:51):

Brendan Hansknecht (Oct 21 2024 at 00:53):

drew (Oct 21 2024 at 00:53):

Brendan Hansknecht (Oct 21 2024 at 00:54):

Brendan Hansknecht (Oct 21 2024 at 00:54):

Brendan Hansknecht (Oct 21 2024 at 00:56):

drew (Oct 21 2024 at 01:07):

Brendan Hansknecht (Oct 21 2024 at 01:17):

drew (Oct 21 2024 at 01:26):

drew (Oct 21 2024 at 01:28):

drew (Oct 21 2024 at 01:28):

Brendan Hansknecht (Oct 21 2024 at 01:35):

Brendan Hansknecht (Oct 21 2024 at 01:35):

Brendan Hansknecht (Oct 21 2024 at 01:36):

Brendan Hansknecht (Oct 21 2024 at 01:38):

drew (Oct 21 2024 at 01:38):

drew (Oct 21 2024 at 01:38):

Brendan Hansknecht (Oct 21 2024 at 01:38):

drew (Oct 21 2024 at 01:39):

drew (Oct 21 2024 at 01:40):

Brendan Hansknecht (Oct 21 2024 at 01:41):

Brendan Hansknecht (Oct 21 2024 at 01:41):

drew (Oct 21 2024 at 01:42):

Brendan Hansknecht (Oct 21 2024 at 01:45):

Brendan Hansknecht (Oct 21 2024 at 01:47):

Brendan Hansknecht (Oct 21 2024 at 01:47):

Brendan Hansknecht (Oct 21 2024 at 01:49):

drew (Oct 21 2024 at 01:49):