Stream: beginners

Topic: ✔ performance question


view this post on Zulip drew (Oct 20 2024 at 20:03):

does anything strike you all is particularly bad from a perf perspective here? https://gist.github.com/drewolson/9efeb114307c7323ef5d7bfcf4775a52

view this post on Zulip drew (Oct 20 2024 at 20:03):

it's an identical algorithm to this gleam https://github.com/drewolson/aoc-gleam/blob/main/src/aoc/year2023/day14.gleam

view this post on Zulip drew (Oct 20 2024 at 20:04):

and this ocaml https://github.com/drewolson/aoc-ocaml/blob/d3bd355365308c018ada088063895b2d4743975f/lib/year2023/day14.ml

view this post on Zulip drew (Oct 20 2024 at 20:04):

and in both cases it is effectively an order of magnitude slower

view this post on Zulip drew (Oct 20 2024 at 20:04):

this is for part 2. it's a problem that requires some type of caching / dynamic programming. i'm using a dict to cache in all cases

view this post on Zulip drew (Oct 20 2024 at 20:05):

ocaml is using a Hashtbl which is mutable, but the gleam solution is very similar (using an immutable dict)

view this post on Zulip drew (Oct 20 2024 at 20:06):

i had a version using a State monad in roc initially, but i rewrote it with manual state threading because i was worried that was the performance issue

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:06):

Are you just running it with roc test? is there a way to run it will full input?

view this post on Zulip drew (Oct 21 2024 at 00:06):

yes, roc run — -y 2023 -d 14 -p 2

view this post on Zulip drew (Oct 21 2024 at 00:07):

the full input is where the performance degradation is obvious

view this post on Zulip drew (Oct 21 2024 at 00:08):

this assumes your input is at ./data/2023/day14.txt

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:09):

do I need to download a full repo somewhere?

view this post on Zulip drew (Oct 21 2024 at 00:09):

let me push it to the full report, one sec

view this post on Zulip drew (Oct 21 2024 at 00:10):

apologies, it's here https://github.com/drewolson/aoc-roc

view this post on Zulip drew (Oct 21 2024 at 00:12):

for reference, here's roc:

view this post on Zulip drew (Oct 21 2024 at 00:12):

$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2  2.56s user 0.02s system 93% cpu 2.751 total

view this post on Zulip drew (Oct 21 2024 at 00:12):

and here's ocaml:

view this post on Zulip drew (Oct 21 2024 at 00:12):

$ time ./_build/install/default/bin/aoc -y 2023 -d 14 -p 2
90982
./_build/install/default/bin/aoc -y 2023 -d 14 -p 2  0.40s user 0.02s system 98% cpu 0.430 total

view this post on Zulip drew (Oct 21 2024 at 00:13):

spoiler alert for my answer :smile:

view this post on Zulip drew (Oct 21 2024 at 00:13):

and here's gleam, even including compilation

view this post on Zulip drew (Oct 21 2024 at 00:13):

$ time gleam run -- -y 2023 -d 14 -p 2
   Compiled in 0.01s
    Running aoc.main
90982
gleam run -- -y 2023 -d 14 -p 2  0.47s user 0.18s system 102% cpu 0.636 total

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:18):

Wow, this program takes forever to compile an optimized build of

view this post on Zulip drew (Oct 21 2024 at 00:19):

lol

view this post on Zulip drew (Oct 21 2024 at 00:19):

it’s relatively simple, compared to large programs, right?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:22):

40984032832 peak memory footprint

was using about 40GB of ram/swap before I killed it. kinda surprised it runs at all for you

view this post on Zulip drew (Oct 21 2024 at 00:23):

just for roc run?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:24):

this was roc build --optimize main.roc . Then the crazy memory came from ./main -y 2023 -d 14 -p 2 with the input data I just downloaded.

view this post on Zulip drew (Oct 21 2024 at 00:34):

i don’t understand how i keep hitting all of these things. i swear i’m just trying to write normal programs

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:40):

This is the big problem location

    aux = \m, acc ->
        when List.mapTry m List.first is
            Ok firsts ->
                rest = List.map m (\l -> List.dropFirst l 1)
                aux rest (List.append acc firsts)

            Err _ -> acc

view this post on Zulip drew (Oct 21 2024 at 00:40):

transpose?

view this post on Zulip drew (Oct 21 2024 at 00:40):

weird

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:40):

Specifically aux

view this post on Zulip drew (Oct 21 2024 at 00:40):

did i write it strangely?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:40):

It makes sense. It is gonna be duplicating m a bunch

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:41):

Cause you are mutating m twice on every iteration

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:42):

It also is not tail recursive. So that will be a crazy about of live copies for a large array

view this post on Zulip drew (Oct 21 2024 at 00:43):

how is it not tail recusive?

view this post on Zulip drew (Oct 21 2024 at 00:43):

all calls are in the tail position, yeah?

view this post on Zulip drew (Oct 21 2024 at 00:44):

also re double mutation, does roc not share structure?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:44):

It is, my brain failed for a moment and thought the tail call was within a lambda

view this post on Zulip drew (Oct 21 2024 at 00:44):

a la persistent data structures?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:45):

Roc does not use persistent data structures, so it requires different design from other ml style languages

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:45):

Instead it uses flat memory dense data structures.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:45):

It just mutates in place wherever possible

view this post on Zulip drew (Oct 21 2024 at 00:46):

oof, i had no idea.

view this post on Zulip drew (Oct 21 2024 at 00:46):

that’s … tricky

view this post on Zulip drew (Oct 21 2024 at 00:46):

also, transpose only happens once in a not that large list

view this post on Zulip drew (Oct 21 2024 at 00:46):

so i’m still confused how it is the problem

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:50):

looks to be 1361 by 17 for me

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:50):

Also, I bet I am missing part of the problem, but not sure how yet

view this post on Zulip drew (Oct 21 2024 at 00:51):

appreciate you taking a look

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:53):

Hmm.... actually rotation looks to happen fast when just running a debug build with roc .... Maybe something separate is breaking the optimized build for that function...

view this post on Zulip drew (Oct 21 2024 at 00:53):

got it. i haven’t attempted an optimized build over here

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:54):

Ok. So maybe optimized build is broken in a totally different way.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:54):

Let me look at a debug build

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 00:56):

Ok yeah, optimized builds are just broken in a different way. Debug actually finishes executing without eating tons of memory

view this post on Zulip drew (Oct 21 2024 at 01:07):

yep, but it is quite slow compared to other “not known as super fast” languages. but perhaps this is expected for debug builds?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:17):

No, I think there is still a deeper issue here. Also, your matrix is ragged, is that intentional? Normally transpose is done on rectangles, not something ragged.

view this post on Zulip drew (Oct 21 2024 at 01:26):

is it? i think it is rectangular

view this post on Zulip drew (Oct 21 2024 at 01:28):

i definitely is after the first transpose. and i was incorrect, i transpose every cycle

view this post on Zulip drew (Oct 21 2024 at 01:28):

so very often

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:35):

is it? i think it is rectangular

From debug prints, it is not, you transpose the entire input. Which is a list a matrices.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:35):

Maybe that is related to the root issue

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:36):

Cause the input isn't like this:

It is like:

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:38):

You are transposing things like the second spoiler here as if they were a single matrix

view this post on Zulip drew (Oct 21 2024 at 01:38):

i think you have an issue with your input

view this post on Zulip drew (Oct 21 2024 at 01:38):

it should be a single matrix

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:38):

That is not what I get from the download button here: https://adventofcode.com/2023/day/13

view this post on Zulip drew (Oct 21 2024 at 01:39):

i just rechecked mine, it is definitely rectangular

view this post on Zulip drew (Oct 21 2024 at 01:40):

$ awk '{print length+1}' data/2023/day14.txt | uniq
101

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:41):

Oh....I just realized...I downloaded day 13....ugh

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:41):

So ignore my crazy results, they are all from bad input....

view this post on Zulip drew (Oct 21 2024 at 01:42):

ha, phew!

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:45):

Yeah, rotate does not matter for perf with proper input

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:47):

about 50% of the time is the concat and prepend in tilt

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:47):

prepend is not friendly to roc lists. They prefer append

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:49):

Also, optimized build gives a 3 to 4x speed up. Makes the time about the same as gleam.

view this post on Zulip drew (Oct 21 2024 at 01:49):

got it. i can try without prepend to see if that helps

view this post on Zulip drew (Oct 21 2024 at 01:50):

i’m guessing native ocaml (instead of bytecode) would be even faster

view this post on Zulip drew (Oct 21 2024 at 01:50):

but again, it is using mutation

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 01:55):

Roughly the perf breakdown of a non-optimized build is:
60% tilt
30% rotate
10% hashing strings for dict

view this post on Zulip drew (Oct 21 2024 at 01:59):

thanks. how are you getting this breakdown?

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 02:02):

I'm currently on mac and using a profiler called samply. Should work on linux too I think. Just have to add --linker=legacy to roc.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 02:02):

Otherwise, I tend to just use perf and a flamegraph tool if on linux

view this post on Zulip drew (Oct 21 2024 at 02:04):

thanks!

view this post on Zulip drew (Oct 21 2024 at 02:12):

FWIW i replaced all prepends with appends and it's about the same

view this post on Zulip drew (Oct 21 2024 at 02:12):

just pushed the change

view this post on Zulip drew (Oct 21 2024 at 02:13):

given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here

view this post on Zulip drew (Oct 21 2024 at 02:17):

(my roc build --optimize seems to be taking roughly forever as well)

view this post on Zulip drew (Oct 21 2024 at 02:17):

ah finished in 37 seconds

view this post on Zulip drew (Oct 21 2024 at 02:17):

certainly faster though

view this post on Zulip drew (Oct 21 2024 at 02:17):

$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2  0.42s user 0.01s system 69% cpu 0.616 total

view this post on Zulip drew (Oct 21 2024 at 02:21):

not really related, but running roc check on almost every file in the project causes a compiler bug. i opened a minimal repro on github. thanks again for all the help!

view this post on Zulip drew (Oct 21 2024 at 02:22):

(see https://github.com/roc-lang/roc/issues/7175)

view this post on Zulip Luke Boswell (Oct 21 2024 at 02:26):

A little off-topic -- but man that samply app is awesome!! Just tested it out on the roc-ray demo I'm working on. < 5% of the time is Roc stuff. :smiley:

view this post on Zulip Richard Feldman (Oct 21 2024 at 03:15):

drew said:

given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here

explanations I can think of for why that might be are:

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 04:52):

The time of a non-optimized run got cut in half on my machine due to the changes from prepend to append

optimized build is about the same though

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 04:55):

About half of all samples in an optimized build are spent in some form of allocation function. That suggests tons of unnecessary data copying going on.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 04:56):

Oh, it looks like small string hashing today leads to an allocation, a copy, hashing, a free....ew

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 04:57):

That's costing about 18% of the execution time, but is no where near the biggest contributor to extra allocations.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 04:59):

This example probably will be good to tinker on. Some of it will be wielding roc incorrectly (cause it isn't design to run algorithms the same as gleam). Some of it will be things like extra allocations during string hashing.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 05:47):

Also, simply changing away from all of those single character strings to U8 is pretty huge for this app. (to be fair, fixing the str byte allocation thing would get a significant portion of the perf back as well, but still won't be as fast)

For me, changing from Str to U8 leads to more than a 2x improvement for the optimized build

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 06:01):

Benchmark 1: ./main-str -y 2023 -d 14 -p 2
  Time (mean ± σ):     367.6 ms ±   3.1 ms    [User: 361.0 ms, System: 5.7 ms]
  Range (min … max):   362.4 ms … 374.0 ms    30 runs

Benchmark 2: ./main-u8 -y 2023 -d 14 -p 2
  Time (mean ± σ):     107.5 ms ±   0.4 ms    [User: 104.5 ms, System: 2.4 ms]
  Range (min … max):   106.8 ms … 108.4 ms    30 runs

Summary
  ./main-u8 -y 2023 -d 14 -p 2 ran
    3.42 ± 0.03 times faster than ./main-str -y 2023 -d 14 -p 2

Note, with time these benchmarks are generally around 2x slower. By running in a hot loop with the benchmarking tool, everything runs faster due to all of the files being fresh and in cache

view this post on Zulip Anton (Oct 21 2024 at 08:46):

i don’t understand how i keep hitting all of these things. i swear i’m just trying to write normal programs

Apologies for the frustrating bugs and rough edges @drew!
Roc is a very complicated project and we work with very limited resources.

view this post on Zulip drew (Oct 21 2024 at 11:52):

of course, no apologies necessary. i just want to make sure i’m approaching the language correctly. i have baggage from using other fp languages.

view this post on Zulip drew (Oct 21 2024 at 12:55):

Richard Feldman said:

drew said:

given all of the talk about competing with "fast" languages, i was thinking it would definitely beat gleam (erlang's performance) but perhaps my expectations are not aligned with reality here

explanations I can think of for why that might be are:

makes sense, thanks for the detailed breakdown

view this post on Zulip drew (Oct 21 2024 at 13:03):

as an aside, given that roc doesn't use persistent data structures, should i be thinking differently about how to write my code? i'm not sure how to idiomatically use immutable data structures that aren't also persistent

view this post on Zulip drew (Oct 21 2024 at 13:59):

Brendan Hansknecht said:

Also, simply changing away from all of those single character strings to U8 is pretty huge for this app. (to be fair, fixing the str byte allocation thing would get a significant portion of the perf back as well, but still won't be as fast)

For me, changing from Str to U8 leads to more than a 2x improvement for the optimized build

nice, i moved to U8 and it was quite easy.

view this post on Zulip drew (Oct 21 2024 at 13:59):

is it normal for the optimized builds to take ~30 seconds for this size of codebase? or have i hit something weird?

view this post on Zulip drew (Oct 21 2024 at 14:00):

it's very fast now :smile:

view this post on Zulip drew (Oct 21 2024 at 14:00):

$ time ./main -y 2023 -d 14 -p 2
90982
./main -y 2023 -d 14 -p 2  0.15s user 0.01s system 98% cpu 0.154 total

view this post on Zulip drew (Oct 21 2024 at 14:00):

thanks for the help!

view this post on Zulip Richard Feldman (Oct 21 2024 at 14:23):

awesome! :smiley:

view this post on Zulip Richard Feldman (Oct 21 2024 at 14:24):

thanks for sticking with it and trying out different things

view this post on Zulip Richard Feldman (Oct 21 2024 at 14:39):

drew said:

is it normal for the optimized builds to take ~30 seconds for this size of codebase? or have i hit something weird?

we haven't spent much time trying to make the optimization itself run faster, but that does sound high to me too.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 14:59):

Sounds really high to me. It also spends almost all of that time in the frontend. Not sure what is confusing the compiler, but I definitely think it is a bug or algorithm (in the compiler passes) that is designed fundamentally wrong

view this post on Zulip drew (Oct 21 2024 at 15:08):

Richard Feldman said:

thanks for sticking with it and trying out different things

thanks for the super positive engagement! i hope i’m not coming off as criticizing. i have very high hopes for the language.

view this post on Zulip Anton (Oct 21 2024 at 15:20):

i hope i’m not coming off as criticizing

No worries, all cool :)

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 16:07):

Definitely not criticizing. This is really important discussion. Part of it is that we need to educate users better on how roc wants to be used for perf. Part of it is just perf bugs in roc. Both very importnant.

view this post on Zulip drew (Oct 21 2024 at 16:09):

cool. i feel like i have a bunch of small "paper cut" things in my head, i may write them down in a private gist and share here

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 16:15):

should i be thinking differently about how to write my code?

In general, roc uses the same low level data structures that would be seen in C/C++/rust/zig. Generally one dense allocation. The big difference in roc is that everything is refcounted and immutable. As such, you have to be careful to design high performance sections of code to only have a single reference to the data structure. If you only have a single reference, roc can be significantly faster than persistent data structures. It will mutate in place. Also, flat data structures are much more cpu friendly than persistent ones. That said, if you have two reference in a hot loop, roc has no option but to copy. This is where perf can get really bad.

We think that in practice, this will be faster overall where it matters (though it may require some perf tuning). A big example showing off that this should work is borrow inference in rust. In rust, they compiler enforces that there is only ever one mutable reference to an object. If you need to mutate something and can't get a mutable reference, you have to manually clone. For most code, you don't have to worry about this too much. Standard code design tends to have few references for most things. Roc does essentially the same thing with 2 caveats: Roc simplifies code greatly by checking at runtime instead of compile time and Roc implicitly adds in the copy. So this is kinda the rust model but at runtime. Has similar theoretical perf limits (though has cost of refcounting added, and less super low level nobs), but requires runtime benchmarking and/or deeper awareness of reference counts when writing code at compile time.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 16:22):

I definitely think it is a bug or algorithm (in the compiler passes) that is designed fundamentally wrong

84% of time spent in morphic. Yeah, that matches roughly what I expected. Generally when frontend time explodes, it is morphic

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 16:23):

This call stack is crazy recursive

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 16:28):

I think I have seen this exact thing before. I tried digging into it and fixing, but I don't understand morphic enough to fix it. I had mostly tried minor changes cause a lot of the time is spent in hash maps and iterators. Probably would really use someone with more context to look into it. That said, maybe we could get a presentation on the morphic code from some of the morphic folks? That might help more people be able to look into this kind of issue.

view this post on Zulip drew (Oct 21 2024 at 16:30):

also, is there a better place for me to put these threads? "beginners" feels weird to me, but i couldn't think of anything else that made sense.

view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 16:35):

Currently this is the best location, but maybe we should have another channel for things like this....not exactly sure what the split would be....performance optimization?

view this post on Zulip Anton (Oct 21 2024 at 16:41):

just "performance" maybe? So people can go there with performance problems too, not just for optimization.

view this post on Zulip Richard Feldman (Oct 21 2024 at 17:11):

I like that idea!

view this post on Zulip Anton (Oct 21 2024 at 17:20):

For the description, should we go with "To discuss performance of Roc programs and the compiler."

view this post on Zulip drew (Oct 21 2024 at 17:21):

should there also be a "bugs" channel? i suppose that could go in "compiler development" but that also feels weird

view this post on Zulip Anton (Oct 21 2024 at 17:40):

Done!

view this post on Zulip Notification Bot (Oct 21 2024 at 18:58):

drew has marked this topic as resolved.


Last updated: Jul 06 2025 at 12:14 UTC