Stream: compiler development

Topic: faster sorting


view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 21:49):

Given I had already started on quadsort for blitsort, I am just transitioning over to working on quadsort for fluxsort. The differencen being that fluxsort uses more memory for more performance.

It will definitely be slow to port quadsort cause it has a lot of low level optimization to be branchless where possible. This tends to be a large perf win for random data where the branch predictor is just constantly wrong. It also has some low level routines for small arrays to make them really fast.

This is tiny_sort in zig: https://godbolt.org/z/q5PTnr13z

It is for arrays of length 0 to 7.

It is just really cool to see the assembly. It has one jump table for the size of the input. It has 4 conditional branches (all for early returns when data happens to be sorted, only 1 can be hit for a given size array). All other control flow is unconditional jumps/calls.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 21:53):

The other nice part of these primitives for quadsort is that they should be transferable to any other sort. Like if we use powersort/glidesort in the future, it still would be a gain to drop into these really optimized routines for sub arrays that are tiny.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 00:19):

promising first results:

mergesort from #beginners > Obvious inefficiencies in merge sort algorithm?:

List sorted correctly!
Sorted 1000000 integers in 100 milliseconds.

quadsort (not fully implemented, and still a few bugs, but enough to run first tests):

List sorted correctly!
Sorted 1000000 integers in 41 milliseconds.

And this is just quadsort. Fluxsort wrapper theoretically can make it even faster.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 00:33):

Main reason for gains:

branch_misses ... ⚡- 99.4% ±  0.0%

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 00:33):

almost 100% less branch misses

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 00:34):

Quadsort is actually more instructions overall:

instructions ... 💩 + 27.3% ±  0.0%

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 00:35):

also:

cache_misses ...  ⚡- 70.8% ±  1.0%

view this post on Zulip Luke Boswell (Jul 25 2024 at 03:03):

This stuff just feels like black magic to me.

view this post on Zulip Luke Boswell (Jul 25 2024 at 03:04):

I love how we can tune the performance and see such amazing results.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 03:04):

1 more point of comparison.

The raw C quadsort on the same size list of longs, task 27ms if the comparison is inlined and 99ms if the comparison is not inlined.

view this post on Zulip Richard Feldman (Jul 25 2024 at 04:41):

wow, so you got it to 41ms when a pure C implementation is 27ms? :smiley:

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 04:44):

Yeah, though llvm is doing a ton of heavy lifting. It manages to inline the comparison function even though we pass it into zig as a pointer.

Also, I haven't done any perf tuning yet. I know of at least one location where for sure the zig is not generating quite the same as the gotos in C. I bet there are other locations as well.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 04:47):

Also, I currently have a broken case. Not sure what is going wrong. Just disabled the case for now. Re-enabling it should be some form of perf gain assuming it was getting hit. With it disabled, the sort is falling back to branchless fully unordered mode more often.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 04:47):

Also, I guess it must have been getting hit, otherwise, it wouldn't be able to break things.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 06:23):

I don't want to admit how much time I spent to fix this bug:

-                        arr_ptr += 8;
+                        arr_ptr += 8 * element_width;

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 06:25):

Down to 34ms. So 34 vs 27ms.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 06:26):

Also, fluxsort is at 20ms.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 06:26):

So definitely still more gains from adding the wrapper layer (it also is much simpler than quadsort)

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 06:33):

So quadsort is fully working now.

I still need to:

  1. add a fallback path for when elements are super large. Instead of sorting the array, will generate and sort an array of pointers.
  2. wire through incrementing the refcount (will make it specialized based on if refcouting is required). This will definitely be a pain. The easy way would be to increment at every compare, but I will probably try to group incrementing where I can to lower the cost. 2 new args to every function and a lot of extra noise in conditionals. (could be done in a different PR, main is broken for this anyway)
  3. Add fluxsort wrapper (definitely could be done in a different PR)
  4. Maybe should fuzz it to make sure it is correct. Though I have a lot of tests, there are a ton of cases in general. DONE
  5. Performance tune? I mean it's pretty darn close to the c, but while it's fresh in my brain maybe should look at diff in more detail. (My biggest guess is most of the cost is related to the element width not being known at compile time, but maybe llvm is propagating it everywhere correctly, followed by a go-to that I know is a function call rn, but just guesses). (more debugging on M1 should be done, but perf is good enough that I'm leaving it be for now).

view this post on Zulip Richard Feldman (Jul 25 2024 at 11:20):

yooo this is awesome!!!

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 15:39):

Interesting, our perf is way worse on m1 mac than on x86.... still better than merge sort, but much farther from the C version.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 23:23):

I can't seem to get flamegraphs from roc into zig working on my mac. So pretty hard to debug the perf.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 23:26):

But yeah, on m1 mac we are like 90ms, but the C is 25ms. Even C with the comparison as no inline is 75ms.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 23:36):

If I average 10 runs, it is 67ms, which is a lot closer, but still not 25ms. and much farther away than the 34ms I get on linux.

view this post on Zulip Brendan Hansknecht (Jul 25 2024 at 23:40):

Sad Flamegraph...

view this post on Zulip Luke Boswell (Jul 26 2024 at 00:14):

Sad because all the time is spent in one function. And we can't see into that function, because that's the boundary where roc calls into zig?

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 00:16):

Yeah

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 00:16):

I'm pretty sure that function is just the sort function

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 00:17):

That's with me turning off strip on the zig builtins. So I feel like there should be debug info

view this post on Zulip Luke Boswell (Jul 26 2024 at 00:18):

How are you building it? Is roc driving the linking? Maybe you could --no-link and then use zig to build-exe?

view this post on Zulip Luke Boswell (Jul 26 2024 at 00:18):

Oh wait... nvm, this is inside roc

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 00:18):

I probably could build a zig app that just directly imports sort.zig. That probably would give me info.

view this post on Zulip Luke Boswell (Jul 26 2024 at 00:19):

That's a good way to isolate the sort impl

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 00:27):

Probably will want to do that anyway at some point for direct fuzzing.

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 01:32):

Apparently essentially all of the time is spent in the copy function...which confuses me cause I'm pretty sure it is getting inlined, so not really sure why it is taking up so much time.

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 01:39):

Hmm, I just hit a segfault on linux. Maybe I have a bug and sorting is looping like crazy sorting mem outside of the array or incrementing incorrectly somehow

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 07:01):

Using a zig app as the driver, I have managed to fix a few bugs. That said, I am still confused by the perf.

On M1 mac, the zig version takes around 70ms. The c++ around 25ms.
If I bench the zig version or run it with a profiler, it takes around 45ms.
If I use a cycle profiler to compare the two, they are within 10% cycle count wise.

I don't know of a good equivalent to perf on mac to get time spent on cache misses and waiting on data.
Probably gonna move on soon and just work on other things.

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 07:02):

Oh, also, in the profiler, almost all the time is spent copying data (which I wouldn't expect C to do better assuming I implemented the algorithm correctly and am copying the same data around).

Given it runs good on x86, that also suggests that I implemented the algorithm correctly.

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 07:03):

But maybe llvm is optimizing poorly for us on m1 mac for some reason

view this post on Zulip Brendan Hansknecht (Jul 26 2024 at 16:42):

Got fuzzing working. Ran it overnight. Definitely found a few edge cases. That said, for 213 million runs, a 552 test cases isn't bad. And I would guess that many test cases have the same root cause.

view this post on Zulip Brendan Hansknecht (Jul 27 2024 at 01:14):

So turns out the bug was a non-bug. I turned a comment in the original C into an assert. Turns out the comment was slightly off. Updated the comment and assert. Running more fuzzing. I think we may be bug free!

view this post on Zulip Brendan Hansknecht (Jul 27 2024 at 09:01):

Down to the last two pieces:

  1. wiring inc_n_data and data_is_owned through everything such that an owning closure could be used. (probably should fuzz that works correctly, basically, make the owned data an incrementing int and make sure it equals the number of times compare is called).
  2. adding the fluxsort wrapper around quadsort (and fuzzing it)

quadsort is working, fuzzed, and can fallback to sort indirectly for giant values.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 00:05):

Definitely a bit of a pervasive PR, but we have refcounting using a comptime bool to add the code or not. Also, did my best to minimize the times we actually call to increment the refcount. Attempt to batch wherever possible. So plenty of places where we increment by 4 or 16 for example.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 00:05):

Now just need to add that into the fuzzer.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 00:06):

Most places where we have to increment by only 1 are in compares that are used with short circuit evaluation. So if the first compare fails, the rest won't run.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 00:06):

And no way to get around that.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 00:37):

Ran it through my existing fuzz corpus and also a bit after. Missed one refcount location, but otherwise, refcounting seems to work happily.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 00:38):

Down to just the fluxsort wrapper and testing/fuzzing for it.

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 22:31):

Printing out the breakdown of sorting is kinda cool:

sort strategies

view this post on Zulip Brendan Hansknecht (Jul 28 2024 at 22:33):

With fluxsort, the more random the data is, the more it uses quicksort. If many elements of the input have the same value, it will use dual pivot quicksort to skip large chunks of sorting. If the array is relatively ordered or is simply small, will fallback to mergesort.

Also has some smart handling for reversed data.

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 01:05):

So besides an issue with the surgical linker for the fallback to sorting pointers if the element size is too large, I think I have full fluxsort working.

view this post on Zulip Richard Feldman (Jul 29 2024 at 01:24):

that's awesome!

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 03:33):

So final perf numbers.

Summary
x86_64:

M1:


M1 (Still not sure why roc is slower here)

x86_64 intel i7-8750H gaming laptop (amazing results)

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 03:39):

PR: #6932

view this post on Zulip Luke Boswell (Jul 29 2024 at 04:20):

Nice work, this is really cool. Love how you explored the performance and shared what you learnt along the way.

I found it super interesting to follow along with, and now I really want to understand the perf characteristics more with my code.

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 06:32):

I'm going to leave this fuzzing overnight, but everything is ready for review and merge here.

view this post on Zulip Luke Boswell (Jul 29 2024 at 11:12):

I thought I might try and run this (or something similar) and see what I see on my M2 mac.

Based on my results, I'm guessing this isn't the correct way to test this.

test program

I used this program inspired by #6294 -- it's not the same because the API has changed a lot since then.

app [main] {
    pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.13.0/nW9yMRtZuCYf1Oa9vbE5XoirMwzLbtoSgv7NGhUlqYA.tar.br",
}

import pf.Task

main =
    n = 86896
    list = List.range { start: At n, end: Before 0 }

    actual = List.sortWith list Num.compare
    expected = List.range { start: At 1, end: At n }

    if actual == expected then
        Task.ok {}
    else
        Task.err NotSorted

results M2 mac

$ hyperfine ./sort
Benchmark 1: ./sort
  Time (mean ± σ):      40.4 ms ±  93.3 ms    [User: 10.0 ms, System: 1.3 ms]
  Range (min … max):     9.3 ms … 305.9 ms    10 runs

$ hyperfine ./sort-opt
Benchmark 1: ./sort-opt
  Time (mean ± σ):     164.6 ms ± 502.0 ms    [User: 3.9 ms, System: 1.7 ms]
  Range (min … max):     3.1 ms … 1593.4 ms    10 runs

$ hyperfine -m 1000 ./sort
Benchmark 1: ./sort
  Time (mean ± σ):       9.3 ms ±   0.2 ms    [User: 7.9 ms, System: 0.8 ms]
  Range (min … max):     8.8 ms …   9.7 ms    1000 runs

$ hyperfine -m 1000 ./sort-opt
Benchmark 1: ./sort-opt
  Time (mean ± σ):       3.0 ms ±   0.2 ms    [User: 1.8 ms, System: 0.7 ms]
  Range (min … max):     2.6 ms …   5.3 ms    1000 runs

view this post on Zulip Luke Boswell (Jul 29 2024 at 11:12):

sort-opt was built using --optimize

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 14:58):

Try running ./bench.sh from this repo: https://github.com/bhansconnect/roc-sorting

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 14:59):

And yeah, that benchmark isn't good cause fluxsort will detect the entire list is reversed and just flip it

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 15:39):

leave this fuzzing overnight

250 million executions later with 0 crashes. There were some timeouts (no hangs), but that's expected. Timeouts are expected cause with a large enough input, sorting will get slow. Timeouts just mean it took longer than expected, but still finished in a reasonable time. Hangs are if it takes so long that the fuzzer decides to kill it.

view this post on Zulip Brendan Hansknecht (Jul 29 2024 at 19:13):

So looked at the M1 story a bit more:

A raw zig version using our new builtin sort is essentially as fast as the C sort on M1.

Summary
  ./cc-fluxsort ran
    1.04 ± 0.02 times faster than ./zig-raw-builtinsort

So it is somehow due to the code roc is generating. Maybe llvm is not inlining the compare or copy for llvm on m1 (I don't think this is the case, would be way slower)? Maybe the compare isn't generating branchless? Maybe some other complexity of llvm ir is leading to an optimization failing to apply?

view this post on Zulip Isaac Van Doren (Jul 29 2024 at 21:37):

Super exciting! :grinning_face_with_smiling_eyes:

view this post on Zulip Brendan Hansknecht (Jul 30 2024 at 04:23):

@Luke Boswell if you get a chance, I'd be curious what your M2 results are for this:

Try running ./bench.sh from this repo: https://github.com/bhansconnect/roc-sorting

view this post on Zulip Brendan Hansknecht (Jul 30 2024 at 04:24):

Should run on latest main with zig 0.11 and any form of clang++.

view this post on Zulip Luke Boswell (Jul 30 2024 at 04:33):

M2 macbook air

Benchmark 1: ./roc-mergesort
  Time (mean ± σ):      71.1 ms ±   0.8 ms    [User: 68.1 ms, System: 1.9 ms]
  Range (min … max):    69.9 ms …  72.1 ms    100 runs

Benchmark 2: ./roc-builtinsort
  Time (mean ± σ):      36.6 ms ±   0.5 ms    [User: 34.4 ms, System: 1.5 ms]
  Range (min … max):    35.8 ms …  37.3 ms    100 runs

Benchmark 3: ./zig-raw-builtinsort
  Time (mean ± σ):      33.7 ms ±   0.4 ms    [User: 32.4 ms, System: 1.0 ms]
  Range (min … max):    33.0 ms …  34.3 ms    100 runs

Benchmark 4: ./cc-quadsort
  Time (mean ± σ):      27.1 ms ±   0.3 ms    [User: 25.3 ms, System: 1.5 ms]
  Range (min … max):    26.8 ms …  27.8 ms    100 runs

Benchmark 5: ./cc-fluxsort
  Time (mean ± σ):      19.0 ms ±   0.1 ms    [User: 17.3 ms, System: 1.3 ms]
  Range (min … max):    18.7 ms …  19.3 ms    100 runs

Summary
  ./cc-fluxsort ran
    1.43 ± 0.02 times faster than ./cc-quadsort
    1.78 ± 0.02 times faster than ./zig-raw-builtinsort
    1.93 ± 0.03 times faster than ./roc-builtinsort
    3.75 ± 0.05 times faster than ./roc-mergesort

view this post on Zulip Luke Boswell (Jul 30 2024 at 04:42):

And on the WIP LLVM 18 branch -- this is a great way to confirm that the optimisation passes I removed are definitely still required... so don't take these results seriously

Benchmark 1: ./roc-mergesort
  Time (mean ± σ):     431.0 ms ±   2.7 ms    [User: 425.4 ms, System: 2.4 ms]
  Range (min … max):   425.5 ms … 445.8 ms    100 runs

Benchmark 2: ./roc-builtinsort
  Time (mean ± σ):     203.1 ms ±   0.9 ms    [User: 200.1 ms, System: 1.4 ms]
  Range (min … max):   198.0 ms … 206.5 ms    100 runs

view this post on Zulip Brendan Hansknecht (Jul 30 2024 at 05:10):

ok, so still a gap from c++ to zig/roc. Not sure why the perf diff would be apple silicon specific, but good to know it is consistent.


Last updated: Jul 06 2025 at 12:14 UTC