Given I had already started on quadsort
for blitsort
, I am just transitioning over to working on quadsort
for fluxsort
. The differencen being that fluxsort
uses more memory for more performance.
It will definitely be slow to port quadsort
cause it has a lot of low level optimization to be branchless where possible. This tends to be a large perf win for random data where the branch predictor is just constantly wrong. It also has some low level routines for small arrays to make them really fast.
This is tiny_sort
in zig: https://godbolt.org/z/q5PTnr13z
It is for arrays of length 0 to 7.
It is just really cool to see the assembly. It has one jump table for the size of the input. It has 4 conditional branches (all for early returns when data happens to be sorted, only 1 can be hit for a given size array). All other control flow is unconditional jumps/calls.
The other nice part of these primitives for quadsort is that they should be transferable to any other sort. Like if we use powersort/glidesort in the future, it still would be a gain to drop into these really optimized routines for sub arrays that are tiny.
promising first results:
mergesort from #beginners > Obvious inefficiencies in merge sort algorithm?:
List sorted correctly!
Sorted 1000000 integers in 100 milliseconds.
quadsort (not fully implemented, and still a few bugs, but enough to run first tests):
List sorted correctly!
Sorted 1000000 integers in 41 milliseconds.
And this is just quadsort. Fluxsort wrapper theoretically can make it even faster.
Main reason for gains:
branch_misses ... ⚡- 99.4% ± 0.0%
almost 100% less branch misses
Quadsort is actually more instructions overall:
instructions ... 💩 + 27.3% ± 0.0%
also:
cache_misses ... ⚡- 70.8% ± 1.0%
This stuff just feels like black magic to me.
I love how we can tune the performance and see such amazing results.
1 more point of comparison.
The raw C quadsort on the same size list of longs, task 27ms
if the comparison is inlined and 99ms
if the comparison is not inlined.
wow, so you got it to 41ms when a pure C implementation is 27ms? :smiley:
Yeah, though llvm is doing a ton of heavy lifting. It manages to inline the comparison function even though we pass it into zig as a pointer.
Also, I haven't done any perf tuning yet. I know of at least one location where for sure the zig is not generating quite the same as the gotos in C. I bet there are other locations as well.
Also, I currently have a broken case. Not sure what is going wrong. Just disabled the case for now. Re-enabling it should be some form of perf gain assuming it was getting hit. With it disabled, the sort is falling back to branchless fully unordered mode more often.
Also, I guess it must have been getting hit, otherwise, it wouldn't be able to break things.
I don't want to admit how much time I spent to fix this bug:
- arr_ptr += 8;
+ arr_ptr += 8 * element_width;
Down to 34ms. So 34 vs 27ms.
Also, fluxsort is at 20ms.
So definitely still more gains from adding the wrapper layer (it also is much simpler than quadsort)
So quadsort is fully working now.
I still need to:
yooo this is awesome!!!
Interesting, our perf is way worse on m1 mac than on x86.... still better than merge sort, but much farther from the C version.
I can't seem to get flamegraphs from roc into zig working on my mac. So pretty hard to debug the perf.
But yeah, on m1 mac we are like 90ms, but the C is 25ms. Even C with the comparison as no inline is 75ms.
If I average 10 runs, it is 67ms, which is a lot closer, but still not 25ms. and much farther away than the 34ms I get on linux.
Sad because all the time is spent in one function. And we can't see into that function, because that's the boundary where roc calls into zig?
Yeah
I'm pretty sure that function is just the sort function
That's with me turning off strip on the zig builtins. So I feel like there should be debug info
How are you building it? Is roc driving the linking? Maybe you could --no-link
and then use zig to build-exe
?
Oh wait... nvm, this is inside roc
I probably could build a zig app that just directly imports sort.zig
. That probably would give me info.
That's a good way to isolate the sort impl
Probably will want to do that anyway at some point for direct fuzzing.
Apparently essentially all of the time is spent in the copy function...which confuses me cause I'm pretty sure it is getting inlined, so not really sure why it is taking up so much time.
Hmm, I just hit a segfault on linux. Maybe I have a bug and sorting is looping like crazy sorting mem outside of the array or incrementing incorrectly somehow
Using a zig app as the driver, I have managed to fix a few bugs. That said, I am still confused by the perf.
On M1 mac, the zig version takes around 70ms. The c++ around 25ms.
If I bench the zig version or run it with a profiler, it takes around 45ms.
If I use a cycle profiler to compare the two, they are within 10% cycle count wise.
I don't know of a good equivalent to perf on mac to get time spent on cache misses and waiting on data.
Probably gonna move on soon and just work on other things.
Oh, also, in the profiler, almost all the time is spent copying data (which I wouldn't expect C to do better assuming I implemented the algorithm correctly and am copying the same data around).
Given it runs good on x86, that also suggests that I implemented the algorithm correctly.
But maybe llvm is optimizing poorly for us on m1 mac for some reason
Got fuzzing working. Ran it overnight. Definitely found a few edge cases. That said, for 213 million runs, a 552 test cases isn't bad. And I would guess that many test cases have the same root cause.
So turns out the bug was a non-bug. I turned a comment in the original C into an assert. Turns out the comment was slightly off. Updated the comment and assert. Running more fuzzing. I think we may be bug free!
Down to the last two pieces:
inc_n_data
and data_is_owned
through everything such that an owning closure could be used. (probably should fuzz that works correctly, basically, make the owned data an incrementing int and make sure it equals the number of times compare is called).fluxsort
wrapper around quadsort
(and fuzzing it)quadsort
is working, fuzzed, and can fallback to sort indirectly for giant values.
Definitely a bit of a pervasive PR, but we have refcounting using a comptime bool to add the code or not. Also, did my best to minimize the times we actually call to increment the refcount. Attempt to batch wherever possible. So plenty of places where we increment by 4 or 16 for example.
Now just need to add that into the fuzzer.
Most places where we have to increment by only 1 are in compares that are used with short circuit evaluation. So if the first compare fails, the rest won't run.
And no way to get around that.
Ran it through my existing fuzz corpus and also a bit after. Missed one refcount location, but otherwise, refcounting seems to work happily.
Down to just the fluxsort wrapper and testing/fuzzing for it.
Printing out the breakdown of sorting is kinda cool:
sort strategies
With fluxsort, the more random the data is, the more it uses quicksort. If many elements of the input have the same value, it will use dual pivot quicksort to skip large chunks of sorting. If the array is relatively ordered or is simply small, will fallback to mergesort.
Also has some smart handling for reversed data.
So besides an issue with the surgical linker for the fallback to sorting pointers if the element size is too large, I think I have full fluxsort working.
that's awesome!
So final perf numbers.
Summary
x86_64:
M1:
M1 (Still not sure why roc is slower here)
x86_64 intel i7-8750H gaming laptop (amazing results)
PR: #6932
Nice work, this is really cool. Love how you explored the performance and shared what you learnt along the way.
I found it super interesting to follow along with, and now I really want to understand the perf characteristics more with my code.
I'm going to leave this fuzzing overnight, but everything is ready for review and merge here.
I thought I might try and run this (or something similar) and see what I see on my M2 mac.
Based on my results, I'm guessing this isn't the correct way to test this.
I used this program inspired by #6294 -- it's not the same because the API has changed a lot since then.
app [main] {
pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.13.0/nW9yMRtZuCYf1Oa9vbE5XoirMwzLbtoSgv7NGhUlqYA.tar.br",
}
import pf.Task
main =
n = 86896
list = List.range { start: At n, end: Before 0 }
actual = List.sortWith list Num.compare
expected = List.range { start: At 1, end: At n }
if actual == expected then
Task.ok {}
else
Task.err NotSorted
$ hyperfine ./sort
Benchmark 1: ./sort
Time (mean ± σ): 40.4 ms ± 93.3 ms [User: 10.0 ms, System: 1.3 ms]
Range (min … max): 9.3 ms … 305.9 ms 10 runs
$ hyperfine ./sort-opt
Benchmark 1: ./sort-opt
Time (mean ± σ): 164.6 ms ± 502.0 ms [User: 3.9 ms, System: 1.7 ms]
Range (min … max): 3.1 ms … 1593.4 ms 10 runs
$ hyperfine -m 1000 ./sort
Benchmark 1: ./sort
Time (mean ± σ): 9.3 ms ± 0.2 ms [User: 7.9 ms, System: 0.8 ms]
Range (min … max): 8.8 ms … 9.7 ms 1000 runs
$ hyperfine -m 1000 ./sort-opt
Benchmark 1: ./sort-opt
Time (mean ± σ): 3.0 ms ± 0.2 ms [User: 1.8 ms, System: 0.7 ms]
Range (min … max): 2.6 ms … 5.3 ms 1000 runs
sort-opt
was built using --optimize
Try running ./bench.sh
from this repo: https://github.com/bhansconnect/roc-sorting
And yeah, that benchmark isn't good cause fluxsort will detect the entire list is reversed and just flip it
leave this fuzzing overnight
250 million executions later with 0 crashes. There were some timeouts (no hangs), but that's expected. Timeouts are expected cause with a large enough input, sorting will get slow. Timeouts just mean it took longer than expected, but still finished in a reasonable time. Hangs are if it takes so long that the fuzzer decides to kill it.
So looked at the M1 story a bit more:
A raw zig version using our new builtin sort is essentially as fast as the C sort on M1.
Summary
./cc-fluxsort ran
1.04 ± 0.02 times faster than ./zig-raw-builtinsort
So it is somehow due to the code roc is generating. Maybe llvm is not inlining the compare or copy for llvm on m1 (I don't think this is the case, would be way slower)? Maybe the compare isn't generating branchless? Maybe some other complexity of llvm ir is leading to an optimization failing to apply?
Super exciting! :grinning_face_with_smiling_eyes:
@Luke Boswell if you get a chance, I'd be curious what your M2 results are for this:
Try running
./bench.sh
from this repo: https://github.com/bhansconnect/roc-sorting
Should run on latest main with zig 0.11 and any form of clang++.
M2 macbook air
Benchmark 1: ./roc-mergesort
Time (mean ± σ): 71.1 ms ± 0.8 ms [User: 68.1 ms, System: 1.9 ms]
Range (min … max): 69.9 ms … 72.1 ms 100 runs
Benchmark 2: ./roc-builtinsort
Time (mean ± σ): 36.6 ms ± 0.5 ms [User: 34.4 ms, System: 1.5 ms]
Range (min … max): 35.8 ms … 37.3 ms 100 runs
Benchmark 3: ./zig-raw-builtinsort
Time (mean ± σ): 33.7 ms ± 0.4 ms [User: 32.4 ms, System: 1.0 ms]
Range (min … max): 33.0 ms … 34.3 ms 100 runs
Benchmark 4: ./cc-quadsort
Time (mean ± σ): 27.1 ms ± 0.3 ms [User: 25.3 ms, System: 1.5 ms]
Range (min … max): 26.8 ms … 27.8 ms 100 runs
Benchmark 5: ./cc-fluxsort
Time (mean ± σ): 19.0 ms ± 0.1 ms [User: 17.3 ms, System: 1.3 ms]
Range (min … max): 18.7 ms … 19.3 ms 100 runs
Summary
./cc-fluxsort ran
1.43 ± 0.02 times faster than ./cc-quadsort
1.78 ± 0.02 times faster than ./zig-raw-builtinsort
1.93 ± 0.03 times faster than ./roc-builtinsort
3.75 ± 0.05 times faster than ./roc-mergesort
And on the WIP LLVM 18 branch -- this is a great way to confirm that the optimisation passes I removed are definitely still required... so don't take these results seriously
Benchmark 1: ./roc-mergesort
Time (mean ± σ): 431.0 ms ± 2.7 ms [User: 425.4 ms, System: 2.4 ms]
Range (min … max): 425.5 ms … 445.8 ms 100 runs
Benchmark 2: ./roc-builtinsort
Time (mean ± σ): 203.1 ms ± 0.9 ms [User: 200.1 ms, System: 1.4 ms]
Range (min … max): 198.0 ms … 206.5 ms 100 runs
ok, so still a gap from c++ to zig/roc. Not sure why the perf diff would be apple silicon specific, but good to know it is consistent.
Last updated: Jul 06 2025 at 12:14 UTC