I've been investigating #6294, this basic sorting of 30k elements takes 1.8 seconds:
app "helloWorld"
packages { pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.1/Icc3xJoIixF3hCcfXrDwLCu4wQHtNdPyoJkEbkgIElA.tar.br" }
imports [
pf.Stdout,
pf.Task,
]
provides [main] to pf
main : Task.Task {} I32
main =
n = 30000
list = List.range { start: At n, end: Before 0 }
listSorted = List.sortAsc list
{} <- Stdout.line (Inspect.toStr (List.get listSorted 0)) |> Task.await
Stdout.line (Inspect.toStr (List.get listSorted (n-1)))
The corresponding python code takes 0.024 seconds:
n = 30001
lst = list(reversed(range(n)))
lst.sort()
print(lst[0])
print(lst[n-1])
The flamegraph is really interesting, it basically spends all its time on doing memcpy before calling rust_main
.
is that with debug symbols preserved? a flamegraph without opt might also help you.
This was built with:
./target/release/roc build examples/sortTest.roc --profiling --linker=legacy --emit-llvm-ir --optimize
I'll do one without optimize as well
flamegraph_no_optimize.svg
sortTest_no_optimize.ll
one quick observation here is that in practice List.range
seems to not withCapacity
with the right amounts. I see a bunch of reallocations happening
I think we just pick the partition index very poorly?
list is unique true
- realloc u8@7f3e6adbf690
before make unique [*]u8@7f3e6adbf848
list is unique true
after make unique [*]u8@7f3e6adbf848
partition 0 <-> 0 <-> 9
partition 1 <-> 9 <-> 9
partition 1 <-> 1 <-> 8
partition 2 <-> 8 <-> 8
partition 2 <-> 2 <-> 7
partition 3 <-> 7 <-> 7
partition 3 <-> 3 <-> 6
partition 4 <-> 6 <-> 6
partition 4 <-> 4 <-> 5
we can do better by not touching any memory in the swap, but fundamentally that input hits a bad pattern in how we select the pivot. that is how the standard quicksort is implemented though and without randomness there is no way to fundamentally do better. Though picking the halfway element is likely better in practice than the current solution
is this one of those situations where timsort might do better?
important disclaimer: I don't know anything about timsort
is the long term to plan to switch to timsort or one of the quicksorts with recognition for many patterns.
As a note, python is just recognizing the list is in reverse and reversing it a second time
Python uses timsort which has that as a pattern.
so we are kinda comparing apples to oranges. Python isn't sorting, Roc is sorting.
As a note, python is just recognizing the list is in reverse and reversing it a second time
lol
Should I write a randomly shuffled list to a file and use that as input?
Anyway, many languages use timsort and I think it is a safe default for us to use as well. I know I have seen c++ talks about going even faster and recognizing more patterns, but it is probably fine to just pick timsort for now and change to that. It is quite good overall.
Should I write a randomly shuffled list to a file and use that as input?
yeah, that would be a good test. Take a range, shuffle it, dump to file.
Then base sorting of of that random list.
As a note, python is just recognizing the list is in reverse and reversing it a second time
How did you figure that out btw @Brendan Hansknecht?
python uses timsort
timsort has a special pattern for descending runs
there is also https://github.com/orlp/glidesort
pattern-defeating quicksort is awesome. So if that actually does merge the two successfully, that sounds great.
not totally sure yet whether it needs extra space
Glidesort can use as much (up to n) or as little extra memory as you want. If given only O(1) memory the average and worst case become O(n (log n)^2), however in practice its performance is great for all but the most skewed data size / auxiliary space ratios. The default is to allocate up to n elements worth of data, unless this exceeds 1 MiB, in which case we scale this down to n / 2 elements worth of data up until 1 GiB after which glidesort uses n / 8 memory.
Hmm...interesting
yeah, that is a tradeoff to think about.
hi all! I wrote the original issue #6294, so
Comments are always welcome
Brendan Hansknecht said:
Should I write a randomly shuffled list to a file and use that as input?
yeah, that would be a good test. Take a range, shuffle it, dump to file.
a less rigorous check (for the sake of my lazy Saturday) is to make python randomize the numbers as well. which isn't fair to python, but python still runs much faster despite the handicap.
new code:
import random
n = 1000000
l = list(range(n))
random.shuffle(l)
l.sort()
print('sorting done')"
timed this on my machine via:
$ time python3 -c "n = 1000000; import random; l = list(range(n)); random.shuffle(l); l.sort(); print('sorting done')"
sorting done
python3 -c 0.53s user 0.02s system 95% cpu 0.574 total
so altogether (on my machine):
-> the speed difference between Python sort and Roc sort is probably more than just pattern exploitation
oh, we swap elements with a buffer and memcpy. that is probably the root of the issues.
cause we are handling what should be an int load and int store as a dynamic memcpy.
that or we are making way to many swaps for some reason cause like almost all time is spent in memcpy for swapping elements.
how swapping is done is zig treating everything as just an opaque block of bytes with dynamic size may be a big part of killing the perf.
It would be really awesome if we could give comptime parameters to zig from roc. (instead we may need to generate and expose many forms of the same function).
Not fast, but a lot better overall (a bit over 10x faster on my system):
$ time /tmp/sort
(Ok 1)
(Ok 30000)
/tmp/sort 0.44s user 0.01s system 99% cpu 0.450 total
Added this to the swap function to avoid the memcpy:
switch (element_width) {
8 => {
var i = @as(*u64, @ptrCast(@alignCast(element_at_i)));
var j = @as(*u64, @ptrCast(@alignCast(element_at_j)));
const tmp = i.*;
i.* = j.*;
j.* = tmp;
},
This is still terrible cause every single swap is dynamic on size when they really shouldn't be.
Cause it is an extra branch that should be 100% knowable that is repeated way way too much.
Right but then it's a very predictable branch so maybe not that bad
Need to do more testing, but Python sorted 1 million items in the same time (note comparing different machines that had similar times). I am only sorting 30 thousand.
in this example we do lots of swaps with the same source and target. Those could be skipped but your code here still would do all the memcpys in that case I think
Simon Peleška has marked this topic as resolved.
Simon Peleška has marked this topic as unresolved.
Sorry i was just reading this out of interest and accidentally resolved the topic
Quicksort has good average performance but has pathological failure cases. If I have understood the code correctly, roc's quicksort picks the last element of the range in question as the pivot; if I'm not mistaken, given that choice a sorted list is a psychological case for this implementation. A quick check would be to try sorting a "random" list (say mapping hash over range) - if this is a normal speed, that's probably the problem.
I'll not a roc developer, but I think quicksort is a poor choice for roc. First it's performance is only good on average, which can lead to not-at-all delightful surprises. Second, it's not a stable sort - the order of elements that compare equal is not preserved; more less-than-delightful surprises.
Timsort was designed to replace python's existing sorting algorithm based on considerable real-world experience. It is based on mergesort, guaranteeing decent scaling for all inputs, and has considerable specialization for special cases (simpler sort for small segments where that's faster, tuning to ensure sitting nearly sorted lists is fast, and so on).
One design consideration where roc may need to handle this differently is that python lists always storte only pointers - fixed, small, size objects. If the list items were large (or maybe even had poor memory alignment), something like numpy's argsort would be faster: determine the correct order in an auxiliary array, then reshuffle in a single pass.
yes the input given here is a pathological case for our choice of pivot
but then, how do we decide on a better pivot, e.g. picking the first or center element has similar pathological input cases
yeah timsort is interesting, but perhaps dated? The https://github.com/orlp/glidesort project is I think the current state of the art?
Anne Archibald said:
Quicksort has good average performance but has pathological failure cases. … A quick check would be to try sorting a "random" list … - if this is a normal speed, that's probably the problem.
The check you described (sorting a random list) sounds like an easy check to perform that would also provide some good insight. I’d be curious to see the results.
Though afaik, this specific problem is more likely related to what @Brendan Hansknecht was investigating. I do not expect a new sorting algorithm will fix this problem.
This is because I’ve also encountered other cases (e.g. writing a heap queue, other instances I can no longer remember, etc.) that use lists in what should be an efficient way (i.e. just using the O(1) operations like swap or dropLast), but exhibit notably poor runtime performance, much worse than what could be explained by just poor algorithm design.
This is just my hunch though, I don’t have a lot of data on this (aside from having written a heapsort in pure roc that outperforms the library sort function by only 2x, but I don’t think that’s super clear evidence for it).
Though afaik, this specific problem is more likely related to what @Brendan Hansknecht was investigating. I do not expect a new sorting algorithm will fix this problem.
My comment is part of the problem, but I think the sorting algorithm is likely a big part of the issue.
I think something that would be good to test is to pass in an always inline function into zig. That function would be the swap element function and it would generate while knowing roc types. It will be specialized function for every call to List.sortAsc/swap/etc
.
Though I guess you probably can't always inline a pointer, so it is probably up to llvm to inline it still. That said, still should be way faster than calling memcpy.
I thought that llvm has a pass that is meant for propagating constant function args, but I'm not sure how well it works in practices for something like this.
Ah okay, @Brendan Hansknecht thanks for clarifying.
I guess I’m showing my inexperience with both sorting algorithms and compiler magic, because I still suspect that it’s not a sorting-specific issue. Unfortunately I’m not well-equipped to explore this more. But I’m excited to see how this gets resolved! Thanks all for your work and kindness so far, and good luck bug-hunting!
I mean it definitely takes the combination of both. Bad sorting algorithm means way more swaps. Slow swaps mean way more time. Slow swaps but a good sorting algorithm should still be ok speed. But yeah, a multi-factor fix is needed.
I filed #6450 for the memory copy slowness.
Last updated: Jul 06 2025 at 12:14 UTC