slow sorting · compiler development

Stream: compiler development

Topic: slow sorting

Anton (Jan 06 2024 at 17:22):

I've been investigating #6294, this basic sorting of 30k elements takes 1.8 seconds:

app "helloWorld"
    packages { pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.1/Icc3xJoIixF3hCcfXrDwLCu4wQHtNdPyoJkEbkgIElA.tar.br" }
    imports [
        pf.Stdout,
        pf.Task,
    ]
    provides [main] to pf

main : Task.Task {} I32
main =
    n = 30000

    list = List.range { start: At n, end: Before 0 }
    listSorted = List.sortAsc list

    {} <- Stdout.line (Inspect.toStr (List.get listSorted 0)) |> Task.await
    Stdout.line (Inspect.toStr (List.get listSorted (n-1)))

The corresponding python code takes 0.024 seconds:

n = 30001

lst = list(reversed(range(n)))
lst.sort()

print(lst[0])
print(lst[n-1])

The flamegraph is really interesting, it basically spends all its time on doing memcpy before calling rust_main.

llvm IR

Ayaz Hafiz (Jan 06 2024 at 17:30):

is that with debug symbols preserved? a flamegraph without opt might also help you.

Anton (Jan 06 2024 at 17:31):

This was built with:

./target/release/roc build examples/sortTest.roc --profiling --linker=legacy --emit-llvm-ir --optimize

Anton (Jan 06 2024 at 17:36):

flamegraph_no_optimize.svg
sortTest_no_optimize.ll

Folkert de Vries (Jan 06 2024 at 18:05):

one quick observation here is that in practice List.range seems to not withCapacity with the right amounts. I see a bunch of reallocations happening

Folkert de Vries (Jan 06 2024 at 18:11):

I think we just pick the partition index very poorly?

list is unique true
- realloc u8@7f3e6adbf690
before make unique [*]u8@7f3e6adbf848
list is unique true
after make unique [*]u8@7f3e6adbf848
partition 0 <-> 0 <-> 9
partition 1 <-> 9 <-> 9
partition 1 <-> 1 <-> 8
partition 2 <-> 8 <-> 8
partition 2 <-> 2 <-> 7
partition 3 <-> 7 <-> 7
partition 3 <-> 3 <-> 6
partition 4 <-> 6 <-> 6
partition 4 <-> 4 <-> 5

Folkert de Vries (Jan 06 2024 at 18:45):

we can do better by not touching any memory in the swap, but fundamentally that input hits a bad pattern in how we select the pivot. that is how the standard quicksort is implemented though and without randomness there is no way to fundamentally do better. Though picking the halfway element is likely better in practice than the current solution

Richard Feldman (Jan 06 2024 at 18:48):

is this one of those situations where timsort might do better?

important disclaimer: I don't know anything about timsort

Brendan Hansknecht (Jan 06 2024 at 18:49):

is the long term to plan to switch to timsort or one of the quicksorts with recognition for many patterns.

Brendan Hansknecht (Jan 06 2024 at 18:51):

As a note, python is just recognizing the list is in reverse and reversing it a second time

Brendan Hansknecht (Jan 06 2024 at 18:51):

Python uses timsort which has that as a pattern.

Brendan Hansknecht (Jan 06 2024 at 18:52):

so we are kinda comparing apples to oranges. Python isn't sorting, Roc is sorting.

Anton (Jan 06 2024 at 18:52):

As a note, python is just recognizing the list is in reverse and reversing it a second time

lol

Anton (Jan 06 2024 at 18:53):

Should I write a randomly shuffled list to a file and use that as input?

Brendan Hansknecht (Jan 06 2024 at 18:53):

Anyway, many languages use timsort and I think it is a safe default for us to use as well. I know I have seen c++ talks about going even faster and recognizing more patterns, but it is probably fine to just pick timsort for now and change to that. It is quite good overall.

Brendan Hansknecht (Jan 06 2024 at 18:54):

Should I write a randomly shuffled list to a file and use that as input?

yeah, that would be a good test. Take a range, shuffle it, dump to file.

Brendan Hansknecht (Jan 06 2024 at 18:55):

Then base sorting of of that random list.

Anton (Jan 06 2024 at 18:56):

As a note, python is just recognizing the list is in reverse and reversing it a second time

How did you figure that out btw @Brendan Hansknecht?

Brendan Hansknecht (Jan 06 2024 at 18:56):

python uses timsort

Brendan Hansknecht (Jan 06 2024 at 18:57):

timsort has a special pattern for descending runs

Folkert de Vries (Jan 06 2024 at 19:02):

there is also https://github.com/orlp/glidesort

Brendan Hansknecht (Jan 06 2024 at 19:04):

pattern-defeating quicksort is awesome. So if that actually does merge the two successfully, that sounds great.

Folkert de Vries (Jan 06 2024 at 19:05):

not totally sure yet whether it needs extra space

Brendan Hansknecht (Jan 06 2024 at 19:05):

Glidesort can use as much (up to n) or as little extra memory as you want. If given only O(1) memory the average and worst case become O(n (log n)^2), however in practice its performance is great for all but the most skewed data size / auxiliary space ratios. The default is to allocate up to n elements worth of data, unless this exceeds 1 MiB, in which case we scale this down to n / 2 elements worth of data up until 1 GiB after which glidesort uses n / 8 memory.

Hmm...interesting

Brendan Hansknecht (Jan 06 2024 at 19:05):

yeah, that is a tradeoff to think about.

Becker A. (Jan 07 2024 at 01:18):

hi all! I wrote the original issue #6294, so

thanks for looking into this :pray:
though I'm not a maintainer, I'm here because I have some interest in this particular issue. I hope it's alright for me to chime in with some thoughts, and please let me know if I'm leading the conversation away towards a topic that's not of interest or priority for the project.

Brendan Hansknecht (Jan 07 2024 at 01:28):

Comments are always welcome

Becker A. (Jan 07 2024 at 01:29):

Brendan Hansknecht said:

Should I write a randomly shuffled list to a file and use that as input?

yeah, that would be a good test. Take a range, shuffle it, dump to file.

a less rigorous check (for the sake of my lazy Saturday) is to make python randomize the numbers as well. which isn't fair to python, but python still runs much faster despite the handicap.

new code:

import random

n = 1000000
l = list(range(n))
random.shuffle(l)
l.sort()

print('sorting done')"

timed this on my machine via:

$ time python3 -c "n = 1000000; import random; l = list(range(n)); random.shuffle(l); l.sort(); print('sorting done')"

sorting done
python3 -c   0.53s user 0.02s system 95% cpu 0.574 total

so altogether (on my machine):

Python reverse & "sort" 1 million - 0.07 s
Python shuffle & sort 1 million - 0.57 s
Roc reverse & sort ~90k - 24 s

-> the speed difference between Python sort and Roc sort is probably more than just pattern exploitation

Brendan Hansknecht (Jan 07 2024 at 01:44):

oh, we swap elements with a buffer and memcpy. that is probably the root of the issues.

Brendan Hansknecht (Jan 07 2024 at 01:44):

cause we are handling what should be an int load and int store as a dynamic memcpy.

Brendan Hansknecht (Jan 07 2024 at 01:45):

that or we are making way to many swaps for some reason cause like almost all time is spent in memcpy for swapping elements.

Brendan Hansknecht (Jan 07 2024 at 01:46):

how swapping is done is zig treating everything as just an opaque block of bytes with dynamic size may be a big part of killing the perf.

Brendan Hansknecht (Jan 07 2024 at 01:53):

It would be really awesome if we could give comptime parameters to zig from roc. (instead we may need to generate and expose many forms of the same function).

Brendan Hansknecht (Jan 07 2024 at 01:54):

Not fast, but a lot better overall (a bit over 10x faster on my system):

$ time /tmp/sort
(Ok 1)
(Ok 30000)
/tmp/sort  0.44s user 0.01s system 99% cpu 0.450 total

Brendan Hansknecht (Jan 07 2024 at 01:54):

Added this to the swap function to avoid the memcpy:

    switch (element_width) {
        8 => {
            var i = @as(*u64, @ptrCast(@alignCast(element_at_i)));
            var j = @as(*u64, @ptrCast(@alignCast(element_at_j)));
            const tmp = i.*;
            i.* = j.*;
            j.* = tmp;
        },

Brendan Hansknecht (Jan 07 2024 at 01:55):

This is still terrible cause every single swap is dynamic on size when they really shouldn't be.

Brendan Hansknecht (Jan 07 2024 at 01:56):

Cause it is an extra branch that should be 100% knowable that is repeated way way too much.

Brian Carroll (Jan 07 2024 at 14:49):

Right but then it's a very predictable branch so maybe not that bad

Brendan Hansknecht (Jan 07 2024 at 15:22):

Need to do more testing, but Python sorted 1 million items in the same time (note comparing different machines that had similar times). I am only sorting 30 thousand.

Folkert de Vries (Jan 07 2024 at 15:32):

in this example we do lots of swaps with the same source and target. Those could be skipped but your code here still would do all the memcpys in that case I think

Notification Bot (Jan 07 2024 at 17:26):

Simon Peleška has marked this topic as resolved.

Notification Bot (Jan 07 2024 at 17:31):

Simon Peleška has marked this topic as unresolved.

Simon Peleška (Jan 07 2024 at 17:35):

Sorry i was just reading this out of interest and accidentally resolved the topic

Anne Archibald (Jan 23 2024 at 21:28):

Quicksort has good average performance but has pathological failure cases. If I have understood the code correctly, roc's quicksort picks the last element of the range in question as the pivot; if I'm not mistaken, given that choice a sorted list is a psychological case for this implementation. A quick check would be to try sorting a "random" list (say mapping hash over range) - if this is a normal speed, that's probably the problem.

I'll not a roc developer, but I think quicksort is a poor choice for roc. First it's performance is only good on average, which can lead to not-at-all delightful surprises. Second, it's not a stable sort - the order of elements that compare equal is not preserved; more less-than-delightful surprises.

Timsort was designed to replace python's existing sorting algorithm based on considerable real-world experience. It is based on mergesort, guaranteeing decent scaling for all inputs, and has considerable specialization for special cases (simpler sort for small segments where that's faster, tuning to ensure sitting nearly sorted lists is fast, and so on).

One design consideration where roc may need to handle this differently is that python lists always storte only pointers - fixed, small, size objects. If the list items were large (or maybe even had poor memory alignment), something like numpy's argsort would be faster: determine the correct order in an auxiliary array, then reshuffle in a single pass.

Folkert de Vries (Jan 23 2024 at 21:32):

yes the input given here is a pathological case for our choice of pivot

Folkert de Vries (Jan 23 2024 at 21:32):

but then, how do we decide on a better pivot, e.g. picking the first or center element has similar pathological input cases

Folkert de Vries (Jan 23 2024 at 21:34):

yeah timsort is interesting, but perhaps dated? The https://github.com/orlp/glidesort project is I think the current state of the art?

Becker A. (Jan 24 2024 at 17:55):

Anne Archibald said:

Quicksort has good average performance but has pathological failure cases. … A quick check would be to try sorting a "random" list … - if this is a normal speed, that's probably the problem.

The check you described (sorting a random list) sounds like an easy check to perform that would also provide some good insight. I’d be curious to see the results.

Though afaik, this specific problem is more likely related to what @Brendan Hansknecht was investigating. I do not expect a new sorting algorithm will fix this problem.

This is because I’ve also encountered other cases (e.g. writing a heap queue, other instances I can no longer remember, etc.) that use lists in what should be an efficient way (i.e. just using the O(1) operations like swap or dropLast), but exhibit notably poor runtime performance, much worse than what could be explained by just poor algorithm design.

This is just my hunch though, I don’t have a lot of data on this (aside from having written a heapsort in pure roc that outperforms the library sort function by only 2x, but I don’t think that’s super clear evidence for it).

Brendan Hansknecht (Jan 24 2024 at 19:05):

Though afaik, this specific problem is more likely related to what @Brendan Hansknecht was investigating. I do not expect a new sorting algorithm will fix this problem.

My comment is part of the problem, but I think the sorting algorithm is likely a big part of the issue.

Brendan Hansknecht (Jan 24 2024 at 19:46):

I think something that would be good to test is to pass in an always inline function into zig. That function would be the swap element function and it would generate while knowing roc types. It will be specialized function for every call to List.sortAsc/swap/etc.

Though I guess you probably can't always inline a pointer, so it is probably up to llvm to inline it still. That said, still should be way faster than calling memcpy.

I thought that llvm has a pass that is meant for propagating constant function args, but I'm not sure how well it works in practices for something like this.

Becker A. (Jan 27 2024 at 20:55):

Ah okay, @Brendan Hansknecht thanks for clarifying.

I guess I’m showing my inexperience with both sorting algorithms and compiler magic, because I still suspect that it’s not a sorting-specific issue. Unfortunately I’m not well-equipped to explore this more. But I’m excited to see how this gets resolved! Thanks all for your work and kindness so far, and good luck bug-hunting!

Brendan Hansknecht (Jan 27 2024 at 22:38):

I mean it definitely takes the combination of both. Bad sorting algorithm means way more swaps. Slow swaps mean way more time. Slow swaps but a good sorting algorithm should still be ok speed. But yeah, a multi-factor fix is needed.

Brendan Hansknecht (Jan 28 2024 at 03:00):

I filed #6450 for the memory copy slowness.

Last updated: Aug 17 2025 at 12:14 UTC

Stream: compiler development

Topic: slow sorting

Anton (Jan 06 2024 at 17:22):

Ayaz Hafiz (Jan 06 2024 at 17:30):

Anton (Jan 06 2024 at 17:31):

Anton (Jan 06 2024 at 17:32):

Anton (Jan 06 2024 at 17:36):

Folkert de Vries (Jan 06 2024 at 18:05):

Folkert de Vries (Jan 06 2024 at 18:11):

Folkert de Vries (Jan 06 2024 at 18:45):

Richard Feldman (Jan 06 2024 at 18:48):

Brendan Hansknecht (Jan 06 2024 at 18:49):

Brendan Hansknecht (Jan 06 2024 at 18:51):

Brendan Hansknecht (Jan 06 2024 at 18:51):

Brendan Hansknecht (Jan 06 2024 at 18:52):

Anton (Jan 06 2024 at 18:52):

Anton (Jan 06 2024 at 18:53):

Brendan Hansknecht (Jan 06 2024 at 18:53):

Brendan Hansknecht (Jan 06 2024 at 18:54):

Brendan Hansknecht (Jan 06 2024 at 18:55):

Anton (Jan 06 2024 at 18:56):

Brendan Hansknecht (Jan 06 2024 at 18:56):

Brendan Hansknecht (Jan 06 2024 at 18:57):

Folkert de Vries (Jan 06 2024 at 19:02):

Brendan Hansknecht (Jan 06 2024 at 19:04):

Folkert de Vries (Jan 06 2024 at 19:05):

Brendan Hansknecht (Jan 06 2024 at 19:05):

Brendan Hansknecht (Jan 06 2024 at 19:05):

Becker A. (Jan 07 2024 at 01:18):

Brendan Hansknecht (Jan 07 2024 at 01:28):

Becker A. (Jan 07 2024 at 01:29):

Brendan Hansknecht (Jan 07 2024 at 01:44):

Brendan Hansknecht (Jan 07 2024 at 01:44):

Brendan Hansknecht (Jan 07 2024 at 01:45):

Brendan Hansknecht (Jan 07 2024 at 01:46):

Brendan Hansknecht (Jan 07 2024 at 01:53):

Brendan Hansknecht (Jan 07 2024 at 01:54):

Brendan Hansknecht (Jan 07 2024 at 01:54):

Brendan Hansknecht (Jan 07 2024 at 01:55):

Brendan Hansknecht (Jan 07 2024 at 01:56):

Brian Carroll (Jan 07 2024 at 14:49):

Brendan Hansknecht (Jan 07 2024 at 15:22):

Folkert de Vries (Jan 07 2024 at 15:32):

Notification Bot (Jan 07 2024 at 17:26):

Notification Bot (Jan 07 2024 at 17:31):

Simon Peleška (Jan 07 2024 at 17:35):

Anne Archibald (Jan 23 2024 at 21:28):

Folkert de Vries (Jan 23 2024 at 21:32):

Folkert de Vries (Jan 23 2024 at 21:32):

Folkert de Vries (Jan 23 2024 at 21:34):

Becker A. (Jan 24 2024 at 17:55):

Brendan Hansknecht (Jan 24 2024 at 19:05):

Brendan Hansknecht (Jan 24 2024 at 19:46):

Becker A. (Jan 27 2024 at 20:55):

Brendan Hansknecht (Jan 27 2024 at 22:38):

Brendan Hansknecht (Jan 28 2024 at 03:00):