Can anyone see any obvious inefficiencies in my merge sort algorithm here? Implemented just for fun in Roc (as well as quicksort). However my mergesort is an order of magnitude (or several depending on list size) worse than my quicksort, despite them having theoretically the same time complexity - O(n log n)
Note that in order to sort the 100k element list with mergesort, I had to quadruple my stack limit from 8mb to 32mb (quicksort has no issues with a 1m list).
The mergesort is of course using recursion instead of List.walk to do the main sorting, which will increase stack usage, but merge
should still recurse at most O(n) at each level of mergesort recursion, just like List.walk is O(n) at each level of recursion in quicksort.
Either way, the change from walking the list to recursion doesn't explain 3 orders of magnitude difference to me...
> List.sortWith: sorted 100k elements in 117ms
----------------------------------
> Sort.quicksort: sorted 100k elements in 248ms
----------------------------------
> Sort.mergesort: sorted 100k elements in 122340ms
mergesort : List a, (a, a -> [LT, EQ, GT]) -> List a
mergesort = \list, compare ->
if List.len list < 2 then
list
else
midpoint = (List.len list) // 2
{ before, others } = List.split list midpoint
merge (mergesort before compare) (mergesort others compare) compare
merge : List a, List a, (a, a -> [LT, EQ, GT]) -> List a
merge = \left, right, compare ->
when (left, right) is
([], _) -> right
(_, []) -> left
([l, .. as ls], [r, .. as rs]) ->
when compare l r is
GT -> List.prepend (merge left rs compare) r
_ -> List.prepend (merge ls right compare) l
(_,_) ->
crash "merge: The previous cases should be exhaustive."
prepend
is a bad idea. You would be better appending and then reversing at the last moment. Every prepend
will copy the entire listList.withCapacity
. That way List.prepend
/List.append
won't be allocating a bunch.As an extra note, List.sortWith
is actually really slow compared to how fast it could be. #compiler development > slow sorting.
Oh also, the stack limit issue is probably due to #6434.
Brendan Hansknecht said:
- If you actually want merge sort to be as fast as possible, you probably need to do it in place with indices
Except that until in-place List
mutations get more efficient as per our "bit-twiddling prime sieving" discussion, it is still going to be about 20 times slower than it should be...
Just for a point of reference, sorting a million random integers takes only a 100 or two milliseconds using arrays in languages that support direct array access, with "boxed" values as what Python is handling a bit slower in in something in the range of a half a second. Roc doesn't really have the tools to be able to do this "in native" nor can it be properly tested without bringing in random data from outside Roc as has been mentioned. An immutable Linked List implementation in Roc for the "bottom-up" merge sort is as follows:
app [main] { pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.10.0/vNe6s9hWzoTZtFmNkvEICPErI9ptji_ySjicO6CkucY.tar.br" }
import pf.Stdout
import pf.Task
cSIZE : I32
cSIZE = 1000000
LinkedList a : [Empty, Cons a (LinkedList a)]
showLL : LinkedList a -> Str where a implements Inspect
showLL = \ lst ->
shw = \ ill, oslst ->
when ill is
Empty -> Str.joinWith oslst ", "
Cons v nxtll -> shw nxtll (List.append oslst (Inspect.toStr v))
shw lst []
reverseLL : LinkedList a -> LinkedList a
reverseLL = \ lst ->
rvs = \ ilst, olst ->
when ilst is
Empty -> olst
Cons v nxtll -> rvs nxtll (Cons v olst)
rvs lst Empty
# generates a LinkedList of I32' in reverse order to zero...
buildList : I32 -> LinkedList I32
buildList = \ size ->
bld = \ n, lst ->
if n < 0 then lst
else bld (n - 1) (Cons n lst)
bld (size - 1) Empty
makelsts : LinkedList I32, LinkedList (LinkedList I32)
-> LinkedList (LinkedList I32)
makelsts = \ lst, rlstlst ->
when lst is
Empty -> rlstlst
Cons f ftl ->
when ftl is
Empty -> Cons (Cons f Empty) rlstlst
Cons s stl ->
if f <= s then makelsts stl (Cons (Cons f (Cons s Empty)) rlstlst)
else makelsts stl (Cons (Cons s (Cons f Empty)) rlstlst)
append : LinkedList a, LinkedList a -> LinkedList a
append = \ llst, rlst ->
when llst is
Empty -> reverseLL rlst
Cons v vtl -> append vtl (Cons v rlst)
merge : LinkedList I32, LinkedList I32, LinkedList I32 -> LinkedList I32
merge = \ lstx, lsty, rlst ->
when lstx is
Empty -> append lsty rlst
Cons x xtl ->
when lsty is
Empty -> append lstx rlst
Cons y ytl ->
if x <= y then merge xtl lsty (Cons x rlst)
else merge lstx ytl (Cons y rlst)
pairs : LinkedList (LinkedList I32), LinkedList (LinkedList I32)
-> LinkedList (LinkedList I32)
pairs = \ lstlst, rlstlst ->
when lstlst is
Empty -> reverseLL rlstlst
Cons xs xstl ->
when xstl is
Empty -> reverseLL (Cons xs rlstlst)
Cons ys ystl -> pairs ystl (Cons (merge xs ys Empty) rlstlst)
loop : LinkedList (LinkedList I32) -> LinkedList I32
loop = \ lstlst ->
when lstlst is
Empty -> Empty
Cons lst lsttl ->
when lsttl is
Empty -> lst
_ -> loop (pairs lstlst Empty)
# can't be generallized b because no Sort ability...
mergeSort : LinkedList I32 -> LinkedList I32
mergeSort = \ lst ->
loop (makelsts lst Empty)
testSort : LinkedList I32 -> Str
testSort = \ list ->
tst = \ lst, last ->
when lst is
Empty -> "Linked List is sorted!"
Cons v vtl ->
if v >= last then tst vtl v
else "Failed to sort Linked List!!!!!"
when list is
Empty -> "Empty Linked Lists are considered sorted."
Cons f ftl -> tst ftl f
main =
Stdout.line! "Merge Sorting \(Inspect.toStr cSIZE) 32-bit integers..."
Stdout.line! (buildList 100 |> mergeSort |> showLL)
Stdout.line! (buildList cSIZE |> mergeSort |> testSort)
The above sorts a million values in about a quarter of a second, but that isn't really worst case as they are already mostly sorted meaning that there won't be many missed branch predictions or cache misses with list nodes scattered all over heap memory; I would expect it to get three times or more slower if we could provide random data...
Why are you showing a linked list?
That will just make performance even worse due to pointer chasing.
If you want a simple but kinda effecient mergesort, I would do this:
alg
Brendan Hansknecht said:
Why are you showing a linked list?
That will just make performance even worse due to pointer chasing.
That's my point: This linked list implementation is actually showing better performance than the current built-in "native" implementation of List.sortWith
(althouugh that is using quickSort under the covers) and ais currently much faster than using a custom merge sort using the current state of List indexing due to the non-optimized in-place mutation...
In Elm, Linked List's are all we have that are suitable for writing alternate sorting algorithms, and code such as this sorting a million random Int's runs a sort in a little over a second and is comparable in speed to Haskell's built-in (lazy) List sort doing the same, although a little over twice as slow as Elm's built-in sort calling the JavaScript engine's implementation of an array sort usually implemented as an array Timsort written in C/C++ (converting the links list to a JavaScript array at the beginning and back at the end) although slower in having to deal with JavaScript's boxed objects...
So the following is an implementation of merge sort using List indexing:
buildList : U64 -> List I32
buildList = \ size ->
loopi = \ lst, i ->
if i >= size then List.reverse lst
else loopi (List.set lst i (Num.intCast i)) (i + 1)
List.repeat 0 size |> loopi 0
sortByTwos : List I32 -> List I32
sortByTwos = \ list ->
len = List.len list
loopi = \ lst, i ->
if i >= len then lst else
lft = List.get lst (i - 1)|> Result.withDefault 0
rgt = List.get lst i |> Result.withDefault 0
if lft > rgt then
loopi (List.swap lst (i - 1) i) (i + 2)
else loopi lst (i + 2)
loopi list 1
copylst : List I32, U64, List I32, U64, U64 -> List I32
copylst = \ srclst, si, dstlst, di, num ->
if num <= 0 then dstlst else
sv = List.get srclst si |> Result.withDefault 0
copylst srclst (si + 1)
(List.set dstlst di sv) (di + 1)
(num - 1)
merge : List I32, U64, U64, U64, U64, List I32, U64 -> List I32
merge = \ ilst, xi, xlmt, yi, ylmt, olst, oi ->
if xi >= xlmt then copylst ilst yi olst oi (ylmt - yi)
else if yi >= ylmt then copylst ilst xi olst oi (xlmt - xi)
else
x = List.get ilst xi |> Result.withDefault 0
y = List.get ilst yi |> Result.withDefault 0
if x <= y then merge ilst (xi + 1) xlmt yi ylmt (List.set olst oi x) (oi + 1)
else merge ilst xi xlmt (yi + 1) ylmt (List.set olst oi y) (oi + 1)
pairs : List I32, List I32, U64 -> List I32
pairs = \ srclst, dstlst, mrgsz ->
len = List.len srclst
loopi = \ dlst, i ->
if i >= len then dlst else
if i + mrgsz >= len then copylst srclst i dlst i (len - i) else
xlmt = i + mrgsz
ylmt = Num.min len (xlmt + mrgsz)
loopi (merge srclst i xlmt xlmt ylmt dlst i) ylmt
loopi dstlst 0
loop : List I32, List I32, U64 -> List I32
loop = \ srclst, dstlst, lstsz ->
len = List.len srclst
if lstsz >= len then srclst
else loop (pairs srclst dstlst lstsz) srclst (lstsz * 2)
mergeSort : List I32 -> List I32
mergeSort = \ lst ->
altlst = List.repeat 0 (List.len lst)
lst |> sortByTwos |> loop altlst 2
testSort : List I32 -> Str
testSort = \ lst ->
len = List.len lst
loopi = \ i ->
if i >= len then "List sorted correctly!" else
f = List.get lst (i - 1) |> Result.withDefault 0
s = List.get lst i |> Result.withDefault 0
if s < f then "Error in List sorting!!!!!"
else loopi (i + 1)
loopi 1
This segfaults for some reason for a million entries but can sort a list of half a million integers in about 0.6 seconds meaning that a million entries would take something like 1.3 seconds (O(n log n)) and this is quite a bit slower than the Linked List merge sort above, which also doesn't segfault. If the optimization for in-place mutation of List's (and what causes the segfault) were fixed, of course an indexed List sort would be faster...
It is also quite trivial to add recognition of pre-existing runs to the initialization for a Linked List based merge sort, unlike for indexed List sorts.
I don't say that Roc should use Linked List sorts, just that the current state of Roc doesn't really support anything better...
In my testing (might be an m1 mac memory thing), the merge sort I wrote above is approximately the same speed as your linked list merge sort (which is pretty sad cause it should be faster, though this merge sort version does allocate for all intermediate lists). Naive merge sort is actually a pretty bad algorithm for sorting contiguous arrays, even the inplace version.
That said, I would still recommend it over using a linked list sort. Cause the linked list will hurt with everything else you do even if it does sort faster.
We probably should pull our sorting algorithm out of zig for now in order to get reasonable performance. That or if someone has time and is willing to contribute, we need to generate element swapping functions and pass them into zig. That hopefully would inline and optimize happily. It would at a minimum be much better than memcpy for swapping elements.
But yeah, overall, the horrible performance here is definitely a major roc bug.
Brendan Hansknecht said:
the horrible performance here is definitely a major roc bug.
That's exactly the point I was trying to make; but also that the current implementation of sorting in Zig isn't very good either as you say...
Brendan Hansknecht said:
Naive merge sort is actually a pretty bad algorithm for sorting contiguous arrays, even the in-place version.
Ah, I think that by "naive", you mean top-down and intermediate allocating as per your implemented - yes, that kind of "naive" is pretty bad although simpler; you notice that I prefer bottom-up for partly in-place sorting for my implementation although it does still need to alternate between two arrays with the final destination the resulting one. I think your poorer results for the Linked List version are indeed due to slower memory allocation on macOS where I was running on Linux, as I often see this difference
It's funny you should say that considering that most common sorting implementations in most languages are just variations of array merge sorting, which includes Timsort, the new Powersort that Python currently uses, etc.; if one wants a reasonable performance general-purpose stable sort with good general performance where worst case is about the same as average performance for random data, these seem to be state-of-the-art; if one is willing to give up stable sorting or having extreme worst case performance, there are some exotic sorting algorithms that can deliver for some special cases...
GordonBGood said:
you notice that I prefer bottom-up for partly in-place sorting for my implementation although it does still need to alternate between two arrays with the final destination the resulting one
Just to finish this up, now that there is a TESTING branch of nightlies that has fixed the in-place mutation so it is now very fast, indexed List/array merge sorting works about as fast as in any of the fastest languages, as per the following code, which does a bottom-up merge sort:
app [main] { pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.10.0/vNe6s9hWzoTZtFmNkvEICPErI9ptji_ySjicO6CkucY.tar.br" }
import pf.Stdout
import pf.Task
import pf.Utc exposing [ now, deltaAsMillis ]
cLIMIT : U64
cLIMIT = 1_000_000
RandSeed : (U64, U64)
initRandSeed : RandSeed
initRandSeed = (0x69B4C98CB8530805u64, 0xFED1DD3004688D68u64)
nextRandI64 : RandSeed -> (RandSeed, I64)
nextRandI64 = \ r ->
s0 = r.0
s1 = r.1
ns1 = Num.bitwiseXor s0 s1
nr0 = Num.shiftLeftBy s0 55 |> Num.bitwiseOr (Num.shiftRightZfBy s0 9)
|> Num.bitwiseXor ns1 |> Num.bitwiseXor (Num.shiftLeftBy ns1 14) # a, b
nr1 = Num.shiftLeftBy ns1 36 |> Num.bitwiseOr (Num.shiftRightZfBy ns1 28) # c
((nr0, nr1), Num.intCast (Num.addWrap s0 s1))
buildList : U64 -> List I64
buildList = \ size ->
loopi = \ lst, r, i ->
if i >= size then lst else
(nr, rv) = nextRandI64 r
loopi (List.set lst i rv) nr (i + 1)
List.repeat 0 size |> loopi initRandSeed 0
mergeSort : List I64 -> List I64
mergeSort = \ ilist ->
sortByTwos = \ list ->
len = List.len list
loopi = \ lst, i ->
if i >= len then lst else
lft = List.get lst (i - 1)|> Result.withDefault 0
rgt = List.get lst i |> Result.withDefault 0
if lft > rgt then loopi (List.swap lst (i - 1) i) (i + 2)
else loopi lst (i + 2)
loopi list 1
copylst = \ srclst, si, dstlst, di, num ->
if num <= 0 then dstlst else
sv = List.get srclst si |> Result.withDefault 0
copylst srclst (si + 1)
(List.set dstlst di sv) (di + 1)
(num - 1)
merge = \ ilst, xi, xlmt, yi, ylmt, olst, oi ->
if xi >= xlmt then copylst ilst yi olst oi (ylmt - yi)
else if yi >= ylmt then copylst ilst xi olst oi (xlmt - xi)
else
x = List.get ilst xi |> Result.withDefault 0
y = List.get ilst yi |> Result.withDefault 0
if x <= y then merge ilst (xi + 1) xlmt yi ylmt (List.set olst oi x) (oi + 1)
else merge ilst xi xlmt (yi + 1) ylmt (List.set olst oi y) (oi + 1)
pairs = \ srclst, dstlst, mrgsz ->
len = List.len srclst
loopi = \ dlst, i ->
if i >= len then dlst else
if i + mrgsz >= len then copylst srclst i dlst i (len - i) else
xlmt = i + mrgsz
ylmt = Num.min len (xlmt + mrgsz)
loopi (merge srclst i xlmt xlmt ylmt dlst i) ylmt
loopi dstlst 0
loop = \ srclst, dstlst, lstsz ->
len = List.len srclst
if lstsz >= len then srclst
else loop (pairs srclst dstlst lstsz) srclst (lstsz * 2)
altlst = List.repeat 0 (List.len ilist)
ilist |> sortByTwos |> loop altlst 2
testSort : List I64 -> Bool
testSort = \ lst ->
len = List.len lst
loopi = \ i ->
if i >= len then Bool.true else
f = List.get lst (i - 1) |> Result.withDefault 0
s = List.get lst i |> Result.withDefault 0
if s < f then Bool.false else loopi (i + 1)
loopi 1
main =
tstlst = buildList cLIMIT
start = now!
answrlst = tstlst |> mergeSort
stop = now!
elpsdStr = deltaAsMillis start stop |> Num.toStr
shwlst = answrlst |> List.takeFirst 100
Stdout.line! "$(Inspect.toStr shwlst)"
Stdout.line! ( if testSort answrlst then "List sorted correctly!"
else "Failure in sorting list!!!" )
Stdout.line! "Sorted $(cLIMIT |> Num.toStr) integers in $(elpsdStr) milliseconds."
The above code sorts a million pseudo random 64-bit integers in about 89 milliseconds with my AMD 7840HS CPU running at 5.1 GHz when single-threaded as here, and that is fast!!! As mentioned in the thread, bottom-up sorting avoids the many allocations for doing partitioning that many top-down versions do and this version just alternates sorting between the original and the alternate allocated array and just returns the last destination one.
Most languages standard library sorting functions prefer a merge sort because it is a stable sort where equal values maintain their order in the List, and because worst case sorting time is still O(n log n) unlike quicksort which is not a stable sort and can be very slow when given a pre-sorted or partially sorted List and the "pivot" point(s) chosen incorrectly. For production, one would probably translate Python's new "power sort" which accounts for pre-sorted runs in the List/array although it does an increase in memory from a factor of twice the size of the List/array as here to up to half again as much for a "stack" of the lengths of the found runs...
Awesome to see this panning out in practice as real perf gains!
I need to look into power sort more. To my understanding glidesort actually uses it as a primitive for example. Cause generally speaking if you are will to allocate a little (sqrt of N) you can sort faster than what you get out of a merge sort.
But you always need the merge sort as the backup for bad cases where quicksort would recurse too far or require too much memory.
looks like power sort also uses extra memory: but they use a linear-size buffer for efficient merging.
very cool!
btw I don't think stable sorting matters in our case because Roc as a language never exposes reference equality
so I don't think the stability of a sort is observable in Roc code, just in the host
It is noticable in roc cause the users can set the compare function.
main =
x = [2.4, 1.2, 2.3, 2.2, 2.1]
List.sortWith x \a, b ->
Num.compare (Num.round a) (Num.round b)
|> Inspect.toStr
|> Stdout.line!
[1.2, 2.1, 2.4, 2.3, 2.2]
Our sort is currently unstable. Stable would produce:
[1.2, 2.4, 2.3, 2.2, 2.1]
ah fair point :thumbs_up:
Brendan Hansknecht said:
I need to look into power sort more. To my understanding glidesort actually uses it as a primitive for example. Cause generally speaking if you are will to allocate a little (sqrt of N) you can sort faster than what you get out of a merge sort.
All of these "in-place" List/array indexing sorts that respect pre-existing "runs" of data are going to need to allocate a little extra space to keep track of the length of the "runs" yet to be sorted, which is their cost in order to be able to take advantage of those pre-existing runs yet enjoy the advantages of an indexing in-place stable sort; then, managing that extra allocated memory will add to their execution time cost that will detract from the advantage of using the pre-existing "runs".
I think that the choices between powersort, glidesort, and any other variations gets into nit-picky type of territory: All try to balance out the number of merges so that the sorting algorithm more often is merging runs of about equal length for efficiency (less copy moves on average) even though the pre-existing runs may differ greatly in length. Some of the more exotic variations try to do intelligent decisions between different sorting methods based on the "shape" of the data, but it seems to me there will always be cases where the decisions will fail. The very competent Python contributor, Tim Peters, went through all the research and comparisons and chose powersort, but says that there may be (obscure - GBG) cases where other algorithms give better performance. I think its a situation where other slightly different algorithms may give somewhat better performance for certain run combinations but then worse for some situations where power sort is better.
It seems to me that we could save time by respecting Tim Peter's research and just implementing a Roc version of the C code he uses for the new Python standard library sort...
I agree with the general sentiment here, and I think that powersort would be a reasonable algorithm for roc. Like it obviously won't be bad. Clearly will be better than what we currently have by a lot. From real world measurements, it is essentially strictly better than timsort. That said, timsort has had a well known really bad case for a long time where it ends up merging length ~1 arrays with giant arrays.
That said, I think there is definitely more to consider here:
n/2
. For any given merge of arrays A
and B
it requires min(len(A), len(B))
extra memory.Brendan Hansknecht said:
That said, timsort has had a well known really bad case for a long time where it ends up merging length ~1 arrays with giant arrays.
Yes, that has been known about timsort for quite some time, which is why most implementations are a modified timsort to avoid the corner case, and probably is part of the drive to change - to powersort, which can be proved to not have that case...
That said, I think there is definitely more to consider here:
- python is always dealing with python objects. That significantly changes the sorting tradeoff due to extra pointer indirection and the actually thing being sorted always being relatively small in size.
True, but that is a trade for any sorting algorithm as to digging down to the actual "key" comparison value and the comparisons should still be valid, just that non some cases of unboxed values will be faster and some that don't box such as tuples and records may be slower...
- Algorithms like glidesort, fluxsort, powersort, quadsort, etc are consistently able to beat timsort (by a large margin). So it isn't a good benchmark.
Skipping timsort, we are only concerned with comparisons between powersort and the others...
- Powersort does not require a little bit of extra memory, it requires a lot of extra memory. In the worst case, it requires an element buffer of size
n/2
. For any given merge of arraysA
andB
it requiresmin(len(A), len(B))
extra memory.
Well, efficient in-place mergesort requires as much extra memory as the array to be sorted, so adding an extra half again extra doesn't seem all that bad - if memory use is a problem, one can switch to one of the others that use a little less, but as far as I can see, all of them use some extra if they are to be stable sorts...
- Quadsort, which is roughly equivalent to powersort can be significantly slower than fluxsort (a quicksort-like wrapper to quadsort) for certain cases where data is more organized.
There are always cases...
- blitsort with sqrt of n memory has approximately equivalent perf to fluxsort.
If low memory use is an essential criteria, then that looks like it might be a good alternative, but what are the other trade offs?
If low memory use is an essential criteria
This is probably a very good question that we should be asking ourselves. How much memory are we ok with roc's default sort using? @Richard Feldman any thoughts?
I had always assumed the goal was essentially constant memory, but that was definitely a naive assumption. I guess in essentially all commonly used language sorting algorithms, the answer is roughly n/2
to n
extra memory. I think the main gain of glidesort and blitsort is that they use essentially constant memory (can be restricted to truly constant, but sqrt of n required for full perf) while being competitive with powersort/quadsort. Also due to taking advantage of a few patterns and a top level quicksort in general have slightly better perf.
but what are the other trade offs?
Main tradeoff is in very large merges. They will require more cycles of the buffer which is slower than just having a single large buffer. In practice, this is not much of a cost as long as you have the sqrt of n memory.
it's a good question; by default I'd say top priority is fastest overall sorting performance, but it's not the only consideration
for example, if the absolute fastest thing is 5x the memory usage of the next one but only 1% faster, that doesn't sound worth it to me
Yeah, I think the two main choices are n
or basically constant.
so I'd say constant memory usage is not a requirement, but less memory usage is separately valuable from faster overall perf, although faster overall perf is the most valuable
also I think the default sorting algorithm will most often be used to sort things of under a thousand elements
Using n
is definitely faster. Though most cases will be similar in performance, in the bad cases for using basically constant memory, it can be 25% slower (though this is also 25% slower of significantly faster than timsort which most languages still use).
what's the coefficient here? like n times how many bytes?
Sorry, n
elements. So 2 whole copies of the array.
As a note, no matter which algorithm we use (within the bounds of modern fast algorithms), my gut feeling is that the cost of the cmp function, copy function (ie not giving zig element size and alignment at compile time), and and dealing with refcounts will be the biggest slowdowns. Those are all things we may need to specialize over to get perf.
when we say the one that doubles the elements is faster, is that in a microbenchmark?
or in real programs where the cost of cache misses due to evictions from that much memory usage is factored in
I would guess in practice given most languages use sorting algorithms that duplicate the array or at least require a buffer of approximately half the length of the array.
That said, it may be a case that most arrays that get sorted easily fit at least L3 cache.
So the buffer as little cost.
Also, the buffers are used linearly, so cache locality is still good even with the large buffer.
Richard Feldman said:
or in real programs where the cost of cache misses due to evictions from that much memory usage is factored in
I think that's as in "real" programs, as most algorithms are using some sort of merge under the covers which is feeding from two sources and writing the merged result to another location; since associative caches use multi-way association, this is well within their capability and there will just be jumps when jumping between different merge streams, which will quickly get the cache up to speed. In the above merge sort benchmark, the main cost is the cost of comparisons at about 24 CPU clock cycles per, and this slow because the sorting data is seen as random and therefore unpredictable...
Brendan Hansknecht said:
Also, the buffers are used linearly, so cache locality is still good even with the large buffer.
As @Brendan Hansknecht says...
yeah, given all that, it sounds like the fastest one is worth the extra memory usage!
So in that case that leaves 3 main options:
Here are the rough pluses and minuses in my opinion:
fluxsort:
+ To my understanding should be the fastest in the most cases on most hardware
+ Can take advantage of patterns for more perf
+ Tuned specifically with the idea of passing in a comparison function
- Does more low level optimizations that while simple to understand may not port properly
glidesort:
~ might be the fastest in the future (depends very heavily on hardware, has optimizations for very new hardware)
+ can take advantage of patterns for more perf
- definitely the most complex to port (sorry rust)
- has some extra tradeoffs made due to panic safety (but we technically don't need to port those)
powersort:
+ Definitely the simplest
+ Vetted by python adopting it
- Likely the slowest here, but still very fast overall. Worst cases:
1. Lots of equivalent elements (like `rand %100` or enums).
2. Patterns that are more complex than many repeated runs.
- More comparisons on average
Note, when I talk about powersort being the slowest in many cases, I would expect it to be within a lets say 3%. In the specific edge cases like lots of equality, it can be a huge diff potentially 2x.
And definitely take all of this with a solid grain of salt. This is from some local benchmarks of existing code, but mostly from trusting existing benchmarks, reading papers on algorithm implementations and tradeoffs, and intuition.
The only way to truly know these tradeoffs would be to implement them all with the constraints of roc.
I'm guessing it's a fair bit of work to implement these. Like you wouldn't just implement a couple and have a flag?
What about implementing the simple one first, and then potentially exploring the more experimental glidesort?
a fair bit of work
100% in fact the simple one is still pretty complex (if you want it fast and not just a naive port of the algorithm as a whole).
A note on the glidesort gains, I think they are a double edge sword. To my understanding they are likely what make it slower on middle end, low end, and embedded hardware. Though in 5 years probably will be fine on middle end. It merges in an interleaved manner to take advantage of out of order execution. And multiple memory streams. This is just larger overhead and potential cache conflicts on older/lower resource hardware.
That said, all algorithms use at least 2 way merging to help some with out of order and multiple memory streams.
In fact, the newest paper on power sort, it uses four way merging, but it has too much overhead to gain perf unless you have sentinel values (values that always sort greater than all other values). It uses them to avoid slice/subarray bounds checks. Just puts them at the end of chunks being merged.
Glidesort looks to have managed multi way merging wins, but they are the main reason the perf varies so much by hardware.
Brendan Hansknecht said:
and intuition...
My intuition says try powersort or some slight derivation thereof, and maybe compare it with fluxsort if it isn't too much more complex. At this stage, nothing is cast in stone and you can always revisit it later, and either of these or even my simple merge sort as per the above is better than the Zig code you have now...
Brendan Hansknecht said:
A note on the glidesort gains, I think they are a double edge sword. To my understanding they are likely what make it slower on middle end, low end, and embedded hardware. Though in 5 years probably will be fine on middle end. It merges in an interleaved manner to take advantage of out of order execution. And multiple memory streams. This is just larger overhead and potential cache conflicts on older/lower resource hardware.
I don't think that's a big downside
it's hard to imagine a few percentage points of sorting perf being a major issue for embedded use cases; I wouldn't expect sorting to be a significant percentage of what they spend their time on regardless
and definitely optimizing for where hardware is going seems like a better idea than optimizing for where it is today, when there aren't yet any big Roc deployments that could actually take advantage of that on today's hardware anyway :big_smile:
I do like that fluxsort is tuned for accepting a comparison function though; I definitely expect that to come up a lot!
based on that, fluxsort seems like the frontrunner to me unless there's some really significant implementation advantage to powersort being simpler, but it sounds like the simplicity of the algorithm itself is a relatively minor percentage of the overall project regardless, yeah?
Yeah, it will be curious to see once we have one of these in zig, what the real overhead ends up being.
I don't think we could write raw roc as optimally, but the overhead of closures in zig and lack of mono could hit hard.
Long term will certainly try multiple configs.
That said, at least currently, we don't have anything interesting in roc to benchmark it against. So we'll have to think about what benchmarks we actually care about when it comes to perf here. What are our "realworld" enough sorting benchmarks.
Brendan Hansknecht said:
What are our "realworld" enough sorting benchmarks.
Also as in Zig versus in "native" Roc...
There is also that it would be really nice to have the Sort
Ability in place rather than having to depend on whatever comparable behavior defaults there are for whatever Type one wants to use as a sorting key...
I don't quite follow that statement, can you elaborate?
The current plan for Sort
is that it is a mapping from 2 inputs of the same type to [EQ, LT, GT]
.
Brendan Hansknecht said:
I don't quite follow that statement, can you elaborate?
I mean that in order to implement a sort function, the contents need to be known to be comparable unless everything is automatically comparable (unless it contains a function), so tuples, records, Str's (???), Char's, and custom tagged union types are all comparable by these criteria (tuples, record's, and union's not containing a function); if they aren't comparable, then one could use the sortWith
or sortBy
variation, but the usual protocol is to apply the constraint to the type variable as in Elm (sort : List comparable -> List comparable
) or with an Ability (sort : List a -> List a where a implements Sort
)...
Ah yeah, sort will be auto derive, but not for strings, functions, opaques. So most types will just have sort.
We will expose sort : List a -> List a where a implements Sort
, but it will just wrap List.sortWith
. Sort
ability will be used to generate the comparison functions.
We eventually may also want to add raw primitive sorts that automatically get dispatched to by List.sort
, but that will depend on the gains it sees. If we are lucky, llvm might be smart enough to optimize sortWith on primitives and inline the indirect call to the constant compare function.
Last updated: Jul 06 2025 at 12:14 UTC