Stream: beginners

Topic: Obvious inefficiencies in merge sort algorithm?


view this post on Zulip Ian McLerran (Jun 17 2024 at 14:39):

Can anyone see any obvious inefficiencies in my merge sort algorithm here? Implemented just for fun in Roc (as well as quicksort). However my mergesort is an order of magnitude (or several depending on list size) worse than my quicksort, despite them having theoretically the same time complexity - O(n log n)

Note that in order to sort the 100k element list with mergesort, I had to quadruple my stack limit from 8mb to 32mb (quicksort has no issues with a 1m list).

The mergesort is of course using recursion instead of List.walk to do the main sorting, which will increase stack usage, but merge should still recurse at most O(n) at each level of mergesort recursion, just like List.walk is O(n) at each level of recursion in quicksort.

Either way, the change from walking the list to recursion doesn't explain 3 orders of magnitude difference to me...

> List.sortWith: sorted 100k elements in 117ms
----------------------------------
> Sort.quicksort: sorted 100k elements in 248ms
----------------------------------
> Sort.mergesort: sorted 100k elements in 122340ms
mergesort : List a, (a, a -> [LT, EQ, GT]) -> List a
mergesort = \list, compare ->
    if List.len list < 2 then
        list
    else
        midpoint = (List.len list) // 2
        { before, others } = List.split list midpoint
        merge (mergesort before compare) (mergesort others compare) compare

merge : List a, List a, (a, a -> [LT, EQ, GT]) -> List a
merge = \left, right, compare ->
    when (left, right) is
        ([], _) -> right
        (_, []) -> left
        ([l, .. as ls], [r, .. as rs]) ->
            when compare l r is
                GT -> List.prepend (merge left rs compare) r
                _ -> List.prepend (merge ls right compare) l
        (_,_) ->
            crash "merge: The previous cases should be exhaustive."

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 15:09):

  1. prepend is a bad idea. You would be better appending and then reversing at the last moment. Every prepend will copy the entire list

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 15:12):

  1. you really should preallocate the output lists. So some sort of structure where you start with List.withCapacity. That way List.prepend/List.append won't be allocating a bunch.

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 15:12):

  1. If you actually want merge sort to be as fast as possible, you probably need to do it in place with indices.

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 15:16):

As an extra note, List.sortWith is actually really slow compared to how fast it could be. #compiler development > slow sorting.

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 15:17):

Oh also, the stack limit issue is probably due to #6434.

view this post on Zulip GordonBGood (Jun 18 2024 at 07:16):

Brendan Hansknecht said:

  1. If you actually want merge sort to be as fast as possible, you probably need to do it in place with indices

Except that until in-place List mutations get more efficient as per our "bit-twiddling prime sieving" discussion, it is still going to be about 20 times slower than it should be...

Just for a point of reference, sorting a million random integers takes only a 100 or two milliseconds using arrays in languages that support direct array access, with "boxed" values as what Python is handling a bit slower in in something in the range of a half a second. Roc doesn't really have the tools to be able to do this "in native" nor can it be properly tested without bringing in random data from outside Roc as has been mentioned. An immutable Linked List implementation in Roc for the "bottom-up" merge sort is as follows:

app [main] { pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.10.0/vNe6s9hWzoTZtFmNkvEICPErI9ptji_ySjicO6CkucY.tar.br" }

import pf.Stdout
import pf.Task

cSIZE : I32
cSIZE = 1000000

LinkedList a : [Empty, Cons a (LinkedList a)]

showLL : LinkedList a -> Str where a implements Inspect
showLL = \ lst ->
  shw = \ ill, oslst ->
    when ill is
      Empty -> Str.joinWith oslst ", "
      Cons v nxtll -> shw nxtll (List.append oslst (Inspect.toStr v))
  shw lst []

reverseLL : LinkedList a -> LinkedList a
reverseLL = \ lst ->
  rvs = \ ilst, olst ->
    when ilst is
      Empty -> olst
      Cons v nxtll -> rvs nxtll (Cons v olst)
  rvs lst Empty

# generates a LinkedList of I32' in reverse order to zero...
buildList : I32 -> LinkedList I32
buildList = \ size ->
  bld = \ n, lst ->
    if n < 0 then lst
    else bld (n - 1) (Cons n lst)
  bld (size - 1) Empty

makelsts : LinkedList I32, LinkedList (LinkedList I32)
             -> LinkedList (LinkedList I32)
makelsts = \ lst, rlstlst ->
  when lst is
    Empty -> rlstlst
    Cons f ftl ->
      when ftl is
        Empty -> Cons (Cons f Empty) rlstlst
        Cons s stl ->
          if f <= s then makelsts stl (Cons (Cons f (Cons s Empty)) rlstlst)
          else makelsts stl (Cons (Cons s (Cons f Empty)) rlstlst)

append : LinkedList a, LinkedList a -> LinkedList a
append = \ llst, rlst ->
  when llst is
    Empty -> reverseLL rlst
    Cons v vtl -> append vtl (Cons v rlst)

merge : LinkedList I32, LinkedList I32, LinkedList I32 -> LinkedList I32
merge = \ lstx, lsty, rlst ->
  when lstx is
    Empty -> append lsty rlst
    Cons x xtl ->
      when lsty is
        Empty -> append lstx rlst
        Cons y ytl ->
          if x <= y then merge xtl lsty (Cons x rlst)
          else merge lstx ytl (Cons y rlst)

pairs : LinkedList (LinkedList I32), LinkedList (LinkedList I32)
          -> LinkedList (LinkedList I32)
pairs = \ lstlst, rlstlst ->
  when lstlst is
    Empty -> reverseLL rlstlst
    Cons xs xstl ->
      when xstl is
        Empty -> reverseLL (Cons xs rlstlst)
        Cons ys ystl -> pairs ystl (Cons (merge xs ys Empty) rlstlst)

loop : LinkedList (LinkedList I32) -> LinkedList I32
loop = \ lstlst ->
  when lstlst is
    Empty -> Empty
    Cons lst lsttl ->
      when lsttl is
        Empty -> lst
        _ -> loop (pairs lstlst Empty)

# can't be generallized b because no Sort ability...
mergeSort : LinkedList I32 -> LinkedList I32
mergeSort = \ lst ->
  loop (makelsts lst Empty)

testSort : LinkedList I32 -> Str
testSort = \ list ->
  tst  = \ lst, last ->
    when lst is
      Empty -> "Linked List is sorted!"
      Cons v vtl ->
        if v >= last then tst vtl v
        else "Failed to sort Linked List!!!!!"
  when list is
    Empty -> "Empty Linked Lists are considered sorted."
    Cons f ftl -> tst ftl f

main =
  Stdout.line! "Merge Sorting \(Inspect.toStr cSIZE) 32-bit integers..."
  Stdout.line! (buildList 100 |> mergeSort |> showLL)
  Stdout.line! (buildList cSIZE |> mergeSort |> testSort)

The above sorts a million values in about a quarter of a second, but that isn't really worst case as they are already mostly sorted meaning that there won't be many missed branch predictions or cache misses with list nodes scattered all over heap memory; I would expect it to get three times or more slower if we could provide random data...

view this post on Zulip Brendan Hansknecht (Jun 18 2024 at 14:51):

Why are you showing a linked list?

view this post on Zulip Brendan Hansknecht (Jun 18 2024 at 14:52):

That will just make performance even worse due to pointer chasing.

view this post on Zulip Brendan Hansknecht (Jun 18 2024 at 21:55):

If you want a simple but kinda effecient mergesort, I would do this:

alg

view this post on Zulip GordonBGood (Jun 18 2024 at 23:17):

Brendan Hansknecht said:

Why are you showing a linked list?
That will just make performance even worse due to pointer chasing.

That's my point: This linked list implementation is actually showing better performance than the current built-in "native" implementation of List.sortWith(althouugh that is using quickSort under the covers) and ais currently much faster than using a custom merge sort using the current state of List indexing due to the non-optimized in-place mutation...

In Elm, Linked List's are all we have that are suitable for writing alternate sorting algorithms, and code such as this sorting a million random Int's runs a sort in a little over a second and is comparable in speed to Haskell's built-in (lazy) List sort doing the same, although a little over twice as slow as Elm's built-in sort calling the JavaScript engine's implementation of an array sort usually implemented as an array Timsort written in C/C++ (converting the links list to a JavaScript array at the beginning and back at the end) although slower in having to deal with JavaScript's boxed objects...

So the following is an implementation of merge sort using List indexing:

buildList : U64 -> List I32
buildList = \ size ->
  loopi = \ lst, i ->
    if i >= size then List.reverse lst
    else loopi (List.set lst i (Num.intCast i)) (i + 1)
  List.repeat 0 size |> loopi 0

sortByTwos : List I32 -> List I32
sortByTwos = \ list ->
  len = List.len list
  loopi = \ lst, i ->
    if i >= len then lst else
    lft = List.get lst (i - 1)|> Result.withDefault 0
    rgt = List.get lst i |> Result.withDefault 0
    if lft > rgt then
      loopi (List.swap lst (i - 1) i) (i + 2)
    else loopi lst (i + 2)
  loopi list 1

copylst : List I32, U64, List I32, U64, U64 -> List I32
copylst = \ srclst, si, dstlst, di, num ->
  if num <= 0 then dstlst else
  sv = List.get srclst si |> Result.withDefault 0
  copylst srclst (si + 1)
          (List.set dstlst di sv) (di + 1)
          (num - 1)

merge : List I32, U64, U64, U64, U64, List I32, U64 -> List I32
merge = \ ilst, xi, xlmt, yi, ylmt, olst, oi ->
  if xi >= xlmt then copylst ilst yi olst oi (ylmt - yi)
  else if yi >= ylmt then copylst ilst xi olst oi (xlmt - xi)
  else
    x = List.get ilst xi |> Result.withDefault 0
    y = List.get ilst yi |> Result.withDefault 0
    if x <= y then merge ilst (xi + 1) xlmt yi ylmt (List.set olst oi x) (oi + 1)
    else merge ilst xi xlmt (yi + 1) ylmt (List.set olst oi y) (oi + 1)

pairs : List I32, List I32, U64 -> List I32
pairs = \ srclst, dstlst, mrgsz ->
  len = List.len srclst
  loopi = \ dlst, i ->
    if i >= len then dlst else
    if i + mrgsz >= len then copylst srclst i dlst i (len - i) else
    xlmt = i + mrgsz
    ylmt = Num.min len (xlmt + mrgsz)
    loopi (merge srclst i xlmt xlmt ylmt dlst i) ylmt
  loopi dstlst 0

loop : List I32, List I32, U64 -> List I32
loop = \ srclst, dstlst, lstsz ->
  len = List.len srclst
  if lstsz >= len then srclst
  else loop (pairs srclst dstlst lstsz) srclst (lstsz * 2)

mergeSort : List I32 -> List I32
mergeSort = \ lst ->
  altlst = List.repeat 0 (List.len lst)
  lst |> sortByTwos |> loop altlst 2

testSort : List I32 -> Str
testSort = \ lst ->
  len = List.len lst
  loopi = \ i ->
    if i >= len then "List sorted correctly!" else
    f = List.get lst (i - 1) |> Result.withDefault 0
    s = List.get lst i |> Result.withDefault 0
    if s < f then "Error in List sorting!!!!!"
    else loopi (i + 1)
  loopi 1

This segfaults for some reason for a million entries but can sort a list of half a million integers in about 0.6 seconds meaning that a million entries would take something like 1.3 seconds (O(n log n)) and this is quite a bit slower than the Linked List merge sort above, which also doesn't segfault. If the optimization for in-place mutation of List's (and what causes the segfault) were fixed, of course an indexed List sort would be faster...

It is also quite trivial to add recognition of pre-existing runs to the initialization for a Linked List based merge sort, unlike for indexed List sorts.

I don't say that Roc should use Linked List sorts, just that the current state of Roc doesn't really support anything better...

view this post on Zulip Brendan Hansknecht (Jun 19 2024 at 00:49):

In my testing (might be an m1 mac memory thing), the merge sort I wrote above is approximately the same speed as your linked list merge sort (which is pretty sad cause it should be faster, though this merge sort version does allocate for all intermediate lists). Naive merge sort is actually a pretty bad algorithm for sorting contiguous arrays, even the inplace version.

That said, I would still recommend it over using a linked list sort. Cause the linked list will hurt with everything else you do even if it does sort faster.

We probably should pull our sorting algorithm out of zig for now in order to get reasonable performance. That or if someone has time and is willing to contribute, we need to generate element swapping functions and pass them into zig. That hopefully would inline and optimize happily. It would at a minimum be much better than memcpy for swapping elements.

view this post on Zulip Brendan Hansknecht (Jun 19 2024 at 00:52):

But yeah, overall, the horrible performance here is definitely a major roc bug.

view this post on Zulip GordonBGood (Jun 19 2024 at 01:53):

Brendan Hansknecht said:

the horrible performance here is definitely a major roc bug.

That's exactly the point I was trying to make; but also that the current implementation of sorting in Zig isn't very good either as you say...

Brendan Hansknecht said:

Naive merge sort is actually a pretty bad algorithm for sorting contiguous arrays, even the in-place version.

Ah, I think that by "naive", you mean top-down and intermediate allocating as per your implemented - yes, that kind of "naive" is pretty bad although simpler; you notice that I prefer bottom-up for partly in-place sorting for my implementation although it does still need to alternate between two arrays with the final destination the resulting one. I think your poorer results for the Linked List version are indeed due to slower memory allocation on macOS where I was running on Linux, as I often see this difference

It's funny you should say that considering that most common sorting implementations in most languages are just variations of array merge sorting, which includes Timsort, the new Powersort that Python currently uses, etc.; if one wants a reasonable performance general-purpose stable sort with good general performance where worst case is about the same as average performance for random data, these seem to be state-of-the-art; if one is willing to give up stable sorting or having extreme worst case performance, there are some exotic sorting algorithms that can deliver for some special cases...

view this post on Zulip GordonBGood (Jul 22 2024 at 18:51):

GordonBGood said:

you notice that I prefer bottom-up for partly in-place sorting for my implementation although it does still need to alternate between two arrays with the final destination the resulting one

Just to finish this up, now that there is a TESTING branch of nightlies that has fixed the in-place mutation so it is now very fast, indexed List/array merge sorting works about as fast as in any of the fastest languages, as per the following code, which does a bottom-up merge sort:

app [main] { pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.10.0/vNe6s9hWzoTZtFmNkvEICPErI9ptji_ySjicO6CkucY.tar.br" }

import pf.Stdout
import pf.Task
import pf.Utc exposing [ now, deltaAsMillis ]

cLIMIT : U64
cLIMIT = 1_000_000

RandSeed : (U64, U64)

initRandSeed : RandSeed
initRandSeed = (0x69B4C98CB8530805u64, 0xFED1DD3004688D68u64)

nextRandI64 : RandSeed -> (RandSeed, I64)
nextRandI64 = \ r ->
  s0 = r.0
  s1 = r.1
  ns1 = Num.bitwiseXor s0 s1
  nr0 = Num.shiftLeftBy s0 55 |> Num.bitwiseOr (Num.shiftRightZfBy s0 9)
          |> Num.bitwiseXor ns1 |> Num.bitwiseXor (Num.shiftLeftBy ns1 14) # a, b
  nr1 = Num.shiftLeftBy ns1 36 |> Num.bitwiseOr (Num.shiftRightZfBy ns1 28) # c
  ((nr0, nr1), Num.intCast (Num.addWrap s0 s1))

buildList : U64 -> List I64
buildList = \ size ->
  loopi = \ lst, r, i ->
    if i >= size then lst else
    (nr, rv) = nextRandI64 r
    loopi (List.set lst i rv) nr (i + 1)
  List.repeat 0 size |> loopi initRandSeed 0

mergeSort : List I64 -> List I64
mergeSort = \ ilist ->
  sortByTwos = \ list ->
    len = List.len list
    loopi = \ lst, i ->
      if i >= len then lst else
      lft = List.get lst (i - 1)|> Result.withDefault 0
      rgt = List.get lst i |> Result.withDefault 0
      if lft > rgt then loopi (List.swap lst (i - 1) i) (i + 2)
      else loopi lst (i + 2)
    loopi list 1
  copylst = \ srclst, si, dstlst, di, num ->
    if num <= 0 then dstlst else
    sv = List.get srclst si |> Result.withDefault 0
    copylst srclst (si + 1)
            (List.set dstlst di sv) (di + 1)
            (num - 1)
  merge = \ ilst, xi, xlmt, yi, ylmt, olst, oi ->
    if xi >= xlmt then copylst ilst yi olst oi (ylmt - yi)
    else if yi >= ylmt then copylst ilst xi olst oi (xlmt - xi)
    else
      x = List.get ilst xi |> Result.withDefault 0
      y = List.get ilst yi |> Result.withDefault 0
      if x <= y then merge ilst (xi + 1) xlmt yi ylmt (List.set olst oi x) (oi + 1)
      else merge ilst xi xlmt (yi + 1) ylmt (List.set olst oi y) (oi + 1)
  pairs = \ srclst, dstlst, mrgsz ->
    len = List.len srclst
    loopi = \ dlst, i ->
      if i >= len then dlst else
      if i + mrgsz >= len then copylst srclst i dlst i (len - i) else
      xlmt = i + mrgsz
      ylmt = Num.min len (xlmt + mrgsz)
      loopi (merge srclst i xlmt xlmt ylmt dlst i) ylmt
    loopi dstlst 0
  loop = \ srclst, dstlst, lstsz ->
    len = List.len srclst
    if lstsz >= len then srclst
    else loop (pairs srclst dstlst lstsz) srclst (lstsz * 2)
  altlst = List.repeat 0 (List.len ilist)
  ilist |> sortByTwos |> loop altlst 2

testSort : List I64 -> Bool
testSort = \ lst ->
  len = List.len lst
  loopi = \ i ->
    if i >= len then Bool.true else
    f = List.get lst (i - 1) |> Result.withDefault 0
    s = List.get lst i |> Result.withDefault 0
    if s < f then Bool.false else loopi (i + 1)
  loopi 1

main =
    tstlst = buildList cLIMIT
    start = now!
    answrlst = tstlst |> mergeSort
    stop = now!
    elpsdStr = deltaAsMillis start stop |> Num.toStr
    shwlst = answrlst |> List.takeFirst 100
    Stdout.line! "$(Inspect.toStr shwlst)"
    Stdout.line! ( if testSort answrlst then "List sorted correctly!"
                   else "Failure in sorting list!!!" )
    Stdout.line! "Sorted $(cLIMIT |> Num.toStr) integers in $(elpsdStr) milliseconds."

The above code sorts a million pseudo random 64-bit integers in about 89 milliseconds with my AMD 7840HS CPU running at 5.1 GHz when single-threaded as here, and that is fast!!! As mentioned in the thread, bottom-up sorting avoids the many allocations for doing partitioning that many top-down versions do and this version just alternates sorting between the original and the alternate allocated array and just returns the last destination one.

Most languages standard library sorting functions prefer a merge sort because it is a stable sort where equal values maintain their order in the List, and because worst case sorting time is still O(n log n) unlike quicksort which is not a stable sort and can be very slow when given a pre-sorted or partially sorted List and the "pivot" point(s) chosen incorrectly. For production, one would probably translate Python's new "power sort" which accounts for pre-sorted runs in the List/array although it does an increase in memory from a factor of twice the size of the List/array as here to up to half again as much for a "stack" of the lengths of the found runs...

view this post on Zulip Brendan Hansknecht (Jul 22 2024 at 19:03):

Awesome to see this panning out in practice as real perf gains!


I need to look into power sort more. To my understanding glidesort actually uses it as a primitive for example. Cause generally speaking if you are will to allocate a little (sqrt of N) you can sort faster than what you get out of a merge sort.

But you always need the merge sort as the backup for bad cases where quicksort would recurse too far or require too much memory.

view this post on Zulip Brendan Hansknecht (Jul 22 2024 at 19:06):

looks like power sort also uses extra memory: but they use a linear-size buffer for efficient merging.

view this post on Zulip Richard Feldman (Jul 22 2024 at 19:09):

very cool!

view this post on Zulip Richard Feldman (Jul 22 2024 at 19:10):

btw I don't think stable sorting matters in our case because Roc as a language never exposes reference equality

view this post on Zulip Richard Feldman (Jul 22 2024 at 19:10):

so I don't think the stability of a sort is observable in Roc code, just in the host

view this post on Zulip Brendan Hansknecht (Jul 22 2024 at 19:14):

It is noticable in roc cause the users can set the compare function.

main =
    x = [2.4, 1.2, 2.3, 2.2, 2.1]
    List.sortWith x \a, b ->
        Num.compare (Num.round a) (Num.round b)
    |> Inspect.toStr
    |> Stdout.line!

[1.2, 2.1, 2.4, 2.3, 2.2]

Our sort is currently unstable. Stable would produce:
[1.2, 2.4, 2.3, 2.2, 2.1]

view this post on Zulip Richard Feldman (Jul 22 2024 at 19:15):

ah fair point :thumbs_up:

view this post on Zulip GordonBGood (Jul 23 2024 at 03:24):

Brendan Hansknecht said:

I need to look into power sort more. To my understanding glidesort actually uses it as a primitive for example. Cause generally speaking if you are will to allocate a little (sqrt of N) you can sort faster than what you get out of a merge sort.

All of these "in-place" List/array indexing sorts that respect pre-existing "runs" of data are going to need to allocate a little extra space to keep track of the length of the "runs" yet to be sorted, which is their cost in order to be able to take advantage of those pre-existing runs yet enjoy the advantages of an indexing in-place stable sort; then, managing that extra allocated memory will add to their execution time cost that will detract from the advantage of using the pre-existing "runs".

I think that the choices between powersort, glidesort, and any other variations gets into nit-picky type of territory: All try to balance out the number of merges so that the sorting algorithm more often is merging runs of about equal length for efficiency (less copy moves on average) even though the pre-existing runs may differ greatly in length. Some of the more exotic variations try to do intelligent decisions between different sorting methods based on the "shape" of the data, but it seems to me there will always be cases where the decisions will fail. The very competent Python contributor, Tim Peters, went through all the research and comparisons and chose powersort, but says that there may be (obscure - GBG) cases where other algorithms give better performance. I think its a situation where other slightly different algorithms may give somewhat better performance for certain run combinations but then worse for some situations where power sort is better.

It seems to me that we could save time by respecting Tim Peter's research and just implementing a Roc version of the C code he uses for the new Python standard library sort...

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 04:15):

I agree with the general sentiment here, and I think that powersort would be a reasonable algorithm for roc. Like it obviously won't be bad. Clearly will be better than what we currently have by a lot. From real world measurements, it is essentially strictly better than timsort. That said, timsort has had a well known really bad case for a long time where it ends up merging length ~1 arrays with giant arrays.

That said, I think there is definitely more to consider here:

  1. python is always dealing with python objects. That significantly changes the sorting tradeoff due to extra pointer indirection and the actually thing being sorted always being relatively small in size.
  2. Algorithms like glidesort, fluxsort, powersort, quadsort, etc are consistently able to beat timsort (by a large margin). So it isn't a good benchmark.
  3. Powersort does not require a little bit of extra memory, it requires a lot of extra memory. In the worst case, it requires an element buffer of size n/2. For any given merge of arrays A and B it requires min(len(A), len(B)) extra memory.
  4. Quadsort, which is roughly equivalent to powersort can be significantly slower than fluxsort (a quicksort-like wrapper to quadsort) for certain cases where data is more organized.
  5. blitsort with sqrt of n memory has approximately equivalent perf to fluxsort.

view this post on Zulip GordonBGood (Jul 23 2024 at 04:52):

Brendan Hansknecht said:

That said, timsort has had a well known really bad case for a long time where it ends up merging length ~1 arrays with giant arrays.

Yes, that has been known about timsort for quite some time, which is why most implementations are a modified timsort to avoid the corner case, and probably is part of the drive to change - to powersort, which can be proved to not have that case...

That said, I think there is definitely more to consider here:

  1. python is always dealing with python objects. That significantly changes the sorting tradeoff due to extra pointer indirection and the actually thing being sorted always being relatively small in size.

True, but that is a trade for any sorting algorithm as to digging down to the actual "key" comparison value and the comparisons should still be valid, just that non some cases of unboxed values will be faster and some that don't box such as tuples and records may be slower...

  1. Algorithms like glidesort, fluxsort, powersort, quadsort, etc are consistently able to beat timsort (by a large margin). So it isn't a good benchmark.

Skipping timsort, we are only concerned with comparisons between powersort and the others...

  1. Powersort does not require a little bit of extra memory, it requires a lot of extra memory. In the worst case, it requires an element buffer of size n/2. For any given merge of arrays A and B it requires min(len(A), len(B)) extra memory.

Well, efficient in-place mergesort requires as much extra memory as the array to be sorted, so adding an extra half again extra doesn't seem all that bad - if memory use is a problem, one can switch to one of the others that use a little less, but as far as I can see, all of them use some extra if they are to be stable sorts...

  1. Quadsort, which is roughly equivalent to powersort can be significantly slower than fluxsort (a quicksort-like wrapper to quadsort) for certain cases where data is more organized.

There are always cases...

  1. blitsort with sqrt of n memory has approximately equivalent perf to fluxsort.

If low memory use is an essential criteria, then that looks like it might be a good alternative, but what are the other trade offs?

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:02):

If low memory use is an essential criteria

This is probably a very good question that we should be asking ourselves. How much memory are we ok with roc's default sort using? @Richard Feldman any thoughts?

I had always assumed the goal was essentially constant memory, but that was definitely a naive assumption. I guess in essentially all commonly used language sorting algorithms, the answer is roughly n/2 to n extra memory. I think the main gain of glidesort and blitsort is that they use essentially constant memory (can be restricted to truly constant, but sqrt of n required for full perf) while being competitive with powersort/quadsort. Also due to taking advantage of a few patterns and a top level quicksort in general have slightly better perf.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:05):

but what are the other trade offs?

Main tradeoff is in very large merges. They will require more cycles of the buffer which is slower than just having a single large buffer. In practice, this is not much of a cost as long as you have the sqrt of n memory.

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:41):

it's a good question; by default I'd say top priority is fastest overall sorting performance, but it's not the only consideration

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:42):

for example, if the absolute fastest thing is 5x the memory usage of the next one but only 1% faster, that doesn't sound worth it to me

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:43):

Yeah, I think the two main choices are n or basically constant.

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:43):

so I'd say constant memory usage is not a requirement, but less memory usage is separately valuable from faster overall perf, although faster overall perf is the most valuable

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:44):

also I think the default sorting algorithm will most often be used to sort things of under a thousand elements

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:44):

Using n is definitely faster. Though most cases will be similar in performance, in the bad cases for using basically constant memory, it can be 25% slower (though this is also 25% slower of significantly faster than timsort which most languages still use).

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:45):

what's the coefficient here? like n times how many bytes?

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:46):

Sorry, n elements. So 2 whole copies of the array.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:47):

As a note, no matter which algorithm we use (within the bounds of modern fast algorithms), my gut feeling is that the cost of the cmp function, copy function (ie not giving zig element size and alignment at compile time), and and dealing with refcounts will be the biggest slowdowns. Those are all things we may need to specialize over to get perf.

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:49):

when we say the one that doubles the elements is faster, is that in a microbenchmark?

view this post on Zulip Richard Feldman (Jul 23 2024 at 05:50):

or in real programs where the cost of cache misses due to evictions from that much memory usage is factored in

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:53):

I would guess in practice given most languages use sorting algorithms that duplicate the array or at least require a buffer of approximately half the length of the array.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:56):

That said, it may be a case that most arrays that get sorted easily fit at least L3 cache.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:57):

So the buffer as little cost.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 05:57):

Also, the buffers are used linearly, so cache locality is still good even with the large buffer.

view this post on Zulip GordonBGood (Jul 23 2024 at 05:58):

Richard Feldman said:

or in real programs where the cost of cache misses due to evictions from that much memory usage is factored in

I think that's as in "real" programs, as most algorithms are using some sort of merge under the covers which is feeding from two sources and writing the merged result to another location; since associative caches use multi-way association, this is well within their capability and there will just be jumps when jumping between different merge streams, which will quickly get the cache up to speed. In the above merge sort benchmark, the main cost is the cost of comparisons at about 24 CPU clock cycles per, and this slow because the sorting data is seen as random and therefore unpredictable...

view this post on Zulip GordonBGood (Jul 23 2024 at 05:59):

Brendan Hansknecht said:

Also, the buffers are used linearly, so cache locality is still good even with the large buffer.

As @Brendan Hansknecht says...

view this post on Zulip Richard Feldman (Jul 23 2024 at 06:00):

yeah, given all that, it sounds like the fastest one is worth the extra memory usage!

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 06:24):

So in that case that leaves 3 main options:

  1. fluxsort
  2. glidesort (with full n element memory)
  3. powersort

Here are the rough pluses and minuses in my opinion:

fluxsort:

+ To my understanding should be the fastest in the most cases on most hardware
+ Can take advantage of patterns for more perf
+ Tuned specifically with the idea of passing in a comparison function
- Does more low level optimizations that while simple to understand may not port properly

glidesort:

~ might be the fastest in the future (depends very heavily on hardware, has optimizations for very new hardware)
+ can take advantage of patterns for more perf
- definitely the most complex to port (sorry rust)
- has some extra tradeoffs made due to panic safety (but we technically don't need to port those)

powersort:

+ Definitely the simplest
+ Vetted by python adopting it
- Likely the slowest here, but still very fast overall. Worst cases:
    1. Lots of equivalent elements (like `rand %100` or enums).
    2. Patterns that are more complex than many repeated runs.
- More comparisons on average

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 06:32):

Note, when I talk about powersort being the slowest in many cases, I would expect it to be within a lets say 3%. In the specific edge cases like lots of equality, it can be a huge diff potentially 2x.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 06:34):

And definitely take all of this with a solid grain of salt. This is from some local benchmarks of existing code, but mostly from trusting existing benchmarks, reading papers on algorithm implementations and tradeoffs, and intuition.

The only way to truly know these tradeoffs would be to implement them all with the constraints of roc.

view this post on Zulip Luke Boswell (Jul 23 2024 at 06:37):

I'm guessing it's a fair bit of work to implement these. Like you wouldn't just implement a couple and have a flag?

view this post on Zulip Luke Boswell (Jul 23 2024 at 06:37):

What about implementing the simple one first, and then potentially exploring the more experimental glidesort?

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 06:41):

a fair bit of work

100% in fact the simple one is still pretty complex (if you want it fast and not just a naive port of the algorithm as a whole).

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 07:00):

A note on the glidesort gains, I think they are a double edge sword. To my understanding they are likely what make it slower on middle end, low end, and embedded hardware. Though in 5 years probably will be fine on middle end. It merges in an interleaved manner to take advantage of out of order execution. And multiple memory streams. This is just larger overhead and potential cache conflicts on older/lower resource hardware.

That said, all algorithms use at least 2 way merging to help some with out of order and multiple memory streams.

In fact, the newest paper on power sort, it uses four way merging, but it has too much overhead to gain perf unless you have sentinel values (values that always sort greater than all other values). It uses them to avoid slice/subarray bounds checks. Just puts them at the end of chunks being merged.

Glidesort looks to have managed multi way merging wins, but they are the main reason the perf varies so much by hardware.

view this post on Zulip GordonBGood (Jul 23 2024 at 12:59):

Brendan Hansknecht said:

and intuition...

My intuition says try powersort or some slight derivation thereof, and maybe compare it with fluxsort if it isn't too much more complex. At this stage, nothing is cast in stone and you can always revisit it later, and either of these or even my simple merge sort as per the above is better than the Zig code you have now...

view this post on Zulip Richard Feldman (Jul 23 2024 at 13:18):

Brendan Hansknecht said:

A note on the glidesort gains, I think they are a double edge sword. To my understanding they are likely what make it slower on middle end, low end, and embedded hardware. Though in 5 years probably will be fine on middle end. It merges in an interleaved manner to take advantage of out of order execution. And multiple memory streams. This is just larger overhead and potential cache conflicts on older/lower resource hardware.

I don't think that's a big downside

view this post on Zulip Richard Feldman (Jul 23 2024 at 13:20):

it's hard to imagine a few percentage points of sorting perf being a major issue for embedded use cases; I wouldn't expect sorting to be a significant percentage of what they spend their time on regardless

view this post on Zulip Richard Feldman (Jul 23 2024 at 13:22):

and definitely optimizing for where hardware is going seems like a better idea than optimizing for where it is today, when there aren't yet any big Roc deployments that could actually take advantage of that on today's hardware anyway :big_smile:

view this post on Zulip Richard Feldman (Jul 23 2024 at 13:24):

I do like that fluxsort is tuned for accepting a comparison function though; I definitely expect that to come up a lot!

view this post on Zulip Richard Feldman (Jul 23 2024 at 13:26):

based on that, fluxsort seems like the frontrunner to me unless there's some really significant implementation advantage to powersort being simpler, but it sounds like the simplicity of the algorithm itself is a relatively minor percentage of the overall project regardless, yeah?

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 14:51):

Yeah, it will be curious to see once we have one of these in zig, what the real overhead ends up being.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 14:53):

I don't think we could write raw roc as optimally, but the overhead of closures in zig and lack of mono could hit hard.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 14:55):

Long term will certainly try multiple configs.

view this post on Zulip Brendan Hansknecht (Jul 23 2024 at 15:28):

That said, at least currently, we don't have anything interesting in roc to benchmark it against. So we'll have to think about what benchmarks we actually care about when it comes to perf here. What are our "realworld" enough sorting benchmarks.

view this post on Zulip GordonBGood (Jul 24 2024 at 02:31):

Brendan Hansknecht said:

What are our "realworld" enough sorting benchmarks.

Also as in Zig versus in "native" Roc...

There is also that it would be really nice to have the Sort Ability in place rather than having to depend on whatever comparable behavior defaults there are for whatever Type one wants to use as a sorting key...

view this post on Zulip Brendan Hansknecht (Jul 24 2024 at 02:51):

I don't quite follow that statement, can you elaborate?

The current plan for Sort is that it is a mapping from 2 inputs of the same type to [EQ, LT, GT].

view this post on Zulip GordonBGood (Jul 24 2024 at 03:26):

Brendan Hansknecht said:

I don't quite follow that statement, can you elaborate?

I mean that in order to implement a sort function, the contents need to be known to be comparable unless everything is automatically comparable (unless it contains a function), so tuples, records, Str's (???), Char's, and custom tagged union types are all comparable by these criteria (tuples, record's, and union's not containing a function); if they aren't comparable, then one could use the sortWith or sortBy variation, but the usual protocol is to apply the constraint to the type variable as in Elm (sort : List comparable -> List comparable) or with an Ability (sort : List a -> List a where a implements Sort)...

view this post on Zulip Brendan Hansknecht (Jul 24 2024 at 03:33):

Ah yeah, sort will be auto derive, but not for strings, functions, opaques. So most types will just have sort.

We will expose sort : List a -> List a where a implements Sort, but it will just wrap List.sortWith. Sort ability will be used to generate the comparison functions.

We eventually may also want to add raw primitive sorts that automatically get dispatched to by List.sort, but that will depend on the gains it sees. If we are lucky, llvm might be smart enough to optimize sortWith on primitives and inline the indirect call to the constant compare function.


Last updated: Jul 06 2025 at 12:14 UTC