faster sorting · compiler development

Given I had already started on quadsort for blitsort, I am just transitioning over to working on quadsort for fluxsort. The differencen being that fluxsort uses more memory for more performance.

It will definitely be slow to port quadsort cause it has a lot of low level optimization to be branchless where possible. This tends to be a large perf win for random data where the branch predictor is just constantly wrong. It also has some low level routines for small arrays to make them really fast.

It is just really cool to see the assembly. It has one jump table for the size of the input. It has 4 conditional branches (all for early returns when data happens to be sorted, only 1 can be hit for a given size array). All other control flow is unconditional jumps/calls.

Brendan Hansknecht (Jul 23 2024 at 21:53):

The other nice part of these primitives for quadsort is that they should be transferable to any other sort. Like if we use powersort/glidesort in the future, it still would be a gain to drop into these really optimized routines for sub arrays that are tiny.

Brendan Hansknecht (Jul 25 2024 at 00:19):

quadsort (not fully implemented, and still a few bugs, but enough to run first tests):

And this is just quadsort. Fluxsort wrapper theoretically can make it even faster.

Brendan Hansknecht (Jul 25 2024 at 00:33):

Brendan Hansknecht (Jul 25 2024 at 00:34):

Brendan Hansknecht (Jul 25 2024 at 00:35):

Luke Boswell (Jul 25 2024 at 03:03):

Luke Boswell (Jul 25 2024 at 03:04):

Brendan Hansknecht (Jul 25 2024 at 03:04):

The raw C quadsort on the same size list of longs, task 27ms if the comparison is inlined and 99ms if the comparison is not inlined.

Richard Feldman (Jul 25 2024 at 04:41):

Brendan Hansknecht (Jul 25 2024 at 04:44):

Yeah, though llvm is doing a ton of heavy lifting. It manages to inline the comparison function even though we pass it into zig as a pointer.

Also, I haven't done any perf tuning yet. I know of at least one location where for sure the zig is not generating quite the same as the gotos in C. I bet there are other locations as well.

Brendan Hansknecht (Jul 25 2024 at 04:47):

Also, I currently have a broken case. Not sure what is going wrong. Just disabled the case for now. Re-enabling it should be some form of perf gain assuming it was getting hit. With it disabled, the sort is falling back to branchless fully unordered mode more often.

Brendan Hansknecht (Jul 25 2024 at 04:47):

Also, I guess it must have been getting hit, otherwise, it wouldn't be able to break things.

Brendan Hansknecht (Jul 25 2024 at 06:23):

Brendan Hansknecht (Jul 25 2024 at 06:25):

Brendan Hansknecht (Jul 25 2024 at 06:26):

So definitely still more gains from adding the wrapper layer (it also is much simpler than quadsort)

Brendan Hansknecht (Jul 25 2024 at 06:33):

Richard Feldman (Jul 25 2024 at 11:20):

Brendan Hansknecht (Jul 25 2024 at 15:39):

Interesting, our perf is way worse on m1 mac than on x86.... still better than merge sort, but much farther from the C version.

Brendan Hansknecht (Jul 25 2024 at 23:23):

I can't seem to get flamegraphs from roc into zig working on my mac. So pretty hard to debug the perf.

Brendan Hansknecht (Jul 25 2024 at 23:26):

But yeah, on m1 mac we are like 90ms, but the C is 25ms. Even C with the comparison as no inline is 75ms.

Brendan Hansknecht (Jul 25 2024 at 23:36):

If I average 10 runs, it is 67ms, which is a lot closer, but still not 25ms. and much farther away than the 34ms I get on linux.

Brendan Hansknecht (Jul 25 2024 at 23:40):

Luke Boswell (Jul 26 2024 at 00:14):

Sad because all the time is spent in one function. And we can't see into that function, because that's the boundary where roc calls into zig?

Brendan Hansknecht (Jul 26 2024 at 00:16):

Brendan Hansknecht (Jul 26 2024 at 00:17):

That's with me turning off strip on the zig builtins. So I feel like there should be debug info

Luke Boswell (Jul 26 2024 at 00:18):

How are you building it? Is roc driving the linking? Maybe you could --no-link and then use zig to build-exe?

Luke Boswell (Jul 26 2024 at 00:18):

Brendan Hansknecht (Jul 26 2024 at 00:18):

I probably could build a zig app that just directly imports sort.zig. That probably would give me info.

Luke Boswell (Jul 26 2024 at 00:19):

Brendan Hansknecht (Jul 26 2024 at 00:27):

Brendan Hansknecht (Jul 26 2024 at 01:32):

Apparently essentially all of the time is spent in the copy function...which confuses me cause I'm pretty sure it is getting inlined, so not really sure why it is taking up so much time.

Brendan Hansknecht (Jul 26 2024 at 01:39):

Hmm, I just hit a segfault on linux. Maybe I have a bug and sorting is looping like crazy sorting mem outside of the array or incrementing incorrectly somehow

Brendan Hansknecht (Jul 26 2024 at 07:01):

Using a zig app as the driver, I have managed to fix a few bugs. That said, I am still confused by the perf.

On M1 mac, the zig version takes around 70ms. The c++ around 25ms.
If I bench the zig version or run it with a profiler, it takes around 45ms.
If I use a cycle profiler to compare the two, they are within 10% cycle count wise.

I don't know of a good equivalent to perf on mac to get time spent on cache misses and waiting on data.
Probably gonna move on soon and just work on other things.

Brendan Hansknecht (Jul 26 2024 at 07:02):

Oh, also, in the profiler, almost all the time is spent copying data (which I wouldn't expect C to do better assuming I implemented the algorithm correctly and am copying the same data around).

Given it runs good on x86, that also suggests that I implemented the algorithm correctly.

Brendan Hansknecht (Jul 26 2024 at 07:03):

Brendan Hansknecht (Jul 26 2024 at 16:42):

Got fuzzing working. Ran it overnight. Definitely found a few edge cases. That said, for 213 million runs, a 552 test cases isn't bad. And I would guess that many test cases have the same root cause.

Brendan Hansknecht (Jul 27 2024 at 01:14):

So turns out the bug was a non-bug. I turned a comment in the original C into an assert. Turns out the comment was slightly off. Updated the comment and assert. Running more fuzzing. I think we may be bug free!

Brendan Hansknecht (Jul 27 2024 at 09:01):

quadsort is working, fuzzed, and can fallback to sort indirectly for giant values.

Brendan Hansknecht (Jul 28 2024 at 00:05):

Definitely a bit of a pervasive PR, but we have refcounting using a comptime bool to add the code or not. Also, did my best to minimize the times we actually call to increment the refcount. Attempt to batch wherever possible. So plenty of places where we increment by 4 or 16 for example.

Brendan Hansknecht (Jul 28 2024 at 00:05):

Brendan Hansknecht (Jul 28 2024 at 00:06):

Most places where we have to increment by only 1 are in compares that are used with short circuit evaluation. So if the first compare fails, the rest won't run.

Brendan Hansknecht (Jul 28 2024 at 00:06):

Brendan Hansknecht (Jul 28 2024 at 00:37):

Ran it through my existing fuzz corpus and also a bit after. Missed one refcount location, but otherwise, refcounting seems to work happily.

Brendan Hansknecht (Jul 28 2024 at 00:38):

Brendan Hansknecht (Jul 28 2024 at 22:31):

sort strategies

quicksort 1000 elems
median of nine: 2938417487880084056
partition strategy: default
    quicksort 332 elems
    median of nine: 6817335681333689882
    partition strategy: default
        quicksort 120 elems
        median of nine: 7738655368416626995
        partition strategy: default
            mergesort 78 elems
            mergesort 42 elems
        quicksort 212 elems
        median of nine: 3983938187880039198
        partition strategy: default
            quicksort 163 elems
            median of nine: 5445429211482806183
            partition strategy: default
                mergesort 82 elems
                mergesort 81 elems
            mergesort 49 elems
    quicksort 668 elems
    median of nine: -4781184063276946074
    partition strategy: default
        quicksort 426 elems
        median of nine: 764764259046233900
        partition strategy: default
            quicksort 117 elems
            median of nine: 2096483146860794127
            partition strategy: default
                mergesort 45 elems
                mergesort 72 elems
            quicksort 309 elems
            median of nine: -1875810895047206237
            partition strategy: default
                quicksort 137 elems
                median of nine: -254467376449306342
                partition strategy: default
                    mergesort 48 elems
                    mergesort 89 elems
                quicksort 172 elems
                median of nine: -3930025406165151116
                partition strategy: default
                    quicksort 121 elems
                    median of nine: -3020111438722416420
                    partition strategy: default
                        mergesort 66 elems
                        mergesort 55 elems
                    mergesort 51 elems
        quicksort 242 elems
        median of nine: -7138956423631159883
        partition strategy: default
            quicksort 137 elems
            median of nine: -5931406857942076890
            partition strategy: default
                mergesort 68 elems
                mergesort 69 elems
            quicksort 105 elems
            median of nine: -7657208392615692120
            partition strategy: default
                mergesort 29 elems
                mergesort 76 elems

Brendan Hansknecht (Jul 28 2024 at 22:33):

With fluxsort, the more random the data is, the more it uses quicksort. If many elements of the input have the same value, it will use dual pivot quicksort to skip large chunks of sorting. If the array is relatively ordered or is simply small, will fallback to mergesort.

Brendan Hansknecht (Jul 29 2024 at 01:05):

So besides an issue with the surgical linker for the fallback to sorting pointers if the element size is too large, I think I have full fluxsort working.

Richard Feldman (Jul 29 2024 at 01:24):

Brendan Hansknecht (Jul 29 2024 at 03:33):

M1 (Still not sure why roc is slower here)

Benchmark 1: ./roc-mergesort
  Time (mean ± σ):      77.6 ms ±   0.6 ms    [User: 73.6 ms, System: 2.5 ms]
  Range (min … max):    76.7 ms …  80.9 ms    100 runs

Benchmark 2: ./roc-builtinsort
  Time (mean ± σ):      40.3 ms ±   0.5 ms    [User: 36.9 ms, System: 2.1 ms]
  Range (min … max):    39.6 ms …  44.0 ms    100 runs

Benchmark 3: ./cc-quadsort
  Time (mean ± σ):      31.1 ms ±   0.9 ms    [User: 27.6 ms, System: 2.3 ms]
  Range (min … max):    30.1 ms …  35.6 ms    100 runs

Benchmark 4: ./cc-fluxsort
  Time (mean ± σ):      21.0 ms ±   0.4 ms    [User: 18.0 ms, System: 2.0 ms]
  Range (min … max):    20.5 ms …  23.1 ms    100 runs

Summary
  ./cc-fluxsort ran
    1.48 ± 0.05 times faster than ./cc-quadsort
    1.92 ± 0.04 times faster than ./roc-builtinsort
    3.70 ± 0.07 times faster than ./roc-mergesort

x86_64 intel i7-8750H gaming laptop (amazing results)

hyperfine

Benchmark 1: ./roc-mergesort
  Time (mean ± σ):      99.4 ms ±   2.9 ms    [User: 96.2 ms, System: 3.2 ms]
  Range (min … max):    96.2 ms … 121.6 ms    100 runs

Benchmark 2: ./roc-builtinsort
  Time (mean ± σ):      25.4 ms ±   0.5 ms    [User: 23.1 ms, System: 2.3 ms]
  Range (min … max):    24.7 ms …  27.7 ms    100 runs

Benchmark 3: ./cc-quadsort
  Time (mean ± σ):      35.9 ms ±   1.5 ms    [User: 31.6 ms, System: 4.3 ms]
  Range (min … max):    33.1 ms …  38.1 ms    100 runs

Benchmark 4: ./cc-fluxsort
  Time (mean ± σ):      26.5 ms ±   0.5 ms    [User: 22.4 ms, System: 4.1 ms]
  Range (min … max):    25.5 ms …  28.0 ms    100 runs

Summary
  ./roc-builtinsort ran
    1.04 ± 0.03 times faster than ./cc-fluxsort
    1.42 ± 0.07 times faster than ./cc-quadsort
    3.92 ± 0.14 times faster than ./roc-mergesort

poop

Benchmark 1 (50 runs): ./roc-mergesort
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          98.5ms ± 1.11ms    97.1ms …  104ms          5 (10%)        0%
  peak_rss           16.3MB ± 66.0KB    16.3MB … 16.4MB          0 ( 0%)        0%
  cpu_cycles          377M  ± 1.07M      376M  …  383M           1 ( 2%)        0%
  instructions        385M  ± 8.40       385M  …  385M           2 ( 4%)        0%
  cache_references   11.6M  ± 43.1K     11.5M  … 11.7M           0 ( 0%)        0%
  cache_misses       5.58M  ±  203K     5.40M  … 6.67M           4 ( 8%)        0%
  branch_misses      9.50M  ± 5.82K     9.49M  … 9.52M           0 ( 0%)        0%
Benchmark 2 (182 runs): ./roc-builtinsort
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          26.6ms ±  766us    25.1ms … 28.4ms          0 ( 0%)        ⚡- 73.0% ±  0.3%
  peak_rss           12.9MB ±  748KB    12.5MB … 14.7MB         29 (16%)        ⚡- 20.9% ±  1.3%
  cpu_cycles         95.5M  ± 2.94M     90.5M  …  102M           0 ( 0%)        ⚡- 74.7% ±  0.2%
  instructions        281M  ±  321K      280M  …  281M           2 ( 1%)        ⚡- 27.1% ±  0.0%
  cache_references   4.16M  ± 46.2K     4.06M  … 4.27M           0 ( 0%)        ⚡- 64.1% ±  0.1%
  cache_misses        637K  ± 60.0K      526K  …  956K           3 ( 2%)        ⚡- 88.6% ±  0.6%
  branch_misses       289K  ± 60.9K      187K  …  362K           0 ( 0%)        ⚡- 97.0% ±  0.2%
Benchmark 3 (139 runs): ./cc-quadsort
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          34.8ms ± 1.01ms    33.9ms … 39.2ms         12 ( 9%)        ⚡- 64.6% ±  0.3%
  peak_rss           18.4MB ± 76.4KB    18.2MB … 18.6MB          9 ( 6%)        💩+ 12.5% ±  0.1%
  cpu_cycles          117M  ±  807K      115M  …  120M          11 ( 8%)        ⚡- 69.0% ±  0.1%
  instructions        283M  ±  173       283M  …  283M          18 (13%)        ⚡- 26.5% ±  0.0%
  cache_references   6.70M  ± 59.4K     6.62M  … 6.88M           8 ( 6%)        ⚡- 42.2% ±  0.2%
  cache_misses       1.82M  ±  126K     1.60M  … 2.18M           2 ( 1%)        ⚡- 67.4% ±  0.9%
  branch_misses      53.3K  ± 2.28K     47.0K  … 61.0K          20 (14%)        ⚡- 99.4% ±  0.0%
Benchmark 4 (172 runs): ./cc-fluxsort
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          28.1ms ± 2.17ms    26.2ms … 49.8ms          8 ( 5%)        ⚡- 71.5% ±  0.6%
  peak_rss           18.4MB ± 80.3KB    18.2MB … 18.6MB          7 ( 4%)        💩+ 12.6% ±  0.1%
  cpu_cycles         92.5M  ± 4.48M     88.1M  …  130M           2 ( 1%)        ⚡- 75.5% ±  0.3%
  instructions        267M  ±  186K      267M  …  268M           3 ( 2%)        ⚡- 30.6% ±  0.0%
  cache_references   4.35M  ± 70.5K     4.21M  … 4.90M           3 ( 2%)        ⚡- 62.5% ±  0.2%
  cache_misses        822K  ±  120K      682K  … 2.04M           7 ( 4%)        ⚡- 85.3% ±  0.8%
  branch_misses       263K  ± 65.8K      193K  …  370K           0 ( 0%)        ⚡- 97.2% ±  0.2%

Brendan Hansknecht (Jul 29 2024 at 03:39):

Luke Boswell (Jul 29 2024 at 04:20):

Nice work, this is really cool. Love how you explored the performance and shared what you learnt along the way.

I found it super interesting to follow along with, and now I really want to understand the perf characteristics more with my code.

Brendan Hansknecht (Jul 29 2024 at 06:32):

I'm going to leave this fuzzing overnight, but everything is ready for review and merge here.

Luke Boswell (Jul 29 2024 at 11:12):

I thought I might try and run this (or something similar) and see what I see on my M2 mac.

test program

I used this program inspired by #6294 -- it's not the same because the API has changed a lot since then.

results M2 mac

$ hyperfine ./sort
Benchmark 1: ./sort
  Time (mean ± σ):      40.4 ms ±  93.3 ms    [User: 10.0 ms, System: 1.3 ms]
  Range (min … max):     9.3 ms … 305.9 ms    10 runs

$ hyperfine ./sort-opt
Benchmark 1: ./sort-opt
  Time (mean ± σ):     164.6 ms ± 502.0 ms    [User: 3.9 ms, System: 1.7 ms]
  Range (min … max):     3.1 ms … 1593.4 ms    10 runs

$ hyperfine -m 1000 ./sort
Benchmark 1: ./sort
  Time (mean ± σ):       9.3 ms ±   0.2 ms    [User: 7.9 ms, System: 0.8 ms]
  Range (min … max):     8.8 ms …   9.7 ms    1000 runs

$ hyperfine -m 1000 ./sort-opt
Benchmark 1: ./sort-opt
  Time (mean ± σ):       3.0 ms ±   0.2 ms    [User: 1.8 ms, System: 0.7 ms]
  Range (min … max):     2.6 ms …   5.3 ms    1000 runs

Luke Boswell (Jul 29 2024 at 11:12):

Brendan Hansknecht (Jul 29 2024 at 14:58):

Brendan Hansknecht (Jul 29 2024 at 14:59):

And yeah, that benchmark isn't good cause fluxsort will detect the entire list is reversed and just flip it

Brendan Hansknecht (Jul 29 2024 at 15:39):

250 million executions later with 0 crashes. There were some timeouts (no hangs), but that's expected. Timeouts are expected cause with a large enough input, sorting will get slow. Timeouts just mean it took longer than expected, but still finished in a reasonable time. Hangs are if it takes so long that the fuzzer decides to kill it.

Brendan Hansknecht (Jul 29 2024 at 19:13):

A raw zig version using our new builtin sort is essentially as fast as the C sort on M1.

So it is somehow due to the code roc is generating. Maybe llvm is not inlining the compare or copy for llvm on m1 (I don't think this is the case, would be way slower)? Maybe the compare isn't generating branchless? Maybe some other complexity of llvm ir is leading to an optimization failing to apply?

Isaac Van Doren (Jul 29 2024 at 21:37):

Brendan Hansknecht (Jul 30 2024 at 04:23):

@Luke Boswell if you get a chance, I'd be curious what your M2 results are for this:

Brendan Hansknecht (Jul 30 2024 at 04:24):

Luke Boswell (Jul 30 2024 at 04:33):

Benchmark 1: ./roc-mergesort
  Time (mean ± σ):      71.1 ms ±   0.8 ms    [User: 68.1 ms, System: 1.9 ms]
  Range (min … max):    69.9 ms …  72.1 ms    100 runs

Benchmark 2: ./roc-builtinsort
  Time (mean ± σ):      36.6 ms ±   0.5 ms    [User: 34.4 ms, System: 1.5 ms]
  Range (min … max):    35.8 ms …  37.3 ms    100 runs

Benchmark 3: ./zig-raw-builtinsort
  Time (mean ± σ):      33.7 ms ±   0.4 ms    [User: 32.4 ms, System: 1.0 ms]
  Range (min … max):    33.0 ms …  34.3 ms    100 runs

Benchmark 4: ./cc-quadsort
  Time (mean ± σ):      27.1 ms ±   0.3 ms    [User: 25.3 ms, System: 1.5 ms]
  Range (min … max):    26.8 ms …  27.8 ms    100 runs

Benchmark 5: ./cc-fluxsort
  Time (mean ± σ):      19.0 ms ±   0.1 ms    [User: 17.3 ms, System: 1.3 ms]
  Range (min … max):    18.7 ms …  19.3 ms    100 runs

Summary
  ./cc-fluxsort ran
    1.43 ± 0.02 times faster than ./cc-quadsort
    1.78 ± 0.02 times faster than ./zig-raw-builtinsort
    1.93 ± 0.03 times faster than ./roc-builtinsort
    3.75 ± 0.05 times faster than ./roc-mergesort

Luke Boswell (Jul 30 2024 at 04:42):

And on the WIP LLVM 18 branch -- this is a great way to confirm that the optimisation passes I removed are definitely still required... so don't take these results seriously

Brendan Hansknecht (Jul 30 2024 at 05:10):

ok, so still a gap from c++ to zig/roc. Not sure why the perf diff would be apple silicon specific, but good to know it is consistent.

Stream: compiler development

Topic: faster sorting

Brendan Hansknecht (Jul 23 2024 at 21:49):

Brendan Hansknecht (Jul 23 2024 at 21:53):

Brendan Hansknecht (Jul 25 2024 at 00:19):

Brendan Hansknecht (Jul 25 2024 at 00:33):

Brendan Hansknecht (Jul 25 2024 at 00:33):

Brendan Hansknecht (Jul 25 2024 at 00:34):

Brendan Hansknecht (Jul 25 2024 at 00:35):

Luke Boswell (Jul 25 2024 at 03:03):

Luke Boswell (Jul 25 2024 at 03:04):

Brendan Hansknecht (Jul 25 2024 at 03:04):

Richard Feldman (Jul 25 2024 at 04:41):

Brendan Hansknecht (Jul 25 2024 at 04:44):

Brendan Hansknecht (Jul 25 2024 at 04:47):

Brendan Hansknecht (Jul 25 2024 at 04:47):

Brendan Hansknecht (Jul 25 2024 at 06:23):

Brendan Hansknecht (Jul 25 2024 at 06:25):

Brendan Hansknecht (Jul 25 2024 at 06:26):

Brendan Hansknecht (Jul 25 2024 at 06:26):

Brendan Hansknecht (Jul 25 2024 at 06:33):

Richard Feldman (Jul 25 2024 at 11:20):

Brendan Hansknecht (Jul 25 2024 at 15:39):

Brendan Hansknecht (Jul 25 2024 at 23:23):

Brendan Hansknecht (Jul 25 2024 at 23:26):

Brendan Hansknecht (Jul 25 2024 at 23:36):

Brendan Hansknecht (Jul 25 2024 at 23:40):

Luke Boswell (Jul 26 2024 at 00:14):

Brendan Hansknecht (Jul 26 2024 at 00:16):

Brendan Hansknecht (Jul 26 2024 at 00:16):

Brendan Hansknecht (Jul 26 2024 at 00:17):

Luke Boswell (Jul 26 2024 at 00:18):

Luke Boswell (Jul 26 2024 at 00:18):

Brendan Hansknecht (Jul 26 2024 at 00:18):

Luke Boswell (Jul 26 2024 at 00:19):

Brendan Hansknecht (Jul 26 2024 at 00:27):

Brendan Hansknecht (Jul 26 2024 at 01:32):

Brendan Hansknecht (Jul 26 2024 at 01:39):

Brendan Hansknecht (Jul 26 2024 at 07:01):

Brendan Hansknecht (Jul 26 2024 at 07:02):

Brendan Hansknecht (Jul 26 2024 at 07:03):

Brendan Hansknecht (Jul 26 2024 at 16:42):

Brendan Hansknecht (Jul 27 2024 at 01:14):

Brendan Hansknecht (Jul 27 2024 at 09:01):

Brendan Hansknecht (Jul 28 2024 at 00:05):

Brendan Hansknecht (Jul 28 2024 at 00:05):

Brendan Hansknecht (Jul 28 2024 at 00:06):

Brendan Hansknecht (Jul 28 2024 at 00:06):

Brendan Hansknecht (Jul 28 2024 at 00:37):

Brendan Hansknecht (Jul 28 2024 at 00:38):

Brendan Hansknecht (Jul 28 2024 at 22:31):

Brendan Hansknecht (Jul 28 2024 at 22:33):

Brendan Hansknecht (Jul 29 2024 at 01:05):

Richard Feldman (Jul 29 2024 at 01:24):

Brendan Hansknecht (Jul 29 2024 at 03:33):

Brendan Hansknecht (Jul 29 2024 at 03:39):

Luke Boswell (Jul 29 2024 at 04:20):

Brendan Hansknecht (Jul 29 2024 at 06:32):

Luke Boswell (Jul 29 2024 at 11:12):

test program

results M2 mac

Luke Boswell (Jul 29 2024 at 11:12):

Brendan Hansknecht (Jul 29 2024 at 14:58):

Brendan Hansknecht (Jul 29 2024 at 14:59):

Brendan Hansknecht (Jul 29 2024 at 15:39):

Brendan Hansknecht (Jul 29 2024 at 19:13):

Isaac Van Doren (Jul 29 2024 at 21:37):

Brendan Hansknecht (Jul 30 2024 at 04:23):

Brendan Hansknecht (Jul 30 2024 at 04:24):

Luke Boswell (Jul 30 2024 at 04:33):

Luke Boswell (Jul 30 2024 at 04:42):

Brendan Hansknecht (Jul 30 2024 at 05:10):