Stream: contributing

Topic: Set perf


view this post on Zulip Brendan Hansknecht (Dec 03 2023 at 22:44):

Elias Mulhall said:

Day 3, this one has a performance bug that I'll talk about below
Day 3, with a workaround

Performance Issue Details


I haven't root caused this or anything, but it honestly might just be a case of a bad usecase for set that got unlucky with perf (though obviously must be some sort of bug due to the issue with --optimize.

Anyway, stats I have so far.

1) generates 1863 neighbor sets of size 8. 8 is an unlucky number it means that every one of those sets will be allocated, insert 7 values which requires hashing them, reallocated and rehashed, then insert the final value. So every fromList call here will cast 2 allocations, 15 hashes, and some other data twiddling.

Fixing fromList perf and at a minimum getting withCapacity properly working should help a lot here. I forget why fromCapacity is no implemented currently, but this probably is pretty terrible for perf.

2) This generates 32222 sets of size 0 to 3 for the various indices. 29788 of which are just empty. Given these never rehash, this is just a ridiculous number of allocations (that wouldn't happen with an empty list) due to dictionaries having a default capacity of 8. And currently a dictionary is 2 allocations. So compared to using lists, this is 59576 extra allocations....probably this is the main issue?

3) Set.insection are currently not written in the optimal way. It probably should be walking one set and removing items. Instead it generates a totally new set from empty.

view this post on Zulip Richard Feldman (Dec 03 2023 at 22:53):

love this investigation! :star_struck:

view this post on Zulip Elias Mulhall (Dec 03 2023 at 23:11):

Ahhh, the size 8 thing makes sense. It would be pretty straightforward to add a Set.withCapacity, right? Actually in that case I'm doing Set.fromList... would it be bad for that function to eager-allocate the set to have the same capacity as the list? Depending on use case that might mean allocing way more space than needed, but not to a ridiculous degree.

view this post on Zulip Elias Mulhall (Dec 03 2023 at 23:12):

Oops, I skipped the paragraph where you already said that

view this post on Zulip Elias Mulhall (Dec 03 2023 at 23:15):

Is the --optimize issue reproducable, or just me?

view this post on Zulip Brendan Hansknecht (Dec 03 2023 at 23:15):

Reproducible on my end

view this post on Zulip Brendan Hansknecht (Dec 03 2023 at 23:16):

I'll try to implement a few improvements and see how it goes.

view this post on Zulip Brendan Hansknecht (Dec 03 2023 at 23:18):

As an extra note, if we were really careful, the set could theoretically take ownership of the list and reuse the allocation. Though would be a more complex algorithm. Would need to inplace deal with duplicate elements.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:27):

So compared to using lists, this is 59576 extra allocations

...

So allocating is faster than not allocating. Cause if we don't allocate, we have to add an extra conditional to all insert/get/remove/etc calls.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:29):

about 10% faster to just always allocate with this example

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:29):

what's the extra conditional?

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:29):

like what does it do

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:29):

before it would just unsafe load an index from the metadata

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:29):

Now it has to check length is not 0

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:30):

and for insert of course, if not allocated, it has to allocate at the beginning of the function.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:33):

Also, due to the sets being super small (max size 8 elements), I don't think that sets will ever be faster than lists for this exampl. Lots of extra overhead with hashing, allocations, and extra conditionals with very very small data.

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:35):

I wonder if there's some cmov trick that could work here

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:35):

like this part:

Brendan Hansknecht said:

before it would just unsafe load an index from the metadata

could use a cmov for "if length is 0, set the pointer to be a pointer to the address of the length itself" or something

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:36):

(I don't know what the metadata would need to be for "this is missing" though)

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:37):

probably not.

It is essentialy:

if initialized:
    # do some gigantic compute
else:
    just return

Generally cmov is only good for evenish small compute. branch prediction is much better for cases like this.

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:41):

well that's true for insert, right? But get presumably never initializes anything

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:41):

but I guess it's the others that are significantly slowed down here

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:42):

So insert is actually

if !initialized:
    initialize

# do rest of compute with initialized dict

for get, it is the case mentioned above. Cause hashing and calculating index related stuff is the large compute I was mentioning above.

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:44):

:thinking: doesn't insert need a branch anyway to see if it's over capacity?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:49):

Only if the key isn't found

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:50):

ok so then it branches on whether the key is found, right?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:50):

yep

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:51):

but checking indices assumes the dict is initialized

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:51):

So it never does any bounds checks

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:51):

gotcha, so what if that branch's conditional branchlessly checked length?

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:52):

e.g. do both the cmov thing I mentioned earlier and also an AND with it being nonempty

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 00:53):

That branch is after it has already unconditionally loaded from the metadata. Which would deref a null pointer on an uninitialized dict.

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:53):

that's what I mean about doing a cmov to have the null pointer point to &length or whatever so it can be successfully dereferenced to something we'll ignore anyway

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:55):

e.g. (pretending && is not going to desugar to a condition here):

let isEmpty = len == 0;
let metadataPtr = if isEmpty { &len } else { normalMetadataPtr };

if !isEmpty && doNormalCheckOn(metadataPtr) { ... }

view this post on Zulip Richard Feldman (Dec 04 2023 at 00:55):

it's definitely possible that would still be more costly than the branch misprediction though haha

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:10):

Brendan Hansknecht said:

Also, due to the sets being super small (max size 8 elements), I don't think that sets will ever be faster than lists for this exampl. Lots of extra overhead with hashing, allocations, and extra conditionals with very very small data.

something I've wondered about is whether we should consider having a "small hashmap optimization" where we don't use a hashmap if the collection is small enough

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:10):

and instead Dict (and therefore Set too) is backed by a flat List of tuples

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:17):

So I figured out how to remove the branch. Just use constant lists. Then it is always initialized

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:17):

Instead of before where it was using list.repeat.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:17):

Also initialization changes seem to have fixed the --optimize bug.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:17):

In optimized, this is about 6x slower than the list version

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:18):

which honestly may be expected perf. Probably should write something similar in c++ with absl::flat_hash_map and vector to compare

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:21):

constant lists as in they're allocated in the readonly section of the binary?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:23):

Yeah, they should be. Though looking at the llvm, I think llvm is smart enough to make them magically disappear cause they are so small. Not fully sure

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:24):

very cool!

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:24):

what do you think of the "small hashmap optimization" idea?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:29):

Mostly questioning if it is a good idea for automatic application or a better idea for human optimization.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:30):

As in, we could expose both types with the same api (or make the list verson in a package). Explicitly choosing either one avoids perf costs related to extra conditionals. So if you know you need large or know you need small it makes sense to pick explicitly. That said merged may avoid some worst cases like this.

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:30):

I think the pitch for it is that it's common to use Set or Dict not so much for performance but rather because they enforce a guarantee and clearly communicate intent

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:32):

and we could have separate types for that, but I do wonder whether the extra conditional matters in the "actually use a hash map" case when you have so many entries that you're probably getting cache misses

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:32):

not so much for performance

To that I would say, use --optimize and accept that it may not be the same perf as picking a ListSet or ListDict, but perf isn't the main concern.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:33):

Also, I need to do some measurements, but likely we should just rewrite dict as a zig builtin.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:33):

That probably will close a solid chunk of the gap.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:34):

But I can't really guarantee that currently, it is mostly speculation. That said, I was unable to implement part of the absl::flat_hash_map optimization in roc due to it being worse performance

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:35):

Though I think we have a builtin to address part of that now, so maybe I should try again.

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:40):

I also wonder if a List.appendUnique or something might be useful

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:41):

like some quick way to use a list like a Set except with different performance characteristics

view this post on Zulip Richard Feldman (Dec 04 2023 at 01:42):

eh might be clearer to get a ListSet off the shelf or something

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:42):

oh, so essentially:

if !List.contains list x then
    List.append list x

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 01:49):

Just found a huge speed boost in the app...
from

|> Set.union state

to

|> \x -> Set.union state x

Made the optimized build about 2x faster

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:01):

Anyway, on a branch, I am down to 3x slower on an optimized build with set vs with list.

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:09):

Brendan Hansknecht said:

Just found a huge speed boost in the app...
from

|> Set.union state

to

|> \x -> Set.union state x

Made the optimized build about 2x faster

whoa! Why does that improve speed?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:11):

It is equivalent of going from

List.concat smallList largeList

To

List.concat largeList smallList

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:11):

So copy a few elements to the end of something large instead of copy a ton of elements to the end of something small.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:11):

In a very hot loop

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:14):

ohhh

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:14):

:thinking: does that suggest we should swap the argument order there?

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:15):

oh nm, I guess this is application-specific

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:16):

might be too fancy, but I wonder if it's worth picking which one we insert into based on which is bigger, assuming they're both unique

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:16):

I added a conditional in the function. I think that will be worth it in general

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:17):

in Set.union you mean?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:17):

yeah

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:18):

cool, makes sense to me!

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:20):

@Elias Mulhall can you test on the set-perf branch? See if --optimize still hangs for you.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:21):

Anyway, patched up a handful of improvements in #6176

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:24):

Also, I wonder if the full performant impl of absl::flat_hash_set might actually be faster than regular lists. The main feature we are missing currently is groups. Groups enable it such that a chunk of the hash can be compared for 8 indices at once. So it might do 1 hash and 1 comparison to avoid up to 8 calls to ==. That said it still has the rest of the extra conditionals and overhead.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:25):

Now that we have popCount as a builtin, I should revisit implementing that in roc and perf gains.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 02:26):

Should definitly convert this or a similar example to c++ with absl::flat_hash_map to see what our real perf diff is.

view this post on Zulip Richard Feldman (Dec 04 2023 at 02:42):

that sounds fascinating! :smiley:

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 03:00):

As an aside, about small dict impl.

The larger the key is the worse small dict is. Cause large key means an expensive Eq. So it would be a lot faster to do more hash comparisons rather than key comparisons.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 03:00):

So probably ListDict is only better with both small size and small key.

view this post on Zulip Richard Feldman (Dec 04 2023 at 03:18):

yeah true, although we do know the size of that statically, so could have a heuristic involving both

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 03:20):

Sure, if roc exposed that at compile time or you add an extra runtime check

view this post on Zulip Elias Mulhall (Dec 04 2023 at 04:53):

Brendan Hansknecht said:

Elias Mulhall can you test on the set-perf branch? See if --optimize still hangs for you.

Unfortunately it still does! Anything I can do to debug further?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 04:59):

Oh wait, I lightly modified the code, with the original I can still repro...let me take a look at that specifically.

view this post on Zulip Elias Mulhall (Dec 04 2023 at 05:00):

Ah ha, good luck!

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 05:27):

So I'm not sure how easy this will be to make a minimized test case, but this is not a Set bug at all.

It is some sort refcounting/uniqueness failure. The bug is specifically with the newPartNumber variable: https://gist.github.com/mulias/b07e012ced92406f59cc2548d0f01323#file-2023day03-roc-L66

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 05:28):

It should always have multiple references which means on this line here it should incur a deep copy:
https://gist.github.com/mulias/b07e012ced92406f59cc2548d0f01323#file-2023day03-roc-L88

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 05:29):

That is not happening. As such, we are getting these weird sets that have old data from the last iteration of the loop. Unsurprisingly, missing old data with a new set breaks things.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 05:29):

A simple workaround is just to inline the variable uses:
current: { digits: [], indices: Set.empty {} },

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 05:34):

Doing a quick scan of the mono ir, it seems that every iteration of Array2d.walk the value is decremented, but it doesn't look to every be incremented. I assume it would get incremented through the closure capture somewhere or something like that, but I don't see it. (definitely might just be missing it)

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 05:39):

actually, it looks like Array2D.walk calls List.walkWithIndex calls List.walkWithIndexHelp which increments the functions refcount. That should mean this all works correctly. Maybe a bug in the actully generated increment or decrement function?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 06:53):

So I was able to work around the refcount craziness (was affecting both the list and set version of the code). Fixing that made both versions about 4x faster. So on my machine the list version is now 40ms and the set version 150ms. Both quite respectable times and the flamegraphs now look like they are spending there time mostly doing real work and not too much in refcounts.

All of the time is now essentially set in calculating set overlaps. (again these are small sets neighbors is 8 elements and the digit sets are 1 to 3 elements).

So if we optimize the set overlap calculation to early exist like the list overlap calculation, we get to the point where the set version is slightly more than 2x slower than the list version at about 90ms

isOverlapping : Set a, Set a -> Bool
isOverlapping = \a, b ->
    if Set.len b < Set.len a then
        isOverlapping b a
    else
        Set.walkUntil a Bool.false \_, k ->
            if Set.contains b k then
                Break Bool.true
            else
                Continue Bool.false

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 06:56):

Anyway, that is probably enough digging into the perf of all of this.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 06:56):

I should probably file some issue to track some of this, but I will leave that as an exercise for tomorrow Brendan cause I am currently quite tired.

view this post on Zulip Hannes Nevalainen (Dec 04 2023 at 07:56):

Richard Feldman said:

what do you think of the "small hashmap optimization" idea?

FWIW maps on the Beam are just arrays under the hood if size < 32

view this post on Zulip Elias Mulhall (Dec 04 2023 at 13:38):

Anyway, that is probably enough digging into the perf of all of this.

I would say so, you went much deeper into this than I anticipated! Thanks for all the details, very interesting.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 15:37):

Haha. I enjoy digging into performance and such

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 15:43):

Hannes Nevalainen said:

FWIW maps on the Beam are just arrays under the hood if size < 32

Do you have a link to info on this? Or source I guess.

view this post on Zulip Elias Mulhall (Dec 04 2023 at 18:11):

So I was able to work around the refcount craziness (was affecting both the list and set version of the code). Fixing that made both versions about 4x faster.

@Brendan Hansknecht I'm not sure I'm following how you did this. Was this a code change or a compiler change?

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 18:16):

Ah sorry, didn't share:

gearPartNumbers : Schematic -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic ->
    partNumbers = allPartNumbers schematic

    Array2D.walk schematic ([], partNumbers) { direction: Forwards } \(state, pns), elem, index ->
        when elem is
            Symbol '*' ->
                elemNeighbors = neighbors index
                neighboringPartNumbers =
                    List.keepIf pns \partNumber ->
                        isOverlapping partNumber.indices elemNeighbors

                when neighboringPartNumbers is
                    [pn1, pn2] -> (List.append state (pn1, pn2), pns)
                    _ -> (state, pns)

            _ -> (state, pns)
    |> .0

Same can be done to the list version

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 18:20):

I think if partNumbers is a capture of the closure we aren't smart about refcounts and recursively increment the refcounts of everything contained in partNumbers.

If we pass it in instead, we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount. That is at least my guess.

view this post on Zulip Brendan Hansknecht (Dec 04 2023 at 18:21):

@Ayaz Hafiz or @Folkert de Vries does that sound plausible? If so, is there an actionable bug I can file?

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:16):

If we pass it in instead, we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount. That is at least my guess.

Can you elaborate what you mean by this? I don't follow

view this post on Zulip Elias Mulhall (Dec 05 2023 at 00:24):

This is the original code

gearPartNumbers : Schematic, List PartNumber -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic, partNumbers ->
    Array2D.walk schematic [] { direction: Forwards } \state, elem, index ->
        when elem is
            Symbol '*' ->
                adjacentPartNumbers =
                    List.keepIf partNumbers \partNumber ->
                        anyAdjacent partNumber.indices [index]

                when adjacentPartNumbers is
                    [pn1, pn2] -> List.append state (pn1, pn2)
                    _ -> state

            _ -> state

This code is faster because it seems to eliminate ref counts on partNumbers

gearPartNumbers : Schematic, List PartNumber -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic, partNumbers ->
    Array2D.walk schematic ([], partNumbers) { direction: Forwards } \(state, pns), elem, index ->
        when elem is
            Symbol '*' ->
                adjacentPartNumbers =
                    List.keepIf pns \partNumber ->
                        anyAdjacent partNumber.indices [index]

                when adjacentPartNumbers is
                    [pn1, pn2] -> (List.append state (pn1, pn2), pns)
                    _ -> (state, pns)

            _ -> (state, pns)
    |> .0

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 00:25):

we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount.

So that comment turned out to be wrong. I thought we had an optimization that enable sometimes just incrementing a list outer refcount instead of incrementing it recursively, but even if we do, it doesn't apply here.


A more correct analysis

In the original version partNumbers is a closure capture. On every iteration of Array2D.walk, we would recursively increment the closure captures and spend a lot of time incrementing refcounts in partNumbers. On top of that, at the end of the function, we would recursively decrement the refcount. Overall just eating tons of times with refcounting.

If we explicitly make partNumbers part of the state, this goes away. Most of the time, pn is just passed through without getting used. This means its refcount is never incremented or decremented. The only time we have to deal with refcounts of pn in the new version of the code is when the symbol is *. In that case, the refcount of pn is recursively incremented. So this is just way way less refcount fiddling overall. Turned out to not be a fancy optimization, just hugely cutting how often we have to deal with this.

Summary: recursive refcount increments/decrements can be super brutal. I wonder if there is a better way to deal with a case like this.

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:27):

Looking at the IR to confirm...

view this post on Zulip Elias Mulhall (Dec 05 2023 at 00:31):

Here's an updated gist if that helps https://gist.github.com/mulias/3e7fb28d934736155028d5e85c9bb3ca
One of the versions of the function is commented out

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:47):

Hmm yeah your analysis sounds right to me Brendan

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:48):

It's not totally clear to me why in the first case the capture is seen as borrowed, and in the second case it isn't though

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:48):

The thing is that on the surface, the capture is seen as owned, until array2d.Array2D.iterate is hit

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:50):

And then we have

# SLOW
procedure : `array2d.Array2D.iterate` List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}
procedure = `array2d.Array2D.iterate` (`#Derived_gen.IdentId(86)`: {List [C , C U8, C U8], {U64, U64}}, `#Derived_gen.IdentId(87)`: {U64, U64}, `#Derived_gen.IdentId(88)`:
List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, `#Derived_gen.IdentId(89)`: {}, `#Derived_gen.IdentId(90)`: List {List U8, List {U64, U64}}):
    joinpoint `array2d.Array2D.770` `array2d.Array2D.array` `array2d.Array2D.index` `array2d.Array2D.state` `array2d.Array2D.nextIndexFn` `array2d.Array2D.nextStateFn`:
        inc `array2d.Array2D.array`;
        let `array2d.Array2D.771` : [C {}, C [C , C U8, C U8]] = CallByName `array2d.Array2D.get` `array2d.Array2D.array` `array2d.Array2D.index`;
        let `array2d.Array2D.791` : U8 = 1i64;
        let `array2d.Array2D.792` : U8 = GetTagId `array2d.Array2D.771`;
        let `array2d.Array2D.793` : Int1 = lowlevel Eq `array2d.Array2D.791` `array2d.Array2D.792`;
        if `array2d.Array2D.793` then
            inc `array2d.Array2D.array`;
            let `array2d.Array2D.elem` : [C , C U8, C U8] = UnionAtIndex (Id 1) (Index 0) `array2d.Array2D.771`;
            let `array2d.Array2D.788` : [C {}, C {U64, U64}] = CallByName `array2d.Array2D.decY` `array2d.Array2D.array` `array2d.Array2D.index`;
            inc `array2d.Array2D.nextStateFn`; <<<<< THIS is the INC
            let `array2d.Array2D.789` : [C List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, C List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}] = CallByName `array2d.Array2D.200` `array2d.Array2D.state` `array2d.Array2D.elem` `array2d.Array2D.index` `array2d.Array2D.nextStateFn`;
# FAST
procedure : `array2d.Array2D.iterate` {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}
procedure = `array2d.Array2D.iterate` (`#Derived_gen.IdentId(48)`: {List [C , C U8, C U8], {U64, U64}}, `#Derived_gen.IdentId(49)`: {U64, U64}, `#Derived_gen.IdentId(50)`:
{List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, `#Derived_gen.IdentId(51)`: {}, `#Derived_gen.IdentId(52)`: {}):
    joinpoint `array2d.Array2D.1039` `array2d.Array2D.array` `array2d.Array2D.index` `array2d.Array2D.state` `array2d.Array2D.nextIndexFn` `array2d.Array2D.nextStateFn`:
        inc `array2d.Array2D.array`;
        let `array2d.Array2D.1040` : [C {}, C [C , C U8, C U8]] = CallByName `array2d.Array2D.get` `array2d.Array2D.array` `array2d.Array2D.index`;
        let `array2d.Array2D.1060` : U8 = 1i64;
        let `array2d.Array2D.1061` : U8 = GetTagId `array2d.Array2D.1040`;
        let `array2d.Array2D.1062` : Int1 = lowlevel Eq `array2d.Array2D.1060` `array2d.Array2D.1061`;
        if `array2d.Array2D.1062` then
            inc `array2d.Array2D.array`;
            let `array2d.Array2D.elem` : [C , C U8, C U8] = UnionAtIndex (Id 1) (Index 0) `array2d.Array2D.1040`;
            let `array2d.Array2D.1057` : [C {}, C {U64, U64}] = CallByName `array2d.Array2D.incX` `array2d.Array2D.array` `array2d.Array2D.index`;
            let `array2d.Array2D.1058` : [C {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, C {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}] = CallByName `array2d.Array2D.200` `array2d.Array2D.state` `array2d.Array2D.elem` `array2d.Array2D.index` `array2d.Array2D.nextStateFn`;

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 00:52):

In the fast example, the closure has no captures, so the inc is elided? It would just be a call to a no-op?

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:52):

So I'm not really sure why we're seeing it as borrowed in one case and not the other. Maybe the fact that the capture parameter passes out specializes it? But I would expect that it's seen as irrelevant in the first (capturing) case for sure, especially because it unwraps transparently to an extra List parameter

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:52):

Maybe @Folkert de Vries has an idea

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:52):

Well, but it would inc on something right? Like presumably the outer capture is passed in some way to iterate

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:53):

But that seems to be missing

view this post on Zulip Ayaz Hafiz (Dec 05 2023 at 00:53):

Although I haven't looked at the iterate implementation so I could be off

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 00:53):

It is pretty direct and just recursive: https://github.com/mulias/roc-array2d/blob/951b2a7ae561de88bf185bb61e26244dc9f5d508/package/Array2D.roc#L894-L906

view this post on Zulip Elias Mulhall (Dec 05 2023 at 01:05):

The IR looks weird to me for a different (and probably unrelated) reason -- Array2D.walk with default params should just be a wrapper for List.walkWithIndex, the iterate function shouldn't get involved unless you're walking in a different order :thinking:.

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 01:08):

Oh wait, that may be exactly why it change to borrowing instead of owning

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 01:08):

Maybe in this dispatch function where there are two paths, we mess something up and switch from owning to borrowing?

https://github.com/mulias/roc-array2d/blob/951b2a7ae561de88bf185bb61e26244dc9f5d508/package/Array2D.roc#L415C1-L426C43

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 01:09):

Or maybe it is related to putting the closure in another closure?

\state, elem, listIndex ->
            fn state elem (arrayIndexOf listIndex arrayShape)

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 01:09):

Just guesses

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 01:12):

Maybe we should try to trim this down to just a simple list version and see if we can still repro.

view this post on Zulip Hannes Nevalainen (Dec 05 2023 at 09:19):

Brendan Hansknecht said:

Hannes Nevalainen said:

FWIW maps on the Beam are just arrays under the hood if size < 32

Do you have a link to info on this? Or source I guess.

It is apparently a "hash array mapped trie" for 32 entries or less. I don't really know more than that :) here are some links

https://twitter.com/akoutmos/status/1266034402422853633
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_map.h#L72-L76
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_map.c#L264-L268
https://elixirforum.com/t/big-maps-versus-small-maps-performance/31909

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 16:56):

Thanks!

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 17:00):

Maybe we should try to trim this down to just a simple list version and see if we can still repro.

I don't have a small repro currently, but I am pretty sure I hit this again when working on the implementation details of Dict. I have a hot loop with a .walk function in it that happens to capture a list. 45% of the hot loop execution time is spent on refcounting that really shouldn't be needed (I'll probably have to manually inline this for perf reasons). To my understanding, the core issue is that we really need to do inlIning before inserting refcounts. LLVM is inlining the lambda, but it still has a capture with a refcount inc before the lambda and a refcount dec in the lambda.

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 17:03):

I would guess a minimal repro will just be something simple like:

x = # Some large list of refcounted things, assume list of sets
y = # Some other list to walk, lets assume it has indices into x
List.walk y 0 \count, index ->
    when List.get x index is
        Ok set ->
            count + (Set.len set)
        Err OutOfBounds ->
            count

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 17:04):

Assuming I have this right, this will just thrash the refcounts of all elements in x.

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 23:53):

I am still diving down the trail of dictionary perf improvements and I found something interesting. listGetUnsafe is supposed to borrow it's input. This means that no refcount changes are required to call the function.

This does not look to be working correctly in my dictionary code. I see lots of cases like this:

inc `Dict.metadata`;
let `Dict.group` : U64 = CallByName `Dict.listGetUnsafe` `Dict.metadata` `Dict.slotIndex`;

If I follow that link through, I find:

procedure : `Dict.listGetUnsafe` {U32, {}}
procedure = `Dict.listGetUnsafe` (`#Attr.#arg1`: List {U32, {}}, `#Attr.#arg2`: U64):
    let `Dict.960` : {U32, {}} = lowlevel ListGetUnsafe `#Attr.#arg1` `#Attr.#arg2`;
    dec `#Attr.#arg1`;
    ret `Dict.960`;

Even thouhg the lowlevel does not require changing the refcount, we are wrapping it in a function call and that function apparently requires changing the refcount. So this is defeating the whole point of the lowlevel not needing to change the refcount

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 23:53):

Any ideas how to fix this?

view this post on Zulip Brendan Hansknecht (Dec 05 2023 at 23:54):

The same thing does not happen to List.getUnsafe:

procedure : `List.getUnsafe` U64
procedure = `List.getUnsafe` (`#Attr.#arg1`: List U64, `#Attr.#arg2`: U64):
    let `List.612` : U64 = lowlevel ListGetUnsafe `#Attr.#arg1` `#Attr.#arg2`;
    ret `List.612`;

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 00:14):

Oh, I figured out the missing piece. Refcounts gone!!!

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 00:45):

Also, I am so thankful for drop specialization...it is removing refcounts for me as well

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 00:46):

Now I just want closures to be able to borrow instead of own captures such that they can avoid refcounting.

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 05:23):

So ummm.....Dictionaries with refcounted keys are problematic:

Screenshot-2023-12-05-at-9.22.17PM.png

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 05:25):

The purple is refcounting. It is 80% of the execution time. The other 20% of the execution time is splitting a gigantic string.

So ~100% of the execution time spent in dictionary related functions is spent on recursive refcount increments and decrements.

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 05:27):

I thought morphic was supposed to enable a function like Dict.get to simply "borrow" it's input since we can statically guarantee it is never modified, only read. As such, we would elide all of this refcounting madness. Is that not the case? Is this an optimization we can add?

view this post on Zulip Richard Feldman (Dec 06 2023 at 11:52):

Brendan Hansknecht said:

Now I just want closures to be able to borrow instead of own captures such that they can avoid refcounting.

@Folkert de Vries any thoughts or info on this? I don't really know what's possible here :sweat_smile:

view this post on Zulip Folkert de Vries (Dec 06 2023 at 11:53):

this is not a morphic thing at all

view this post on Zulip Folkert de Vries (Dec 06 2023 at 11:53):

the refcounting approach is now to always own values, and push RC updates into the called function. That is optimal for in-place mutation, because borrowed values can never be mutated

view this post on Zulip Folkert de Vries (Dec 06 2023 at 11:54):

inferring borrowing is hard, and it's harder to bound the amount of code duplication for owned/borrowed arguments

view this post on Zulip Folkert de Vries (Dec 06 2023 at 11:55):

e.g. what if you have multiple arguments, they can all be owned or borrowed, this cascades

view this post on Zulip Richard Feldman (Dec 06 2023 at 12:26):

I wonder about that last part in practice :thinking:

view this post on Zulip Richard Feldman (Dec 06 2023 at 12:26):

like certainly it's possible to imagine that blowing up, but that doesn't necessarily mean it would in practice (after all, the same is true of monomorphization in general!)

view this post on Zulip Folkert de Vries (Dec 06 2023 at 12:51):

you would now do it twice though

view this post on Zulip Folkert de Vries (Dec 06 2023 at 12:51):

but to be clear, this is not some technical limitation, it's a tradeoff

view this post on Zulip Folkert de Vries (Dec 06 2023 at 12:54):

the core problem is that you don't know at any particular point whether to own or borrow: borrowing saves RC manipulation, but disallows mutation. In the middle of a call chain you don't know which is best. Even in most functions it's tricky to know whether being able to mutate is useful.

there is another problem here which is that with borrowing, in recursive functions, you may run into big spikes in memory use, because all the memory is borrowed and only released when all of those recursive functions return

view this post on Zulip Folkert de Vries (Dec 06 2023 at 12:54):

all of that is to say: you have to be careful with borrowing. If you can figure out a good heuristic then there should be no problem technically, but finding that heuristic is not trivial

view this post on Zulip Richard Feldman (Dec 06 2023 at 13:47):

I wonder how we might go about trying to figure out what a good heuristic might be :thinking:

view this post on Zulip Richard Feldman (Dec 06 2023 at 13:49):

now that we've run into concrete performance problems of the form "almost all this program does is increment and decrement refcounts, and then it does a bit of work which isn't that" it's probably worth starting to discuss things we might want to try :big_smile:

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 15:23):

So we have builtins that borrow things because they are read only, like List.getUnsafe. my wrong understanding was that if a function only called borrowing builtins, it would also just borrow the value. So it would kinda propagate out from those builtins.

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 15:24):

That would fix this specific problem

view this post on Zulip Richard Feldman (Dec 06 2023 at 15:25):

it seems like for that scenario, there would be no risk of a blowup, right?

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 15:25):

But would add the delayed freeing of memory in recursive functions like mentioned by folkert. (Though not sure how realistic of a problem this is in practice)

view this post on Zulip Richard Feldman (Dec 06 2023 at 15:26):

"if this function argument is only ever passed to things that borrow, then it too can be borrowed"

view this post on Zulip Richard Feldman (Dec 06 2023 at 15:26):

I'm more worried about the overhead of reference counting than the delaying of freed memory

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 15:28):

I agree.

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 15:29):

Also, I swear that at some point we talked about the idea of enabling just refcounting the outer list instead of recursively refcounting all elements in some cases. If we could do that here, it also would alleviate the issue. Though, I don't recall the context in which we thought a change like this might work.

view this post on Zulip Folkert de Vries (Dec 06 2023 at 15:42):

oh yeah we should totally change how lists do RC

view this post on Zulip Brendan Hansknecht (Dec 06 2023 at 16:03):

Do we have concrete ideas for that or is this a case where we need more research projects like Morphic?

view this post on Zulip Brendan Hansknecht (Dec 07 2023 at 15:41):

Looping back to the idea of borrowing an input to a function if all uses in the function also borrow the input, is that something we could just append onto the type checker? Like for each value we want to know both the type and a bit flag of whether or not the value is potentially dirty. Any call in a sub branch would make an entire body dirty. Of course there would need to be some custom propagation logic at function boundaries.

Would that make sense and be something realitively small cost to add onto the type checker? Should it go somewhere else? Is it just a bad idea overall? Any other suggestions on solution?

view this post on Zulip Richard Feldman (Dec 07 2023 at 17:52):

I was talking to @Folkert de Vries about this the other day and he mentioned that doing the analysis on mono instead of during type checking would have various benefits

view this post on Zulip Folkert de Vries (Dec 07 2023 at 22:38):

well the main benefit is that you can reason about higher-order functions

view this post on Zulip Folkert de Vries (Dec 07 2023 at 22:38):

(with lambda sets, at least)

view this post on Zulip Brendan Hansknecht (Dec 07 2023 at 22:44):

Ah, then you could analyze closure captures and do something smarter?

view this post on Zulip Richard Feldman (Dec 08 2023 at 00:19):

@Brendan Hansknecht is this something you'd be interested in getting into in terms of compiler development? (With guidance of course!)

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 00:24):

Well, I do want dict to run fast and this will be fundamental to getting it closer to c++ speeds. So for sure.

Also, I think it will be helpful for me to understand more deeply. I have some vague performance ideas, but many I am not sure could work in roc without really expensive analysis.

view this post on Zulip Richard Feldman (Dec 08 2023 at 00:35):

that would be amazing! @Folkert de Vries are you currently the only person with domain expertise in this? Or does anyone else know how it works/how to implement things to try?

view this post on Zulip Ayaz Hafiz (Dec 08 2023 at 01:10):

onto the type checker? Should it go somewhere else?

Doing it in the typechecker has other problems too, namely that you lose all specialization and are bound to the virality of unification. Really you want this over the SSA form.

view this post on Zulip Richard Feldman (Dec 08 2023 at 01:14):

also it slows down roc check unnecessarily

view this post on Zulip Richard Feldman (Dec 08 2023 at 01:14):

(and therefore editor integrations)

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 01:27):

also, I guess if we want to do this for closure captures as well, we need to check that the closure capture is essentially never saved and is just directly called or something of that nature. Think of List.map and List.walk closures that may capture values. Would suck if List.walk capturing a refcounted value and then just using it in a borrowed manner had bad perf compared passing in the value as part of the state.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 01:28):

Cause if the closure capture is saved, it actually does need to make sure the value isn't freed.

view this post on Zulip Ayaz Hafiz (Dec 08 2023 at 01:29):

At the level of the type-specialized IR closure captures are identical to normal parameters

view this post on Zulip Ayaz Hafiz (Dec 08 2023 at 01:30):

So I guess I'm not sure why they have to be treated specially

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 01:30):

Wait, really. But captures can live for a long long time while regular params only live for the body of the function.

view this post on Zulip Ayaz Hafiz (Dec 08 2023 at 01:31):

How can the captures outlive the body of the function? Only if they're returned right? But that's the same as for the case of regular parameters

view this post on Zulip Ayaz Hafiz (Dec 08 2023 at 01:32):

After type specialization of lambda sets, all closures are unboxed and their captures are passed as an extra (unboxed) parameter

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 01:32):

yeah, I just realized that as I was trying to write out an example. So it can be borrowed if it is only used in a borrowing way and never possibly returned from the function. With those rules, I think it is the same for captures and params

view this post on Zulip Ayaz Hafiz (Dec 08 2023 at 01:32):

Right

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 02:45):

A note on the cost of this refcounting. I ported some hashmap benchmarks form C++ to roc. The test we are closest to C++ in perf (about 2x slower) is the insert remove bench. Important part about insert and remove, due to mutating the dictionary, they just take ownership of it and never need to touch the refcount.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 02:46):

Still slower than I would hope, but it is a lot better than the 4x of some other cases.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 07:35):

I have only ported 1 benchmark so far, but I am a bit surprised. Decided to test out go just for the fun of it. In the test of insert 100million keys, clear the dict, insert 100 million keys again, finally delete 100 million keys 1 by 1: golang is 4.5x slower than the current version of roc.

Go spends a heck of a lot of system time. I wonder if it is doing something weird with memory that is just eating tons of kernel time.

Definitely will have to port a few more benchmarks and compare at some point.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 07:37):

We also use about 2x less memory, which makes sense cause Roc cleans up memory immediately instead of lazily

view this post on Zulip Jacob (Dec 08 2023 at 10:48):

I think lobster lang infers borrowing/owning arguments or something, cant really remember. I remember the creator being asked about code blowup. Maybe relevant?

view this post on Zulip Richard Feldman (Dec 08 2023 at 11:54):

wait so what does that means in terms of comparing Go to C++ on these particular benchmarks? Go is 10x slower?

view this post on Zulip Elias Mulhall (Dec 08 2023 at 14:18):

I love that all this started with me thinking that creating thousands of small sets would be a performant way to solve a problem, then complaining when that wasn't the case. Really enjoying fallowing this thread.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 15:24):

I may have an addiction

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 15:25):

Richard Feldman said:

wait so what does that means in terms of comparing Go to C++ on these particular benchmarks? Go is 10x slower?

Yep.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 15:27):

That said, something still feels wrong to me about that giant perf difference, need to dig in more.

view this post on Zulip Richard Feldman (Dec 08 2023 at 15:39):

Brendan Hansknecht said:

I may have an addiction

gottagofast.webp

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:15):

Looking into the go hash impl, I was reminded that go does not have custom equality or custom hash (maybe has changed with the new generics?) And instead always used an autoderive impl with maps. On top of that, maps don't implement hash or equality (and they have no set type)

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:18):

My guess as to why go maps are slow in one picture. The key is the bottom right corner.

164cba0e-0cf4-4d49-850d-029e37a0acf0.jpg

Honestly, there design is fine, but it is not totally flat. As such, in the case a bucket overflows instead of place elements in a calculated different bucked in the same array, they just allocate a new bucket on the heap. So it is kinda like in between flat and traditional cause a bucket contains 8 elements and overflow should be rare.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:20):

I'm guessing the benchmark led to lots of random heap jumps.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:23):

Also, never realized that all the generics in the go runtime are implemented c style. To call map.get, the runtime passes in a map that essentially has void* pointers cause it knows nothing about the data type, and a struct of function pointers that know about the data and how it works.

view this post on Zulip Richard Feldman (Dec 08 2023 at 17:25):

oh I assumed that was how they did it

view this post on Zulip Richard Feldman (Dec 08 2023 at 17:25):

monomorphization is very rare among languages with automatic memory management!

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:26):

Yeah, but most languages just use runtime type info, right?

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:28):

But yeah, summary of go lang hash map is kinda like absl::flat_hash_map, but less flat and with more pointer generic shuffling.

Oh and with metadata in buckets instead of stored densely.

view this post on Zulip Richard Feldman (Dec 08 2023 at 17:29):

I think that varies. I don't know for sure but I believe Haskell does it the way Go does

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:39):

Interesting go maps after growing migrate to the new map in chunks over multiple calls to insert instead of doing it all at once.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:39):

So for a bit, it will be two partial maps.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:40):

Trying to make perf of insert more consistent at the cost of probably more perf over time.

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 17:45):

Hmm, sounds like go may have a weird hash function choice too to make it more secure against hash flooding attacks without needing a fallback. Will have to look into that. they used to, now they also just use wyhash.

view this post on Zulip timotree (Dec 08 2023 at 18:16):

Brendan Hansknecht said:

Interesting go maps after growing migrate to the new map in chunks over multiple calls to insert instead of doing it all at once.

This sounds like https://github.com/jonhoo/griddle

view this post on Zulip Brendan Hansknecht (Dec 08 2023 at 18:18):

Yep

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 02:08):

Ok. so turns out that go just has horrible perf on arm. On x86, it is actually quite fast. In between roc and c++ for this benchmark


Last updated: Jul 05 2025 at 12:14 UTC