Set perf · contributing · Zulip Chat Archive

I haven't root caused this or anything, but it honestly might just be a case of a bad usecase for set that got unlucky with perf (though obviously must be some sort of bug due to the issue with --optimize.

1) generates 1863 neighbor sets of size 8. 8 is an unlucky number it means that every one of those sets will be allocated, insert 7 values which requires hashing them, reallocated and rehashed, then insert the final value. So every fromList call here will cast 2 allocations, 15 hashes, and some other data twiddling.

Fixing fromList perf and at a minimum getting withCapacity properly working should help a lot here. I forget why fromCapacity is no implemented currently, but this probably is pretty terrible for perf.

2) This generates 32222 sets of size 0 to 3 for the various indices. 29788 of which are just empty. Given these never rehash, this is just a ridiculous number of allocations (that wouldn't happen with an empty list) due to dictionaries having a default capacity of 8. And currently a dictionary is 2 allocations. So compared to using lists, this is 59576 extra allocations....probably this is the main issue?

3) Set.insection are currently not written in the optimal way. It probably should be walking one set and removing items. Instead it generates a totally new set from empty.

Richard Feldman (Dec 03 2023 at 22:53):

Elias Mulhall (Dec 03 2023 at 23:11):

Ahhh, the size 8 thing makes sense. It would be pretty straightforward to add a Set.withCapacity, right? Actually in that case I'm doing Set.fromList... would it be bad for that function to eager-allocate the set to have the same capacity as the list? Depending on use case that might mean allocing way more space than needed, but not to a ridiculous degree.

Elias Mulhall (Dec 03 2023 at 23:12):

Elias Mulhall (Dec 03 2023 at 23:15):

Brendan Hansknecht (Dec 03 2023 at 23:15):

Brendan Hansknecht (Dec 03 2023 at 23:16):

Brendan Hansknecht (Dec 03 2023 at 23:18):

As an extra note, if we were really careful, the set could theoretically take ownership of the list and reuse the allocation. Though would be a more complex algorithm. Would need to inplace deal with duplicate elements.

Brendan Hansknecht (Dec 04 2023 at 00:27):

So allocating is faster than not allocating. Cause if we don't allocate, we have to add an extra conditional to all insert/get/remove/etc calls.

Brendan Hansknecht (Dec 04 2023 at 00:29):

Richard Feldman (Dec 04 2023 at 00:29):

Brendan Hansknecht (Dec 04 2023 at 00:29):

Brendan Hansknecht (Dec 04 2023 at 00:30):

and for insert of course, if not allocated, it has to allocate at the beginning of the function.

Brendan Hansknecht (Dec 04 2023 at 00:33):

Also, due to the sets being super small (max size 8 elements), I don't think that sets will ever be faster than lists for this exampl. Lots of extra overhead with hashing, allocations, and extra conditionals with very very small data.

Richard Feldman (Dec 04 2023 at 00:35):

could use a cmov for "if length is 0, set the pointer to be a pointer to the address of the length itself" or something

Richard Feldman (Dec 04 2023 at 00:36):

Brendan Hansknecht (Dec 04 2023 at 00:37):

if initialized:
    # do some gigantic compute
else:
    just return

Generally cmov is only good for evenish small compute. branch prediction is much better for cases like this.

Richard Feldman (Dec 04 2023 at 00:41):

well that's true for insert, right? But get presumably never initializes anything

Richard Feldman (Dec 04 2023 at 00:41):

Brendan Hansknecht (Dec 04 2023 at 00:42):

if !initialized:
    initialize

# do rest of compute with initialized dict

for get, it is the case mentioned above. Cause hashing and calculating index related stuff is the large compute I was mentioning above.

Richard Feldman (Dec 04 2023 at 00:44):

Brendan Hansknecht (Dec 04 2023 at 00:49):

Richard Feldman (Dec 04 2023 at 00:50):

Brendan Hansknecht (Dec 04 2023 at 00:50):

Brendan Hansknecht (Dec 04 2023 at 00:51):

Richard Feldman (Dec 04 2023 at 00:51):

Richard Feldman (Dec 04 2023 at 00:52):

e.g. do both the cmov thing I mentioned earlier and also an AND with it being nonempty

Brendan Hansknecht (Dec 04 2023 at 00:53):

That branch is after it has already unconditionally loaded from the metadata. Which would deref a null pointer on an uninitialized dict.

Richard Feldman (Dec 04 2023 at 00:53):

that's what I mean about doing a cmov to have the null pointer point to &length or whatever so it can be successfully dereferenced to something we'll ignore anyway

Richard Feldman (Dec 04 2023 at 00:55):

let isEmpty = len == 0;
let metadataPtr = if isEmpty { &len } else { normalMetadataPtr };

if !isEmpty && doNormalCheckOn(metadataPtr) { ... }

Richard Feldman (Dec 04 2023 at 00:55):

it's definitely possible that would still be more costly than the branch misprediction though haha

Richard Feldman (Dec 04 2023 at 01:10):

something I've wondered about is whether we should consider having a "small hashmap optimization" where we don't use a hashmap if the collection is small enough

Richard Feldman (Dec 04 2023 at 01:10):

and instead Dict (and therefore Set too) is backed by a flat List of tuples

Brendan Hansknecht (Dec 04 2023 at 01:17):

So I figured out how to remove the branch. Just use constant lists. Then it is always initialized

Brendan Hansknecht (Dec 04 2023 at 01:17):

Brendan Hansknecht (Dec 04 2023 at 01:18):

which honestly may be expected perf. Probably should write something similar in c++ with absl::flat_hash_map and vector to compare

Richard Feldman (Dec 04 2023 at 01:21):

Brendan Hansknecht (Dec 04 2023 at 01:23):

Yeah, they should be. Though looking at the llvm, I think llvm is smart enough to make them magically disappear cause they are so small. Not fully sure

Richard Feldman (Dec 04 2023 at 01:24):

Brendan Hansknecht (Dec 04 2023 at 01:29):

Mostly questioning if it is a good idea for automatic application or a better idea for human optimization.

Brendan Hansknecht (Dec 04 2023 at 01:30):

As in, we could expose both types with the same api (or make the list verson in a package). Explicitly choosing either one avoids perf costs related to extra conditionals. So if you know you need large or know you need small it makes sense to pick explicitly. That said merged may avoid some worst cases like this.

Richard Feldman (Dec 04 2023 at 01:30):

I think the pitch for it is that it's common to use Set or Dict not so much for performance but rather because they enforce a guarantee and clearly communicate intent

Richard Feldman (Dec 04 2023 at 01:32):

and we could have separate types for that, but I do wonder whether the extra conditional matters in the "actually use a hash map" case when you have so many entries that you're probably getting cache misses

Brendan Hansknecht (Dec 04 2023 at 01:32):

To that I would say, use --optimize and accept that it may not be the same perf as picking a ListSet or ListDict, but perf isn't the main concern.

Brendan Hansknecht (Dec 04 2023 at 01:33):

Also, I need to do some measurements, but likely we should just rewrite dict as a zig builtin.

Brendan Hansknecht (Dec 04 2023 at 01:33):

Brendan Hansknecht (Dec 04 2023 at 01:34):

But I can't really guarantee that currently, it is mostly speculation. That said, I was unable to implement part of the absl::flat_hash_map optimization in roc due to it being worse performance

Brendan Hansknecht (Dec 04 2023 at 01:35):

Though I think we have a builtin to address part of that now, so maybe I should try again.

Richard Feldman (Dec 04 2023 at 01:40):

Richard Feldman (Dec 04 2023 at 01:41):

like some quick way to use a list like a Set except with different performance characteristics

Richard Feldman (Dec 04 2023 at 01:42):

Brendan Hansknecht (Dec 04 2023 at 01:42):

if !List.contains list x then
    List.append list x

Brendan Hansknecht (Dec 04 2023 at 01:49):

|> Set.union state

|> \x -> Set.union state x

Brendan Hansknecht (Dec 04 2023 at 02:01):

Anyway, on a branch, I am down to 3x slower on an optimized build with set vs with list.

Richard Feldman (Dec 04 2023 at 02:09):

Brendan Hansknecht (Dec 04 2023 at 02:11):

List.concat smallList largeList

List.concat largeList smallList

Brendan Hansknecht (Dec 04 2023 at 02:11):

So copy a few elements to the end of something large instead of copy a ton of elements to the end of something small.

Brendan Hansknecht (Dec 04 2023 at 02:11):

Richard Feldman (Dec 04 2023 at 02:14):

Richard Feldman (Dec 04 2023 at 02:15):

Richard Feldman (Dec 04 2023 at 02:16):

might be too fancy, but I wonder if it's worth picking which one we insert into based on which is bigger, assuming they're both unique

Brendan Hansknecht (Dec 04 2023 at 02:16):

Richard Feldman (Dec 04 2023 at 02:17):

Brendan Hansknecht (Dec 04 2023 at 02:17):

Richard Feldman (Dec 04 2023 at 02:18):

Brendan Hansknecht (Dec 04 2023 at 02:20):

@Elias Mulhall can you test on the set-perf branch? See if --optimize still hangs for you.

Brendan Hansknecht (Dec 04 2023 at 02:21):

Brendan Hansknecht (Dec 04 2023 at 02:24):

Also, I wonder if the full performant impl of absl::flat_hash_set might actually be faster than regular lists. The main feature we are missing currently is groups. Groups enable it such that a chunk of the hash can be compared for 8 indices at once. So it might do 1 hash and 1 comparison to avoid up to 8 calls to ==. That said it still has the rest of the extra conditionals and overhead.

Brendan Hansknecht (Dec 04 2023 at 02:25):

Now that we have popCount as a builtin, I should revisit implementing that in roc and perf gains.

Brendan Hansknecht (Dec 04 2023 at 02:26):

Should definitly convert this or a similar example to c++ with absl::flat_hash_map to see what our real perf diff is.

Richard Feldman (Dec 04 2023 at 02:42):

Brendan Hansknecht (Dec 04 2023 at 03:00):

The larger the key is the worse small dict is. Cause large key means an expensive Eq. So it would be a lot faster to do more hash comparisons rather than key comparisons.

Brendan Hansknecht (Dec 04 2023 at 03:00):

Richard Feldman (Dec 04 2023 at 03:18):

yeah true, although we do know the size of that statically, so could have a heuristic involving both

Brendan Hansknecht (Dec 04 2023 at 03:20):

Elias Mulhall (Dec 04 2023 at 04:53):

Brendan Hansknecht (Dec 04 2023 at 04:59):

Oh wait, I lightly modified the code, with the original I can still repro...let me take a look at that specifically.

Elias Mulhall (Dec 04 2023 at 05:00):

Brendan Hansknecht (Dec 04 2023 at 05:27):

So I'm not sure how easy this will be to make a minimized test case, but this is not a Set bug at all.

Brendan Hansknecht (Dec 04 2023 at 05:28):

Brendan Hansknecht (Dec 04 2023 at 05:29):

That is not happening. As such, we are getting these weird sets that have old data from the last iteration of the loop. Unsurprisingly, missing old data with a new set breaks things.

Brendan Hansknecht (Dec 04 2023 at 05:29):

A simple workaround is just to inline the variable uses:
current: { digits: [], indices: Set.empty {} },

Brendan Hansknecht (Dec 04 2023 at 05:34):

Doing a quick scan of the mono ir, it seems that every iteration of Array2d.walk the value is decremented, but it doesn't look to every be incremented. I assume it would get incremented through the closure capture somewhere or something like that, but I don't see it. (definitely might just be missing it)

Brendan Hansknecht (Dec 04 2023 at 05:39):

actually, it looks like Array2D.walk calls List.walkWithIndex calls List.walkWithIndexHelp which increments the functions refcount. That should mean this all works correctly. Maybe a bug in the actully generated increment or decrement function?

Brendan Hansknecht (Dec 04 2023 at 06:53):

So I was able to work around the refcount craziness (was affecting both the list and set version of the code). Fixing that made both versions about 4x faster. So on my machine the list version is now 40ms and the set version 150ms. Both quite respectable times and the flamegraphs now look like they are spending there time mostly doing real work and not too much in refcounts.

All of the time is now essentially set in calculating set overlaps. (again these are small sets neighbors is 8 elements and the digit sets are 1 to 3 elements).

So if we optimize the set overlap calculation to early exist like the list overlap calculation, we get to the point where the set version is slightly more than 2x slower than the list version at about 90ms

isOverlapping : Set a, Set a -> Bool
isOverlapping = \a, b ->
    if Set.len b < Set.len a then
        isOverlapping b a
    else
        Set.walkUntil a Bool.false \_, k ->
            if Set.contains b k then
                Break Bool.true
            else
                Continue Bool.false

Brendan Hansknecht (Dec 04 2023 at 06:56):

I should probably file some issue to track some of this, but I will leave that as an exercise for tomorrow Brendan cause I am currently quite tired.

Hannes Nevalainen (Dec 04 2023 at 07:56):

Elias Mulhall (Dec 04 2023 at 13:38):

I would say so, you went much deeper into this than I anticipated! Thanks for all the details, very interesting.

Brendan Hansknecht (Dec 04 2023 at 15:37):

Brendan Hansknecht (Dec 04 2023 at 15:43):

Elias Mulhall (Dec 04 2023 at 18:11):

@Brendan Hansknecht I'm not sure I'm following how you did this. Was this a code change or a compiler change?

Brendan Hansknecht (Dec 04 2023 at 18:16):

gearPartNumbers : Schematic -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic ->
    partNumbers = allPartNumbers schematic

    Array2D.walk schematic ([], partNumbers) { direction: Forwards } \(state, pns), elem, index ->
        when elem is
            Symbol '*' ->
                elemNeighbors = neighbors index
                neighboringPartNumbers =
                    List.keepIf pns \partNumber ->
                        isOverlapping partNumber.indices elemNeighbors

                when neighboringPartNumbers is
                    [pn1, pn2] -> (List.append state (pn1, pn2), pns)
                    _ -> (state, pns)

            _ -> (state, pns)
    |> .0

Brendan Hansknecht (Dec 04 2023 at 18:20):

I think if partNumbers is a capture of the closure we aren't smart about refcounts and recursively increment the refcounts of everything contained in partNumbers.

If we pass it in instead, we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount. That is at least my guess.

Brendan Hansknecht (Dec 04 2023 at 18:21):

@Ayaz Hafiz or @Folkert de Vries does that sound plausible? If so, is there an actionable bug I can file?

Ayaz Hafiz (Dec 05 2023 at 00:16):

Elias Mulhall (Dec 05 2023 at 00:24):

gearPartNumbers : Schematic, List PartNumber -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic, partNumbers ->
    Array2D.walk schematic [] { direction: Forwards } \state, elem, index ->
        when elem is
            Symbol '*' ->
                adjacentPartNumbers =
                    List.keepIf partNumbers \partNumber ->
                        anyAdjacent partNumber.indices [index]

                when adjacentPartNumbers is
                    [pn1, pn2] -> List.append state (pn1, pn2)
                    _ -> state

            _ -> state

gearPartNumbers : Schematic, List PartNumber -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic, partNumbers ->
    Array2D.walk schematic ([], partNumbers) { direction: Forwards } \(state, pns), elem, index ->
        when elem is
            Symbol '*' ->
                adjacentPartNumbers =
                    List.keepIf pns \partNumber ->
                        anyAdjacent partNumber.indices [index]

                when adjacentPartNumbers is
                    [pn1, pn2] -> (List.append state (pn1, pn2), pns)
                    _ -> (state, pns)

            _ -> (state, pns)
    |> .0

Brendan Hansknecht (Dec 05 2023 at 00:25):

~~we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount.~~

So that comment turned out to be wrong. I thought we had an optimization that enable sometimes just incrementing a list outer refcount instead of incrementing it recursively, but even if we do, it doesn't apply here.

A more correct analysis

In the original version partNumbers is a closure capture. On every iteration of Array2D.walk, we would recursively increment the closure captures and spend a lot of time incrementing refcounts in partNumbers. On top of that, at the end of the function, we would recursively decrement the refcount. Overall just eating tons of times with refcounting.

If we explicitly make partNumbers part of the state, this goes away. Most of the time, pn is just passed through without getting used. This means its refcount is never incremented or decremented. The only time we have to deal with refcounts of pn in the new version of the code is when the symbol is *. In that case, the refcount of pn is recursively incremented. So this is just way way less refcount fiddling overall. Turned out to not be a fancy optimization, just hugely cutting how often we have to deal with this.

Summary: recursive refcount increments/decrements can be super brutal. I wonder if there is a better way to deal with a case like this.

Ayaz Hafiz (Dec 05 2023 at 00:27):

Elias Mulhall (Dec 05 2023 at 00:31):

Ayaz Hafiz (Dec 05 2023 at 00:47):

Ayaz Hafiz (Dec 05 2023 at 00:48):

It's not totally clear to me why in the first case the capture is seen as borrowed, and in the second case it isn't though

Ayaz Hafiz (Dec 05 2023 at 00:48):

The thing is that on the surface, the capture is seen as owned, until array2d.Array2D.iterate is hit

Ayaz Hafiz (Dec 05 2023 at 00:50):

# SLOW
procedure : `array2d.Array2D.iterate` List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}
procedure = `array2d.Array2D.iterate` (`#Derived_gen.IdentId(86)`: {List [C , C U8, C U8], {U64, U64}}, `#Derived_gen.IdentId(87)`: {U64, U64}, `#Derived_gen.IdentId(88)`:
List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, `#Derived_gen.IdentId(89)`: {}, `#Derived_gen.IdentId(90)`: List {List U8, List {U64, U64}}):
    joinpoint `array2d.Array2D.770` `array2d.Array2D.array` `array2d.Array2D.index` `array2d.Array2D.state` `array2d.Array2D.nextIndexFn` `array2d.Array2D.nextStateFn`:
        inc `array2d.Array2D.array`;
        let `array2d.Array2D.771` : [C {}, C [C , C U8, C U8]] = CallByName `array2d.Array2D.get` `array2d.Array2D.array` `array2d.Array2D.index`;
        let `array2d.Array2D.791` : U8 = 1i64;
        let `array2d.Array2D.792` : U8 = GetTagId `array2d.Array2D.771`;
        let `array2d.Array2D.793` : Int1 = lowlevel Eq `array2d.Array2D.791` `array2d.Array2D.792`;
        if `array2d.Array2D.793` then
            inc `array2d.Array2D.array`;
            let `array2d.Array2D.elem` : [C , C U8, C U8] = UnionAtIndex (Id 1) (Index 0) `array2d.Array2D.771`;
            let `array2d.Array2D.788` : [C {}, C {U64, U64}] = CallByName `array2d.Array2D.decY` `array2d.Array2D.array` `array2d.Array2D.index`;
            inc `array2d.Array2D.nextStateFn`; <<<<< THIS is the INC
            let `array2d.Array2D.789` : [C List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, C List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}] = CallByName `array2d.Array2D.200` `array2d.Array2D.state` `array2d.Array2D.elem` `array2d.Array2D.index` `array2d.Array2D.nextStateFn`;

# FAST
procedure : `array2d.Array2D.iterate` {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}
procedure = `array2d.Array2D.iterate` (`#Derived_gen.IdentId(48)`: {List [C , C U8, C U8], {U64, U64}}, `#Derived_gen.IdentId(49)`: {U64, U64}, `#Derived_gen.IdentId(50)`:
{List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, `#Derived_gen.IdentId(51)`: {}, `#Derived_gen.IdentId(52)`: {}):
    joinpoint `array2d.Array2D.1039` `array2d.Array2D.array` `array2d.Array2D.index` `array2d.Array2D.state` `array2d.Array2D.nextIndexFn` `array2d.Array2D.nextStateFn`:
        inc `array2d.Array2D.array`;
        let `array2d.Array2D.1040` : [C {}, C [C , C U8, C U8]] = CallByName `array2d.Array2D.get` `array2d.Array2D.array` `array2d.Array2D.index`;
        let `array2d.Array2D.1060` : U8 = 1i64;
        let `array2d.Array2D.1061` : U8 = GetTagId `array2d.Array2D.1040`;
        let `array2d.Array2D.1062` : Int1 = lowlevel Eq `array2d.Array2D.1060` `array2d.Array2D.1061`;
        if `array2d.Array2D.1062` then
            inc `array2d.Array2D.array`;
            let `array2d.Array2D.elem` : [C , C U8, C U8] = UnionAtIndex (Id 1) (Index 0) `array2d.Array2D.1040`;
            let `array2d.Array2D.1057` : [C {}, C {U64, U64}] = CallByName `array2d.Array2D.incX` `array2d.Array2D.array` `array2d.Array2D.index`;
            let `array2d.Array2D.1058` : [C {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, C {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}] = CallByName `array2d.Array2D.200` `array2d.Array2D.state` `array2d.Array2D.elem` `array2d.Array2D.index` `array2d.Array2D.nextStateFn`;

Brendan Hansknecht (Dec 05 2023 at 00:52):

In the fast example, the closure has no captures, so the inc is elided? It would just be a call to a no-op?

Ayaz Hafiz (Dec 05 2023 at 00:52):

So I'm not really sure why we're seeing it as borrowed in one case and not the other. Maybe the fact that the capture parameter passes out specializes it? But I would expect that it's seen as irrelevant in the first (capturing) case for sure, especially because it unwraps transparently to an extra List parameter

Ayaz Hafiz (Dec 05 2023 at 00:52):

Well, but it would inc on something right? Like presumably the outer capture is passed in some way to iterate

Ayaz Hafiz (Dec 05 2023 at 00:53):

Brendan Hansknecht (Dec 05 2023 at 00:53):

Elias Mulhall (Dec 05 2023 at 01:05):

The IR looks weird to me for a different (and probably unrelated) reason -- Array2D.walk with default params should just be a wrapper for List.walkWithIndex, the iterate function shouldn't get involved unless you're walking in a different order :thinking:.

Brendan Hansknecht (Dec 05 2023 at 01:08):

Maybe in this dispatch function where there are two paths, we mess something up and switch from owning to borrowing?

Brendan Hansknecht (Dec 05 2023 at 01:09):

\state, elem, listIndex ->
            fn state elem (arrayIndexOf listIndex arrayShape)

Brendan Hansknecht (Dec 05 2023 at 01:09):

Brendan Hansknecht (Dec 05 2023 at 01:12):

Maybe we should try to trim this down to just a simple list version and see if we can still repro.

Hannes Nevalainen (Dec 05 2023 at 09:19):

It is apparently a "hash array mapped trie" for 32 entries or less. I don't really know more than that :) here are some links

Brendan Hansknecht (Dec 05 2023 at 16:56):

Brendan Hansknecht (Dec 05 2023 at 17:00):

I don't have a small repro currently, but I am pretty sure I hit this again when working on the implementation details of Dict. I have a hot loop with a .walk function in it that happens to capture a list. 45% of the hot loop execution time is spent on refcounting that really shouldn't be needed (I'll probably have to manually inline this for perf reasons). To my understanding, the core issue is that we really need to do inlIning before inserting refcounts. LLVM is inlining the lambda, but it still has a capture with a refcount inc before the lambda and a refcount dec in the lambda.

Brendan Hansknecht (Dec 05 2023 at 17:03):

x = # Some large list of refcounted things, assume list of sets
y = # Some other list to walk, lets assume it has indices into x
List.walk y 0 \count, index ->
    when List.get x index is
        Ok set ->
            count + (Set.len set)
        Err OutOfBounds ->
            count

Brendan Hansknecht (Dec 05 2023 at 17:04):

Assuming I have this right, this will just thrash the refcounts of all elements in x.

Brendan Hansknecht (Dec 05 2023 at 23:53):

I am still diving down the trail of dictionary perf improvements and I found something interesting. listGetUnsafe is supposed to borrow it's input. This means that no refcount changes are required to call the function.

This does not look to be working correctly in my dictionary code. I see lots of cases like this:

inc `Dict.metadata`;
let `Dict.group` : U64 = CallByName `Dict.listGetUnsafe` `Dict.metadata` `Dict.slotIndex`;

procedure : `Dict.listGetUnsafe` {U32, {}}
procedure = `Dict.listGetUnsafe` (`#Attr.#arg1`: List {U32, {}}, `#Attr.#arg2`: U64):
    let `Dict.960` : {U32, {}} = lowlevel ListGetUnsafe `#Attr.#arg1` `#Attr.#arg2`;
    dec `#Attr.#arg1`;
    ret `Dict.960`;

Even thouhg the lowlevel does not require changing the refcount, we are wrapping it in a function call and that function apparently requires changing the refcount. So this is defeating the whole point of the lowlevel not needing to change the refcount

Brendan Hansknecht (Dec 05 2023 at 23:53):

Brendan Hansknecht (Dec 05 2023 at 23:54):

procedure : `List.getUnsafe` U64
procedure = `List.getUnsafe` (`#Attr.#arg1`: List U64, `#Attr.#arg2`: U64):
    let `List.612` : U64 = lowlevel ListGetUnsafe `#Attr.#arg1` `#Attr.#arg2`;
    ret `List.612`;

Brendan Hansknecht (Dec 06 2023 at 00:14):

Brendan Hansknecht (Dec 06 2023 at 00:45):

Also, I am so thankful for drop specialization...it is removing refcounts for me as well

Brendan Hansknecht (Dec 06 2023 at 00:46):

Now I just want closures to be able to borrow instead of own captures such that they can avoid refcounting.

Brendan Hansknecht (Dec 06 2023 at 05:23):

Brendan Hansknecht (Dec 06 2023 at 05:25):

The purple is refcounting. It is 80% of the execution time. The other 20% of the execution time is splitting a gigantic string.

So ~100% of the execution time spent in dictionary related functions is spent on recursive refcount increments and decrements.

Brendan Hansknecht (Dec 06 2023 at 05:27):

I thought morphic was supposed to enable a function like Dict.get to simply "borrow" it's input since we can statically guarantee it is never modified, only read. As such, we would elide all of this refcounting madness. Is that not the case? Is this an optimization we can add?

Richard Feldman (Dec 06 2023 at 11:52):

@Folkert de Vries any thoughts or info on this? I don't really know what's possible here :sweat_smile:

Folkert de Vries (Dec 06 2023 at 11:53):

the refcounting approach is now to always own values, and push RC updates into the called function. That is optimal for in-place mutation, because borrowed values can never be mutated

Folkert de Vries (Dec 06 2023 at 11:54):

inferring borrowing is hard, and it's harder to bound the amount of code duplication for owned/borrowed arguments

Folkert de Vries (Dec 06 2023 at 11:55):

e.g. what if you have multiple arguments, they can all be owned or borrowed, this cascades

Richard Feldman (Dec 06 2023 at 12:26):

like certainly it's possible to imagine that blowing up, but that doesn't necessarily mean it would in practice (after all, the same is true of monomorphization in general!)

Folkert de Vries (Dec 06 2023 at 12:51):

Folkert de Vries (Dec 06 2023 at 12:54):

the core problem is that you don't know at any particular point whether to own or borrow: borrowing saves RC manipulation, but disallows mutation. In the middle of a call chain you don't know which is best. Even in most functions it's tricky to know whether being able to mutate is useful.

there is another problem here which is that with borrowing, in recursive functions, you may run into big spikes in memory use, because all the memory is borrowed and only released when all of those recursive functions return

Folkert de Vries (Dec 06 2023 at 12:54):

all of that is to say: you have to be careful with borrowing. If you can figure out a good heuristic then there should be no problem technically, but finding that heuristic is not trivial

Richard Feldman (Dec 06 2023 at 13:47):

I wonder how we might go about trying to figure out what a good heuristic might be :thinking:

Richard Feldman (Dec 06 2023 at 13:49):

now that we've run into concrete performance problems of the form "almost all this program does is increment and decrement refcounts, and then it does a bit of work which isn't that" it's probably worth starting to discuss things we might want to try :big_smile:

Brendan Hansknecht (Dec 06 2023 at 15:23):

So we have builtins that borrow things because they are read only, like List.getUnsafe. my wrong understanding was that if a function only called borrowing builtins, it would also just borrow the value. So it would kinda propagate out from those builtins.

Brendan Hansknecht (Dec 06 2023 at 15:24):

Richard Feldman (Dec 06 2023 at 15:25):

Brendan Hansknecht (Dec 06 2023 at 15:25):

But would add the delayed freeing of memory in recursive functions like mentioned by folkert. (Though not sure how realistic of a problem this is in practice)

Richard Feldman (Dec 06 2023 at 15:26):

"if this function argument is only ever passed to things that borrow, then it too can be borrowed"

Richard Feldman (Dec 06 2023 at 15:26):

I'm more worried about the overhead of reference counting than the delaying of freed memory

Brendan Hansknecht (Dec 06 2023 at 15:28):

Brendan Hansknecht (Dec 06 2023 at 15:29):

Also, I swear that at some point we talked about the idea of enabling just refcounting the outer list instead of recursively refcounting all elements in some cases. If we could do that here, it also would alleviate the issue. Though, I don't recall the context in which we thought a change like this might work.

Folkert de Vries (Dec 06 2023 at 15:42):

Brendan Hansknecht (Dec 06 2023 at 16:03):

Do we have concrete ideas for that or is this a case where we need more research projects like Morphic?

Brendan Hansknecht (Dec 07 2023 at 15:41):

Looping back to the idea of borrowing an input to a function if all uses in the function also borrow the input, is that something we could just append onto the type checker? Like for each value we want to know both the type and a bit flag of whether or not the value is potentially dirty. Any call in a sub branch would make an entire body dirty. Of course there would need to be some custom propagation logic at function boundaries.

Would that make sense and be something realitively small cost to add onto the type checker? Should it go somewhere else? Is it just a bad idea overall? Any other suggestions on solution?

Richard Feldman (Dec 07 2023 at 17:52):

I was talking to @Folkert de Vries about this the other day and he mentioned that doing the analysis on mono instead of during type checking would have various benefits

Folkert de Vries (Dec 07 2023 at 22:38):

Brendan Hansknecht (Dec 07 2023 at 22:44):

Richard Feldman (Dec 08 2023 at 00:19):

@Brendan Hansknecht is this something you'd be interested in getting into in terms of compiler development? (With guidance of course!)

Brendan Hansknecht (Dec 08 2023 at 00:24):

Well, I do want dict to run fast and this will be fundamental to getting it closer to c++ speeds. So for sure.

Also, I think it will be helpful for me to understand more deeply. I have some vague performance ideas, but many I am not sure could work in roc without really expensive analysis.

Richard Feldman (Dec 08 2023 at 00:35):

that would be amazing! @Folkert de Vries are you currently the only person with domain expertise in this? Or does anyone else know how it works/how to implement things to try?

Ayaz Hafiz (Dec 08 2023 at 01:10):

Doing it in the typechecker has other problems too, namely that you lose all specialization and are bound to the virality of unification. Really you want this over the SSA form.

Richard Feldman (Dec 08 2023 at 01:14):

Brendan Hansknecht (Dec 08 2023 at 01:27):

also, I guess if we want to do this for closure captures as well, we need to check that the closure capture is essentially never saved and is just directly called or something of that nature. Think of List.map and List.walk closures that may capture values. Would suck if List.walk capturing a refcounted value and then just using it in a borrowed manner had bad perf compared passing in the value as part of the state.

Brendan Hansknecht (Dec 08 2023 at 01:28):

Cause if the closure capture is saved, it actually does need to make sure the value isn't freed.

Ayaz Hafiz (Dec 08 2023 at 01:29):

At the level of the type-specialized IR closure captures are identical to normal parameters

Ayaz Hafiz (Dec 08 2023 at 01:30):

Brendan Hansknecht (Dec 08 2023 at 01:30):

Wait, really. But captures can live for a long long time while regular params only live for the body of the function.

Ayaz Hafiz (Dec 08 2023 at 01:31):

How can the captures outlive the body of the function? Only if they're returned right? But that's the same as for the case of regular parameters

Ayaz Hafiz (Dec 08 2023 at 01:32):

After type specialization of lambda sets, all closures are unboxed and their captures are passed as an extra (unboxed) parameter

Brendan Hansknecht (Dec 08 2023 at 01:32):

yeah, I just realized that as I was trying to write out an example. So it can be borrowed if it is only used in a borrowing way and never possibly returned from the function. With those rules, I think it is the same for captures and params

Ayaz Hafiz (Dec 08 2023 at 01:32):

Brendan Hansknecht (Dec 08 2023 at 02:45):

A note on the cost of this refcounting. I ported some hashmap benchmarks form C++ to roc. The test we are closest to C++ in perf (about 2x slower) is the insert remove bench. Important part about insert and remove, due to mutating the dictionary, they just take ownership of it and never need to touch the refcount.

Brendan Hansknecht (Dec 08 2023 at 02:46):

Still slower than I would hope, but it is a lot better than the 4x of some other cases.

Brendan Hansknecht (Dec 08 2023 at 07:35):

I have only ported 1 benchmark so far, but I am a bit surprised. Decided to test out go just for the fun of it. In the test of insert 100million keys, clear the dict, insert 100 million keys again, finally delete 100 million keys 1 by 1: golang is 4.5x slower than the current version of roc.

Go spends a heck of a lot of system time. I wonder if it is doing something weird with memory that is just eating tons of kernel time.

Brendan Hansknecht (Dec 08 2023 at 07:37):

We also use about 2x less memory, which makes sense cause Roc cleans up memory immediately instead of lazily

Jacob (Dec 08 2023 at 10:48):

I think lobster lang infers borrowing/owning arguments or something, cant really remember. I remember the creator being asked about code blowup. Maybe relevant?

Richard Feldman (Dec 08 2023 at 11:54):

wait so what does that means in terms of comparing Go to C++ on these particular benchmarks? Go is 10x slower?

Elias Mulhall (Dec 08 2023 at 14:18):

I love that all this started with me thinking that creating thousands of small sets would be a performant way to solve a problem, then complaining when that wasn't the case. Really enjoying fallowing this thread.

Brendan Hansknecht (Dec 08 2023 at 15:24):

Brendan Hansknecht (Dec 08 2023 at 15:25):

Brendan Hansknecht (Dec 08 2023 at 15:27):

That said, something still feels wrong to me about that giant perf difference, need to dig in more.

Richard Feldman (Dec 08 2023 at 15:39):

Brendan Hansknecht (Dec 08 2023 at 17:15):

Looking into the go hash impl, I was reminded that go does not have custom equality or custom hash (maybe has changed with the new generics?) And instead always used an autoderive impl with maps. On top of that, maps don't implement hash or equality (and they have no set type)

Brendan Hansknecht (Dec 08 2023 at 17:18):

My guess as to why go maps are slow in one picture. The key is the bottom right corner.

Honestly, there design is fine, but it is not totally flat. As such, in the case a bucket overflows instead of place elements in a calculated different bucked in the same array, they just allocate a new bucket on the heap. So it is kinda like in between flat and traditional cause a bucket contains 8 elements and overflow should be rare.

Brendan Hansknecht (Dec 08 2023 at 17:20):

Brendan Hansknecht (Dec 08 2023 at 17:23):

Also, never realized that all the generics in the go runtime are implemented c style. To call map.get, the runtime passes in a map that essentially has void* pointers cause it knows nothing about the data type, and a struct of function pointers that know about the data and how it works.

Richard Feldman (Dec 08 2023 at 17:25):

Brendan Hansknecht (Dec 08 2023 at 17:26):

Brendan Hansknecht (Dec 08 2023 at 17:28):

But yeah, summary of go lang hash map is kinda like absl::flat_hash_map, but less flat and with more pointer generic shuffling.

Richard Feldman (Dec 08 2023 at 17:29):

I think that varies. I don't know for sure but I believe Haskell does it the way Go does

Brendan Hansknecht (Dec 08 2023 at 17:39):

Interesting go maps after growing migrate to the new map in chunks over multiple calls to insert instead of doing it all at once.

Brendan Hansknecht (Dec 08 2023 at 17:39):

Brendan Hansknecht (Dec 08 2023 at 17:40):

Trying to make perf of insert more consistent at the cost of probably more perf over time.

Brendan Hansknecht (Dec 08 2023 at 17:45):

~~Hmm, sounds like go may have a weird hash function choice too to make it more secure against hash flooding attacks without needing a fallback. Will have to look into that.~~ they used to, now they also just use wyhash.

timotree (Dec 08 2023 at 18:16):

Brendan Hansknecht (Dec 08 2023 at 18:18):

Brendan Hansknecht (Dec 09 2023 at 02:08):

Ok. so turns out that go just has horrible perf on arm. On x86, it is actually quite fast. In between roc and c++ for this benchmark

Stream: contributing

Topic: Set perf

Brendan Hansknecht (Dec 03 2023 at 22:44):

Richard Feldman (Dec 03 2023 at 22:53):

Elias Mulhall (Dec 03 2023 at 23:11):

Elias Mulhall (Dec 03 2023 at 23:12):

Elias Mulhall (Dec 03 2023 at 23:15):

Brendan Hansknecht (Dec 03 2023 at 23:15):

Brendan Hansknecht (Dec 03 2023 at 23:16):

Brendan Hansknecht (Dec 03 2023 at 23:18):

Brendan Hansknecht (Dec 04 2023 at 00:27):

Brendan Hansknecht (Dec 04 2023 at 00:29):

Richard Feldman (Dec 04 2023 at 00:29):

Richard Feldman (Dec 04 2023 at 00:29):

Brendan Hansknecht (Dec 04 2023 at 00:29):

Brendan Hansknecht (Dec 04 2023 at 00:29):

Brendan Hansknecht (Dec 04 2023 at 00:30):

Brendan Hansknecht (Dec 04 2023 at 00:33):

Richard Feldman (Dec 04 2023 at 00:35):

Richard Feldman (Dec 04 2023 at 00:35):

Richard Feldman (Dec 04 2023 at 00:36):

Brendan Hansknecht (Dec 04 2023 at 00:37):

Richard Feldman (Dec 04 2023 at 00:41):

Richard Feldman (Dec 04 2023 at 00:41):

Brendan Hansknecht (Dec 04 2023 at 00:42):

Richard Feldman (Dec 04 2023 at 00:44):

Brendan Hansknecht (Dec 04 2023 at 00:49):

Richard Feldman (Dec 04 2023 at 00:50):

Brendan Hansknecht (Dec 04 2023 at 00:50):

Brendan Hansknecht (Dec 04 2023 at 00:51):

Brendan Hansknecht (Dec 04 2023 at 00:51):

Richard Feldman (Dec 04 2023 at 00:51):

Richard Feldman (Dec 04 2023 at 00:52):

Brendan Hansknecht (Dec 04 2023 at 00:53):

Richard Feldman (Dec 04 2023 at 00:53):

Richard Feldman (Dec 04 2023 at 00:55):

Richard Feldman (Dec 04 2023 at 00:55):

Richard Feldman (Dec 04 2023 at 01:10):

Richard Feldman (Dec 04 2023 at 01:10):

Brendan Hansknecht (Dec 04 2023 at 01:17):

Brendan Hansknecht (Dec 04 2023 at 01:17):

Brendan Hansknecht (Dec 04 2023 at 01:17):

Brendan Hansknecht (Dec 04 2023 at 01:17):

Brendan Hansknecht (Dec 04 2023 at 01:18):

Richard Feldman (Dec 04 2023 at 01:21):

Brendan Hansknecht (Dec 04 2023 at 01:23):

Richard Feldman (Dec 04 2023 at 01:24):

Richard Feldman (Dec 04 2023 at 01:24):

Brendan Hansknecht (Dec 04 2023 at 01:29):

Brendan Hansknecht (Dec 04 2023 at 01:30):

Richard Feldman (Dec 04 2023 at 01:30):

Richard Feldman (Dec 04 2023 at 01:32):

Brendan Hansknecht (Dec 04 2023 at 01:32):

Brendan Hansknecht (Dec 04 2023 at 01:33):

Brendan Hansknecht (Dec 04 2023 at 01:33):

Brendan Hansknecht (Dec 04 2023 at 01:34):

Brendan Hansknecht (Dec 04 2023 at 01:35):

Richard Feldman (Dec 04 2023 at 01:40):

Richard Feldman (Dec 04 2023 at 01:41):

Richard Feldman (Dec 04 2023 at 01:42):

Brendan Hansknecht (Dec 04 2023 at 01:42):

Brendan Hansknecht (Dec 04 2023 at 01:49):

Brendan Hansknecht (Dec 04 2023 at 02:01):

Richard Feldman (Dec 04 2023 at 02:09):

Brendan Hansknecht (Dec 04 2023 at 02:11):

Brendan Hansknecht (Dec 04 2023 at 02:11):

Brendan Hansknecht (Dec 04 2023 at 02:11):

Richard Feldman (Dec 04 2023 at 02:14):

Richard Feldman (Dec 04 2023 at 02:14):

Richard Feldman (Dec 04 2023 at 02:15):

Richard Feldman (Dec 04 2023 at 02:16):

Brendan Hansknecht (Dec 04 2023 at 02:16):

Richard Feldman (Dec 04 2023 at 02:17):

Brendan Hansknecht (Dec 04 2023 at 02:17):

Richard Feldman (Dec 04 2023 at 02:18):

Brendan Hansknecht (Dec 04 2023 at 02:20):

Brendan Hansknecht (Dec 04 2023 at 02:21):

Brendan Hansknecht (Dec 04 2023 at 02:24):

Brendan Hansknecht (Dec 04 2023 at 02:25):

Brendan Hansknecht (Dec 04 2023 at 02:26):