Elias Mulhall said:
Day 3, this one has a performance bug that I'll talk about below
Day 3, with a workaroundPerformance Issue Details
I haven't root caused this or anything, but it honestly might just be a case of a bad usecase for set that got unlucky with perf (though obviously must be some sort of bug due to the issue with --optimize
.
Anyway, stats I have so far.
1) generates 1863 neighbor sets of size 8. 8 is an unlucky number it means that every one of those sets will be allocated, insert 7 values which requires hashing them, reallocated and rehashed, then insert the final value. So every fromList call here will cast 2 allocations, 15 hashes, and some other data twiddling.
Fixing fromList perf and at a minimum getting withCapacity properly working should help a lot here. I forget why fromCapacity is no implemented currently, but this probably is pretty terrible for perf.
2) This generates 32222 sets of size 0 to 3 for the various indices. 29788 of which are just empty. Given these never rehash, this is just a ridiculous number of allocations (that wouldn't happen with an empty list) due to dictionaries having a default capacity of 8. And currently a dictionary is 2 allocations. So compared to using lists, this is 59576 extra allocations....probably this is the main issue?
3) Set.insection are currently not written in the optimal way. It probably should be walking one set and removing items. Instead it generates a totally new set from empty.
love this investigation! :star_struck:
Ahhh, the size 8 thing makes sense. It would be pretty straightforward to add a Set.withCapacity
, right? Actually in that case I'm doing Set.fromList
... would it be bad for that function to eager-allocate the set to have the same capacity as the list? Depending on use case that might mean allocing way more space than needed, but not to a ridiculous degree.
Oops, I skipped the paragraph where you already said that
Is the --optimize
issue reproducable, or just me?
Reproducible on my end
I'll try to implement a few improvements and see how it goes.
As an extra note, if we were really careful, the set could theoretically take ownership of the list and reuse the allocation. Though would be a more complex algorithm. Would need to inplace deal with duplicate elements.
So compared to using lists, this is 59576 extra allocations
...
So allocating is faster than not allocating. Cause if we don't allocate, we have to add an extra conditional to all insert/get/remove/etc calls.
about 10% faster to just always allocate with this example
what's the extra conditional?
like what does it do
before it would just unsafe load an index from the metadata
Now it has to check length is not 0
and for insert of course, if not allocated, it has to allocate at the beginning of the function.
Also, due to the sets being super small (max size 8 elements), I don't think that sets will ever be faster than lists for this exampl. Lots of extra overhead with hashing, allocations, and extra conditionals with very very small data.
I wonder if there's some cmov trick that could work here
like this part:
Brendan Hansknecht said:
before it would just unsafe load an index from the metadata
could use a cmov for "if length is 0, set the pointer to be a pointer to the address of the length itself" or something
(I don't know what the metadata would need to be for "this is missing" though)
probably not.
It is essentialy:
if initialized:
# do some gigantic compute
else:
just return
Generally cmov
is only good for evenish small compute. branch prediction is much better for cases like this.
well that's true for insert
, right? But get
presumably never initializes anything
but I guess it's the others that are significantly slowed down here
So insert
is actually
if !initialized:
initialize
# do rest of compute with initialized dict
for get
, it is the case mentioned above. Cause hashing and calculating index related stuff is the large compute I was mentioning above.
:thinking: doesn't insert
need a branch anyway to see if it's over capacity?
Only if the key isn't found
ok so then it branches on whether the key is found, right?
yep
but checking indices assumes the dict is initialized
So it never does any bounds checks
gotcha, so what if that branch's conditional branchlessly checked length?
e.g. do both the cmov thing I mentioned earlier and also an AND
with it being nonempty
That branch is after it has already unconditionally loaded from the metadata. Which would deref a null pointer on an uninitialized dict.
that's what I mean about doing a cmov to have the null pointer point to &length
or whatever so it can be successfully dereferenced to something we'll ignore anyway
e.g. (pretending &&
is not going to desugar to a condition here):
let isEmpty = len == 0;
let metadataPtr = if isEmpty { &len } else { normalMetadataPtr };
if !isEmpty && doNormalCheckOn(metadataPtr) { ... }
it's definitely possible that would still be more costly than the branch misprediction though haha
Brendan Hansknecht said:
Also, due to the sets being super small (max size 8 elements), I don't think that sets will ever be faster than lists for this exampl. Lots of extra overhead with hashing, allocations, and extra conditionals with very very small data.
something I've wondered about is whether we should consider having a "small hashmap optimization" where we don't use a hashmap if the collection is small enough
and instead Dict
(and therefore Set
too) is backed by a flat List
of tuples
So I figured out how to remove the branch. Just use constant lists. Then it is always initialized
Instead of before where it was using list.repeat.
Also initialization changes seem to have fixed the --optimize
bug.
In optimized, this is about 6x slower than the list version
which honestly may be expected perf. Probably should write something similar in c++ with absl::flat_hash_map and vector to compare
constant lists as in they're allocated in the readonly section of the binary?
Yeah, they should be. Though looking at the llvm, I think llvm is smart enough to make them magically disappear cause they are so small. Not fully sure
very cool!
what do you think of the "small hashmap optimization" idea?
Mostly questioning if it is a good idea for automatic application or a better idea for human optimization.
As in, we could expose both types with the same api (or make the list verson in a package). Explicitly choosing either one avoids perf costs related to extra conditionals. So if you know you need large or know you need small it makes sense to pick explicitly. That said merged may avoid some worst cases like this.
I think the pitch for it is that it's common to use Set
or Dict
not so much for performance but rather because they enforce a guarantee and clearly communicate intent
and we could have separate types for that, but I do wonder whether the extra conditional matters in the "actually use a hash map" case when you have so many entries that you're probably getting cache misses
not so much for performance
To that I would say, use --optimize
and accept that it may not be the same perf as picking a ListSet
or ListDict
, but perf isn't the main concern.
Also, I need to do some measurements, but likely we should just rewrite dict as a zig builtin.
That probably will close a solid chunk of the gap.
But I can't really guarantee that currently, it is mostly speculation. That said, I was unable to implement part of the absl::flat_hash_map optimization in roc due to it being worse performance
Though I think we have a builtin to address part of that now, so maybe I should try again.
I also wonder if a List.appendUnique
or something might be useful
like some quick way to use a list like a Set
except with different performance characteristics
eh might be clearer to get a ListSet
off the shelf or something
oh, so essentially:
if !List.contains list x then
List.append list x
Just found a huge speed boost in the app...
from
|> Set.union state
to
|> \x -> Set.union state x
Made the optimized build about 2x faster
Anyway, on a branch, I am down to 3x slower on an optimized build with set vs with list.
Brendan Hansknecht said:
Just found a huge speed boost in the app...
from|> Set.union state
to
|> \x -> Set.union state x
Made the optimized build about 2x faster
whoa! Why does that improve speed?
It is equivalent of going from
List.concat smallList largeList
To
List.concat largeList smallList
So copy a few elements to the end of something large instead of copy a ton of elements to the end of something small.
In a very hot loop
ohhh
:thinking: does that suggest we should swap the argument order there?
oh nm, I guess this is application-specific
might be too fancy, but I wonder if it's worth picking which one we insert into based on which is bigger, assuming they're both unique
I added a conditional in the function. I think that will be worth it in general
in Set.union
you mean?
yeah
cool, makes sense to me!
@Elias Mulhall can you test on the set-perf
branch? See if --optimize
still hangs for you.
Anyway, patched up a handful of improvements in #6176
Also, I wonder if the full performant impl of absl::flat_hash_set
might actually be faster than regular lists. The main feature we are missing currently is groups. Groups enable it such that a chunk of the hash can be compared for 8 indices at once. So it might do 1 hash and 1 comparison to avoid up to 8 calls to ==
. That said it still has the rest of the extra conditionals and overhead.
Now that we have popCount
as a builtin, I should revisit implementing that in roc and perf gains.
Should definitly convert this or a similar example to c++
with absl::flat_hash_map
to see what our real perf diff is.
that sounds fascinating! :smiley:
As an aside, about small dict impl.
The larger the key is the worse small dict is. Cause large key means an expensive Eq. So it would be a lot faster to do more hash comparisons rather than key comparisons.
So probably ListDict
is only better with both small size and small key.
yeah true, although we do know the size of that statically, so could have a heuristic involving both
Sure, if roc exposed that at compile time or you add an extra runtime check
Brendan Hansknecht said:
Elias Mulhall can you test on the
set-perf
branch? See if--optimize
still hangs for you.
Unfortunately it still does! Anything I can do to debug further?
Oh wait, I lightly modified the code, with the original I can still repro...let me take a look at that specifically.
Ah ha, good luck!
So I'm not sure how easy this will be to make a minimized test case, but this is not a Set
bug at all.
It is some sort refcounting/uniqueness failure. The bug is specifically with the newPartNumber
variable: https://gist.github.com/mulias/b07e012ced92406f59cc2548d0f01323#file-2023day03-roc-L66
It should always have multiple references which means on this line here it should incur a deep copy:
https://gist.github.com/mulias/b07e012ced92406f59cc2548d0f01323#file-2023day03-roc-L88
That is not happening. As such, we are getting these weird sets that have old data from the last iteration of the loop. Unsurprisingly, missing old data with a new set breaks things.
A simple workaround is just to inline the variable uses:
current: { digits: [], indices: Set.empty {} },
Doing a quick scan of the mono ir, it seems that every iteration of Array2d.walk
the value is decremented, but it doesn't look to every be incremented. I assume it would get incremented through the closure capture somewhere or something like that, but I don't see it. (definitely might just be missing it)
actually, it looks like Array2D.walk
calls List.walkWithIndex
calls List.walkWithIndexHelp
which increments the functions refcount. That should mean this all works correctly. Maybe a bug in the actully generated increment or decrement function?
So I was able to work around the refcount craziness (was affecting both the list and set version of the code). Fixing that made both versions about 4x faster. So on my machine the list version is now 40ms and the set version 150ms. Both quite respectable times and the flamegraphs now look like they are spending there time mostly doing real work and not too much in refcounts.
All of the time is now essentially set in calculating set overlaps. (again these are small sets neighbors is 8 elements and the digit sets are 1 to 3 elements).
So if we optimize the set overlap calculation to early exist like the list overlap calculation, we get to the point where the set version is slightly more than 2x slower than the list version at about 90ms
isOverlapping : Set a, Set a -> Bool
isOverlapping = \a, b ->
if Set.len b < Set.len a then
isOverlapping b a
else
Set.walkUntil a Bool.false \_, k ->
if Set.contains b k then
Break Bool.true
else
Continue Bool.false
Anyway, that is probably enough digging into the perf of all of this.
I should probably file some issue to track some of this, but I will leave that as an exercise for tomorrow Brendan cause I am currently quite tired.
Richard Feldman said:
what do you think of the "small hashmap optimization" idea?
FWIW maps on the Beam are just arrays under the hood if size < 32
Anyway, that is probably enough digging into the perf of all of this.
I would say so, you went much deeper into this than I anticipated! Thanks for all the details, very interesting.
Haha. I enjoy digging into performance and such
Hannes Nevalainen said:
FWIW maps on the Beam are just arrays under the hood if size < 32
Do you have a link to info on this? Or source I guess.
So I was able to work around the refcount craziness (was affecting both the list and set version of the code). Fixing that made both versions about 4x faster.
@Brendan Hansknecht I'm not sure I'm following how you did this. Was this a code change or a compiler change?
Ah sorry, didn't share:
gearPartNumbers : Schematic -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic ->
partNumbers = allPartNumbers schematic
Array2D.walk schematic ([], partNumbers) { direction: Forwards } \(state, pns), elem, index ->
when elem is
Symbol '*' ->
elemNeighbors = neighbors index
neighboringPartNumbers =
List.keepIf pns \partNumber ->
isOverlapping partNumber.indices elemNeighbors
when neighboringPartNumbers is
[pn1, pn2] -> (List.append state (pn1, pn2), pns)
_ -> (state, pns)
_ -> (state, pns)
|> .0
Same can be done to the list version
I think if partNumbers
is a capture of the closure we aren't smart about refcounts and recursively increment the refcounts of everything contained in partNumbers
.
If we pass it in instead, we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount. That is at least my guess.
@Ayaz Hafiz or @Folkert de Vries does that sound plausible? If so, is there an actionable bug I can file?
If we pass it in instead, we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount. That is at least my guess.
Can you elaborate what you mean by this? I don't follow
This is the original code
gearPartNumbers : Schematic, List PartNumber -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic, partNumbers ->
Array2D.walk schematic [] { direction: Forwards } \state, elem, index ->
when elem is
Symbol '*' ->
adjacentPartNumbers =
List.keepIf partNumbers \partNumber ->
anyAdjacent partNumber.indices [index]
when adjacentPartNumbers is
[pn1, pn2] -> List.append state (pn1, pn2)
_ -> state
_ -> state
This code is faster because it seems to eliminate ref counts on partNumbers
gearPartNumbers : Schematic, List PartNumber -> List (PartNumber, PartNumber)
gearPartNumbers = \schematic, partNumbers ->
Array2D.walk schematic ([], partNumbers) { direction: Forwards } \(state, pns), elem, index ->
when elem is
Symbol '*' ->
adjacentPartNumbers =
List.keepIf pns \partNumber ->
anyAdjacent partNumber.indices [index]
when adjacentPartNumbers is
[pn1, pn2] -> (List.append state (pn1, pn2), pns)
_ -> (state, pns)
_ -> (state, pns)
|> .0
we realize that we don't need to recursively increment hte refcount based on how it is being used and instead just increment the outer list refcount.
So that comment turned out to be wrong. I thought we had an optimization that enable sometimes just incrementing a list outer refcount instead of incrementing it recursively, but even if we do, it doesn't apply here.
In the original version partNumbers
is a closure capture. On every iteration of Array2D.walk
, we would recursively increment the closure captures and spend a lot of time incrementing refcounts in partNumbers
. On top of that, at the end of the function, we would recursively decrement the refcount. Overall just eating tons of times with refcounting.
If we explicitly make partNumbers
part of the state, this goes away. Most of the time, pn
is just passed through without getting used. This means its refcount is never incremented or decremented. The only time we have to deal with refcounts of pn
in the new version of the code is when the symbol is *
. In that case, the refcount of pn
is recursively incremented. So this is just way way less refcount fiddling overall. Turned out to not be a fancy optimization, just hugely cutting how often we have to deal with this.
Summary: recursive refcount increments/decrements can be super brutal. I wonder if there is a better way to deal with a case like this.
Looking at the IR to confirm...
Here's an updated gist if that helps https://gist.github.com/mulias/3e7fb28d934736155028d5e85c9bb3ca
One of the versions of the function is commented out
Hmm yeah your analysis sounds right to me Brendan
It's not totally clear to me why in the first case the capture is seen as borrowed, and in the second case it isn't though
The thing is that on the surface, the capture is seen as owned, until array2d.Array2D.iterate
is hit
And then we have
# SLOW
procedure : `array2d.Array2D.iterate` List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}
procedure = `array2d.Array2D.iterate` (`#Derived_gen.IdentId(86)`: {List [C , C U8, C U8], {U64, U64}}, `#Derived_gen.IdentId(87)`: {U64, U64}, `#Derived_gen.IdentId(88)`:
List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, `#Derived_gen.IdentId(89)`: {}, `#Derived_gen.IdentId(90)`: List {List U8, List {U64, U64}}):
joinpoint `array2d.Array2D.770` `array2d.Array2D.array` `array2d.Array2D.index` `array2d.Array2D.state` `array2d.Array2D.nextIndexFn` `array2d.Array2D.nextStateFn`:
inc `array2d.Array2D.array`;
let `array2d.Array2D.771` : [C {}, C [C , C U8, C U8]] = CallByName `array2d.Array2D.get` `array2d.Array2D.array` `array2d.Array2D.index`;
let `array2d.Array2D.791` : U8 = 1i64;
let `array2d.Array2D.792` : U8 = GetTagId `array2d.Array2D.771`;
let `array2d.Array2D.793` : Int1 = lowlevel Eq `array2d.Array2D.791` `array2d.Array2D.792`;
if `array2d.Array2D.793` then
inc `array2d.Array2D.array`;
let `array2d.Array2D.elem` : [C , C U8, C U8] = UnionAtIndex (Id 1) (Index 0) `array2d.Array2D.771`;
let `array2d.Array2D.788` : [C {}, C {U64, U64}] = CallByName `array2d.Array2D.decY` `array2d.Array2D.array` `array2d.Array2D.index`;
inc `array2d.Array2D.nextStateFn`; <<<<< THIS is the INC
let `array2d.Array2D.789` : [C List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, C List {{List U8, List {U64, U64}}, {List U8, List {U64, U64}}}] = CallByName `array2d.Array2D.200` `array2d.Array2D.state` `array2d.Array2D.elem` `array2d.Array2D.index` `array2d.Array2D.nextStateFn`;
# FAST
procedure : `array2d.Array2D.iterate` {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}
procedure = `array2d.Array2D.iterate` (`#Derived_gen.IdentId(48)`: {List [C , C U8, C U8], {U64, U64}}, `#Derived_gen.IdentId(49)`: {U64, U64}, `#Derived_gen.IdentId(50)`:
{List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, `#Derived_gen.IdentId(51)`: {}, `#Derived_gen.IdentId(52)`: {}):
joinpoint `array2d.Array2D.1039` `array2d.Array2D.array` `array2d.Array2D.index` `array2d.Array2D.state` `array2d.Array2D.nextIndexFn` `array2d.Array2D.nextStateFn`:
inc `array2d.Array2D.array`;
let `array2d.Array2D.1040` : [C {}, C [C , C U8, C U8]] = CallByName `array2d.Array2D.get` `array2d.Array2D.array` `array2d.Array2D.index`;
let `array2d.Array2D.1060` : U8 = 1i64;
let `array2d.Array2D.1061` : U8 = GetTagId `array2d.Array2D.1040`;
let `array2d.Array2D.1062` : Int1 = lowlevel Eq `array2d.Array2D.1060` `array2d.Array2D.1061`;
if `array2d.Array2D.1062` then
inc `array2d.Array2D.array`;
let `array2d.Array2D.elem` : [C , C U8, C U8] = UnionAtIndex (Id 1) (Index 0) `array2d.Array2D.1040`;
let `array2d.Array2D.1057` : [C {}, C {U64, U64}] = CallByName `array2d.Array2D.incX` `array2d.Array2D.array` `array2d.Array2D.index`;
let `array2d.Array2D.1058` : [C {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}, C {List {List U8, List {U64, U64}}, {List U8, List {U64, U64}}}] = CallByName `array2d.Array2D.200` `array2d.Array2D.state` `array2d.Array2D.elem` `array2d.Array2D.index` `array2d.Array2D.nextStateFn`;
In the fast example, the closure has no captures, so the inc
is elided? It would just be a call to a no-op?
So I'm not really sure why we're seeing it as borrowed in one case and not the other. Maybe the fact that the capture parameter passes out specializes it? But I would expect that it's seen as irrelevant in the first (capturing) case for sure, especially because it unwraps transparently to an extra List parameter
Maybe @Folkert de Vries has an idea
Well, but it would inc
on something right? Like presumably the outer capture is passed in some way to iterate
But that seems to be missing
Although I haven't looked at the iterate implementation so I could be off
It is pretty direct and just recursive: https://github.com/mulias/roc-array2d/blob/951b2a7ae561de88bf185bb61e26244dc9f5d508/package/Array2D.roc#L894-L906
The IR looks weird to me for a different (and probably unrelated) reason -- Array2D.walk
with default params should just be a wrapper for List.walkWithIndex
, the iterate
function shouldn't get involved unless you're walking in a different order :thinking:.
Oh wait, that may be exactly why it change to borrowing instead of owning
Maybe in this dispatch function where there are two paths, we mess something up and switch from owning to borrowing?
Or maybe it is related to putting the closure in another closure?
\state, elem, listIndex ->
fn state elem (arrayIndexOf listIndex arrayShape)
Just guesses
Maybe we should try to trim this down to just a simple list version and see if we can still repro.
Brendan Hansknecht said:
Hannes Nevalainen said:
FWIW maps on the Beam are just arrays under the hood if size < 32
Do you have a link to info on this? Or source I guess.
It is apparently a "hash array mapped trie" for 32 entries or less. I don't really know more than that :) here are some links
https://twitter.com/akoutmos/status/1266034402422853633
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_map.h#L72-L76
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_map.c#L264-L268
https://elixirforum.com/t/big-maps-versus-small-maps-performance/31909
Thanks!
Maybe we should try to trim this down to just a simple list version and see if we can still repro.
I don't have a small repro currently, but I am pretty sure I hit this again when working on the implementation details of Dict. I have a hot loop with a .walk
function in it that happens to capture a list. 45% of the hot loop execution time is spent on refcounting that really shouldn't be needed (I'll probably have to manually inline this for perf reasons). To my understanding, the core issue is that we really need to do inlIning before inserting refcounts. LLVM is inlining the lambda, but it still has a capture with a refcount inc before the lambda and a refcount dec in the lambda.
I would guess a minimal repro will just be something simple like:
x = # Some large list of refcounted things, assume list of sets
y = # Some other list to walk, lets assume it has indices into x
List.walk y 0 \count, index ->
when List.get x index is
Ok set ->
count + (Set.len set)
Err OutOfBounds ->
count
Assuming I have this right, this will just thrash the refcounts of all elements in x.
I am still diving down the trail of dictionary perf improvements and I found something interesting. listGetUnsafe
is supposed to borrow it's input. This means that no refcount changes are required to call the function.
This does not look to be working correctly in my dictionary code. I see lots of cases like this:
inc `Dict.metadata`;
let `Dict.group` : U64 = CallByName `Dict.listGetUnsafe` `Dict.metadata` `Dict.slotIndex`;
If I follow that link through, I find:
procedure : `Dict.listGetUnsafe` {U32, {}}
procedure = `Dict.listGetUnsafe` (`#Attr.#arg1`: List {U32, {}}, `#Attr.#arg2`: U64):
let `Dict.960` : {U32, {}} = lowlevel ListGetUnsafe `#Attr.#arg1` `#Attr.#arg2`;
dec `#Attr.#arg1`;
ret `Dict.960`;
Even thouhg the lowlevel does not require changing the refcount, we are wrapping it in a function call and that function apparently requires changing the refcount. So this is defeating the whole point of the lowlevel not needing to change the refcount
Any ideas how to fix this?
The same thing does not happen to List.getUnsafe
:
procedure : `List.getUnsafe` U64
procedure = `List.getUnsafe` (`#Attr.#arg1`: List U64, `#Attr.#arg2`: U64):
let `List.612` : U64 = lowlevel ListGetUnsafe `#Attr.#arg1` `#Attr.#arg2`;
ret `List.612`;
Oh, I figured out the missing piece. Refcounts gone!!!
Also, I am so thankful for drop specialization...it is removing refcounts for me as well
Now I just want closures to be able to borrow instead of own captures such that they can avoid refcounting.
So ummm.....Dictionaries with refcounted keys are problematic:
Screenshot-2023-12-05-at-9.22.17PM.png
The purple is refcounting. It is 80% of the execution time. The other 20% of the execution time is splitting a gigantic string.
So ~100% of the execution time spent in dictionary related functions is spent on recursive refcount increments and decrements.
I thought morphic was supposed to enable a function like Dict.get
to simply "borrow" it's input since we can statically guarantee it is never modified, only read. As such, we would elide all of this refcounting madness. Is that not the case? Is this an optimization we can add?
Brendan Hansknecht said:
Now I just want closures to be able to borrow instead of own captures such that they can avoid refcounting.
@Folkert de Vries any thoughts or info on this? I don't really know what's possible here :sweat_smile:
this is not a morphic thing at all
the refcounting approach is now to always own values, and push RC updates into the called function. That is optimal for in-place mutation, because borrowed values can never be mutated
inferring borrowing is hard, and it's harder to bound the amount of code duplication for owned/borrowed arguments
e.g. what if you have multiple arguments, they can all be owned or borrowed, this cascades
I wonder about that last part in practice :thinking:
like certainly it's possible to imagine that blowing up, but that doesn't necessarily mean it would in practice (after all, the same is true of monomorphization in general!)
you would now do it twice though
but to be clear, this is not some technical limitation, it's a tradeoff
the core problem is that you don't know at any particular point whether to own or borrow: borrowing saves RC manipulation, but disallows mutation. In the middle of a call chain you don't know which is best. Even in most functions it's tricky to know whether being able to mutate is useful.
there is another problem here which is that with borrowing, in recursive functions, you may run into big spikes in memory use, because all the memory is borrowed and only released when all of those recursive functions return
all of that is to say: you have to be careful with borrowing. If you can figure out a good heuristic then there should be no problem technically, but finding that heuristic is not trivial
I wonder how we might go about trying to figure out what a good heuristic might be :thinking:
now that we've run into concrete performance problems of the form "almost all this program does is increment and decrement refcounts, and then it does a bit of work which isn't that" it's probably worth starting to discuss things we might want to try :big_smile:
So we have builtins that borrow things because they are read only, like List.getUnsafe
. my wrong understanding was that if a function only called borrowing builtins, it would also just borrow the value. So it would kinda propagate out from those builtins.
That would fix this specific problem
it seems like for that scenario, there would be no risk of a blowup, right?
But would add the delayed freeing of memory in recursive functions like mentioned by folkert. (Though not sure how realistic of a problem this is in practice)
"if this function argument is only ever passed to things that borrow, then it too can be borrowed"
I'm more worried about the overhead of reference counting than the delaying of freed memory
I agree.
Also, I swear that at some point we talked about the idea of enabling just refcounting the outer list instead of recursively refcounting all elements in some cases. If we could do that here, it also would alleviate the issue. Though, I don't recall the context in which we thought a change like this might work.
oh yeah we should totally change how lists do RC
Do we have concrete ideas for that or is this a case where we need more research projects like Morphic?
Looping back to the idea of borrowing an input to a function if all uses in the function also borrow the input, is that something we could just append onto the type checker? Like for each value we want to know both the type and a bit flag of whether or not the value is potentially dirty. Any call in a sub branch would make an entire body dirty. Of course there would need to be some custom propagation logic at function boundaries.
Would that make sense and be something realitively small cost to add onto the type checker? Should it go somewhere else? Is it just a bad idea overall? Any other suggestions on solution?
I was talking to @Folkert de Vries about this the other day and he mentioned that doing the analysis on mono instead of during type checking would have various benefits
well the main benefit is that you can reason about higher-order functions
(with lambda sets, at least)
Ah, then you could analyze closure captures and do something smarter?
@Brendan Hansknecht is this something you'd be interested in getting into in terms of compiler development? (With guidance of course!)
Well, I do want dict to run fast and this will be fundamental to getting it closer to c++ speeds. So for sure.
Also, I think it will be helpful for me to understand more deeply. I have some vague performance ideas, but many I am not sure could work in roc without really expensive analysis.
that would be amazing! @Folkert de Vries are you currently the only person with domain expertise in this? Or does anyone else know how it works/how to implement things to try?
onto the type checker? Should it go somewhere else?
Doing it in the typechecker has other problems too, namely that you lose all specialization and are bound to the virality of unification. Really you want this over the SSA form.
also it slows down roc check
unnecessarily
(and therefore editor integrations)
also, I guess if we want to do this for closure captures as well, we need to check that the closure capture is essentially never saved and is just directly called or something of that nature. Think of List.map
and List.walk
closures that may capture values. Would suck if List.walk capturing a refcounted value and then just using it in a borrowed manner had bad perf compared passing in the value as part of the state.
Cause if the closure capture is saved, it actually does need to make sure the value isn't freed.
At the level of the type-specialized IR closure captures are identical to normal parameters
So I guess I'm not sure why they have to be treated specially
Wait, really. But captures can live for a long long time while regular params only live for the body of the function.
How can the captures outlive the body of the function? Only if they're returned right? But that's the same as for the case of regular parameters
After type specialization of lambda sets, all closures are unboxed and their captures are passed as an extra (unboxed) parameter
yeah, I just realized that as I was trying to write out an example. So it can be borrowed if it is only used in a borrowing way and never possibly returned from the function. With those rules, I think it is the same for captures and params
Right
A note on the cost of this refcounting. I ported some hashmap benchmarks form C++ to roc. The test we are closest to C++ in perf (about 2x slower) is the insert remove bench. Important part about insert and remove, due to mutating the dictionary, they just take ownership of it and never need to touch the refcount.
Still slower than I would hope, but it is a lot better than the 4x of some other cases.
I have only ported 1 benchmark so far, but I am a bit surprised. Decided to test out go just for the fun of it. In the test of insert 100million keys, clear the dict, insert 100 million keys again, finally delete 100 million keys 1 by 1: golang is 4.5x slower than the current version of roc.
Go spends a heck of a lot of system time. I wonder if it is doing something weird with memory that is just eating tons of kernel time.
Definitely will have to port a few more benchmarks and compare at some point.
We also use about 2x less memory, which makes sense cause Roc cleans up memory immediately instead of lazily
I think lobster lang infers borrowing/owning arguments or something, cant really remember. I remember the creator being asked about code blowup. Maybe relevant?
wait so what does that means in terms of comparing Go to C++ on these particular benchmarks? Go is 10x slower?
I love that all this started with me thinking that creating thousands of small sets would be a performant way to solve a problem, then complaining when that wasn't the case. Really enjoying fallowing this thread.
I may have an addiction
Richard Feldman said:
wait so what does that means in terms of comparing Go to C++ on these particular benchmarks? Go is 10x slower?
Yep.
That said, something still feels wrong to me about that giant perf difference, need to dig in more.
Brendan Hansknecht said:
I may have an addiction
Looking into the go hash impl, I was reminded that go does not have custom equality or custom hash (maybe has changed with the new generics?) And instead always used an autoderive impl with maps. On top of that, maps don't implement hash or equality (and they have no set type)
My guess as to why go maps are slow in one picture. The key is the bottom right corner.
164cba0e-0cf4-4d49-850d-029e37a0acf0.jpg
Honestly, there design is fine, but it is not totally flat. As such, in the case a bucket overflows instead of place elements in a calculated different bucked in the same array, they just allocate a new bucket on the heap. So it is kinda like in between flat and traditional cause a bucket contains 8 elements and overflow should be rare.
I'm guessing the benchmark led to lots of random heap jumps.
Also, never realized that all the generics in the go runtime are implemented c style. To call map.get, the runtime passes in a map that essentially has void* pointers cause it knows nothing about the data type, and a struct of function pointers that know about the data and how it works.
oh I assumed that was how they did it
monomorphization is very rare among languages with automatic memory management!
Yeah, but most languages just use runtime type info, right?
But yeah, summary of go lang hash map is kinda like absl::flat_hash_map, but less flat and with more pointer generic shuffling.
Oh and with metadata in buckets instead of stored densely.
I think that varies. I don't know for sure but I believe Haskell does it the way Go does
Interesting go maps after growing migrate to the new map in chunks over multiple calls to insert instead of doing it all at once.
So for a bit, it will be two partial maps.
Trying to make perf of insert more consistent at the cost of probably more perf over time.
Hmm, sounds like go may have a weird hash function choice too to make it more secure against hash flooding attacks without needing a fallback. Will have to look into that. they used to, now they also just use wyhash.
Brendan Hansknecht said:
Interesting go maps after growing migrate to the new map in chunks over multiple calls to insert instead of doing it all at once.
This sounds like https://github.com/jonhoo/griddle
Yep
Ok. so turns out that go just has horrible perf on arm. On x86, it is actually quite fast. In between roc and c++ for this benchmark
Last updated: Jul 05 2025 at 12:14 UTC