I wonder how much of the cost of (naive) atomic refcounting is due to the fact that branch predictors assume that a loop condition will be true the first time around, whereas if there's no contention, an atomic refcount's loop condition will actually fail the first time, leading to a branch misprediction
in other words, I wonder if that strategy would be faster by introducing an extra conditional at the front where it's assumed that the atomic operation succeeded, and then it only falls back on the loop if it failed
I do think that would help, but I don't think it would make much of a difference. I believe an old version (and slightly faster but technically wrong) version of our code did not have the loop. That said, it was not much faster. Most of the time was just in the atomic read/write and not in the on cpu conditional.
why would you have a cas loop around the refcounter? why not just do an atomic inc/dec. Or, are you talking about loops in the assembly/CPU for lowered atomic ops?
Oh wait, do we even have a loop currently, looking at the code, I don't see a loop currently.
I think our current code assumes that it is not possible to have enough references to roll over the 64bit count. Cause that would require 64bits worth of 64bit pointers to the object. Which is not possible. So we simplified our atomics to just be a check that the refcount is max (only happens for static values). Then just monotonically increment or decrement.
Also, I think we could remove a conditional in all of our refcounting.
due to limitations in memory space and number of possible references to the same value.
So currently, we can view our refcounting as a value from 0 to (2^63 - 1). When doing anything with refcounts, we first check if the value is (2^63 - 1). If so, we skip all refcounts because this is a constant static value.
If you pack a 64 bit memory space with 64bit pointers, you get 2^64 / 8 bytes. That is a maximum of 2^61 references. So we have a few bits of references that are never possible to flip. The system will run out of memory first. So what if instead of checking that a refcount is (2^63 - 1), we always just increment and decrement refcounts. For a constant static value, we will just initialize the refcount to (2^62 - 1). Given it is impossible ever increment or decrement that value more than (2 ^ 61) times, that value will always remain a valid refcount. Now refcounting no longer needs a branch to handle constant values. Instead they are handled like any other value. Just with special initialization such that they will never increment or decrement too far. That should remove a branch from all refcounting calls.
Does that sound accurate?
Thinking about this more, we don't have to even initialize constants to (2^62 - 1). We could initialize to just 2. Then when everything is decremented, there will be 1 extra refcount lying around. So we won't try freeing the constant. Then when we next load the constant, we would increment back to 2 again and start the cycle again.
So basically treat constants like normal values but with a single extra refcount so that we never free them. Given we can never increment so much that the counter overflows, that should just work.
So no special constant branch.
whoa, I never thought of that! :exploding_head:
(earlier, I was thinking we'd need a loop to enforce the lock; I had the misimpression that atomic increment required this, but apparently there are individual instructions for that!)
that's a really sweet design for constants and I can't think of a reason not to switch to it
I can think of 1 reason. Current constants can live in the read only sections. If we mess something up and accidentally write to them, the program will crash and the user will file a bug report.
With the new setup, they must be in a writable data section. If we mess something up, we will silently corrupt the constant and all uses of it afterwards. So the current form is minorly more protected from compiler bugs.
yeah that's fair, although if platform bugs are on the table, it's also always possible today that the platform author could forget to check for the special case
but either way, it seems like saving a conditional (which will sometimes be mispredicted!) from every single refcount is worth that downside
Yeah, totally agree, just noting
when talking to Matt Godbolt about this idea, I realized we'd need to store static things in the globals section of the binary instead of readonly, because if we're storing them as refcount = 2 instead of refcount = "never change this" then we wouldn't be able to tell that they're immortal anymore
but that actually seems good, because it means we can skip a check on every single refcount, and have the refcounting logic always be:
and that's it!
I realized we'd need to store static things in the globals section of the binary instead of readonly
Yeah, I mentioned that above
another thought about this: in a world where we do atomic refcounting, couldn't we have contention on our (unnecessary, but done anyway to save a conditional) atomic incr/decr on the global for a static allocation?
if so, that could in some situations end up being a higher runtime cost than a branch
Global would only have issues when growing and needing to reallocate. The rest of the time any number of accesses can happen at once. So only an issue when adding new refcounts and having no space remaining.
sorry, this is a different thread :laughing:
(I mean, I intentionally was revisiting a previous thread - this isn't the discussion we've been having earlier today!)
oops my bad
so the specific scenario I'm imagining here is that we always atomically increment or decrement the refcount without first having a branch to see if it's a static allocation...but then in some cases we end up with contention on those static allocations' refcounts, which might be more costly than the branch would have been
yes, yes we could
seems unlikely to happen often, but it's possible I guess
I think it is pretty easy to imagine a program with a global that is referenced all the time but never modified.
that would be a lot of unnecessary atomic updates to access static data
sure...I guess it would depend on how good we could get at eliding refcounts
And the global probably is data that is accessed in every thread.
actually maybe that wouldn't help, like if you wanted to put a global in a bunch of collections on different threads, that would do it
Given atomic refcounts are expensive. Maybe if we are going to do an atomic refount then it is worth adding an extra branch that the pointer location is not in the read only data section?
Or changing the generation completely back to what we currently have
We can pick two totally different schemes here. They don't have to both do the same thing
one thing I wonder: supposing we went with the "don't branch" design, and someone actually found themselves in a situation where contention on statics was a problem...what would be the best strategy for them to try to fix the performance problem? :thinking:
Do something to cause a local clone on all threads
reference that local clone
one potential idea that might be more complicated than it's worth: use the cpuid instruction to get an integer index of what core we're on, use that as an offset to some fixed location in memory (e.g. a global), and then use a branchless cmov to increment either that address (if this is a static) or else the requested one
so that would be even more instructions than a branch, but it would result in no contention for statics because they're always incrementing or decrementing an address that's unique per cpu core
and also no branch mispredictions (unlike the branchy version)
apparently cpuid takes multiple cycles though, so probably not worth it :sweat_smile:
I think maybe the way to go here is:
because it's very unclear whether this downside would actually ever come up in practice in a way that would be worth solving, and I don't think there's any other realistic way to find out whether it actually comes up besides just doing the experiment and finding out empirically
separately, I'm coming around to the idea that we might just want to always atomically refcount - unless maybe we want to have a global configuration option platforms can set which opts out of it
I'm increasingly skeptical that fancier designs would actually pay for themselves :sweat_smile:
in practice I mean
if there's a way we could do escape analysis internally to skip atomic refcounting if we know it doesn't escape to the platform, that seems reasonable (assuming we can actually pull off such an optimization), although technically the platform does make those allocations, so we'd need to make it clear that they shouldn't share arbitrary opaque roc_alloc allocations across threads, because some of them might be non-atomically refcounted
oh yeah, another consideration is that since atomic increments and decrements only take a few cycles if there's no contention, that both reduces the chance of contention in the first place and also means that the delay caused by contention is smaller
at which point it starts to become important to wonder which causes more delay: contention (to the tune of a few cycles at a time) or branching (where mispredictions also cause a delay of a few cycles at a time)
I can imagine scenarios where either one is much more impactful than the other, but the scenarios where contention is more impactful seem much less likely to come up in practice
one other thing potentially worth noting: I can imagine this not being worth it in practice, but theoretically we could generate additional specializations of functions based on their arguments needing atomic refcounts, non atomic refcounts, or no refcounting (static).
that would be more work for the compiler to track, and could substantially increase generated binary size
hm, I guess a point in favor of branching is that it reduces memory cache usage :thinking:
one branch misprediction per static might be less costly than having to load its refcount into cache
apparently Swift does a branch, and they also call it "immortal" memory https://forums.swift.org/t/pitch-core-team-publishes-results-of-performance-study-cooperative-scheduler-just-introduced-plus-compile-for-non-atomic-arc/65632/27
I would expect the branch to be far faster than the atomic refcount update (even without contention) on all but Apple M Processor.
With contention, I would expect the branch to always be faster.
This should generally be a trivial to predict branch
yeah unless there's a bunch of alternating refcounts of static and non static :sweat_smile:
so maybe we should keep the branch after all
with all that in mind
I guess we only skip the memory load for statics if we can know it's static without looking at the refcount
so for that we'd either need to look at the address of the pointer to see if it's in the readonly range, or else track refcount location separately
yeah unless there's a bunch of alternating refcounts of static and non static :sweat_smile:
I would still expect it to likely be in a predictable form when looking at a long enough global branch history (which modern process do). So I don't think alternating would likely be a problem unless you are constantly reordering the alternating pattern.
Last updated: Jun 16 2026 at 16:19 UTC