Stream: ideas

Topic: atomic refcounting and branch misprediction


view this post on Zulip Richard Feldman (Sep 25 2023 at 23:16):

I wonder how much of the cost of (naive) atomic refcounting is due to the fact that branch predictors assume that a loop condition will be true the first time around, whereas if there's no contention, an atomic refcount's loop condition will actually fail the first time, leading to a branch misprediction

view this post on Zulip Richard Feldman (Sep 25 2023 at 23:20):

in other words, I wonder if that strategy would be faster by introducing an extra conditional at the front where it's assumed that the atomic operation succeeded, and then it only falls back on the loop if it failed

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:27):

I do think that would help, but I don't think it would make much of a difference. I believe an old version (and slightly faster but technically wrong) version of our code did not have the loop. That said, it was not much faster. Most of the time was just in the atomic read/write and not in the on cpu conditional.

view this post on Zulip Ayaz Hafiz (Sep 25 2023 at 23:27):

why would you have a cas loop around the refcounter? why not just do an atomic inc/dec. Or, are you talking about loops in the assembly/CPU for lowered atomic ops?

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:30):

Oh wait, do we even have a loop currently, looking at the code, I don't see a loop currently.

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:33):

I think our current code assumes that it is not possible to have enough references to roll over the 64bit count. Cause that would require 64bits worth of 64bit pointers to the object. Which is not possible. So we simplified our atomics to just be a check that the refcount is max (only happens for static values). Then just monotonically increment or decrement.

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:35):

Also, I think we could remove a conditional in all of our refcounting.

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:35):

due to limitations in memory space and number of possible references to the same value.

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:47):

So currently, we can view our refcounting as a value from 0 to (2^63 - 1). When doing anything with refcounts, we first check if the value is (2^63 - 1). If so, we skip all refcounts because this is a constant static value.

If you pack a 64 bit memory space with 64bit pointers, you get 2^64 / 8 bytes. That is a maximum of 2^61 references. So we have a few bits of references that are never possible to flip. The system will run out of memory first. So what if instead of checking that a refcount is (2^63 - 1), we always just increment and decrement refcounts. For a constant static value, we will just initialize the refcount to (2^62 - 1). Given it is impossible ever increment or decrement that value more than (2 ^ 61) times, that value will always remain a valid refcount. Now refcounting no longer needs a branch to handle constant values. Instead they are handled like any other value. Just with special initialization such that they will never increment or decrement too far. That should remove a branch from all refcounting calls.

view this post on Zulip Brendan Hansknecht (Sep 25 2023 at 23:48):

Does that sound accurate?

view this post on Zulip Brendan Hansknecht (Sep 26 2023 at 00:09):

Thinking about this more, we don't have to even initialize constants to (2^62 - 1). We could initialize to just 2. Then when everything is decremented, there will be 1 extra refcount lying around. So we won't try freeing the constant. Then when we next load the constant, we would increment back to 2 again and start the cycle again.

view this post on Zulip Brendan Hansknecht (Sep 26 2023 at 00:10):

So basically treat constants like normal values but with a single extra refcount so that we never free them. Given we can never increment so much that the counter overflows, that should just work.

view this post on Zulip Brendan Hansknecht (Sep 26 2023 at 00:10):

So no special constant branch.

view this post on Zulip Richard Feldman (Sep 26 2023 at 01:54):

whoa, I never thought of that! :exploding_head:

view this post on Zulip Richard Feldman (Sep 26 2023 at 01:54):

(earlier, I was thinking we'd need a loop to enforce the lock; I had the misimpression that atomic increment required this, but apparently there are individual instructions for that!)

view this post on Zulip Richard Feldman (Sep 26 2023 at 01:57):

that's a really sweet design for constants and I can't think of a reason not to switch to it

view this post on Zulip Brendan Hansknecht (Sep 26 2023 at 02:14):

I can think of 1 reason. Current constants can live in the read only sections. If we mess something up and accidentally write to them, the program will crash and the user will file a bug report.

With the new setup, they must be in a writable data section. If we mess something up, we will silently corrupt the constant and all uses of it afterwards. So the current form is minorly more protected from compiler bugs.

view this post on Zulip Richard Feldman (Sep 26 2023 at 02:46):

yeah that's fair, although if platform bugs are on the table, it's also always possible today that the platform author could forget to check for the special case

view this post on Zulip Richard Feldman (Sep 26 2023 at 02:47):

but either way, it seems like saving a conditional (which will sometimes be mispredicted!) from every single refcount is worth that downside

view this post on Zulip Brendan Hansknecht (Sep 26 2023 at 03:20):

Yeah, totally agree, just noting

view this post on Zulip Richard Feldman (Sep 29 2023 at 17:58):

when talking to Matt Godbolt about this idea, I realized we'd need to store static things in the globals section of the binary instead of readonly, because if we're storing them as refcount = 2 instead of refcount = "never change this" then we wouldn't be able to tell that they're immortal anymore

view this post on Zulip Richard Feldman (Sep 29 2023 at 17:59):

but that actually seems good, because it means we can skip a check on every single refcount, and have the refcounting logic always be:

and that's it!

view this post on Zulip Brendan Hansknecht (Sep 29 2023 at 18:10):

I realized we'd need to store static things in the globals section of the binary instead of readonly

Yeah, I mentioned that above

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:32):

another thought about this: in a world where we do atomic refcounting, couldn't we have contention on our (unnecessary, but done anyway to save a conditional) atomic incr/decr on the global for a static allocation?

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:34):

if so, that could in some situations end up being a higher runtime cost than a branch

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:34):

Global would only have issues when growing and needing to reallocate. The rest of the time any number of accesses can happen at once. So only an issue when adding new refcounts and having no space remaining.

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:35):

sorry, this is a different thread :laughing:

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:35):

(I mean, I intentionally was revisiting a previous thread - this isn't the discussion we've been having earlier today!)

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:36):

oops my bad

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:37):

so the specific scenario I'm imagining here is that we always atomically increment or decrement the refcount without first having a branch to see if it's a static allocation...but then in some cases we end up with contention on those static allocations' refcounts, which might be more costly than the branch would have been

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:37):

yes, yes we could

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:37):

seems unlikely to happen often, but it's possible I guess

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:38):

I think it is pretty easy to imagine a program with a global that is referenced all the time but never modified.

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:38):

that would be a lot of unnecessary atomic updates to access static data

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:39):

sure...I guess it would depend on how good we could get at eliding refcounts

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:39):

And the global probably is data that is accessed in every thread.

view this post on Zulip Richard Feldman (Nov 04 2023 at 23:40):

actually maybe that wouldn't help, like if you wanted to put a global in a bunch of collections on different threads, that would do it

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:44):

Given atomic refcounts are expensive. Maybe if we are going to do an atomic refount then it is worth adding an extra branch that the pointer location is not in the read only data section?

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:45):

Or changing the generation completely back to what we currently have

view this post on Zulip Brendan Hansknecht (Nov 04 2023 at 23:46):

We can pick two totally different schemes here. They don't have to both do the same thing

view this post on Zulip Richard Feldman (Nov 05 2023 at 00:03):

one thing I wonder: supposing we went with the "don't branch" design, and someone actually found themselves in a situation where contention on statics was a problem...what would be the best strategy for them to try to fix the performance problem? :thinking:

view this post on Zulip Brendan Hansknecht (Nov 05 2023 at 00:04):

Do something to cause a local clone on all threads

view this post on Zulip Brendan Hansknecht (Nov 05 2023 at 00:04):

reference that local clone

view this post on Zulip Richard Feldman (Nov 05 2023 at 00:22):

one potential idea that might be more complicated than it's worth: use the cpuid instruction to get an integer index of what core we're on, use that as an offset to some fixed location in memory (e.g. a global), and then use a branchless cmov to increment either that address (if this is a static) or else the requested one

view this post on Zulip Richard Feldman (Nov 05 2023 at 00:23):

so that would be even more instructions than a branch, but it would result in no contention for statics because they're always incrementing or decrementing an address that's unique per cpu core

view this post on Zulip Richard Feldman (Nov 05 2023 at 00:23):

and also no branch mispredictions (unlike the branchy version)

view this post on Zulip Richard Feldman (Nov 05 2023 at 00:26):

apparently cpuid takes multiple cycles though, so probably not worth it :sweat_smile:

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:20):

I think maybe the way to go here is:

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:22):

because it's very unclear whether this downside would actually ever come up in practice in a way that would be worth solving, and I don't think there's any other realistic way to find out whether it actually comes up besides just doing the experiment and finding out empirically

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:24):

separately, I'm coming around to the idea that we might just want to always atomically refcount - unless maybe we want to have a global configuration option platforms can set which opts out of it

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:33):

I'm increasingly skeptical that fancier designs would actually pay for themselves :sweat_smile:

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:33):

in practice I mean

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:38):

if there's a way we could do escape analysis internally to skip atomic refcounting if we know it doesn't escape to the platform, that seems reasonable (assuming we can actually pull off such an optimization), although technically the platform does make those allocations, so we'd need to make it clear that they shouldn't share arbitrary opaque roc_alloc allocations across threads, because some of them might be non-atomically refcounted

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:50):

oh yeah, another consideration is that since atomic increments and decrements only take a few cycles if there's no contention, that both reduces the chance of contention in the first place and also means that the delay caused by contention is smaller

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:51):

at which point it starts to become important to wonder which causes more delay: contention (to the tune of a few cycles at a time) or branching (where mispredictions also cause a delay of a few cycles at a time)

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:53):

I can imagine scenarios where either one is much more impactful than the other, but the scenarios where contention is more impactful seem much less likely to come up in practice

view this post on Zulip Richard Feldman (Nov 05 2023 at 12:58):

one other thing potentially worth noting: I can imagine this not being worth it in practice, but theoretically we could generate additional specializations of functions based on their arguments needing atomic refcounts, non atomic refcounts, or no refcounting (static).

that would be more work for the compiler to track, and could substantially increase generated binary size

view this post on Zulip Richard Feldman (Nov 05 2023 at 14:54):

hm, I guess a point in favor of branching is that it reduces memory cache usage :thinking:

view this post on Zulip Richard Feldman (Nov 05 2023 at 14:56):

one branch misprediction per static might be less costly than having to load its refcount into cache

view this post on Zulip Richard Feldman (Nov 05 2023 at 14:57):

apparently Swift does a branch, and they also call it "immortal" memory https://forums.swift.org/t/pitch-core-team-publishes-results-of-performance-study-cooperative-scheduler-just-introduced-plus-compile-for-non-atomic-arc/65632/27

view this post on Zulip Brendan Hansknecht (Nov 05 2023 at 15:46):

I would expect the branch to be far faster than the atomic refcount update (even without contention) on all but Apple M Processor.

view this post on Zulip Brendan Hansknecht (Nov 05 2023 at 15:47):

With contention, I would expect the branch to always be faster.

view this post on Zulip Brendan Hansknecht (Nov 05 2023 at 15:48):

This should generally be a trivial to predict branch

view this post on Zulip Richard Feldman (Nov 05 2023 at 16:14):

yeah unless there's a bunch of alternating refcounts of static and non static :sweat_smile:

view this post on Zulip Richard Feldman (Nov 05 2023 at 16:14):

so maybe we should keep the branch after all

view this post on Zulip Richard Feldman (Nov 05 2023 at 16:15):

with all that in mind

view this post on Zulip Richard Feldman (Nov 05 2023 at 16:43):

I guess we only skip the memory load for statics if we can know it's static without looking at the refcount

view this post on Zulip Richard Feldman (Nov 05 2023 at 16:44):

so for that we'd either need to look at the address of the pointer to see if it's in the readonly range, or else track refcount location separately

view this post on Zulip Brendan Hansknecht (Nov 05 2023 at 16:46):

yeah unless there's a bunch of alternating refcounts of static and non static :sweat_smile:

I would still expect it to likely be in a predictable form when looking at a long enough global branch history (which modern process do). So I don't think alternating would likely be a problem unless you are constantly reordering the alternating pattern.


Last updated: Jun 16 2026 at 16:19 UTC