data oriented reference counting · ideas

Stream: ideas

Topic: data oriented reference counting

Richard Feldman (Nov 04 2023 at 20:07):

I had an idea, might be bad but might also be interesting to explore someday:

suppose the main cost of reference counting is just that as we pass around heap-allocated values, we have to keep loading their reference counts into cache just to increment or decrement them, and otherwise they wouldn't be in cache. In other words, suppose the main cost of reference counting in practice is increased cache misses.

Richard Feldman (Nov 04 2023 at 20:07):

if that's true, then it's conceivable there could be a benefit to storing all the refcounts densely in one section of memory, at the cost of each heap-allocated data structure needing to keep a separate pointer on the stack to its refcount (distinct from its pointer to its heap-allocated memory)

Richard Feldman (Nov 04 2023 at 20:07):

so overall memory usage would increase, but cache misses could go down

Richard Feldman (Nov 04 2023 at 20:09):

one reason this might not work out in practice: there would presumably be some refcount area fragmentation over time, as some refcount slots open up but others stay used

Richard Feldman (Nov 04 2023 at 20:09):

so the benefit compared to status quo would decrease over time

Richard Feldman (Nov 04 2023 at 20:11):

it could conceivably be compacted using https://youtu.be/c1UBJbfR-H0?si=YE_vqIHZaK-QGzqY but I'm not sure how much that would help

Brendan Hansknecht (Nov 04 2023 at 21:13):

That would be really interesting to mess with. Though you would need a free-list as well to enable reuse of refcount slots as items are freed.

Brendan Hansknecht (Nov 04 2023 at 21:13):

So extra overhead from that as well.

Richard Feldman (Nov 04 2023 at 21:14):

true, although I assume the free list would be cheap to manage in practice

Brendan Hansknecht (Nov 04 2023 at 21:19):

I agree. Especially given everything is the same size. No complicated fitting algorithm, just get the first open slot

Brendan Hansknecht (Nov 04 2023 at 21:22):

Also, technically could store an index instead of a pointer for the refcount. That could be 32 bit if a limit of 4 billion refcounted objects is fine (4 billion refcounted objects is at least 32 GiB of ram to store).

Brendan Hansknecht (Nov 04 2023 at 21:22):

That is the size of just refcounts and no data. So feels pretty safe overall

Richard Feldman (Nov 04 2023 at 21:23):

agreed :thumbs_up:

Richard Feldman (Nov 04 2023 at 21:24):

so that would mean List and Str would take up 28B on the stack instead of 24B

Richard Feldman (Nov 04 2023 at 21:24):

although they'd still have an alignment of 8, so often would be 32B in practice

Brendan Hansknecht (Nov 04 2023 at 21:24):

Yeah, so maybe no gains there.

Richard Feldman (Nov 04 2023 at 21:25):

could be relevant in a future where get fancier with packing our records

Richard Feldman (Nov 04 2023 at 21:25):

like instead of sorting fields by alignment and then alphabetically, if we tracked where alignment padding gaps would be and then filled those in opportunistically

Richard Feldman (Nov 04 2023 at 21:29):

actually, if we wanted to keep it at 24B, we could save a bunch of bytes by storing capacity as an exponent of 2, e.g. if the capacity is 4 we store 2, if it's 8 we store 3, etc.

Richard Feldman (Nov 04 2023 at 21:29):

so then capacity would only need 1 byte

Richard Feldman (Nov 04 2023 at 21:29):

we'd have to do an extra bit shift when comparing length to it, but that doesn't seem like a big deal

Richard Feldman (Nov 04 2023 at 21:31):

actually, capacity would only need 6 bits to store up to 64, so if we were okay only using 26 bits (so, up to 67 million refcounts) then we could fit the whole thing in 20B :big_smile:

Brendan Hansknecht (Nov 04 2023 at 21:31):

Capacity isn't a power of 2 though. Grows by 1.5 if I recall correctly

Brendan Hansknecht (Nov 04 2023 at 21:32):

So store as power of 1.5?

Richard Feldman (Nov 04 2023 at 21:32):

sure haha

Richard Feldman (Nov 04 2023 at 21:32):

or could just have it double instead of 1.5x

Richard Feldman (Nov 04 2023 at 21:32):

so it's a cheap bit shift

Brendan Hansknecht (Nov 04 2023 at 21:32):

doubling is bad cause you can never reuse allocations

Richard Feldman (Nov 04 2023 at 21:32):

hm, really? Why's that? :thinking:

Brendan Hansknecht (Nov 04 2023 at 21:33):

https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md#memory-handling

Richard Feldman (Nov 04 2023 at 22:02):

hm, I don't follow the logic in this document:

When there's a request to grow the vector, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk of memory next to its current chunk, copy its existing data to the new chunk, and then deallocate the old chunk

this makes sense to me :thumbs_up:

This means that any new chunk requested will be larger than all previously used chunks combined, so the vector must crawl forward in memory; it can't move back to its deallocated chunks.

this seems to be talking about the situation where:

we're ineligible for in-place resizing because there's some in-use memory in the way
because of this, we have to make a new allocation somewhere that is not immediately adjacent to the allocation we have (because otherwise we would have just resized), and then of course deallocate our old allocation after copying it over, because it's no longer in use

ordinarily, this means we can never reuse our previously deallocated chunks, because if there had been enough contiguous room to fit our allocation, then we would have resized in the first place instead of doing the more expensive allocate-copy-deallocate

so it seems like it's talking about the specific scenario where:

we can't resize in place because there's another allocation in the way
later on, in a future growing operation, we discover by happy coincidence that the allocation that was previously in the way has been deallocated is no longer in the way, so we can now write into the previously deallocated chunk
so now we can write over that old chunk, which is better for cache, but somehow if our load factor had been 2 instead of 1.5 this wouldn't be true?

that's the part I'm not understanding (unless I've misunderstood something along the way) - basically it seems to be saying that "we can't resize in place because we don't have enough contiguous memory to do that, but then assuming each new allocation is right after the old one - which would have required that we had all that memory available contiguously, even though we already established that we weren't in that scenario - then 1.5 is better" :face_with_raised_eyebrow:

Richard Feldman (Nov 04 2023 at 22:03):

what am I missing? :big_smile:

Brendan Hansknecht (Nov 04 2023 at 22:14):

I think the simple way to look at this If we are growing by 2x, a larger allocation will never fit in the sum of many smaller allocations. If we grow by 1.5x, a larger allocation can fit in the sum of smaller allocations. As such, instead of going to new space for the allocation, there are more chances that we will fill in chunks of memory from the free list.

Richard Feldman (Nov 04 2023 at 22:18):

If we are growing by 2x, a larger allocation will never fit in the sum of many smaller allocations

importantly, it's many smaller allocations of the same vector, right?

(I mean, we could happen to have a bunch of allocations in a row of unrelated data that all get freed in time to be used for a 2x growth!)

Richard Feldman (Nov 04 2023 at 22:28):

and if we're only considering the same vector, what I'm not understanding is the scenario where this would happen:

we grow a Vec, and we can't resize it in place because there isn't enough contiugous memory available
therefore, we have to allocate a new chunk, copy over the elements, and deallocate the old chunk
later on, we can somehow reuse some of the memory freed during one of these "deallocate the old chunk" operations, because there is now enough contiguous memory available beginning at that address, even there wasn't back when we had to deallocate it, and now we require at least 1.5x more contiguous memory than we required back when there wasn't enough at that address

Richard Feldman (Nov 04 2023 at 22:29):

it seems like a scenario that could plausibly happen, but if so, then whether or not we can use the free list seems to depend mostly on the allocations that were "in the way" that prevented resizing in the first place, how big they are, whether they're deallocated before we want to attempt to use that space, etc.

Richard Feldman (Nov 04 2023 at 22:30):

maybe another way of expression my confusion is: "in the scenario where 1.5x is better, wouldn't we have been able to resize in the first place and not need to make new allocations? And if not, wouldn't whether we could use the free list depend mainly on other allocations rather than the sizes of our own past allocations?"

Brendan Hansknecht (Nov 04 2023 at 22:34):

maybe a very important note from the doc:

This comes from the now notorious design of realloc() to opaquely perform either in-place reallocation or an allocate-memcpy-deallocate cycle. Such lack of control subsequently forced all clib-based allocator designs to avoid in-place reallocation

Richard Feldman (Nov 04 2023 at 22:36):

hm maybe that explains it, but of course that doesn't apply to us because roc_realloc can call the allocator's realloc directly :big_smile:

Brendan Hansknecht (Nov 04 2023 at 22:38):

Yeah, I guess I'm not fully sure. Would need to do some more testing.

Richard Feldman (Nov 05 2023 at 22:06):

if the "store capacity in 1 byte as a power of 2" idea works, one interesting observation is that we can store it inline with length

Richard Feldman (Nov 05 2023 at 22:06):

at least on 64-bit targets

Richard Feldman (Nov 05 2023 at 22:06):

so we'd have 7 bytes for length and then 1 byte for capacity

Richard Feldman (Nov 05 2023 at 22:08):

that would mean max length is 2^56 instead of 2^63, which means a single List U8 can store a maximum of 72 petabytes, which is a cap I'm comfortable with :stuck_out_tongue:

Richard Feldman (Nov 05 2023 at 22:09):

at which point we could use either 4 or even an entire 8 bytes for the refcount pointer

Richard Feldman (Nov 05 2023 at 22:09):

and still have lists/strings take up 20 or 24 bytes on the stack

Brendan Hansknecht (Nov 05 2023 at 22:22):

I think power of 2 capacity would have unwanted side effects

Brendan Hansknecht (Nov 05 2023 at 22:23):

The most basic example is constants using tons of extra space

Richard Feldman (Nov 05 2023 at 22:23):

as in static things?

Brendan Hansknecht (Nov 05 2023 at 22:23):

Or a file loaded from disk

Brendan Hansknecht (Nov 05 2023 at 22:23):

Would suck if a 1.1GB file had to use 2GB of memory

Richard Feldman (Nov 05 2023 at 22:24):

fair

Kevin Gillette (Jan 01 2024 at 18:16):

I'm a little late to this, but we don't necessarily need a fixed growth factor. Go's slice growth behavior was tuned following empirical analysis at Google over quite a lot of code (both internal and external), and they settled on something that power-of-two doubles for trivially small slices, followed by tapering off to 25% growth. Demonstrated at https://go.dev/play/p/oyVrvrfPz7Y

The doubling doesn't matter at the low lengths because the slice is still typically on the order of kilobytes. The doubling issue, afaict, is only relevant to contiguous data that uses a significant fraction of available memory. I also wonder how much that issue applies to 64-bit address spaces, where, compared to the older computing days, you always have a lot more address space than you have physical space.

Granted, on a 32-bit system, it's plausible that you might need to reuse some of the virtual address space that this same semantic list formerly occupied, which large doubling would prevent. That seems to be what that anti-doubling article was getting at.

With a variable growth factor (even a simpler curve, such as doubling followed by 25% growth), that would cease to be an issue, and we'd get the benefit of better growth efficiency (fewer allocations) when lists are small-ish.

Brendan Hansknecht (Jan 01 2024 at 18:18):

This isn't about address space, it is about fragmentation of the physical ram

Brendan Hansknecht (Jan 01 2024 at 18:20):

Also, our currently growth pattern is just following the facebook c++ stardard growth pattern, which I would say is pretty equally battle tested to anything at Google.

Kevin Gillette (Jan 01 2024 at 18:20):

I'd need help understanding then. Do you mean a bunch of small allocations that stay live, pinning a bunch of ram?

Brendan Hansknecht (Jan 01 2024 at 18:24):

Some ram will be pinned for currently live lists, but it is more that the gaps left behind are less likely to be reusable.

Kevin Gillette (Jan 01 2024 at 18:25):

The article made it sound like it was about a single vector growing, in that it could never reuse its own previously allocated space. Physical fragmentation would make me think it's about systemic over-allocation and the inability to compact, like in https://youtu.be/c1UBJbfR-H0?si=zyyoneKI6E5nKU0s

Brendan Hansknecht (Jan 01 2024 at 18:26):

Yes, the article gives the simplest example of a single vector never being able to reuse space.

Brendan Hansknecht (Jan 01 2024 at 18:27):

But it also means that anything sized based on that list or grown to fill the same amount of space would be unable to reuse that space either.

Brendan Hansknecht (Jan 01 2024 at 18:32):

As an extra note, in roc, we are much more likely to hit the copy case due to any use of a list where it is not unique needing to copy. So being able to fill old space from growth may be more valuable in roc.

Kevin Gillette (Jan 01 2024 at 18:33):

That would only happen lists above a certain size, right? If you ask for 128 bytes for your list, an allocator typically wouldn't care that it's a list vs a record, and just give you a slot out of a 128 byte size class pool, which would be easy to reuse again once freed.

In contrast, once you're using multiple whole memory pages for this list, it again shouldn't matter too much, because once freed, the allocator can use those 2k pages for whatever it wants, and because of memory transitions, it also doesn't really matter from a space management perspective if there's a 2k hole in the middle of a 2m span of physical ram: the 2k hole can be mapped into a contiguous virtual range, transparently piecemeal, for the application use for its list or whatever.

Brendan Hansknecht (Jan 01 2024 at 18:34):

ah yeah, that's true

Brendan Hansknecht (Jan 01 2024 at 18:34):

Probably an important detail of the FB list growth strategy is that is only applies to medium sized lists

Kevin Gillette (Jan 01 2024 at 18:34):

once you get to 2k or above, the issue is really just utilization of those 2k chunks, iiuc

Brendan Hansknecht (Jan 01 2024 at 18:35):

Small size lists generally are in allocation classes where doubling is how the allocation class works. Medium sized lists hit this issue. Large lists get the benefit of virtual addresses and consuming entire pages of memory.

Brendan Hansknecht (Jan 01 2024 at 18:36):

So the 1.5 growth is only for medium sized lists.

Kevin Gillette (Jan 01 2024 at 18:36):

this was all a huge issue before MMUs and the TLB to be sure, especially on and before 32-bit architectures (which we still need to worry about), but it doesn't seem... dire

Kevin Gillette (Jan 01 2024 at 18:38):

Ah, that makes sense. In any case that's a case of dynamic growth factor, which it sounds like we're saying is a win, and that doubling at large sizes is pretty well agreed to be a bad idea these days

Brendan Hansknecht (Jan 01 2024 at 18:38):

I don't think anyone said it was dire. Just a large enough optimization that it effects gigantic fleets of computers like at facebook or google. For anyone else it is a relativily minor optimization.

Brendan Hansknecht (Jan 01 2024 at 18:41):

Also, doubling at large sizes should be just fine. Lets say you have a vector that uses 32 physical pages and tries to grow. You then claim 64 virtual pages. After copying over the data, you have claimed 32 physical pages as well. The other 32 pages should just remain virtual until they are written to. So besides claiming some virtual address space, there is little cost to doubling large lists.

Kevin Gillette (Jan 01 2024 at 18:44):

side note: even doubling at a large size shouldn't be too much of an issue on a 64-bit system with a semi-modern OS: you're not actually using ram until you attempt to write to it, in which case a soft-fault causes a jit mapping of, typically 2k of ram. Same happens on a decent OS/FS for any kind of memory mapping of a file (or seeking 2gb forward and writing a byte): stuff you haven't touched will not cost space.

So in practice, that 1.1GiB file (or mem buffer) out of a 2GB requested allocation almost certainly is only costing 1.1GiB rounded up to the next page or block size.

Kevin Gillette (Jan 01 2024 at 18:45):

Heh. It sounds like we're aligned. I need to stop typing on my phone :wink:

Brendan Hansknecht (Jan 01 2024 at 18:45):

Ah yeah, I guess I wasn't thinking about this when I commented that before.

Brendan Hansknecht (Jan 01 2024 at 18:46):

Modern hardware definitely has a lot of magic that can really solve certain problems.

Kevin Gillette (Jan 01 2024 at 18:47):

I wouldn't be surprised if FB is doing some select work on bare metal without virtual mem, in which case the choice of vector growth factors become a bit more critical again.

Last updated: Jul 23 2026 at 13:15 UTC