I had an idea, might be bad but might also be interesting to explore someday:
suppose the main cost of reference counting is just that as we pass around heap-allocated values, we have to keep loading their reference counts into cache just to increment or decrement them, and otherwise they wouldn't be in cache. In other words, suppose the main cost of reference counting in practice is increased cache misses.
if that's true, then it's conceivable there could be a benefit to storing all the refcounts densely in one section of memory, at the cost of each heap-allocated data structure needing to keep a separate pointer on the stack to its refcount (distinct from its pointer to its heap-allocated memory)
so overall memory usage would increase, but cache misses could go down
one reason this might not work out in practice: there would presumably be some refcount area fragmentation over time, as some refcount slots open up but others stay used
so the benefit compared to status quo would decrease over time
it could conceivably be compacted using https://youtu.be/c1UBJbfR-H0?si=YE_vqIHZaK-QGzqY but I'm not sure how much that would help
That would be really interesting to mess with. Though you would need a free-list as well to enable reuse of refcount slots as items are freed.
So extra overhead from that as well.
true, although I assume the free list would be cheap to manage in practice
I agree. Especially given everything is the same size. No complicated fitting algorithm, just get the first open slot
Also, technically could store an index instead of a pointer for the refcount. That could be 32 bit if a limit of 4 billion refcounted objects is fine (4 billion refcounted objects is at least 32 GiB of ram to store).
That is the size of just refcounts and no data. So feels pretty safe overall
agreed :thumbs_up:
so that would mean List and Str would take up 28B on the stack instead of 24B
although they'd still have an alignment of 8, so often would be 32B in practice
Yeah, so maybe no gains there.
could be relevant in a future where get fancier with packing our records
like instead of sorting fields by alignment and then alphabetically, if we tracked where alignment padding gaps would be and then filled those in opportunistically
actually, if we wanted to keep it at 24B, we could save a bunch of bytes by storing capacity as an exponent of 2, e.g. if the capacity is 4 we store 2, if it's 8 we store 3, etc.
so then capacity would only need 1 byte
we'd have to do an extra bit shift when comparing length to it, but that doesn't seem like a big deal
actually, capacity would only need 6 bits to store up to 64, so if we were okay only using 26 bits (so, up to 67 million refcounts) then we could fit the whole thing in 20B :big_smile:
Capacity isn't a power of 2 though. Grows by 1.5 if I recall correctly
So store as power of 1.5?
sure haha
or could just have it double instead of 1.5x
so it's a cheap bit shift
doubling is bad cause you can never reuse allocations
hm, really? Why's that? :thinking:
https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md#memory-handling
hm, I don't follow the logic in this document:
When there's a request to grow the vector, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk of memory next to its current chunk, copy its existing data to the new chunk, and then deallocate the old chunk
this makes sense to me :thumbs_up:
This means that any new chunk requested will be larger than all previously used chunks combined, so the vector must crawl forward in memory; it can't move back to its deallocated chunks.
this seems to be talking about the situation where:
ordinarily, this means we can never reuse our previously deallocated chunks, because if there had been enough contiguous room to fit our allocation, then we would have resized in the first place instead of doing the more expensive allocate-copy-deallocate
so it seems like it's talking about the specific scenario where:
that's the part I'm not understanding (unless I've misunderstood something along the way) - basically it seems to be saying that "we can't resize in place because we don't have enough contiguous memory to do that, but then assuming each new allocation is right after the old one - which would have required that we had all that memory available contiguously, even though we already established that we weren't in that scenario - then 1.5 is better" :face_with_raised_eyebrow:
what am I missing? :big_smile:
I think the simple way to look at this If we are growing by 2x, a larger allocation will never fit in the sum of many smaller allocations. If we grow by 1.5x, a larger allocation can fit in the sum of smaller allocations. As such, instead of going to new space for the allocation, there are more chances that we will fill in chunks of memory from the free list.
If we are growing by 2x, a larger allocation will never fit in the sum of many smaller allocations
importantly, it's many smaller allocations of the same vector, right?
(I mean, we could happen to have a bunch of allocations in a row of unrelated data that all get freed in time to be used for a 2x growth!)
and if we're only considering the same vector, what I'm not understanding is the scenario where this would happen:
Vec, and we can't resize it in place because there isn't enough contiugous memory availableit seems like a scenario that could plausibly happen, but if so, then whether or not we can use the free list seems to depend mostly on the allocations that were "in the way" that prevented resizing in the first place, how big they are, whether they're deallocated before we want to attempt to use that space, etc.
maybe another way of expression my confusion is: "in the scenario where 1.5x is better, wouldn't we have been able to resize in the first place and not need to make new allocations? And if not, wouldn't whether we could use the free list depend mainly on other allocations rather than the sizes of our own past allocations?"
maybe a very important note from the doc:
This comes from the now notorious design of realloc() to opaquely perform either in-place reallocation or an allocate-memcpy-deallocate cycle. Such lack of control subsequently forced all clib-based allocator designs to avoid in-place reallocation
hm maybe that explains it, but of course that doesn't apply to us because roc_realloc can call the allocator's realloc directly :big_smile:
Yeah, I guess I'm not fully sure. Would need to do some more testing.
if the "store capacity in 1 byte as a power of 2" idea works, one interesting observation is that we can store it inline with length
at least on 64-bit targets
so we'd have 7 bytes for length and then 1 byte for capacity
that would mean max length is 2^56 instead of 2^63, which means a single List U8 can store a maximum of 72 petabytes, which is a cap I'm comfortable with :stuck_out_tongue:
at which point we could use either 4 or even an entire 8 bytes for the refcount pointer
and still have lists/strings take up 20 or 24 bytes on the stack
I think power of 2 capacity would have unwanted side effects
The most basic example is constants using tons of extra space
as in static things?
Or a file loaded from disk
Would suck if a 1.1GB file had to use 2GB of memory
fair
I'm a little late to this, but we don't necessarily need a fixed growth factor. Go's slice growth behavior was tuned following empirical analysis at Google over quite a lot of code (both internal and external), and they settled on something that power-of-two doubles for trivially small slices, followed by tapering off to 25% growth. Demonstrated at https://go.dev/play/p/oyVrvrfPz7Y
The doubling doesn't matter at the low lengths because the slice is still typically on the order of kilobytes. The doubling issue, afaict, is only relevant to contiguous data that uses a significant fraction of available memory. I also wonder how much that issue applies to 64-bit address spaces, where, compared to the older computing days, you always have a lot more address space than you have physical space.
Granted, on a 32-bit system, it's plausible that you might need to reuse some of the virtual address space that this same semantic list formerly occupied, which large doubling would prevent. That seems to be what that anti-doubling article was getting at.
With a variable growth factor (even a simpler curve, such as doubling followed by 25% growth), that would cease to be an issue, and we'd get the benefit of better growth efficiency (fewer allocations) when lists are small-ish.
This isn't about address space, it is about fragmentation of the physical ram
Also, our currently growth pattern is just following the facebook c++ stardard growth pattern, which I would say is pretty equally battle tested to anything at Google.
I'd need help understanding then. Do you mean a bunch of small allocations that stay live, pinning a bunch of ram?
Some ram will be pinned for currently live lists, but it is more that the gaps left behind are less likely to be reusable.
The article made it sound like it was about a single vector growing, in that it could never reuse its own previously allocated space. Physical fragmentation would make me think it's about systemic over-allocation and the inability to compact, like in https://youtu.be/c1UBJbfR-H0?si=zyyoneKI6E5nKU0s
Yes, the article gives the simplest example of a single vector never being able to reuse space.
But it also means that anything sized based on that list or grown to fill the same amount of space would be unable to reuse that space either.
As an extra note, in roc, we are much more likely to hit the copy case due to any use of a list where it is not unique needing to copy. So being able to fill old space from growth may be more valuable in roc.
That would only happen lists above a certain size, right? If you ask for 128 bytes for your list, an allocator typically wouldn't care that it's a list vs a record, and just give you a slot out of a 128 byte size class pool, which would be easy to reuse again once freed.
In contrast, once you're using multiple whole memory pages for this list, it again shouldn't matter too much, because once freed, the allocator can use those 2k pages for whatever it wants, and because of memory transitions, it also doesn't really matter from a space management perspective if there's a 2k hole in the middle of a 2m span of physical ram: the 2k hole can be mapped into a contiguous virtual range, transparently piecemeal, for the application use for its list or whatever.
ah yeah, that's true
Probably an important detail of the FB list growth strategy is that is only applies to medium sized lists
once you get to 2k or above, the issue is really just utilization of those 2k chunks, iiuc
Small size lists generally are in allocation classes where doubling is how the allocation class works. Medium sized lists hit this issue. Large lists get the benefit of virtual addresses and consuming entire pages of memory.
So the 1.5 growth is only for medium sized lists.
this was all a huge issue before MMUs and the TLB to be sure, especially on and before 32-bit architectures (which we still need to worry about), but it doesn't seem... dire
Ah, that makes sense. In any case that's a case of dynamic growth factor, which it sounds like we're saying is a win, and that doubling at large sizes is pretty well agreed to be a bad idea these days
I don't think anyone said it was dire. Just a large enough optimization that it effects gigantic fleets of computers like at facebook or google. For anyone else it is a relativily minor optimization.
Also, doubling at large sizes should be just fine. Lets say you have a vector that uses 32 physical pages and tries to grow. You then claim 64 virtual pages. After copying over the data, you have claimed 32 physical pages as well. The other 32 pages should just remain virtual until they are written to. So besides claiming some virtual address space, there is little cost to doubling large lists.
side note: even doubling at a large size shouldn't be too much of an issue on a 64-bit system with a semi-modern OS: you're not actually using ram until you attempt to write to it, in which case a soft-fault causes a jit mapping of, typically 2k of ram. Same happens on a decent OS/FS for any kind of memory mapping of a file (or seeking 2gb forward and writing a byte): stuff you haven't touched will not cost space.
So in practice, that 1.1GiB file (or mem buffer) out of a 2GB requested allocation almost certainly is only costing 1.1GiB rounded up to the next page or block size.
Heh. It sounds like we're aligned. I need to stop typing on my phone :wink:
Ah yeah, I guess I wasn't thinking about this when I commented that before.
Modern hardware definitely has a lot of magic that can really solve certain problems.
I wouldn't be surprised if FB is doing some select work on bare metal without virtual mem, in which case the choice of vector growth factors become a bit more critical again.
Last updated: Jun 16 2026 at 16:19 UTC