Stream: platform development

Topic: Ref count and parallel calls


view this post on Zulip Oskar Hahn (May 20 2024 at 09:01):

The kingfisher platform exports this function to the host: Request, Box Model -> Response. It takes a Model but does not return Model.

This call reduces the ref count of Model by one. Since a refcount of 0 would mean, that the value gets deallocated, the host has to set the refcount of Model to 2 before calling roc.

When this is done in parallel, a segmentation fault happens. Probably, because the host and roc try to manipulate the same memory at the same time.

A workaround is to set the refcount to a very high value any time Model gets manipulated. But this would mean, that the platform would only support that high number of read requests.

I though, I read somewhere, that when refcount is set to a magic number, it is handled by roc as infinity and is not manipulated. But I don't know, where I read this or what that magic number is.

Is there a magic refcount number, that tells roc not to manipulate the ref count?

If not, is there another way to solve this?

view this post on Zulip Richard Feldman (May 20 2024 at 14:44):

@Brendan Hansknecht has done some atomic refcounting work

view this post on Zulip Brendan Hansknecht (May 20 2024 at 14:58):

When this is done in parallel, a segmentation fault happens. Probably, because the host and roc try to manipulate the same memory at the same time.

Yeah, our refcounting is not thread safe currently. Also, simply updating the box refcount may not be enough. It might have recourse and update refcount a of things in the box, but not sure.

Is there a magic refcount number, that tells roc not to manipulate the ref count?

Yes, I believe it is -1 as usize, but would need to double check.

view this post on Zulip Oskar Hahn (May 20 2024 at 16:56):

Thank you. The confirmation, that there is a magic number helped me to find it.

After playing around with different numbers, I think the magic number is 0.

As far as I know, the highest bit of a refcount has to be set to 1.

When I set the value to 2^63+1 (1<<63+1, highest bit is 1, all other bits are 0), the refcount gets reduced by one (from 9223372036854775809 to 9223372036854775808). If I call the function again without modifying the new refcount, it changes to a random number. Probably, because the memory is invalid. If I call the function again without modifying the refcount, I get a segmentation violation, since Model was deallocated.

When I set the refcount to 1 (the lowest bit is 1, all other bits are zero, even the first bit) and I call the function, the refcount changes to 0. If I call the function again, the value stays at 0. The Model does not get deallocated.

So setting the refcount to 0 does what I want.

view this post on Zulip Brendan Hansknecht (May 20 2024 at 16:58):

Awesome

view this post on Zulip Brendan Hansknecht (May 20 2024 at 16:58):

I guess I forgot the number

view this post on Zulip Richard Feldman (May 20 2024 at 18:42):

we've talked about this in various places; I wonder if the time has come to figure out how we want to do this

view this post on Zulip Brendan Hansknecht (May 20 2024 at 18:49):

My current vote is roughly. Use a bit to decide whether or not to do atomic refcounting (this feature is off by default and must be enabled by a compile time or platform flag). Only do atomic refcounting if the platform sets that bit. Cause the only way for data to be shared between threads is through the platform. When sharing, it can set the bit.

view this post on Zulip Richard Feldman (May 20 2024 at 18:52):

I like that design!

view this post on Zulip Oskar Hahn (May 20 2024 at 20:04):

I have another bug when calling roc in parallel. I don't understand what is going on or if this is related to refcount. Maybe you have an idea what the cause is or how I could debug it?

First, the host calls a roc function, that returns a Box Model. Model has type List Str. If Model is Str, everything works.

Then, the host calls the function from above (Request, Box Model -> Response). If the function is called in sequence, there is no problem. But if it is called many times in parallel (around 10.000 calls "at the same time"), the memory gets corrupted. For example, if Model was ["hello"], then after the run, it is something like [ �GM��].

Could it be, that the Str inside the List is refcounted and roc manipulates that refcount?

view this post on Zulip Oskar Hahn (May 20 2024 at 20:47):

I tried to debug this by looking at the memory, that gets allocated with roc_alloc.

After calling the function, that returns the Model, roc has allocated the following values:

0x5b3d0ea5c930: [0 0 0 0 0 0 0 128 8 202 165 14 61 91 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
0x5b3d0ea5ca00: [0 0 0 0 0 0 0 128 87 111 114 108 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 133]

The first 8 bytes of each value are the ref counter. When you know, how the roc types look it memory, you can see, that the second value is a short string with the ascii values for "World". From the two 1 at the first value, you can see that it could be a list with one element and a capacity of 1.

After calling the second function, the following values look like this:

0x5b3d0ea5c930: [254 255 255 255 255 255 255 255 8 202 165 14 61 91 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
0x5b3d0ea5ca00: [0 0 0 0 0 0 0 128 87 111 114 108 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 133]

So the values are the same, but the refcount of the list has changed to a very big number. When the function is called again, the refcount of the list gets reduced by one.

Should roc change the refcount of the list? If so, it should probably not be such a high number but something like 0, 1or the magic infinity. Is this a bug in roc?

view this post on Zulip Oskar Hahn (May 20 2024 at 20:49):

By the way: It would be nice, if roc_alloc would have a debugging argument. For example a pointer to a struct, that contains the line number of the roc code, that triggert the alloc call, and a type-ID of the value, that gets allocated. In optimized builds, the argument could be a zero-pointer.

view this post on Zulip Brendan Hansknecht (May 20 2024 at 21:43):

Could it be, that the Str inside the List is refcounted and roc manipulates that refcount?

Yes, exactly that

view this post on Zulip Brendan Hansknecht (May 20 2024 at 21:43):

We need to remove this kind of recursive refcounting

view this post on Zulip Brendan Hansknecht (May 20 2024 at 21:43):

I am planning to fix it for list (mostly done minus final tests)

view this post on Zulip Brendan Hansknecht (May 20 2024 at 21:43):

We separately probably need to fix it for recursive tags and box.

view this post on Zulip Brendan Hansknecht (May 20 2024 at 21:44):

Oh also, it is probably freeing the list, not the string

view this post on Zulip Brendan Hansknecht (May 20 2024 at 21:45):

Cause the string is small and would not get freed

view this post on Zulip Oskar Hahn (May 21 2024 at 05:34):

Brendan Hansknecht said:

I am planning to fix it for list (mostly done minus final tests)

This sounds great. Is it the list-size-on-heap branch? I tried it and it fails with munmap_chunk(): invalid pointer

view this post on Zulip Brendan Hansknecht (May 21 2024 at 06:11):

It is.

view this post on Zulip Brendan Hansknecht (May 21 2024 at 06:12):

And yeah, unsurprising. Though more of the code is hopefully done, it definitely has bugs


Last updated: Jul 26 2025 at 12:14 UTC