Ref count and parallel calls · platform development

The kingfisher platform exports this function to the host: Request, Box Model -> Response. It takes a Model but does not return Model.

This call reduces the ref count of Model by one. Since a refcount of 0 would mean, that the value gets deallocated, the host has to set the refcount of Model to 2 before calling roc.

When this is done in parallel, a segmentation fault happens. Probably, because the host and roc try to manipulate the same memory at the same time.

A workaround is to set the refcount to a very high value any time Model gets manipulated. But this would mean, that the platform would only support that high number of read requests.

I though, I read somewhere, that when refcount is set to a magic number, it is handled by roc as infinity and is not manipulated. But I don't know, where I read this or what that magic number is.

Is there a magic refcount number, that tells roc not to manipulate the ref count?

Richard Feldman (May 20 2024 at 14:44):

Brendan Hansknecht (May 20 2024 at 14:58):

Yeah, our refcounting is not thread safe currently. Also, simply updating the box refcount may not be enough. It might have recourse and update refcount a of things in the box, but not sure.

Oskar Hahn (May 20 2024 at 16:56):

When I set the value to 2^63+1 (1<<63+1, highest bit is 1, all other bits are 0), the refcount gets reduced by one (from 9223372036854775809 to 9223372036854775808). If I call the function again without modifying the new refcount, it changes to a random number. Probably, because the memory is invalid. If I call the function again without modifying the refcount, I get a segmentation violation, since Model was deallocated.

When I set the refcount to 1 (the lowest bit is 1, all other bits are zero, even the first bit) and I call the function, the refcount changes to 0. If I call the function again, the value stays at 0. The Model does not get deallocated.

Brendan Hansknecht (May 20 2024 at 16:58):

Richard Feldman (May 20 2024 at 18:42):

we've talked about this in various places; I wonder if the time has come to figure out how we want to do this

Brendan Hansknecht (May 20 2024 at 18:49):

My current vote is roughly. Use a bit to decide whether or not to do atomic refcounting (this feature is off by default and must be enabled by a compile time or platform flag). Only do atomic refcounting if the platform sets that bit. Cause the only way for data to be shared between threads is through the platform. When sharing, it can set the bit.

Richard Feldman (May 20 2024 at 18:52):

Oskar Hahn (May 20 2024 at 20:04):

I have another bug when calling roc in parallel. I don't understand what is going on or if this is related to refcount. Maybe you have an idea what the cause is or how I could debug it?

First, the host calls a roc function, that returns a Box Model. Model has type List Str. If Model is Str, everything works.

Then, the host calls the function from above (Request, Box Model -> Response). If the function is called in sequence, there is no problem. But if it is called many times in parallel (around 10.000 calls "at the same time"), the memory gets corrupted. For example, if Model was ["hello"], then after the run, it is something like [ �GM��].

Could it be, that the Str inside the List is refcounted and roc manipulates that refcount?

Oskar Hahn (May 20 2024 at 20:47):

I tried to debug this by looking at the memory, that gets allocated with roc_alloc.

After calling the function, that returns the Model, roc has allocated the following values:

0x5b3d0ea5c930: [0 0 0 0 0 0 0 128 8 202 165 14 61 91 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
0x5b3d0ea5ca00: [0 0 0 0 0 0 0 128 87 111 114 108 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 133]

The first 8 bytes of each value are the ref counter. When you know, how the roc types look it memory, you can see, that the second value is a short string with the ascii values for "World". From the two 1 at the first value, you can see that it could be a list with one element and a capacity of 1.

0x5b3d0ea5c930: [254 255 255 255 255 255 255 255 8 202 165 14 61 91 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
0x5b3d0ea5ca00: [0 0 0 0 0 0 0 128 87 111 114 108 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 133]

So the values are the same, but the refcount of the list has changed to a very big number. When the function is called again, the refcount of the list gets reduced by one.

Should roc change the refcount of the list? If so, it should probably not be such a high number but something like 0, 1or the magic infinity. Is this a bug in roc?

Oskar Hahn (May 20 2024 at 20:49):

By the way: It would be nice, if roc_alloc would have a debugging argument. For example a pointer to a struct, that contains the line number of the roc code, that triggert the alloc call, and a type-ID of the value, that gets allocated. In optimized builds, the argument could be a zero-pointer.

Brendan Hansknecht (May 20 2024 at 21:43):