Stream: ideas

Topic: parallelization


view this post on Zulip Kevin Gillette (Mar 28 2022 at 16:52):

Immutable languages are famously [auto-]parallelizable in theory, whether or not that applies in practice. In a hypothetical 80s era machine that might not have on-cpu caching or other modern features (branch prediction, pipelining, SIMD/MIMD), yet does have multiple independent processors, auto-parallelization is probably a simpler problem with fewer caveats.

On modern computers, there may still be opportunities for auto-parallelization, to which optimistic mutation is both an opportunity and an obstacle (when it's an obstacle, opportunistic mutation would usually be the better choice).

A List.map over a large list (tens of megabytes or even gigabytes) could benefit from opportunistic parallel processing. In some cases, List.walk and similar (List.sum, List.min, and even List.find) could likewise benefit, particularly if the state transform is not order-sensitive.

Does Roc have a plan for either explicit or implicit parallel computation capabilities at the language level? I could see a green-thread implementation with auto-parallelization being potentially quite useful--such that if there's only a single core available, the overhead is minimal compared to no parallelization provided the unit-of-work (chunk) size is sufficiently large.

view this post on Zulip Folkert de Vries (Mar 28 2022 at 16:54):

automatic parallelization has never been shown to work well in practice (in the general case)

view this post on Zulip Folkert de Vries (Mar 28 2022 at 16:55):

the paper that should do it http://iu-parfunc.github.io/gibbon/public/icfp21.pdf makes a point of asking the programmer for where parallelism should be inserted

view this post on Zulip Folkert de Vries (Mar 28 2022 at 16:55):

they found that trying to guess was just ineffective

view this post on Zulip Kevin Gillette (Mar 28 2022 at 17:06):

Ah, cool, thanks!

What about any Roc plans for explicit parallelization? Would that just be a matter of returning a list of functions to the platform, on a platform-by-platform basis? If so, how would the compiler know to protect closed-over data from opportunistic mutation?

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 17:07):

Would be interesting to enable something like list.parallelMap, but I would be really concern about the implications:

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 17:12):

Would that just be a matter of returning a list of functions to the platform, on a platform-by-platform basis?

Can you expand on this. I don't understand the question.

If so, how would the compiler know to protect closed-over data from opportunistic mutation?

The reference count would just be checked. We only do opportunistic mutation when either we know at compile time an object is unique or we check the reference count at runtime and it is unique.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 18:02):

Is Roc's reference counting atomic when the counter may be accessed by multiple CPUs?

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:03):

no

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:03):

that would be slow

view this post on Zulip Kevin Gillette (Mar 28 2022 at 18:07):

iiuc, a counter would need to be parallel-safe if new references can be gained or lost across multiple threads in any case in which life cycles can't be statically proven. Alternatively the data could be copied, but that data might be quite large

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:08):

right, the solutions is not to (explicitly) have different threads

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:09):

you could do data-parallelism just fine with that

view this post on Zulip Kevin Gillette (Mar 28 2022 at 18:10):

Data-parallelism?

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:13):

it's the idea that in a pure language, you can in fact cut up an array and process chunks on different cores. or two branches of a tree, etc

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:13):

and rather than doing that explicitly, by spawning threads and sending the work over and joining the results

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:14):

you could do that in a more first-class way

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:14):

or less nuts-and-bolts in any case

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:14):

though now that I think of it that doesn't entirely fix the issue of atomic reference counts

view this post on Zulip Kevin Gillette (Mar 28 2022 at 18:18):

By returning lists of functions to the platform, I mean some mechanism to say "i have these 10 functions I want to run in parallel, if possible. platform: please make that happen"

Since, iiuc, effects in Roc are managed by returning, from Roc to the platform, a description of the effect to be achieved, it's meaningless to explicitly spawn a single thread because you need to give up your own thread to do it using such a mechanism. If the platform implements something like the Elm Architecture, then this is no longer a concern

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:18):

the paper our RC is based on does have a basic idea of how to fix the problem, by using one bit of the refcount to indicate "can this value be accessed by multiple threads"

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:18):

https://arxiv.org/pdf/1908.05647.pdf

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:18):

but then, you need to check that bit

view this post on Zulip Folkert de Vries (Mar 28 2022 at 18:19):

every time. So you add a branch that is hard to predict in the critical path, so... that's still slow

view this post on Zulip Jared Cone (Mar 28 2022 at 18:30):

Could you like... force the data to be copied when it gets passed from one thread to another? So the data starts with no references and the ref counter doesn't have to concern itself with threads

view this post on Zulip Kevin Gillette (Mar 28 2022 at 18:31):

When does Roc actually need to heap allocate in the single threaded case? Can't any return of data could be transformed into passing in a pointer that the data can be assigned to?

view this post on Zulip Kevin Gillette (Mar 28 2022 at 18:31):

Re copying: imagine a list that is gigabytes large and only read from after initialization

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 18:58):

By returning lists of functions to the platform, I mean some mechanism to say "i have these 10 functions I want to run in parallel, if possible. platform: please make that happen"

This should (maybe) work fine. If the functions don't share captures, it just works because they don't share data. If they do share captures, when they are created the reference count should get increased. Thus the reference count is increased before the functions are run. As such the functions should all make a copy of the data (if they want to modify it). Though this may have bugs when they go to lower the reference count.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 19:01):

I think that for the most part, the platform either has do directly deal with the parallelism: Platform calls both foo and bar in parallel. They are both functions exposed to the host at the global scope. They are guaranteed to be pure and have no closures. As such they can't share any data except what is passed in by the platform. It is the platforms job to make sure that any data passed in is done so in a thread safe way.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 19:03):

Or the platform needs to expose data parallel functions as tasks. These functions will be well formed and would not let roc define anything related to threading. For example, parallelMap would just expose the exact same api as map. The individual reference counted map would never be used in multiple threads except when done so safely by the platform. So no need to worry about roc having multiple references to it or atomic reference counting.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 19:05):

When does Roc actually need to heap allocate in the single threaded case? Can't any return of data could be transformed into passing in a pointer that the data can be assigned to?

Roc only needs to allocate if it is modifying something that is not unique. (or creating something new). A large list passed in from host would not require any allocations. The host could even give it uniquely to roc and roc would be able to modify it inplace.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 19:08):

Side note: Roc could probably have a flag that the platform sets to request atomic reference count update instructions.

view this post on Zulip Folkert de Vries (Mar 28 2022 at 19:10):

Brendan Hansknecht said:

Or the platform needs to expose data parallel functions as tasks. These functions will be well formed and would not let roc define anything related to threading. For example, parallelMap would just expose the exact same api as map. The individual reference counted map would never be used in multiple threads except when done so safely by the platform. So no need to worry about roc having multiple references to it or atomic reference counting.

that is true, but what if two list elements are really the same value, and the two threads both increment it?

view this post on Zulip Folkert de Vries (Mar 28 2022 at 19:10):

in that case you need at least those increments to be atomic

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 19:12):

I see. I guess the platform would have to deal with that and use atomics if necessary. That or delay freeing memory and cache the address to free or something of that nature.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 20:06):

Folkert de Vries said:

every time. So you add a branch that is hard to predict in the critical path, so... that's still slow

It seems that most of the time this can be optimized. A reference in which is not shared across threads need not ever be checked for that bit, just as some values don't need refcounting because they're either stack allocated, or they can be freed as part of the unwind of a specific stack frame. I suspect that relatively few counters ever need to be atomic, and indeed that most values don't need references: for example, if the compiler can prove that nodes within a tree live about as long, but never longer than the root node (i.e. no reference to a sub-tree is persisted), then there's little value in individually refcounting the sub-nodes.

Even in cases where a value is shared across threads, the initial increment of the shared data must happen before the second thread is created, since there's often not a guarantee (or at least not one worth paying for) that the first thread is still alive when the second starts running (and if that occurs with an "increment on receive" strategy, a memory leak has just occurred).

The real issue is that the decrements of inter-thread counters need to be atomic (or have some other race/leak-mitigating mechanism), and further, if a value that is created by thread A and shared with thread B, and then B further creates thread C and shares the value with it, that increment before the share with C generally needs to be atomic because it could be racing with the decrement from A.

It may also be possible to arena-allocate values that share together and which have similar lifetimes, and just refcount the arena itself.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 20:10):

Brendan Hansknecht said:

If the functions don't share captures, it just works because they don't share data. If they do share captures, when they are created the reference count should get increased.

I expect any non-trivial program will share captures. Hopefully it's read-only data compiled into the data, but there may be some data fetch that is then passed to parallel functions to refine.

As such the functions should all make a copy of the data (if they want to modify it). Though this may have bugs when they go to lower the reference count.

A copy of the data would mean the data is no longer shared, thus doesn't require a counter (or doesn't require the same counter). Copied data would work well when it is modified in-place, but wouldn't meaningfully solve any case where the data is used as a shared read-only input and of non-trivial size.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:19):

Note, the copy might not happen. It depends if the data needs to be modified. Also, some thread needs to deal with the eventual freeing of the captured data. So the reference count is still important. I was not suggesting always copying.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:19):

Once copied the data is simply thread local, but probably a copy is not desired in most cases.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:25):

Also the atomic reference counter problem is probably harder to optimized than said above. To ensure a value is never shared between threads, it must:

So I think there would be a ton of cases we can't optimize to remove atomic reference counting.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:27):

I expect any non-trivial program will share captures.

Why?
I would assume good parallel programming should explicitly share data instead of implicitly sharing through things like closure captures

view this post on Zulip Kevin Gillette (Mar 28 2022 at 20:38):

These functions will be well formed and would not let roc define anything related to threading. For example, parallelMap would just expose the exact same api as map.

This seems like a large burden for the platform to implement, especially if we're encouraging people to write their own platform. It makes sense to provide roc_alloc and such, but to also say "oh, and please implement roc_parallel_map because it's needed for the stdlib" is probably an unreasonable ask (and one that would lead to buggy results).

I think it'd be far more favorable to have a green-threading concept in the language: if given no threading capabilities from the host, Roc can still run one task to completion, or schedule a new task at times when the host needs to be involved to run effects (such as when a blocking syscall would've been involved anyways and the thread would be idle).

The host could additionally provide a roc_spawn_thread, at which point there are real threads in which Roc can directly run functions (or configurably, Roc can just run its scheduler trampoline on the thread, which finds green-threads/tasks to run).

iow, Roc could have a threading concept which benefits from host support, but which requires no additional host support (it'd work on 16 bit embedded controllers that have limited hardware capabilities, for example). The question of what to do if the host is directly invoking Roc in separate threads also still applies (but a "too many threads" issue can be mitigated the host taking a posture of "just run your sub-threads as green threads").

Whether it's a reasonable fit for the language is another question, but computers are scaling horizontally (more cores) faster than they're scaling vertically (faster clockspeed), so in a holistic sense, it would be hard for us to say "Roc is competitively fast," if we don't have some streamlined answer to running code on multiple CPUs, specifically one accessible to Roc programmers with minimal investment in the host, and which supports data-sharing in a consistent, safe way.

view this post on Zulip Richard Feldman (Mar 28 2022 at 20:38):

Brendan Hansknecht said:

Side note: Roc could probably have a flag that the platform sets to request atomic reference count update instructions.

this is generally what I've been thinking :thumbs_up:

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:50):

"oh, and please implement roc_parallel_map because it's needed for the stdlib"

I was thinking something very different more like: Roc has no support for parallelism. Platform authors don't have to think about it if they don't want to. If on the other hand, a platform author wants to support parallelism, they can decide to add it. This is not by any means a languages requirement.

I totally could see someone making a rust crate to abstract this away and make it easier for a platform to add this functionality, but still no requirement.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:52):

I think it'd be far more favorable to have a green-threading concept in the language: if given no threading capabilities from the host, Roc can still run one task to completion, or schedule a new task at times when the host needs to be involved to run effects

This can easily lead to deadlocks without the cost of a full runtime that can interrupt execution at random times.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 20:52):

Brendan Hansknecht said:

I would assume good parallel programming should explicitly share data instead of implicitly sharing through things like closure captures

I agree that explicit data dependencies are generally better than implicit, but I'm also generally used to dealing with the downsides of mutable languages as well, where that's probably more important. I'm not sure what that explicit passing should like like in a language where you don't imperatively ask the host to do something, but rather pass back a request for the host to do something. Perhaps instead of returning Spawn myFunc, you do Spawn myFunc myData, and the data is passed as the sole argument to myFunc when it's invoked? I don't see how that's really satisfactory though because it's probably more natural to share data via closure and it seems like it'd be draconian-programmer-punishment for the compiler to say "we see that you're referring to variable a in your function, but we require you to pass that separately, just because we wanted to make your life a bit harder."

I believe that, if the compiler can detect closure-captured data, and separately passing that data via a non-closure mechanism doesn't open up categorical correctness or optimization opportunities, then the compiler/language shouldn't require the separate passing of data. Part of being a high level language is you should be able to do things (within the language paradigm at least, i.e. functional) that feel natural and expect it to "just work."

You could say that forcing data to be passed explicitly gives the programmer a reason to review their data and share less, or restructure it to be more compact or efficient, but such measures benefit single-threaded code as well, so why are we only penalizing multi-threaded use-cases? Sometimes you just need to share a lot of data, and repackaging disparate data sources into an explicit data parameter just constitutes, in my mind, a misuse of the programmer's time, particularly since the language is semantically immutable, and thus there's never any danger associated with implicitly passing any or all data anywhere.

We could certainly provide editor functionality to highlight which data is being shared, and provide tools to help minimize sharing or at least help the programmer understand the runtime cost of such sharing. I believe that would be a better outcome than forcing the programmer to deviate from what they're used to doing, since captured data is similar to let-style bindings, and we encourage people to use those.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:52):

That requires system threads or interrupts or other pieces to be exposed from the platform.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:54):

so why are we only penalizing multi-threaded use-cases?

Because multithread is dangerous and single threaded is not: deadlocks, memory leaks, etc

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:55):

particularly since the language is semantically immutable, and thus there's never any danger associated with implicitly passing any or all data anywhere.

Very true

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:56):

We just don't want to end up penalizing single threaded code due to supporting multi-threaded code. That would happen with certain arc requirements

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 20:58):

so in a holistic sense, it would be hard for us to say "Roc is competitively fast," if we don't have some streamlined answer to running code on multiple CPUs, specifically one accessible to Roc programmers with minimal investment in the host, and which supports data-sharing in a consistent, safe way.

How I see it: Roc is for writing a pure calculation. A pure calculation can be run anywhere. The platform is for running it in parallel or scaling out. Easy when you control the platform, but I guess that may not be the general case.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 20:59):

Brendan Hansknecht said:

This can easily lead to deadlocks without the cost of a full runtime that can interrupt execution at random times.

True. Idris has totality detection; such functions would be guaranteed to be free from deadlocks. Preemption can reduce throughput if it happens too often, though it can also harm responsiveness if it doesn't happen often enough. Those seem to be clear concerns for the platform to decide (a number crunching application probably only cares about throughput, while a video game probably favors responsiveness. I don't like the idea of making the platform integrate yet more hooks just to configure/implement preemption and such, so that would be a reason, as you say, to not have the language deal with any of this directly.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 21:00):

Brendan Hansknecht said:

I totally could see someone making a rust crate to abstract this away and make it easier for a platform to add this functionality, but still no requirement.

Good thought!

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 21:02):

I don't like the idea of making the platform integrate yet more hooks just to configure/implement preemption and such, so that would be a reason, as you say, to not have the language deal with any of this directly.

Yeah, it is really hard, especially since if we want fast surgical linking, we literally can't access anything that isn't handed to us by the host. Without surgical linking we could theoretically have our own runtime.

view this post on Zulip Kevin Gillette (Mar 28 2022 at 21:11):

Brendan Hansknecht said:

How I see it: Roc is for writing a pure calculation. A pure calculation can be run anywhere. The platform is for running it in parallel or scaling out. Easy when you control the platform, but I guess that may not be the general case.

This could limit language adoption (which may or may not be a problem depending on language goals). I don't find a lot of programmers in my domain (SaaS companies) who are satisfied to hear "you just need to learn [at least] these two languages to do anything in this language."

If we end up having a "pretty good" CLI platform with access to networking, HTTP, disk I/O, etc, and a "pretty good" network service platform (or fleet of similar http, telnet, whatever platforms), and a "pretty good" game platform, and have figured out common needs fairly well (if you need to make an http client request, this is what it generally looks like across platforms), then adoption and usability won't really be a problem.

If instead the intent/vision is that every Roc programmer also, necessarily, be a host programmer, then the audience may be pretty niche (i.e. people who want something whose role has traditionally been fulfilled by an embedded interpreter for specialized tasks, but which happens to be compiled to machine code. Or you might end up with someone creating a "posix" platform that just receives descriptions of syscalls to perform, upon which a popular module ecosystem develops, and you reach a point where you have a general audience yet the platform becomes an unimportant, hidden detail.

view this post on Zulip Jared Cone (Mar 28 2022 at 21:41):

delay freeing memory and cache the address to free or something of that nature.

I have no idea how Roc's ref counting works so apologies if this completely misses the mark, but what if the ref count was local to threads but there was an additional "thread count" that gets threadsafe incremented when data is passed from one thread to another, and threadsafe decremented when a thread dies?

How it might look in c++. Note I didn't compile this, just using as an example.

struct Allocation
{
    size_t ThreadCount;
    void* Data;
}

struct Ref
{
    Allocation* Allocation;
    size_t RefCount;

    ~Ref()
    {
        RefCount--;

        if (RefCount == 0)
        {
            delete Allocation;
        }
    }
};

void ThreadA()
{
    // Allocations that may need to be freed when this thread dies
    vector<Allocation*> threadAllocations;

    {
        size_t PayloadSize = 5000;

        // Create an Allocation large enough to hold TheadCount and payload.
        Allocation* allocation = (Allocation*)malloc(sizeof(Allocation) + PayloadSize)
        aloocation->ThreadCount = 1;
        allocation->Data = allocation + (size_t)&((allocation*)nullptr)->Data;

        // Create a reference to that allocation on this thread
        Ref ref;
        ref.Allocation = Allocation;
        ref.RefCount = 1;

        // Pass this data off to some other thread
        {
            // Increment our RefCount to ensure the allocation stays alive at least until this thread dies.
            ref.RefCount++;

            // Try deallocating when ThreadA dies
            threadAllocations.push_back(ref.Allocation);

            // Create a RefCount that gets passed off to the new thread.
            Ref threadRef;
            threadRef.Allocation = ref.Allocation;

            // But no matter what our RefCount is, set ThreadB's RefCount to 2 so no operations on ThreadB will mutate the allocation
            threadRef.Refcount = 2

            // Make sure it can't be deallocated while more than one thread references it
            _InterlockedIncrement(&ref.Allocation->ThreadCount);

            StartThread(ThreadB, threadRef);
        }

        // ... do some work ...

        // ref.~Ref() gets called, decrements RefCount, but RefCount goes from 2 to 1 so no deletion occurs.
    }

    // This thread is done, see if we need to cleanup
    for (Allocation* threadalloc : threadAllocations)
    {
        // If it doesn't decrement to zero, then ThreadB is still running
        if (_InterlockedDecrement(&threadalloc->ThreadCount) == 0)
        {
            delete allocation;
        }
    }
}

void ThreadB(Ref data)
{
    vector<Allocation*> threadAllocations;

    // If we were given ownership of data then its RefCount would be 1.
    // If it's larger than that then the data may still be used by the other thread.
    if (data.RefCount > 1)
    {
        threadAllocations.push_back(data.Allocation);

        // data.RefCount is already 2, so nothing in this thread will try to mutate it
    }

    // ... do some work ...

    // This thread is done, see if we need to cleanup
    for (Allocation* threadalloc : threadAllocations)
    {
        // If it doesn't decrement to zero, then ThreadA is still running
        if (_InterlockedDecrement(&threadalloc->ThreadCount) == 0)
        {
            delete allocation;
        }
    }
}

Pro:

Con:

view this post on Zulip Jared Cone (Mar 28 2022 at 21:47):

Though I imagine this gets pretty hairy when dealing with nested allocations/refs

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:02):

Jared Cone said:

both threads will still increment and decrement those refcounts just from their own use, which is sufficient to create data races. For example, let's say each thread runs a function like this:

\ x -> [ x, x ]

this will result in a refcount increment because x is now being used twice in the same list

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:03):

when the refcount increase happens, that involves the following:

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:04):

if both threads are doing that, it's possible that they both do the "read the current refcount" step at the same time, then they each increment that refcount, and then write the same number back - so the refcount ends up being 1 lower than it should be, resulting in problems

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:04):

instead, it's possible to do an atomic instruction to read-increment-write

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:04):

this would mean it's safe to do across threads

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:05):

and the data race can't happen

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:05):

however, the atomic read-increment-write instruction is slower than the "read, then increment, then write" sequence of instructions

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:05):

hence why if we always did refcounting that way, it would decrease performance across all Roc applications, even single-threaded ones

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:06):

also worth noting that there are some applications of multithreading that don't require _applications_ to share data across threads

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:06):

for example, a web server platform can take care of managing a thread pool for web requests, in which each request handler is running on a different thread, but they never talk to each other or share data

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:07):

so it's still completely safe to do the faster non-atomic refcount even though all the CPU cores are potentially being used efficiently for massive parallelization (across requests, not within a single request)

view this post on Zulip Jared Cone (Mar 28 2022 at 23:08):

I was suggesting rather than each thread referring to the same ref counter, they would each have their own ref counters.

view this post on Zulip Jared Cone (Mar 28 2022 at 23:09):

and when data passes from one thread to another, the ref count on both threads is incremented so that the resource won't try to be deleted until the threads end

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:10):

So if you have 10 threads you have 10 refcounts?

view this post on Zulip Jared Cone (Mar 28 2022 at 23:13):

In the example above, there are two Ref's that refer to the same allocation but each have their own ref count. Single-threaded behavior is exactly the same, but when the data gets passed to another thread the RefCount gets force-incremented so neither thread is allowed to deallocate or mutate the allocation, until the end of the threads when they do the more expensive InterlockedDecrement to see if there are any more threads referring to the allocation

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:13):

so right now the exact location of the refcount is right next to the heap-allocated memory it's refcounting. For example, in a List we have all the elements in one contiguous chunk of memory, and then right before the first element is the refcount.

This means that the should be good cache locality for refcounts; if you're accessing the list's data, chances are good that the refcount is already in the CPU's cache because CPUs cache memory in 64B chunks, so it will often end up with the refcount being in memory just by coincidence of having accessed the list data (or vice versa)

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:13):

if instead we had N refcounts per thread, we couldn't do that anymore; the refcounts would have to be stored in a completely different part of memory from the data they were refcounting

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:13):

meaning more cache misses, which could easily (even in a multithreaded environment) be more expensive than the atomic increment operation

view this post on Zulip Jared Cone (Mar 28 2022 at 23:14):

yep good point

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:15):

it could be more efficient in some circumstances, e.g. when the refcount is being incremented super frequently, to the point where there's significant contention on the atomic increment operation (at which point the refcounts would be having so much traffic that they'd probably be in cache anyway) but that seems like it would be an uncommon case :big_smile:

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:18):

Another option, roc could just force all refcounts to be set to max when sharing between threads.

This would mean it is equivalent to static data. Each thread would view that data as a constant object.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:19):

Or nvm. Someone has to eventually free the data.

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:20):

that actually could work in some cases

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:21):

e.g. run this closure on N other threads, but first store all the refcounts for everything that will be accessible on those threads from the original threads (e.g. closed-over data), then set them to 0 so the individual threads will look at them and think they're immortal and not try to change them, then once all the closures have finished running, restore the refcounts to their original state

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:25):

The big issue is that this would somehow need to either be done by the platform or done in the background otherwise to keep mice roc ergonomics. I feel that most likely a user would just include something in a closure without thinking about it.

view this post on Zulip Jared Cone (Mar 28 2022 at 23:25):

Another option, roc could just force all refcounts to be set to max when sharing between threads.
This would mean it is equivalent to static data. Each thread would view that data as a constant object.
Or nvm. Someone has to eventually free the data.

could combine that with a threadsafe thread counter that lives next to the ref counter. When data gets passed from one thread to another the ref count doesn't really matter anymore, and instead threads check that thread counter when the thread dies

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:27):

Yeah, and the thread safe counter could be only for the top level closure captures since all internal data would be constant.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:28):

Though still not sure how to ergonomically just change a function to support this. Especially problematic given you don't really want to make all roc code thread aware. Like every closure would need to know if it might be run in parallel.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:29):

Oh, and if a closure is run multiple times you have other problems cause you still don't know when to free the data.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:30):

Would require the host to free the closure I guess.

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:32):

Oh wait, if the host owns the closure anyway, there is no need for any of the thread counting. The host has to free the closure and when it does, the closure should no longer be running. So all closure data can last until the host frees the closure

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:33):

If that is the case, the host just has to be able to modify any closures it will run in parallel to have all 0 refcounts before running them in parallel.

view this post on Zulip Jared Cone (Mar 28 2022 at 23:33):

Assuming it's a platform effect that takes a closure as input and creates a thread from it, perhaps that platform function could override the ref counters and increment the thread counter, though I don't know what kind of access platform functions have to those kinds of internals

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:34):

We could give them a helper to tell roc it wants to run the function in parallel

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:34):

That will just make all of the captures constants.

view this post on Zulip Jared Cone (Mar 28 2022 at 23:37):

I was thinking the Roc code would just be

myData = { ... }
PlatformThread.run (\{} -> DoExpensiveStuff myData)

and PlatformThread.run is some Rust or Zig function that takes in the closure, sets ref counts to max, increments thread count, starts thread

view this post on Zulip Richard Feldman (Mar 28 2022 at 23:37):

yeah that's what I meant in this idea

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:41):

Except you still have problems with a few things:

view this post on Zulip Brendan Hansknecht (Mar 28 2022 at 23:44):

I think this would lead to needing to check the thread count all over the place, which means that it would require atomics loads all over and would likely be the same or more costly than just using an atomic reference count.

view this post on Zulip Jared Cone (Mar 29 2022 at 00:05):

I think you would only need to check and alter thread count when passing data to a new thread and when a thread ends

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:06):

Yes, but you need to read it every time you do any operation that would read the normal refcount.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:06):

And that will be an atomic load

view this post on Zulip Jared Cone (Mar 29 2022 at 00:12):

Why would you need to read it with every operation?

view this post on Zulip Jared Cone (Mar 29 2022 at 00:12):

I was still under the assumption that when passing data to another thread RefCount would get set to max

view this post on Zulip Jared Cone (Mar 29 2022 at 00:14):

Actually setting to max would overflow as soon as a new ref gets added, so rather passing data to another thread should increment the ref count without a corresponding decrement which leaves the data immutable

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:22):

0 is static/infinite/max refcount. We would only have to check it for any potentially data modifying operations. For example every item in a List.map would check the thread refcount for that element.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:23):

Also, we check it every time there is a new reference to an item

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:23):

And every time a reference goes out of scope

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:24):

So this can actually happen all over the place in a chain of function even if we never modify something

view this post on Zulip Jared Cone (Mar 29 2022 at 00:25):

if 0 is infinite, wouldn't every ref increment have to make sure it isn't 0 before incrementing?

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:25):

yep. it can be done for free with saturating math. anyway, that is just an implementation detail that can be ignored for this thread.

view this post on Zulip Jared Cone (Mar 29 2022 at 00:27):

data modifying operations wouldn't need to check threadcount. If threadcount > 1 then refcount is infinite, so data modifying operations just need to check refcount and they know they can't modify the data because refcount is infinite

view this post on Zulip Jared Cone (Mar 29 2022 at 00:31):

threadcount should only need to be checked at the end of a thread that was given data that had threadcount > 1, and at the end of a thread that was responsible for incrementing threadcount from 1

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:31):

Ah, your right. I guess everything local to a closure can ignore thread count. All closures would just have to increment thread count on creation and decrement it on destruction.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:32):

Also all data to/from the host would also have to deal with thread count cause we can't know what state it is in.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:33):

This has a major loss still. We have no way to distinguish from a closure that will be used in a multi-threaded way and one that won't, so all closures would need to deal with threadcount.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:34):

So this would pessimize single threaded code.

view this post on Zulip Jared Cone (Mar 29 2022 at 00:34):

it wouldn't be up to the closure, it would be up to the PlatformThread.run hosted effect

view this post on Zulip Richard Feldman (Mar 29 2022 at 00:34):

it can be done for free with saturating math

we actually don't do that because sometimes the reason it's maxed is because the memory is in the readonly section of the binary, so if we tried to write to it we'd get an access violation :big_smile:

view this post on Zulip Jared Cone (Mar 29 2022 at 00:36):

a closure shouldn't have to worry about threadcount, it would be up to PlatformThread.run to iterate the captures inside the closure to set refcounts to infinite and threadcount++

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:36):

it wouldn't be up to the closure, it would be up to the PlatformThread.run hosted effect

But the closure passed to that could come from anywhere. Also, any closure returned to the host could be run in a different thread. So all captures most deal with this cost cause they can't know if the data is currently being used elsewhere as well.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:37):

Side question: how do we atomically update all uses of a refcount to 0? We have no idea when a thread will be launched and something else could currently be modifying that data when it still has a normal refcount. Before we set the refcount to 0.

view this post on Zulip Jared Cone (Mar 29 2022 at 00:40):

we know when a thread will be launched because launching a thread goes through PlatformThread.run. If code elsewhere in the platform starts spinning up threads on its own and passing Roc data into those threads without updating refcount and threadcount that's a fault of the platform

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:44):

PlatformThread.run could be run in parallel. Or it could be run in one thread while another thread is running List.map on data it closes over. I don't think it will come out so cleanly and simply.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:50):

It also still misses that thread count doesn't decide when we free the data. A closure can be run many times.

view this post on Zulip Jared Cone (Mar 29 2022 at 00:50):

it could be run in one thread while another thread is running List.map on data it closes over.

in order for that to happen, the data would have to have been passed from one thread to another at which point threadcount would have been incremented and refcount would have been set to infinite

view this post on Zulip Jared Cone (Mar 29 2022 at 00:51):

It also still misses that thread count doesn't decide when we free the data. A closure can be run many times.

once data gets passed from one thread to another, threadcount is in charge of when the data gets freed

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:53):

Another annoyance I just realized, structs are never reference counted. And they are normally just passed by const reference. I think that all struct uses would incur a copy or we would need to add a refcount field to all structs (and I guess thread count).

view this post on Zulip Jared Cone (Mar 29 2022 at 00:54):

they're probably passed by const ref to functions, but I assume they would be copied for closures

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:55):

Sure, but it mostly gets compiled away by llvm which is why that is ok. Not the case with threading. So suddenly a closure may have a huge cost

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 00:55):

Actually we have Box now, so that should be fine since box is refcounted.

view this post on Zulip Jared Cone (Mar 29 2022 at 00:58):

re: threadcount - when something gets allocated, it's getting allocated by some thread. Refcount only matters as long as only that thread references the allocation. You could think of that thread as the "owner" of the allocation. When that thread decides to pass the data to another thread (via PlatformThread.run), the refcount gets set to infinite and the threadcount gets incremented. From then on it doesn't matter who gets the data, the refcount is infinite until the very last thread decrements threadcount to zero at the end of its life, at which point the allocation gets freed

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:00):

the refcount is infinite until the very last thread decrements threadcount to zero at the end of its life, at which point the allocation gets freed

I think that also needs to include: "and the closure decrements it's refcount to 0 ensuring it can't be run again"

view this post on Zulip Jared Cone (Mar 29 2022 at 01:01):

the closure itself could be counted among the data that gets refcount set to infinite

view this post on Zulip Jared Cone (Mar 29 2022 at 01:02):

so the whole tree: the closure, its captures, the things the captures reference, and so on

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:03):

Oh, what about the case were we close over a list. An element is pulled out of the list and returned from the function. This element would have a thread and infinite reference count. It is currently only running in a single thread. We use it and then free it. Then the closure is run again. We panic when accessing an already free element.

view this post on Zulip Jared Cone (Mar 29 2022 at 01:05):

Do individual elements have refcounts?

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:05):

An element could be a list

view this post on Zulip Jared Cone (Mar 29 2022 at 01:05):

right

view this post on Zulip Jared Cone (Mar 29 2022 at 01:06):

so yes, if it's a list of lists, when that list gets given to a closure then its refcount and the refcount of all the element lists gets set to infinite

view this post on Zulip Jared Cone (Mar 29 2022 at 01:07):

so it doesn't get freed until the last thread ends and sets its threadcount to zero

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:10):

data = [[1,2,3],[4,5,6]]
closure = \{} -> List.get data 0

# increment data and sub list thread counts to 2 and set refcount to infinite
x = PlatformThread.run closure
# decrement data and sublist thread count back to 1 after we receive that data.
# Note: this may be split across the code,
# but eventually we wait on and get a result back before running code below

# x is now [1,2,3] with a threadcount of 1 and a refcount of infinite

# do something with x and then get rid of it
# x now has a thread count of 0 and a refcount of infinite : free it

# run the closure again, but [1,2,3] is freed and the program crashes
y = PlatformThread.run closure

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:15):

Extra note, this closure might be passed to the host and then back to Roc before being called again. So this function would be completely over and for sure would have cleaned up any data if we were using a thread count. That or the data is leaked because we don't clean it up and eventually it is never used again.

view this post on Zulip Jared Cone (Mar 29 2022 at 01:47):

"x now has a thread count of 0" I was suggesting thread count doesn't get decremented until the end of the thread. That way we aren't having to touch thread count every time ref count decrements

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:50):

So when does it get set to 0? This function could return to the platform as I mentioned above.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 01:50):

Would that mean the main thread never ends? Roc no longer has a view of this data and somewhere it has to be freed.

view this post on Zulip Jared Cone (Mar 29 2022 at 02:08):

It gets set to zero when the last thread referencing it ends. This includes the main thread.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 02:10):

When does the main thread end? The main roc function could have done something with x and then returned?

view this post on Zulip Jared Cone (Mar 29 2022 at 02:10):

Yep

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 02:10):

This sounds like the data would be leaked and around until the application ends.

view this post on Zulip Jared Cone (Mar 29 2022 at 02:14):

The "thread counts" could be scoped to anything, doesn't necessarily have to be scoped to just threads. The platform can create a context, calls a Roc function that may add thread count items to that context, then platform frees that context which decrements thread counts

view this post on Zulip Jared Cone (Mar 29 2022 at 02:15):

Similar idea to using arena allocator

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 02:19):

Sure, but then you don't just have PlatformThread.run closure you now have to pass a context around and everything gets way more explicit with potential bugs depending on how the end uses uses a context. That is very different than the proposed api with just being able to run a closure in parallel.

view this post on Zulip Jared Cone (Mar 29 2022 at 02:20):

No, the context is under the hood as far as Roc is concerned

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 02:20):

How?

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 02:21):

Also, if the context is just gonna free everything later anyway, do we need any threadcount? we can just do arena allocation and move on.

view this post on Zulip Jared Cone (Mar 29 2022 at 02:22):

Not if there is a thread still running when the roc function returns

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 02:22):

This would defeat the purpose of reference counting and any long living context may not be a memory leak, but it would just collect tons of data caues it wouldn't know if the data could be accessed again.

view this post on Zulip Jared Cone (Mar 29 2022 at 02:23):

Tho I'm not sure if it makes sense for a thread to run longer than the roc function that starts it. I'm thinking like, a thread to write to a file, but that would be an effect anyways

view this post on Zulip Kevin Gillette (Mar 29 2022 at 04:01):

For a typical platform, when would we want a platform sourced value to be passed by Roc to other threads? Couldn't the platform just pass this same data to each thread (as some kind of static record argument), rather than having Roc behave like it [falsely] own such data? Can we just treat platform sourced data as if it's static, and have the Roc compiler ensure that such values don't escape?

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 04:09):

It is actually really useful that roc can own data passed in from the host. It means we can do inplace mutation to it. Roc is also able to free the data when done with it.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 04:10):

So roc actually does own the data. Nothing false about it

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 04:10):

Of course the host could choose to pass it in a way such that roc would not own it.

view this post on Zulip Giovanni Zanandrea (Mar 29 2022 at 11:33):

I can't quite wrap my head around this dual threadcount/refcount, but would it be possible to do the following?

Let's say we have a maximum of 8 threads and we keep them in a pool. Each thread gets its own id, which is a number that goes from 0 to 7

Let's say we reserve a 64-bit word for the reference count. Each thread stores its own reference count in one of the 8 bytes: thread 0 in the least significant byte, thread 7 in the most significant one and so on. Bytes can be written atomically, so that should not create race conditions, right?

Data is deallocated when the entire 64-bit word is 0.

Sometimes the reference count of an individual thread can exceed the capacity of a single byte. To handle that, each thread keeps a second reference count in a separate dictionary. For most blocks of memory, we'll never need to use this dictionary at all, and when we need to do it, that can be done very efficiently (I'll post the detail later, if anyone is interested)

Could that work?

view this post on Zulip Kevin Gillette (Mar 29 2022 at 13:23):

It may be platform dependent whether sub-word-sized values can be written atomically?

view this post on Zulip Kevin Gillette (Mar 29 2022 at 13:45):

With your proposal, increments to the byte size counter after the first one perhaps do not need to be atomic, and perhaps even could be registerized (really only the non-zero-ness of the slot is what matters). The last decrement would need to be atomic or at least force a cache spill in the other processors.

Given that the other threads only care if your thread has any references, but don't care how many, your byte per thread may be reduced to a bit per thread, and on 64-bit platforms you could have one 64 threads represented in a single word. That would benefit from having atomic bitwise or and xor, which I'm not sure exist, or atomic load followed by atomic compare-and-swap in a loop (that'll usually succeed on the first try), or a mutex.

I believe the problem here is: how do you deterministically assign a given thread to a given byte or bit, and cheaply and without collisions (which would mean memory leaks and/or use-after-free), and how can a value be shared by more than 8/64 threads?

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 14:52):

Also, I think the reads and writes still have to be atomic across the entire 64bits. Otherwise, you could hit the case where 2 threads decrement their byte, read the counter, both see zero, and you double free the data.

view this post on Zulip Brendan Hansknecht (Mar 29 2022 at 15:15):

Two general thought on designs:

view this post on Zulip Giovanni Zanandrea (Mar 30 2022 at 11:50):

Kevin Gillette said:

The last decrement would need to be atomic or at least force a cache spill in the other processors.

Brendan Hansknecht said:

Also, I think the reads and writes still have to be atomic across the entire 64bits. Otherwise, you could hit the case where 2 threads decrement their byte, read the counter, both see zero, and you double free the data.

True, I missed that. This is the algorithm I had in mind:

  read word from memory to a register
  decrement byte in the register
  if register == 0
    release
  else
    write byte back to memory

but that could lead not to a double free but to a memory leak. I should have thought harder about it before posting, sorry. What about this modified version?

  read word from memory to a register
  decrement thread's own byte in the register
  if byte > 0
    write the byte back to memory
  else if register == 0
    release
  else
    write the entire word back to memory using CAS, and if that fails retry

It would still use CAS, but more sparingly. If N threads ever hold a reference to a block of memory, that would require no more than N-1 CAS operations (per block of memory), assuming that all of them are successful. Could that work?

Kevin Gillette said:

Given that the other threads only care if your thread has any references, but don't care how many, your byte per thread may be reduced to a bit per thread, and on 64-bit platforms you could have one 64 threads represented in a single word.

I don't understand this, for the same reason I couldn't see how the threadcount+refcount could possibly work. If threads don't keep a local (per-thread) reference count, how would they know when they can release their own bit?

Kevin Gillette said:

I believe the problem here is: how do you deterministically assign a given thread to a given byte or bit, and cheaply and without collisions (which would mean memory leaks and/or use-after-free)?

Assigning an id to a thread wouldn't be a problem, the problem is where to store it. It's just three bits, but as far as I can tell it would have to be kept in a CPU register. I've no idea if that would be feasible or worth it in practice, or if there are other solutions.

Kevin Gillette said:

... and how can a value be shared by more than 8 threads?

I've no answer to that. I guess a different technique would be required for a really large number of threads. Still, 8 threads is better than nothing, and that's just the threads that execute ROC code. The platform itself could have more threads that it uses for its own purposes.

Kevin Gillette said:

It may be platform dependent whether sub-word-sized values can be written atomically?

Is it? I was under the impression it was a universal feature, but I don't really know.

Brendan Hansknecht said:

we should likely avoid anything that requires a separate data structure like a map to track this information. It would mean there is a large cost to looking up and modifying reference counts which is a very common operation.

I think that can be done very cheaply at least in cases like this one. Let's say you have two counters, RC1 and RC2. RC1 would be stored next to the object itself and it would only take one byte. RC2 would be stored in the dictionary. The overall reference count (let's call it RC), would be calculated like this:

  RC = RC1 + 64 * RC2

with the constraint that if RC1 < 128, then RC2 must be 0. Here's how you add a reference:

  if RC1 == 255
    RC1 = 255 - 64 + 1
    RC2++
  else
    RC1++

and this is how you release it:

  if RC1 == 128 and RC2 != 0
    RC1 = 128 + 64 - 1
    RC2--
  else
    RC1--

  if RC1 == 0
    release

That way, the dictionary would be used at most once every 64 addref()/release() operations, and in the vast majority of cases it would not be touched at all.

Brendan Hansknecht said:

We should be very careful of any designs that have per thread data, even if they are faster, they will increase the memory consumption of the app for every piece of reference counted data. This could be a large cost especially with node based data structures.

Not sure I understand this, at least not in this context. The dictionary containing the extra reference count (RC2 in the above code) would be used very sparingly. Can you elaborate?

view this post on Zulip Kevin Gillette (Mar 30 2022 at 15:00):

I meant one-bit-per-thread in the shared U64 (up to 64 threads) to keep track of which threads still have references, and every thread that ever has more than one reference could keep its own local recount.

That said, I don't believe optimizing for or limiting to 8 threads or even 64 threads is meaningful. The techniques and tradeoffs we're discussing are meaningful, but in the context of solving this for any number of threads.

In terms of C style use of OS threads (POSIX equivalent model), being able to manage hundreds or low-thousands of threads is competitive with other languages. The limiting factor is thread stack size, because C-style stacks are fixed in size (on a thread-by-thread basis at least), and since the default size is usually 1 MiB, 1000 threads will use about a GiB of virtual memory. Since the thread handed to us by the platform is really only going to be used by Roc itself and the Zig builtin code for data structures and such (though I'm not sure if returning for effects will use the same thread to run the effect), we may be able to set a lower default depending on the memory characteristics of Roc code.

In languages with green threads (Stackless Python and Go follow this model), being and to handle millions of green threads is competitive. In Go's case, it attaches its green threads onto real threads via a runtime scheduler with language-specific knowledge and optimizations (communication channels, locks, timers, network via epoll, etc), and has growable stacks starting at 2 KiB in size (so discounting some amortized overhead, Go can have up to 512 green threads for the same memory as a single C thread, and Go typically just has as many OS threads as there are CPU cores).

Anyways, I believe we would need to support reference counting over any number of threads, but there are techniques we should be able to find for that. Swift does ARC and has threads, for example

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 15:33):

What about this modified version? write the entire word back to memory using CAS, and if that fails retry

This would mean decrementing a refcount would be loop over your entire modified function. In high contention scenarios with short lived threads, a thread could loop many many times before getting to write and continue (theoretically forever). Cause it would have to loop over your entire modified version not just the CAS write. Since if the register is change it needs to recalculate if the register would be zero after it's decrement.

and this is how you release it: if RC1 == 128 and RC2 != 0

Every decrement would need to look up the value in the dictionary. This would be the high cost operation.

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 15:38):

Not sure I understand this, at least not in this context. The dictionary containing the extra reference count (RC2 in the above code) would be used very sparingly. Can you elaborate?

For my comment on haveing per thread data. I meant the per thread byte, not specifically the dictionary. With the per thread byte, every reference count grows by 8 bytes if you have 8 threads. That just doubled the memory usage of the reference count. With more threads it would use even more data. Even if we only supported up to 1 thread per core, some systems have 100+ cores. So that would be 100 extra bytes per reference count. In a node based data structure, that is 100 extra bytes per node.

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 15:43):

Anyways, I believe we would need to support reference counting over any number of threads, but there are techniques we should be able to find for that. Swift does ARC and has threads, for example

100% agree. With the swift comment, do they just force any shared data to be atomically reference counted? Or is it more that you can use ARC and by default it uses regualar RC and is buggy if you aren't careful?

Side note: apple has more optimized ARC instruction than other CPUs. theirs have no extra costs compared to the memory reads/writes related to them.

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 17:22):

Looked into his a bit more swift just uses ARC everywhere by default. All classes uses ARC even in a single threaded programs.

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 17:52):

Based on the numbers posted on a Twitter thread related to this in 2020. Apple silicon is about 5x faster than x86-64 hardware when running retain and release (very similar but not exactly the same as atomic increment and decrement due to how they store the counters).

For Intel the cost is about 30ns (90 cycles) and for apple it is 6.5ns (20 cycles). These instructions always have to access L2 cache, and they are doing it twice.

So thinking only about intel we can estimate the cost of atomic reference counting that many roc users would experience.

Atomic (just from number mentioned above):

Non atomic (assuming refcount in L2 cache):

The big advantage of a non-atomic refcount is that it can be in L1 cache. From L1 cache the load instruction becomes about 5 cycles.
Non atomic (assuming refcount in L1 cache):

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 17:54):

So in the worst case, atomic reference counts are probably a ~10x cost per increment/decrement.

Note: if we decrement and free, the cost of freeing is probably so large that this doesn't matter. So in a sense, if roc does a good job of keeping things unique and doing inplace mutation, a lot of this cost is hidden.

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 17:56):

Of course when dealing with large data structures, the cost is greatly reduced, while waiting on update the first reference count, we can start updating the second and third and etc. So the reference count updates get pipelined due to no dependencies.

view this post on Zulip Brendan Hansknecht (Mar 30 2022 at 18:03):

One more note: except for the case where we decrement and deallocate or we decrement and the branch predictor is wrong, every use of a refcount should pipeline with no dependencies to the code after it. That will also greatly lower/hide the cost.

view this post on Zulip Giovanni Zanandrea (Mar 31 2022 at 08:30):

Brendan Hansknecht said:

and this is how you release it: if RC1 == 128 and RC2 != 0

Every decrement would need to look up the value in the dictionary. This would be the high cost operation.

I understand now why what I proposed wouldn't work, but with that said, that's obviously a short-circuit and. The dictionary lookup would be needed only when RC1 == 128.

view this post on Zulip Giovanni Zanandrea (Mar 31 2022 at 08:31):

While we're on the subject of reference counting, there's a couple questions I wanted to ask.

The first one is, what's the typical performance cost associated with RC in ROC? That is, if you compile the same program twice, the first time with RC and the second time with no GC at all, and compare their performance on tasks that are small enough for the second version to complete without running out of memory, what's the typical performance difference in a realistic scenario? I'm asking specifically about the single-threaded case, with regular (non-atomic) RC.

Second, how fundamental is RC to ROC's opportunistic mutation strategy? That is, we can determine if a given object can be safely updated either by checking its reference count at runtime, or by a static analysis of the source code. Obviously the former is more accurate, but how much more accurate in practice? Do we have any data? Another way of asking this would be, could ROC be implemented with a different GC strategy (say, arena/stack allocation) without significantly reducing the effectiveness of opportunistic mutation?

view this post on Zulip Kevin Gillette (Mar 31 2022 at 14:12):

I can't answer the relative performance cost question, but regarding the accuracy of static analysis vs RC, generally a compiler in an RC language doesn't omit RC for a value unless it can prove the lifecycle of the value is bound to the stack frame in which it's created (i.e. it's either a constant sized value or, in a pure FP like Roc, it's never returned from the function which created it). Thus static analysis and RC are both 100% accurate if implemented without bugs.

view this post on Zulip Kevin Gillette (Mar 31 2022 at 14:42):

So the difference is just in whether there's CPU and memory overhead at runtime.

view this post on Zulip Kevin Gillette (Mar 31 2022 at 14:47):

In theory the speed would be both a function of 1) how often references are acquired and released, and 2) the memory overhead of each counter.

For 1, if you acquire a reference, then crunch the data in a hot loop, the relative cost should be low, but if you're constantly acquiring and releasing data, or duplicating references (this data is in two lists now), then the relative cost can be high.

For 2, in theory, at least in a mutating, pointerized language with an RC implementation, if a programmer needs to create a list of n U8 "application level counters" to share 1 each to n threads (with some orchestrator thread seeing the whole list), then the compiler might take the Roc approach of placing the RC in memory immediately preceding the value it counts. If the language uses U32 for RC, then in this case the list might be allocated as:

[U32RC U8App U24Pad, U32RC U8App U24Pad, ...]

The U24Pad is just 3 bytes of padding to make sure the next U32RC is 4 bytes aligned (since unaligned numerics ops are typically either prohibited by the architecture or more expensive: 2 loads, some shifts, and a bitwise or to read, then the reverse to write). This means that just to create a reference counted pointer to a byte value, 8 bytes in memory are consumed (16 bytes if the reference counter is a U64, though that seems less likely). This is less of an issue for sharing larger values like strings, but in any case with many reference counters, less application data fits in a CPU cache line. Compared to an unshared list of U8, in which 64 elements might fit in a cache line, only 8 of the above elements fit in a cache line, probably leading to less cache utilization and more memory access thrash.

Aside: some compilers might be super advanced and pack their bytes and RCs so that the array has chunks of:

[U32RC U32RC U32RC U32RC U8App U8App U8App U8App]

Thus eliminating padding. If this memory optimization is applied, at runtime it would need to check check addrOfU8 % 4 to figure out which counter applies, thus incurring a higher CPU cost, but getting much better cache utilization. This optimization probably isn't worth the time for anyone to implement though, because it's not often that programmers want to share a value smaller than the ref counter.

view this post on Zulip Kevin Gillette (Mar 31 2022 at 14:48):

Also, that kind of thing would be especially hard to express in Roc, and thus would hopefully never happen anyways

view this post on Zulip Folkert de Vries (Mar 31 2022 at 14:53):

as a note, the static analysis is inherently not as precise as RC for in-place mutation

view this post on Zulip Folkert de Vries (Mar 31 2022 at 14:53):

If I do List.repeat myString 1 then that string (assuming it as unique) has a refcount of 1, and can be updated in-place

view this post on Zulip Folkert de Vries (Mar 31 2022 at 14:53):

but List.repeat myString 2 means that is no longer true

view this post on Zulip Folkert de Vries (Mar 31 2022 at 14:54):

then, for List.repeat myString n the static analysis says: nope, cannot prove that myString can be updated in-place

view this post on Zulip Folkert de Vries (Mar 31 2022 at 14:54):

while at runtime we may discover that actually n == 1 and we can update in-place

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 14:58):

I think the other big thing with static analysis is that it will never work for data from the platform. We just don't have any guarantees about it.
So even if it was perfect elsewhere, the interface between platform and application needs runtime rc to get inplace mutation. If the platform is passing the application lots of data, this could kill the performance of the application.

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 15:09):

The first one is, what's the typical performance cost associated with RC in ROC?

I don't think this has ever been tested. Also, most current roc projects barely use reference counting. As in most of them pass around a bunch of single level tag unions and records, both which are value types and have no reference count. I think if we wanted a simple test for a pathological case, it would be some form of graph or tree with really small nodes that get passed around a lot (maybe even duplicated). I guess having many small lists or strings could also be a bad case.

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 15:15):

[U32RC U8App U24Pad, U32RC U8App U24Pad, ...]

I am pretty sure this kind of structure is impossible in Roc. Either the U8 would be a value type in a reference counted list. Or because it is boxed or otherwise turned into a reference type, it would just be a pointer. And you would get [ { RC, U8}*, { RC, U8}*, { RC, U8}*, { RC, U8}*, ...]. Though the best for that use case would probably just be a list of atomic integers.

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 15:17):

that's obviously a short-circuit and

Sorry I missed that. Must have not been fully awake when I wrote that yesterday morning.

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 15:24):

@Folkert de Vries If we wanted to test programs with RC mostly off, what do you think the easiest way would be? Initialize all RCs to infinite and then delete all increment and decrement calls? Then we just have the cost of checking it before we copy data?

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 23:02):

So reference counting and deallocating is actually more expensive than I expected. I just ran a version of the false-interpretter without reference counting and deallocating memory (assuming I disabled it correctly).

At the small cost of 817x memory consumption, the application can run approximately 0.5x the time ( 2x faster ).

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 23:03):

Definitely will try to dig into the assembly, this feels like too large of a gap in performance, but maybe it is hitting a pathelogical case due to recursive unions in the false interpretter.

view this post on Zulip Brendan Hansknecht (Mar 31 2022 at 23:04):

This was done on an M1 mac by making false calculate the cksum of a random 1k file.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 01:31):

So I just did the same tests on my x86_64 machine and also added in an atomic reference counted version:

The memory story is for some reason worse. No reference count uses 1800x memory compared to the reference counted version. Maybe this is just a difference in how the memory is counted(full page vs used memory?).

Anyway, the results:
no refcount is 1.59x faster than the refcounted version
the refcounted version is 1.06x faster than the atomic refcount version.

This is actually way better results than I expected for atomic refcounts on x86_64.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 05:02):

Ok. I am running more tests with atomic references counts, and it is less pretty. I ran our benchmarks and nqueens with the cons list is definitely the worst case. The reference count version is 2.1x faster than the atomic reference counted version. Most other benchmarks are 1.5x slower. In general, recursive structures and small structures are where we are hit the most by using atomics. Things like false and quicksort are barely effect. For quicksort, both results are within 2% of each other.

To be fair, these are pretty small benchmark where all of their performance relies on single data structures that all use reference counting heavily. I would bet that in general larger apps would be less affected, but it does depend on how they were built.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 05:11):

What's really cool is that some of these benchmarks are slower without a reference count than even with an atomic reference count. The reason is that reference count enables in place mutation, but no reference count means you can never get inplace mutation.

Over all benchmarks and the false cksum, no reference count actually averages(geomean) 1% slower than normal reference counts with some benchmarks being significantly slower and some being significantly faster.

Atomic reference counts average(geomean) 37% slower than normal reference counts with a few examples being basically the same, but most being ~50% slower.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 05:14):

My guess would be that most large programs that don't heavily use recursive data structures will see ~5% slowdown with atomic reference counts and programs that significantly use recursive data structures ~20-30% slow down. Though smaller programs that essentially just interact with a recursive data structure could see up to ~100% slow down with atomic reference counts.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 05:53):

And just to be more complete with the data. On an M1 mac with the faster atomic operations, atomic reference counts are only 7% slower on average (geomean) than normal reference counts. With the largest difference being 12%. Truly a totally different picture than my x86 machine.

view this post on Zulip Giovanni Zanandrea (Apr 01 2022 at 09:25):

Brendan Hansknecht said:

So reference counting and deallocating is actually more expensive than I expected. I just ran a version of the false-interpreter without reference counting and deallocating memory (assuming I disabled it correctly).

At the small cost of 817x memory consumption, the application can run approximately 0.5x the time ( 2x faster ).

Definitely will try to dig into the assembly, this feels like too large of a gap in performance, but maybe it is hitting a pathelogical case due to recursive unions in the false interpretter.

If those numbers are confirmed they would actually match my experience with my own language. When running the compiler (which is self-hosted) I too saw a large performance gap between the RC version and the no-GC one. So much so that I gave up on reference counting altogether and switched to other memory management techniques (although of course a compiler, with all those highly recursive data structures, is probably a worst-case scenario). That's why I was curious about RC's impact on performance in ROC's case.

view this post on Zulip Giovanni Zanandrea (Apr 01 2022 at 09:27):

Brendan Hansknecht said:

What's really cool is that some of these benchmarks are slower without a reference count than even with an atomic reference count. The reason is that reference count enables in place mutation, but no reference count means you can never get inplace mutation.

Over all benchmarks and the false cksum, no reference count actually averages(geomean) 1% slower than normal reference counts with some benchmarks being significantly slower and some being significantly faster.

Did you entirely disable opportunistic mutation, or did you still enable it when it was possible to statically determine that objects where safe to update? As I see it, there are four possibilities:

    1) Enable RC and use it for both GC and opportunistic mutation
    2) Enable RC only for GC and rely exclusively on static analisys for opportunistic mutation
    3) Disable RC and GC entirely, and rely on static analysis for opportunistic mutation
    4) Disable RC, GC and opportunistic mutation

To figure out the real hidden cost of RC I think one should compare 2) and 3), not 1) and 4). Although what matters in terms of deciding what GC strategy to use is 1) vs 3) of course.

view this post on Zulip Giovanni Zanandrea (Apr 01 2022 at 09:29):

Brendan Hansknecht said:

Definitely will try to dig into the assembly, ...

If I wanted to look at the code generated by the compiler, would checking the generated LLVM code be the only option? There's no higher-level, intermediate format that one could peruse instead, I guess?

view this post on Zulip Giovanni Zanandrea (Apr 01 2022 at 09:30):

Brendan Hansknecht said:

I think the other big thing with static analysis is that it will never work for data from the platform. We just don't have any guarantees about it.
So even if it was perfect elsewhere, the interface between platform and application needs runtime rc to get inplace mutation. If the platform is passing the application lots of data, this could kill the performance of the application.

Wouldn't it be possible to stipulate some contract regarding the ownership of the data in at least some specific cases, like the CLI or the Web platforms?

view this post on Zulip Giovanni Zanandrea (Apr 01 2022 at 09:31):

Kevin Gillette said:

For 2, in theory, at least in a mutating, pointerized language with an RC implementation, if a programmer needs to create a list of n U8 "application level counters" to share 1 each to n threads (with some orchestrator thread seeing the whole list), then the compiler might take the Roc approach of placing the RC in memory immediately preceding the value it counts. If the language uses U32 for RC, then in this case the list might be allocated as:

[U32RC U8App U24Pad, U32RC U8App U24Pad, ...]

...

Brendan Hansknecht said:

[U32RC U8App U24Pad, U32RC U8App U24Pad, ...]

I am pretty sure this kind of structure is impossible in Roc. Either the U8 would be a value type in a reference counted list. Or because it is boxed or otherwise turned into a reference type, it would just be a pointer. And you would get [ { RC, U8}*, { RC, U8}*, { RC, U8}*, { RC, U8}*, ...]. Though the best for that use case would probably just be a list of atomic integers.

What happens when one creates a list of records, but then needs to pass around those records individually? Are they always passed around by value, even when they are large?

view this post on Zulip Kevin Gillette (Apr 01 2022 at 13:47):

The scenario I had in mind with the list of values, admittedly, assumed but did not state that the reference to the list would provably (via static analysis) outlast all the element references, since some orchestrator would review the list later after all the subtasks are done. In such a specific case, the relationship between elements and the list wouldn't need to be tracked at runtime, but it's a pretty niche case.

I believe Brendan called out that in general, passing references to elements would need to say something about the lifetime of the list as well. Perhaps in some special cases, such as when the list itself is never manipulated after initialization (never resized), perhaps the list can be forgotten, and the n elements can be treated as if they're n separate allocations; that assumes a lot about the allocator though.

Given that Roc doesn't have a references concept, since the distinction between references and "values" only applies, iiuc, in a language with semantic mutation, this only comes up when Roc is using mutation as a hidden optimization. iow, since Roc doesn't need to deal with programmer requests for especially awkward mutation scenarios, the compiler can avoid these tricky tradeoffs entirely merely by choosing not to fall into traps of its own making.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 15:19):

too saw a large performance gap between the RC version and the no-GC one. So much so that I gave up on reference counting altogether and switched to other memory management techniques

Just to note, as I said later in my chain of post, RC was hit or miss on speed differences. In some cases it was significantly faster and in other it was significantly slower, but on average was basically the same speed.

I think that false is a pathological case for the roc compiler as a whole. It is super recursive (not exactly tail recursive) with a metric ton of nested lambdas that capture various amounts of data. If it were to be rewritten with some of the changes to roc since then, it could be way faster in general. The biggest problem is that it is one giant chain of tasks. It should be much faster and more understandable to the Roc compiler as a model update structure where the host loops over many smaller tasks. Also, it still has compilation bugs despite generating "working" code. Those could be related to its high performance rc cost.

To figure out the real hidden cost of RC I think one should compare 2) and 3), not 1) and 4). Although what matters in terms of deciding what GC strategy to use is 1) vs 3) of course.

I was comparing 1) and 3). I never changed anything about static analysis.

If I wanted to look at the code generated by the compiler, would checking the generated LLVM code be the only option? There's no higher-level, intermediate format that one could peruse instead, I guess?

We have a high level IR, it would be slightly better than the raw code given it adds in all compiler functions and reference counting. I don't think we have a flag to print it out. Would need to modify the source to get it. That being said, even looking at the high level IR is not very useful with Roc. So much gets inlined and optimized away that the final code is far removed from even the unoptimized LLVM IR.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 15:24):

Wouldn't it be possible to stipulate some contract regarding the ownership of the data in at least some specific cases, like the CLI or the Web platforms?

We could probably force the platform to annotate it and then just trust that it is never wrong.

What happens when one creates a list of records, but then needs to pass around those records individually? Are they always passed around by value, even when they are large?

Yep. Definitely need to be copied out of the list (it could get de-allocated later), though for function calling, they can be passed by const reference if large. Of course, a user can just box the struct to avoid this.

view this post on Zulip Brendan Hansknecht (Apr 01 2022 at 15:26):

The scenario I had in mind with the list of values, admittedly, assumed but did not state that the reference to the list would provably (via static analysis) outlast all the element references

I don't quite follow: if it was a list of values, why would each value have a reference count?

view this post on Zulip Folkert de Vries (Apr 01 2022 at 15:54):

We can still improve RC further too. it's doing too much work when RC'd things are contained in records/tags. False in particular would benefit from those improvements

view this post on Zulip Kevin Gillette (Apr 02 2022 at 03:20):

Brendan Hansknecht said:

I don't quite follow: if it was a list of values, why would each value have a reference count?

I was speaking to reference counting in general, not necessarily the kind of reference counted algorithms that would come up or make sense in Roc. The scenario I had in mind is plausible in imperative concurrent programming, but would be difficult to express in Roc (it'd be solved in other ways).

view this post on Zulip Kevin Gillette (Apr 02 2022 at 03:27):

Brendan Hansknecht said:

Wouldn't it be possible to stipulate some contract regarding the ownership of the data in at least some specific cases, like the CLI or the Web platforms?

We could probably force the platform to annotate it and then just trust that it is never wrong.

I'll suggest this is pretty reasonable. Roc already relies on the platform doing the right thing, since the platform:

  1. Provides memory management, and we hope it does not allocate memory already in use or free the memory before Roc asks it to.
  2. Can arbitrarily read or mutate Roc's memory, and we hope it does not.
  3. Can make roc_panic no-op, and thus in theory continue on with undefined behavior.

I don't see why it'd be problematic for the platform to claim that the memory is platform owned and will exceed the lifetime of the Roc program, or to claim that Roc owns the memory now (and thus roc_free may be called on it), or that the memory belongs to a mmap'd file on disk, or is exclusively owned and referenced by Roc but must not be written to for whatever reason, etc.

view this post on Zulip Kevin Gillette (Apr 02 2022 at 03:42):

Brendan Hansknecht said:

What's really cool is that some of these benchmarks are slower without a reference count than even with an atomic reference count. The reason is that reference count enables in place mutation, but no reference count means you can never get inplace mutation.

Reading this surprised me, though not necessarily in a bad way. After watching Richard Feldman's talks, I [mis]inferred that opportunistic mutation was generally used when the compiler could statically prove exclusive ownership of a value, or when it could reasonably determine that mutation was beneficial enough to warrant copying the data.

What you mentioned suggests that, in general, opportunistic mutation is a runtime decision based on reference count, leading to branching code paths: the compiler might generate into the same binary both an in-place mutation version of a function, as well as a non-mutating or copy-mutating version; for some tree or graph algorithms, I imagine that mutating and non-mutating assembly code might look quite different.
This would mean approximately double the generated code for many functions leading to possibly more instruction cache churn, as well as the [possibly atomic] refcount check and the jmp.

I had figured those downsides were probably not worth runtime mutation/non-mutation branching costs, and that statically identifiable cases, where refcounting is unneeded and mutation is assured, would be common. It sounds like I may have been wrong on both counts.

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 03:48):

For roc, I think it is just for the case where either we need to copy the data and then edit it or we just can edit it. Though I guess roc could also edit while copying, but I don't think that happens in practice. So I think it doesn't bloat up the instruction cache much.

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 03:48):

Would if we added more edit while copying type operations. Not sure how those would perform in practice.

view this post on Zulip Richard Feldman (Apr 02 2022 at 03:51):

Kevin Gillette said:

What you mentioned suggests that, in general, opportunistic mutation is a runtime decision based on reference count, leading to branching code paths: the compiler might generate into the same binary both an in-place mutation version of a function, as well as a non-mutating or copy-mutating version; for some tree or graph algorithms, I imagine that mutating and non-mutating assembly code might look quite different.
This would mean approximately double the generated code for many functions leading to possibly more instruction cache churn, as well as the [possibly atomic] refcount check and the jmp.

I had figured those downsides were probably not worth runtime mutation/non-mutation branching costs, and that statically identifiable cases, where refcounting is unneeded and mutation is assured, would be common. It sounds like I may have been wrong on both counts.

so we do both - there are two situations where we do opportunistic mutation:

  1. We check the refcount at runtime and it's currently 1. (If it's not, we need to clone first and then mutate the clone.)
  2. We determine at compile time that the refcount would definitely be 1, so there's no need to check. We just mutate in-place. This is what the Morphic solver does.

there are some circumstances where it's impossible to detect statically because the answer changes depending on the running state of the program; sometimes within the same code path the refcount will be 1, and other times it won't be.

this implies that some programs will benefit more or less than others from the runtime refcount check. It really depends on what the application is doing!

view this post on Zulip Richard Feldman (Apr 02 2022 at 03:53):

of note, if the runtime refcount check determines that we can't mutate in place, we have to deep clone the entire data structure. So it is absolutely worth checking every time! We can potentially avoid doing multiple heap allocations and/or memcpys for the cost of a single conditional, if it says the refcount was indeed 1 :big_smile:

view this post on Zulip Richard Feldman (Apr 02 2022 at 03:54):

one of the things we can't know until larger Roc programs exist is what the tradeoffs will end up being around this runtime refcount optimization

view this post on Zulip Richard Feldman (Apr 02 2022 at 03:55):

e.g. maybe it works all the time and is super great. Maybe it depends on what coding style you use, and that influences what comes to be considered best practice coding styles in Roc. Since no other production-targeted language has ever done this approach before, it's impossible to predict how it will go! We're the first to do the experiment, so we'll be the first to learn about all this. :smiley:

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 04:28):

So turns out I have to update my numbers. Even though we switched to using zig for refcounting, 2 core llvm functions hadn't been converted to the new refcounts. This lead to 2 main issues:

  1. Most refcount increments where emitted no matter what and as non-atomic increments
  2. No rc got to do implace mutation in a few locations where it shouldn't have been able to do so

So I am just gonna straight to the summary. Looking at our benchmarks geomeans:

The overall picture for no reference count vs reference count is about the same. Some apps are significantly faster, some are significantly slower. With something like quicksort that has good static analysis and lots of data movement work doing basically the same.

Atomic reference counts took a much bigger hit from this. This is probably mostly due to all increments now actually being atomic. That being said, we also fixed a refcounting overflow related bug that leads to an extra branch for atomics, but not for regular reference counts. That could be a large part of the cost as well.

So besides atomic reference counts being a lot worse on x86 with this update, I feel the picture is about the same here.

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 04:31):

Also, I left the false interpretter out of those number because in simple terms it is very broken. Something must be going wrong with compilation and the refcount generation. My best guess is the issue is due to deeply nested task closures that are updating tons of refcounts that they shouldn't be touching.

~97% of the executed instructions in the false interpreter are due to refcounting. With refcounting off, it runs about 40x faster, has an about 5000x peak memory footprint, and executes 43x less instructions while generating the exact same result.

view this post on Zulip Kevin Gillette (Apr 02 2022 at 15:25):

40x faster while using 5000x peak memory does sound like something's broken.

view this post on Zulip Kevin Gillette (Apr 02 2022 at 15:49):

Regarding atomic vs non-atomic refcounts: is all code that needs refcounts generating a high-bit check branch with a conditional atomic/non-atomic read/increment? If so, by the same means in which the compiler can determine that some allocations can provably omit refcounting, can we also provably make some refcounts strictly non-atomic (such as when we need to heap allocate, but that allocation passes strictly deeper into the stack and never returns to the caller, thus cannot possibly be shared by other threads)?

Unless the Roc compiler itself opportunistically introduces parallelization, then given the tedium of creating additional threads and "sharing" data between them, I suspect non-atomic refcounts will be much more common than may-be-atomic and must-be-atomic refcounts. It sounds like the benchmark variants mentioned are the result of switching the whole compilation to atomic refcounts vs non-atomic refcounts vs non-refcounting, though I would hope that, eventually, the answer would be contextually determined (even within a single program).

iiuc, the only source of or path to parallelization is the platform itself, and putting aside some possible future in which Roc calls thread-management hooks in the platform, we could just have the platform itself set that "use atomics" high bit on refcounts for any data it receives and may send to other threads, or data the platform generates and passes to Roc. For data in the readonly segment, I'm guessing the refcount would be all one bits to signal that no increments/decrements should occur?

Perhaps a compile-time constant, provided by the host, can signal to Roc whether any values can ever be accessed simultaneously from multiple threads, so that a non-threaded platform (or one in which freeable data will never be shared) can signal that Roc never needs to generate high-bit checks nor branching to atomic logic. Alternatively perhaps the platform can optionally provide functions which take a refcount and return whether it's read-only (don't refcount) or atomic; if such functions aren't provided, then Roc can supply its own version that always returns false, so that llvm will inline and branch eliminate if false { ... } atomic refcounting code and such.

view this post on Zulip Kevin Gillette (Apr 02 2022 at 15:56):

Richard Feldman said:

of note, if the runtime refcount check determines that we can't mutate in place, we have to deep clone the entire data structure. So it is absolutely worth checking every time! We can potentially avoid doing multiple heap allocations and/or memcpys for the cost of a single conditional, if it says the refcount was indeed 1 :big_smile:

It sounds like the strategy is to always mutate, rather than to generate alternate mutating and true-immutable algorithms, and that there's merely a check whether the data can already be mutated in place, or if it needs to be copied and references restitched to point to the copy, at which point the code will proceed as before. If so, the mutate-in-place and copy-and-mutate variants sound like they'd share almost all the same code (with just a jmp-to-mutate shortcut in the case of being able to mutate without copy), and so I'd be less worried about binary bloat compared to having completely different versions of every non-trivial algorithm.

view this post on Zulip Kevin Gillette (Apr 02 2022 at 15:58):

Does the Roc compiler have enough visibility into data flow, and is it feasible enough to do such analysis, to know when only a partial copy is actually needed? For example, if given a function that takes a tree structure and only updates the value of the root node, while never modifying any child nodes, would Roc foreseeably be able to just copy the root node itself while merely bumping the ref-counts of the child nodes, or is that too difficult to analyze reliably?

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 15:59):

is all code that needs refcounts generating a high-bit check branch with a conditional atomic/non-atomic read/increment?

Currently no. It is either doing atomic refcounts everywhere or regular refcounts everywhere. That being said, we do save the high bit such that in the future it could be used to convert and atomic refcount to a regular refcount.

If so, by the same means in which the compiler can determine that some allocations can provably omit refcounting, can we also provably make some refcounts strictly non-atomic (such as when we need to heap allocate, but that allocation passes strictly deeper into the stack and never returns to the caller, thus cannot possibly be shared by other threads)?

Yeah, this should be doable. Though it assumes the functions we pass to are either always passed thread local data or have 2 version, 1 for maybe thread local and 1 for definitely thread local. Or I guess are always inlined.

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 16:09):

suspect non-atomic refcounts will be much more common than may-be-atomic and must-be-atomic refcounts.

Hopefully, but depends heavily on the platform and what guarantees it can give us. What is the state of data passed in? Can x be run in parallel by the platform (not by a function called inside the roc app)? What happens when we interact with tasks/pass data to the platform?

For data in the readonly segment, I'm guessing the refcount would be all one bits to signal that no increments/decrements should occur?

Essentially, but we store refcounts weird so it would probably be all zero bits. Or a 0 followed by all 1 bits.

Alternatively perhaps the platform can optionally provide functions which take a refcount and return whether it's read-only (don't refcount) or atomic

As long as those functions are special and written in roc so that we can inline them and avoid the cost of a function call for every refcount check, that could be doable.

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 16:10):

would Roc foreseeably be able to just copy the root node itself while merely bumping the ref-counts of the child nodes, or is that too difficult to analyze reliably?

Should definitely be doable. I don't think it is done currently.

view this post on Zulip Kevin Gillette (Apr 02 2022 at 16:12):

I'm presuming inlined cases. I don't know where the performance threshold lies between saving a jmp by generating two assembly compilations of the same function that are each 20 instructions long (but in theory, though maybe or maybe not in practice having worse icache utilization), or having a single version with the jmp that probably branch predicts well but also may not wholly fit in the icache due to being a few instructions too long

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 16:21):

40x faster while using 5000x peak memory does sound like something's broken.

It's important to note that either version is allocating a similar amount of memory here. One is just taking the time to refcount and free it while the other is just leaking it. So 5000x peak memory is a bit misleading in terms of work done. Just checked with valgrind on calculating the cksum of just "hi".

Refcount: ~25,000 allocs, ~25,000 frees, 38MB allocated
No refcount: ~50,000 allocs, 8 frees, 45MB allocated

view this post on Zulip Brendan Hansknecht (Apr 02 2022 at 16:23):

25,000 -> that just insane. to calculate the cksum of "hi" in this interpreter, we are refcounting 25,000 different memory locations.


Last updated: Jun 16 2026 at 16:19 UTC