A design for parallelism in Roc · ideas

Stream: ideas

Topic: A design for parallelism in Roc

Asier Elorz (he/him) (Dec 17 2023 at 21:27):

I have been working this weekend on what could be an idea for a design for parallelism in Roc. I have a compiler fork that implements a proof of concept (almost), and a text explaining what it is about.

I was going to just paste the text here but it is too long for Zulip apparently :sweat_smile: so I am linking it instead:

https://github.com/asielorz/roc/blob/main/presentation.md

TLDR

An idea for how Roc could offer parallelism in the language and standard library without having to involve tasks and in a platform agnostic manner, and links to a fork of the compiler implementing the idea.

Brendan Hansknecht (Dec 17 2023 at 21:58):

Intriguing

Brendan Hansknecht (Dec 17 2023 at 21:58):

I like the area of exploration for sure

Brendan Hansknecht (Dec 17 2023 at 21:59):

Definitely something I have mostly written off as best left to the platform fully, but cool to explore it as a language builtin.

Brendan Hansknecht (Dec 17 2023 at 22:01):

Also, probably par would fit well with record builder syntax

Brendan Hansknecht (Dec 17 2023 at 22:01):

Instead of a totally custom systacx, but would need to think through that more

Luke Boswell (Dec 17 2023 at 22:09):

Nice work! This isn't something I can provide useful commentary on sorry, but I really enjoy reading these discussions :+1: . I love it when we challenge and share ideas like this to make things better.

Brendan Hansknecht (Dec 17 2023 at 22:30):

As a side note, I think the complexities will come around reductions and expanding the functionality past just parallel mapping.

Brendan Hansknecht (Dec 17 2023 at 22:32):

Cause in reductions, order often matters even though it shouldn't, so something like this could lead to different results that the sequential code in some cases (mostly due to floats, but there may be other cases). So it may not just be safe and guaranteed the same results, which does lead to some complexities and user concerns.

Brendan Hansknecht (Dec 17 2023 at 22:33):

This of course would happen with a true threading API that uses tasks, but it feels more ok for it to happen in that case then it does for it to happen with a language builtin.

Brendan Hansknecht (Dec 17 2023 at 22:39):

Another part of the complexity to really add something like this is that it has no guarantees or introspection about what it is running on. (Multiple os threads, multi green threads, an async executor, just sequential). This is often very important for performance. For example, on an single threaded async executor this will strictly be a slow down for work that isn't waiting on a task anyway. For os threads, you might want a large chunk size. For green threads, is it a thread per task or thread pools?

Note, a platform threading implementation will hit the exact same issues. The big difference is that when building for a specific platform, you will hopefully think about the specific platform implementation and tailor to that.

Brendan Hansknecht (Dec 17 2023 at 22:40):

Lastly, I am not sure enabling something like a par keyword in arbitrary libraries is a good idea. As an end user of a library, I have less idea of its performance characteristics. I also might be running on a platform that runs one roc process per thread. So all of these extra thread spawning could be really problematic cause there are no CPU cores to run all the extra threads on.

Brendan Hansknecht (Dec 17 2023 at 22:41):

None of this to say we shouldn't add a feature like this to roc, just kinda brainstorming my top potential concerns that would affect the design.

Kevin Gillette (Dec 18 2023 at 02:57):

A thought in terms of a par keyword: could a platform provide a function that takes a similar signature to roc_parallel_context_register_task, and returns a handle (just an integer):

if the platform returns a non-zero handle, the task is being run concurrently and the handle can be used to block for or check completion.
if the platform returns a zero-valued handle, then the platform refused to run the task concurrently, at which point the currently running roc code will just run that function directly.

That has the benefit of centralizing the orchestration of running tasks in Roc rather than in the platform.

It also allows for bounded parallelism. For example, imagine a platform chooses to allocate 16 threads for use by Roc (in this case assume Roc is the only meaningful code doing application work). Perhaps it's a web server.

Each time a connection is accepted, the platform decrements a counter (initialized to 16), and allocates a thread.
If any of those connection handlers uses par, the platform checks if the counter is positive, and if so, decrements the counter and allocates another thread.
If the counter was exhausted at the time par was called, the platform returns a zero handle, and the Roc code just invokes that task sequentially.
After each task that was made to run sequentially, the Roc code will still request the next one to be run concurrently (maybe a thread has become available).

Of course a platform with no parallelization support will just implement the function as always returning a zero handle.

Kevin Gillette (Dec 18 2023 at 03:20):

Ideally the Roc compiler would estimate the overall cost to run each task (if a task also internally uses par, it'd be an estimate to run the task sequentially, i.e. estimate of total work, _not_ estimate of wall-clock-duration). This could be just a big-O determination (i.e. linear time is more expensive than constant time), but could evolve to be more sophisticated.

It would then request parallel evaluation of the most expensive-seeming expressions first, and fall back to sequentially ruining the least expensive-seeming functions first.

For example, given:

(a, b, c) = par (midExpensive, leastExpensive, mostExpensive)

The Roc compiler would rearrange the order:

It'd first ask the platform to run mostExpensive.
If the platform returned a non-zero handle, it'd then ask the platform to run midExpensive.
It'd then directly run leastExpensive (without platform assistance).

Conversely, consider a resource-starved platform at the time this code is run:

The "parent" Roc code would first ask the platform to run mostExpensive, and the platform would return zero (not presently available to run a parallel function).
In response, the parent Roc code would then directly evaluate leastExpensive (without platform assistance).
Upon synchronous completion of leastExpensive, the parent code would again ask the platform to run mostExpensive, which, in this scenario would return zero again.
In response, the parent Roc code would synchronously run midExpensive and mostExpensive in either order.

If at step 3, the platform had provided a non-zero handle for the parallel run of mostExpensive, then mostExpensive would have been running on a separate thread (or whatever) at the same time that the parent code was synchronously running midExpensive.

Kevin Gillette (Dec 18 2023 at 03:25):

The benefit would be more pronounced when the parent code has more tasks to run in parallel, but overall this strategy gives the a good opportunity for the shortest wall-clock runtime while still being able to make progress when compute resources are contended.

The reason for running the cheapest task synchronously when a parallel run is [perhaps temporarily] unavailable, is that minimizes the time waited before attempting parallelization again, while spending that time doing actual work, rather than blocking/spin-waiting just to have another thread do real work (we already have a thread, so we might as well use it).

Kevin Gillette (Dec 18 2023 at 03:28):

I'm not as convinced about the par-via-tuples approach, either syntactically or in that it is limited to a constant number of parallel tasks (a list would be more flexible even if no other design change were made). The above notes would be applicable regardless of syntax.

Brendan Hansknecht (Dec 18 2023 at 04:12):

I think all of these complexities are what Richard was talking about when he commented:

how this feature would make the Roc optimizer a very unpredictable and unstable target, and writing efficient code for it would be a matter of tweaking the implementation to trigger the optimizer's internal heuristics, and that would be a bad position to be in because these heuristics could change from a version of the compiler to the next, effectively invalidating all this effort.

Brendan Hansknecht (Dec 18 2023 at 04:16):

Though the above could be really cool when it works right, it is a ton of complexity and complications that could make it exceptionally hard to understand performance of a roc app. Also, likely would be very hard to accurately estimate the cost of a function call.

Anton (Dec 18 2023 at 10:23):

Impressive work @Asier Elorz (he/him)!

Anton (Dec 18 2023 at 10:24):

This proposal made me think of parallel collections in Scala which were very convenient to use.

Anton (Dec 18 2023 at 10:25):

Being able to avoid Task is a nice feature!

Anton (Dec 18 2023 at 10:37):

I would prefer to avoid taking on the complexity that comes with parallelization until we've improved the reliability of Roc. In general, I believe we should hold off on adding features that increase complexity until we can go through something like advent of code with only a few bugs reported. We want people to use the language, which means it must be reliable.

Asier Elorz (he/him) (Dec 18 2023 at 17:13):

I would prefer to avoid taking on the complexity that comes with parallelization until we've improved the reliability of Roc.

This makes a lot of sense and is a very good idea.

Richard Feldman (Dec 19 2023 at 13:56):

ok, I finally got around to digging into this :sweat_smile:

Richard Feldman (Dec 19 2023 at 13:56):

this is really cool @Asier Elorz (he/him)! :smiley:

Richard Feldman (Dec 19 2023 at 13:57):

I think the discussion here raises some good points around tradeoffs and timing, but it's awesome to be able to discuss these in the context of a concrete idea

Richard Feldman (Dec 19 2023 at 13:58):

and of course it's always risky to make an implementation before talking through the design, but it seems like you had the right attitude about it - e.g.

The main point of this exercise is not to adopt my changes into Roc. I am not making a merge request with these. I am trying to open a conversation about the topic and suggest a direction. [...] The design could also be completely disregarded and that is fine too. It was a fun weekend project.

I really appreciate that you came at this with such a great mindset! :smiley:

Richard Feldman (Dec 19 2023 at 13:59):

separately, I am really impressed that you managed to get this working with no prior experience in the roc code base :big_smile:

Richard Feldman (Dec 19 2023 at 13:59):

As I said before, please do not merge my code. This is my first time working on the Roc codebase. I don't know what I am doing.

I wouldn't sell yourself short there - it seems like you figured it how to be productive in the code base impressively quickly! :rocket:

LoipesMas (Dec 20 2023 at 14:54):

I like the idea! Even though parallelism shouldn't be a priority right now, I think it's still valuable to talk about it sooner rather than later.
In my opinion, even if the builtin solution worked good for 80% of cases, decent for 10% and bad for 10%, it's still worth it. Performance always should be measured, so I'd expect the developer to measure the sequential solution, then the easy builtin one (s/List.map/List.parallelMap/) and if it's not good enough look for other solutions. If Roc/platform is clear about the implementation and trade-offs, then that shouldn't be an issue.
If the implementation/performance would change with time, that's an issue, but it always is, right?

Regarding ordering: that's problematic. Paraphrasing a wise quote: "order always matters on real hardware". Floating point precision, overflowing, crashing, crashing, dbg, etc. I don't think there is a way around that, but, again, if we're clear about it, it would be worth it.

Last updated: Jul 23 2026 at 13:15 UTC