I have been working this weekend on what could be an idea for a design for parallelism in Roc. I have a compiler fork that implements a proof of concept (almost), and a text explaining what it is about.
I was going to just paste the text here but it is too long for Zulip apparently :sweat_smile: so I am linking it instead:
https://github.com/asielorz/roc/blob/main/presentation.md
An idea for how Roc could offer parallelism in the language and standard library without having to involve tasks and in a platform agnostic manner, and links to a fork of the compiler implementing the idea.
Intriguing
I like the area of exploration for sure
Definitely something I have mostly written off as best left to the platform fully, but cool to explore it as a language builtin.
Also, probably par would fit well with record builder syntax
Instead of a totally custom systacx, but would need to think through that more
Nice work! This isn't something I can provide useful commentary on sorry, but I really enjoy reading these discussions :+1: . I love it when we challenge and share ideas like this to make things better.
As a side note, I think the complexities will come around reductions and expanding the functionality past just parallel mapping.
Cause in reductions, order often matters even though it shouldn't, so something like this could lead to different results that the sequential code in some cases (mostly due to floats, but there may be other cases). So it may not just be safe and guaranteed the same results, which does lead to some complexities and user concerns.
This of course would happen with a true threading API that uses tasks, but it feels more ok for it to happen in that case then it does for it to happen with a language builtin.
Another part of the complexity to really add something like this is that it has no guarantees or introspection about what it is running on. (Multiple os threads, multi green threads, an async executor, just sequential). This is often very important for performance. For example, on an single threaded async executor this will strictly be a slow down for work that isn't waiting on a task anyway. For os threads, you might want a large chunk size. For green threads, is it a thread per task or thread pools?
Note, a platform threading implementation will hit the exact same issues. The big difference is that when building for a specific platform, you will hopefully think about the specific platform implementation and tailor to that.
Lastly, I am not sure enabling something like a par keyword in arbitrary libraries is a good idea. As an end user of a library, I have less idea of its performance characteristics. I also might be running on a platform that runs one roc process per thread. So all of these extra thread spawning could be really problematic cause there are no CPU cores to run all the extra threads on.
None of this to say we shouldn't add a feature like this to roc, just kinda brainstorming my top potential concerns that would affect the design.
A thought in terms of a par keyword: could a platform provide a function that takes a similar signature to roc_parallel_context_register_task, and returns a handle (just an integer):
That has the benefit of centralizing the orchestration of running tasks in Roc rather than in the platform.
It also allows for bounded parallelism. For example, imagine a platform chooses to allocate 16 threads for use by Roc (in this case assume Roc is the only meaningful code doing application work). Perhaps it's a web server.
par, the platform checks if the counter is positive, and if so, decrements the counter and allocates another thread.par was called, the platform returns a zero handle, and the Roc code just invokes that task sequentially.Of course a platform with no parallelization support will just implement the function as always returning a zero handle.
Ideally the Roc compiler would estimate the overall cost to run each task (if a task also internally uses par, it'd be an estimate to run the task sequentially, i.e. estimate of total work, _not_ estimate of wall-clock-duration). This could be just a big-O determination (i.e. linear time is more expensive than constant time), but could evolve to be more sophisticated.
It would then request parallel evaluation of the most expensive-seeming expressions first, and fall back to sequentially ruining the least expensive-seeming functions first.
For example, given:
(a, b, c) = par (midExpensive, leastExpensive, mostExpensive)
The Roc compiler would rearrange the order:
mostExpensive.midExpensive.leastExpensive (without platform assistance).Conversely, consider a resource-starved platform at the time this code is run:
mostExpensive, and the platform would return zero (not presently available to run a parallel function).leastExpensive (without platform assistance).leastExpensive, the parent code would again ask the platform to run mostExpensive, which, in this scenario would return zero again.midExpensive and mostExpensive in either order.If at step 3, the platform had provided a non-zero handle for the parallel run of mostExpensive, then mostExpensive would have been running on a separate thread (or whatever) at the same time that the parent code was synchronously running midExpensive.
The benefit would be more pronounced when the parent code has more tasks to run in parallel, but overall this strategy gives the a good opportunity for the shortest wall-clock runtime while still being able to make progress when compute resources are contended.
The reason for running the cheapest task synchronously when a parallel run is [perhaps temporarily] unavailable, is that minimizes the time waited before attempting parallelization again, while spending that time doing actual work, rather than blocking/spin-waiting just to have another thread do real work (we already have a thread, so we might as well use it).
I'm not as convinced about the par-via-tuples approach, either syntactically or in that it is limited to a constant number of parallel tasks (a list would be more flexible even if no other design change were made). The above notes would be applicable regardless of syntax.
I think all of these complexities are what Richard was talking about when he commented:
how this feature would make the Roc optimizer a very unpredictable and unstable target, and writing efficient code for it would be a matter of tweaking the implementation to trigger the optimizer's internal heuristics, and that would be a bad position to be in because these heuristics could change from a version of the compiler to the next, effectively invalidating all this effort.
Though the above could be really cool when it works right, it is a ton of complexity and complications that could make it exceptionally hard to understand performance of a roc app. Also, likely would be very hard to accurately estimate the cost of a function call.
Impressive work @Asier Elorz (he/him)!
This proposal made me think of parallel collections in Scala which were very convenient to use.
Being able to avoid Task is a nice feature!
I would prefer to avoid taking on the complexity that comes with parallelization until we've improved the reliability of Roc. In general, I believe we should hold off on adding features that increase complexity until we can go through something like advent of code with only a few bugs reported. We want people to use the language, which means it must be reliable.
I would prefer to avoid taking on the complexity that comes with parallelization until we've improved the reliability of Roc.
This makes a lot of sense and is a very good idea.
ok, I finally got around to digging into this :sweat_smile:
this is really cool @Asier Elorz (he/him)! :smiley:
I think the discussion here raises some good points around tradeoffs and timing, but it's awesome to be able to discuss these in the context of a concrete idea
and of course it's always risky to make an implementation before talking through the design, but it seems like you had the right attitude about it - e.g.
The main point of this exercise is not to adopt my changes into Roc. I am not making a merge request with these. I am trying to open a conversation about the topic and suggest a direction. [...] The design could also be completely disregarded and that is fine too. It was a fun weekend project.
I really appreciate that you came at this with such a great mindset! :smiley:
separately, I am really impressed that you managed to get this working with no prior experience in the roc code base :big_smile:
As I said before, please do not merge my code. This is my first time working on the Roc codebase. I don't know what I am doing.
I wouldn't sell yourself short there - it seems like you figured it how to be productive in the code base impressively quickly! :rocket:
I like the idea! Even though parallelism shouldn't be a priority right now, I think it's still valuable to talk about it sooner rather than later.
In my opinion, even if the builtin solution worked good for 80% of cases, decent for 10% and bad for 10%, it's still worth it. Performance always should be measured, so I'd expect the developer to measure the sequential solution, then the easy builtin one (s/List.map/List.parallelMap/) and if it's not good enough look for other solutions. If Roc/platform is clear about the implementation and trade-offs, then that shouldn't be an issue.
If the implementation/performance would change with time, that's an issue, but it always is, right?
Regarding ordering: that's problematic. Paraphrasing a wise quote: "order always matters on real hardware". Floating point precision, overflowing, crashing, crashing, dbg, etc. I don't think there is a way around that, but, again, if we're clear about it, it would be worth it.
Last updated: Jun 16 2026 at 16:19 UTC