here's a proposal for how SIMD could work in Roc - https://docs.google.com/document/d/1R3ESkzSyOAA8DnasxF1WHDeoe9LgFE1uvmIr8s-aNro/edit?usp=sharing
any feedback welcome!
Very cool!
Do an extra SIMD operation on the leftovers, but first fill the unused lanes with a constant instead of letting them be garbage memory.
Instead of a constant you could repeat some of the real data. So if you have 3 leftover items at the end then you could repeat the last one to get 4. Then it won't throw a divide-by-zero unless it would have done so anyway.
interesting! Algorithms where each chunk references the others could still cause problems there, but it seems like that would be a super rare use case
hmm will this work when e.g. counting zeroes/ones?
I don't think any of the "fill in leftovers automatically" designs can work well with that :big_smile:
Screen-Shot-2022-04-23-at-5.00.38-PM.png
great suggestion, autocorrect! :laughing:
User-provided fill value may be necessary. The value may need to be different depending on the operation. Adds a bit of complexity for the initial implementer but most people would probably just use functions from a library of functions that hide this detail. One downside is it may require duplicating the array if the array has more than one reference and the specified fill value doesn't match what's already there?
Might want a version where user can declare it doesn't matter what the fill value is, the operation doesn't care
Don't you just not use simd on the last iteration of the loop if it isn't a multiple of the lanes?
Awesome proposal! The addition of tuples, should this proposal pass, sounds great! On the handling of overflow, giving the user choice between garbage fill and defined fill seems preferable. just wondering, the last bullet on #3 under proposals shows Simd having kind * -> * but doesn't it have kind * -> * -> * for defining the number of lanes? Not being too knowledgeable about simd beyond basics, is it not possible to supply a function a -> a and automatically fit that onto all lanes so that it can be used for both the simd operations and the leftovers?
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/coding-for-neon---part-2-dealing-with-leftovers mentions using a regular loop on leftovers produces more code with less performance. Different architecture but may still apply?
The overlapping idea from that link is a good one!
Algorithms where each chunk references the others could still cause problems there
Does that still count as "single instruction multiple data"? (Sounds to me like it shouldn't but maybe CPU makers see it differently?)
maybe not technically
but you can count the number of bits in those wide registers, for instance
so it fits in the "works on 256/512-bit registers" category, which we typically call simd I guess but yeah that is not entirely accurate
shows Simd having kind
* -> *but doesn't it have kind* -> * -> *for defining the number of lanes?
yep, my mistake!
is it not possible to supply a function a -> a and automatically fit that onto all lanes so that it can be used for both the simd operations and the leftovers?
this would be equivalent to having map use SIMD under the hood; the problem is that SIMD instructions only sometimes have equivalents in non-SIMD instructions, and vice versa
so sometimes they can be automated - e.g. we can offer a List.mul that silently uses SIMD under the hood, and handles leftovers in this way, and sometimes LLVM can automatically SIMDify certain loops (this is known as "autovectorization") - but it can only do it for pretty basic examples.
The point of having a function like this would be to give you the power to do operations that are complex enough (could be as simple as doing one instruction or another based on a conditional) that autovectorization wouldn't be able to figure out how to correctly translate them from non-SIMD to SIMD instructions.
Brian Carroll said:
Algorithms where each chunk references the others could still cause problems there
Does that still count as "single instruction multiple data"? (Sounds to me like it shouldn't but maybe CPU makers see it differently?)
yeah they have things like swizzle which reorders the lanes in a vector, and also things like masks, "give me the min/max of all the numbers in the lanes" etc.
I don't think any of those operations would make sense to use in a map (as opposed to a walk that accumulates state) but there's nothing stopping someone from using those operations if they decide to :big_smile:
Brian Carroll said:
The overlapping idea from that link is a good one!
I agree! Great link @Jared Cone - I think the overlapping strategy is the most promising one so far
Is the introducing tuples necessary for this? Doesn't using T2, T3, T4 tag unions work equally well? For example, instead of
matrix = Matrix4x4.new
((5, 4, 7, 8),
(9, 0, 1, 2),
(3, 4, 5, 6),
(7, 8, 9, 0))
you'd write
matrix = Matrix4x4.new
(T4 5 4 7 8)
(T4 9 0 1 2)
(T4 3 4 5 6)
(T4 7 8 9 0)
how do you get the 3rd column from the 2nd row of that though?
if you can call (toTuple matrix).1.2 it's very straightforward!
Wouldn't matrix4x4 define getter functions? Matrix4x4.row1column2 matrix or something short like Matrix4x4.v12 matrix, though I kind of like the first one because I always get row and column mixed up when working with matrices.
To me tuples seem like a natural addition to the language, and I like how you define them on top of records which will definitely reduce implementation complexity.
Wouldn't matrix4x4 define getter functions?
it could, but then you also need a separate set of them for Matrix4x4 and Matrix3x3 and Matrix4x3 and Matrix3x4, all of which come up in games - so a total of 49 getter functions
also 7 more for vectors of length 3 and 4, so 56 total
edit: also you'd want setter functions for all those positions too, so it's either tuples or +112 functions for known use cases games have
that's totally a plausible API, but I was on the fence about tuples before, and this pushes me over to "yeah they're worth it"
Forcing the API author to write getters/setters for the matrices nudges them towards giving useful names that say whether the matrix is row major or column major. (toTuple matrix).1.2 doesn't tell the end user if row or column is accessed first. To me the time saved by one person not needing to write the getters and setters will be lost later by people needing to slow down and carefully think through this (or not realizing it and potentially introducing a bug)
Edit: To clarify why I think row/column are easily mixed up. I know that matrices tend to be row then column but anyone new to matrices might think xy coordinates in which case column would come first.
API authors could still write those getters and setters even if underlying is a tuple. Though I think adding a language feature should always be approached with caution. What would the ergonomics be with using tags instead of tuples? Is it inconvenient to access those elements?
Tags are quite inconvenient for accessing a specific element in general. You have to pattern matching the tag to get the element. Just ends up being quite verbose
Something like this:
(Matrix4x4 _ (T4 _ _ elem _) _ _) = m
is equivalent to:
elem = m.1.2
Martin Stewart said:
Is the introducing tuples necessary for this? Doesn't using T2, T3, T4 tag unions work equally well? For example, instead of
matrix = Matrix4x4.new (5, 4, 7, 8) (9, 0, 1, 2) (3, 4, 5, 6) (7, 8, 9, 0)you'd write
matrix = Matrix4x4.new (T4 5 4 7 8) (T4 9 0 1 2) (T4 3 4 5 6) (T4 7 8 9 0)
I certainly don't think tuples are necessary for this, but I also certainly think they're nicer; I definitely prefer the version without the T tags! :big_smile:
to be fair, if not for the fact that roc format would jumble it up, you could also theoretically have new take 16 arguments:
matrix = Matrix4x4.new
5 4 7 8
9 0 1 2
3 4 5 6
7 8 9 0
I also think it's reasonable to want a Matrix API that exposes 100+ functions for getters and setters, to make it harder to mix up row and column, but I also think it's reasonable to prefer polymorphic getters and setters and module documentation that doesn't primarily consist of 100+ getter and setter functions.
as an aside, I've run into wanting tuples in both application design and API design often enough that I'm convinced they will end up in the language eventually
if it's not going to be because of matrices and vectors, I expect they'd end up being worth it for other reasons anyway
For leftover handling, what about each Simd function accepting a tag, i.e.
List.mapSimd List a, Simd.Policy (Simd a lanes -> Simd b lanes) -> List b
Where Simd.Policy is:
[
Panic,
CopyAny # likely equivalent to CopyLast
CopyLast # repeat last element as needed
CopyFirst
FillConst a
FillFunc (Nat -> a) # based on destination index (past end of list)
]
Last updated: Jun 16 2026 at 16:19 UTC