Stream: ideas

Topic: SIMD API


view this post on Zulip Richard Feldman (Apr 23 2022 at 16:22):

here's a proposal for how SIMD could work in Roc - https://docs.google.com/document/d/1R3ESkzSyOAA8DnasxF1WHDeoe9LgFE1uvmIr8s-aNro/edit?usp=sharing

any feedback welcome!

view this post on Zulip Brian Carroll (Apr 23 2022 at 20:11):

Very cool!

view this post on Zulip Brian Carroll (Apr 23 2022 at 20:12):

Do an extra SIMD operation on the leftovers, but first fill the unused lanes with a constant instead of letting them be garbage memory.

Instead of a constant you could repeat some of the real data. So if you have 3 leftover items at the end then you could repeat the last one to get 4. Then it won't throw a divide-by-zero unless it would have done so anyway.

view this post on Zulip Richard Feldman (Apr 23 2022 at 20:52):

interesting! Algorithms where each chunk references the others could still cause problems there, but it seems like that would be a super rare use case

view this post on Zulip Folkert de Vries (Apr 23 2022 at 20:55):

hmm will this work when e.g. counting zeroes/ones?

view this post on Zulip Richard Feldman (Apr 23 2022 at 20:58):

I don't think any of the "fill in leftovers automatically" designs can work well with that :big_smile:

view this post on Zulip Richard Feldman (Apr 23 2022 at 21:00):

Screen-Shot-2022-04-23-at-5.00.38-PM.png

great suggestion, autocorrect! :laughing:

view this post on Zulip Jared Cone (Apr 23 2022 at 22:33):

User-provided fill value may be necessary. The value may need to be different depending on the operation. Adds a bit of complexity for the initial implementer but most people would probably just use functions from a library of functions that hide this detail. One downside is it may require duplicating the array if the array has more than one reference and the specified fill value doesn't match what's already there?

view this post on Zulip Jared Cone (Apr 23 2022 at 22:39):

Might want a version where user can declare it doesn't matter what the fill value is, the operation doesn't care

view this post on Zulip Brendan Hansknecht (Apr 23 2022 at 23:13):

Don't you just not use simd on the last iteration of the loop if it isn't a multiple of the lanes?

view this post on Zulip Thomas Dwyer (Apr 23 2022 at 23:19):

Awesome proposal! The addition of tuples, should this proposal pass, sounds great! On the handling of overflow, giving the user choice between garbage fill and defined fill seems preferable. just wondering, the last bullet on #3 under proposals shows Simd having kind * -> * but doesn't it have kind * -> * -> * for defining the number of lanes? Not being too knowledgeable about simd beyond basics, is it not possible to supply a function a -> a and automatically fit that onto all lanes so that it can be used for both the simd operations and the leftovers?

view this post on Zulip Jared Cone (Apr 23 2022 at 23:57):

https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/coding-for-neon---part-2-dealing-with-leftovers mentions using a regular loop on leftovers produces more code with less performance. Different architecture but may still apply?

view this post on Zulip Brian Carroll (Apr 24 2022 at 00:21):

The overlapping idea from that link is a good one!

view this post on Zulip Brian Carroll (Apr 24 2022 at 00:26):

Algorithms where each chunk references the others could still cause problems there

Does that still count as "single instruction multiple data"? (Sounds to me like it shouldn't but maybe CPU makers see it differently?)

view this post on Zulip Folkert de Vries (Apr 24 2022 at 00:41):

maybe not technically

view this post on Zulip Folkert de Vries (Apr 24 2022 at 00:41):

but you can count the number of bits in those wide registers, for instance

view this post on Zulip Folkert de Vries (Apr 24 2022 at 00:42):

so it fits in the "works on 256/512-bit registers" category, which we typically call simd I guess but yeah that is not entirely accurate

view this post on Zulip Richard Feldman (Apr 24 2022 at 01:04):

shows Simd having kind * -> * but doesn't it have kind * -> * -> * for defining the number of lanes?

yep, my mistake!

view this post on Zulip Richard Feldman (Apr 24 2022 at 01:09):

is it not possible to supply a function a -> a and automatically fit that onto all lanes so that it can be used for both the simd operations and the leftovers?

this would be equivalent to having map use SIMD under the hood; the problem is that SIMD instructions only sometimes have equivalents in non-SIMD instructions, and vice versa

view this post on Zulip Richard Feldman (Apr 24 2022 at 01:12):

so sometimes they can be automated - e.g. we can offer a List.mul that silently uses SIMD under the hood, and handles leftovers in this way, and sometimes LLVM can automatically SIMDify certain loops (this is known as "autovectorization") - but it can only do it for pretty basic examples.

The point of having a function like this would be to give you the power to do operations that are complex enough (could be as simple as doing one instruction or another based on a conditional) that autovectorization wouldn't be able to figure out how to correctly translate them from non-SIMD to SIMD instructions.

view this post on Zulip Richard Feldman (Apr 24 2022 at 01:32):

Brian Carroll said:

Algorithms where each chunk references the others could still cause problems there

Does that still count as "single instruction multiple data"? (Sounds to me like it shouldn't but maybe CPU makers see it differently?)

yeah they have things like swizzle which reorders the lanes in a vector, and also things like masks, "give me the min/max of all the numbers in the lanes" etc.

I don't think any of those operations would make sense to use in a map (as opposed to a walk that accumulates state) but there's nothing stopping someone from using those operations if they decide to :big_smile:

view this post on Zulip Richard Feldman (Apr 24 2022 at 01:32):

Brian Carroll said:

The overlapping idea from that link is a good one!

I agree! Great link @Jared Cone - I think the overlapping strategy is the most promising one so far

view this post on Zulip Martin Stewart (Apr 24 2022 at 15:27):

Is the introducing tuples necessary for this? Doesn't using T2, T3, T4 tag unions work equally well? For example, instead of

matrix = Matrix4x4.new
    ((5, 4, 7, 8),
     (9, 0, 1, 2),
     (3, 4, 5, 6),
     (7, 8, 9, 0))

you'd write

matrix = Matrix4x4.new
    (T4 5 4 7 8)
    (T4 9 0 1 2)
    (T4 3 4 5 6)
    (T4 7 8 9 0)

view this post on Zulip Richard Feldman (Apr 24 2022 at 15:34):

how do you get the 3rd column from the 2nd row of that though?

view this post on Zulip Richard Feldman (Apr 24 2022 at 15:34):

if you can call (toTuple matrix).1.2 it's very straightforward!

view this post on Zulip Martin Stewart (Apr 24 2022 at 15:43):

Wouldn't matrix4x4 define getter functions? Matrix4x4.row1column2 matrix or something short like Matrix4x4.v12 matrix, though I kind of like the first one because I always get row and column mixed up when working with matrices.

view this post on Zulip Nikita Mounier (Apr 24 2022 at 16:37):

To me tuples seem like a natural addition to the language, and I like how you define them on top of records which will definitely reduce implementation complexity.

view this post on Zulip Richard Feldman (Apr 24 2022 at 16:38):

Wouldn't matrix4x4 define getter functions?

it could, but then you also need a separate set of them for Matrix4x4 and Matrix3x3 and Matrix4x3 and Matrix3x4, all of which come up in games - so a total of 49 getter functions

view this post on Zulip Richard Feldman (Apr 24 2022 at 16:39):

also 7 more for vectors of length 3 and 4, so 56 total

edit: also you'd want setter functions for all those positions too, so it's either tuples or +112 functions for known use cases games have

view this post on Zulip Richard Feldman (Apr 24 2022 at 16:39):

that's totally a plausible API, but I was on the fence about tuples before, and this pushes me over to "yeah they're worth it"

view this post on Zulip Martin Stewart (Apr 24 2022 at 16:55):

Forcing the API author to write getters/setters for the matrices nudges them towards giving useful names that say whether the matrix is row major or column major. (toTuple matrix).1.2 doesn't tell the end user if row or column is accessed first. To me the time saved by one person not needing to write the getters and setters will be lost later by people needing to slow down and carefully think through this (or not realizing it and potentially introducing a bug)

Edit: To clarify why I think row/column are easily mixed up. I know that matrices tend to be row then column but anyone new to matrices might think xy coordinates in which case column would come first.

view this post on Zulip Jared Cone (Apr 24 2022 at 17:16):

API authors could still write those getters and setters even if underlying is a tuple. Though I think adding a language feature should always be approached with caution. What would the ergonomics be with using tags instead of tuples? Is it inconvenient to access those elements?

view this post on Zulip Brendan Hansknecht (Apr 24 2022 at 17:23):

Tags are quite inconvenient for accessing a specific element in general. You have to pattern matching the tag to get the element. Just ends up being quite verbose

view this post on Zulip Brendan Hansknecht (Apr 24 2022 at 17:25):

Something like this:
(Matrix4x4 _ (T4 _ _ elem _) _ _) = m
is equivalent to:
elem = m.1.2

view this post on Zulip Richard Feldman (Apr 24 2022 at 17:45):

Martin Stewart said:

Is the introducing tuples necessary for this? Doesn't using T2, T3, T4 tag unions work equally well? For example, instead of

matrix = Matrix4x4.new
    (5, 4, 7, 8)
    (9, 0, 1, 2)
    (3, 4, 5, 6)
    (7, 8, 9, 0)

you'd write

matrix = Matrix4x4.new
    (T4 5 4 7 8)
    (T4 9 0 1 2)
    (T4 3 4 5 6)
    (T4 7 8 9 0)

I certainly don't think tuples are necessary for this, but I also certainly think they're nicer; I definitely prefer the version without the T tags! :big_smile:

view this post on Zulip Richard Feldman (Apr 24 2022 at 17:47):

to be fair, if not for the fact that roc format would jumble it up, you could also theoretically have new take 16 arguments:

matrix = Matrix4x4.new
     5 4 7 8
     9 0 1 2
     3 4 5 6
     7 8 9 0

view this post on Zulip Richard Feldman (Apr 24 2022 at 18:06):

I also think it's reasonable to want a Matrix API that exposes 100+ functions for getters and setters, to make it harder to mix up row and column, but I also think it's reasonable to prefer polymorphic getters and setters and module documentation that doesn't primarily consist of 100+ getter and setter functions.

view this post on Zulip Richard Feldman (Apr 24 2022 at 18:14):

as an aside, I've run into wanting tuples in both application design and API design often enough that I'm convinced they will end up in the language eventually

view this post on Zulip Richard Feldman (Apr 24 2022 at 18:16):

if it's not going to be because of matrices and vectors, I expect they'd end up being worth it for other reasons anyway

view this post on Zulip Kevin Gillette (Apr 25 2022 at 14:40):

For leftover handling, what about each Simd function accepting a tag, i.e.

List.mapSimd List a, Simd.Policy (Simd a lanes -> Simd b lanes) -> List b

Where Simd.Policy is:

[
  Panic,
  CopyAny # likely equivalent to CopyLast
  CopyLast # repeat last element as needed
  CopyFirst
  FillConst a
  FillFunc (Nat -> a) # based on destination index (past end of list)
]

Last updated: Jun 16 2026 at 16:19 UTC