Stream: compiler development

Topic: SIMD


view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 18:18):

The Great SIMD ergonomics section of Mojo vs. Rust: is Mojo 🔥 faster than Rust 🦀 ? is quite interesting.

This is a very interesting idea

Mojo's primitives are natively designed to be SIMD-first: UInt8 is actually a SIMD[DType.uint8, 1] which is a SIMD of 1 element. There is no performance overhead to represent it this way, but it allows the programmer to easily use it for SIMD optimizations.

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:04):

unrelated: I really hope they give up on this "Mojo:fire:" "Rust:crab:" style of writing, it bothers me every time I see it :sweat_smile:

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:07):

Brendan Hansknecht said:

This is a very interesting idea

Mojo's primitives are natively designed to be SIMD-first: UInt8 is actually a SIMD[DType.uint8, 1] which is a SIMD of 1 element. There is no performance overhead to represent it this way, but it allows the programmer to easily use it for SIMD optimizations.

I think I'm missing something here, because there certainly is performance overhead if you put some of these into a struct, because there will be way more alignment padding :thinking:

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:07):

Yeah... No idea:light_bulb: why they tripled:three: down:point_down: on emojis :stuck_out_tongue_wink:

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:09):

exclusively :ring: redundant :women_with_bunny_ears: ones :one::one: too :two:

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:13):

because there will be way more alignment padding

I think what they are really saying is that they type will implicitly promote to simd if wanted. Not that it is stored in simd alignment.

SIMD[DType.uint8, 1] -> size 1 byte, alignment 1 byte. Since it is a simd type though, it can be used with other simd types.
So you can do 8 * SIMD[DType.uint8, 8](2, 4, 6, 8, 16, 32, 64, 128).

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:14):

I also think that the simd numbers automatically figure out how to map to the widest simd hardware that they match

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:14):

So SIMD[DType.uint8, 1] won't actually use simd hardware if used alone. I think only if used with other simd units.

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:15):

Also, I have seen some kernel code that has helpers that deal with the simd alignment mapping while also dealing with any extra elements that don't align.

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:16):

It can all be automatic/run the same code, cause for most of the function, it can run with SIMD[DType.uint8, 8], but for the remainder elements, it can automatically switch to running with SIMD[DType.uint8, 1].

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:16):

So I think it all boils down to smarter duck typing (at least it feels like duck typing). Though technically all the types are known at compile time.

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:20):

ahh gotcha!

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:24):

yeah I've thought about how, for example, if you have a List Foo in Roc, and Foo happens to be 32 bits in size, then we can automatically implement certain List operations in terms of SIMD (e.g. equals or contains, or if we had an indexOf, it could use it too)

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:25):

equals or contains

I don't think this is true anymore cause we have custom Eq. That said, it could be an optimization for types that don't have custom eq.

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:26):

oh sure, I mean for if you're just using the default

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:26):

or like if Foo is a type alias

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:34):

Yeah. Definitely sounds like it could be useful. Of course with the minor concerns of avx512 intermixed with other code issues. Though probably we just need to follow whatever glibc memcpy does. It is highly optimized and uses some simd

view this post on Zulip Notification Bot (Feb 15 2024 at 20:39):

17 messages were moved here from #compiler development > Casual Conversation by Richard Feldman.

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:39):

regarding AVX512, I kinda assume we'll end up on the lowest common denominator of 16B SIMD but maybe not?

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:40):

Many libraries do a dynamic dispatch on the supported simd.

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:41):

Either at first load or at first run, update some pointers based on the simd that is available on the pc..

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 20:41):

Not actually sure what memcpy does for glibc. They may just use the lowest common denominator.

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:44):

Brendan Hansknecht said:

Either at first load or at first run, update some pointers based on the simd that is available on the pc..

right, I just wonder about the costs of this in practice (hopefully minimal, but maybe not?) and the "processor dials back clock speed because avx512 runs hot" concern we talked about in some other thread

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:45):

like 16B SIMD seems like an obvious win compared to doing things 8B (or smaller) at a time, and it's widely supported, but beyond that it's not as obvious that it would consistently be a win

view this post on Zulip Richard Feldman (Feb 15 2024 at 20:45):

maybe so though!

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 22:25):

true

view this post on Zulip Brendan Hansknecht (Feb 15 2024 at 22:28):

May be a case of 16B is low hanging fruit, 32B is probably good, 64B maybe is a not yet but maybe one day.


Last updated: Jul 06 2025 at 12:14 UTC