The Great SIMD ergonomics
section of Mojo vs. Rust: is Mojo 🔥 faster than Rust 🦀 ? is quite interesting.
This is a very interesting idea
Mojo's primitives are natively designed to be SIMD-first:
UInt8
is actually aSIMD[DType.uint8, 1]
which is a SIMD of 1 element. There is no performance overhead to represent it this way, but it allows the programmer to easily use it for SIMD optimizations.
unrelated: I really hope they give up on this "Mojo:fire:" "Rust:crab:" style of writing, it bothers me every time I see it :sweat_smile:
Brendan Hansknecht said:
This is a very interesting idea
Mojo's primitives are natively designed to be SIMD-first:
UInt8
is actually aSIMD[DType.uint8, 1]
which is a SIMD of 1 element. There is no performance overhead to represent it this way, but it allows the programmer to easily use it for SIMD optimizations.
I think I'm missing something here, because there certainly is performance overhead if you put some of these into a struct, because there will be way more alignment padding :thinking:
Yeah... No idea:light_bulb: why they tripled:three: down:point_down: on emojis :stuck_out_tongue_wink:
exclusively :ring: redundant :women_with_bunny_ears: ones :one::one: too :two:
because there will be way more alignment padding
I think what they are really saying is that they type will implicitly promote to simd if wanted. Not that it is stored in simd alignment.
SIMD[DType.uint8, 1]
-> size 1 byte, alignment 1 byte. Since it is a simd type though, it can be used with other simd types.
So you can do 8 * SIMD[DType.uint8, 8](2, 4, 6, 8, 16, 32, 64, 128)
.
I also think that the simd numbers automatically figure out how to map to the widest simd hardware that they match
So SIMD[DType.uint8, 1]
won't actually use simd hardware if used alone. I think only if used with other simd units.
Also, I have seen some kernel code that has helpers that deal with the simd alignment mapping while also dealing with any extra elements that don't align.
It can all be automatic/run the same code, cause for most of the function, it can run with SIMD[DType.uint8, 8]
, but for the remainder elements, it can automatically switch to running with SIMD[DType.uint8, 1]
.
So I think it all boils down to smarter duck typing (at least it feels like duck typing). Though technically all the types are known at compile time.
ahh gotcha!
yeah I've thought about how, for example, if you have a List Foo
in Roc, and Foo
happens to be 32 bits in size, then we can automatically implement certain List
operations in terms of SIMD (e.g. equals or contains
, or if we had an indexOf
, it could use it too)
equals or contains
I don't think this is true anymore cause we have custom Eq. That said, it could be an optimization for types that don't have custom eq.
oh sure, I mean for if you're just using the default
or like if Foo
is a type alias
Yeah. Definitely sounds like it could be useful. Of course with the minor concerns of avx512 intermixed with other code issues. Though probably we just need to follow whatever glibc memcpy does. It is highly optimized and uses some simd
17 messages were moved here from #compiler development > Casual Conversation by Richard Feldman.
regarding AVX512, I kinda assume we'll end up on the lowest common denominator of 16B SIMD but maybe not?
Many libraries do a dynamic dispatch on the supported simd.
Either at first load or at first run, update some pointers based on the simd that is available on the pc..
Not actually sure what memcpy does for glibc. They may just use the lowest common denominator.
Brendan Hansknecht said:
Either at first load or at first run, update some pointers based on the simd that is available on the pc..
right, I just wonder about the costs of this in practice (hopefully minimal, but maybe not?) and the "processor dials back clock speed because avx512 runs hot" concern we talked about in some other thread
like 16B SIMD seems like an obvious win compared to doing things 8B (or smaller) at a time, and it's widely supported, but beyond that it's not as obvious that it would consistently be a win
maybe so though!
true
May be a case of 16B is low hanging fruit, 32B is probably good, 64B maybe is a not yet but maybe one day.
Last updated: Jul 06 2025 at 12:14 UTC