Using SIMD for Parallel Processing in Rust

124 points by nbrempel a year ago

oconnor663 a year ago

There are a lot of factors that go into how fast a hash function is, but the case we're showing in the big red chart at https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven by SIMD. It's a huge deal.

ww520 a year ago

Zig actually has a very nice abstraction for SIMD in the form of vector programming. The size of the vector is agnostic to the underlying cpu architecture. The compiler or LLVM will generate code for using SIMD128, 256, or 512 registers. And you are just programming straight vectors.

pcwalton a year ago

Rust has that too, with nalgebra if you want arbitrary-sized tensors as scientific computing wants, or with glam and similar crates if your needs are more modest as in graphics. In all cases they're SIMD-accelerated.
hansvm a year ago

I do generally like their approach. It's especially well suited given how easy comptime allows metaprogramming against the target register size.
I wish it had a few more builtins for commonly supported operations without me having to write inline assembly (e.g., runtime LUTs are basically untenable for implementing something like bolt [0] without inline asm), but otherwise the abstraction level is about where I'd like it to be. I usually prefer it to gcc intrinsics, fully inline asm, and other such shenanigans.
[0] https://arxiv.org/abs/1706.10283
ladyanita22 a year ago

Isn't that what std:simd is for Rust?
ladyanita22 a year ago

But zig lacks the intrinsics support, and not ever single simd spec is exposed on the abstraction.
jvanderbot a year ago

Yeah, the article overlooked library support for SIMD. nalgebra had a decent writeup on their ability to squeeze out autovectorization for their vector and matrix types.

thomashabets2 a year ago

The portable SIMD is quite nice. We can't really trust a "sufficiently smart compiler" to make the best SIMD decisions, since it may not see through what you're actually doing.

https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and code at https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr... showed me getting 3x faster using portable SIMD, on my first attempt.

nbrempel a year ago

Thanks for reading everyone. I’ve gotten some feedback over on Reddit as well that the example is not effectively showing the benefits of SIMD. I plan on revising this.

One of my goals of writing these articles is to learn so feedback is more than welcome!

dzaima a year ago

What's fun is that, as the use of SIMD in your example is useless, LLVM correctly completely removes it, and makes your "neon" and "fallback" versions exactly the same - without any SIMD (compiler explorer: https://godbolt.org/z/YWoMGoaxT).
As an additional note, aarch64 always has NEON (similar to how x86-64 always has SSE2; extensions useful to dispatch would be SVE on aarch64 and AVX/AVX2/AVX-512 on x86-64), so no point dynamically checking for it.
KineticLensman a year ago

Great read!
> One of my goals of writing these articles is to learn so feedback is more than welcome!
When I went into the Rust playground to see the assembly output for the Cumulative Sum example, I could only get it to show the compiler warnings, not the actual assembly. I'm probably doing something wrong, but for me this was a barrier that detracted from the article. I'd suggest incorporating the assembly directly into the article, although keeping the playground link for people who are more dedicated / competent than I am.
- the8472 a year ago
  
  The function has to be made pub so it doesn't get optimized out as unusued private function.
  Godbolt is a better choice for looking at asm anyway. https://rust.godbolt.org/z/3Y9ovsoz9
  - hayley-patton a year ago
    
    Narrator: "The code did not, in fact, auto-vectorise."
    (There's only addsd/movsd instructions, which are add/move scalar-double; we want addpd/movpd which are add/move packed-double in vectorised code.)
  - KineticLensman a year ago
    
    Ah, that worked, thanks!
    Although I can now see why he didn't include the output directly.
devit a year ago

Are you really writing them?
Seems written by an LLM for the most part.

eachro a year ago

This is cool that simd primitives exist in the std lib of rust. I've wanted wanted to mess around a bit more with simd in python but I don't think that native support exists. Or your have to go down to C/C++ bindings to actually mess around with it (last I checked at least, please correct me if I'm wrong).

bardak a year ago

I feel like most languages could use simd in the standard library. We have all this power in the vector units of our CPUs that compilers struggle to use but yet we also don't make it easy to do manually
- neonsunset a year ago
  
  C# is the language that is doing this exact thing, with the next two close options being Swift and, from my understanding, Mojo.
  Without easy to use SIMD abstraction, many* of .NET's CoreLib functions would have been significantly slower.
  * UTF-8 validation, text encoding/decoding, conversion to/from hex bytes, copying data, zeroing, various checksum and hash functions, text/element counting, searching, advanced text search with multiple algorithms under SearchValues type used by Regex engine, etc.
  - pjmlp a year ago
    
    D as well.
mroche a year ago

Quick search turned up this:
SIMD in Pure Python
https://www.da.vidbuchanan.co.uk/blog/python-swar.html
Don't let the "SIMD in Python" section fool you, it's a short stop on Numpy before putting it aside.
Calavar a year ago

What would native SIMD support entail in a language without first party JIT or AOT compilation?
- runevault a year ago
  
  At some point bytecode still turns into CPU instructions, so if you added syntax or special functions that went to parts of the interpreter that are SIMD you could certainly add it to a purely interpreted language.
  - Calavar a year ago
    
    If we're talking low level SIMD, like opcode level, I'm really struggling to see the use case for interpreted bytecode. The cost of type checking operands to dynamically dispatch down a SIMD path would almost certainly outweigh the savings of the SIMD path itself.
    JIT is different because in function-level JIT, you can check types just once at the opening of the function, then you stay on the SIMD happy path for the rest of the function. And in AOT, you may able to elide the checks entirely.
    There is certainly a space for higher level SIMD functionality. Numpy is one example.

anonymousDan a year ago

The interesting question for me is whether Rust makes it easier for the compiler to extract SIMD parallelism automatically given the restrictions imposed by its type system.

pcwalton a year ago

The main thing I can think of that would help here is the fact that Rust has stricter alignment requirements than C++ does. Any live reference can more or less be assumed to point to validly-aligned memory at all times, which isn't true in C++.
As to whether LLVM actually takes advantage of this effectively, I don't know. I know that we do supply the necessary attributes to LLVM in most cases, but I haven't looked at the individual transform and optimization passes to see whether they take advantage of this (e.g. emitting movdqa vs. falling back to movdqu).
PoignardAzur a year ago

Aside from aliasing restrictions, you can use chunked iterators which IIRC make it easier for the compiler to auto-vectorize your loop. The actual code changes very little.

IshKebab a year ago

Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's Scalable Vector Extension. Unlike traditional SIMD the code is agnostic to the register width and different hardware can run the same code with different widths.

There is also a traditional SIMD extension (P I think?) but it isn't finished. Most focus has been on the vector extension.

I am wondering how and if Rust will support these vector processing extensions.

camel-cdr a year ago

> RISC-V Vector isn't SIMD
Isn't SIMD a subset of vector processors?
To that matter, can anybody here provide a proper and useful distinction between the two, that is SIMD and vector ISAs?
You imply it's because it's vector length agnostic, but you could take e.g. the SSE encoding, and apart from a few instructions, make it operate on SIMD registers of any length. Wouldn't that also be vector length agnostic, as long as software can query the vector length? I think most people wouldn't call this a vector ISA, and how is this substantially different from dispatching to different implementations for SSE AVX and AVX512?
I've also seen people say it's about the predication, which would make AVX512 a vector isa.
I've seen others say it's about resource usage and vector chaining, but that is just an implementation detail and can be used or not used on traditional SIMD ISAs to the same extend as on vector ISAs.
- janwas a year ago
  
  I agree that SIMD and vector are basically interchangeable at a certain level.
  There is still a difference in the binutils, because SSE4 and AVX2 and AVX-512 have different instruction encodings per length.
  But yes, it is possible to write VL-agnostic code for both SIMD and vector, and indeed the same user code written with Highway works on both SIMD and RISC-V.
Findecanor a year ago

RISC-V's vector extension will have at least 128 bits in application processors, so I think you could set VLEN=128 and just use SIMD algorithms.
The P extension is intended more for embedded microcontrollers for which the V extension would be too expensive. It reuses the GPRs at whatever width they are at (32 or 64 bits).
- camel-cdr a year ago
  
  That or you can detect the vector length and specialize for it, just like it's already done on x86 with VLEN 128, 256, and 512 for sse, avx, and avx512.

brundolf a year ago

std::simd is a delight. I'd never done SIMD before in any language, and it was very easy and natural (and safe!) to introduce to my code, and just automatically works cross-platform. Can't recommend it enough

neonsunset a year ago

If you like SIMD and would like to dabble in it, I can strongly recommend trying it out in C# via its platform-agnostic SIMD abstraction. It is very accessible especially if you already know a little bit of C or C++, and compiles to very competent codegen for AdvSimd, SSE2/4.2/AVX1/2/AVX512, WASM's Packed SIMD and, in .NET 9, SVE1/2:

https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

Here's an example of "checked" sum over a span of integers that uses platform-specific vector width:

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Other examples:

CRC64 https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Hamming distance https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

Default syntax is a bit ugly in my opinion, but it can be significantly improved with helper methods like here where the code is a port of simdutf's UTF-8 code point counting: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

There are more advanced scenarios. Bepuphysics2 engine heavily leverages SIMD to perform as fast as PhysX's CPU back-end: https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...

Note that practically none of these need to reach out to platform-specific intrinsics (except for replacing movemask emulation with efficient ARM64 alternative) and use the same path for all platforms, varied by vector width rather than specific ISA.

runevault a year ago

Funny you mention c#, I started to look at this and I made the mistake of wanting to do string comparison via SIMD, except you can't do it externally because it relies on private internals (note, the built in comparison for c# already does SIMD, you just can't easily reimplement it against the built in string type).
- neonsunset a year ago
  
  What kind of private internals do you have in mind? You absolutely can hand-roll your own comparison routine, just hard to beat existing implementation esp. once you start considering culture-sensitive comparison (which may defer to e.g. ICU).
  There are no private SIMD APIs save for sequence comparison intrisic for unrolling against known lengths which JIT/ILC does for spans and strings.
  - runevault a year ago
    
    IIRC (Been a month or so since I looked into it) I couldn't access the underlying array in a way SIMD liked I think? If you look at how they did it inside the actual string class it uses those private properties of the string that are only available internally to guarantee you don't change the string data if memory serves.
    
    neonsunset a year ago
    
    String can provide you a `ReadOnlySpan<char>`, out of which you can either take `ref readonly char` "byref" pointer, which all vectors work with, or you can use the unsafe variant and make this byref mutable (just don't write to it) with `Unsafe.AsRef`.
    Because pretty much every type that has linear memory can be represented as span, it means that every span is amenable to pointer (byref) arithmetics which you then use to write a SIMD routine. e.g.:
    var text = "Hello, World! Hello, World!"; var span = MemoryMarshal.Cast<char, ushort>(text); ref readonly var ptr = ref span[0]; var chunk = Vector128.LoadUnsafe(in ptr); var needle = Vector128.Create((ushort)','); var comparison = Vector128.Equals(chunk, needle); var offset = uint.TrailingZeroCount(comparison.ExtractMostSignificantBits()); Console.WriteLine(text[..(int)offset]);
    If you have doubts regarding codegen quality, take a look at: https://godbolt.org/z/b97zjfTP7 The above vector API calls are lowered to lines 17-22.
    
    runevault a year ago
    
    Oh interesting, I'll have to give that a try then. My concern was avoiding a reallocation by doing it another way, but if the readonly span works I can see how it would get you there. I need to see if I still have that project to test it out, appreciate the heads up. SIMD is something I really want to get better with.
    
    neonsunset a year ago
    
    If you go through the guide at the first link, it will pretty much set you up with the basics to work on vectorization, and once done, you can look at what CoreLib does as a reference (just keep in mind it tries to squeeze all the performance for short lengths too, so the tail/head scalar handlers and dispatch can be high-effort, more so than you may care about). The point behind the way .NET does it is to have the same API exposed to external consumers as the one CoreLib uses itself, which is why I was surprised by your initial statement.
    No offense taken, just clarifying, SIMD can seem daunting especially if you look at intrinsics in C/C++, and I hope the approach in C# will popularize it. Good luck with your experiments!
    
    runevault a year ago
    
    I appreciate you taking the time to talk me through this, SIMD has been an interest of mine for a while. I ran into issues and then when I went and looked at how the actual string class did it I stopped since they were doing tricks that required said access to the internal data. But this gives me a path to explore. I was already planning on looking at the links you supplied.
    Thank you again.
zvrba a year ago

I implemented a sorting network in C# with AVX2 intrinsics. https://github.com/zvrba/SortingNetworks
- neonsunset a year ago
  
  It's a nice piece of work! If you're interested, .NET's compiler has improved significantly since 3.1, in particular, around structs and pre-existing intrinsics (which are no longer needed to be used directly in most situations - pretty much all code prefers to use plain methods on VectorXXX<T> whenever possible). Also note the use of AggressiveOptimization attribute which disables tiered compilation and forces the static initialization checks your readme refers to - removing AO allows the compiler to bake statics directly into codegen through tiered compilation as upon reaching Tier 1 the value of such readonly statics will be known. For trivially constructed values, it is better to not store such in fields but rather construct them in place via e.g. expression-bodied properties like 'Vector128<byte> MASK => Vector128.Create((byte)0x80)`. I don't remember exactly whether this was introduced in Core 3.1 or 5, but today the use of `AggressiveOptimization` flag is discouraged unless you do need to bypass DynamicPGO.
  You also noted the lack of ability to express numeric properties of T within generic context. This was indeed true, and this limitation was eventually addressed by generic math feature. There are INumber<T>, IBinaryInteger<T> and others to constrain the T on, which bring the comparison operators you were looking for.
  In general, the knowledge around vectorized code has substantially improved within the community, and it is used quite more liberally nowadays by those who are aware of it.