> I think a legitimate criticism is that it is unclear who std::simd is for. I t...

jandrewrogers · 2026-05-17T04:28:29 1778992109

You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.

The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.

An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.

The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.

SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.

That said, I love that silicon has become so much more expressive.

camel-cdr · 2026-05-17T06:02:49 1778997769

IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).

Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.

Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...

duped · 2026-05-17T04:40:07 1778992807

> it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions

This is one complaint I toss back at Intel and AMD.

If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.

There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.

janwas · 2026-05-17T09:30:12 1779010212

:) I agree a tutorial would be helpful. We are working on one with Fastcode.

duped · 2026-05-17T14:06:36 1779026796

A manual is not a tutorial, and having AI anywhere near this task is actively harmful. Please do not build this.

janwas · 2026-05-17T16:13:36 1779034416

?? Where did you see mention of AI?

duped · 2026-05-17T17:02:28 1779037348

I searched the name "fastcode" and the only results were AI

janwas · 2026-05-18T06:23:37 1779085417

Oh, interesting :) I meant Fastcode.org.

janwas · 2026-05-17T09:21:25 1779009685

Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.

mpyne · 2026-05-17T14:40:19 1779028819

I hadn't but it would make sense for doing my own personal programming challenges.

Given the ongoing disasters around the software supply chains I've been fighting the creeping NPM-ism that people are trying to introduce to C++, where you just FetchContent 20 different libraries to build your own app upon.

I do use gtest, fmt and a few others though, so something as broadly used as Highway would probably be fine by that standard as well. But I'd still like it better if there was a Good Enough solution that was part of C++ stdlib to reduce the number of external integrations that are deemed required for a modern C++ program.

janwas · 2026-05-17T16:06:55 1779034015

Fair point. If it helps, our security team has called Highway critical infrastructure and helped to harden the repo. The flip side of standardization is that it would be much harder and slower to add ops as the need arises, which we do regularly.

fweimer · 2026-05-17T10:05:40 1779012340

Does it have fallback paths for everything, though? Scalar if necessary?

Projects that depend on Highway drop support for CPUs not listed in the Highway documentation, saying that they can't support these CPUs because they are incompatible with Highway: https://google.github.io/highway/en/master/README.html#curre...

Are these projects somehow mistaken?

janwas · 2026-05-17T10:48:39 1779014919

Yes, the EMU128 target is scalar only, with for loops. This is a fun way to see how well autovectorization works, with the same source code. That works on any CPU. Curious which projects have such concerns, any link?

fweimer · 2026-05-17T12:53:04 1779022384

People reported challenges building V8 (whether upstream or the Node.js variant) on s390x with z13 support. I don't know if it was discussed on the porters mailing list because it's not public: https://groups.google.com/g/v8-s390-ports

Elsewhere, some people interpreted https://github.com/google/highway/issues/1895 as meaning that Highway code does not work on z13 at all.

janwas · 2026-05-17T16:12:37 1779034357

Thanks for sharing. The first link seems non public indeed. I can imagine there is some compile issue we could reasonably fix, with the help of someone who has Z13 access. Please encourage them to raise an issue. I will be back on May 26. After that, it should at least be able to use the scalar fallback. The issue with Z14 is that it lacks fp32 support. Would their usage be integer only?

fweimer · 2026-05-20T10:04:34 1779271474

I'll bring it up with some folks. It probably won't change much because the z13 transition has finished by now. It's still good to know because RISC-V is in the same boat regarding Highway support today: we need scalar fallback in Highway until we get RVA23 hardware deployed.

janwas · 2026-05-18T06:24:35 1779085475

Correction (typo): Z13 lacks fp32.

nnevatie · 2026-05-18T09:02:01 1779094921

Runtime dispatch is still a PITA in Highway, especially compared to ISPC. A simple algorithm or kernel becomes a multi-file macro-hell, basically.

janwas · 2026-05-20T09:16:20 1779268580

Any suggestions for improvement? We went through >5 iterations of the dispatching and I am fairly confident this is about as good as it gets in current C++. I suppose "macro hell" is a matter of taste. Objectively, we have six dispatch related macros in the example: https://gcc.godbolt.org/z/KM3ben7E The ~two dozen lines of boilerplate are generally copied from an example. But why multi-file?

nnevatie · 2026-05-21T13:07:46 1779368866

The link doesn't work - could you link it again? I might be basing my above comment on and older version of Highway.

janwas · 2026-05-25T14:26:03 1779719163

Oops, the final T got cut off somehow, sorry about that.

https://gcc.godbolt.org/z/KM3ben7ET