Not necessarily a myth anymore, strangely. If you apt/rpm install something nowa...

jcranmer · on April 12, 2020

> -march=native is gonna have a much bigger effect now than stage 1 Gentoo installs did in the mid-2000s.

Actually, probably quite the contrary. All x86-64 chips are required to support SSE2, which lets you use SSE for floating-point instead of x87 floating-point, which is a big speed win. But the newer extensions are specialized SSE instructions, which generally require manual use of instructions to utilize; specialized crypto instructions, which definitely require manual use; and AVX instructions, which doubles the width of vectors you can use. The first two is not going to be improved with -march; the code that uses it is almost certainly going to be compiled in a way that lets it dynamically use these instructions when available.

As for vectorized code that could use AVX instead, it's dubious how much of an effect it would have, since the biggest improvement in vectorization will be enabled with the 128-bit vectors, with 256-bit vectors offering at most a 2x speedup in the vectorized code, the effect being reduced by some code only being 128-bit-vectorizable (and not receiving any speedup), and also by Amdahl's Law reducing the benefit of further speedups in that code. Furthermore, vectorization tends to be much less relevant in the "integer" code that is typical of most consumer software, outside of a few hot loops that are already manually specified as above.

brobinson · on April 13, 2020

Thanks for the more detailed information. I was under the impression that compilers would automatically use SSE3+ and AVX when appropriate.

Avamander · on April 12, 2020

But what about other SIMD extensions, are really all packages distributed with the expectation that for example SSE5 and similar are supported?

jcranmer · on April 13, 2020

Most of the SSE3, SSSE3, SSE4.1, and SSE4.2 (there is no SSE5 in any released processor) instructions are not particularly feasible to be used by automatic vectorization, being mostly horizontal vector optimizations or some oddball instructions that are pretty task-specific (hi, PCMPESTRI). You might see them come up in SLP vectorization, but my last experience with LLVM's SLP vectorizer is that it does a poor job of taking advantage of these kinds of instructions anyways.

For hot kernels (say, memcpy), it is definitely the case that many projects have implementations of several different varieties of these, and use the version best suited for your current architecture. See https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86... for the different variants of common functions in glibc.