Not necessarily a myth anymore, strangely. If you apt/rpm install something nowadays, it probably wasn't built with support for newer CPU instructions (AVX, AES-NI, sometimes even SSE4). -march=native is gonna have a much bigger effect now than stage 1 Gentoo installs did in the mid-2000s.
(and, yes, I do remember waiting a full day for KDE to compile)
> -march=native is gonna have a much bigger effect now than stage 1 Gentoo installs did in the mid-2000s.
Actually, probably quite the contrary. All x86-64 chips are required to support SSE2, which lets you use SSE for floating-point instead of x87 floating-point, which is a big speed win. But the newer extensions are specialized SSE instructions, which generally require manual use of instructions to utilize; specialized crypto instructions, which definitely require manual use; and AVX instructions, which doubles the width of vectors you can use. The first two is not going to be improved with -march; the code that uses it is almost certainly going to be compiled in a way that lets it dynamically use these instructions when available.
As for vectorized code that could use AVX instead, it's dubious how much of an effect it would have, since the biggest improvement in vectorization will be enabled with the 128-bit vectors, with 256-bit vectors offering at most a 2x speedup in the vectorized code, the effect being reduced by some code only being 128-bit-vectorizable (and not receiving any speedup), and also by Amdahl's Law reducing the benefit of further speedups in that code. Furthermore, vectorization tends to be much less relevant in the "integer" code that is typical of most consumer software, outside of a few hot loops that are already manually specified as above.
Most of the SSE3, SSSE3, SSE4.1, and SSE4.2 (there is no SSE5 in any released processor) instructions are not particularly feasible to be used by automatic vectorization, being mostly horizontal vector optimizations or some oddball instructions that are pretty task-specific (hi, PCMPESTRI). You might see them come up in SLP vectorization, but my last experience with LLVM's SLP vectorizer is that it does a poor job of taking advantage of these kinds of instructions anyways.
For hot kernels (say, memcpy), it is definitely the case that many projects have implementations of several different varieties of these, and use the version best suited for your current architecture. See https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86... for the different variants of common functions in glibc.
(and, yes, I do remember waiting a full day for KDE to compile)
https://web.archive.org/web/20080704112619/http://funroll-lo...