Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Never ever build in '-03 -march=native' by default. This is always a red flag and a sign of immaturity.

Perhaps you can see how there are some assumptions baked into that statement.



What assumptions would that be?

Shipping anything built with -march=native is a horrible idea. Even on homogeneous targets like one of the clouds, you never know if they'll e.g. switch CPU vendors.

The correct thing to do is use microarch levels (e.g. x86-64-v2) or build fully generic if the target architecture doesn't have MA levels.


I build on the exact hardware I intend to deploy my software to and ship it to another machine with the same specs as the one it was built on.

I am willing to hear arguments for other approaches.


Not the OP, but: -march says the compiler can assume that the features of that particular CPU architecture family, which is broken out by generation, can be relied upon. In the worst case the compiler could in theory generate code that does not run on older CPUs of the same family or from different vendors.

-mtune says "generate code that is optimised for this architecture" but it doesn't trigger arch specific features.

Whether these are right or not depends on what you are doing. If you are building gentoo on your laptop you should absolutely -mtune=native and -march=native. That's the whole point: you get the most optimised code you can for your hardware.

If you are shipping code for a wide variety of architectures and crucially the method of shipping is binary form then you want to think more about what you might want to support. You could do either: if you're shipping standard software pick a reasonable baseline (check what your distribution uses in its cflags). If however you're shipping compute-intensive software perhaps you load a shared object per CPU family or build your engine in place for best performance. The Intel compiler quite famously optimised per family, included all the copies in the output and selected the worst one on AMD ;) (https://medium.com/codex/fixing-intel-compilers-unfair-cpu-d...)


> Not the OP, but: -march says the compiler can assume that the features of that particular CPU architecture family, which is broken out by generation, can be relied upon. In the worst case the compiler could in theory generate code that does not run on older CPUs of the same family or from different vendors.

Or on newer CPUs of the same vendor (e.g. AMD dropped some instructions in Zen that Intel didn't pick up) or even in different CPUs of the same generation (Intel market segmenting shenanigans with AVX512).


Just popping in here because people seem to be surprised by

> I build on the exact hardware I intend to deploy my software to and ship it to another machine with the same specs as the one it was built on.

This is exactly the use case in HPC. We always build -march=native and go to some trouble to enable all the appropriate vectorization flags (e.g., for PowerPC) that don't come along automatically with the -march=native setting.

Every HPC machine is a special snowflake, often with its own proprietary network stack, so you can forget about binaries being portable. Even on your own machine you'll be recompiling your binaries every time the machine goes down for a major maintenance.


If you get enough of them they can start to look like cattle.

Still, they are all the same breed.


I'm willing to hear arguments for your approach?

it certainly has scale issues when you need to support larger deployments.

[P.S.: the way I understand the words, "shipping" means "passing it off to someone else, likely across org boundaries" whereas what you're doing I'd call "deploying"]


So, do you see now the assumptions baked in your argument?

> when you need to support larger deployments

> shipping

> passing it off to someone else


On every project I've worked on, the PC I've had has been much better than the minimum PC required. Just because I'm writing code that will run nicely enough on a slow PC, that doesn't mean I need to use that same slow PC to build it!

And then, the binary that the end user receives will actually have been built on one of the CI systems. I bet they don't all have quite the same spec. And the above argument applies anyway.


So I get you don't do neither cloud, embedded, game consoles, mobile devices.

Quite hard to build on the exact hardware for those scenarios.


What?! seriously?!

I’ve never heard of anyone doing that.

If you use a cloud provider and use a remote development environment (VSCode remote/Jetbrains Gateway) then you’re wrong: cloud providers swap out the CPUs without telling you and can sell newer CPUs at older prices if theres less demand for the newer CPUs; you can’t rely on that.

To take an old naming convention, even an E3-Xeon CPU is not equivalent to an E5 of the same generation. I’m willing to bet it mostly works but your claim “I build on the exact hardware I ship on” is much more strict.

The majority of people I know use either laptops or workstations with Xeon workstation or Threadripper CPUs— but when deployed it will be a Xeon scalable datacenter CPU or an Epyc.

Hell, I work in gamedev and we cross compile basically everything for consoles.


… not everyone uses the cloud?

Some people, gasp, run physical hardware, that they bought.


So you buy exact same generation of Intel and AMD chips to your developers than your servers and your cutomsers? And encode this requirement into your development process for the future?


No? That would be ridiculous. You’re inventing dumb scenarios to make your argument work.

It’s more like: some organizations buy many of the same model of server, make one or two of them their build machines, and use the rest as production. So it’d be totally fine to use march=native there.

You just wouldn’t use those binaries anywhere else. Devs would simply do their own build locally (why does everyone act like this is impossible?) and use that. And obviously you don’t ship these binaries to customers… but, why are we suddenly talking about client software here? There’s a whole universe of software that exists to be a service and not a distributed binary, we’re clearly talking about that. Said software is typically distributed as source, if it’s distributed at all.

There’s a thousand different use cases for compiling software. Running locally, shipping binaries to users, HPC clusters, SaaS running on your own hardware… hell, maybe you’re running an HFT system and you need every microsecond of latency you can get. Do you really think there are no situations ever where -march=native is appropriate? That’s the claim we’re debunking, the idea that "-march=native is always always a mistake". It’s ridiculous.


We use physical hardware at work, but it's still not the way you build/deploy unless it's for a workstation/laptop type thing.

If you're deploying the binary to more than one machine, you quickly run into issues where the CPUs are different and you would need to rebuild for each of them. This is feasible if you have a couple of machines that you generally upgrade together, but quickly falls apart at just slightly more than 2 machines.


And all your deployed and dev machines run the same spec- same CPU entirely?

And you use them for remote development?

I think this is highly unusual.


Lots of organizations buy many of a single server spec. In fact that should be the default plan unless you have a good reason to buy heterogeneous hardware. With the way hardware depreciation works they tend to move to new server models “in bulk” as well, replacing entire clusters/etc at once. I’m not sure why this seems so foreign to folks…

Nobody is saying dev machines are building code that ships to their servers though… quite the opposite, a dev machine builds software for local use… a server builds software for running on other servers. And yes, often build machines are the same spec as the production ones, because they were all bought together. It’s not really rare. (Well, not using the cloud in general is “rare” but, that’s what we’re discussing.)


There is a large subset of devs who have worked their entire career on abstracted hardware which is fine I guess, just different domains.

The size of your L1/L2/L3 cache or the number of TLB misses doesn't matter too much if your python web service is just waiting for packets.


The only time I used -march=native was for a university assignment which was built and evaluated on the same server, and it allowed juicing an extra bit of performance. Using it basically means locking the program to the current CPU only.

However I'm not sure about -O3. I know it can make the binary larger, not sure about other downsides.


> The only time I used -march=native

It is completely fine to use -march=native, just do not make it the default for someone building your project.

That should always be something to opt-in.

The main reason is that software are a composite of (many) components. It becomes quickly a pain in the ass of maintainability if any tiny library somewhere try to sneak in '-march=native' that will make the final binary randomly crash with an illegal instruction error if executed on any CPU that is not exactly the same than the host.

When you design a build system configuration, think for the others first (the users of your software), and yourself after.


-O3 also makes build times longer (sometimes significantly), and occasionally the resulting program is actually slightly slower than -O2.

IME -O3 should only be used if you have benchmarks that show -O3 actually produces a speedup for your specific codebase.


This various a lot between compilers. Clang for example treats O3 perf regressions a bugs In many cases at least) and is a bit more reasonable with O3 on. GCC goes full mad max and you don't know what it's going to do.


If you have a lot of "data plane" code or other looping over data, you can see a big gain from -O3 because of more aggressive unrolling and vectorization (HPC people use -O3 quite a lot). CRUD-like applications and other things that are branchy and heavy on control flow will often see a mild performance regression from use of -O3 compared to -O2 because of more frequent frequency hits due to AVX instructions and larger binary size.


I made a program with some inline assembly and tried O3 with clang once. Because the assembly was in a loop, the compiler probably didn't have enough information on the actual code and decided to fully unroll all 16 iterations, making performance drop by 25% because the cache locality was completely destroyed. What I'm trying to say, is that loop unrolling is definitely not a guarantee for faster code in exchange for binary size


Large blocks of inline assembly also destroy -O3. The compiler treats the asm statement as being essentially empty and makes decisions around it. Most inline asm is 1 instruction, so this is usually safe.


Not assumptions, experience.

I fully concur with that whole post as someone who also maintained a C++ codebase used in production.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: