> people don't need to work out which version of cuda/which gpu settings/libraries/etc. to use
This is not true in my case. The regular pytorch does not work on my system. I had to download a version specific to my system from the pytorch website using --index-url.
> packages bundle all possible options into a single artifact
Cross-platform Java apps do it too. For e.g., see https://github.com/xerial/sqlite-jdbc. But it does not become a clusterfuck like it does with python. After downloading gigabytes and gigabytes of dependencies repeatedly, the python tool you are trying to run will refuse to do so for random reasons.
You cannot serve end-users a shit-sandwich of this kind.
The python ecosystem is a big mess and, outside of a few projects like uv, I don't see anyone trying to build a sensible solution that tries to improve both speed/performance and packaging/distribution.
That's a pytorch issue. The solution is, as always, build from source. You will understand how the system is assembled, then you can build a minimal version meeting your specific needs (which, given wheels are a well-defined thing, you can then store on a server for reuse).
Cross-OS (especially with a VM like Java or JS) is relatively easy compared to needing specific versions for every single sub-architecture of a CPU and GPU system (and that's ignoring all the other bespoke hardware that's out there).
Cross platform Java doesn't have the issue because the JVM is handling all of that for you. But if you want native extensions written in C you get back to the same problem pretty quickly.
The SQLite project I linked to is a JDBC driver that makes use of the C version of the library appropriate to each OS. LWJGL (https://repo1.maven.org/maven2/org/lwjgl/lwjgl/3.3.6/) is another project which heavily relies on native code. But distributing these, or using these as dependencies, does not result in hair-pulling like it does with python.
There's native code like SQLite which assuming a sensible file system and C compiler is quite portable, and then there's native code which cares about exact compiler versions, driver version, and the exact model of your CPU, GPU and NIC. My suggestion is go look at how to program a GPU using naive vulkan/metal, and then look for the dark magic that is used to make GPUs run fast. It's the latter you're encountering with the ML python projects.
This is not true in my case. The regular pytorch does not work on my system. I had to download a version specific to my system from the pytorch website using --index-url.
> packages bundle all possible options into a single artifact
Cross-platform Java apps do it too. For e.g., see https://github.com/xerial/sqlite-jdbc. But it does not become a clusterfuck like it does with python. After downloading gigabytes and gigabytes of dependencies repeatedly, the python tool you are trying to run will refuse to do so for random reasons.
You cannot serve end-users a shit-sandwich of this kind.
The python ecosystem is a big mess and, outside of a few projects like uv, I don't see anyone trying to build a sensible solution that tries to improve both speed/performance and packaging/distribution.