That problem is very much not pip (pip is only the installer), the issue is:
* We have a conflict between being easy to use (people don't need to work out which version of cuda/which gpu settings/libraries/etc. to use) vs install size (it's basically the x86 vs arm issue, except at least 10 fold larger). Rather than making it the end-users problem, packages bundle all possible options into a single artifact (MacOS does this same, but see the 10 fold larger issue).
* The are almost fundamental assumptions (and the newer Python packaging tools, including uv, very much rely on these) that Python packaging makes that are inherently about how a system should act (basically, frozen, no detection available apart from the "OS"), which do not align with having hardware/software that much be detected. One could do this via sdists, but Windows plus the issues around dynamic metadata make this a non-starter (and hence tools like conda, spack and others from the more scientific side of the ecosystem have been created—notably on the more webby side, this problems are solved either via vendoring non-python libraries, or making it someone/something else's problem, hence docker or the cloud for databases or other companion services).
* Frankly, more and more developers have no idea how systems are built (and this isn't just a Python issue). Docker lets people hide their sins, with magical invocations that just work (and static linking in many cases sadly does the same). There are tools out of the hyperscalars which are designed to solve these problems, but they solve it by creating tools that experts can wrangle many systems and hence imply you have a team which can do the wrangling.
Can this be solved? Maybe, but not by a new tool (on its own). It would require a lot of devs who may not see much improvement to their workflow change their workflow for others (to newer ones which remove the assumptions which are built in to the current workflows), plus a bunch of work by key stakeholders (and maybe even the open sourcing of some crown jewels), and I don't see that happening.
> people don't need to work out which version of cuda/which gpu settings/libraries/etc. to use
This is not true in my case. The regular pytorch does not work on my system. I had to download a version specific to my system from the pytorch website using --index-url.
> packages bundle all possible options into a single artifact
Cross-platform Java apps do it too. For e.g., see https://github.com/xerial/sqlite-jdbc. But it does not become a clusterfuck like it does with python. After downloading gigabytes and gigabytes of dependencies repeatedly, the python tool you are trying to run will refuse to do so for random reasons.
You cannot serve end-users a shit-sandwich of this kind.
The python ecosystem is a big mess and, outside of a few projects like uv, I don't see anyone trying to build a sensible solution that tries to improve both speed/performance and packaging/distribution.
That's a pytorch issue. The solution is, as always, build from source. You will understand how the system is assembled, then you can build a minimal version meeting your specific needs (which, given wheels are a well-defined thing, you can then store on a server for reuse).
Cross-OS (especially with a VM like Java or JS) is relatively easy compared to needing specific versions for every single sub-architecture of a CPU and GPU system (and that's ignoring all the other bespoke hardware that's out there).
Cross platform Java doesn't have the issue because the JVM is handling all of that for you. But if you want native extensions written in C you get back to the same problem pretty quickly.
The SQLite project I linked to is a JDBC driver that makes use of the C version of the library appropriate to each OS. LWJGL (https://repo1.maven.org/maven2/org/lwjgl/lwjgl/3.3.6/) is another project which heavily relies on native code. But distributing these, or using these as dependencies, does not result in hair-pulling like it does with python.
There's native code like SQLite which assuming a sensible file system and C compiler is quite portable, and then there's native code which cares about exact compiler versions, driver version, and the exact model of your CPU, GPU and NIC. My suggestion is go look at how to program a GPU using naive vulkan/metal, and then look for the dark magic that is used to make GPUs run fast. It's the latter you're encountering with the ML python projects.
Can this be solved? Maybe, but not by a new tool (on its own). It would require a lot of devs who may not see much improvement to their workflow change their workflow for others (to newer ones which remove the assumptions which are built in to the current workflows), plus a bunch of work by key stakeholders (and maybe even the open sourcing of some crown jewels), and I don't see that happening.