Compute. The NVIDIA cards have way more tflops than the Mac SoC. A simple RTX 4090 has 16,384 CUDA cores. The M2's GPU has... 38 cores, apparently? The M2 GPU clocks in at 3.6tflops, compared to ~100tflops for the RTX 4090.
And this is just for a consumer GPU, I haven't even touched on the datacenter-grade stuff.
tl;dr the M2 is an underpowered GPU with a lot of RAM close by, while NVIDIA cards are multiple orders of magnitude more powerful but most of the RAM's a bit farther away.
seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.
• The GPU in the base M2 has 10 GPU cores and 3.6 tflops
• The GPU in the M2 Ultra has 76 GPU cores and corresponds to 2x M2 Max, which has 13.6 tflops; so 27.2 tflops [1,2]
• RTX 4090 has 82.58 tflops [3] (overclocked can reach 100 tflops [4])
While more powerful NVidia cards are not "multiple orders of magnitude more". Rather it seems a 4090 is around 3-4 faster than an M2 Ultra GPU.
Keep in mind that the Apple Silicon chips also have low-precission Neural Engine circuits for inference of neural nets. For the M2 Ultra they claim 31.6 tops [1].
Probably true that performance doesn't scale linearly (and I do remember the Max Tech analysis of M1), but I don't have the tflops number for the M2 ultra.
For the M1 the stated values are:
M1 Ultra [1]: 21 TFLOPs
M1 Max [2]: 10.6 TFLOPS
Actual performance obviously depends on the type of work-load etc. E.g. Geekbench Ultra is only 50% faster [3]:
M1 Ultra Geekbench 6: 150260
M1 Max Geekbench 6: 108198
The GPU is $4500. That gives you $2000 to build a machine (which you can build a top of the line consumer PC for cheaper than that, I just did recently w/128GB DDR5 and Zen 4 7950X3D).
You're not wrong that the RTX 4090 will be much more powerful than an M2 Ultra, but you don't have any idea what you're talking about when you're comparing specs.
You might be able to compare the number of CUDA cores to the ALU count of the Apple GPUs. I don't know what that is for M2 Ultra yet, but for the 64 core M1 Ultra each core had 16 execution units and each of those had 8 ALUs, for a total of 8,192 ALUs. The M1 Ultra's FP32 performance was in the ballpark of 21tflops - assuming a ~30% improvement in the M2 Ultra that takes us to ~27tflops. Google suggests that for the RTX 4090 it's 83tflops.
There is a video [1] benchmarking last year's M1 Ultra against 3080ti for one ML use case and it was 3x slower. Apple MX does have neural engines which are used but not sure how they compare to CUDA cores.
In that video they run the Linux experiments over a windows with a virtual machine. And I didn't see the model, but I bet you I can trains a model in a 4090 one 2x faster than in a old 1050 (because i can chose a model which bottle neck could be the data transfering not the actual computation).
And this is just for a consumer GPU, I haven't even touched on the datacenter-grade stuff.
tl;dr the M2 is an underpowered GPU with a lot of RAM close by, while NVIDIA cards are multiple orders of magnitude more powerful but most of the RAM's a bit farther away.
seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.