Compute. The NVIDIA cards have way more tflops than the Mac SoC. A simple RTX 40...

joakleaf · on June 6, 2023

• The GPU in the base M2 has 10 GPU cores and 3.6 tflops

• The GPU in the M2 Ultra has 76 GPU cores and corresponds to 2x M2 Max, which has 13.6 tflops; so 27.2 tflops [1,2]

• RTX 4090 has 82.58 tflops [3] (overclocked can reach 100 tflops [4])

While more powerful NVidia cards are not "multiple orders of magnitude more". Rather it seems a 4090 is around 3-4 faster than an M2 Ultra GPU.

Keep in mind that the Apple Silicon chips also have low-precission Neural Engine circuits for inference of neural nets. For the M2 Ultra they claim 31.6 tops [1].

[1] https://www.apple.com/newsroom/2023/06/apple-introduces-m2-u...

[2] https://www.notebookcheck.net/Apple-unveils-M2-Pro-and-M2-Ma...

[3] https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

[4] https://videocardz.com/newz/overclocked-nvidia-rtx-4090-gpu-...

lostmsu · on June 6, 2023

Since we are talking about AI, you should use Tensor Core Tops instead of regular flops. That number for 4090 is 330. So more than 10x of M2 Ultra

mosshammer · on June 6, 2023

The RTX 4090 has 330.3 TOPS with INT8 precision, so for inference workloads it is still a magnitude faster

fyzix · on June 6, 2023

Max to ultra doesn't scale linearly. Max tech on YouTube covered this extensively

joakleaf · on June 6, 2023

Probably true that performance doesn't scale linearly (and I do remember the Max Tech analysis of M1), but I don't have the tflops number for the M2 ultra.

For the M1 the stated values are:

  M1 Ultra [1]: 21 TFLOPs
  M1 Max [2]: 10.6 TFLOPS

Actual performance obviously depends on the type of work-load etc. E.g. Geekbench Ultra is only 50% faster [3]:

  M1 Ultra Geekbench 6: 150260 
  M1 Max Geekbench 6: 108198

[1] https://videocardz.com/newz/despite-apples-claims-m1-ultra-g... [2] https://wccftech.com/m1-max-teraflops-performance-higher-tha... [3] https://browser.geekbench.com/metal-benchmarks

senko · on June 6, 2023

> seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.

Yes, but, if you need 48GB to run inference on a model, and you only have 24GB available, you don't get to enjoy the tflops difference.

If nVidia released somewhat lower-performance GPUs but with more VRAM, we could talk. But they're not stupid :)

nightski · on June 6, 2023

You can get an A6000 for cheaper than an M2 Ultra. That said it's still only 48GB.

senko · on June 6, 2023

I'm not sure if it's cheaper when you compare the whole machine price, not just GPU, since you can't buy piecewise with the Mac.

nightski · on June 6, 2023

The GPU is $4500. That gives you $2000 to build a machine (which you can build a top of the line consumer PC for cheaper than that, I just did recently w/128GB DDR5 and Zen 4 7950X3D).

senko · on June 6, 2023

Cool if you can find it at that price. On my side of the pond, it's around €6,000.

sterlind · on June 8, 2023

yes you do. you can use DeepSpeed. it's only like 2x slower.

argsnd · on June 6, 2023

You're not wrong that the RTX 4090 will be much more powerful than an M2 Ultra, but you don't have any idea what you're talking about when you're comparing specs.

You might be able to compare the number of CUDA cores to the ALU count of the Apple GPUs. I don't know what that is for M2 Ultra yet, but for the 64 core M1 Ultra each core had 16 execution units and each of those had 8 ALUs, for a total of 8,192 ALUs. The M1 Ultra's FP32 performance was in the ballpark of 21tflops - assuming a ~30% improvement in the M2 Ultra that takes us to ~27tflops. Google suggests that for the RTX 4090 it's 83tflops.

cypress66 · on June 6, 2023

The cuda cores and m2 cores aren't comparable.

A more appropriate comparison is the fp16 performance.

It seems to be 27 tflops for the 38 core M2, and 330 for the 4090.

The more useful for training fp16 with fp32 accumulate is 165 for the 4090, I don't know about the apple one.

threeseed · on June 6, 2023

There is a video [1] benchmarking last year's M1 Ultra against 3080ti for one ML use case and it was 3x slower. Apple MX does have neural engines which are used but not sure how they compare to CUDA cores.

Either way quite a bit better than 30x slower.

[1] https://www.youtube.com/watch?v=k_rmHRKc0JM

jorgemf · on June 6, 2023

In that video they run the Linux experiments over a windows with a virtual machine. And I didn't see the model, but I bet you I can trains a model in a 4090 one 2x faster than in a old 1050 (because i can chose a model which bottle neck could be the data transfering not the actual computation).