Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Compute. The NVIDIA cards have way more tflops than the Mac SoC. A simple RTX 4090 has 16,384 CUDA cores. The M2's GPU has... 38 cores, apparently? The M2 GPU clocks in at 3.6tflops, compared to ~100tflops for the RTX 4090.

And this is just for a consumer GPU, I haven't even touched on the datacenter-grade stuff.

tl;dr the M2 is an underpowered GPU with a lot of RAM close by, while NVIDIA cards are multiple orders of magnitude more powerful but most of the RAM's a bit farther away.

seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.



• The GPU in the base M2 has 10 GPU cores and 3.6 tflops

• The GPU in the M2 Ultra has 76 GPU cores and corresponds to 2x M2 Max, which has 13.6 tflops; so 27.2 tflops [1,2]

• RTX 4090 has 82.58 tflops [3] (overclocked can reach 100 tflops [4])

While more powerful NVidia cards are not "multiple orders of magnitude more". Rather it seems a 4090 is around 3-4 faster than an M2 Ultra GPU.

Keep in mind that the Apple Silicon chips also have low-precission Neural Engine circuits for inference of neural nets. For the M2 Ultra they claim 31.6 tops [1].

[1] https://www.apple.com/newsroom/2023/06/apple-introduces-m2-u...

[2] https://www.notebookcheck.net/Apple-unveils-M2-Pro-and-M2-Ma...

[3] https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

[4] https://videocardz.com/newz/overclocked-nvidia-rtx-4090-gpu-...


Since we are talking about AI, you should use Tensor Core Tops instead of regular flops. That number for 4090 is 330. So more than 10x of M2 Ultra


The RTX 4090 has 330.3 TOPS with INT8 precision, so for inference workloads it is still a magnitude faster


Max to ultra doesn't scale linearly. Max tech on YouTube covered this extensively


Probably true that performance doesn't scale linearly (and I do remember the Max Tech analysis of M1), but I don't have the tflops number for the M2 ultra.

For the M1 the stated values are:

  M1 Ultra [1]: 21 TFLOPs
  M1 Max [2]: 10.6 TFLOPS
Actual performance obviously depends on the type of work-load etc. E.g. Geekbench Ultra is only 50% faster [3]:

  M1 Ultra Geekbench 6: 150260 
  M1 Max Geekbench 6: 108198

[1] https://videocardz.com/newz/despite-apples-claims-m1-ultra-g... [2] https://wccftech.com/m1-max-teraflops-performance-higher-tha... [3] https://browser.geekbench.com/metal-benchmarks


> seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.

Yes, but, if you need 48GB to run inference on a model, and you only have 24GB available, you don't get to enjoy the tflops difference.

If nVidia released somewhat lower-performance GPUs but with more VRAM, we could talk. But they're not stupid :)


You can get an A6000 for cheaper than an M2 Ultra. That said it's still only 48GB.


I'm not sure if it's cheaper when you compare the whole machine price, not just GPU, since you can't buy piecewise with the Mac.


The GPU is $4500. That gives you $2000 to build a machine (which you can build a top of the line consumer PC for cheaper than that, I just did recently w/128GB DDR5 and Zen 4 7950X3D).


Cool if you can find it at that price. On my side of the pond, it's around €6,000.


yes you do. you can use DeepSpeed. it's only like 2x slower.


You're not wrong that the RTX 4090 will be much more powerful than an M2 Ultra, but you don't have any idea what you're talking about when you're comparing specs.

You might be able to compare the number of CUDA cores to the ALU count of the Apple GPUs. I don't know what that is for M2 Ultra yet, but for the 64 core M1 Ultra each core had 16 execution units and each of those had 8 ALUs, for a total of 8,192 ALUs. The M1 Ultra's FP32 performance was in the ballpark of 21tflops - assuming a ~30% improvement in the M2 Ultra that takes us to ~27tflops. Google suggests that for the RTX 4090 it's 83tflops.


The cuda cores and m2 cores aren't comparable.

A more appropriate comparison is the fp16 performance.

It seems to be 27 tflops for the 38 core M2, and 330 for the 4090.

The more useful for training fp16 with fp32 accumulate is 165 for the 4090, I don't know about the apple one.


There is a video [1] benchmarking last year's M1 Ultra against 3080ti for one ML use case and it was 3x slower. Apple MX does have neural engines which are used but not sure how they compare to CUDA cores.

Either way quite a bit better than 30x slower.

[1] https://www.youtube.com/watch?v=k_rmHRKc0JM


In that video they run the Linux experiments over a windows with a virtual machine. And I didn't see the model, but I bet you I can trains a model in a 4090 one 2x faster than in a old 1050 (because i can chose a model which bottle neck could be the data transfering not the actual computation).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: