NVIDIA vs. AMD and workstation vs. consumer: Who has the edge in the AI graphics card benchmarks?

23. May 2024 05:30

Today we’re doing something completely different, because for once it’s not about gaming, which is getting boring by now, but about the new golden calf, namely AI. NVIDIA’s record result of 26.04 billion dollars in sales announced yesterday represents an increase of 262 percent, so it was simply time for a test. I am testing a total of 12 graphics cards, 6 from AMD and 6 from NVIDIA. The special thing about this is the selection, as the three fastest workstation and consumer cards from each manufacturer are competing against each other, and in the case of NVIDIA, with and without the use of Tensor cores

The UL Procyon AI Computer Vision Benchmark I am using today offers exactly the detailed insights into the performance of AI inference engines on this hardware in a Windows environment that we need. This benchmark includes multiple AI inference engines from different vendors and evaluates the performance of on-device inference operations.

AI workloads and tasks

The AI workloads include common machine vision tasks such as image classification, image segmentation, object detection and super-resolution. These tasks are performed using a set of popular state-of-the-art neural networks running on the device’s CPU, GPU or a dedicated AI accelerator to benchmark hardware performance. Various SDKs are used to measure AI inference performance, including:

Microsoft® Windows ML
Qualcomm® SNPE
Intel® OpenVINO™
NVIDIA® TensorRT™
Apple® Core ML™

The benchmark uses various neural network models, including

MobileNet V3: Optimized for visual recognition on mobile devices.
Inception V4: An accurate model for image classification tasks.
YOLO V 3: For object recognition and localization of objects in images.
DeepLab V3: For semantic image segmentation.
Real-ESRGAN: For super-resolution to upscale images to a higher resolution.
ResNet 50: Provides a novel method for adding more layers in neural networks.

The benchmark includes both float- (FP32, FP16) and integer-optimized versions of each model, running sequentially on all compatible hardware components of the device. But I have a detailed explanation of all these individual benchmarks on the respective page, because I can’t assume that everyone knows exactly what I’m testing. But I am sure that the topic is (a) interesting and (b) also future-oriented, so that (c) readers will also be interested in it.

The results provide detailed insights into AI inference performance, including comparability of float- and integer-optimized models, as well as performance measurement across the GPU and specialized AI accelerators. The benchmark is designed primarily for engineering teams and professional users who need independent, standardized tools to evaluate the overall AI performance of inference engine implementations and dedicated hardware. It is ideal for hardware manufacturers, companies and the press to make informed decisions and verify the quality of AI inference. Unmd at press I just felt addressed.

In the world of artificial intelligence and machine learning, the FP32, FP16 and Integer data types play a crucial role in the performance and efficiency of computations on GPUs. Each of these data types has specific advantages and disadvantages that can vary depending on the use case and hardware architecture. This is one of the reasons why I show all results separately and have also run all the maps for each data type individually. With quite interesting results, by the way.

FP32 (32-bit floating point)

Advantages:

Precision: FP32 offers high accuracy and is therefore ideal for applications that require high numerical precision, such as scientific calculations and complex models.
Compatibility: Many existing neural networks and frameworks are optimized for FP32 and deliver the best results here.

Disadvantages:

Power consumption: FP32 calculations are more computationally intensive and require more power and memory, resulting in higher power consumption and lower efficiency.
Speed: FP32 calculations are slower compared to FP16 and Integer, which reduces the processing speed.

FP16 (16-bit floating point)

Advantages:

Performance: FP16 calculations are faster and require less energy than FP32, which increases efficiency and throughput rate.
Memory requirement: The memory requirement is lower, which means that more data can be processed and stored simultaneously.

Disadvantages:

Accuracy: The lower accuracy of FP16 can lead to rounding errors, which can be problematic in certain applications.
Adaptation effort: It may require additional effort to optimize and adapt existing models and algorithms to FP16.

Integer (INT8 and INT16)

Advantages:

Efficiency: Integer computations are extremely efficient and consume significantly less energy than FP32 and FP16, making them ideal for mobile and embedded systems.
Speed: They are faster than FP calculations, which increases inference speed and reduces latency.

Disadvantages:

Accuracy: Integer formats offer the lowest precision, which can lead to greater errors and inaccuracies, especially with complex models.
Complexity: Quantizing models to make them suitable for integer calculations can be complex and time-consuming.

Architectures and their optimization

Different GPU architectures are optimized differently for these data types:

NVIDIA GPUs: these offer special tensor cores that are optimized for FP16 and INT8 computations, making them particularly efficient in AI computation.
AMD GPUs: AMD is also focusing on improved support for FP16 and is working on improving efficiency with lower precision.
Intel GPUs: With the OpenVINO architecture, Intel is optimizing for broad support of different data types, including INT8, to enable high performance with lower power consumption.

The bottom line is that the choice of data type and architecture depends on the specific requirements of the application. For high accuracy and compatibility, FP32 is suitable, while FP16 and integer are preferred for efficiency and speed in inference applications.

Test system

Pages:

35 Antworten

Zeige alle Kommentare an

Kommentar

Lade neue Kommentare

echolot

Urgestein

1,059 Kommentare 810 Likes

#1 May 23, 2024

Das war sehr umfangreich. Also mit einer 4070 Ti super ist man schon gut bedient und ich bereue es nach wie vor, dass ich Nvidia nicht schon 2015 ins Portfolio genommen habe. Dieses Unternehmen kennt gerade keine Grenzen.
Nachtrag:

View image at the forums

Antwort 2 Likes

letauch

Mitglied

12 Kommentare 9 Likes

#2 May 23, 2024

Ahoi,

an der Börse gilt wie immer: hinterher ist man immer schlauer.

Grüße
letauch

Antwort 1 Like

eastcoast_pete

Urgestein

1,635 Kommentare 953 Likes

#3 May 23, 2024

Ja, momentan ist Nvidia hier dominant, keine Frage. Da ja jetzt die viel beworbenen NPUs/AI ASICs auch ihren Einzug in Notebooks feiern (die Snapdragon X mit Windows-on-ARM sind ja gerade überall zu sehen), wär es auch spannend, diese SoCs mit (laut Microsoft) starken, dedizierten NPU Kernen durch einige der Test Parkours hier zu schicken, auch um die KI Leistung dieser SoCs einordnen zu können (gilt mE ebenso für Phoenix/Hawks). Und, zumindest theoretisch, sollten hier Anwendungen, die besonders auf schnelle Kommunikation zwischen CPU und GPU bzw NPU Kernen angewiesen sind, besonders profitieren.

Antwort 2 Likes

RazielNoir

Veteran

428 Kommentare 194 Likes

#4 May 23, 2024

Die RTX 4000 ADA SFF mit TensorRT ist ziemlich das effizienteste Modell, wenn ich den Overallscore richtig sehe. Auf Niveau einer 4070ti bzw. 7900XT bei 70w!

Antwort Gefällt mir

8j0ern

Urgestein

2,672 Kommentare 837 Likes

#5 May 23, 2024

View image at the forums

UL Procyon AI Computer Vision Benchmark

NNAPI-Leistung von Android-Geräten mit UL Procyon AI Computer Vision Benchmark testen und vergleichen

View image at the forums

benchmarks.ul.com

Sehr interessant, wie soll das Unabhängig Funktionieren, wenn der Tensor Code nur von einem Hardware Hersteller Supportet wird ?

Anders herum gefragt, warum sollte ich als Unabhängiger Coder auf Tensor Cores gehen ?

Davon mal ab, Bilder Generieren in 1024p ?

Warten wir besser auf die NPUs ;)

Antwort Gefällt mir

Igor Wallossek

10,398 Kommentare 19,365 Likes

#6 May 23, 2024

Warum wohl habe ich die NV-Karten wohl auch mit alternativem Code gemessen? Bei der Bildgenerierung gabs dann für NV alternativ Intels OpenVINO. Ich kenne keinen Benchmark, der mehr APIs unterstützt und vor allem in der Pro Version vom Tester auch gescriptet werden kann. Insofern ist Dein Einwurf etwas am Thema vorbei. Heute gings auch nur um Grafikkarten und keine NPUs. Das ist wieder ein anderes Thema und längst in Vorbereitung. Nur ist es so, dass nicht mal AMD irgendeinen vergleichbaren Absatz bietet.

Du arbeitest lokal, nicht auf einer Serverfarm. Und es sind viele, nicht nur eins. :D

Die werden gegen jede noch so kleine NV Karte mit Tensor Cores erst mal gehörig abstinken. Aber für einfache LM wirds schon reichen. Ich versuche gerade, passende Hardware zu beschaffen, aber fast alle mauern noch.

Antwort 1 Like

8j0ern

Urgestein

2,672 Kommentare 837 Likes

#7 May 23, 2024

Ich meinte jetzt nicht deine Vergleich hier, daher habe ich auch die Homepage des Benchmarks verlinkt.

Falls es doch noch Nvidia unabhängige Coder gibt: https://www.amd.com/en/products/sof... including open frameworks, models, and tools.

Antwort Gefällt mir

echolot

Urgestein

1,059 Kommentare 810 Likes

#8 May 23, 2024

Tensor Cores und Frame Generation. War da was? Solange AMD da nicht nachziehen kann, wird Nvidia immer davonziehen.

Antwort 1 Like

Igor Wallossek

10,398 Kommentare 19,365 Likes

#9 May 23, 2024

ROCm... Naja, da muss schon noch mehr kommen. AMDs Software bietet eine Reihe von Optimierungen für KI-Workloads, aber das wars dann auch schon.

Aktuell sind Microsofts Windows ML, Qualcomms SNPE, Intels OpenVINO, Apples Core ML und halt NVIDIAs TensorRT das Maß der Dinge.

Antwort Gefällt mir

Yumiko

Urgestein

513 Kommentare 222 Likes

#10 May 23, 2024

Ist das so?
Beispielsweise für den Preis einer 4090 bekommt man 3x 7900xt welche zusammen deutlich schneller sind nach obigen Benchmarks (KI Anwendungen sind massiv parallel).
Beim Verbrauch (je nach Stromkosten) kann sich das natürlich irgendwann drehen.

Antwort Gefällt mir

Igor Wallossek

10,398 Kommentare 19,365 Likes

#11 May 23, 2024

Je nach Anwendung. Wenn TensorRT genutzt werden kann, ist AMD mit RDNA3 fast komplett hilflos. Nicht alles lässt sich über veile Devices hin parallelisieren und dann braucht man immer noch ein performantes API. Da sehe ich meist ein Software-Problem bei AMD, zumindest im Desktop-Bereich.

Antwort 3 Likes

echolot

Urgestein

1,059 Kommentare 810 Likes

#12 May 23, 2024

Und ein performantes Netzteil für 3x 7900XT

Antwort 1 Like

RazielNoir

Veteran

428 Kommentare 194 Likes

#13 May 23, 2024

Oder die Passende Plattform

Antwort Gefällt mir

8j0ern

Urgestein

2,672 Kommentare 837 Likes

#14 May 23, 2024

Da kommt auch mehr, aber nicht auf Basis von TensorRT ;)

https://www.amd.com/en/developer/resources/ryzen-ai-software.html

https://www.amd.com/en/technologies/xdna.html

View image at the forums

Antwort Gefällt mir

ipat66

Urgestein

1,390 Kommentare 1,397 Likes

#15 May 23, 2024

Stand heute bekommt man eine 4090 für 1730 Euro.
Eine 7900XTX bekommt man für ab 950 Euro...
Das sind also eher knapp zwei 7900 XTX für den Preis einer 4090.
Edit: Bei den 7900 XT für 700 Euro wären wir bei 2100 Euro, bei 3 Stück

Zudem braucht es im KI-Produktivbereich nur eine 4070 TI Super, um die gleiche bzw. teils viel bessere Leistung im Vergleich zu einer 7900 XTX zu erreichen.
Eine 4070 TI Super bekommt man ab 850 Euro....
Also: 100 Euro gespart mit im Vergleich weniger Energieverbrauch .

Das erkenne ich zumindest aus Igor's Diagrammen... :)

Antwort 2 Likes

8j0ern

Urgestein

2,672 Kommentare 837 Likes

#16 May 23, 2024

Auf den Benchmark bezogen stimmt das auch.
Die Frage ist, welche Relevanz hat z.B. ein Mobile Benchmark auf einer 4070 TI ?

Kommt jetzt wieder: Liebling, ich habe die Kinder geschrumpft ?

Antwort Gefällt mir

echolot

Urgestein

1,059 Kommentare 810 Likes

#17 May 23, 2024

Ist auch mein Denkansatz. Da muss AMD bei der nächsten Generation noch ein, zwei Schippen drauflegen.

Antwort Gefällt mir

8j0ern

Urgestein

2,672 Kommentare 837 Likes

#18 May 23, 2024

Dann will ich dich sehen, wie du ein, zwei Geldbeutel mehr drauflegst. ;)

Antwort Gefällt mir

echolot

Urgestein

1,059 Kommentare 810 Likes

#19 May 23, 2024

Der Markt bestimmt den Preis. Siehe Nvidia. Soviele 4090 Besitzer gibbet nicht.

Antwort 1 Like

Alle Kommentare lesen unter igor´sLAB Community →

Danke für die Spende

Du fandest, der Beitrag war interessant und möchtest uns unterstützen? Klasse!

Hier erfährst Du, wie: Hier spenden.

Hier kannst Du per PayPal spenden.

Geekom A8 Mini-PC in test – A Ryzen 8945HS at the physical limit

Even more GeForce RTX 4080 with a factory defect – The thermal paste drama continues unabated

About the author

View All Posts

Igor Wallossek

Editor-in-chief and name-giver of igor'sLAB as the content successor of Tom's Hardware Germany, whose license was returned in June 2019 in order to better meet the qualitative demands of web content and challenges of new media such as YouTube with its own channel.

Computer nerd since 1983, audio freak since 1979 and pretty much open to anything with a plug or battery for over 50 years.

Follow Igor:
YouTube Facebook Instagram Twitter