Nvidia unveils DGX GH200 supercomputer: Grace Hopper superchips go into production

At Computex 2023 in Taipei, Taiwan, Jensen Huang, CEO of Nvidia, announced that the company’s Grace hopper superchips are now in full production. The Grace platform has already successfully won six supercomputer competitions. As part of this exciting development, Huang unveiled the new DGX GH200 AI supercomputing platform at Computex 2023.

This platform, designed specifically for massive generative AI workloads, is available now with 256 Grace Hopper superchips. These chips work together as a powerful supercomputing system with 144 TB of shared memory to handle even the most demanding generative AI training tasks. Well-known companies such as Google, Meta and Microsoft have already expressed interest in these advanced systems from Nvidia.

Nvidia also announced the launch of its new MGX reference architectures. These architectures will help OEMs develop new AI supercomputers even faster. More than 100 systems are available in the process. The company also unveiled the Spectrum-X Ethernet networking platform, designed and optimized specifically for AI servers and supercomputing clusters. With this platform, Nvidia offers a state-of-the-art solution for networking AI systems and supercomputers. Dive into the future of technology!

Grace hopper superchips now in production at Nvidia

There has already been extensive coverage of the Grace and Grace Hopper superchips that have been introduced in the past. These chips are central to the new systems announced by Nvidia today. The Grace chip acts as Nvidia’s own arm CPU processor, while the Grace Hopper superchip combines the Grace CPU with a Hopper GPU, 72 cores, 96GB of HBM3 and 512GB of LPDDR5X in one package. With a total of 200 billion transistors, this superchip is a technical heavyweight. The combination of Grace CPU and Hopper GPU enables an impressive data bandwidth of up to 1 TB/s between both components. This high throughput proves to be a huge advantage, especially for memory-intensive workloads.

Given that the Grace Hopper superchips are now in production, we can expect systems to be available from various Nvidia partners including Asus, Gigabyte, ASRock Rack and Pegatron. Of particular note, Nvidia will also introduce its own systems based on the new chips. In addition, the company will release reference design architectures for OxMs and hyperscalers, which we will discuss below.

Nvidia DGX GH200 supercomputer

Nvidia’s DGX systems are the leading choice for high-performance AI and HPC workloads. However, current DGX A100 systems are limited to eight A100 GPUs working together as a single unit. Given the rapid growth in generative AI, there is a great need among Nvidia’s customers for more powerful systems with significantly more capacity. That’s why the DGX H200 is designed to provide tremendous scalability and ultimate throughput for the most demanding workloads such as generative AI training, large language models, recommendation systems, and data analytics. It does so by circumventing the limitations of traditional cluster interconnect options such as InfiniBand and Ethernet through the use of Nvidia’s custom NVLink switch silicon.

Although the exact specifications of the new DGX GH200 AI supercomputer are not yet known, we do know that Nvidia is using a new NVLink switch system. This system consists of 36 third-generation NVLink switches to connect 256 GH200 Grace Hopper chips and 144TB of shared memory into a unified structure. This creates a massive computing unit that resembles a giant graphics processor both visually and in terms of performance.

The DGX GH200 is equipped with a total of 256 Grace Hopper CPU+GPUs, significantly outperforming Nvidia’s largest NVLink-connected DGX system to date with eight GPUs. Compared to the DGX A100 systems, which offer only 320GB of shared memory between eight A100 GPUs, the DGX GH200 has an impressive 144TB of shared memory, a 500x increase. Unlike the DGX A100 system’s expansion to clusters with more than eight GPUs, which uses InfiniBand as a link between systems and results in performance degradation, the DGX GH200 is the first time Nvidia has built a full supercomputing cluster around the NVLink switch topology. This cluster offers up to 10 times the GPU-to-GPU bandwidth and seven times the CPU-to-GPU bandwidth compared to the previous generation, according to Nvidia. In addition, it offers 5 times the interconnect efficiency of competing interconnects and up to 128 TB/s of bisectional bandwidth.

Although the system requires 150 miles of fiber optic cable and weighs 40,000 pounds, it is still considered a single GPU. Nvidia claims that the 256 Grace Hopper superchips give the DGX GH200 “AI performance” in the range of exaflops, a value measured with smaller data types that are more relevant to AI workloads than the FP64 measurements used in HPC and supercomputing. This impressive performance is achieved through GPU-to-GPU bandwidth of 900 GB/s. The scalability of the system is particularly noteworthy, as Grace Hopper achieves a maximum throughput of 1 TB/s with the Grace CPU when connected directly on the same board via the NVLink C2C chip interconnect.

Nvidia has published benchmarks of the DGX GH200 with the NVLink Switch system, with a DGX H100 cluster connected over InfiniBand. Different numbers of GPUs were used for the calculations, from 32 to 256, but each system used the same number of GPUs for each test. This resulted in an expected 2.2 to 6.3 times increase in interconnect performance. By the end of 2023, Nvidia will provide DGX GH200 reference blueprints to its leading customers Google, Meta and Microsoft. The system will also be offered as a reference architecture design for cloud service providers and hyperscalers.In addition, Nvidia will use the new Nvidia Helios supercomputer for internal research and development work. The supercomputer consists of four DGX GH200 systems containing a total of 1,024 Grace Hopper superchips. These systems will be connected via Nvidia’s Quantum-2 InfiniBand 400 Gb/s network.

Nvidia MGX Systems – Reference Architectures

Although DGX systems are designed for high-end applications and HGX systems are designed for hyperscalers, the new MGX systems represent something of an intermediate solution. DGX and HGX will continue to exist alongside the new MGX systems.

Nvidia’s OxM partners are facing new challenges in developing and deploying AI-centric server designs, causing delays. To accelerate this process, Nvidia has developed the MGX reference architectures, which include more than 100 reference designs. The MGX systems offer modular designs that span Nvidia’s entire portfolio of CPUs, GPUs, DPUs and networking systems. In addition, designs based on the popular x86 and Arm-based processors used in today’s servers are also available. Nvidia also offers options for air-cooled and liquid-cooled designs, giving OxM partners multiple design options for a variety of applications.

A significant factor is that Nvidia is using Ethernet as a connectivity option in high-performance AI platforms instead of the InfiniBand links often used in such systems. The Spectrum-X design leverages Nvidia’s 51 TB/s Spectrum 4-400 GbE Ethernet switches and Nvidia’s Bluefield 3 DPUs, providing developers with software and SDKs to customize systems according to the specific needs of AI workloads. Compared to other Ethernet-based systems, Nvidia emphasizes that Spectrum-X is lossless, providing improved QoS and latency. Additionally, it features adaptive routing technology, which is especially beneficial in multi-client environments.

The Spectrum-X networking platform is a key component of Nvidia’s portfolio as it introduces powerful AI clustering capabilities to Ethernet-based networks, providing new opportunities for the extensive use of AI in hyperscale infrastructures. The Spectrum-X platform is fully interoperable with existing Ethernet-based stacks and offers impressive scalability with up to 256 ports at 200 Gb/s on a single switch or 16,000 ports in a two-tier leaf spine topology.

The Nvidia Spectrum-X platform and related components, including 400G LinkX optics, are available now.

Nvidia Grace and Grace Hopper Superchip Supercomputing Wins

Nvidia’s first Grace CPUs are already in production and have generated significant interest through their use in three recent supercomputer wins. One of these notable supercomputers is the Taiwania 4, which is being developed by ASUS on behalf of the Taiwan National Center for High-Performance Computing and was recently announced. With a total of 44 Grace CPU nodes, this supercomputer will be one of the most energy-efficient in Asia once it becomes operational, according to Nvidia. Its purpose is to assist in modeling climate change issues.

Nvidia also announced more details about its new Taipei 1 supercomputer, which will be based in Taiwan. This system includes 64 DGX H100 AI supercomputers as well as 64 Nvidia OVX systems connected by the company’s proprietary networking kit. When completed later this year, the supercomputer will be used for local research and development tasks without being specified.

Source: TomsHardware, NVIDIA

4 Antworten

Zeige alle Kommentare an

Kommentar

Lade neue Kommentare

eastcoast_pete

Urgestein

1,732 Kommentare 1,063 Likes

#1 May 30, 2023

Na, bei den Specs kann man die alte Frage "But, can it run Crysis?" wohl mit "Ja" oder "Yes" beantworten 😁

Antwort 1 Like

Gregor Kacknoob

Urgestein

543 Kommentare 450 Likes

#2 May 30, 2023

Aber für Gollum wirds nicht reichen :D

Antwort Gefällt mir

LurkingInShadows

Urgestein

1,387 Kommentare 584 Likes

#3 May 30, 2023

1 GPU mit:

150 Meilen Glasfaserkabel
2.112 60mm Lüftern
40.000 Pfund

Da fragt sich eher:
1) Kann sie fliegen?
2) Ausblick auf die RTX 9090 TI ?

Antwort Gefällt mir

8j0ern

Urgestein

2,794 Kommentare 882 Likes

#4 May 31, 2023

Hahaha, so viel Mut kann nur Jensen vorweisen !

FP64 Heraus zu Fordern Grenzt schon an Wahnsinn !

:geek:

Antwort Gefällt mir

Alle Kommentare lesen unter igor´sLAB Community →

Danke für die Spende

Du fandest, der Beitrag war interessant und möchtest uns unterstützen? Klasse!

Hier erfährst Du, wie: Hier spenden.

Hier kannst Du per PayPal spenden.

Computex 2023 – Day one with many videos and faces

AMD Ryzen Threadripper 7000 “Storm Peak” – first traces also spotted in CPU-Z

About the author

View All Posts

Igor Wallossek

Editor-in-chief and name-giver of igor'sLAB as the content successor of Tom's Hardware Germany, whose license was returned in June 2019 in order to better meet the qualitative demands of web content and challenges of new media such as YouTube with its own channel.

Computer nerd since 1983, audio freak since 1979 and pretty much open to anything with a plug or battery for over 50 years.

Follow Igor:
YouTube Facebook Instagram Twitter