Turing and possible performance improvements in existing games
However, there are quite a few skeptics who, based on various leaks, have already expressed their concerns that Turing-based maps do not have dramatically higher core CUDA numbers than their predecessors, and also not quite as high. Boost clock rates. At first glance, they weren't quite so wrong with that.
Unfortunately, Nvidia was also a bit narrow-lipped about this at the kick-off event at Gamescom in Cologne and did not go into the generation improvements in relation to the current games in the course of the presentation. But the company has made some effort to design Turing for better performance per core. However, this type of communication is also not suitable to take the wind out of the sails of pointless speculation.
Let's just take a look at the details. First, Turing leans pretty much on the Volta playbook to support simultaneous execution of FP32 and INT32 operations. If one assumes that the Turing cores can achieve a better performance at a certain clock frequency than Pascal, then also and above all this special ability explains to a large extent why this is really so.
But what exactly is at stake? In previous generations, a single mathematical data path meant that unequal command types could not be executed at the same time, so that the floating-point pipeline stopped when, for example, non-FP operations were required in a shader program. In Volta's case, they eventually tried to change this by creating separate pipelines.
Although Nvidia eliminated the second disposition unit associated with each warp scheduler, the once problematic throughput of the instructions increased. Turing now takes a similar approach by providing one warp scheduler and disposition unit per quad (four per SM) while giving instructions for the INT32 and FP32 pipelines at the same time.
According to Nvidia, the potential gains are significant. In a game like Battlefield 1, there are 100 floating-point instructions, e.g. 50 non-FP statements in the shader code. Other titles tend even more towards floating-point mathematics. But Nvidia also says there are an average of 36 integer pipeline statements that would block the floating-point pipeline for 100 FP statements each. These are now unloaded onto the INT32 cores. At least in theory.
Despite the separation of FP32 and INT32 paths in the block diagrams, Nvidia writes in the technical explanations that each Turing SM contains 64 CUDA cores to keep things as simple as possible. The Turing SM also includes 16 loading/storage units, 16 special function units, 256KB register file storage, 96KB shared memory and L1 data cache, four texture units, eight tensor cores, and an RT core.
On paper, an SM appears more complex with pascal's predecessor GP102, offering twice as many CUDA cores, load/storage units, SFUs, texture units, as much capacity for register files, and even more cache. But it is also important to note that the TU102 has up to 72 SMs, while the GP102 has to cope with 30 SMs. The result is a Turing-based flagship with 21% more CUDA cores and texture units than the GeForce GTX 1080 Ti, but also much more SRAM for registers, shared memory and L1 cache, not to mention 6MB l2 cache that even doubles the 3MB of GP102.
This increase in on-die memory plays another, very critical role in improving performance, as does the hierarchical organization. As with the GP102 and GP104, the TU102's streaming multiprocessors are divided into four blocks. But while the Pascal-based GPUs share a 24KB L1 data and texture cache between each block pair and 96KB of shared storage over the SM, TU102 unifies these units in a flexible 96KB structure.
The advantage of unification is that whether a workload is optimized for L1 or shared memory, on-chip memory is used instead of remaining idle as before. Moving the L1 functionality down has the added benefit of placing it on a wider bus, doubling the L1 cache bandwidth (while keeping the shared memory bandwidth unchanged).
Compared from TPC to TPC (i.e. with the same number of CUDA cores) pascal 64B/clock cache supports hits per TPC, Turing supports 128B/clock cache hits, i.e. here, too, the power is 2x higher. And since these 96KB can be freely configured as 64KB L1 and 32KB shared memory (or vice versa), the L1 capacity per SM can also be 50% higher. By the way, Turing's cache structure looks very similar at first glance to Kepler, where you had a configurable 64KB shared memory/L1 cache.
To explain, there are three different datastores – the texture cache for textures, the L1 cache for generic LD/ST data, and shared memory for calculation. In the Kepler generation, the texture was separate (the read-only data cache), while L1 and Shared were combined. In Maxwell and Pascal there were also two separate structures, only slightly modified. Now all three are combined into a common and configurable storage pool.
In summary, Nvidia suggests that the effects of the redesigned mathematical pipelines and storage architecture allow for a 50% increase in performance per CUDA core! To provide more effective power to these data-intensive cores, Nvidia paired the TU102 with GDDR6 storage and developed data flow reduction technologies (such as delta color compression).
If you compare the 11Gb/s GDDR5X modules of the GeForce GTX 1080 Ti with the 14Gb/s GDDR6 memory of the GeForce RTX 2080 Ti, both of which can access an aggregated 352-bit bus, the data rate and peak bandwidth across the entire map will be increased by 27%. Depending on the game, and especially when the GeForce RTX 2080 Ti can reduce the transmission of data over the bus, the effective throughput then increases even more by two-digit percentages.
- 1 - Einführung und Vorstellung
- 2 - TU102 + GeForce RTX 2080 Ti
- 3 - TU104 + GeForce RTX 2080
- 4 - TU106 + GeForce RTX 2070
- 5 - Performance-Anstieg für bestehende Anwendungen
- 6 - Tensor-Kerne und DLSS
- 7 - Ray Tracing in Echtzeit
- 8 - NVLink: als Brücke wohin?
- 9 - RTX-OPs: wir rechnen nach
- 10 - Shading-Verbesserungen
- 11 - Anschlüsse und Video
- 12 - 1-Klick-Übertaktung
- 13 - Tschüss, Gebläselüfter!
- 14 - Zusammenfassung und Fazit