Latest memory chips, such as Micron’s GDDR6X modules on the GeForce RTX 3080 allow the internal protection mechanism to be used for special protection mechanisms (e.g. the clocking down) of the chip temperature Tjunction with suitable software, which is a nice addition in itself, anyone would be able to do it. But this knowledge of what it is really about and how high the values then turn out to be could cause many a contemporary to be in dire need and fear. This is exactly why this value is not available in the normal Sensor Loop. The Founders Edition cooler is not bad per se, but the fact that NVIDIA only clocks the memory at 19Gbps is certainly due to thermal reasons.
We remember: In the launch article I wrote that the memory could hardly be clocked stable over 20 Gbps and even suddenly slowed down again in the border area. Some other colleagues have also observed this behaviour, so I wanted to get to the bottom of the cause. As a little food for thought I show you once again the infrared image of the backside of the board with the 84 °C hotspot at the hottest spot where a memory module is located. And yes. it’s not the memory itself, which remains cool enough in other places, but exactly there also the heating by the much too close six voltage converters for NVDD.
Thermal resistances and temperatures at different points
Interestingly, Micron is completely silent on GDDR6(X), because even the “Device Thermal Information” included with the GDDR6 documentation annoyingly still ends at GDDR5. The manufacturer specifies a maximum Tjunction of 100 °C for its GDDR5 modules, which seems quite plausible and corresponds with the specifications for the maximum “Operation Temperature” of 95 °C. But it is exactly at this point that the uncertainties begin, what then becomes where and why how warm.
In response to questions among colleagues, for example from the R&D departments, it was unanimously agreed that the maximum temperature Ttot before the start of a possible destruction of the chip should be 120 °C and that Tjunction should probably be set at 105 °C or for the GDDR6X even at 110 °C is specified as the maximum value. But let us first look at the thermal scheme of such a GDDR6X module. First of all, PT, i.e. the maximum “power” Ptot, which is supplied as electrical energy and is almost completely released again as heat (see red arrow).
This should be around 2.5 to 3 watts per module, which sounds a bit low at first, but due to the small structure width and heat density (density) it’s definitely a house number, especially when the board underneath is already quite hot. Because even though the memory module may look quite big as a package, the chip itself is rather tiny. You simply need a lot of space for all the connections and you also want to remain downward compatible:
At the same point TJ, i.e. Tjunction, now comes into play. Maximum chip temperature and maximum power dissipation are therefore directly related here. This is also exactly the value that f.e. also AMD outputs in the sensor loop as storage temperature. I asked AMD myself months ago and found out that this is not an average value of all modules, but the absolute peak value, i.e. the Tjunction of the hottest module of a board. Also important are the values PB, marked with the other two red arrows, i.e. as Pboard the power dissipation which is dissipated via the board and PC, which stands for the dissipated heat Pcase via the top of the case (Package).
In addition, all thermal resistances of the individual layers and the combination of related layers as a directional value upwards and through the board downwards, as well as the temperatures of the environment (air) TA or Tair on the top and bottom, although both can be quite different if a water block comes into play on top. But more on that in a moment.
The crux of a tester like me is the very limited (public) availability of the specifications and the missing (official) possibility to measure inside a module. But stop! in the meantime I can also read out the temperatures of the GDDR6X, more precisely the temperature of the hottest module. For certain reasons, I will not go into this in detail at this point, especially since the appropriate software is only intended for the internal use of engineers. Even though I am not subject to any NDA in this regard, I will abide by it and will not publicly offer for download or redistribute anything. This is simply a question of honor and source protection for signed software, so there is no point in asking.
Test system and setup
As always, I “tropicalized” the backside of the board, i.e. I overcoated it with a transparent varnish, which is used in the industry to protect environmental factors such as high air humidity is used and its emissivity of approx. 0.95, and is therefore known. If a factor of 1 were applied here, the measured temperature would be significantly lower. The wafer-thin special foil attached to the benchtable has a transmission factor of approx. 0.97, which I also include in the measurement. This allows me to make a clever temperature analysis of the relevant areas with the Optris PI640, because the resolution of the bolometer with 640 x 480 real measuring points is sufficiently high.
I have already talked about tropicalization, so I would also classify the measured values as reliable. In spite of the good equipment, I would also be able to work here with approx. 0.Calculate 5 to 1 degree tolerance, but nothing more. I use Witcher 3 in UHD, which I let run for 30 minutes until the final measurement. The room temperature is 22 °C, the superstructure is exceptionally open because I need a constant ambient temperature. You could already see the difference between the memory modules on the board (picture on top). Now please remember the 84 °C from above and the launch article.
Measurement of tjunction in memory
The graph now shows the development of the memory temperature, which I was able to read out with the in-house software. After heating up, everything remains constant from minute 8, so you can always assume the final temperature, which will not change even after 30 minutes. The hottest module on the IR image is located in the immediate vicinity of the voltage transformer and has a Tjunction inside of 104 °C. This results in a delta of approx. 20 degrees between chip and bottom of the board.
Earlier experiments with a water block and many backplate allocation variants have shown further interesting influences. If you only cool the memory on the back side with a good pad between the backplate and the board, the Tboard sinks up to 4 degrees, which also makes Tjunction sink 1 to 2 degrees. Tcase is also significantly lower than Tboard in terms of water cooling, which could make rear cooling all the more interesting. However it is indicated with the RTX 3080 FE to cool nevertheless rather only the clearly hotter voltage transformers, because these are in direct proximity. By the way, I can reassure anyone who insists that I took off the backplate. Even when fully assembled, the RAM is still internally at 104 °C for the hottest module.
Summary and conclusion
It’s no secret that memory modules can become significantly hotter inside than the outer surface on the top of the package or the bottom of the board would suggest. If you now set the maximum Tjunction to 110 °C for GDDR6, the remaining 6 degrees until the suspected throttling are really not a big cushion. But even such a high value is no reason for hasty panic when you understand the interrelationships of all temperatures.
Unfortunately, NVIDIA and their board partners are very cautious about the exact use of this value for performance control (throttling down) or safety features like shutdown, but you won’t have bothered to do it for nothing. I for my part will also read the memory temperatures of the GDDR6(X) of the new ampere cards in all upcoming tests.
Kommentieren