The possible reason for crashes and instabilities of the NVIDIA GeForce RTX 3080 and RTX 3090 | Investigative

Not only the editors and testers were surprised by sudden instabilities of the new GeForce RTX 3080 and RTX 3090, but also the first customers who were able to get board partner cards from the first wave. An interesting pattern of behavior emerged that did not affect all cards or manufacturers and the problems only occurred at certain boost clock rates above or just around 2 GHz. To make matters worse, NVIDIA has obviously also slightly undermined the quality management of the board partners (AIC) due to the secrecy – unconsciously, of course, but with plausible consequences. A chain of adverse circumstances? This could well be the case, because this explains the somewhat diffuse error pattern from the most diverse forums.

Start of production without real function control?

Let’s start with the latter, before I get lost in the technical analyses. You probably remember when I wrote that the board partners couldn’t use working drivers yet and only work with a very limited driver and NVPunish. Since the driver problem lasted until shortly before the launch, but the first wave of cards had to be produced already, the functional testing of the first models was obviously limited to power-on and thermal stability. Running, not running. However, this does not say much about the chip quality and the possible maximum frequencies that the respective chip can safely handle.

Thus, it would at least be plausible that cards could have been sold as OC cards, which wouldn’t have passed a real quality test at the manufacturer with the delivered settings. Real binning? Nothing. Subsequent selection of particularly overclocked cards? Impossible, in fact. And so it is by no means impossible that one or the other “Potato” chip could also have gotten lost on such an OC card. We know the consequences from the posts of the buyers in the relevant forums.

Wrong component selection? Plausible!

Now let’s come to the fact that even good chips have dropped out now and then. That they are good, you can see for yourself e.g. by the boost cycle and the temperatures. So it is quite easy to find out with a selected card. This brings us now to a point that I was actually very unconsciously haunting the back of my mind at first, and which then solidified into a realization when comparing the boards of different models So let’s go directly to the “reference board” PG132, which can also be understood as a so-called Base Design. Especially the backside and especially the area below the BGA is interesting. What is interesting about such drawings and the so-called BoM (Bill of Materials) is that you are offered different placement alternatives.

I will (have to) simplify the following for better understanding. Below the BGA we see the six NECESSARY capacitors for filtering high frequencies on the voltage rails, i.e. NVVDD and MSVDD. Apart from the fact that there is still enough high-frequency “garbage” from the voltage converters, it is mainly the so-called GPU load including all jumps caused by boost, which leads to very broadband frequency mixtures, which become more extreme the higher the boost clock goes. The BoM and the drawing from June leave it open whether large-area POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are used (marked in red), or rather the somewhat more expensive MLCCs (Multilayer Ceramic Chip Capacitor). The latter are smaller and have to be grouped for a higher capacity.

According to the list and specifications of Nvidia, both are possible. In terms of quality, however, good MLCCs are better able to filter the very high frequency components in particular. In the end, this is simple practical knowledge, which only often enough collides with the world view of a financial controller. If one searches the forums, it seems that the Zotac Trinity is particularly affected when it comes to instabilities starting at certain boost clock rates from around 2010 MHz. A feat, because Zotac is relying on a total of six cheaper POSCAPs.

And what does NVIDIA do with its own Founders Editions? One does it obviously better, because I could not reproduce these stability problems with any FE even very clearly beyond 2 GHz (fan to 100%). If something went wrong, it was almost certainly a driver problem. If we take a look at the FE, we see only four SP-CAPs (red) and in the middle two MLCC groups of 10 individual capacitors each (green). This is definitely the better solution and the optimal compromise. because especially the middle areas should best be provided with suitable filters (short circuit of the high-frequency frequency mixtures).

If it is only about NVVDD, a single MLCC block may be sufficient to solve the most serious problems. For example, MSI uses only one on the Gaming X Trio, which is theoretically enough, but could have been better solved if, for example, the 2.1 GHz were to be used with water. Whether this is still enough would of course have to be tested. PC Partner, Zotac’s mother company, seems to have recognised this and is obviously changing its cards. By the way, the following example is from a soldering experiment that was NOT made by Zotac, but which confirmed the effectiveness of MLCC smoothing very impressively. One can almost be envious of these soldering skills.

By the way, you also have to praise a company here that recognized the whole thing from the start and didn’t even let it touch them, as the Asus TUF RTX 3080 Gaming consequently did without POSCAPs and only used MLCC groups. My compliments, it fits!

Interestingly, all board partners are silent on this issue, no matter who you ask. No answer is also an answer, because this behaviour is the absolute exception and almost resembles a muzzle decree. This is because components are normally spoken freely when the launch has already taken place. But here comes nothing but meaningful silence. This also applies to the question of whether the BoM was subsequently changed again to completely exclude the exclusive use of POSCAPs/SP-CAPs.

Sometimes things are so obvious that you really have to look several times to see them. But once you have understood it, many things suddenly go from nebulous to plausible. NVIDIA, by the way, cannot be blamed directly, because the fact that MLCCs work better than POSCAPs is something that any board designer who hasn’t taken the wrong profession knows. Such a thing can even be simulated if necessary. I will of course stay on it, because my interest is naturally aroused.

Please read also the latest follow up to that sory: