Amazon launches the beta phase of a new game and almost simultaneously users all over the internet report that their Nvidia RTX 3090 GPUs are dying. The reasons have not yet been officially confirmed. In EVGA’s case, it’s supposedly due to a faulty fan controller, though other sources say cards from other vendors or even other GPU manufacturers are also affected. And Amazon has since spoken out, cheerfully giving the all-clear. That’s in a nutshell what’s wrong with the hardware industry and gaming community right now, and it blew a fuse for me personally as well. That is why I would now like to present my view of things in this editorial.
First of all, of course, in such an event of a mass death of GPUs, users need to be warned as soon as possible in order to perhaps save one or the other. So far, it’s only clear the story is related to Amazon’s “New World” game, which recently became playable in closed beta. So it behooves you to draw attention to the issue on all websites and forums, in articles and threads with as much reach to as many players and GPU owners as possible. It’s better to pass the word once too often than too little! 😉
Internet Information Clusterf***
Igor had already highlighted the topic yesterday with exclusive information from EVGA. According to internal information, the cause of this manufacturer is suspected in the chip for the fan control, which for unknown reasons tends to die exactly in this game. This would mean that only cards from this manufacturer are actually affected.
I've now had the following GPU owners express they have had shut downs and failures with New World…
So once again, the issue definitely is with SOMETHING in the way the game New World is rendering. This ISNT a 3090'exclusive issue! PERIOD!!
— JayzTwoCents (@JayzTwoCents) July 22, 2021
However, this is not the case – if one is to believe the various reports on the internet. The reason for this, in turn, seems to be a second possible cause of death that affects all vendors, and if JayzTwoCents’ tweet is to be believed, other GPU types as well. Therefore, it should be clearly stated that there is no general all-clear for non-EVGA cards, at least at the present time!
Of course, you always have to ask yourself how reliable such reports and statements are, or if there aren’t one or two free riders who somehow want to smuggle their otherwise defective graphics card into the RMA process. But even if one does not assume bad intentions, one must of course examine the resilience of such reports in each case.
Check and only then sort out if necessary. What’s not a good idea is to blindly believe to only one source’s statement, ignoring everything else or labeling it as false. Yes, in the internet it is difficult to keep track of everything, we humans are often overwhelmed with it and yes, this is nothing new. But it is in a case like this that humanity once again comes to the surface in the processing of information.
Ideally, a central organization should collect all such reports, ask the users for descriptions, verification, etc., and then evaluate the credibility or reliability of the report. Similar to what the PEI (in Germany) does with adverse reactions to Covid-19-vaccinations – lends itself to comparison and surprisingly few know about it, just as a side note. In this case, Nvidia, for example, would be a good choice as such a central information collecting organization. But the fact that each reviewer cooks up his own soup of information but doesn’t make it fully public isn’t really helpful either.
Once you have a solid data base you can work with the data, sort out false positives and find common denominators, like the very high FPS in the menu at New World – sometimes in the range of several thousand -, the popcorn-like noises that come with the cards dying, or the PCB designs of the different vendors, where some seem to be more and some less vulnerable.
Only then can statements be made about the cause or sweeping all-clear statements be given. Before that, at least in my opinion, it’s just plain negligent. Now unfortunately we don’t live in a perfect world and a data collection of this nature, even if it existed, would probably not be made public by Nvidia. But at least one should wait for an official statement from Nvidia to be able to foresee a way forward in good conscience.
1000 and 1 FPS, and it went boom!
Now what the actual cause of the GPU mass death is remains unclear at this point. The most plausible explanation, at least to me, seems to be the following: Due to the extremely high FPS in the game’s menu, the GPU goes through extremely short render cycles (< 1 ms). These render cycles cause the old familiar spikes in current and power consumption. In addition to Igor’s launch review, you can see this very nicely in the update to the miracle driver. In the above diagram from this article, the spikes in power consumption are shown with a resolution of 10 µs. Mind you, this is “only” a smaller and somewhat more frugal RTX 3080.
On all modern graphics cards, special monitoring circuitry constantly monitors these spikes, slows them down or even completely shuts down the card in extreme cases. But in the case of New World, the spikes are probably so incredibly short that the cards’ internal monitoring is simply too slow to detect them. There seems to be a limit at about 1 ms or 1000 FPS above which the card just doesn’t notice when it’s drinking itself to death on power. This would also fit with the alleged info that other high-end cards are also affected, which can reach such dangerously high FPS.
Some manufacturers like EVGA or MSI install additional slow fuses on their PCBs to protect the comparatively expensive components of the voltage regulation from overloads. In those cases these slow fuses are triggered by the sum of all the ultra-short spikes or transients and the card is ultimately also defective, just easier to repair later.
But it should also be noted here once again that all board partners must submit each of their PCB designs to Nvidia for approval prior to production (so-called Greenlight Program). So the responsibility for a faulty or suboptimal PCB design falls back on Nvidia after all. But as I said, this explanation with the spikes is only one theory of many and nothing is officially confirmed yet. So please enjoy this with an appropriately sized grain of salt or speculoos.