Allgemein Editor's Desk Gaming GPUs Hardware Latest news

The death cocktail for Nvidia's RTX cards: a whole mix of different problems is said to be responsible for the outages | Nvidia GeForce RTX Graphics Cards Are Dying

Another Nvidia news? If you only ask long enough, the information condenses further and further and in the end everything merges into a beautiful picture. And it turns out that many are right, if only indirectly. Because the failures can not only be fixed on a single reason, but show very nicely what can happen in sum if one does not take enough time. Time pressure has never been a good work colleague. But always beautiful in turn... Just another Nvidia news? Well, it's not that easy, but if you just ask around and long enough, the information becomes more and more condensed and it all ends up in a nice picture. And it turns out that many are right, even if only indirectly. Because the failures can't only be fixed at a single reason, but show very nicely what can happen in the sum, if one doesn't take oneself enough time. Time pressure has never been a good colleague. But always one after the other...

Another Nvidia news? If you only ask long enough, the information condenses further and further and in the end everything merges into a beautiful picture. And it turns out that many are right, if only indirectly. Because the failures can not only be fixed on a single reason, but show very nicely what can happen in sum if one does not take enough time. Time pressure has never been a good work colleague. But always beautiful in turn…

Just another Nvidia news? Well, it's not that easy, but if you just ask around and long enough, the information becomes more and more condensed and it all ends up in a nice picture. And it turns out that many are right, even if only indirectly. Because the failures can't only be fixed at a single reason, but show very nicely what can happen in the sum, if one doesn't take oneself enough time. Time pressure has never been a good colleague. But always one after the other…

 

1. Memory issues

It really existed, you don't have to say that. Even if one or the other colleague thought it was nonsense. But it was precisely the question of what caused the present card to die. But the memory problem is also divided into two different scenarios. On the one hand there was the Micron memory as such, which was obviously not quite as solid under certain conditions as described in the Specs and then there was the thing with the cold soldering points, which I will also deal with in point 2.

They were real, there's no need to gloss it over. Even if one or the other colleague thought it was nonsense. But it was the question what caused the death of the card in question. But also the memory problem is divided into two different scenarios. On the one hand there was the Micron memory as such, which was obviously not quite as solid under certain conditions as described in the Specs and then there was the thing with the cold solder joints, which I will also discuss in point 2.

The fact is that thermal problems must have occurred, as I also observed in my test with the infrared measurements. However, it was also interesting for me to receive feedback from individual manufacturers, which pointed me out to two possible consequences. The first episode concerned precisely these Micron modules, where the quality dispersion is said to have been quite high. Thus, not every Micron module is bad per se, but it is already enough to put a card completely or partially out of action.

The fact is that thermal problems must have occurred, as I was able to observe in my test with infrared measurements. For me it was also interesting to get feedback from several manufacturers who pointed out two possible consequences. The first issue concerned exactly these Micron modules, where the quality scatter is said to have been quite high. So not every Micron module is bad as such, but a single false module is enough to disable a card completely or partially.

From a purely statistical point of view, the probable failure rate and error frequency of a GeForce RTX 2080 Ti with its 11 memory modules is of course also significantly higher than that of an RTX 2080 or 2070 with only 8 modules. However, since these "volume models" are considered more important by the manufacturers, Nvidia has now completely switched the bundles of GPU and memory delivered to the board partners to Samsung models, according to several sources. The GeForce RTX 2080 Ti is due to follow soon.

Incidentally, the problem with memory affects not only the Founders Edition of Nvidia, but all manufacturers to the same extent, which also explains that (a few) board partner cards in their own design also caused problems. However, since there are not so many, the provisional RMA ratio is not excessively high and the memory is not the main cause, but only a sideshow.

Statistically speaking, the probable failure rate and error rate of a GeForce RTX 2080 Ti with its 11 memory modules is of course much higher than that of an RTX 2080 or 2070 with only modules 8. But since these "volume models" are considered more important by the manufacturers, Nvidia has now completely switched the bundles of GPU and memory delivered to the board partners to Samsung models, according to several sources. The GeForce RTX 2080 Ti will follow soon.

The problem with the memory does not only concern the Founders Edition of Nvidia, but all manufacturers to the same extent, which also explains that (a few) board partner cards in their own design also caused problems. But since there have not been so many, the RMA rate is not too high and the memory is not the main cause, but only a side issue.

 

2. Cold solder points

In addition, I also got the information that the modules were sometimes very difficult to solder and that there could have been major variations in quality during reflow soldering. But then the long-term thermal club also takes effect here, because if the module does not die directly due to overheating in this case, soldering points can be caused by constant heating or heating. Cooling "break" or loosen up.

Normally, you can find out things like this with longer environmental and shock tests, but unfortunately this also takes a lot of time in advance, which not everyone can or wants to invest. I wrote a report about the development and testing of electronic devices a long time ago and this really necessary procedure really applies across all sectors.

In addition, I got the information that the modules were sometimes very difficult to solder and that reflow soldering could have resulted in large fluctuations in quality. In this case, the thermal problem also applies, because if the module does not the direct due to overheating, solder joints can "break" or become loose due to constant heating and cooling.

Normally one can find out such things with longer lasting environmental and shock tests, but unfortunately this takes a lot of time in advance, which not everyone can or wants to invest. A long time ago I wrote a story about the development and testing of electronic devices and this really necessary procedure is indeed cross-industry.

However, it is not only the memory that can be affected by the soldering problems mentioned. In particular, the large LBGA package of the RTX 2080 Ti is said to have been affected by soldering problems in some cases. In detail, there was talk of possibly cold or faulty soldering points in the SMT process, where the BGA chip is connected to the board (PCB) by means of the reflow soldering process.

The graphic below shows once again the phenomenon of faulty contacts between BGA and PCB, possibly caused by a faulty SMT mask or incorrect temperatures.

But not only the memory can be affected by these soldering problems. Especially the large LBGA package of the RTX 2080 Ti is said to have been affected by soldering problems in some cases. In detail, there was talk of possibly cold or faulty solder joints in the SMT process, where the BGA chip is connected to the circuit board (PCB) using the reflow soldering process.

The following graphic shows once again the phenomenon of faulty contacts between BGA and PCB, possibly caused by a faulty SMT stencil or wrong temperatures.

Only complete demolitions or missing soldering pills ("balls") can be detected directly during quality control by means of simple functional testing on site. Everything that turns out slowly and after several thermal processes such as heating and cooling can only be found out by our own quality management, but this does not work perfectly at Foxconn as a contract manufacturer for Nvidia's Founders Edition. should have.

Apparently, you either relied too much on your own routine or simply didn't have enough time for long-term tests (see above). Or maybe even both, who knows? I don't want to get lost in technical details here, but there are still many interesting hints in the forum thread from our forum users who have relevant experience in production (up to underfill). The interested reader should also be happy to take this into his own minds, because it is in any case a nice enrichment of one's own knowledge.

Only complete breaks or missing balls can be detected directly during the quality control by performing a simple functional test on site. Anything that only turns out slowly and after several thermal processes, such as heating and cooling, can only be found out by the company's own quality management, but what Foxconn, as a contract manufacturer for Nvidia's Founders Edition, is said to have not worked perfectly.

Apparently they either relied too much on their own routine or simply didn't have enough time for longer tests (see above). Or maybe even both, who knows? I don't want to get lost in technical details here, but there are still many interesting hints from our forum users in the forum thread, who have relevant experience in production (up to underfill). The interested reader should also take a look at this, because it is definitely a nice enrichment of one's own knowledge.

 

3. Bending, Die, Reference Cooler

Bending the multi-layer board is also such a thing in itself that one cannot look at the above problem in a detached way! All these things, such as bad or weak soldering points, become much more important in interplay with the thermal, different physical expansion, of course. But even now, it may become even more complex.

One has to ask why Foxconn uses flexible spring screws with a fixed torque in the radiator assembly of the Founders Edition in the outer fasteners, which compensate for possible differences in the edition of the Vapor-Chamber on the Die Could. This applies in particular to manufacturing tolerances for the die and the package (height), as well as unevenness in the bottom of the vapor-chamber. And on the other hand, the inner four normal shudders of the cooling and assembly frame are so tightly energized as if there was no tomorrow. Find the contradiction!

The bending of the multi-layer board is also such a thing in itself that one must not look at it in isolation from the above problem! All these things, like bad or weak solder joints, become in the interplay with the thermally caused, different physical expansion, of course more important. But even now it may become even more complex.

One has to ask oneself why Foxconn uses flexible spring bolts with a fixed torque in the outer fastenings when mounting the Founders Edition cooler, which could compensate for possible differences in the contact surface of the vapor-chamber on the die. This applies especially to manufacturing tolerances in the die and the package (height), as well as unevenness in the bottom of the vapour chamber. And on the other hand you turn the inner four normal screws for the cooling frame as tightly as if there were no tomorrow left. Find the contradiction!

The consequences can be quite serious if such tightly connected components expand to varying degrees or even the pressure at the individual corners is different. Even then, mechanical faults and demolitions of soldering points are almost pre-programmed. You can play the game with the eight screws, but then you needed a better radiator floor and less torque when screwing.

The consequences can be quite serious if such firmly connected components expand to different degrees or even if the pressure at the individual corners is different. Even then, mechanical errors and breaks of solder points are almost pre-programmed. You can play the game with the eight screw connections, but then you needed a better cooler surface and less torque when screwing.

By the way, this with the different contact pressure also affects the memory, in which the modules (see picture above) get "between the fronts". If the pressure at the edges (also favored by the thick pad) is different, this is dangerous for low-quality soldering points. The entire radiator layout and fastening is actually problematic, but the main thing is to glue the top cover… So much time was then available to devise something like this.

By the way, the different contact pressure also affects the memory, where the modules (see picture above) get "between the fronts". If the pressure at the edges (also favoured by the thick pad) is different, it is dangerous if the solder joints are of inferior quality. The entire cooler layout and mounting is problematic, but the main thing was to glue the top cover… So much time was available to come up with something like this.

 

4th. RMA Odds / RMA Rate

Here, too, caution is required, because it is difficult to understand in what interest or order one or the other report has been posted on the Internet. The fact is that the Founders Edition produced by Foxconn is mostly affected, so you can actually dispense with the statements of retailers who only sell custom models. It only confirms the circumstance with the frequency of the FE cards, but nothing more.

Especially since with such little units sold so far, one should not actually speak of an RMA quota, but in fairness of a preliminary RMA trend. So far, too few cards have been implemented in the trade in order to be able to lean out of the window in a really resilient way.

Caution is also required here, because it is difficult to understand in which interest or order one or the other report was released on the Internet. It is a fact that the Founders Edition produced by Foxconn is predominantly affected, so that one can actually forget the statements of retailers who only sell custom models. It only confirms the fact with the higher rate of FE cards, but nothing more.

This is all the more so since, with units that have been sold so little so far, one should not actually speak of an RMA rate, but rather, fair enough, speak of a provisional RMA trend. So far, too few cards have been sold in the trade to really be able to lean out of the window.

 

In the end, however, I would have a small request to my colleagues. It is really only fair and collegial to adopt information and thought processes, conclusions or conjectures in such a way that the source is also visible. Re-enactment of tests or interpretation scant is not prohibited but expressly desired, but it should be fair. And if you interpret things yourself, you should also make this difference known. Thank you.

However, I would still have a small request to the colleagues at the end. It is actually only fair and collegial to adopt information and thought processes, conclusions or speculations in such a way that the source is also apparent. The re-enactment of tests or interpretations of what has been read is not forbidden but expressly desired, but it should be fair. And one should, if one still interprets things in addition oneself, so make this difference recognizable. Thank You.

Danke für die Spende



Du fandest, der Beitrag war interessant und möchtest uns unterstützen? Klasse!

Hier erfährst Du, wie: Hier spenden.

Hier kannst Du per PayPal spenden.

About the author

Igor Wallossek

Editor-in-chief and name-giver of igor'sLAB as the content successor of Tom's Hardware Germany, whose license was returned in June 2019 in order to better meet the qualitative demands of web content and challenges of new media such as YouTube with its own channel.

Computer nerd since 1983, audio freak since 1979 and pretty much open to anything with a plug or battery for over 50 years.

Follow Igor:
YouTube Facebook Instagram Twitter

Werbung

Werbung