There are always things that fall almost from heaven into your lap and at the same time also a little on the sender's feet. The basic rule number one is always that as an attentive observer and tester you always remain cooperative and do not work out your workload indifferently and rejoice in the next, supposed mistake, which you then enjoy the public. can spread.
Sure, one is then every time a great pike, but the causes of some problems lie much deeper than it might seem at first with the mere analysis of any measurement results. Assuming some basic technical knowledge, there are sometimes even interesting new insights, because the whole graphics card problem is more complex than one might think.
Although many may like to criticize me for my supposed proximity to the industry, exclusive payroll and cooperative coexistence are always two pairs of shoes. The former certainly brings a little more money to the cash register, the latter friends, information and a preferred sampling. That is why I want to show today, by a very recent example, what can go wrong without having to starve someone in the Asian company dungeon as punishment without a lunch box.
Starting position: Hot spot, cause unclear at first
Let's take a quick look at the stone of offense. The almost 107°C in the closed housing (Furmark) is really nothing you would like, even if it is "only" an 8-layer board. Above 95°C, FC4 as a board material is not infinitely resilient and suffering-free, at least not in the long run. The flash point doesn't matter at all. The superficial viewer would now simply think that the cooling of the voltage converters is simply undersized:
Counter-question: If you arrange 5 phases with a total of 10 asymmetric dual MOSFETS as a series vertically among each other, then one might actually expect that the heat development should be a little more even and also better distributed on the surface. Sure, the current implementations burden individual phases differently depending on the overall load or even leave some phases idle when low loads are involved, but even then such an image must not actually emerge. At least not so extreme.
Control of electricity flow and root cause research
Now, of course, I don't want to get stuck in technical details that most would probably get bored anyway, but we have to dive a little bit for a better understanding. But don't worry, it remains understandable enough. Let us therefore now move directly to the tension converters. No matter how many phases need to be controlled and perhaps intelligently balanced, a PWM controller needs a value as feedback from each individual control circuit (each phase): the current current flow.
One catchphrase I had already teased with balancing, the second comes now: DCR (Direct Current Resistance). In the end, each component has very specific characteristics in this respect. But to shorten it. DCR is the basis for calculating temperatures and currents. But how does the controller learn exactly which currents flow in which control loop? Monitoring can be different, because there are – who wonders – different methods for doing so.
In my article "Nvidia GeForce RTX 2080 Ti – Internal details about power supply, different components and where the spikes stayed" I praised Nvidia's reference design for the power supply, and rightly so. There you also read something about the Smart Power Stages (SPS) and IMON (power), but at that time I had given myself certain details. I'll add them now, because IMON is exactly what the so-called MOSFET DCR (DrMOS) provides!
The picture above shows the typical layout with the intelligent PLC, which provide the current value for each individual control circuit with IMON, which is so urgently needed for perfect balancing, i.e. the balance between the phases. How do the PLC determine this value? The drain currents of the MOSFETS are measured in real time and these values are also extremely accurate (in the example above 5 A/A signal).
This very cost-intensive solution replaces the significantly cheaper Inductor DCR, i.e. a current measurement via the inductive resistance of the respective filter coils in the output range. For example, Nvidia uses such a solution for inexpensive maps (symbol image below), where it is a little more leisurely when it comes to the flow of electricity. However, the accuracy of this solution is significantly lower and is additionally strongly influenced by fluctuations in the quality of the component elements. Too big tolerances can quickly tip the complete balance!
The balance is right again
The quality of the coils is always such a story in itself and it also explains why the manufacturer could not immediately recognize the problem itself. It remains to be assumed that the coils on the boards are in the EVT phase (Engineering Validation and Testing), the DVT (Design Validation and Testing) phase or PVT (Production Validation and Testing) phase had lower tolerances, or that the same inductance was installed according to the data sheet, but the values still deviated in reality. Theory and dry design, small series and practice of mass production are often enough really very different siblings.
What the editors very often get delivered by the board partners as fast-knit with the hot needle knitted own designs under time pressure of the chip manufacturers (Nvidia and AMD) is in the rarest of cases real mass production or very rarely also retail goods. They are also not "Golden Samples", as one likes to and often suspect, but simply cards made in small series or Products from the MVT (Manufacturing Verification Testing) phase. There are often enough worlds between these graphics cards and what comes into the store later.
But this step can also be a real opportunity to improve the product once again! After all, many eyes see more than just a product engineer or a small team, all of which have to finish something under time pressure until Day X. I myself have measured some things and considered with the manufacturer how to solve the problem. A new BIOS will now be used in the actual mass production, which, among other things, can better implement balancing after a targeted fine-tuning.
This has already reduced temperatures by 5 degrees at the said location without any further changes. Because if you think logically: the best cooling is the one you don't need! But I'm not really happy with that either. The next page will show us what makes Nvidia's new fan control possible, if you only make a little effort at the small grey cells. Because I can spoil it before: a lot has happened! So please turn around…