AMD’s latest processor revision guide for the EPYC 7002 “Rome” server chips reveals an interesting errata that could lead to a potential core freeze. After about 1,044 days (~2.93 years) of operation, a core could get stuck on the chip, requiring a server reboot to get the chip working properly again. AMD has stated that they will not fix this problem. Although AMD’s description of the problem affecting the second generation of EPYC processors (the latest fourth generation chips, Genoa, are not affected by this errata) is concise, there are still many aspects that can be analyzed.
The main problem is that the core is not able to exit the CC6 sleep state. However, AMD states that the exact timing of the error depends on the spread spectrum and the REFCLK frequency. The latter is the reference clock that helps the chip track time. A Reddit user by the name of acid_migrain has come up with a plausible theory about the exact timing of the crash: “Despite what they say, the problem actually occurs after about 1042 days and 12 hours. The TSC clock is 2800 MHz, and 2800 * 10**6 * 1042.5 days equals almost 0x3800000000000, which contains too many zeros to be a coincidence.”
There are two simple solutions to the problem: Either a reboot can be performed after 1044 days of operation to reset the 1044-day timer, or the CC6 sleep state can be disabled. Although the core crash bug, which is already 2.93 years old, is interesting, the question is how relevant it really is. Of course, it is important, but security updates and maintenance should be performed at much shorter intervals. In reality, it would be most realistic to use the live patching feature of Linux to perform updates without rebooting. However, this could result in a longer uptime that triggers the error. In particular, servers used for business-critical applications often experience longer uptimes. Although this error is interesting, it does not affect most users and errors in chips are by no means uncommon. Modern CPUs are the most complex devices ever designed by humans, and it is common to encounter errors either during the development process or after the chips are shipped (stepping).
Chip errata (errors in the chips) are common, but not bad
With a large number of transistors, it is inevitable that problems will occur. It is common for a chip to contain a thousand or more errors, which are fixed in newer versions of the chip or through firmware changes prior to release. These bugs can include various types of errors, such as security vulnerabilities or faulty flags and cache tags. Chip manufacturers are making efforts to fix these bugs before launch. However, some bugs persist, even in chips that have already shipped. For example, there are over 150 listed bugs that are still present in Intel’s 8th generation chips, even though these chips were launched in 2017. We don’t know exactly how many bugs are present in AMD’s Rome chips, as AMD has removed the list of fixed bugs. However, it is known that there are 39 bugs left, which doesn’t seem so bad compared to Intel.
Some bugs are not fixed if they do not cause any harm. Aside from critical bugs that could be potential security vulnerabilities, some feature-related bugs are simply never patched. The chip manufacturer evaluates factors such as the severity of the bug, the feasibility of fixing it, and whether the number of bugs is large enough to warrant further action. This decision is not an easy task. Why didn’t AMD notice this sooner? Well, 2.93 years is longer than any qualification cycle. The AMD EPYC Rome chips were released in late 2018, so some AMD customers may have already encountered the problem.
Rome does not become a member of the Uptime Club
And then there are those who simply want to join the Uptime Club and set a record. Their goal is to outdo the computer aboard the Voyager 2 starship. Yes, the starship that was the second to enter interstellar space. That computer has been running for 16,735 days (over 48 years) and is still working fine. When it comes to terrestrial records, 6,014 days (16 years) seems to be the all-time high for a server, but there is much discussion about other candidates for that title. The small Reddit community of /r/uptimeporn/ shows many examples of longer uptimes.
Be that as it may, with the EPYC Rome chips it will be impossible to break this record – the bug is not fixed, so not all cores will be able to exceed the 1,044 day limit under all circumstances. AMD has clarified that the problem will not be fixed. It may be that AMD has decided that fixing the problem is too expensive to solve in silicon, or that a cleanup via microcode/firmware would incur too much performance penalty. It could also be that the number of customers affected is too small to make a solution profitable. Either way, to sleep better at night, it’s recommended to disable the server’s CC6 sleep state or simply reboot every 1000 days or so.
Source: TomsHardware
12 Antworten
Kommentar
Lade neue Kommentare
Mitglied
Urgestein
Urgestein
1
Mitglied
Urgestein
Urgestein
Urgestein
Urgestein
Urgestein
Urgestein
Urgestein
Alle Kommentare lesen unter igor´sLAB Community →