AMD’s EPYC Rome chips fail after 1,044 days of operation

5. June 2023 05:55

AMD’s latest processor revision guide for the EPYC 7002 “Rome” server chips reveals an interesting errata that could lead to a potential core freeze. After about 1,044 days (~2.93 years) of operation, a core could get stuck on the chip, requiring a server reboot to get the chip working properly again. AMD has stated that they will not fix this problem. Although AMD’s description of the problem affecting the second generation of EPYC processors (the latest fourth generation chips, Genoa, are not affected by this errata) is concise, there are still many aspects that can be analyzed.

The main problem is that the core is not able to exit the CC6 sleep state. However, AMD states that the exact timing of the error depends on the spread spectrum and the REFCLK frequency. The latter is the reference clock that helps the chip track time. A Reddit user by the name of acid_migrain has come up with a plausible theory about the exact timing of the crash: “Despite what they say, the problem actually occurs after about 1042 days and 12 hours. The TSC clock is 2800 MHz, and 2800 * 10**6 * 1042.5 days equals almost 0x3800000000000, which contains too many zeros to be a coincidence.”

There are two simple solutions to the problem: Either a reboot can be performed after 1044 days of operation to reset the 1044-day timer, or the CC6 sleep state can be disabled. Although the core crash bug, which is already 2.93 years old, is interesting, the question is how relevant it really is. Of course, it is important, but security updates and maintenance should be performed at much shorter intervals. In reality, it would be most realistic to use the live patching feature of Linux to perform updates without rebooting. However, this could result in a longer uptime that triggers the error. In particular, servers used for business-critical applications often experience longer uptimes. Although this error is interesting, it does not affect most users and errors in chips are by no means uncommon. Modern CPUs are the most complex devices ever designed by humans, and it is common to encounter errors either during the development process or after the chips are shipped (stepping).

Chip errata (errors in the chips) are common, but not bad

With a large number of transistors, it is inevitable that problems will occur. It is common for a chip to contain a thousand or more errors, which are fixed in newer versions of the chip or through firmware changes prior to release. These bugs can include various types of errors, such as security vulnerabilities or faulty flags and cache tags. Chip manufacturers are making efforts to fix these bugs before launch. However, some bugs persist, even in chips that have already shipped. For example, there are over 150 listed bugs that are still present in Intel’s 8th generation chips, even though these chips were launched in 2017. We don’t know exactly how many bugs are present in AMD’s Rome chips, as AMD has removed the list of fixed bugs. However, it is known that there are 39 bugs left, which doesn’t seem so bad compared to Intel.

Some bugs are not fixed if they do not cause any harm. Aside from critical bugs that could be potential security vulnerabilities, some feature-related bugs are simply never patched. The chip manufacturer evaluates factors such as the severity of the bug, the feasibility of fixing it, and whether the number of bugs is large enough to warrant further action. This decision is not an easy task. Why didn’t AMD notice this sooner? Well, 2.93 years is longer than any qualification cycle. The AMD EPYC Rome chips were released in late 2018, so some AMD customers may have already encountered the problem.

Rome does not become a member of the Uptime Club

And then there are those who simply want to join the Uptime Club and set a record. Their goal is to outdo the computer aboard the Voyager 2 starship. Yes, the starship that was the second to enter interstellar space. That computer has been running for 16,735 days (over 48 years) and is still working fine. When it comes to terrestrial records, 6,014 days (16 years) seems to be the all-time high for a server, but there is much discussion about other candidates for that title. The small Reddit community of /r/uptimeporn/ shows many examples of longer uptimes.

Be that as it may, with the EPYC Rome chips it will be impossible to break this record – the bug is not fixed, so not all cores will be able to exceed the 1,044 day limit under all circumstances. AMD has clarified that the problem will not be fixed. It may be that AMD has decided that fixing the problem is too expensive to solve in silicon, or that a cleanup via microcode/firmware would incur too much performance penalty. It could also be that the number of customers affected is too small to make a solution profitable. Either way, to sleep better at night, it’s recommended to disable the server’s CC6 sleep state or simply reboot every 1000 days or so.

Source: TomsHardware

12 Antworten

Zeige alle Kommentare an

Kommentar

Lade neue Kommentare

NatokWa

Mitglied

14 Kommentare 12 Likes

#1 Jun 05, 2023

Zitat : "Wie dem auch sei, mit den EPYC-Rome-Chips wird es unmöglich sein, diesen Rekord zu brechen – der Fehler wird nicht behoben, sodass nicht alle Kerne unter allen Umständen die Grenze von 1.044 Tagen überschreiten können."

Dies wird schon im Artikel als Falsch dar gestellt ... es reicht den C6 aus zu schalten und schon ist der Prozessor "im Rennen".

Antwort 2 Likes

ChaosKopp

Urgestein

573 Kommentare 608 Likes

#2 Jun 05, 2023

An die MicroVAX, die wir im Zuge der Jahr 2000 Umstellung entdeckt hatten, kommt eh nix ran. Lief seit den 80ern als Zeitserver. Und zwar so stabil, dass alle sie vergessen hatten. Das war ne Uptime...

Antwort Gefällt mir

Derfnam

Urgestein

7,517 Kommentare 2,032 Likes

#3 Jun 05, 2023

Was it epic^^?

Antwort Gefällt mir

amd64

1,114 Kommentare 684 Likes

#4 Jun 05, 2023

Wenn der Server C6 (Tiefes Abschalten (Deep Power Down)) können muss, dann kann man den auch mal neustarten, denn so kritisch ist der Einsatz dann wohl nicht, oder aber man schaltet C6 ab und umgeht so das Problem. Trotzdem ist es ein sehr interessantes Detail 🧐

Antwort Gefällt mir

Aragornius

Mitglied

86 Kommentare 22 Likes

#5 Jun 06, 2023

@Samir Bashir Korrektur bitte: "Voyager 2" ist eine Raumsonde aber kein Raumschiff, sonst verwirrt es einen so wie mich.

Antwort 1 Like

8j0ern

Urgestein

2,794 Kommentare 882 Likes

#6 Jun 06, 2023

Ernsthaft gemeint ?

Das sind Praktisch 8x 8 Ryzen Cores unter einem IHS.
Rome gibt es seit 2018: https://wccftech.com/amd-epyc-rome-...-launch-64-core-128-thread-128-pcie-gen4/amp/

Antwort Gefällt mir

Derfnam

Urgestein

7,517 Kommentare 2,032 Likes

#7 Jun 06, 2023

Och, nöö...
Steht da Epyc? Oder könnte es sein, dass ich mich scherzhaft auf #3 bezog?

Antwort 1 Like

ChaosKopp

Urgestein

573 Kommentare 608 Likes

#8 Jun 06, 2023

A predecessor, terrific we called it.

Antwort Gefällt mir

ChaosKopp

Urgestein

573 Kommentare 608 Likes

#9 Jun 06, 2023

Wortspiele liegen Dir, nur manchem Leser nicht.

Antwort 1 Like

8j0ern

Urgestein

2,794 Kommentare 882 Likes

#10 Jun 06, 2023

Rome
Milan
Milan-X
Genoa
Genoa-X

X=3DCache

Antwort Gefällt mir

ChaosKopp

Urgestein

573 Kommentare 608 Likes

#11 Jun 06, 2023

It was VAXinated, 3100 times...

Antwort Gefällt mir

LurkingInShadows

Urgestein

1,387 Kommentare 584 Likes

#12 Jun 06, 2023

Was MS kann, kann AMD auch.

Antwort 1 Like

Alle Kommentare lesen unter igor´sLAB Community →

Danke für die Spende

Du fandest, der Beitrag war interessant und möchtest uns unterstützen? Klasse!

Hier erfährst Du, wie: Hier spenden.

Hier kannst Du per PayPal spenden.

Innodisk has the right idea for Gen 5 SSDs to make them passively coolable

NVIDIA RTX 3070 with 16GB? GXORE shows how it works

About the author

View All Posts

Igor Wallossek

Editor-in-chief and name-giver of igor'sLAB as the content successor of Tom's Hardware Germany, whose license was returned in June 2019 in order to better meet the qualitative demands of web content and challenges of new media such as YouTube with its own channel.

Computer nerd since 1983, audio freak since 1979 and pretty much open to anything with a plug or battery for over 50 years.

Follow Igor:
YouTube Facebook Instagram Twitter