Question Weird behaviour of RX5700XT (XTX)

SKART1

Neuling
Mitglied seit
Dez 8, 2022
Beiträge
1
Bewertungspunkte
0
Punkte
2
Hello!
First of all - thanks for the incredible work of developing such amazing tools like RedBIOSEditor and MorePowerTool, and what is more important - guides and explanations about its functionality.

As far as I see there is some kind of expertise with working with AMD based GPUs - so I will try to ask for help. My case is also quite interesting to investigate.

_____________________________________________________________________________________
I have an some chinese variant of rx5700xt (xtx) gpu (6 items)

It was attached to HiveOS for mining with an undervolting profile with help of integrated tools and mining ravencoin with teamredminer.
What may be important - one old Saphire rx580 nitro+ was also attached.

After a couple of hours, the miner detected one of the rx5700 gpu is hanging and started the reboot process.
After rebooting HiveOS started to hang at the moment of "Loading AMD drivers, N GPU".
Sometimes this message was followed by
  • amdgpu_ib_ring_tests *ERROR* IB test failed on jpeg_dec
  • Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
  • Or just hangs shortly after trying to load driver

I have tried another risers, another PSU, another PC, and even another OperatingSystem. It was only possible to load Windows(10) in safe mode and HiveOS without loading drivers.

It was the end of the story if I had not decided that the attached RX580 somehow influenced the overclocking profile and tried to reset it back. Firstly to safe values and then to default. It didn't help. After that I tried to change the bios setting`s desperately trying to revive it. And somehow I succeed.
After that I removed rx580 from the set or GPU`s attached to the PC, and after a couple of hours - another GPU stopped working with exactly the same symptoms. I revived it back (still did not get what exact action is making it revive). And one more time it happened with another GPU.

I tried to analyze dmesg logs (and will try to do more), tried to analyze bios (luckily I can still dump bios even without loading drivers mode) - have not found anything interesting
I found similar cases on some mining forums - but no solutions

So the question is:
  1. Does anyone come across such situations?
  2. Can overclocking settings influence GPU after its power-off power-on cycle?
  3. What can cause such strange behaviour?
 
Oben Unten