View previous topic :: View next topic |
Author |
Message |
s|mon Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/10603893055f67209aeb2bf.jpg)
Joined: 04 Jul 2004 Posts: 219 Location: Bayern [de]
|
Posted: Wed Oct 14, 2020 7:55 am Post subject: resume fails after hibernate amdgpu (black screen - rx 5700) |
|
|
I'm facing issues using hibernate with my radeon 5700 rx.
It goes to hibernate (on 5.8 kernel screen goes black, then comes back short before real hibernation) but when resume is triggered from power button it ends with black screen and unresponsive system after image loading is completed (screen is visible till then). If my interpretation of dmesg output is correct it seems to be related to amdgpu (Radeon 5700 Navi 10 ).
SSH to machine is possible.
Suspend to memory worked so far. As well as testing levels: freezer, devices, platform, processors, core to /sys/power/pm_test with disk state according to:
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt
I did an update to a newer bios without change of the behavior.
Harware is a MSI X570 board, Ryzen 3700X, 32G and said Radeon RX 5700.
Further details in dmesg and lspci.
Quote: | [ 452.843888] amdgpu 0000:2f:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 452.849889] amdgpu: SMU is resuming...
[ 455.353688] amdgpu: failed send message: EnableAllSmuFeatures (6) param: 0x00000000 response 0xffffffc2
[ 455.353693] [drm:amdgpu_device_ip_resume_phase2] *ERROR* resume of IP block <smu> failed -62
[ 455.353695] [drm:amdgpu_device_resume] *ERROR* amdgpu_device_ip_resume failed (-62).
[ 455.353699] PM: dpm_run_callback(): pci_pm_restore+0x0/0xb0 returns -62
[ 455.353730] PM: Device 0000:2f:00.0 failed to restore: error -62
...
[ 472.029420] [drm:amdgpu_job_timedout] *ERROR* ring sdma0 timeout, signaled seq=5251, emitted seq=5253
[ 472.029423] [drm:amdgpu_job_timedout] *ERROR* Process information: process pid 0 thread pid 0
[ 472.029428] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[ 476.029446] amdgpu 0000:2f:00.0: amdgpu: failed to suspend display audio
[ 476.029454] BUG: unable to handle page fault for address: fffff26600000017
[ 476.029457] #PF: supervisor read access in kernel mode
[ 476.029458] #PF: error_code(0x0000) - not-present page
[ 476.029459] PGD 0 P4D 0
[ 476.029462] Oops: 0000 [#1] SMP NOPTI
[ 476.029465] CPU: 11 PID: 254 Comm: kworker/11:1 Tainted: G W T 5.8.11-gentoo #1
[ 476.029466] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 ACE (MS-7C35), BIOS 1.C3 09/29/2020
[ 476.029470] Workqueue: events drm_sched_job_timedout
[ 476.029473] RIP: 0010:free_mqd_hiq_sdma+0x4/0x20
[ 476.029475] Code: 01 d1 48 89 48 18 48 8b 8e 08 02 00 00 48 01 d1 48 89 48 08 48 03 96 10 02 00 00 48 89 50 10 5b 5d c3 0f 1f 40 00 55 48 89 d5 <48> 83 7d 18 00 74 09 48 89 ef 5d e9 1c 1c 90 ff 0f 0b 48 89 ef 5d
[ 476.029476] RSP: 0018:ffffc90000807d08 EFLAGS: 00010202
[ 476.029478] RAX: ffff8887e9553600 RBX: 0000000000000001 RCX: 0000000080800075
[ 476.029479] RDX: fffff265ffffffff RSI: ffffffff00000000 RDI: ffff8888084f2c80
[ 476.029480] RBP: fffff265ffffffff R08: 0000000000000001 R09: ffffffff819e3500
[ 476.029481] R10: ffff8887c74b2020 R11: 0000000000000001 R12: ffff88880863c528
[ 476.029482] R13: ffff888808de0000 R14: 0000000009336700 R15: ffff88880aad40b0
[ 476.029484] FS: 0000000000000000(0000) GS:ffff88880eec0000(0000) knlGS:0000000000000000
[ 476.029485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 476.029486] CR2: fffff26600000017 CR3: 00000007fc26c000 CR4: 0000000000340ea0
[ 476.029487] Call Trace:
[ 476.029490] kernel_queue_uninit+0x2b/0xd4
[ 476.029492] stop_cpsch+0xa0/0xd0
[ 476.029495] kgd2kfd_suspend.part.0+0x34/0x50
[ 476.029497] kgd2kfd_pre_reset+0x2d/0x40
[ 476.029500] amdgpu_device_gpu_recover.cold+0x21c/0xfb5
[ 476.029502] amdgpu_job_timedout+0x122/0x140
|
I would appreciate any tips to narrow it down further? I created a bug entry for the kernel here - i hope this is the correct place or is there a dedicated amd-driver site? Bug 209535
Full dmesg after reume dmesg
Lspci for device information lspci |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Juippisi Developer
![Developer Developer](/images/ranks/rank-dev.gif)
![](images/avatars/807759109442827505b55d.gif)
Joined: 30 Sep 2005 Posts: 761 Location: /home
|
Posted: Wed Oct 14, 2020 1:06 pm Post subject: |
|
|
Maybe you're running buggy kernel or firmware? Try to upgrade/downgrade either/both.
I've heard especially with 5700 series amdgpu has been buggy on 5.8. I also heard that two latest firmware packages have been buggy for these cards.
Try 5.7 or 5.9 kernel first and if they fail, try older firmware versions? And if you incorporate firmware drivers to your kernel, remember to rebuild kernel after emerging other firmware versions. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
halcon l33t
![l33t l33t](/images/ranks/rank_rect_4.gif)
![](images/avatars/488895685f0ccb058b6ff.jpg)
Joined: 15 Dec 2019 Posts: 649
|
Posted: Wed Oct 14, 2020 1:59 pm Post subject: |
|
|
s|mon wrote: | I created a bug entry for the kernel here - i hope this is the correct place or is there a dedicated amd-driver site? Bug 209535 |
There is a special place for amd drm issues:
https://gitlab.freedesktop.org/drm/amd |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
s|mon Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/10603893055f67209aeb2bf.jpg)
Joined: 04 Jul 2004 Posts: 219 Location: Bayern [de]
|
Posted: Wed Oct 14, 2020 2:19 pm Post subject: |
|
|
Good hint on the firmware (right now it is 20200918) - i already tried 5.9-rc8 (vanilla) as well, but of course with the same firmware. Will try later today with 5.7 and older firmware.
And thanks for the link to the amd specific stuff, i was not aware that this was seperated from the kernel (will search their bugtracker as well)
Would that be covering issues within kernel code or only the firmware parts? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
halcon l33t
![l33t l33t](/images/ranks/rank_rect_4.gif)
![](images/avatars/488895685f0ccb058b6ff.jpg)
Joined: 15 Dec 2019 Posts: 649
|
Posted: Wed Oct 14, 2020 3:14 pm Post subject: |
|
|
s|mon wrote: | Would that be covering issues within kernel code or only the firmware parts? |
It covers xorg-xf86-video-amdgpu and does not cover kernel. But, from the point of view of a user, it's not always obvious where is the problem:
https://gitlab.freedesktop.org/drm/amd/-/issues/882#note_315176 |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
s|mon Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/10603893055f67209aeb2bf.jpg)
Joined: 04 Jul 2004 Posts: 219 Location: Bayern [de]
|
Posted: Wed Oct 14, 2020 3:55 pm Post subject: |
|
|
I now tried with
5.6 and linux-firmware-20200721
5.7.8 and linux-firmware-20200721
5.7.8 and linux-firmware-20200817
5.8.11 and linux-firmware-20200817
5.8.11 and linux-firmware-20200918
5.9-rc8 and linux-firmware-20200918
all fail with hibernation.
On you last link halcon, was this ment to be a specific related issue? As it is about a throttling problem which is don't see here? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
halcon l33t
![l33t l33t](/images/ranks/rank_rect_4.gif)
![](images/avatars/488895685f0ccb058b6ff.jpg)
Joined: 15 Dec 2019 Posts: 649
|
Posted: Wed Oct 14, 2020 4:09 pm Post subject: |
|
|
s|mon wrote: | On you last link halcon, was this ment to be a specific related issue? As it is about a throttling problem which is don't see here? |
The throttling problem is, sure, unrelated. I just meant that it's not always obvious where to report the issue. There is the kernel part of code, and there is the xorg part of code, they are interacting. Maybe, your issue has nothing to do with xorg. But I saw many issues about suspending there at freedesktop. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Goverp Advocate
![Advocate Advocate](/images/ranks/rank-G-1-advocate.gif)
![](images/avatars/152613747847c6fd276c31a.jpg)
Joined: 07 Mar 2007 Posts: 2202
|
Posted: Thu Oct 15, 2020 9:41 am Post subject: |
|
|
With different hardware (oldish HP laptop with HD3400 graphics), I've never had hibernate working in the 3 years I've had it. _________________ Greybeard |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
s|mon Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/10603893055f67209aeb2bf.jpg)
Joined: 04 Jul 2004 Posts: 219 Location: Bayern [de]
|
Posted: Thu Oct 15, 2020 11:41 am Post subject: |
|
|
Never working would be not nice - especially since suspend seems to work quite fine i'd assume it should not be that complex, but who knows.
FYI: i created a entry there as well 1335 |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
Goverp Advocate
![Advocate Advocate](/images/ranks/rank-G-1-advocate.gif)
![](images/avatars/152613747847c6fd276c31a.jpg)
Joined: 07 Mar 2007 Posts: 2202
|
Posted: Fri Oct 16, 2020 9:05 am Post subject: |
|
|
A couple of years ago I tried to nail down my hibernate problem. Searching on Google doesn't hack it for my hardware - there are loads of reports, but in essence they come up to "the only way to hibernate is use one of the basic screen drivers instead of amdgpu". Diagnosing the problem requires building a kernel with some extra debugging, trying to hibernate, restarting after the failure, digging an address out of the extra debug info, and mapping that to a module - tedious. What it did was blame amdgpu, which wasn't really a surprise.
Unfortunately it doesn's tell you any more, so it would be a problem for upstream. Now upstream would probably want a kernel bisect (or equivalent for their driver) to find out which change broke hibernate. But that's only possible if it ever worked
More to the point, AMD wrote somewhere that they don't do QA on hibernate 'cos (a) there's so many hardware configurations, and hibernate is very hardware dependent, and (b) nobody uses it anyway. The latter is, of course, putting the cart before the horse; nobody uses hibernate because it doesn't work.
I did note that the current AMD drivers complain that I should have checkpoint restart support built into my kernel; I tried adding that, but it didn't cure the problem, and I presume (as it triggered a complete kernel rebuild) it adds lots of code, so I've left it out, and stuck with suspend. _________________ Greybeard |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
s|mon Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/10603893055f67209aeb2bf.jpg)
Joined: 04 Jul 2004 Posts: 219 Location: Bayern [de]
|
Posted: Fri Oct 16, 2020 11:17 am Post subject: |
|
|
Nice background information - doesn't raise to much hopes though as what you describe seems to go beyong my skill/time-to-spent ratio as well.
Let's hope they pick up that nevertheless. Suspend is nice but limiting. With hibernate i could e.g. keep running ffmpeg conversions, hibernate and boot another OS, come back and continue as i didn't leave. A very special scenario but would work otherwise.
Anyhow thanks for sharing that details. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
|