Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
resume fails after hibernate amdgpu (black screen - rx 5700)
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
s|mon
Apprentice
Apprentice


Joined: 04 Jul 2004
Posts: 219
Location: Bayern [de]

PostPosted: Wed Oct 14, 2020 7:55 am    Post subject: resume fails after hibernate amdgpu (black screen - rx 5700) Reply with quote

I'm facing issues using hibernate with my radeon 5700 rx.

It goes to hibernate (on 5.8 kernel screen goes black, then comes back short before real hibernation) but when resume is triggered from power button it ends with black screen and unresponsive system after image loading is completed (screen is visible till then). If my interpretation of dmesg output is correct it seems to be related to amdgpu (Radeon 5700 Navi 10 ).

SSH to machine is possible.
Suspend to memory worked so far. As well as testing levels: freezer, devices, platform, processors, core to /sys/power/pm_test with disk state according to:
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

I did an update to a newer bios without change of the behavior.

Harware is a MSI X570 board, Ryzen 3700X, 32G and said Radeon RX 5700.
Further details in dmesg and lspci.

Quote:
[ 452.843888] amdgpu 0000:2f:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 452.849889] amdgpu: SMU is resuming...
[ 455.353688] amdgpu: failed send message: EnableAllSmuFeatures (6) param: 0x00000000 response 0xffffffc2
[ 455.353693] [drm:amdgpu_device_ip_resume_phase2] *ERROR* resume of IP block <smu> failed -62
[ 455.353695] [drm:amdgpu_device_resume] *ERROR* amdgpu_device_ip_resume failed (-62).
[ 455.353699] PM: dpm_run_callback(): pci_pm_restore+0x0/0xb0 returns -62
[ 455.353730] PM: Device 0000:2f:00.0 failed to restore: error -62
...
[ 472.029420] [drm:amdgpu_job_timedout] *ERROR* ring sdma0 timeout, signaled seq=5251, emitted seq=5253
[ 472.029423] [drm:amdgpu_job_timedout] *ERROR* Process information: process pid 0 thread pid 0
[ 472.029428] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[ 476.029446] amdgpu 0000:2f:00.0: amdgpu: failed to suspend display audio
[ 476.029454] BUG: unable to handle page fault for address: fffff26600000017
[ 476.029457] #PF: supervisor read access in kernel mode
[ 476.029458] #PF: error_code(0x0000) - not-present page
[ 476.029459] PGD 0 P4D 0
[ 476.029462] Oops: 0000 [#1] SMP NOPTI
[ 476.029465] CPU: 11 PID: 254 Comm: kworker/11:1 Tainted: G W T 5.8.11-gentoo #1
[ 476.029466] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 ACE (MS-7C35), BIOS 1.C3 09/29/2020
[ 476.029470] Workqueue: events drm_sched_job_timedout
[ 476.029473] RIP: 0010:free_mqd_hiq_sdma+0x4/0x20
[ 476.029475] Code: 01 d1 48 89 48 18 48 8b 8e 08 02 00 00 48 01 d1 48 89 48 08 48 03 96 10 02 00 00 48 89 50 10 5b 5d c3 0f 1f 40 00 55 48 89 d5 <48> 83 7d 18 00 74 09 48 89 ef 5d e9 1c 1c 90 ff 0f 0b 48 89 ef 5d
[ 476.029476] RSP: 0018:ffffc90000807d08 EFLAGS: 00010202
[ 476.029478] RAX: ffff8887e9553600 RBX: 0000000000000001 RCX: 0000000080800075
[ 476.029479] RDX: fffff265ffffffff RSI: ffffffff00000000 RDI: ffff8888084f2c80
[ 476.029480] RBP: fffff265ffffffff R08: 0000000000000001 R09: ffffffff819e3500
[ 476.029481] R10: ffff8887c74b2020 R11: 0000000000000001 R12: ffff88880863c528
[ 476.029482] R13: ffff888808de0000 R14: 0000000009336700 R15: ffff88880aad40b0
[ 476.029484] FS: 0000000000000000(0000) GS:ffff88880eec0000(0000) knlGS:0000000000000000
[ 476.029485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 476.029486] CR2: fffff26600000017 CR3: 00000007fc26c000 CR4: 0000000000340ea0
[ 476.029487] Call Trace:
[ 476.029490] kernel_queue_uninit+0x2b/0xd4
[ 476.029492] stop_cpsch+0xa0/0xd0
[ 476.029495] kgd2kfd_suspend.part.0+0x34/0x50
[ 476.029497] kgd2kfd_pre_reset+0x2d/0x40
[ 476.029500] amdgpu_device_gpu_recover.cold+0x21c/0xfb5
[ 476.029502] amdgpu_job_timedout+0x122/0x140


I would appreciate any tips to narrow it down further? I created a bug entry for the kernel here - i hope this is the correct place or is there a dedicated amd-driver site? Bug 209535

Full dmesg after reume dmesg
Lspci for device information lspci
Back to top
View user's profile Send private message
Juippisi
Developer
Developer


Joined: 30 Sep 2005
Posts: 761
Location: /home

PostPosted: Wed Oct 14, 2020 1:06 pm    Post subject: Reply with quote

Maybe you're running buggy kernel or firmware? Try to upgrade/downgrade either/both.

I've heard especially with 5700 series amdgpu has been buggy on 5.8. I also heard that two latest firmware packages have been buggy for these cards.
Try 5.7 or 5.9 kernel first and if they fail, try older firmware versions? And if you incorporate firmware drivers to your kernel, remember to rebuild kernel after emerging other firmware versions.
Back to top
View user's profile Send private message
halcon
l33t
l33t


Joined: 15 Dec 2019
Posts: 649

PostPosted: Wed Oct 14, 2020 1:59 pm    Post subject: Reply with quote

s|mon wrote:
I created a bug entry for the kernel here - i hope this is the correct place or is there a dedicated amd-driver site? Bug 209535

There is a special place for amd drm issues:

https://gitlab.freedesktop.org/drm/amd
Back to top
View user's profile Send private message
s|mon
Apprentice
Apprentice


Joined: 04 Jul 2004
Posts: 219
Location: Bayern [de]

PostPosted: Wed Oct 14, 2020 2:19 pm    Post subject: Reply with quote

Good hint on the firmware (right now it is 20200918) - i already tried 5.9-rc8 (vanilla) as well, but of course with the same firmware. Will try later today with 5.7 and older firmware.
And thanks for the link to the amd specific stuff, i was not aware that this was seperated from the kernel (will search their bugtracker as well)
Would that be covering issues within kernel code or only the firmware parts?
Back to top
View user's profile Send private message
halcon
l33t
l33t


Joined: 15 Dec 2019
Posts: 649

PostPosted: Wed Oct 14, 2020 3:14 pm    Post subject: Reply with quote

s|mon wrote:
Would that be covering issues within kernel code or only the firmware parts?

It covers xorg-xf86-video-amdgpu and does not cover kernel. But, from the point of view of a user, it's not always obvious where is the problem:
https://gitlab.freedesktop.org/drm/amd/-/issues/882#note_315176
Back to top
View user's profile Send private message
s|mon
Apprentice
Apprentice


Joined: 04 Jul 2004
Posts: 219
Location: Bayern [de]

PostPosted: Wed Oct 14, 2020 3:55 pm    Post subject: Reply with quote

I now tried with
5.6 and linux-firmware-20200721
5.7.8 and linux-firmware-20200721
5.7.8 and linux-firmware-20200817
5.8.11 and linux-firmware-20200817
5.8.11 and linux-firmware-20200918
5.9-rc8 and linux-firmware-20200918
all fail with hibernation.

On you last link halcon, was this ment to be a specific related issue? As it is about a throttling problem which is don't see here?
Back to top
View user's profile Send private message
halcon
l33t
l33t


Joined: 15 Dec 2019
Posts: 649

PostPosted: Wed Oct 14, 2020 4:09 pm    Post subject: Reply with quote

s|mon wrote:
On you last link halcon, was this ment to be a specific related issue? As it is about a throttling problem which is don't see here?

The throttling problem is, sure, unrelated. I just meant that it's not always obvious where to report the issue. There is the kernel part of code, and there is the xorg part of code, they are interacting. Maybe, your issue has nothing to do with xorg. But I saw many issues about suspending there at freedesktop.
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2202

PostPosted: Thu Oct 15, 2020 9:41 am    Post subject: Reply with quote

With different hardware (oldish HP laptop with HD3400 graphics), I've never had hibernate working in the 3 years I've had it.
_________________
Greybeard
Back to top
View user's profile Send private message
s|mon
Apprentice
Apprentice


Joined: 04 Jul 2004
Posts: 219
Location: Bayern [de]

PostPosted: Thu Oct 15, 2020 11:41 am    Post subject: Reply with quote

Never working would be not nice - especially since suspend seems to work quite fine i'd assume it should not be that complex, but who knows.
FYI: i created a entry there as well 1335
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2202

PostPosted: Fri Oct 16, 2020 9:05 am    Post subject: Reply with quote

A couple of years ago I tried to nail down my hibernate problem. Searching on Google doesn't hack it for my hardware - there are loads of reports, but in essence they come up to "the only way to hibernate is use one of the basic screen drivers instead of amdgpu". Diagnosing the problem requires building a kernel with some extra debugging, trying to hibernate, restarting after the failure, digging an address out of the extra debug info, and mapping that to a module - tedious. What it did was blame amdgpu, which wasn't really a surprise.

Unfortunately it doesn's tell you any more, so it would be a problem for upstream. Now upstream would probably want a kernel bisect (or equivalent for their driver) to find out which change broke hibernate. But that's only possible if it ever worked :-(

More to the point, AMD wrote somewhere that they don't do QA on hibernate 'cos (a) there's so many hardware configurations, and hibernate is very hardware dependent, and (b) nobody uses it anyway. The latter is, of course, putting the cart before the horse; nobody uses hibernate because it doesn't work.

I did note that the current AMD drivers complain that I should have checkpoint restart support built into my kernel; I tried adding that, but it didn't cure the problem, and I presume (as it triggered a complete kernel rebuild) it adds lots of code, so I've left it out, and stuck with suspend.
_________________
Greybeard
Back to top
View user's profile Send private message
s|mon
Apprentice
Apprentice


Joined: 04 Jul 2004
Posts: 219
Location: Bayern [de]

PostPosted: Fri Oct 16, 2020 11:17 am    Post subject: Reply with quote

Nice background information - doesn't raise to much hopes though as what you describe seems to go beyong my skill/time-to-spent ratio as well.
Let's hope they pick up that nevertheless. Suspend is nice but limiting. With hibernate i could e.g. keep running ffmpeg conversions, hibernate and boot another OS, come back and continue as i didn't leave. A very special scenario but would work otherwise.
Anyhow thanks for sharing that details.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum