Strange lockup [amd74xx ide resume SOLVED]

pgolik · Last edited by pgolik on Fri Nov 10, 2006 8:41 pm; edited 4 times in total

I finally got suspend to ram working on my desktop system. It's an amd64 with nForce3 MB, nvidia GF5200 with the binary driver, kernel 2.6.16-r9 (stable gentoo sources) and Xorg 7.0. I use the hibernate script with a pretty default configuration for suspending to RAM (only commented out nvidia in blacklisted-modules).
It suspends perfectly and wakes up correctly most of the time, with video, X and all that, but sometimes (not every time) it locks up hard about 3 minutes after resuming. Immediately after resuming it would work normally, allow me to open programs etc., and then, after a few minutes it locks up. The display freezes (but isn't garbled in any way), CapsLock and NumLock LEDS start flashing, keyboard and mouse become of course totally unresponsive (no magic sysrq) and the HDD LED stays on (but I can't hear the disk spin). I can't ssh into the machine, it doesn't even answer to pings, so it looks like a kernel level lockup. Usually it happens after an overnight suspension, not when I wake it up after only a few minutes of suspend. It may coincide with cron starting some disk activity.
The only unusual thing I found in the system logs (but only once) was this line:

Raffi · l33t Joined: 17 Mar 2003 Posts: 731 Location: Moscow, Id.

Unless I misread what you wrote, I think you answered your own question. The nvidia module was probably blacklisted for a good reason.

Is it possible for you to try a different supported card and see if it still locks up?

pgolik · Posted: Tue Jul 04, 2006 5:08 pm Post subject:

Many users report that the newest binary drivers from nvidia support hibernation (I'm using the 8762 release). They were blacklisted as earlier driver releases didn't work. Meanwhile I enabled SMART checking on all my drives, but they report perfect health.
Perhaps it's the new modular xorg that lockups like that, but it's still strange that it happens not immediately after wakeup, but after a few minutes of seemingly normal functioning.

Raffi · l33t Joined: 17 Mar 2003 Posts: 731 Location: Moscow, Id.

You might try the following experiment. Reboot your machine so that X will not come up (this will also keep the nvidia stuff from loading even once). Try to hibernate and restore. Try everything except X stuff and see if you stay up. If so, start X and see if you stay up.

This way you should be able to either eliminate X as the problem or prove that it is probably X related.

pgolik · Posted: Tue Jul 04, 2006 10:20 pm Post subject:

It restores correctly, X and all, about 9 out of 10 tries. Another "interesting" development: this time it resumed, albeit slowly and spewed lots of complaints about not being able to write files. All the filesystems were read-only after resuming! So I'm leaning to think that it's some problem with the disk subsystem. Most of the times it wakes up correctly, sometimes it brings back all filesystems as read-only, and sometimes it locks up hard.
I also found this in the logs:

Raffi · l33t Joined: 17 Mar 2003 Posts: 731 Location: Moscow, Id.

I agree with your assessment, but don't know what to suggest.

pgolik · Posted: Wed Jul 05, 2006 10:09 am Post subject:

Today it happened hours after resuming, so probably it's not even related to suspend. It seems to happen when cron starts some disk-intensive task (like updatedb). And I got another

Raffi · l33t Joined: 17 Mar 2003 Posts: 731 Location: Moscow, Id.

I agree you don't want to disable dma.

Are you running smartd? The drive could be failing.

pgolik · Posted: Wed Jul 05, 2006 1:10 pm Post subject:

pgolik · Posted: Sat Jul 08, 2006 9:11 pm Post subject:

Tried restarting hdparm upon resume - didn't help. I added DisableWriteCacheOn option on all my drives - perhaps the lockups are less frequent, but they still happen. Here is a log from the last one

devsk · Posted: Fri Jul 14, 2006 9:17 pm Post subject:

please make sure you report if that works. I am seeing similar issues with both ATA and SATA drivers after resume.

devsk · Posted: Fri Jul 14, 2006 9:19 pm Post subject:

pgolik · Posted: Fri Jul 14, 2006 9:58 pm Post subject:

It works - sort of. I'm not getting the hard lockups as before, but often after resuming I can't use my DVDR drive - k3b just sits doing nothing and I get a message about DMA timeout on device in the logs. Restarting the hdparm service (which sets dma again) works, but not all the time (sometimes it locks up hard trying to re-set DMA mode).
Now I'm trying the libata_resume patch together with the ide.c patch attached to Bug 2039 in the kernel. Will report after a couple of tries - due to the somewhat random character of the lockup I cannot fully confirm that the patch worked until I've used it for at least a couple of days.
So far it seems that the libata_resume patch helped with the problem with my SATA drive (which is my root device), but the problem with the ATA devices (secondary HD and DVDR) remains.
I can confirm that neither patch had any negative effect on my system, so why not try them yourself - two testcases are better than one.
I've also heard that there is ongoing work in the kernel community to address power management issues with (S)ATA in the 2.6.18 release.

pgolik · Posted: Sat Jul 15, 2006 10:16 am Post subject:

Update: the SATA drive (which is the root device) resumes fine. All the problems I've reported are related to PATA (IDE) devices (secondary HD and DVDR) and these are not solved by any of the patches or solutions I've tried. The lockups are caused by cron starting updatedb on the secondary HD (disabled it for the time being). Restarting hdparm does reset the DMA flag on the DVDR, but it does not prevent the timeout errors. Here's a relevant log, the message about ATAPI reset appeared after I reset DMA with hdparm.

pgolik · Posted: Fri Nov 10, 2006 8:44 pm Post subject:

Finally there is a fix. This patch fixes the IDE resume problem on my hardware and I have a fully functional suspend to RAM. So far no negative side effects of the patch. The patch was made for 2.6.18 but it patches gentoo-sources-2.6.17 without any problems.