Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Strange lockup [amd74xx ide resume SOLVED]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Tue Jul 04, 2006 10:22 am    Post subject: Strange lockup [amd74xx ide resume SOLVED] Reply with quote

I finally got suspend to ram working on my desktop system. It's an amd64 with nForce3 MB, nvidia GF5200 with the binary driver, kernel 2.6.16-r9 (stable gentoo sources) and Xorg 7.0. I use the hibernate script with a pretty default configuration for suspending to RAM (only commented out nvidia in blacklisted-modules).
It suspends perfectly and wakes up correctly most of the time, with video, X and all that, but sometimes (not every time) it locks up hard about 3 minutes after resuming. Immediately after resuming it would work normally, allow me to open programs etc., and then, after a few minutes it locks up. The display freezes (but isn't garbled in any way), CapsLock and NumLock LEDS start flashing, keyboard and mouse become of course totally unresponsive (no magic sysrq) and the HDD LED stays on (but I can't hear the disk spin). I can't ssh into the machine, it doesn't even answer to pings, so it looks like a kernel level lockup. Usually it happens after an overnight suspension, not when I wake it up after only a few minutes of suspend. It may coincide with cron starting some disk activity.
The only unusual thing I found in the system logs (but only once) was this line:
Code:
Jul  4 11:40:25 [kernel] hda: dma_timer_expiry: dma status == 0x21

My primary Linux drive is on SATA (/dev/sda), hda is a PATA drive I use for windows and some secondary storage on FAT32.
I realize this isn't enough information, but I don't even know where to look. Any ideas?


Last edited by pgolik on Fri Nov 10, 2006 8:41 pm; edited 4 times in total
Back to top
View user's profile Send private message
Raffi
l33t
l33t


Joined: 17 Mar 2003
Posts: 731
Location: Moscow, Id.

PostPosted: Tue Jul 04, 2006 4:53 pm    Post subject: Reply with quote

Unless I misread what you wrote, I think you answered your own question. The nvidia module was probably blacklisted for a good reason.

Is it possible for you to try a different supported card and see if it still locks up?
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Tue Jul 04, 2006 5:08 pm    Post subject: Reply with quote

Many users report that the newest binary drivers from nvidia support hibernation (I'm using the 8762 release). They were blacklisted as earlier driver releases didn't work. Meanwhile I enabled SMART checking on all my drives, but they report perfect health.
Perhaps it's the new modular xorg that lockups like that, but it's still strange that it happens not immediately after wakeup, but after a few minutes of seemingly normal functioning.
Back to top
View user's profile Send private message
Raffi
l33t
l33t


Joined: 17 Mar 2003
Posts: 731
Location: Moscow, Id.

PostPosted: Tue Jul 04, 2006 5:13 pm    Post subject: Reply with quote

You might try the following experiment. Reboot your machine so that X will not come up (this will also keep the nvidia stuff from loading even once). Try to hibernate and restore. Try everything except X stuff and see if you stay up. If so, start X and see if you stay up.

This way you should be able to either eliminate X as the problem or prove that it is probably X related.
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Tue Jul 04, 2006 10:20 pm    Post subject: Reply with quote

It restores correctly, X and all, about 9 out of 10 tries. Another "interesting" development: this time it resumed, albeit slowly and spewed lots of complaints about not being able to write files. All the filesystems were read-only after resuming! So I'm leaning to think that it's some problem with the disk subsystem. Most of the times it wakes up correctly, sometimes it brings back all filesystems as read-only, and sometimes it locks up hard.
I also found this in the logs:
Code:
Jul  4 18:42:29 [kernel] ata1: error=0x04 { DriveStatusError }
Jul  4 18:42:29 [kernel] ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
Jul  4 18:42:29 [kernel] ata1: status=0x51 { DriveReady SeekComplete Error }
Jul  4 18:42:29 [kernel] ata1: error=0x04 { DriveStatusError }

repeated. I'm beginning to think that SATA and suspend don't mix well on my hardware.

Pawel
Back to top
View user's profile Send private message
Raffi
l33t
l33t


Joined: 17 Mar 2003
Posts: 731
Location: Moscow, Id.

PostPosted: Tue Jul 04, 2006 11:16 pm    Post subject: Reply with quote

I agree with your assessment, but don't know what to suggest.
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Wed Jul 05, 2006 10:09 am    Post subject: Reply with quote

Today it happened hours after resuming, so probably it's not even related to suspend. It seems to happen when cron starts some disk-intensive task (like updatedb). And I got another
Code:
[kernel] hda: dma_timer_expiry: dma status == 0x21
message in the log seconds before the lockup. The strange thing is I haven't changed anything in hardware recently (disk, chipset, cables) and it's been stable for a year. I'll try to recompile the kernel, any sugestions as to where to look (no, I don't want to disable dma).
Back to top
View user's profile Send private message
Raffi
l33t
l33t


Joined: 17 Mar 2003
Posts: 731
Location: Moscow, Id.

PostPosted: Wed Jul 05, 2006 12:25 pm    Post subject: Reply with quote

I agree you don't want to disable dma.

Are you running smartd? The drive could be failing.
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Wed Jul 05, 2006 1:10 pm    Post subject: Reply with quote

Raffi wrote:
Are you running smartd?

I am. No problems reported. Someone on suspend2 list suggests re-applying hdparm settings on each resume. I added RestartServices hdparm to the hibernate script. Will see if it helps.
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Sat Jul 08, 2006 9:11 pm    Post subject: Reply with quote

Tried restarting hdparm upon resume - didn't help. I added DisableWriteCacheOn option on all my drives - perhaps the lockups are less frequent, but they still happen. Here is a log from the last one
Code:

Jul  8 22:55:33 [kernel] ATA: abnormal status 0x80 on port 0x9F7
                - Last output repeated twice -
Jul  8 22:55:33 [kernel] Restarting tasks... done
Jul  8 22:55:33 [kernel] input: PS2++ Logitech MX Mouse as /class/input/input3
Jul  8 22:55:33 [kernel] ata1: command 0x35 timeout, stat 0x80 host_stat 0x1
Jul  8 22:55:33 [kernel] ata1: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Jul  8 22:55:33 [kernel] ata1: status=0x80 { Busy }
Jul  8 22:55:33 [kernel] sd 0:0:0:0: SCSI error: return code = 0x8000002
Jul  8 22:55:33 [kernel] sda: Current: sense key=0xb
Jul  8 22:55:33 [kernel]     ASC=0x47 ASCQ=0x0
Jul  8 22:55:33 [kernel] end_request: I/O error, dev sda, sector 498143
Jul  8 22:55:33 [kernel] Buffer I/O error on device sda2, logical block 16
Jul  8 22:55:33 [kernel] lost page write due to I/O error on sda2
Jul  8 22:55:33 [kernel] ATA: abnormal status 0x80 on port 0x9F7
                - Last output repeated 2 times -
Jul  8 22:55:33 [kernel] nv_sata: Primary device added
Jul  8 22:55:33 [kernel] nv_sata: Primary device removed
Jul  8 22:55:33 [kernel] nv_sata: Secondary device removed

So it appears that both SATA and PATA sometimes fail to resume correctly. I have no idea what it depends on, though. They resume correctly most of the time. I saw some posts on LKML list with a similar problem here, so at least I know I'm not alone.
[Edit]
It appears the problem is known to the kernel hackers at LKML and still present as of 2.6.17 kernel release - if something tries to access the disk immediately upon resume it may timeout and give errors like I observed. Several patches exist, but none that deserve integrating into the kernel yet. One person reported, that a patch from Andrew Morton mm kernel is supposed to solve it. I found a libata_resume_fix patch here, and it applies cleanly to the current stable gentoo-sources 2.6.16-r12, will try it for a couple of days and keep y'all updated.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 3003
Location: Bay Area, CA

PostPosted: Fri Jul 14, 2006 9:17 pm    Post subject: Reply with quote

please make sure you report if that works. I am seeing similar issues with both ATA and SATA drivers after resume.
Back to top
View user's profile Send private message
devsk
Advocate
Advocate


Joined: 24 Oct 2003
Posts: 3003
Location: Bay Area, CA

PostPosted: Fri Jul 14, 2006 9:19 pm    Post subject: Reply with quote

pgolik wrote:
Today it happened hours after resuming, so probably it's not even related to suspend. It seems to happen when cron starts some disk-intensive task (like updatedb). And I got another
Code:
[kernel] hda: dma_timer_expiry: dma status == 0x21
message in the log seconds before the lockup. The strange thing is I haven't changed anything in hardware recently (disk, chipset, cables) and it's been stable for a year. I'll try to recompile the kernel, any sugestions as to where to look (no, I don't want to disable dma).
this can be worked around by setting correct UDMA mode just after resuming and just before mounting the device. If this is your root device, you are outof luck.
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Fri Jul 14, 2006 9:58 pm    Post subject: Reply with quote

It works - sort of. I'm not getting the hard lockups as before, but often after resuming I can't use my DVDR drive - k3b just sits doing nothing and I get a message about DMA timeout on device in the logs. Restarting the hdparm service (which sets dma again) works, but not all the time (sometimes it locks up hard trying to re-set DMA mode).
Now I'm trying the libata_resume patch together with the ide.c patch attached to Bug 2039 in the kernel. Will report after a couple of tries - due to the somewhat random character of the lockup I cannot fully confirm that the patch worked until I've used it for at least a couple of days.
So far it seems that the libata_resume patch helped with the problem with my SATA drive (which is my root device), but the problem with the ATA devices (secondary HD and DVDR) remains.
I can confirm that neither patch had any negative effect on my system, so why not try them yourself - two testcases are better than one.
I've also heard that there is ongoing work in the kernel community to address power management issues with (S)ATA in the 2.6.18 release.
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Sat Jul 15, 2006 10:16 am    Post subject: Reply with quote

Update: the SATA drive (which is the root device) resumes fine. All the problems I've reported are related to PATA (IDE) devices (secondary HD and DVDR) and these are not solved by any of the patches or solutions I've tried. The lockups are caused by cron starting updatedb on the secondary HD (disabled it for the time being). Restarting hdparm does reset the DMA flag on the DVDR, but it does not prevent the timeout errors. Here's a relevant log, the message about ATAPI reset appeared after I reset DMA with hdparm.
Code:

Jul 15 00:40:17 [kernel] hdc: DMA disabled
Jul 15 00:40:17 [kernel] hdc: ide_intr: huh? expected NULL handler on exit
Jul 15 00:40:17 [kernel] hdc: ATAPI reset complete
Jul 15 00:41:53 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }
Jul 15 00:41:53 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }
Jul 15 00:41:53 [kernel] ide: failed opcode was: unknown
Jul 15 00:41:59 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }
Jul 15 00:41:59 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }
Jul 15 00:41:59 [kernel] ide: failed opcode was: unknown
Jul 15 00:42:06 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }
Jul 15 00:42:06 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }
Jul 15 00:42:06 [kernel] ide: failed opcode was: unknown
Jul 15 00:42:12 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }
Jul 15 00:42:12 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }
Jul 15 00:42:12 [kernel] ide: failed opcode was: unknown
Jul 15 00:42:12 [kernel] hdc: DMA disabled
Jul 15 00:42:12 [kernel] hdc: ide_intr: huh? expected NULL handler on exit
Jul 15 00:42:12 [kernel] hdc: ATAPI reset complete
Jul 15 00:42:12 [kernel] ISO 9660 Extensions: Microsoft Joliet Level 3
Jul 15 00:42:12 [kernel] ISO 9660 Extensions: RRIP_1991A
Jul 15 00:42:42 [kernel] hdc: tray open
Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64
Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8
Jul 15 00:42:42 [kernel] hdc: tray open
Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64
Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8
Jul 15 00:42:42 [kernel] hdc: tray open
Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64
Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8
Jul 15 00:42:42 [kernel] hdc: tray open
Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64
Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8


To summarize - IDE (but not SATA) devices do not resume correctly, none of the solutions I've found works. As my root and boot partitions are on SATA I'll try to build all of the ide subsystem as modules and unload it before hibernating - but that's not a solution, just a workaround (if it works at all).
Back to top
View user's profile Send private message
pgolik
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2004
Posts: 125
Location: Warsaw, Poland

PostPosted: Fri Nov 10, 2006 8:44 pm    Post subject: Reply with quote

Finally there is a fix. This patch fixes the IDE resume problem on my hardware and I have a fully functional suspend to RAM. So far no negative side effects of the patch. The patch was made for 2.6.18 but it patches gentoo-sources-2.6.17 without any problems.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum