View previous topic :: View next topic |
Author |
Message |
theJackalnz n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 05 Oct 2005 Posts: 32
|
Posted: Sat Sep 30, 2006 10:15 am Post subject: Persistent Unexplainable hardware problem with suspicions. |
|
|
I have had enough. For the last 6 months I have had a problem I cannot make go away.
Code: |
Sep 30 21:43:10 isengard ata2: command 0x35 timeout, stat 0x50 host_stat 0x1
Sep 30 21:43:10 isengard ata2: status=0x50 { DriveReady SeekComplete }
Sep 30 21:43:10 isengard ata2: error=0x01 { AddrMarkNotFound }
Sep 30 21:43:10 isengard sdb: Current: sense key: No Sense
|
This error and friends.
You can be working away, and suddenly your box freezes solid, and you hear your hard drive clicking and spinning up/down. Generally the FS Drivers using that drive go play hide and go fork themselves, and generally the only way to continue using your box is a reboot.
Initially, I thought it was just the hard drive( Seagate ), screeds of error counts in its smart tables.
so i brought a _2_ replacement drives (Hitachi) AND a replacement controller JUST to be sure.
I initially had only a via controller and 2 SATA drives.
Code: |
via -> bootDrive [hitachi1] sda
via -> seagate sdb
|
upon upgrade, this became my layout:
Code: |
via -> bootDrive[hitachi1] sda
via -> hitachi2 sdb
SIL -> hitachi3 sdc
|
You can imagine in my horror when I discovered that even after replacing what I thought was a dodgy drive, the problem persisted. Still the same old stubborn failure of sdb.
So, determinted to kill this sucker, i moved as many drives as I could to the new silicon image card. I cannot move the boot drive from its place as i cant get my BIOS ( ASUS A7V8X ) to boot off the addon controller card.
Code: |
via -> bootDrive[hitachi1] sda
SIL -> hitachi2 sdb
SIL -> hitachi3 sdc
|
Now after a while, the problem is back. yes.. again on SDB. So I swap the order they're in to
Code: |
via -> bootDrive[hitachi1] sda
SIL -> hitachi3 sdb
SIL -> hitachi2 sdc
|
And yet again, Failures happen in SDB, and ONLY in SDB.
This sequence of events leads me to believe that its _NOT_ a hardware issue as I previously thought, and have become CONVINCED it is indeed a software problem, but the only place I can see potential for problems are in the SATA controller software, and libata itself.
If anybody who knows more about linux internals & hardware is out there and knows what the hell is going on and how I can fix it, I would be most appreciateive.
Thanks. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
shanew n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 16 Sep 2006 Posts: 34 Location: Austin, TX
|
Posted: Fri Oct 06, 2006 2:14 pm Post subject: |
|
|
I know this isn't the most helpful response, but have you tried replacing and/or swapping the SCSI cable that runs to sdb?
Also, it you think it's a software problem, it would be useful to know what kernel version you're running, what drivers are supporting the disks and controller, and whether they are compiled into the kernel or loaded as modules. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
widan Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/142533236243680bda6f27b.jpg)
Joined: 07 Jun 2005 Posts: 1512 Location: Paris, France
|
Posted: Fri Oct 06, 2006 3:36 pm Post subject: Re: Persistent Unexplainable hardware problem with suspicion |
|
|
Code: | Sep 30 21:43:10 isengard ata2: status=0x50 { DriveReady SeekComplete }
Sep 30 21:43:10 isengard ata2: error=0x01 { AddrMarkNotFound } |
This particular error means the system asked the drive for a sector it can't find (either the sector is past the end of the disk, or the drive can't find it because of an hardware problem).
theJackalnz wrote: | You can be working away, and suddenly your box freezes solid, and you hear your hard drive clicking and spinning up/down. |
You've confused the controller (the microcontroller on the disk, not the SATA controller). Does it do it for all the drives (when they are at /dev/sdb) ?
theJackalnz wrote: | This sequence of events leads me to believe that its _NOT_ a hardware issue as I previously thought, and have become CONVINCED it is indeed a software problem, but the only place I can see potential for problems are in the SATA controller software, and libata itself. |
When there are errors, is a sector number mentionned (look for LBA in the kernel messages) ? If there is one, is the errored sector ID greater than the number of sectors in your drive (the sector count is indicated in dmesg when it detects the drives) ? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
theJackalnz n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 05 Oct 2005 Posts: 32
|
Posted: Sat Oct 07, 2006 3:50 pm Post subject: |
|
|
shanew wrote: | I know this isn't the most helpful response, but have you tried replacing and/or swapping the SCSI cable that runs to sdb?
Also, it you think it's a software problem, it would be useful to know what kernel version you're running, what drivers are supporting the disks and controller, and whether they are compiled into the kernel or loaded as modules. |
Thanks for your reply, Im pretty sure I've alternated the cables numerous times to be a different cable on SDB ( sometimes i replugged on the controller card, sometimes i replugged on the drive connector, so its probably fair to say the cables have been shuffled a bit )
Im currently running kernel 2.6.17-gentoo-r7, but have had this problem for more than 6 months so I may have had it with a prior kernel too.
Im using modules, not built in, I currently have sata_via and sata_sil controlling the works, i have other sata drivers loaded for some reason ( part of genkernel built kernels initialization stuff, but its kinda hard to boot without that )
Code: |
lsmod | grep ata
ata_piix 11332 0
sata_via 8132 4
sata_svw 7492 0
sata_sil24 11076 0
sata_sil 9352 13
sata_promise 11204 0
libata 62860 6 ata_piix,sata_via,sata_svw,sata_sil24,sata_sil,sata_promise
|
|
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
theJackalnz n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 05 Oct 2005 Posts: 32
|
Posted: Sat Oct 07, 2006 4:06 pm Post subject: Re: Persistent Unexplainable hardware problem with suspicion |
|
|
widan wrote: |
This particular error means the system asked the drive for a sector it can't find (either the sector is past the end of the disk, or the drive can't find it because of an hardware problem).
You've confused the controller (the microcontroller on the disk, not the SATA controller). Does it do it for all the drives (when they are at /dev/sdb) ?
|
Yeah, every last one of them .
widan wrote: |
When there are errors, is a sector number mentionned (look for LBA in the kernel messages) ? If there is one, is the errored sector ID greater than the number of sectors in your drive (the sector count is indicated in dmesg when it detects the drives) ? |
Sector always seems to be in-range of my drive, I did a big collation of all the sectors one drive had ( at one stage it was faulting every hour ) before the upgrade and they after sorting them they seemed to cluster a bit, but no real order to it, ( Well, apart from all the groupings of faulting sectors always having an even number they faulted at, except for one grouping which was all odd sector numbers )
On that same disk I DD'd it through pipebench to /dev/null and at about 80% ( with the drive unmounted ) the aformentioned blight showed up again and the DD just hardlocked and I had to reboot to get it free again, yet booting up using a spinrite CD resulted in the problem not happening, did a full disk test with no problem ( well, other than spinrite was too blonde to realise SATA supports SMART )
I managed to reduce the frequency at which problems popped up by specifying the "irqpoll" option to my kernel since then, and its not been so prolific, but its kinda killing the uptimes despite it, which has had the side-effect of spamming my logs with my CDROM drives getting confused, but they dont seem to be a problem. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
nklb n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 27 Jun 2002 Posts: 42
|
Posted: Sat Oct 21, 2006 12:35 am Post subject: |
|
|
I hope a solution is found for this. I am now having the same problem on a similar motherboard, Asus m2n-sli deluxe |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
theJackalnz n00b
![n00b n00b](/images/ranks/rank_rect_0.gif)
Joined: 05 Oct 2005 Posts: 32
|
Posted: Sat Oct 21, 2006 1:04 am Post subject: |
|
|
I feel theres still the remote chance that its indeed something wrong with possibly the DMA controller on the motherboard :/
I've had problems with it before, I had to disable APIC support so XComposite wouldnt hard-lock my box :/ |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|