View previous topic :: View next topic |
Author |
Message |
VoidMage Watchman
![Watchman Watchman](/images/ranks/rank-G-2-watchman.gif)
![](images/avatars/16259984764821973534cdc.gif)
Joined: 14 Oct 2006 Posts: 6196
|
Posted: Mon May 29, 2023 4:03 am Post subject: Help me diagnose/fix(?) a problem with a hdd |
|
|
Well, it's been awhile...(and I'm likely not back for long)
Anyway...
Relevant strings:
Code: | 12:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
31:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 61)
WD40EFPX
|
The above mentioned disk keeps disconnecting after random time period with little to no disk activity in the meanwhile (that is it's not a disconnection under heavy workload).
It *usually* reconnects shortly after, but the system treats it as a partial unmount (partial, as to remount it I need to find and terminate any programs that were using the files on it).
What's worse, sometimes the controller gets stuck in command recovery (/sys/class/scsi_host/host1/state recovery).
/sys/class/scsi_host/host1/link_power_management_policy is max_performance.
Also, turning off/on is not sufficient - sometimes that recovery loop persists even to the level of affecting reboot/uefi.
On one hand, both of the controllers seem affected, on the other - both are onboard (its a B450-A PRO).
Also, I've already checked the sata data cable - made no difference.
A dmesg example of such failure:
Code: | [28may 18:46] EXT4-fs warning (device sdd2): htree_dirblock_to_tree:1072: inode #2: lblock 0: comm mc: error -5 reading directory block
[ +0,000021] EXT4-fs warning (device sdd2): htree_dirblock_to_tree:1072: inode #2: lblock 0: comm mc: error -5 reading directory block
[ +2,512337] EXT4-fs (sdd2): unmounting filesystem.
[ +1,541277] ata2.00: exception Emask 0x50 SAct 0x4 SErr 0x4070802 action 0xe frozen
[ +0,000009] ata2.00: irq_stat 0x00400000, PHY RDY changed
[ +0,000003] ata2: SError: { RecovComm HostInt PHYRdyChg PHYInt CommWake DevExch }
[ +0,000007] ata2.00: failed command: READ FPDMA QUEUED
[ +0,000002] ata2.00: cmd 60/08:10:00:28:06/00:00:00:00:00/40 tag 2 ncq dma 4096 in
res 40/00:10:00:28:06/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
[ +0,000011] ata2.00: status: { DRDY }
[ +0,000008] ata2: hard resetting link
[ +1,483941] ata2: SATA link down (SStatus 0 SControl 310)
|
I'm not sure what could be the problem - the disk is only a few months old, other disks (1 hdd, 1sdd) work fine...
What do you think: disk, controller, psu...?
(noncq doesn't help (tested by setting /sys/block/sdd/device/queue_depth to 1)) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
jpsollie Guru
![Guru Guru](/images/ranks/rank_rect_3.gif)
Joined: 17 Aug 2013 Posts: 323
|
Posted: Mon May 29, 2023 5:01 am Post subject: |
|
|
what do short smartctl tests say about it?
Do you have any possible HDD formware updates?
if not, try to play around with enabled features in hdparm, often disabling things like APM helps _________________ The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img] |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
NeddySeagoon Administrator
![Administrator Administrator](/images/ranks/rank-admin.gif)
![](images/avatars/3946266373f47d606a2db3.jpg)
Joined: 05 Jul 2003 Posts: 54850 Location: 56N 3W
|
Posted: Mon May 29, 2023 7:53 am Post subject: |
|
|
VoidMage,
Run smartctl -x on the drive and save the result.
Run the long test with smartctl. That's the disk doing a surface scan, all on its own.
Run smartctl -x again afther the test completes. Its some hours.
Post both smartctl -x outputs. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
VoidMage Watchman
![Watchman Watchman](/images/ranks/rank-G-2-watchman.gif)
![](images/avatars/16259984764821973534cdc.gif)
Joined: 14 Oct 2006 Posts: 6196
|
Posted: Mon May 29, 2023 4:56 pm Post subject: |
|
|
OK, this is before and after (unless I've messed it up...or those hardware resets interrupted the test). (that was with '-t long'; if that was meant to be '-t offline', well, that wasn't clear)
Not sure if that's a decent pastebin, but it seems to work.
...on a somewhat related note, is there a simple way to have the controller hard reset the command queue when it enters such recovery loop ?
More specifically, just a single scsi_host, so that the reset doesn't affect other disks...
...OK, I'm likely not using proper terms in the line above, what I want is to issue a command that would make the controller stop trying to reconnect and instead act as if it was turned off and on again (but still just for that single scsi host, not all).
Edit: so, it gets more dicey; as I've said if recovery loop is entered, it persists through a reboot/power cycle (and that's including power switch on the psu). Switching to a different sata port doesn't reset the cycle. Yet, if it get turned off for long enough, it's (for the time being) still able to *eventually* recover. Pretty much only thing I didn't change is the psu cable (simply cause all other are in use).
If that's the disk after all, I'd say that's odd...
Could it be what's failing is disk's electronics, not mechanical components ? |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|