Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Help me diagnose/fix(?) a problem with a hdd
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
VoidMage
Watchman
Watchman


Joined: 14 Oct 2006
Posts: 6196

PostPosted: Mon May 29, 2023 4:03 am    Post subject: Help me diagnose/fix(?) a problem with a hdd Reply with quote

Well, it's been awhile...(and I'm likely not back for long)

Anyway...

Relevant strings:

Code:
12:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
31:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 61)
WD40EFPX


The above mentioned disk keeps disconnecting after random time period with little to no disk activity in the meanwhile (that is it's not a disconnection under heavy workload).
It *usually* reconnects shortly after, but the system treats it as a partial unmount (partial, as to remount it I need to find and terminate any programs that were using the files on it).
What's worse, sometimes the controller gets stuck in command recovery (/sys/class/scsi_host/host1/state recovery).
/sys/class/scsi_host/host1/link_power_management_policy is max_performance.
Also, turning off/on is not sufficient - sometimes that recovery loop persists even to the level of affecting reboot/uefi.
On one hand, both of the controllers seem affected, on the other - both are onboard (its a B450-A PRO).
Also, I've already checked the sata data cable - made no difference.

A dmesg example of such failure:

Code:
[28may 18:46] EXT4-fs warning (device sdd2): htree_dirblock_to_tree:1072: inode #2: lblock 0: comm mc: error -5 reading directory block
[  +0,000021] EXT4-fs warning (device sdd2): htree_dirblock_to_tree:1072: inode #2: lblock 0: comm mc: error -5 reading directory block
[  +2,512337] EXT4-fs (sdd2): unmounting filesystem.
[  +1,541277] ata2.00: exception Emask 0x50 SAct 0x4 SErr 0x4070802 action 0xe frozen
[  +0,000009] ata2.00: irq_stat 0x00400000, PHY RDY changed
[  +0,000003] ata2: SError: { RecovComm HostInt PHYRdyChg PHYInt CommWake DevExch }
[  +0,000007] ata2.00: failed command: READ FPDMA QUEUED
[  +0,000002] ata2.00: cmd 60/08:10:00:28:06/00:00:00:00:00/40 tag 2 ncq dma 4096 in
                       res 40/00:10:00:28:06/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
[  +0,000011] ata2.00: status: { DRDY }
[  +0,000008] ata2: hard resetting link
[  +1,483941] ata2: SATA link down (SStatus 0 SControl 310)


I'm not sure what could be the problem - the disk is only a few months old, other disks (1 hdd, 1sdd) work fine...
What do you think: disk, controller, psu...?
(noncq doesn't help (tested by setting /sys/block/sdd/device/queue_depth to 1))
Back to top
View user's profile Send private message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 323

PostPosted: Mon May 29, 2023 5:01 am    Post subject: Reply with quote

what do short smartctl tests say about it?
Do you have any possible HDD formware updates?
if not, try to play around with enabled features in hdparm, often disabling things like APM helps
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54850
Location: 56N 3W

PostPosted: Mon May 29, 2023 7:53 am    Post subject: Reply with quote

VoidMage,

Run smartctl -x on the drive and save the result.
Run the long test with smartctl. That's the disk doing a surface scan, all on its own.
Run smartctl -x again afther the test completes. Its some hours.

Post both smartctl -x outputs.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
VoidMage
Watchman
Watchman


Joined: 14 Oct 2006
Posts: 6196

PostPosted: Mon May 29, 2023 4:56 pm    Post subject: Reply with quote

OK, this is before and after (unless I've messed it up...or those hardware resets interrupted the test). (that was with '-t long'; if that was meant to be '-t offline', well, that wasn't clear)

Not sure if that's a decent pastebin, but it seems to work.

...on a somewhat related note, is there a simple way to have the controller hard reset the command queue when it enters such recovery loop ?
More specifically, just a single scsi_host, so that the reset doesn't affect other disks...

...OK, I'm likely not using proper terms in the line above, what I want is to issue a command that would make the controller stop trying to reconnect and instead act as if it was turned off and on again (but still just for that single scsi host, not all).

Edit: so, it gets more dicey; as I've said if recovery loop is entered, it persists through a reboot/power cycle (and that's including power switch on the psu). Switching to a different sata port doesn't reset the cycle. Yet, if it get turned off for long enough, it's (for the time being) still able to *eventually* recover. Pretty much only thing I didn't change is the psu cable (simply cause all other are in use).

If that's the disk after all, I'd say that's odd...

Could it be what's failing is disk's electronics, not mechanical components ?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum