Help me diagnose/fix(?) a problem with a hdd

VoidMage · Watchman Joined: 14 Oct 2006 Posts: 6196

Well, it's been awhile...(and I'm likely not back for long)

Anyway...

Relevant strings:

jpsollie · Guru Joined: 17 Aug 2013 Posts: 323

what do short smartctl tests say about it?
Do you have any possible HDD formware updates?
if not, try to play around with enabled features in hdparm, often disabling things like APM helps
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]

NeddySeagoon · Posted: Mon May 29, 2023 7:53 am Post subject:

VoidMage,

Run smartctl -x on the drive and save the result.
Run the long test with smartctl. That's the disk doing a surface scan, all on its own.
Run smartctl -x again afther the test completes. Its some hours.

Post both smartctl -x outputs.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

VoidMage · Watchman Joined: 14 Oct 2006 Posts: 6196

OK, this is before and after (unless I've messed it up...or those hardware resets interrupted the test). (that was with '-t long'; if that was meant to be '-t offline', well, that wasn't clear)

Not sure if that's a decent pastebin, but it seems to work.

...on a somewhat related note, is there a simple way to have the controller hard reset the command queue when it enters such recovery loop ?
More specifically, just a single scsi_host, so that the reset doesn't affect other disks...

...OK, I'm likely not using proper terms in the line above, what I want is to issue a command that would make the controller stop trying to reconnect and instead act as if it was turned off and on again (but still just for that single scsi host, not all).

Edit: so, it gets more dicey; as I've said if recovery loop is entered, it persists through a reboot/power cycle (and that's including power switch on the psu). Switching to a different sata port doesn't reset the cycle. Yet, if it get turned off for long enough, it's (for the time being) still able to *eventually* recover. Pretty much only thing I didn't change is the psu cable (simply cause all other are in use).

If that's the disk after all, I'd say that's odd...

Could it be what's failing is disk's electronics, not mechanical components ?