View previous topic :: View next topic |
Author |
Message |
jpsollie Guru
Joined: 17 Aug 2013 Posts: 323
|
Posted: Sat Aug 22, 2020 8:28 am Post subject: [Solved] SAS smart resettable? |
|
|
I have a SAS constellation ES.3 drive where I ran into a weird situation:
I performed a firmware upgrade, but apparently something went wrong while writing the data.
The drive got unstable.
As such, after a number of IO writes, I saw read errors occuring:
Code: |
smartctl -a -d aacraid,0,0,37 /dev/sdk
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.8.1+] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3000NM0023
Revision: E007
Compliance: SPC-4
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500625d7fc7
Serial number: Z1Y27B210000W5073NRQ
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Aug 22 10:10:40 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=30]
Current Drive Temperature: 56 C
Drive Trip Temperature: 60 C
Manufactured in week 36 of year 2014
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 68450
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 70268
Elements in grown defect list: 13
Vendor (Seagate Cache) information
Blocks sent to initiator = 3700963977
Blocks received from initiator = 560400404
Blocks read from cache and sent to initiator = 884081808
Number of read and write commands whose size <= segment size = 19713241
Number of read and write commands whose size > segment size = 59670
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 32885.77
number of minutes until next internal SMART test = 55
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 357898204 2 0 357898206 61 45875.359 60
write: 0 0 5 5 6 11813.337 1
verify: 24387 0 0 24387 2 0.000 0
Non-medium error count: 3867
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Failed in segment --> - 32884 5864966 [0x3 0x16 0x0]
Long (extended) Self-test duration: 26000 seconds [433.3 minutes]
|
But, in a desperate way to bring it back to life, I rewrote the SAS firmware on the drive. After that, the SMART self-test ran again:
Code: |
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.8.1+] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3000NM0023
Revision: E007
Compliance: SPC-4
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500625d7fc7
Serial number: Z1Y27B210000W5073NRQ
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Aug 22 10:14:43 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=30]
Current Drive Temperature: 56 C
Drive Trip Temperature: 60 C
Manufactured in week 36 of year 2014
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 68450
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 70268
Elements in grown defect list: 13
Vendor (Seagate Cache) information
Blocks sent to initiator = 3700966026
Blocks received from initiator = 560400404
Blocks read from cache and sent to initiator = 884081809
Number of read and write commands whose size <= segment size = 19713242
Number of read and write commands whose size > segment size = 59670
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 32885.83
number of minutes until next internal SMART test = 51
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 357900253 2 0 357900255 61 45875.360 60
write: 0 0 5 5 6 11813.337 1
verify: 24387 0 0 24387 2 0.000 0
Non-medium error count: 3867
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 32885 - [- - -]
# 2 Background short Failed in segment --> - 32884 5864966 [0x3 0x16 0x0]
Long (extended) Self-test duration: 26000 seconds [433.3 minutes]
|
So, it seems like this worked! but the SMART status is still "in error". I am currently low-level reformatting the disc (sg_format --format /dev/sdk), but is it possible to set the SMART status back to healthy followed by a long self-test to make sure it works again? _________________ The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Last edited by jpsollie on Tue Aug 25, 2020 4:42 pm; edited 1 time in total |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Sat Aug 22, 2020 10:41 am Post subject: |
|
|
Nope. Resetting SMART data is supposed to be impossible, for the same reason you can't reset a car odometer. |
|
Back to top |
|
|
Banana Moderator
Joined: 21 May 2004 Posts: 1842 Location: Germany
|
|
Back to top |
|
|
jpsollie Guru
Joined: 17 Aug 2013 Posts: 323
|
Posted: Sun Aug 23, 2020 6:10 am Post subject: |
|
|
why?
I know smart reset should not be done, because it's a loophole to selling broken drives, but why wouldn't it be ok to revert a drive status back to normal after fixing the problem? The sg_format has been completed, the amount of grown defects has actually DECREASED from 13 to 2, and the long background self test has completed successfully, so why would it not be thrustworthy any longer? _________________ The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img] |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54733 Location: 56N 3W
|
Posted: Sun Aug 23, 2020 8:02 am Post subject: |
|
|
jpsollie,
Code: | Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 68450 |
Its already done 7x its life rated start stop count.
That's a server drive, not intended to be power cycled very often
The grown defect list doesn't really reduce. Those 11 blocks are just as marginal as they always were.
When the fail next, they might not respond to rereads and you lose your data.
Maybe that's one block of a film. Maybe its a block of your top level directory, so all the files on the disc are gone.
One thing sg_format does not do is a low level format of the drive. The drive uses a 'voicecoil servo' for head positioning which requires the format information to find tracks.
Early 'voicecoil servo' drives could be destroyed by doing a low level format. That was fixed by accepting the command but not actually writing any format information, or more to the point' not erasing what was already there.
Regard that drive as write only. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
krinn Watchman
Joined: 02 May 2003 Posts: 7470
|
Posted: Sun Aug 23, 2020 12:07 pm Post subject: |
|
|
NeddySeagoon wrote: | That's a server drive, not intended to be power cycled very often |
and with a lifetime of 32884 hours, it show an average 48 minutes powerup, which must be a record in awfulness |
|
Back to top |
|
|
jpsollie Guru
Joined: 17 Aug 2013 Posts: 323
|
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Sun Aug 23, 2020 10:06 pm Post subject: |
|
|
That's head load/unload cycles, which the drive already shows as a 300k limit. The number that's gone over is start/stop cycles, i.e. how many times the motor spun up. |
|
Back to top |
|
|
jpsollie Guru
Joined: 17 Aug 2013 Posts: 323
|
Posted: Mon Aug 24, 2020 8:23 am Post subject: |
|
|
I see,
But as long as it functions correctly, I have no problem adding it to a raid6 array ^^
also the device has 46 tb read, 11 TB written, that would mean 56TB data processed. divided by 70k, this means the device would have <1GB per load/unload cycle.
Guess I have to check my power mgmt settings ... _________________ The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img] |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54733 Location: 56N 3W
|
Posted: Mon Aug 24, 2020 8:54 am Post subject: |
|
|
jpsollie,
The start/stop cycles are exceeded. The spin bearings will be worn.
They are 'air bearings' which only actually wear during spin up and spin down, before the drive gets up to speed. There is no contact at normal operating speeds.
Next time it stops, it might not start :)
The more usual failure mechanism is misalignment due to wear, leading to unrecoverable read errors, that rapidly accelerate.
Hence the data recovery trick of running the failed drive at all sorts of odd angles to coax one more read.
If you put that drive into a raid6 set, you only have raid5. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
joanandk Apprentice
Joined: 12 Feb 2017 Posts: 169
|
Posted: Mon Aug 24, 2020 12:49 pm Post subject: |
|
|
jpsollie wrote: | But as long as it functions correctly, I have no problem adding it to a raid6 array ^^ |
Hey,
The drive is around 50bucks. Are you willing to risk 3TB data for that (or in your case of raid 6: 12TB)?
BR |
|
Back to top |
|
|
jpsollie Guru
Joined: 17 Aug 2013 Posts: 323
|
|
Back to top |
|
|
szatox Advocate
Joined: 27 Aug 2013 Posts: 3477
|
Posted: Tue Aug 25, 2020 6:03 pm Post subject: |
|
|
joanandk wrote: | jpsollie wrote: | But as long as it functions correctly, I have no problem adding it to a raid6 array ^^ |
Hey,
The drive is around 50bucks. Are you willing to risk 3TB data for that (or in your case of raid 6: 12TB)?
BR |
Guess what... Drives fail.
That's why we use RAID in the first place.
What we really don't want, is multiple drives failing at the same time. So, don't create a new array with a bunch of brand new disks from a single manufacturer, brand, and (the horror) with consecutive serial numbers.
I'm still using a decade old drive which developed a few bad blocks within it's first year for backups. I get it, it's not a perfect solution, but I don't expect it to fail right after I've deleted something important everywhere else. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|