Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] SAS smart resettable?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 323

PostPosted: Sat Aug 22, 2020 8:28 am    Post subject: [Solved] SAS smart resettable? Reply with quote

I have a SAS constellation ES.3 drive where I ran into a weird situation:

I performed a firmware upgrade, but apparently something went wrong while writing the data.
The drive got unstable.

As such, after a number of IO writes, I saw read errors occuring:
Code:

smartctl -a -d aacraid,0,0,37 /dev/sdk     
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.8.1+] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST3000NM0023
Revision:             E007
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500625d7fc7
Serial number:        Z1Y27B210000W5073NRQ
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sat Aug 22 10:10:40 2020 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=30]

Current Drive Temperature:     56 C
Drive Trip Temperature:        60 C

Manufactured in week 36 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  68450
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  70268
Elements in grown defect list: 13

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3700963977
  Blocks received from initiator = 560400404
  Blocks read from cache and sent to initiator = 884081808
  Number of read and write commands whose size <= segment size = 19713241
  Number of read and write commands whose size > segment size = 59670

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 32885.77
  number of minutes until next internal SMART test = 55

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   357898204        2         0  357898206         61      45875.359          60
write:         0        0         5         5          6      11813.337           1
verify:    24387        0         0     24387          2          0.000           0

Non-medium error count:     3867


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Failed in segment -->       -   32884           5864966 [0x3 0x16 0x0]

Long (extended) Self-test duration: 26000 seconds [433.3 minutes]



But, in a desperate way to bring it back to life, I rewrote the SAS firmware on the drive. After that, the SMART self-test ran again:
Code:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.8.1+] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST3000NM0023
Revision:             E007
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500625d7fc7
Serial number:        Z1Y27B210000W5073NRQ
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sat Aug 22 10:14:43 2020 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=30]

Current Drive Temperature:     56 C
Drive Trip Temperature:        60 C

Manufactured in week 36 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  68450
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  70268
Elements in grown defect list: 13

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3700966026
  Blocks received from initiator = 560400404
  Blocks read from cache and sent to initiator = 884081809
  Number of read and write commands whose size <= segment size = 19713242
  Number of read and write commands whose size > segment size = 59670

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 32885.83
  number of minutes until next internal SMART test = 51

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   357900253        2         0  357900255         61      45875.360          60
write:         0        0         5         5          6      11813.337           1
verify:    24387        0         0     24387          2          0.000           0

Non-medium error count:     3867


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   32885                 - [-   -    -]
# 2  Background short  Failed in segment -->       -   32884           5864966 [0x3 0x16 0x0]

Long (extended) Self-test duration: 26000 seconds [433.3 minutes]


So, it seems like this worked! but the SMART status is still "in error". I am currently low-level reformatting the disc (sg_format --format /dev/sdk), but is it possible to set the SMART status back to healthy followed by a long self-test to make sure it works again?
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]


Last edited by jpsollie on Tue Aug 25, 2020 4:42 pm; edited 1 time in total
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Sat Aug 22, 2020 10:41 am    Post subject: Reply with quote

Nope. Resetting SMART data is supposed to be impossible, for the same reason you can't reset a car odometer.
Back to top
View user's profile Send private message
Banana
Moderator
Moderator


Joined: 21 May 2004
Posts: 1842
Location: Germany

PostPosted: Sat Aug 22, 2020 4:17 pm    Post subject: Reply with quote

depends on your data you put there but I would not trust this drive anymore.
_________________
Forum Guidelines

PFL - Portage file list - find which package a file or command belongs to.
My delta-labs.org snippets do expire
Back to top
View user's profile Send private message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 323

PostPosted: Sun Aug 23, 2020 6:10 am    Post subject: Reply with quote

why?

I know smart reset should not be done, because it's a loophole to selling broken drives, but why wouldn't it be ok to revert a drive status back to normal after fixing the problem? The sg_format has been completed, the amount of grown defects has actually DECREASED from 13 to 2, and the long background self test has completed successfully, so why would it not be thrustworthy any longer?
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54736
Location: 56N 3W

PostPosted: Sun Aug 23, 2020 8:02 am    Post subject: Reply with quote

jpsollie,

Code:
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  68450

Its already done 7x its life rated start stop count.
That's a server drive, not intended to be power cycled very often

The grown defect list doesn't really reduce. Those 11 blocks are just as marginal as they always were.
When the fail next, they might not respond to rereads and you lose your data.
Maybe that's one block of a film. Maybe its a block of your top level directory, so all the files on the disc are gone.

One thing sg_format does not do is a low level format of the drive. The drive uses a 'voicecoil servo' for head positioning which requires the format information to find tracks.
Early 'voicecoil servo' drives could be destroyed by doing a low level format. That was fixed by accepting the command but not actually writing any format information, or more to the point' not erasing what was already there.

Regard that drive as write only.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Sun Aug 23, 2020 12:07 pm    Post subject: Reply with quote

NeddySeagoon wrote:
That's a server drive, not intended to be power cycled very often

and with a lifetime of 32884 hours, it show an average 48 minutes powerup, which must be a record in awfulness :)
Back to top
View user's profile Send private message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 323

PostPosted: Sun Aug 23, 2020 2:51 pm    Post subject: Reply with quote

I do not worry about that:

The cycle count estimation of Seagate for Constellation ES.3 drives (according to manual here: https://www.seagate.com/www-content/product-content/constellation-fam/constellation-es/constellation-es-3/en-us/docs/100671510b.pdf)#page18 is more or less 600k, which means the drive hasn't even reached 20%. I guess this is a bug in the smartctl software, as they all say "10k", while I got a few ES2 for backup purposes (rated as 300k) and WD SAS drives as well ...
maybe this should be passed to smartctl developers, 10k is a low value, isn't it?
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Sun Aug 23, 2020 10:06 pm    Post subject: Reply with quote

That's head load/unload cycles, which the drive already shows as a 300k limit. The number that's gone over is start/stop cycles, i.e. how many times the motor spun up.
Back to top
View user's profile Send private message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 323

PostPosted: Mon Aug 24, 2020 8:23 am    Post subject: Reply with quote

I see,

But as long as it functions correctly, I have no problem adding it to a raid6 array ^^
also the device has 46 tb read, 11 TB written, that would mean 56TB data processed. divided by 70k, this means the device would have <1GB per load/unload cycle.
Guess I have to check my power mgmt settings ...
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54736
Location: 56N 3W

PostPosted: Mon Aug 24, 2020 8:54 am    Post subject: Reply with quote

jpsollie,

The start/stop cycles are exceeded. The spin bearings will be worn.
They are 'air bearings' which only actually wear during spin up and spin down, before the drive gets up to speed. There is no contact at normal operating speeds.

Next time it stops, it might not start :)
The more usual failure mechanism is misalignment due to wear, leading to unrecoverable read errors, that rapidly accelerate.
Hence the data recovery trick of running the failed drive at all sorts of odd angles to coax one more read.

If you put that drive into a raid6 set, you only have raid5.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
joanandk
Apprentice
Apprentice


Joined: 12 Feb 2017
Posts: 169

PostPosted: Mon Aug 24, 2020 12:49 pm    Post subject: Reply with quote

jpsollie wrote:
But as long as it functions correctly, I have no problem adding it to a raid6 array ^^


Hey,

The drive is around 50bucks. Are you willing to risk 3TB data for that (or in your case of raid 6: 12TB)?

BR
Back to top
View user's profile Send private message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 323

PostPosted: Tue Aug 25, 2020 4:41 pm    Post subject: Reply with quote

All right, I'll buy another one.
Thx for the advice!
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3477

PostPosted: Tue Aug 25, 2020 6:03 pm    Post subject: Reply with quote

joanandk wrote:
jpsollie wrote:
But as long as it functions correctly, I have no problem adding it to a raid6 array ^^


Hey,

The drive is around 50bucks. Are you willing to risk 3TB data for that (or in your case of raid 6: 12TB)?

BR

Guess what... Drives fail.
That's why we use RAID in the first place.
What we really don't want, is multiple drives failing at the same time. So, don't create a new array with a bunch of brand new disks from a single manufacturer, brand, and (the horror) with consecutive serial numbers.
I'm still using a decade old drive which developed a few bad blocks within it's first year for backups. I get it, it's not a perfect solution, but I don't expect it to fail right after I've deleted something important everywhere else.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum