View previous topic :: View next topic |
Author |
Message |
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3920 Location: Rasi, Finland
|
Posted: Mon Mar 01, 2021 3:41 pm Post subject: Advices/opinions needed - hard drive failing? |
|
|
I ran SMART extended/long tests and one of my drives started to look worrying...
Code: | # /usr/sbin/skdump /dev/sda
Device: sat12:/dev/sda
Type: 12 Byte SCSI ATA SAT Passthru
Size: 1907729 MiB
Model: [WDC WD20EARX-008FB0]
Serial: [WD-WCAZAK774726]
Firmware: [51.0AB51]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 30060 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 324 min
Conveyance Self-Test Polling Time: 5 min
Bad Sectors: 3 sectors
Powered On: 7.1 years
Power Cycles: 173
Average Powered On Per Power Cycle: 15.0 days
Temperature: 41.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 17 0x110000000000 prefail online yes yes
3 spin-up-time 183 182 21 5.8 s 0xda1600000000 prefail online yes yes
4 start-stop-count 100 100 0 189 0xbd0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 3 sectors 0x030000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 15 15 0 7.1 years 0xd9f200000000 old-age online n/a n/a
10 spin-retry-count 100 100 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 100 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 173 0xad0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 68 0x440000000000 old-age online n/a n/a
193 load-cycle-count 200 200 0 124 0x7c0000000000 old-age online n/a n/a
194 temperature-celsius-2 109 96 0 41.0 C 0x290000000000 old-age online n/a n/a
196 reallocated-event-count 197 197 0 3 0x030000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 200 200 0 0 sectors 0x000000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 200 0 0 0x000000000000 old-age online n/a n/a
200 multi-zone-error-rate 200 200 0 0 0x000000000000 old-age offline n/a n/a | ID 5 looks worrying...
What is strange is Code: | # /usr/sbin/skdump --status /dev/sda
GOOD | Maybe libatasmart is too optimistic? :P _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Anon-E-moose Watchman
Joined: 23 May 2008 Posts: 6215 Location: Dallas area
|
Posted: Mon Mar 01, 2021 4:04 pm Post subject: |
|
|
all reallocated-sector-count tells you is it reallocated xyz # of sectors.
What would be a problem is if 197 and possibly 198 start having something more than zero. _________________ UM780, 6.12 zen kernel, gcc 13, openrc, wayland |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Mon Mar 01, 2021 4:43 pm Post subject: |
|
|
Zucca,
Code: | 197 current-pending-sector | being non zero is a bad thing.
It means that the drive has bad sectors that it would like to remap but it can't as they are unreadable.
That is, the drive cannot read its own writing.
A non zero Code: | 5 reallocated-sector-count | is expected an a drive that's been in use for 7.1 years.
The drive is supposed to remap sectors that are difficult to read before reads fail. Its normal drive operation.
A sudden jump in the reallocated-sector-count should be an alarm signal.
Taken together with Code: | 196 reallocated-event-count | the drive has had three remapping events oy one sector each.
That drive was OK at the time you asked it. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3920 Location: Rasi, Finland
|
Posted: Mon Mar 01, 2021 6:06 pm Post subject: |
|
|
Thanks guys.
It's one of five drives in my btrfs-raid1 (essentially raid5) drive pool/stack.
I'll see if I can convert the set into a one that can tolerate two drive failures.
I guess I'll ready my order for one spare... just in case. Besides I need few bigger drives if I'm going to have more redundancy.
But first I make sure my backups disks are working. _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
AJM Apprentice
Joined: 25 Sep 2002 Posts: 195 Location: Aberdeen, Scotland
|
Posted: Mon Mar 01, 2021 6:14 pm Post subject: |
|
|
Maybe I'm paranoid / wasteful, but I've always gone by the rule that if a drive has even one bad sector pending it goes in the bin. My experience has always been that if one sector has gone bad, others will almost inevitably follow and usually before too long... |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Mon Mar 01, 2021 6:29 pm Post subject: |
|
|
AJM,
Pending means the drive is dead. That's grounds for a warranty return.
The relocation mechanism working properly is normal operation.
Indeed, new drives have relocated sectors but the counts are set to zero as part of final test.
That whey they always appear to be perfect.
Remapped sectors can be detected by a dip in the continuous read speed as the remapping causes extra seek time. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Mon Mar 01, 2021 6:31 pm Post subject: |
|
|
Please don't forget that the meaning of the SMART values varies between vendors and even between product series of the same vendor. SMART data is NOT standardized. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Mon Mar 01, 2021 6:48 pm Post subject: |
|
|
mike155,
Correct but it is normalised for display.
When anf parameters Value or Worst is equal to or lower than Threh, that parameter has failed.
In the case of the reallocated-event-count, that logic is broken.
That is, SMART will report that a drive is OK, even when it has an unreadable sector that might prevent the system booting. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Buffoon Veteran
Joined: 17 Jun 2015 Posts: 1369 Location: EU or US
|
Posted: Tue Mar 02, 2021 12:04 am Post subject: |
|
|
Is this drive OK? Just logged into Wife's puter and that's what she has. I've got that weird feeling ...
Code: | Device: sat16:/dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 152627 MiB
Model: [FUJITSU MJA2160BH G2]
Serial: [K96PTA125BWU]
Firmware: [0084001C]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 508 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 72 min
Conveyance Self-Test Polling Time: 2 min
Bad Sectors: 589834 sectors
Powered On: 5.8 years
Power Cycles: 982
Average Powered On Per Power Cycle: 2.2 days
Temperature: 36.0 C
Attribute Parsing Verification: Bad
Overall Status: BAD_SECTOR
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 100 100 46 44021 0xf5ab00000000 prefail online yes yes
2 throughput-performance 100 100 30 n/a 0x000077010000 prefail offline yes yes
3 spin-up-time 100 100 25 n/a 0x000000000000 prefail online yes yes
4 start-stop-count 99 99 0 2177 0x810800000000 old-age online n/a n/a
5 reallocated-sector-count 100 100 24 589834 sectors 0x0a000900c707 prefail online yes yes
7 seek-error-rate 100 54 47 988 0xdc0300000000 prefail online yes yes
8 seek-time-performance 100 100 19 n/a 0x000000000000 prefail offline yes yes
9 power-on-hours 1 1 0 5.8 years 0xa0c600000000 old-age online n/a n/a
10 spin-retry-count 100 100 20 0 0x000000000000 prefail online yes yes
12 power-cycle-count 100 100 0 982 0xd60300000000 old-age online n/a n/a
192 power-off-retract-count 99 99 0 379 0x7b0100000000 old-age online n/a n/a
193 load-cycle-count 74 74 0 532917 0xb52108000000 old-age online n/a n/a
194 temperature-celsius-2 100 15 0 36.0 C 0x240012004d00 old-age online n/a n/a
195 hardware-ecc-recovered 100 100 0 220 0xdc0000000000 old-age online n/a n/a
196 reallocated-event-count 100 100 0 26798194698 0x0a004c3d0600 old-age online n/a n/a
197 current-pending-sector 100 95 0 0 sectors 0x000000000000 old-age online n/a n/a
198 offline-uncorrectable 96 96 0 9 sectors 0x090000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 253 0 1 0x010000000000 old-age online n/a n/a
200 multi-zone-error-rate 100 100 60 19071 0x7f4a00000000 prefail online yes yes
203 run-out-cancel 100 99 0 n/a 0x9507e1f56401 old-age online n/a n/a
240 head-flying-hours 200 200 0 n/a 0x000000000000 old-age online n/a n/a |
_________________ Life is a tragedy for those who feel and a comedy for those who think. |
|
Back to top |
|
|
figueroa Advocate
Joined: 14 Aug 2005 Posts: 3007 Location: Edge of marsh USA
|
Posted: Tue Mar 02, 2021 3:44 am Post subject: |
|
|
Buffoon wrote: | Is this drive OK? Just logged into Wife's puter and that's what she has. I've got that weird feeling ... |
As a minimum, I'd be watching this one closely, double checking state of backups (beware of perfect backups of corrupted data), and laying in a spare to have on-hand. _________________ Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi |
|
Back to top |
|
|
figueroa Advocate
Joined: 14 Aug 2005 Posts: 3007 Location: Edge of marsh USA
|
Posted: Tue Mar 02, 2021 3:54 am Post subject: |
|
|
The drive here is the backup drive of a remote server, mainly NFS for other machines on the network to hold their backups. No reallocated sectors, but one pending and uncorrectable. Given the 10.1 year power-on hours, I'm planning to replace it.
Code: | $ sudo /usr/sbin/skdump /dev/sdb
Password:
Device: sat16:/dev/sdb
Type: 16 Byte SCSI ATA SAT Passthru
Size: 476940 MiB
Model: [WDC WD5000AAKS-00UU3A0]
Serial: [WD-WCAYU0555963]
Firmware: [01.03B01]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 7980 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 95 min
Conveyance Self-Test Polling Time: 5 min
Bad Sectors: 1 sectors
Powered On: 10.1 years
Power Cycles: 170
Average Powered On Per Power Cycle: 21.6 days
Temperature: 45.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name Value Worst Thres Pretty Raw Type Updates Good Good/Past
1 raw-read-error-rate 200 200 51 522 0x0a0200000000 prefail online yes yes
3 spin-up-time 156 138 21 3.2 s 0x560c00000000 prefail online yes yes
4 start-stop-count 100 100 0 172 0xac0000000000 old-age online n/a n/a
5 reallocated-sector-count 200 200 140 0 sectors 0x000000000000 prefail online yes yes
7 seek-error-rate 200 200 0 0 0x000000000000 old-age online n/a n/a
9 power-on-hours 1 1 0 10.1 years 0x4b5801000000 old-age online n/a n/a
10 spin-retry-count 100 100 0 0 0x000000000000 old-age online n/a n/a
11 calibration-retry-count 100 100 0 0 0x000000000000 old-age online n/a n/a
12 power-cycle-count 100 100 0 170 0xaa0000000000 old-age online n/a n/a
192 power-off-retract-count 200 200 0 141 0x8d0000000000 old-age online n/a n/a
193 load-cycle-count 200 200 0 30 0x1e0000000000 old-age online n/a n/a
194 temperature-celsius-2 98 77 0 45.0 C 0x2d0000000000 old-age online n/a n/a
196 reallocated-event-count 200 200 0 0 0x000000000000 old-age online n/a n/a
197 current-pending-sector 200 200 0 1 sectors 0x010000000000 old-age online n/a n/a
198 offline-uncorrectable 200 200 0 1 sectors 0x010000000000 old-age offline n/a n/a
199 udma-crc-error-count 200 200 0 14 0x0e0000000000 old-age online n/a n/a
200 multi-zone-error-rate 200 200 0 10 0x0a0000000000 old-age offline n/a n/a
|
_________________ Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi |
|
Back to top |
|
|
lord_khelben n00b
Joined: 22 Jan 2017 Posts: 7
|
Posted: Tue Mar 02, 2021 10:59 am Post subject: |
|
|
NeddySeagoon wrote: | Zucca,
Code: | 197 current-pending-sector | being non zero is a bad thing.
It means that the drive has bad sectors that it would like to remap but it can't as they are unreadable.
That is, the drive cannot read its own writing.
|
NeddySeagoon wrote: | AJM,
Pending means the drive is dead. That's grounds for a warranty return.
|
Is Current Pending Sector that bad ?
I have seen quite a few cases of Current Pending Sector being non-zero because of a flaky cable or power loss during writing. For example the power loss occurred after writing the sector but before updating the internal checksum of the drive (or vice versa). This resulted in wrong checksum and the sector being reported as "not being able to be read" while there was no real problem in the disk surface. After using dd to write zeroes to that particular sector, CPS was lowered to zero again and everything worked correctly. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Tue Mar 02, 2021 5:12 pm Post subject: |
|
|
lord_khelben,
The SMART data only tells about what is happening inside the drive.
Only Code: | 199 udma-crc-error-count | can be a data cable problem.
When the 197 current-pending-sector count includes LBA 0, the partition table is unreadable, the kernel cannot find any filesystems on the drive.
The drive only finds out about unreadable sectors when it actually goes to read them.
That means that they always have your data in them.
When a sector fails on write, the write is remapped immediately, while that data is still in the write cache. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Tue Mar 02, 2021 5:20 pm Post subject: |
|
|
figueroa
Code: | 197 current-pending-sector 200 200 0 1 sectors |
Some of your data is already gone. ddrescue may coax one more read.
What you have lost depends on what the sector holds.
A data block for a file. That's the best case.
A directory block. That's pretty bad. You data is still there, it can't be found by traversing the directory structure.
Maybe its a director block that contains other directories. You just lost access to a lot more data.
Is it a file system meta data block?
That's worse again.
It isn't going to get better. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9886 Location: almost Mile High in the USA
|
Posted: Tue Mar 02, 2021 5:45 pm Post subject: |
|
|
Is this drive still good?
Code: | === START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes. |
Code: | 5 Reallocated_Sector_Ct 0x0033 001 001 005 Pre-fail Always FAILING_NOW 30
9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 28864
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
|
Because I've been using this drive for several months like this now (Hint: I'm not using this drive for anything valuable!!!) _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Last edited by eccerr0r on Tue Mar 02, 2021 5:52 pm; edited 1 time in total |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Tue Mar 02, 2021 5:51 pm Post subject: |
|
|
eccerr0r,
Code: | 5 Reallocated_Sector_Ct 0x0033 001 001 005 |
That's Value Worst Thresh, the parameter has failed as both Value and Worst are <= Thresh
197 Current_Pending_Sector is still zero. so you don't have any lost data yet. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9886 Location: almost Mile High in the USA
|
Posted: Tue Mar 02, 2021 5:55 pm Post subject: |
|
|
Actually the pending sector list was >0 for a while but I was able to coax the drive to report a value of 0 and pass its selftest finally. Prior to the coaxing, it was reporting bad sectors and had problems reading sectors left and right.
But yeah already knew this drive has pretty much expired. Even BIOS with SMART support will hang on boot requiring manual interaction to allow boot with this drive attached - had to disable it so I can still squeeze the last few moments of life out of this disk...
... for the past few months...
----
Well that hard drive died, or is choking on bad sectors now. Wasn't unexpected of course, so no big deal.
Now this is the kicker for another hard drive. Actually, though it's hard, it's quite solid...
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always - 935604h+53m+50.010s |
Surely this drive hasn't really been turned on for this many hours, otherwise it'd give the Centennial Bulb a run for its money... unfortunately this corrupted value probably kills the resale value of this drive despite still having 100% on its media wearout indicator (MLC)... _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
|