Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Advices/opinions needed - hard drive failing?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3920
Location: Rasi, Finland

PostPosted: Mon Mar 01, 2021 3:41 pm    Post subject: Advices/opinions needed - hard drive failing? Reply with quote

I ran SMART extended/long tests and one of my drives started to look worrying...

Code:
# /usr/sbin/skdump /dev/sda
Device: sat12:/dev/sda
Type: 12 Byte SCSI ATA SAT Passthru
Size: 1907729 MiB
Model: [WDC WD20EARX-008FB0]
Serial: [WD-WCAZAK774726]
Firmware: [51.0AB51]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 30060 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 324 min
Conveyance Self-Test Polling Time: 5 min
Bad Sectors: 3 sectors
Powered On: 7.1 years
Power Cycles: 173
Average Powered On Per Power Cycle: 15.0 days
Temperature: 41.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         200   200    51   17          0x110000000000 prefail online  yes  yes
  3 spin-up-time                183   182    21   5.8 s       0xda1600000000 prefail online  yes  yes
  4 start-stop-count            100   100     0   189         0xbd0000000000 old-age online  n/a  n/a
  5 reallocated-sector-count    200   200   140   3 sectors   0x030000000000 prefail online  yes  yes
  7 seek-error-rate             200   200     0   0           0x000000000000 old-age online  n/a  n/a
  9 power-on-hours               15    15     0   7.1 years   0xd9f200000000 old-age online  n/a  n/a
 10 spin-retry-count            100   100     0   0           0x000000000000 old-age online  n/a  n/a
 11 calibration-retry-count     100   100     0   0           0x000000000000 old-age online  n/a  n/a
 12 power-cycle-count           100   100     0   173         0xad0000000000 old-age online  n/a  n/a
192 power-off-retract-count     200   200     0   68          0x440000000000 old-age online  n/a  n/a
193 load-cycle-count            200   200     0   124         0x7c0000000000 old-age online  n/a  n/a
194 temperature-celsius-2       109    96     0   41.0 C      0x290000000000 old-age online  n/a  n/a
196 reallocated-event-count     197   197     0   3           0x030000000000 old-age online  n/a  n/a
197 current-pending-sector      200   200     0   0 sectors   0x000000000000 old-age online  n/a  n/a
198 offline-uncorrectable       200   200     0   0 sectors   0x000000000000 old-age offline n/a  n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a  n/a
200 multi-zone-error-rate       200   200     0   0           0x000000000000 old-age offline n/a  n/a
ID 5 looks worrying...
What is strange is
Code:
# /usr/sbin/skdump --status /dev/sda
GOOD
Maybe libatasmart is too optimistic? :P
_________________
..: Zucca :..

My gentoo installs:
init=/sbin/openrc-init
-systemd -logind -elogind seatd

Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6215
Location: Dallas area

PostPosted: Mon Mar 01, 2021 4:04 pm    Post subject: Reply with quote

all reallocated-sector-count tells you is it reallocated xyz # of sectors.

What would be a problem is if 197 and possibly 198 start having something more than zero.
_________________
UM780, 6.12 zen kernel, gcc 13, openrc, wayland
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54813
Location: 56N 3W

PostPosted: Mon Mar 01, 2021 4:43 pm    Post subject: Reply with quote

Zucca,

Code:
197 current-pending-sector
being non zero is a bad thing.
It means that the drive has bad sectors that it would like to remap but it can't as they are unreadable.
That is, the drive cannot read its own writing.

A non zero
Code:
   5 reallocated-sector-count
is expected an a drive that's been in use for 7.1 years.
The drive is supposed to remap sectors that are difficult to read before reads fail. Its normal drive operation.

A sudden jump in the reallocated-sector-count should be an alarm signal.

Taken together with
Code:
196 reallocated-event-count
the drive has had three remapping events oy one sector each.

That drive was OK at the time you asked it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3920
Location: Rasi, Finland

PostPosted: Mon Mar 01, 2021 6:06 pm    Post subject: Reply with quote

Thanks guys.
It's one of five drives in my btrfs-raid1 (essentially raid5) drive pool/stack.
I'll see if I can convert the set into a one that can tolerate two drive failures.

I guess I'll ready my order for one spare... just in case. Besides I need few bigger drives if I'm going to have more redundancy.

But first I make sure my backups disks are working.
_________________
..: Zucca :..

My gentoo installs:
init=/sbin/openrc-init
-systemd -logind -elogind seatd

Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
AJM
Apprentice
Apprentice


Joined: 25 Sep 2002
Posts: 195
Location: Aberdeen, Scotland

PostPosted: Mon Mar 01, 2021 6:14 pm    Post subject: Reply with quote

Maybe I'm paranoid / wasteful, but I've always gone by the rule that if a drive has even one bad sector pending it goes in the bin. My experience has always been that if one sector has gone bad, others will almost inevitably follow and usually before too long...
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54813
Location: 56N 3W

PostPosted: Mon Mar 01, 2021 6:29 pm    Post subject: Reply with quote

AJM,

Pending means the drive is dead. That's grounds for a warranty return.
The relocation mechanism working properly is normal operation.

Indeed, new drives have relocated sectors but the counts are set to zero as part of final test.
That whey they always appear to be perfect.

Remapped sectors can be detected by a dip in the continuous read speed as the remapping causes extra seek time.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Mon Mar 01, 2021 6:31 pm    Post subject: Reply with quote

Please don't forget that the meaning of the SMART values varies between vendors and even between product series of the same vendor. SMART data is NOT standardized.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54813
Location: 56N 3W

PostPosted: Mon Mar 01, 2021 6:48 pm    Post subject: Reply with quote

mike155,

Correct but it is normalised for display.

When anf parameters Value or Worst is equal to or lower than Threh, that parameter has failed.
In the case of the reallocated-event-count, that logic is broken.

That is, SMART will report that a drive is OK, even when it has an unreadable sector that might prevent the system booting.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Buffoon
Veteran
Veteran


Joined: 17 Jun 2015
Posts: 1369
Location: EU or US

PostPosted: Tue Mar 02, 2021 12:04 am    Post subject: Reply with quote

Is this drive OK? Just logged into Wife's puter and that's what she has. I've got that weird feeling ...

Code:
Device: sat16:/dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 152627 MiB
Model: [FUJITSU MJA2160BH G2]
Serial: [K96PTA125BWU]
Firmware: [0084001C]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
Total Time To Complete Off-Line Data Collection: 508 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 72 min
Conveyance Self-Test Polling Time: 2 min
Bad Sectors: 589834 sectors
Powered On: 5.8 years
Power Cycles: 982
Average Powered On Per Power Cycle: 2.2 days
Temperature: 36.0 C
Attribute Parsing Verification: Bad
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         100   100    46   44021       0xf5ab00000000 prefail online  yes  yes
  2 throughput-performance      100   100    30   n/a         0x000077010000 prefail offline yes  yes
  3 spin-up-time                100   100    25   n/a         0x000000000000 prefail online  yes  yes
  4 start-stop-count             99    99     0   2177        0x810800000000 old-age online  n/a  n/a
  5 reallocated-sector-count    100   100    24   589834 sectors 0x0a000900c707 prefail online  yes  yes
  7 seek-error-rate             100    54    47   988         0xdc0300000000 prefail online  yes  yes
  8 seek-time-performance       100   100    19   n/a         0x000000000000 prefail offline yes  yes
  9 power-on-hours                1     1     0   5.8 years   0xa0c600000000 old-age online  n/a  n/a
 10 spin-retry-count            100   100    20   0           0x000000000000 prefail online  yes  yes
 12 power-cycle-count           100   100     0   982         0xd60300000000 old-age online  n/a  n/a
192 power-off-retract-count      99    99     0   379         0x7b0100000000 old-age online  n/a  n/a
193 load-cycle-count             74    74     0   532917      0xb52108000000 old-age online  n/a  n/a
194 temperature-celsius-2       100    15     0   36.0 C      0x240012004d00 old-age online  n/a  n/a
195 hardware-ecc-recovered      100   100     0   220         0xdc0000000000 old-age online  n/a  n/a
196 reallocated-event-count     100   100     0   26798194698 0x0a004c3d0600 old-age online  n/a  n/a
197 current-pending-sector      100    95     0   0 sectors   0x000000000000 old-age online  n/a  n/a
198 offline-uncorrectable        96    96     0   9 sectors   0x090000000000 old-age offline n/a  n/a
199 udma-crc-error-count        200   253     0   1           0x010000000000 old-age online  n/a  n/a
200 multi-zone-error-rate       100   100    60   19071       0x7f4a00000000 prefail online  yes  yes
203 run-out-cancel              100    99     0   n/a         0x9507e1f56401 old-age online  n/a  n/a
240 head-flying-hours           200   200     0   n/a         0x000000000000 old-age online  n/a n/a

_________________
Life is a tragedy for those who feel and a comedy for those who think.
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 3007
Location: Edge of marsh USA

PostPosted: Tue Mar 02, 2021 3:44 am    Post subject: Reply with quote

Buffoon wrote:
Is this drive OK? Just logged into Wife's puter and that's what she has. I've got that weird feeling ...

As a minimum, I'd be watching this one closely, double checking state of backups (beware of perfect backups of corrupted data), and laying in a spare to have on-hand.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 3007
Location: Edge of marsh USA

PostPosted: Tue Mar 02, 2021 3:54 am    Post subject: Reply with quote

The drive here is the backup drive of a remote server, mainly NFS for other machines on the network to hold their backups. No reallocated sectors, but one pending and uncorrectable. Given the 10.1 year power-on hours, I'm planning to replace it.
Code:
$ sudo /usr/sbin/skdump /dev/sdb
Password:
Device: sat16:/dev/sdb
Type: 16 Byte SCSI ATA SAT Passthru
Size: 476940 MiB
Model: [WDC WD5000AAKS-00UU3A0]
Serial: [WD-WCAYU0555963]
Firmware: [01.03B01]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 7980 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 95 min
Conveyance Self-Test Polling Time: 5 min
Bad Sectors: 1 sectors
Powered On: 10.1 years
Power Cycles: 170
Average Powered On Per Power Cycle: 21.6 days
Temperature: 45.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         200   200    51   522         0x0a0200000000 prefail online  yes  yes
  3 spin-up-time                156   138    21   3.2 s       0x560c00000000 prefail online  yes  yes
  4 start-stop-count            100   100     0   172         0xac0000000000 old-age online  n/a  n/a
  5 reallocated-sector-count    200   200   140   0 sectors   0x000000000000 prefail online  yes  yes
  7 seek-error-rate             200   200     0   0           0x000000000000 old-age online  n/a  n/a
  9 power-on-hours                1     1     0   10.1 years  0x4b5801000000 old-age online  n/a  n/a
 10 spin-retry-count            100   100     0   0           0x000000000000 old-age online  n/a  n/a
 11 calibration-retry-count     100   100     0   0           0x000000000000 old-age online  n/a  n/a
 12 power-cycle-count           100   100     0   170         0xaa0000000000 old-age online  n/a  n/a
192 power-off-retract-count     200   200     0   141         0x8d0000000000 old-age online  n/a  n/a
193 load-cycle-count            200   200     0   30          0x1e0000000000 old-age online  n/a  n/a
194 temperature-celsius-2        98    77     0   45.0 C      0x2d0000000000 old-age online  n/a  n/a
196 reallocated-event-count     200   200     0   0           0x000000000000 old-age online  n/a  n/a
197 current-pending-sector      200   200     0   1 sectors   0x010000000000 old-age online  n/a  n/a
198 offline-uncorrectable       200   200     0   1 sectors   0x010000000000 old-age offline n/a  n/a
199 udma-crc-error-count        200   200     0   14          0x0e0000000000 old-age online  n/a  n/a
200 multi-zone-error-rate       200   200     0   10          0x0a0000000000 old-age offline n/a  n/a

_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
lord_khelben
n00b
n00b


Joined: 22 Jan 2017
Posts: 7

PostPosted: Tue Mar 02, 2021 10:59 am    Post subject: Reply with quote

NeddySeagoon wrote:
Zucca,

Code:
197 current-pending-sector
being non zero is a bad thing.
It means that the drive has bad sectors that it would like to remap but it can't as they are unreadable.
That is, the drive cannot read its own writing.


NeddySeagoon wrote:
AJM,

Pending means the drive is dead. That's grounds for a warranty return.


Is Current Pending Sector that bad ?

I have seen quite a few cases of Current Pending Sector being non-zero because of a flaky cable or power loss during writing. For example the power loss occurred after writing the sector but before updating the internal checksum of the drive (or vice versa). This resulted in wrong checksum and the sector being reported as "not being able to be read" while there was no real problem in the disk surface. After using dd to write zeroes to that particular sector, CPS was lowered to zero again and everything worked correctly.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54813
Location: 56N 3W

PostPosted: Tue Mar 02, 2021 5:12 pm    Post subject: Reply with quote

lord_khelben,

The SMART data only tells about what is happening inside the drive.

Only
Code:
199 udma-crc-error-count
can be a data cable problem.

When the 197 current-pending-sector count includes LBA 0, the partition table is unreadable, the kernel cannot find any filesystems on the drive.

The drive only finds out about unreadable sectors when it actually goes to read them.
That means that they always have your data in them.
When a sector fails on write, the write is remapped immediately, while that data is still in the write cache.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54813
Location: 56N 3W

PostPosted: Tue Mar 02, 2021 5:20 pm    Post subject: Reply with quote

figueroa

Code:
197 current-pending-sector      200   200     0   1 sectors


Some of your data is already gone. ddrescue may coax one more read.
What you have lost depends on what the sector holds.

A data block for a file. That's the best case.
A directory block. That's pretty bad. You data is still there, it can't be found by traversing the directory structure.
Maybe its a director block that contains other directories. You just lost access to a lot more data.
Is it a file system meta data block?
That's worse again.

It isn't going to get better.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9886
Location: almost Mile High in the USA

PostPosted: Tue Mar 02, 2021 5:45 pm    Post subject: Reply with quote

Is this drive still good?
Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

Code:
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 30
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       28864
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

Because I've been using this drive for several months like this now ;-) (Hint: I'm not using this drive for anything valuable!!!)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?


Last edited by eccerr0r on Tue Mar 02, 2021 5:52 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54813
Location: 56N 3W

PostPosted: Tue Mar 02, 2021 5:51 pm    Post subject: Reply with quote

eccerr0r,

Code:
 5 Reallocated_Sector_Ct   0x0033   001   001   005

That's Value Worst Thresh, the parameter has failed as both Value and Worst are <= Thresh

197 Current_Pending_Sector is still zero. so you don't have any lost data yet.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9886
Location: almost Mile High in the USA

PostPosted: Tue Mar 02, 2021 5:55 pm    Post subject: Reply with quote

Actually the pending sector list was >0 for a while but I was able to coax the drive to report a value of 0 and pass its selftest finally. Prior to the coaxing, it was reporting bad sectors and had problems reading sectors left and right.

But yeah already knew this drive has pretty much expired. Even BIOS with SMART support will hang on boot requiring manual interaction to allow boot with this drive attached - had to disable it so I can still squeeze the last few moments of life out of this disk...

... for the past few months...

----

Well that hard drive died, or is choking on bad sectors now. Wasn't unexpected of course, so no big deal.

Now this is the kicker for another hard drive. Actually, though it's hard, it's quite solid...
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       935604h+53m+50.010s

Surely this drive hasn't really been turned on for this many hours, otherwise it'd give the Centennial Bulb a run for its money... unfortunately this corrupted value probably kills the resale value of this drive despite still having 100% on its media wearout indicator (MLC)...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum