Is my drive dying?

m27315

Hi,

Whenever I run rsnapshot at night, the hard-drive begins to act up. rsnapshot and other cron jobs report errors like so:

NeddySeagoon · Posted: Sun Jan 11, 2009 11:46 am Post subject:

m27315,

Raid levels are easy to set up and easy to swap out disks. I recommend you set up a raid in degraded mode, then add the last drive so you get to try out the process. raid0 does not have any redundancy, so loss of a drive gives total data loss too. Thats why you have backups.

I can't tell if you have a drive problem from your post, it could also be a motherboard or data cable problem too.
You should find more information (about your drive) in dmesg and in smartmontools after the failure events.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

energyman76b · Advocate Joined: 26 Mar 2003 Posts: 2048 Location: Germany

I would replace the cabling - and if that doesn't help, replace the harddisk.

And I wouldn't touch Raid0 with a ten feet pole. Raid0 is only for people who don't care about their data. Harddisks do fail. The more harddisks, the more likely that one fails. One fails, everything is gone.
If you care about data: Raid1. It is 'striping' read accesses, so reading is sped up. Or Raid 5. Which is the best combination of speed and data redundancy.
_________________
Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

I identify as a dirty penismensch.

m27315 · Posted: Sun Jan 11, 2009 8:07 pm Post subject: thanks

Thanks for the help guys! (BTW, Sorry to hijack the thread. I intended to start a new thread, but clearly I posted a relpy instead of starting a new topic. Maybe a moderator can split the appropriate posts off, if that's best?)

Yes, I will definitely try RAID1. I want robustness now a days. I am too old to stress over the other. :-)

I will try swapping the cables next time this happens. Incidentally, I have disabled rsnapshot in cron, and this happens within 24-hours with all my cron jobs disabled. rsnapshot simply produces the problem immediately, but whatever the root problem is, it causes the system to go bananas in less than 24 hours.

If I try to list the contents of the root directory, I get:

energyman76b · Advocate Joined: 26 Mar 2003 Posts: 2048 Location: Germany

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2042

that is pretty high. The error count too. Just try different cable ASAP (and don't even try fsck with a maybe broken cable, it can make everything much worse), if harddisk this misbehaves, i would put money on a bad disk.
_________________
Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

I identify as a dirty penismensch.

energyman76b · Advocate Joined: 26 Mar 2003 Posts: 2048 Location: Germany

example of a good harddisk:
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 085 085 011 Pre-fail Always - 5440
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 466
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9776
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 3347
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 459
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 078 067 000 Old_age Always - 22 (Lifetime Min/Max 14/22)
194 Temperature_Celsius 0x0022 078 061 000 Old_age Always - 22 (Lifetime Min/Max 14/24)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 223953
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
_________________
Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

I identify as a dirty penismensch.

m27315 · Posted: Sun Jan 11, 2009 9:16 pm Post subject: died again

Well, it just died again - second time today! Things are going downhill quick, it seems. I swapped out the cable this time. We'll see what happens.

You know, I have always had questions about the SMART data. I always thought the "VALUE" column was the real, meaningful value that should be examined, whereas the "RAW_VALUE" column referred to the actual contents of the register, which could be inverted, bit-shifted, offset, etc, and which was basically useless without knowledge of how to interpret the data. Maybe now is a good time for me to research that ... :roll:

For example, in this row:

energyman76b · Advocate Joined: 26 Mar 2003 Posts: 2048 Location: Germany

http://www.t13.org/Documents/UploadedDocuments/docs2005/e05148r0-ACS-SMARTAttributesAnnex.pdf

http://smartmontools.sourceforge.net/faq.html
_________________
Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

I identify as a dirty penismensch.

NeddySeagoon · Posted: Sun Jan 11, 2009 9:30 pm Post subject:

m27315,

The drive is dying:-

Sysa · Apprentice Joined: 16 Mar 2005 Posts: 161 Location: Europe

Let me to correct you a little:

m27315 · Posted: Mon Jan 12, 2009 2:34 pm Post subject: you were right

The HDD died last night!

Thanks for the help and explanations! Now I will better understand the warning signs next time.

tnt · Veteran Joined: 27 Feb 2004 Posts: 1227