View previous topic :: View next topic |
Author |
Message |
m27315 Apprentice
Joined: 10 Dec 2004 Posts: 253 Location: 2 workstations down
|
Posted: Sun Jan 11, 2009 2:31 am Post subject: buggy rsnapshot, configuration error, or dying HDD? |
|
|
Hi,
Whenever I run rsnapshot at night, the hard-drive begins to act up. rsnapshot and other cron jobs report errors like so:
Code: | rsync: writefd_unbuffered failed to write 4 bytes [sender]: Broken pipe (32)
rsync: write failed on "/var/www/.snapshots/daily.0/localhost/var/www/mywebsite.org/htdocs/audio/mymovie.mkv": Read-only file system (30)
rsync error: error in file IO (code 11) at receiver.c(298) [receiver=3.0.4]
rsync: connection unexpectedly closed (163226 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(632) [sender=3.0.4]
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
----------------------------------------------------------------------------
ERROR: /usr/bin/rsync returned 12 while processing /var/www/
Could not open logfile /var/log/rsnapshot for writing
Do you have write permission for this file?
/root/bin/backup_mysql.sh: line 4: /root/mydatabase.sql: Read-only file system
chmod: cannot access `/root/mydatabase.sql': No such file or directory
mv: cannot stat `/root/mydatabase.sql': No such file or directory
chown: changing ownership of `/home/user1/my.sql': Read-only file system
bzip2: Can't create output file /home/user1/my.sql.bz2: Read-only file system.
mv: cannot stat `/home/user1/my.sql.bz2': No such file or directory |
Afterward, I am able to SSH into the box, and I can look around. But, many commands fail to execute (like shutdown), and listing of random drive contents report an IO error. Reboot seems to be the only recovery method.
The machine seems to behave normally otherwise. And, I have let the box do a e2fsck at boot-up, which does find and correct errors.
I am guessing HDD pre-fail. Any other thoughts?
(BTW, for some reason SMART was not activated on the drive. I have activated it, and smartmontools is reporting PASS, although it does have a few errors logged. )
As a side thought, I have never used RAID0 for a Linux box before now. Is it worth it? How much of a pain is it to setup and swap out disks? Would you recommend LVM, hardware RAID, or some other method?
Thanks!!! |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54838 Location: 56N 3W
|
Posted: Sun Jan 11, 2009 11:46 am Post subject: |
|
|
m27315,
Raid levels are easy to set up and easy to swap out disks. I recommend you set up a raid in degraded mode, then add the last drive so you get to try out the process. raid0 does not have any redundancy, so loss of a drive gives total data loss too. Thats why you have backups.
I can't tell if you have a drive problem from your post, it could also be a motherboard or data cable problem too.
You should find more information (about your drive) in dmesg and in smartmontools after the failure events. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
energyman76b Advocate
Joined: 26 Mar 2003 Posts: 2048 Location: Germany
|
Posted: Sun Jan 11, 2009 12:53 pm Post subject: |
|
|
I would replace the cabling - and if that doesn't help, replace the harddisk.
And I wouldn't touch Raid0 with a ten feet pole. Raid0 is only for people who don't care about their data. Harddisks do fail. The more harddisks, the more likely that one fails. One fails, everything is gone.
If you care about data: Raid1. It is 'striping' read accesses, so reading is sped up. Or Raid 5. Which is the best combination of speed and data redundancy. _________________ Study finds stunning lack of racial, gender, and economic diversity among middle-class white males
I identify as a dirty penismensch. |
|
Back to top |
|
|
m27315 Apprentice
Joined: 10 Dec 2004 Posts: 253 Location: 2 workstations down
|
Posted: Sun Jan 11, 2009 8:07 pm Post subject: thanks |
|
|
Thanks for the help guys! (BTW, Sorry to hijack the thread. I intended to start a new thread, but clearly I posted a relpy instead of starting a new topic. Maybe a moderator can split the appropriate posts off, if that's best?)
Yes, I will definitely try RAID1. I want robustness now a days. I am too old to stress over the other.
I will try swapping the cables next time this happens. Incidentally, I have disabled rsnapshot in cron, and this happens within 24-hours with all my cron jobs disabled. rsnapshot simply produces the problem immediately, but whatever the root problem is, it causes the system to go bananas in less than 24 hours.
If I try to list the contents of the root directory, I get:
Code: | ls -latr /
ls: cannot access /lib32: Input/output error
ls: cannot access /mnt: Input/output error
total 68
d????????? ? ? ? ? ? mnt
d????????? ? ? ? ? ? lib32
drwx------ 2 root root 16384 2007-01-05 01:14 lost+found
drwxr-xr-x 13 root root 4096 2008-06-17 13:51 var
drwxr-xr-x 14 root root 4096 2009-01-03 19:19 usr
drwxr-xr-x 3 root root 4096 2009-01-04 00:21 opt
drwxr-xr-x 2 root root 4096 2009-01-04 01:27 boot
lrwxrwxrwx 1 root root 5 2009-01-04 17:31 lib -> lib64
drwxr-xr-x 9 root root 4096 2009-01-04 17:55 lib64
drwxr-xr-x 2 root root 4096 2009-01-05 20:45 bin
drwxr-xr-x 4 root root 4096 2009-01-08 11:41 home
dr-xr-xr-x 87 root root 0 2009-01-10 15:25 proc
drwxr-xr-x 12 root root 0 2009-01-10 15:25 sys
drwxr-xr-x 2 root root 4096 2009-01-10 20:14 sbin
drwxr-xr-x 13 root root 3480 2009-01-10 21:25 dev
drwxr-xr-x 45 root root 4096 2009-01-10 21:34 etc
drwxr-xr-x 19 root root 4096 2009-01-10 21:34 ..
drwxr-xr-x 19 root root 4096 2009-01-10 21:34 .
drwx------ 7 root root 4096 2009-01-10 22:13 root
drwxrwxrwt 4 root root 4096 2009-01-11 05:20 tmp |
See anything weird?
Here's the SMART data:
Code: | smartctl -a /dev/sda
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11
Device Model: ST3500320AS
Serial Number: 5QM01WWW
Firmware Version: SD04
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Jan 11 13:55:45 2009 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 25) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 642) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 106) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003b) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 112 088 006 Pre-fail Always - 218496749
3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 475
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2042
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 4356327589
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4134
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1
12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 486
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1087
188 Unknown_Attribute 0x0032 098 098 000 Old_age Always - 4295032849
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 050 045 Old_age Always - 30 (Lifetime Min/Max 30/30)
194 Temperature_Celsius 0x0022 030 050 000 Old_age Always - 30 (0 16 0 0)
195 Hardware_ECC_Recovered 0x001a 047 024 000 Old_age Always - 218496749
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 5
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 5
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 1128 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1128 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 04 9d 00 32 e0 Device Fault; Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
a1 00 00 00 00 00 a0 02 07:57:48.684 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 02 07:57:48.661 IDENTIFY DEVICE
00 00 00 00 00 00 00 06 07:57:48.501 NOP [Abort queued commands]
a1 00 00 00 00 00 a0 02 07:57:43.194 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 02 07:57:43.171 IDENTIFY DEVICE
Error 1127 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 04 9d 00 32 e0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 a0 02 07:57:48.661 IDENTIFY DEVICE
00 00 00 00 00 00 00 06 07:57:48.501 NOP [Abort queued commands]
a1 00 00 00 00 00 a0 02 07:57:43.194 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 02 07:57:43.171 IDENTIFY DEVICE
00 00 00 00 00 00 00 06 07:57:43.014 NOP [Abort queued commands]
...
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 4126 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay. |
Here are the references to /dev/sda in dmesg:
Code: | [ 3.270520] scsi 0:0:0:0: Direct-Access ATA ST3500320AS SD04 PQ: 0 ANSI: 5
[ 3.271020] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
[ 3.271184] sd 0:0:0:0: [sda] Write Protect is off
[ 3.271334] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 3.271357] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 3.271698] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
[ 3.271861] sd 0:0:0:0: [sda] Write Protect is off
[ 3.272011] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 3.272032] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 3.272310] sda: sda1 sda2 sda3 sda4
[ 3.277548] sd 0:0:0:0: [sda] Attached SCSI disk
...
[ 8.749840] kjournald starting. Commit interval 5 seconds
[ 8.750002] EXT3-fs: sda3: orphan cleanup on readonly fs
[ 8.750155] ext3_orphan_cleanup: deleting unreferenced inode 833960
[ 8.750181] ext3_orphan_cleanup: deleting unreferenced inode 833959
[ 8.750188] ext3_orphan_cleanup: deleting unreferenced inode 833958
[ 8.750194] ext3_orphan_cleanup: deleting unreferenced inode 833957
[ 8.750200] ext3_orphan_cleanup: deleting unreferenced inode 833956
[ 8.750205] EXT3-fs: sda3: 5 orphan inodes deleted
[ 8.750355] EXT3-fs: recovery complete.
[ 8.755033] EXT3-fs: mounted filesystem with ordered data mode.
[ 8.755200] VFS: Mounted root (ext3 filesystem) readonly.
[ 8.755366] Freeing unused kernel memory: 480k freed
[ 8.755713] Write protecting the kernel read-only data: 6040k
[ 9.169288] khelper used greatest stack depth: 5216 bytes left
[ 9.695106] stty used greatest stack depth: 4592 bytes left
[ 10.095949] udevadm used greatest stack depth: 4520 bytes left
[ 10.291137] usb usb2: uevent
[ 10.291164] usb 2-0:1.0: uevent
[ 10.291270] usb usb1: uevent
[ 10.291289] usb 1-0:1.0: uevent
[ 11.219301] EXT3 FS on sda3, internal journal
[ 11.289742] sort used greatest stack depth: 4504 bytes left
[ 12.542691] kjournald starting. Commit interval 5 seconds
[ 12.545857] EXT3 FS on sda4, internal journal
[ 12.545861] EXT3-fs: mounted filesystem with ordered data mode.
[ 12.643508] Adding 4008208k swap on /dev/sda2. Priority:-1 extents:1 across:4008208k
[ 15.339397] ps used greatest stack depth: 4448 bytes left
... |
hdparm included for completeness:
Code: | hdparm /dev/sda
/dev/sda:
IO_support = 0 (default)
readonly = 0 (off)
readahead = 256 (on)
geometry = 60801/255/63, sectors = 976773168, start = 0 |
Any more thoughts? If I swap out the cables and the failure persists, would you guess HDD or MB? Unfortunately, I don't have a spare to try in an experiment. ... I would guess HDD, because I had used it as my main drive in another computer for a year or so, with no problems, and its relatively new, but that doesn't mean much.
Do any of the above errors clearly suggest MB or HDD to you?
Thanks!!! |
|
Back to top |
|
|
energyman76b Advocate
Joined: 26 Mar 2003 Posts: 2048 Location: Germany
|
Posted: Sun Jan 11, 2009 8:31 pm Post subject: |
|
|
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2042
that is pretty high. The error count too. Just try different cable ASAP (and don't even try fsck with a maybe broken cable, it can make everything much worse), if harddisk this misbehaves, i would put money on a bad disk. _________________ Study finds stunning lack of racial, gender, and economic diversity among middle-class white males
I identify as a dirty penismensch. |
|
Back to top |
|
|
energyman76b Advocate
Joined: 26 Mar 2003 Posts: 2048 Location: Germany
|
Posted: Sun Jan 11, 2009 8:44 pm Post subject: |
|
|
example of a good harddisk:
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 085 085 011 Pre-fail Always - 5440
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 466
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9776
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 3347
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 459
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 078 067 000 Old_age Always - 22 (Lifetime Min/Max 14/22)
194 Temperature_Celsius 0x0022 078 061 000 Old_age Always - 22 (Lifetime Min/Max 14/24)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 223953
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0 _________________ Study finds stunning lack of racial, gender, and economic diversity among middle-class white males
I identify as a dirty penismensch. |
|
Back to top |
|
|
m27315 Apprentice
Joined: 10 Dec 2004 Posts: 253 Location: 2 workstations down
|
Posted: Sun Jan 11, 2009 9:16 pm Post subject: died again |
|
|
Well, it just died again - second time today! Things are going downhill quick, it seems. I swapped out the cable this time. We'll see what happens.
You know, I have always had questions about the SMART data. I always thought the "VALUE" column was the real, meaningful value that should be examined, whereas the "RAW_VALUE" column referred to the actual contents of the register, which could be inverted, bit-shifted, offset, etc, and which was basically useless without knowledge of how to interpret the data. Maybe now is a good time for me to research that ... For example, in this row:
Code: | SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
...
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2042 |
Does this mean that I have 100 sectors available for reallocation (good), and when I get down to 36 (bad), the drive is exhausted of available sectors? Or, does it mean that the drive has already reallocated 2042 sectors (very bad)? I tend toward the first one, but I need to research this... I have never known how to really interpret this data.
Thanks again for the help and the comparison data - that is very helpful. |
|
Back to top |
|
|
energyman76b Advocate
Joined: 26 Mar 2003 Posts: 2048 Location: Germany
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54838 Location: 56N 3W
|
Posted: Sun Jan 11, 2009 9:30 pm Post subject: |
|
|
m27315,
The drive is dying:-
Code: | 1 Raw_Read_Error_Rate 0x000f 112 088 006 Pre-fail Always - 218496749
195 Hardware_ECC_Recovered 0x001a 047 024 000 Old_age Always - 218496749 | its generating a lot of errors and working hard to correct them - those two numbers are both very high (bad) and identical (good), so its winning at the moment
Code: | Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 4356327589 | is a bad sign and your
Code: | Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2042 | is high.
There have been no errors on the interface Code: | 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 | so thats good.
The drive is working hard to provide correct data by using redundant data and retries. It will soon fail. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Sysa Apprentice
Joined: 16 Mar 2005 Posts: 161 Location: Europe
|
Posted: Mon Jan 12, 2009 1:22 pm Post subject: |
|
|
Let me to correct you a little:
NeddySeagoon wrote: | m27315,
The drive is dying:-
Code: | 1 Raw_Read_Error_Rate 0x000f 112 088 006 Pre-fail Always - 218496749
195 Hardware_ECC_Recovered 0x001a 047 024 000 Old_age Always - 218496749 | its generating a lot of errors and working hard to correct them - those two numbers are both very high (bad) and identical (good), so its winning at the moment |
FYI: It's OK (until Raw_Read_Error_Rate==Hardware_ECC_Recovered), just some OEM (e.g. Seagate) show it since other hide this info.
The worst thing is
Code: | 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 5
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 5
|
It means that the HDD relocation table is full and it is impossible to relocate other (5) existing bad blocks...
Any way, your conclusion (diagnoze ) is correct:
Quote: | ... It will soon fail. |
_________________ RedHat -> SuSE -> Debian -> Gentoo |
|
Back to top |
|
|
m27315 Apprentice
Joined: 10 Dec 2004 Posts: 253 Location: 2 workstations down
|
Posted: Mon Jan 12, 2009 2:34 pm Post subject: you were right |
|
|
The HDD died last night!
Thanks for the help and explanations! Now I will better understand the warning signs next time. |
|
Back to top |
|
|
tnt Veteran
Joined: 27 Feb 2004 Posts: 1227
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|