View previous topic :: View next topic |
Author |
Message |
dobbs Tux's lil' helper
Joined: 20 Aug 2005 Posts: 105 Location: Wenatchee, WA
|
Posted: Fri Mar 30, 2012 9:41 pm Post subject: random read errors with mdraid??? [SOLVED] |
|
|
Alright, I need a second opinion from a kernel guru.
I dd'ed my windows partition (115GiB) into a file, and then this happened:
Code: | dobbs@bender ~ $ sudo !-1
sudo cmp -l /dev/sda4 /mnt/storage/tempstore/windows.part
51796599 40 0
16039693943 274 234
29991661943 201 241
66805234167 164 124
69818277623 115 155
73482455671 202 242
94468409719 377 337
95529264119 260 220
96286320375 17 57
103653245047 6 46
103809902583 173 133
105303325815 40 0
106056683383 163 123
107211539063 112 152
109386836727 215 255
109386836855 104 144
117111876599 210 250
120390354551 312 352
121743028727 365 325
dobbs@bender ~ $ sudo cmp -l /dev/sda4 /mnt/storage/tempstore/windows.part
Password:
390982263 144 104
9640181367 54 14
9640181623 262 222
29991661943 201 241
31463156343 256 216
37555086327 346 306
56837503223 51 11
69818277623 115 155
73482455671 202 242
80509345527 175 135
80509345655 162 122
94666073719 343 303
101261748087 151 111
103393197431 344 304
103454269047 251 211
103454269175 56 16
103653245047 6 46
105992555639 150 110
107211539063 112 152
109386836727 215 255
109386836855 104 144
109549263351 56 16
109549263479 363 323
110002149239 52 12
114666473079 167 127
114671000439 171 131
117111876599 210 250
117340243959 376 336
117340244215 276 236
120390354551 312 352
dobbs@bender ~ $ |
What's worrisome is that some of the errors repeat (write errors?), but some don't (read errors?). Worse, none of the operations printed any error messages to dmesg, /var/log/messages or stdio. The dd operation reported success and no errors. The cmp operations did not report any read errors. Shouldn't the block layer find checksum mismatches in this case?
The destination, /mnt/storage/, is a reiser3 partition on a RAID 5 md array. smartctl doesn't show any hardware errors on the underlying devices, and /proc/mdstat shows a good array. fsck says the filesystem is fine, though single byte errors wouldn't make much sense in that case. Kernel is gentoo-sources-3.2.1-r2.
I'm not worried about /dev/sda -- it has yet to exhibit any other symptoms, it's a young drive, I'm not writing anything to it, and it's the least complex of the two setups. That leaves the RAID-5 array. It's an old array I set up years ago, but the underlying devices don't report any errors.
So is mdraid just not reliable? It looks like I'm getting both read AND write errors from that layer. The lack of error detecting seems absurd. And I just realized every one of those errors is a difference of 40...
Did I just hit a bug in the raid456 module or what?!
Last edited by dobbs on Fri Apr 06, 2012 6:43 am; edited 1 time in total |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Sat Mar 31, 2012 10:35 am Post subject: |
|
|
dobbs,
You can't usefully dd anything from a mounted filesystem because you will have open files. If thats what you did throw away the image and start afgain.
With read errors on a single drive in a raid5 array, you won't notice. Any n-1 from n drives works.
If you suspect the raid array do Code: | echo "check" > /sys/block/mdX/md/sync_action | where X is the md node you want to check.
A real totally failed read error will put Code: | [231200.568383] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[231200.568389] ata6.00: irq_stat 0x40000001
[231200.568402] ata6.00: cmd 25/00:08:e8:99:04/00:00:c0:00:00/e0 tag 0 dma 4096 in
[231200.568405] res 51/40:08:e8:99:04/00:00:c0:00:00/e0 Emask 0x9 (media error)
[231200.575646] ata6.00: configured for UDMA/133
[231200.575666] ata6: EH complete | or something like it in dmesg as the kernel resets the interface. If the drive has several goes at the read, you may get something like Code: | SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 140 140 051 Pre-fail Always - 18654
3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1166
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 104
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6409
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 103
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 44
193 Load_Cycle_Count 0x0032 102 102 000 Old_age Always - 295050
194 Temperature_Celsius 0x0022 126 110 000 Old_age Always - 24
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 263
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 63
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 188 166 000 Old_age Offline - 3355
| The meaning of the RAW numbers vary from vendor to vendor. Check yours. The importand numbers here are Code: | 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 263 | so the drive has not reallocated any sectors yet but its considering reallocating 263. The above dmesg and smartctrl -a are real from a dead drive I'm about to get RMAed. I'm having ddrescue work hard on it first.
Write errors would cause an immedate Reallocated_Event, unless the drive had no spare sectors left, then you would get an I/O error. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
dobbs Tux's lil' helper
Joined: 20 Aug 2005 Posts: 105 Location: Wenatchee, WA
|
Posted: Sat Mar 31, 2012 5:56 pm Post subject: |
|
|
NeddySeagoon wrote: | You can't usefully dd anything from a mounted filesystem because you will have open files. If thats what you did throw away the image and start afgain. |
Right. I should have explicitly stated that both /dev/sda4 and the windows.part file were never mounted during this debacle. Sorry about that. It's what I meant when I said I wasn't writing anything to /dev/sda (which is blatantly false anyway; I'm just not writing to /dev/sda4).
NeddySeagoon wrote: | With read errors on a single drive in a raid5 array, you won't notice. Any n-1 from n drives works. |
Which is why I'm confused and frightened. The system obviously isn't detecting any "errors"; bit 6 (octal 40) just happen to get flipped occasionally. Given that the partition and the file should be inert, this is an impossible[1] situation. Specifically, this is a situation I hoped to avoid by constructing the RAID 5 array, and now it looks (to me) like the raid layer is introducing these errors.
1. This event exceeds my improbability threshold.
As for smartctl, one drive has one relocated sector, but it's had it for over a year (I've been keeping an eye on that for a while). Zero pending relocations across all drives. I don't believe the underlying drives are the source of corruption.
What brand was your drive there?
Addendum: The array check found zero mismatched. I will re-copy the partition yet again to reproduce the problem. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Sat Mar 31, 2012 8:37 pm Post subject: |
|
|
dobbs,
My drive is an WD20EARS. Thats a green 2Tb drive. I have five in raid5 and two have died over the last few weeks.
The first one was obvious - mega iowaits. When I replaced that, the resyc failed as another drive (the one I showed above, has 6 bad blocks.
Bit flipping sonds like dud RAM. Data reads from the HDD into its RAM is CRC protected. Across the raid set, its 'parity protected'
If your resysnc did not produce any errors - your data is self consistant in the raid set. Thats does not meanits correct, just that all the members of the raid agree on what it is. Those two things taken together rule out any bit flipping.
If your drives are SATA, the data interface is serial, that only one bit gets flipped during data transmission over a serial link is well beyond my incredability threshold. That only leaves the motherboard and its component parts.
Time to boot into memtest86+ and run a few cycles.
Errors found in memtest86 do not always point to RAM. Its only likely to be RAM if you get the same error at the same address every time. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
dobbs Tux's lil' helper
Joined: 20 Aug 2005 Posts: 105 Location: Wenatchee, WA
|
Posted: Sat Mar 31, 2012 9:55 pm Post subject: |
|
|
Sorry to hear about losing the drives, Neddy. I've been weary of drive reliability since we passed the 500GB mark. I think that's when "perpendicular recording" became common. Possibly just me being paranoid, though. I do need to replace these three drives for various reasons: they're only 320GB, two of them are PATA, more than 45,000 hours operating... Like I said, this array is old. :) Unfortunately, I don't know what to purchase anymore.
Quote: | Bit flipping sonds like dud RAM. Data reads from the HDD into its RAM is CRC protected. Across the raid set, its 'parity protected'
If your resysnc did not produce any errors - your data is self consistant in the raid set. Thats does not meanits correct, just that all the members of the raid agree on what it is. Those two things taken together rule out any bit flipping. |
Yeah, that's why I was considering an mdraid software bug. I was grasping at straws. A possible RAM issue didn't occur to me... I would expect other system stability issues. I'm guessing the faulty region of RAM lies outside the kernel memory, and the data buffer runs into it due to the heavy load. Does that make sense, or am I way off?
I did eliminate mdraid as the culprit, though. Freed up another drive and duplicated the partition:
Code: | dobbs@bender ~ $ sudo fdisk -l /dev/sd[ab] | grep -E "sda4|sdb1"
/dev/sda4 * 238774095 477173440 119199673 7 HPFS/NTFS/exFAT
/dev/sdb1 2048 238401393 119199673 7 HPFS/NTFS/exFAT
dobbs@bender ~ $ sudo dd if=/dev/sda4 of=/dev/sdb1 bs=32M
3637+1 records in
3637+1 records out
122060465152 bytes (122 GB) copied, 2242.62 s, 54.4 MB/s
dobbs@bender ~ $ sudo cmp -l /dev/sda4 /dev/sdb1
Password:
253594999 377 337
302277623 47 7
388563063 40 0
457392375 252 212
617962103 165 125
710643831 156 116
781120759 253 213
823862263 243 203
866853367 154 114
1238579191 141 101
1238581623 40 0
1312984567 242 202
1313322999 270 230
1482857335 170 130
1977688311 40 0
2081347575 376 336
2120394615 40 0
2161162231 43 3
2212050039 173 133
2263106423 42 2
2501622135 277 237
2534076919 355 315
2565879927 375 335
2747989111 40 0
2837622903 41 1
3005169271 40 0
3063135095 370 330
3083515127 163 123
...and lots more
|
sda and sdb are both SATA; my RAID 5 array spans sd[def]. Same issue, same bit, getting worse... I don't know the significance, but the byte offset mod 128 is always 119.
I'm trying to read one of these errors with hdparm, but neither the left nor right byte values as reported by cmp appeared at the indicated byte offset. It's possible my math is wrong, but I've checked it three times now.
I will memtest the system while I leave town for the weekend. Thanks for the insight, Neddy!
Update: After 18 completed passes, memtest (memtest86+ 4.2) showed zero errors. I'm back to not knowing where the issue lies. Regardless, I do have more RAM on order. We'll see if replacing the RAM solves it. |
|
Back to top |
|
|
dobbs Tux's lil' helper
Joined: 20 Aug 2005 Posts: 105 Location: Wenatchee, WA
|
Posted: Fri Apr 06, 2012 6:43 am Post subject: |
|
|
Yep. Replacing the RAM resolved the issue. Marking solved. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Fri Apr 06, 2012 5:28 pm Post subject: |
|
|
dobbs,
I bet putting your old RAM back in would work too. Thats called 'wiping the contacts'. It reduces the contract resistance between the plugged in parts and is usually good for 12 to 18 months.
Oh, I lost 3 DVDs tops as I have 2 one block errors and a four block error, all in the area where my DVD rips are stored.
The raid5 is back and WD replaced 2 nine month old drives under warranty. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
kimmie Guru
Joined: 08 Sep 2004 Posts: 531 Location: Australia
|
Posted: Sat Apr 07, 2012 1:40 pm Post subject: |
|
|
Neddy,
That load cycle count in your smartctl output looks a little high. Do you know about the nasty head-unloading behaviour of WD20EARS under linux, and how to cure it with WDIDLE.exe? I have some of these drives in RAID5 too... they needed to be spanked before they kept their heads in the right place.
Anyway if you can't find this utility and you need it drop me a PM. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Sat Apr 07, 2012 4:24 pm Post subject: |
|
|
kimmie,
I'm aware of the head-unloading every eight seconds issuse now. I wasn't when I set up the raid.
I understand that WDIDLE.exe needs to be run under Windows and Windows, or even getting those drives near a box with a GUI, is out of the question.
I'm using Code: | hdparm -S 252 /dev/... | which sets the idle timeout to an hour but I don't thing its the same thing.
hdparm has an option to set the idle3 timeout but its not widely tested, so I have not used it. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
kimmie Guru
Joined: 08 Sep 2004 Posts: 531 Location: Australia
|
Posted: Sat Apr 07, 2012 8:45 pm Post subject: |
|
|
Just needs DOS... I had to make a FreeDOS boot floppy and boot that. I'm guessing you could convince FreeDOS to redirect console to serial if you cared enough. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Sat Apr 07, 2012 8:50 pm Post subject: |
|
|
kimmie,
The drives are in a HP Microserver. There is no floppy interface and no PATA interfacae.
Its USB or (e)SATA
Hmm - I wonder if I could remaster a SystemRescueCD image to put on a USB pen drive, so WDIDLE.exe (and FreeDOS) was one of its image tools.
I can at least test that the floppy boots on another box before I make the ISO _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
dobbs Tux's lil' helper
Joined: 20 Aug 2005 Posts: 105 Location: Wenatchee, WA
|
Posted: Thu Apr 12, 2012 10:08 pm Post subject: |
|
|
NeddySeagoon wrote: | I bet putting your old RAM back in would work too. Thats called 'wiping the contacts'. It reduces the contract resistance between the plugged in parts and is usually good for 12 to 18 months. |
I got around to trying that. While the problem isn't as severe, it's still there:
Code: | ubuntu@ubuntu:/mnt$ sudo cmp -l storage/tempstore/windows.part /dev/sdc4
55485640375 370 330
58497711927 120 160
93697501719 116 156
ubuntu@ubuntu:/mnt$ |
Still in the sixth bit, but the offsets mod 128 is now 55 and 23 instead of always 119. Offset mod 256 is 23 for all three, but the sample set is too small. Different kernel (LiveUSB in this case), memory capacity and physical arrangement, so I'm not going to explore that.
The obvious explanation for fewer discrepancies is that the system has twice the RAM, so the bad bit(s?) isn't used as frequently. Also, the whole RAM subsystem is operating slightly slower. The "bad" RAM can run at 5ns latency (CAS 4 at 800MHz), while the new RAM needs at least 5.5ns latency (CAS 6 at 1067MHz). My motherboard actually runs the RAM at 800MHz and CAS 6 (7.5ns) when both sets are installed, so they're not really operating at their peak. Might help, might not; that's all conjecture to me.
On a sadder note, the original boot disk died abruptly shortly after configuring the boot array. I don't know how or why it died; the SMART status was always clean while I investigated the RAM problem. Now the system won't POST with the drive connected (tried different SATA cables, ports, basic debug procedure). Unfortunately, I was absent when it happened. Coincidentally, it's a WD3200KS with a manufacture date of "01 APR 2006", and it died the night of 01 APR 2012. I kinda want to call Western Digital and ask them if it's just a prank... |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|