Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
random read errors with mdraid??? [SOLVED]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 105
Location: Wenatchee, WA

PostPosted: Fri Mar 30, 2012 9:41 pm    Post subject: random read errors with mdraid??? [SOLVED] Reply with quote

Alright, I need a second opinion from a kernel guru.

I dd'ed my windows partition (115GiB) into a file, and then this happened:
Code:
dobbs@bender ~ $ sudo !-1
sudo cmp -l /dev/sda4 /mnt/storage/tempstore/windows.part
    51796599  40   0
 16039693943 274 234
 29991661943 201 241
 66805234167 164 124
 69818277623 115 155
 73482455671 202 242
 94468409719 377 337
 95529264119 260 220
 96286320375  17  57
103653245047   6  46
103809902583 173 133
105303325815  40   0
106056683383 163 123
107211539063 112 152
109386836727 215 255
109386836855 104 144
117111876599 210 250
120390354551 312 352
121743028727 365 325
dobbs@bender ~ $ sudo cmp -l /dev/sda4 /mnt/storage/tempstore/windows.part
Password:
   390982263 144 104
  9640181367  54  14
  9640181623 262 222
 29991661943 201 241
 31463156343 256 216
 37555086327 346 306
 56837503223  51  11
 69818277623 115 155
 73482455671 202 242
 80509345527 175 135
 80509345655 162 122
 94666073719 343 303
101261748087 151 111
103393197431 344 304
103454269047 251 211
103454269175  56  16
103653245047   6  46
105992555639 150 110
107211539063 112 152
109386836727 215 255
109386836855 104 144
109549263351  56  16
109549263479 363 323
110002149239  52  12
114666473079 167 127
114671000439 171 131
117111876599 210 250
117340243959 376 336
117340244215 276 236
120390354551 312 352
dobbs@bender ~ $


What's worrisome is that some of the errors repeat (write errors?), but some don't (read errors?). Worse, none of the operations printed any error messages to dmesg, /var/log/messages or stdio. The dd operation reported success and no errors. The cmp operations did not report any read errors. Shouldn't the block layer find checksum mismatches in this case?

The destination, /mnt/storage/, is a reiser3 partition on a RAID 5 md array. smartctl doesn't show any hardware errors on the underlying devices, and /proc/mdstat shows a good array. fsck says the filesystem is fine, though single byte errors wouldn't make much sense in that case. Kernel is gentoo-sources-3.2.1-r2.

I'm not worried about /dev/sda -- it has yet to exhibit any other symptoms, it's a young drive, I'm not writing anything to it, and it's the least complex of the two setups. That leaves the RAID-5 array. It's an old array I set up years ago, but the underlying devices don't report any errors.

So is mdraid just not reliable? It looks like I'm getting both read AND write errors from that layer. The lack of error detecting seems absurd. And I just realized every one of those errors is a difference of 40...

Did I just hit a bug in the raid456 module or what?!


Last edited by dobbs on Fri Apr 06, 2012 6:43 am; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54815
Location: 56N 3W

PostPosted: Sat Mar 31, 2012 10:35 am    Post subject: Reply with quote

dobbs,

You can't usefully dd anything from a mounted filesystem because you will have open files. If thats what you did throw away the image and start afgain.
With read errors on a single drive in a raid5 array, you won't notice. Any n-1 from n drives works.

If you suspect the raid array do
Code:
echo "check" > /sys/block/mdX/md/sync_action
where X is the md node you want to check.

A real totally failed read error will put
Code:
[231200.568383] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[231200.568389] ata6.00: irq_stat 0x40000001
[231200.568402] ata6.00: cmd 25/00:08:e8:99:04/00:00:c0:00:00/e0 tag 0 dma 4096 in
[231200.568405]          res 51/40:08:e8:99:04/00:00:c0:00:00/e0 Emask 0x9 (media error)
[231200.575646] ata6.00: configured for UDMA/133
[231200.575666] ata6: EH complete
or something like it in dmesg as the kernel resets the interface. If the drive has several goes at the read, you may get something like
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   140   140   051    Pre-fail  Always       -       18654
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1166
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       104
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6409
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       103
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       44
193 Load_Cycle_Count        0x0032   102   102   000    Old_age   Always       -       295050
194 Temperature_Celsius     0x0022   126   110   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       263
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       63
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   188   166   000    Old_age   Offline      -       3355
The meaning of the RAW numbers vary from vendor to vendor. Check yours. The importand numbers here are
Code:
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       263
so the drive has not reallocated any sectors yet but its considering reallocating 263. The above dmesg and smartctrl -a are real from a dead drive I'm about to get RMAed. I'm having ddrescue work hard on it first.

Write errors would cause an immedate Reallocated_Event, unless the drive had no spare sectors left, then you would get an I/O error.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 105
Location: Wenatchee, WA

PostPosted: Sat Mar 31, 2012 5:56 pm    Post subject: Reply with quote

NeddySeagoon wrote:
You can't usefully dd anything from a mounted filesystem because you will have open files. If thats what you did throw away the image and start afgain.


Right. I should have explicitly stated that both /dev/sda4 and the windows.part file were never mounted during this debacle. Sorry about that. It's what I meant when I said I wasn't writing anything to /dev/sda (which is blatantly false anyway; I'm just not writing to /dev/sda4).

NeddySeagoon wrote:
With read errors on a single drive in a raid5 array, you won't notice. Any n-1 from n drives works.


Which is why I'm confused and frightened. The system obviously isn't detecting any "errors"; bit 6 (octal 40) just happen to get flipped occasionally. Given that the partition and the file should be inert, this is an impossible[1] situation. Specifically, this is a situation I hoped to avoid by constructing the RAID 5 array, and now it looks (to me) like the raid layer is introducing these errors.

1. This event exceeds my improbability threshold.

As for smartctl, one drive has one relocated sector, but it's had it for over a year (I've been keeping an eye on that for a while). Zero pending relocations across all drives. I don't believe the underlying drives are the source of corruption.

What brand was your drive there?

Addendum: The array check found zero mismatched. I will re-copy the partition yet again to reproduce the problem.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54815
Location: 56N 3W

PostPosted: Sat Mar 31, 2012 8:37 pm    Post subject: Reply with quote

dobbs,

My drive is an WD20EARS. Thats a green 2Tb drive. I have five in raid5 and two have died over the last few weeks.
The first one was obvious - mega iowaits. When I replaced that, the resyc failed as another drive (the one I showed above, has 6 bad blocks.

Bit flipping sonds like dud RAM. Data reads from the HDD into its RAM is CRC protected. Across the raid set, its 'parity protected'
If your resysnc did not produce any errors - your data is self consistant in the raid set. Thats does not meanits correct, just that all the members of the raid agree on what it is. Those two things taken together rule out any bit flipping.

If your drives are SATA, the data interface is serial, that only one bit gets flipped during data transmission over a serial link is well beyond my incredability threshold. That only leaves the motherboard and its component parts.

Time to boot into memtest86+ and run a few cycles.
Errors found in memtest86 do not always point to RAM. Its only likely to be RAM if you get the same error at the same address every time.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 105
Location: Wenatchee, WA

PostPosted: Sat Mar 31, 2012 9:55 pm    Post subject: Reply with quote

Sorry to hear about losing the drives, Neddy. I've been weary of drive reliability since we passed the 500GB mark. I think that's when "perpendicular recording" became common. Possibly just me being paranoid, though. I do need to replace these three drives for various reasons: they're only 320GB, two of them are PATA, more than 45,000 hours operating... Like I said, this array is old. :) Unfortunately, I don't know what to purchase anymore.

Quote:
Bit flipping sonds like dud RAM. Data reads from the HDD into its RAM is CRC protected. Across the raid set, its 'parity protected'
If your resysnc did not produce any errors - your data is self consistant in the raid set. Thats does not meanits correct, just that all the members of the raid agree on what it is. Those two things taken together rule out any bit flipping.


Yeah, that's why I was considering an mdraid software bug. I was grasping at straws. A possible RAM issue didn't occur to me... I would expect other system stability issues. I'm guessing the faulty region of RAM lies outside the kernel memory, and the data buffer runs into it due to the heavy load. Does that make sense, or am I way off?

I did eliminate mdraid as the culprit, though. Freed up another drive and duplicated the partition:
Code:
dobbs@bender ~ $ sudo fdisk -l /dev/sd[ab] | grep -E "sda4|sdb1"
/dev/sda4   *   238774095   477173440   119199673    7  HPFS/NTFS/exFAT
/dev/sdb1            2048   238401393   119199673    7  HPFS/NTFS/exFAT
dobbs@bender ~ $ sudo dd if=/dev/sda4 of=/dev/sdb1 bs=32M
3637+1 records in
3637+1 records out
122060465152 bytes (122 GB) copied, 2242.62 s, 54.4 MB/s
dobbs@bender ~ $ sudo cmp -l /dev/sda4 /dev/sdb1
Password:
          253594999 377 337
          302277623  47   7
          388563063  40   0
          457392375 252 212
          617962103 165 125
          710643831 156 116
          781120759 253 213
          823862263 243 203
          866853367 154 114
         1238579191 141 101
         1238581623  40   0
         1312984567 242 202
         1313322999 270 230
         1482857335 170 130
         1977688311  40   0
         2081347575 376 336
         2120394615  40   0
         2161162231  43   3
         2212050039 173 133
         2263106423  42   2
         2501622135 277 237
         2534076919 355 315
         2565879927 375 335
         2747989111  40   0
         2837622903  41   1
         3005169271  40   0
         3063135095 370 330
         3083515127 163 123
...and lots more

sda and sdb are both SATA; my RAID 5 array spans sd[def]. Same issue, same bit, getting worse... I don't know the significance, but the byte offset mod 128 is always 119.

I'm trying to read one of these errors with hdparm, but neither the left nor right byte values as reported by cmp appeared at the indicated byte offset. It's possible my math is wrong, but I've checked it three times now.

I will memtest the system while I leave town for the weekend. Thanks for the insight, Neddy!

Update: After 18 completed passes, memtest (memtest86+ 4.2) showed zero errors. I'm back to not knowing where the issue lies. Regardless, I do have more RAM on order. We'll see if replacing the RAM solves it.
Back to top
View user's profile Send private message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 105
Location: Wenatchee, WA

PostPosted: Fri Apr 06, 2012 6:43 am    Post subject: Reply with quote

Yep. Replacing the RAM resolved the issue. Marking solved.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54815
Location: 56N 3W

PostPosted: Fri Apr 06, 2012 5:28 pm    Post subject: Reply with quote

dobbs,

I bet putting your old RAM back in would work too. Thats called 'wiping the contacts'. It reduces the contract resistance between the plugged in parts and is usually good for 12 to 18 months.

Oh, I lost 3 DVDs tops as I have 2 one block errors and a four block error, all in the area where my DVD rips are stored.
The raid5 is back and WD replaced 2 nine month old drives under warranty.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
kimmie
Guru
Guru


Joined: 08 Sep 2004
Posts: 531
Location: Australia

PostPosted: Sat Apr 07, 2012 1:40 pm    Post subject: Reply with quote

Neddy,

That load cycle count in your smartctl output looks a little high. Do you know about the nasty head-unloading behaviour of WD20EARS under linux, and how to cure it with WDIDLE.exe? I have some of these drives in RAID5 too... they needed to be spanked before they kept their heads in the right place.

Anyway if you can't find this utility and you need it drop me a PM.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54815
Location: 56N 3W

PostPosted: Sat Apr 07, 2012 4:24 pm    Post subject: Reply with quote

kimmie,

I'm aware of the head-unloading every eight seconds issuse now. I wasn't when I set up the raid.

I understand that WDIDLE.exe needs to be run under Windows and Windows, or even getting those drives near a box with a GUI, is out of the question.

I'm using
Code:
hdparm -S 252 /dev/...
which sets the idle timeout to an hour but I don't thing its the same thing.
hdparm has an option to set the idle3 timeout but its not widely tested, so I have not used it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
kimmie
Guru
Guru


Joined: 08 Sep 2004
Posts: 531
Location: Australia

PostPosted: Sat Apr 07, 2012 8:45 pm    Post subject: Reply with quote

Just needs DOS... I had to make a FreeDOS boot floppy and boot that. I'm guessing you could convince FreeDOS to redirect console to serial if you cared enough.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54815
Location: 56N 3W

PostPosted: Sat Apr 07, 2012 8:50 pm    Post subject: Reply with quote

kimmie,

The drives are in a HP Microserver. There is no floppy interface and no PATA interfacae.
Its USB or (e)SATA

Hmm - I wonder if I could remaster a SystemRescueCD image to put on a USB pen drive, so WDIDLE.exe (and FreeDOS) was one of its image tools.
I can at least test that the floppy boots on another box before I make the ISO
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 105
Location: Wenatchee, WA

PostPosted: Thu Apr 12, 2012 10:08 pm    Post subject: Reply with quote

NeddySeagoon wrote:
I bet putting your old RAM back in would work too. Thats called 'wiping the contacts'. It reduces the contract resistance between the plugged in parts and is usually good for 12 to 18 months.


I got around to trying that. While the problem isn't as severe, it's still there:
Code:
ubuntu@ubuntu:/mnt$ sudo cmp -l storage/tempstore/windows.part /dev/sdc4
 55485640375 370 330
 58497711927 120 160
 93697501719 116 156
ubuntu@ubuntu:/mnt$


Still in the sixth bit, but the offsets mod 128 is now 55 and 23 instead of always 119. Offset mod 256 is 23 for all three, but the sample set is too small. Different kernel (LiveUSB in this case), memory capacity and physical arrangement, so I'm not going to explore that.

The obvious explanation for fewer discrepancies is that the system has twice the RAM, so the bad bit(s?) isn't used as frequently. Also, the whole RAM subsystem is operating slightly slower. The "bad" RAM can run at 5ns latency (CAS 4 at 800MHz), while the new RAM needs at least 5.5ns latency (CAS 6 at 1067MHz). My motherboard actually runs the RAM at 800MHz and CAS 6 (7.5ns) when both sets are installed, so they're not really operating at their peak. Might help, might not; that's all conjecture to me.

On a sadder note, the original boot disk died abruptly shortly after configuring the boot array. I don't know how or why it died; the SMART status was always clean while I investigated the RAM problem. Now the system won't POST with the drive connected (tried different SATA cables, ports, basic debug procedure). Unfortunately, I was absent when it happened. Coincidentally, it's a WD3200KS with a manufacture date of "01 APR 2006", and it died the night of 01 APR 2012. I kinda want to call Western Digital and ask them if it's just a prank...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum