View previous topic :: View next topic |
Author |
Message |
davecorder n00b
Joined: 03 Sep 2004 Posts: 10
|
Posted: Thu Dec 23, 2004 1:21 pm Post subject: Software RAID-5 problem |
|
|
The problem: I've got a personal file server set up with a RAID-5 configuration. It's been running very smoothly for the last few months, serving files to Mac and PC clients via Samba. Late yesterday, the machine froze while I was copying files off it via the network, and when I rebooted the machine, the OS hung on "Mounting local filesystems..." and ever since then I can't get the array to work with Gentoo (it does, however, appear to work just fine with Knoppix).
The hardware:
AMD Athlon XP 1500+
384 MB PC133 SDRAM
MSI K7T-Turbo2 MB
GeForce2 MX (NV11 DDR)
RealTek 8139C-based 10/100 Ethernet
Two Promise SATA150 TX4 4-port SATA host controllers
Two Hauppauge PVR-250 TV Tuners
IBM 20 GB ATA/100 (boot drive /dev/hda)
Generic 16X DVD-ROM (/dev/hdc)
8 160GB Maxtor PATA drives with HighPoint RocketHead PATA/SATA adapters (/dev/sda through /dev/sdh)
The software:
Gentoo Linux (installed from 2004.2, up to date as of about 2 weeks ago)
Kernel is 2.6.9-gentoo-r9
Using sata_promise module (from libata) for the SATA controllers
RAID array is RAID-5, built from the 8 Maxtor drives (/dev/md0)
Using XFS for the filesystem on the array (1.1 TB total formatted space)
Like I said, this system was working fine up until yesterday's crash. I'm not sure if the crash caused the RAID failure, or if a RAID problem caused the crash. In either case, I can't find any info on the crash in any log file.
After the crash, I noticed something odd: if I soft-rebooted the machine after it hangs with the reset button on the front, the BIOS on the SATA cards would not detect any drives (I let it fill one row across the screen with it's progress bar before killing power). BUT: if I power down the machine and the boot it, the drives are detected just fine.
I was able to boot the machine with a Knoppix 3.6 CD. I was then able to copy the /etc/raidtab file from my boot partition and start the RAID while in Knoppix. The array was out of sync then, and it took about 5 or 6 hours to restore it to good condition. While it was rebuilding, I mounted it and cleared up some inconstancies in the XFS journal (I tried to run xfs_check, but apparently there were some bad sectors on my CD while reading xfs_db). I was able to copy some random files off the array while it was recovering, so it seems that the RAID-5 has done it's purpose and protected my data. /proc/mdstat said all the drives in the array were good (no failures).
Now comes the fun part. I rebooted back into Gentoo and the system stuck right where it had before: on "Mounting local filesystems."
This comes very shortly after the "Starting RAID devices" step in the boot sequence.
One thing that occurred to me is that perhaps it is actually just waiting for the RAID rebuild process to finish before mounting the filesystem. But that shouldn't be the case, since that process happens entirely in the background and the array is still usable while it's being rebuilt. So that's not what is going on.
After a bit of tweaking (disabling autoloading of the sata_promise module and moving /etc/raidtab so no RAID devices are started on boot), I was able to get Gentoo back up and running without the array. I started the array manually. So far so good. Then I mounted the file system. About 5 seconds later, the system froze again. I don't know if it would have frozen if I had just left the array active and didn't mount the filesystem.
So, rebuilding it under the Knoppix CD didn't help.
At the moment, I'm thinking I've got some sort of drive failure, even though I'm not seeing any error messages in the log files and /proc/mdstat reports that all the drives are good. I'm currently in the process of running Maxtor's PowerMax diagnostic utility on the drives (which, despite what the readme says, does detect the drives connected to my third-party SATA controller), so hopefully that'll reveal something.
The version of libata in my kernel does not yet have SMART support (I plan to patch to libata-dev to get that ASAP), so I can't use that at the moment to determine if a drive is going bad.
On the plus side, if it is a drive failure, I have a cold spare (200 GB, though, but that's not a big deal) read to be inserted.
Any thoughts as to what I should be looking at if Maxtor's diagnostic software reports that all drives are good?
TIA
Dave |
|
Back to top |
|
|
fvant Guru
Joined: 08 Jun 2003 Posts: 328 Location: Leiden, The Netherlands
|
Posted: Thu Dec 23, 2004 4:13 pm Post subject: |
|
|
if things run smoothly with Knoppix but not with your homemade kernel, i'd have to conclude the problem lies with your kernel and drivers.
If you compare dmesg and lsmod output between Knoppix and your kernel, what are the differences ? |
|
Back to top |
|
|
davecorder n00b
Joined: 03 Sep 2004 Posts: 10
|
Posted: Thu Dec 23, 2004 4:32 pm Post subject: |
|
|
fvant: I was just about to reach that conclusion myself, but I continued to test each drive with Maxtor's diagnostic software.
As it turns out, I have a failing drive. It didn't show up as failed to OS until several reboots and much mucking around with cables and drives and diagnostic software. But now Maxtor's software consistently reports it as defective and cannot effectively repair it (though it tries). Even my Knoppix 3.7 CD now shows the drive as failed.
Off to get a replacement 160 GB if I can, otherwise I'll toss in the 200 GB and call it good.
Next step: get SMART monitoring working, preferably with an email or even SMS alert to me when a drive starts failing. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|