Dell CERC SATA 1.5/6xh Raid controller, reiser bug [SOLVED].

r3tude · n00b Joined: 12 Jan 2005 Posts: 18

Hi all

Some history we bought this server last year, with 794GB Raid array to take over from out PDC's fileserver roll. The configuration was easy enough got us a good reliable fileserver for a week then all went wrong and dell came out. Since then we have had 3 Drive failures one of the a dual failure losing lots of data, dell replaced the drives. Now I am getting drive failures on what I can see are working drives. Theyre Maxtor drives which may explain it, but i run maxtor's utility brun in test and they work fine, but in this server one fails and hotspare takes over then the hotpsare fails and the buggered one takes over and it keeps going in the loop.

The main thing is that every time a drive fails in this raid 5 array the whole server stops responding and goes down, its usually fine after a hard reboot and caries on rebuilding the array but today ive had to do a --rebuild-tree to fix it.

so the main question is, has anyone had experience with this raid card under gentoo and have you had any problems. I am thinking its more to do with it being Dell and maxtor hardware not what i would class good by anystandards, but i need to cover all avenues

HackingM2 · Posted: Wed May 17, 2006 9:30 am Post subject:

I have had similar problems on a server I own. It used to do pretty much what you described. It wasn't a Dell though.

What fixed the problems in the end was a new PSU. Turned out that the old one had an under-voltage 12v rail which was making the drives behave like they had failed. Swapped it out for a better model (650w tripple redundant) and all has been well since - the server sounds like an aircraft taking off but it works.

You may want to invest in a half-decent digital multi-meter and see what it says about that. I know what Dell support techs are like and I can't imagine them checking it. Try loading the system with lots of CPU and disk activity while you test. Mine used to drop to about 10.2v. :roll:

r3tude · n00b Joined: 12 Jan 2005 Posts: 18

thanks for the info, I'll dig out my multimeter now

I never thought of the PSU, it makes sense.

r3tude · n00b Joined: 12 Jan 2005 Posts: 18

Sorted there was a bug in reiserfs affecting large filesystems, it was documented on a 1.4TB filesystem here http://www.mail-archive.com/reiserfs-list@namesys.com/msg20923.html.

It was causing massive server load and read write failures, i've upgraded my system and done a kernel upgrade and it seems fine now, I am going to keep an eye on the raid array just encase this was a secondary problem.

HackingM2 · Posted: Fri May 19, 2006 3:21 pm Post subject:

Interesting. I shall have to watch out for that as I have some filesystems approaching that size.

I don't want to put a dampner on things but I have to say that if the controller is deciding that the drive is failed (and switching to a hot-spare) I would be very surprised indeed if it was a software issue. In my experience these things always turn out to be hardware - usually PSU or RAM.

Still... Glad to hear it is working now. I hope it continues to do so.