RAID 6 lost 3/7 drives, but 2 drives have no HW errors. rec?

fLares · n00b Joined: 05 May 2005 Posts: 15

I used to have a RAID 6 (created with mdadm) consisting of 7 HDDs (5+2).
One disk (#7) died and I removed it.
I figured that there are still (5+1) disks left, so one spare still - and continued working with it until I can buy a new HDD. But then disaster struck and took out another drive (#6), /proc/mdstat told me there are now only 5 drives, so I had no reserve left.

I checked the disk that gave up last (#6) with SMART etc and it came out ok, so I figured it was maybe a software hickup and added it back in the RAID as a new drive (mdadm --add) and it started syncing. At about 2% the PC froze, upon reboot only 3 drives where in the RAID (#1,#2,#3).
I checked the missing drives (#4,#5) with mdadm -E and they where ok, superblocks and all. So I figured they where not in the RAID due to the crash before and I have to manually add them to the RAID for some reason.

Now the big mistake was to use --add for the first device (#5) I tried to get back in the RAID and looking at the RAID info, it was added as "spare".
Then strange things happened again with the PC and I checked for Hardware errors. Found that 2 IDE controllers where not behaving well any longer, probably causing all the trouble.

Now to get the data back I tried to copy every HD that was at one time part of the RAID to image files (with dd if=/dev/hdx of=/mnt/backup/hdx), so I would only have to use the on-board controller. So now I have 6 files which are images of the 6 HDDs that where formerly installed, 4 of which where pretty much untouched (#1,#2,#3,#4), one was added as a new drive that started syncing while the RAID was still active (#6) and one was added as spare while the RAID was inactive (#5).

NeddySeagoon · Posted: Sat Nov 17, 2007 12:52 pm Post subject:

fLares,

The bad news is that you may not get any data back but to help you work with the disk images we need to know.

1) how the drives in the raid set were partitioned, if at all.
2) how the images were made. (The exact commands)
3) the command you used to attach the images to /dev/loopX
There are a lot of pitfalls in those steps
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

fLares · n00b Joined: 05 May 2005 Posts: 15

Hi.

1) The drives where originally partitioned only to have 1 large partition each. They were all the same size, 160GB. So I had /dev/hda1 /dev/hdc1, etc...
2) The images where made with dd if=/dev/hda1 of=/backup/hda etc
3) I used losetup /backup/hda /dev/loop1 etc

I still have the original hard disks though, just in case. (Can't mount them all though, since I have lost 2 IDE Controller cards to hardware failures. Most likely it is the mainboard that causes the problems though, so the controllers may work in a different system)

Many Thanks
Aurora

NeddySeagoon · Posted: Tue Nov 27, 2007 7:52 pm Post subject:

fLares,

I know what to try but I don't know where the information is you need to change.
You need to hack the metadata to make one of the failed (or spare) images appear good, even if its not.

The find out what you need to do to attempt that, you need to read the mdadm, or possibly the kernel raid code.

Lets suppose for a moment that this data is held within the raid section of your raid set. You raid is broken and cannot be read (as raid) therefore it follows that it must be in an area on each drive outside the raided area. Further, if mdadm can tell you the status of your raid, you can get at it to change it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

fLares · n00b Joined: 05 May 2005 Posts: 15

I found a tool named mddump. Supossedly it can change the superblocks of a md-RAID drive. I tried to use it, can red the superblocks, but cant write them back when changed (the checksum does not work). Can this tool be used to help me? Or any other suggestions? Looking into the kernel source code is out of the question since I myself don't understand it well enough and a friend who could do it would need quite some time to get into it, which he is only willing to to if it would be extremely important, which is not neccecarily the case here. The lost data is not critical, it's mostly old letters, images, some mp3s and copies of websites I once made that are offline (and now lost).

However, I need those extra disk space soon for backups

- so I give it about 2 Weeks until I need to have a solution for this... better loos the old data than risk the current data by not performing backups.

Greetings
fLares