problem with software-based RAID-5 with IMSM metadata-SOLVED

joelparthemore · n00b Joined: 15 Nov 2014 Posts: 29

My home directory is on a RAID-5 array that, for whatever reason (it seemed like a good idea at the time?), I built using the hooks from the UEFI BIOS (or so I understand what I did). That is to say, it's a "real" software-based RAID array in Linux that's built on a "fake" RAID array in the UEFI BIOS.

All was well for some number of years until a few days ago. After I installed the latest KDE updates, the RAID array would lock up entirely when I tried to log in to a new KDE Wayland session. It all came down to one process that refused to die, running startplasma-wayland. Because the process refused to die, the RAID array could not be stopped cleanly and rebooting the computer therefore caused the RAID array to go out of sync. After that, any attempt whatsoever to access the RAID array would cause the RAID array to lock up again.

The first few times this happened, I was able to start the computer without starting the RAID array, reassemble the RAID array using the command mdadm --assemble --run --force /dev/md126 /dev/sda /dev/sde /dev/sdc and have it working fine -- I could fix any filestore problems with e2fsck, mount /home, log in to my home directory, do pretty much whatever I wanted -- until I tried logging into a new KDE Wayland session again. This happened several times while I was trying to troubleshoot the problem with startplasma-wayland.

Unfortunately, one time this didn't work. I was still able to start the computer without starting the RAID array, reassemble it and reboot with the RAID array looking seemingly okay (according to mdadm -D) BUT this time, any attempt to access the RAID array or even just stop the array (mdadm --stop /dev/md126) once it was started would cause the RAID array to lock up.

I'm guessing that the contents of the filestore on the RAID array are probably still there. Does anyone have suggestions on getting the RAID array working properly again and accessing them? I have avoided doing anything further myself because, of course, if the contents of the filestore are still there, I don't want to do anything to jeopardize them.

I'm happy to send whatever output might assist in answering that question. :-)

Here is the output to mdadm -D /dev/md126:

NeddySeagoon · Posted: Sat Sep 16, 2023 11:08 am Post subject:

joelparthemore,

I'll have a nibble. Treat it with caution as I don't know the on disk layout of Intel Raid ISM.
It may not even be published. As you say, its fake raid.

When you assemble the raid by hand, mdadm has a read only option. If you start the raid read only you may be able to copy the data out, if you don't have a backup already.
Be warned that making the raid set read only is not the same as mounting the filesystems it contains read only.

When the raid set is read only, it means what it says. No writes at all.
Mounting a filesystem read only does not prevent journal replays to make the filesystem self consistent again.
Attempting to mount a dirty filesystem on a read only raid set will fail and as journal replay cannot write to the raid set because the raid set is read only.

Never use fsck until you have an image of the damaged filesystem. Normally, it will only do a journal replay anyway, as that makes the filesystem clean again.
When a repair needs more than that, fsck has to guess and often makes a bad situation worse.
fsck makes the filesystem metadata self consistent. It says nothing about any user data that might have been on the filesystem.
The image before fsck is you undo for when fsck get it wrong ... and it does.

Raid is not a backup substitute.

Plan to make a backup and validate the backup if you don't have one.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

joelparthemore · n00b Joined: 15 Nov 2014 Posts: 29