[SOLVED] mdadm RAID1 - replacing a failed drive

cami · n00b Joined: 15 Jan 2005 Posts: 36

So just one day after I had my RAID setup up (see [SOLVED] How to properly boot a custom initramfs?), a disk failed permanently. I've installed an identical replacement, but I cannot figure out how to make it being used. The idea behind using RAID was to make this easy, but I tried really hard and I only find more and more questions instead of answers.

I initially created a full-disk RAID-1 on two identical disks using intel storage manager (X58 chipset).

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

Is there a windows on this machine? For Linux it's best to stick to the native format, not using any intel storage manager.

You don't have to remove failed. Just ignore it.

My guess is that you need `mdadm /dev/md_d127 --add /dev/sdb` but I could be wrong because I don't use imsm format.

cami · n00b Joined: 15 Jan 2005 Posts: 36

Well I already issued the suggested command without achieving the desired result (see OP for details).

I already noticed imsm might not have been the best choice but now I'm kind of stuck with it.

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

You're not really showing that in your post... and you only talk of adding to md127 not md_dangnabbit127

If that doesn't work, could you show output for file -s and parted print for each disk?

cami · n00b Joined: 15 Jan 2005 Posts: 36

Oh sorry, I overlooked that bit. It is not possible to --add the array directy, mdadm says i shall add to the container instead.

I will post the output of the requested commands tomorrow. Note however that it's full-disk raid. I also included the outputs of mdadm --examine of the two disks in the OP, maybe that helps for the time being.

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

cami · n00b Joined: 15 Jan 2005 Posts: 36

Thanks for your advice. I already noticed the setup choices might not have been the best.

For completeness, I did not do anything fancy with the disks, I only swapped the failed drive. The RAID is still working, only degraded. So this is basically the standard situation RAID is designed for.

I strongly doubt it has anything to do with partitions however, and that I would have the exact same problem if it were sda1 and sdb1 instead.

So I'm still looking for a proper solution without starting over. If the only solution would be starting over, RAID 1 would be pointless, and a standard backup would be more efficient. So could we assume it's unrelated to partitioning and pretend the raid is on an individual partition?

cami · n00b Joined: 15 Jan 2005 Posts: 36

Update. I was able to solve the problem today, although I still don't understand what happened. Here's what I did:

Booted the system using a Gentoo LiveCD
Noticed that the LIVECD found two containers (imsm0 and imsm1) and one volume (Volume0_0)

Code:

$ ls /dev/md
Volume0_0 imsm0 imsm1
Found that Volume0_0 was using container imsm1

Code:

mdadm --detail /dev/md/Volume0_0
Checked metadata on /dev/sda and /dev/sdb (see OP for outputs)

Code:

$ mdadm --examine /dev/sda
...
$ mdadm --examine /dev/sdb
...
Observed that /dev/sda contained the Intel Storage Manager metadata for my RAID, with the first disk missing and the second disk being /dev/sda itself. (see also OP)
Observed that /dev/sdb contained Intel Storage Manager metadata for a spare without any assigned volume (see also OP)
Assumed that container imsm0 consisted of the spare /dev/sdb
Stopped container /dev/md/imsm0

Code:

mdadm --manage /dev/md/imsm0 --stop
Added /dev/sdb to container /dev/md/imsm1

Code:

mdadm --manage /dev/md/imsm1 --add /dev/sdb
I could hear that this started a rebuild.

I don't know why this didn't work while the system was running, somehow mdadm must have added /dev/sdb to a new container instead of the one I specified. The LiveCD and my system use different versions of mdadm, maybe a bug?

Code:

# mdadm --version # this is the potentially buggy version
mdadm - v3.3.1 - 5th June 2014

Checked what was going on

Code:

# cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md125 : active raid1 sdb[1] sda[0]
1953511424 blocks super external:/md126/0 [2/1] [_U]
[==>..................] recovery = 10.8% (212502848/1953511556) finish=223.3min speed=129900K/sec

md126 : inactive sda[1](S) sdb[0](S)
6056 blocks super external:imsm

unused devices: <none>

Checked disk metadata

Code:

cami ~ # mdadm --examine /dev/sda
/dev/sda:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : 1d385601
Family : 5a6ea771
Generation : 00005e95
Attributes : All supported
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Checksum : 60a930ae correct
MPB Sectors : 2
Disks : 2
RAID Devices : 1

Disk01 Serial : PK1134P6JWDGUW
State : active
Id : 00010000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:
UUID : 083b0d35:d926f293:ef50839b:4f023f76
RAID Level : 1 <-- 1
Members : 2 <-- 2
Slots : [UU] <-- [_U]
Failed disk : 0
This Slot : 1
Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
Sector Offset : 0
Num Stripes : 15261808
Chunk Size : 64 KiB <-- 64 KiB
Reserved : 0
Migrate State : rebuild
Map State : normal <-- degraded
Checkpoint : 787874 (512)
Dirty State : dirty

Disk00 Serial : PK1134P6JVNVHW
State : active
Id : 00030000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
cami ~ # mdadm --examine /dev/sdb
/dev/sdb:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : 1d385601
Family : 5a6ea771
Generation : 00005e95
Attributes : All supported
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Checksum : 60a930ae correct
MPB Sectors : 2
Disks : 2
RAID Devices : 1

Disk00 Serial : PK1134P6JVNVHW
State : active
Id : 00030000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:
UUID : 083b0d35:d926f293:ef50839b:4f023f76
RAID Level : 1 <-- 1
Members : 2 <-- 2
Slots : [UU] <-- [_U]
Failed disk : 0
This Slot : 0 (out-of-sync)
Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
Sector Offset : 0
Num Stripes : 15261808
Chunk Size : 64 KiB <-- 64 KiB
Reserved : 0
Migrate State : rebuild
Map State : normal <-- degraded
Checkpoint : 787874 (512)
Dirty State : dirty

Disk01 Serial : PK1134P6JWDGUW
State : active
Id : 00010000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

Tested the array by mounting the partitions and accessing some files and directories.

Code:

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1,8T 0 disk
└─md125 254:8128 0 1,8T 0 raid1
├─md125p1 254:8129 0 1023M 0 md
├─md125p2 254:8130 0 31G 0 md [SWAP]
└─md125p3 254:8131 0 1,8T 0 md /
sdb 8:16 0 1,8T 0 disk
└─md125 254:8128 0 1,8T 0 raid1
├─md125p1 254:8129 0 1023M 0 md
├─md125p2 254:8130 0 31G 0 md [SWAP]
└─md125p3 254:8131 0 1,8T 0 md /
# mount /dev/md/Volume0_0p3 /mnt/gentoo # /dev/md/Volume0_0p3 symlinks to /dev/md125p3
# ...
# umount /mnt/gentoo

I didn't want to wait for recovery to finish, so I stopped the array, checked everything was offline, checked metadata again, and rebooted.

Code:

mdadm --manage /dev/md/Volume0_0 --stop
mdadm --manage /dev/md/imsm1 --stop
cat /proc/mdstat # this should say "no devices"
mdadm --examine /dev/sda
mdadm --examine /dev/sdb
reboot
During boot, Intel Storage Manager showed the RAID with both disks attached and state "Rebuild" (i.e. recovery), and the system came up normally.