View previous topic :: View next topic |
Author |
Message |
cami n00b

Joined: 15 Jan 2005 Posts: 36
|
Posted: Tue Jul 26, 2016 4:38 pm Post subject: [SOLVED] mdadm RAID1 - replacing a failed drive |
|
|
So just one day after I had my RAID setup up (see [SOLVED] How to properly boot a custom initramfs?), a disk failed permanently. I've installed an identical replacement, but I cannot figure out how to make it being used. The idea behind using RAID was to make this easy, but I tried really hard and I only find more and more questions instead of answers.
I initially created a full-disk RAID-1 on two identical disks using intel storage manager (X58 chipset).
Code: | mdadm --examine /dev/sda
/dev/sda:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : 1d385601
Family : 83eb12c3
Generation : 00005e60
Attributes : All supported
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Checksum : d65c6715 correct
MPB Sectors : 2
Disks : 2
RAID Devices : 1
Disk01 Serial : PK1134P6JWDGUW
State : active
Id : 00010000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
[Volume0]:
UUID : 083b0d35:d926f293:ef50839b:4f023f76
RAID Level : 1 <-- 1
Members : 2 <-- 2
Slots : [UU] <-- [__]
Failed disk : 1
This Slot : 1 (out-of-sync)
Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
Sector Offset : 0
Num Stripes : 15261808
Chunk Size : 64 KiB <-- 64 KiB
Reserved : 0
Migrate State : rebuild
Map State : degraded <-- degraded
Checkpoint : 0 (512)
Dirty State : dirty
Disk00 Serial : 134P6JVNVHW:0:0
State : active failed
Id : ffffffff
Usable Size : 3907022936 (1863.01 GiB 2000.40 GB) |
The last lines represent the failed disk. It doesn't physically exist anymore. The other disk ( Disk01 Serial : PK1134P6JWDGUW) is attached as /dev/sda. The new drive is attached as /dev/sdb, but not used in any way yet.
Code: | NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1,8T 0 disk
└─md_d127 254:8128 0 1,8T 0 raid1
├─md_d127p1 254:8129 0 1023M 0 md
├─md_d127p2 254:8130 0 31G 0 md [SWAP]
└─md_d127p3 254:8131 0 1,8T 0 md /
sdb 8:16 0 1,8T 0 disk |
Intel storage manager UI only lets me create or delete arrays, but not replace drives. So I have to do this using mdadm somehow. I already ran
Code: | mdadm --manage /dev/md127 --remove failed |
It exited without saying anything. I'm not sure whether it did something.
The first thing I do not understand is why there is a separate "container" (md127) and an "array" (md_d127), what each of these are and when to use which. Most sources on the net have just one "md0". Documentation on containers is very brief.
The second thing I do not understand is the output of /proc/mdstat, mdadm --detail and mdadm --examine. Documentation doesnt explain very well what the differences are and how to interpret the output. As far as I understood --examine reads a data block from the physical drives. Couldn't figure out what --detail and /proc/mdstat do.
Code: | cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md_d127 : active raid1 sda[0]
1953511424 blocks super external:/md127/0 [2/1] [U_]
md127 : inactive sda[0](S)
3028 blocks super external:imsm
unused devices: <none> |
Code: | mdadm --detail /dev/md127
/dev/md127:
Version : imsm
Raid Level : container
Total Devices : 1
Working Devices : 1
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Member Arrays : /dev/md/Volume0_0
Number Major Minor RaidDevice
0 8 0 - /dev/sda |
Code: | mdadm --detail /dev/md_d127
/dev/md_d127:
Container : /dev/md/imsm0, member 0
Raid Level : raid1
Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 1
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
UUID : 083b0d35:d926f293:ef50839b:4f023f76
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
2 0 0 2 removed |
I can add and remove /dev/sdb to the container /dev/md127, but that doesn't seem to affect the actual array.
Code: | cami ~ # cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md_d127 : active raid1 sda[0]
1953511424 blocks super external:/md127/0 [2/1] [U_]
md127 : inactive sdb[1](S) sda[0](S)
6056 blocks super external:imsm
unused devices: <none>
cami ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1,8T 0 disk
└─md_d127 254:8128 0 1,8T 0 raid1
├─md_d127p1 254:8129 0 1023M 0 md
├─md_d127p2 254:8130 0 31G 0 md [SWAP]
└─md_d127p3 254:8131 0 1,8T 0 md /
sdb 8:16 0 1,8T 0 disk
cami ~ # mdadm --detail /dev/md127
/dev/md127:
Version : imsm
Raid Level : container
Total Devices : 2
Working Devices : 2
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Member Arrays : /dev/md/Volume0_0
Number Major Minor RaidDevice
0 8 0 - /dev/sda
1 8 16 - /dev/sdb
cami ~ # mdadm --detail /dev/md_d127
/dev/md_d127:
Container : /dev/md/imsm0, member 0
Raid Level : raid1
Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 1
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
UUID : 083b0d35:d926f293:ef50839b:4f023f76
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
2 0 0 2 removed
cami ~ # mdadm --examine /dev/sdb
/dev/sdb:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.0.00
Orig Family : 00000000
Family : e3724720
Generation : 00000001
Attributes : All supported
UUID : 00000000:00000000:00000000:00000000
Checksum : 01a96b92 correct
MPB Sectors : 1
Disks : 1
RAID Devices : 0
Disk00 Serial : PK1134P6JVNVHW
State : spare
Id : 03000000
Usable Size : 3907026958 (1863.02 GiB 2000.40 GB)
Disk Serial : PK1134P6JVNVHW
State : spare
Id : 03000000
Usable Size : 3907026958 (1863.02 GiB 2000.40 GB)
|
The raid contains the root filesystem, so it's not easy to stop/reassemble th array, albeit possible (using a boot CD). I hoped for an easy solution. Easy replacement was the idea behind the setup, after all. But so far I haven't found any solution at all that doesn't require recreating the array and losing the data.
Last edited by cami on Wed Jul 27, 2016 12:05 pm; edited 2 times in total |
|
Back to top |
|
 |
frostschutz Advocate


Joined: 22 Feb 2005 Posts: 2977 Location: Germany
|
Posted: Tue Jul 26, 2016 4:49 pm Post subject: |
|
|
Is there a windows on this machine? For Linux it's best to stick to the native format, not using any intel storage manager.
You don't have to remove failed. Just ignore it.
My guess is that you need `mdadm /dev/md_d127 --add /dev/sdb` but I could be wrong because I don't use imsm format. |
|
Back to top |
|
 |
cami n00b

Joined: 15 Jan 2005 Posts: 36
|
Posted: Tue Jul 26, 2016 7:46 pm Post subject: |
|
|
Well I already issued the suggested command without achieving the desired result (see OP for details).
I already noticed imsm might not have been the best choice but now I'm kind of stuck with it. |
|
Back to top |
|
 |
frostschutz Advocate


Joined: 22 Feb 2005 Posts: 2977 Location: Germany
|
Posted: Tue Jul 26, 2016 8:20 pm Post subject: |
|
|
You're not really showing that in your post... and you only talk of adding to md127 not md_dangnabbit127
If that doesn't work, could you show output for file -s and parted print for each disk?
Code: |
for disk in /dev/sda* /dev/sdb* /dev/md* /dev/md*/*
do
file -sL "$disk"
parted "$disk" unit s print free
done
|
|
|
Back to top |
|
 |
cami n00b

Joined: 15 Jan 2005 Posts: 36
|
Posted: Tue Jul 26, 2016 8:56 pm Post subject: |
|
|
Oh sorry, I overlooked that bit. It is not possible to --add the array directy, mdadm says i shall add to the container instead.
I will post the output of the requested commands tomorrow. Note however that it's full-disk raid. I also included the outputs of mdadm --examine of the two disks in the OP, maybe that helps for the time being. |
|
Back to top |
|
 |
frostschutz Advocate


Joined: 22 Feb 2005 Posts: 2977 Location: Germany
|
Posted: Tue Jul 26, 2016 10:05 pm Post subject: |
|
|
cami wrote: | Note however that it's full-disk raid. |
There are currently two threads on the linux-raid mailing list by people who destroyed their RAID due to it being a full-disk raid. ( http://www.spinics.net/lists/raid/msg53033.html http://www.spinics.net/lists/raid/msg53046.html )
Their mistake: They partitioned their full disk RAID with GPT, then ran a partitioner on ... the full disk.
Partitioner sees GPT data at either start or end of the disk (GPT keeps a backup at the end), and restores/rebuilds the "missing" GPT on the other end of the disk - and there goes your RAID metadata bye-bye.
I never do full-disk RAID, or full-disk anything for that matter, there's just too many ways for it to go wrong in unexpected ways. Always use a partition table.
My suggestion for you is to bite the bullet and do it over. If your current RAID is still working, you can use sdb to build a new structure from scratch, this time with a traditional disk -> partitions -> md -> filesystem structure. |
|
Back to top |
|
 |
cami n00b

Joined: 15 Jan 2005 Posts: 36
|
Posted: Wed Jul 27, 2016 8:55 am Post subject: |
|
|
Thanks for your advice. I already noticed the setup choices might not have been the best.
For completeness, I did not do anything fancy with the disks, I only swapped the failed drive. The RAID is still working, only degraded. So this is basically the standard situation RAID is designed for.
I strongly doubt it has anything to do with partitions however, and that I would have the exact same problem if it were sda1 and sdb1 instead.
So I'm still looking for a proper solution without starting over. If the only solution would be starting over, RAID 1 would be pointless, and a standard backup would be more efficient. So could we assume it's unrelated to partitioning and pretend the raid is on an individual partition? |
|
Back to top |
|
 |
cami n00b

Joined: 15 Jan 2005 Posts: 36
|
Posted: Wed Jul 27, 2016 12:01 pm Post subject: |
|
|
Update. I was able to solve the problem today, although I still don't understand what happened. Here's what I did:
- Booted the system using a Gentoo LiveCD
- Noticed that the LIVECD found two containers (imsm0 and imsm1) and one volume (Volume0_0)
Code: | $ ls /dev/md
Volume0_0 imsm0 imsm1 |
Found that Volume0_0 was using container imsm1
Code: | mdadm --detail /dev/md/Volume0_0 |
Checked metadata on /dev/sda and /dev/sdb (see OP for outputs)
Code: | $ mdadm --examine /dev/sda
...
$ mdadm --examine /dev/sdb
... |
Observed that /dev/sda contained the Intel Storage Manager metadata for my RAID, with the first disk missing and the second disk being /dev/sda itself. (see also OP)
Observed that /dev/sdb contained Intel Storage Manager metadata for a spare without any assigned volume (see also OP)
Assumed that container imsm0 consisted of the spare /dev/sdb
Stopped container /dev/md/imsm0
Code: | mdadm --manage /dev/md/imsm0 --stop |
Added /dev/sdb to container /dev/md/imsm1
Code: | mdadm --manage /dev/md/imsm1 --add /dev/sdb |
I could hear that this started a rebuild.
I don't know why this didn't work while the system was running, somehow mdadm must have added /dev/sdb to a new container instead of the one I specified. The LiveCD and my system use different versions of mdadm, maybe a bug?
Code: | # mdadm --version # this is the potentially buggy version
mdadm - v3.3.1 - 5th June 2014 |
Checked what was going on
Code: | # cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md125 : active raid1 sdb[1] sda[0]
1953511424 blocks super external:/md126/0 [2/1] [_U]
[==>..................] recovery = 10.8% (212502848/1953511556) finish=223.3min speed=129900K/sec
md126 : inactive sda[1](S) sdb[0](S)
6056 blocks super external:imsm
unused devices: <none> |
Checked disk metadata
Code: | cami ~ # mdadm --examine /dev/sda
/dev/sda:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : 1d385601
Family : 5a6ea771
Generation : 00005e95
Attributes : All supported
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Checksum : 60a930ae correct
MPB Sectors : 2
Disks : 2
RAID Devices : 1
Disk01 Serial : PK1134P6JWDGUW
State : active
Id : 00010000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
[Volume0]:
UUID : 083b0d35:d926f293:ef50839b:4f023f76
RAID Level : 1 <-- 1
Members : 2 <-- 2
Slots : [UU] <-- [_U]
Failed disk : 0
This Slot : 1
Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
Sector Offset : 0
Num Stripes : 15261808
Chunk Size : 64 KiB <-- 64 KiB
Reserved : 0
Migrate State : rebuild
Map State : normal <-- degraded
Checkpoint : 787874 (512)
Dirty State : dirty
Disk00 Serial : PK1134P6JVNVHW
State : active
Id : 00030000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
cami ~ # mdadm --examine /dev/sdb
/dev/sdb:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : 1d385601
Family : 5a6ea771
Generation : 00005e95
Attributes : All supported
UUID : eb43e025:cff7929e:9af766e7:f2d60015
Checksum : 60a930ae correct
MPB Sectors : 2
Disks : 2
RAID Devices : 1
Disk00 Serial : PK1134P6JVNVHW
State : active
Id : 00030000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
[Volume0]:
UUID : 083b0d35:d926f293:ef50839b:4f023f76
RAID Level : 1 <-- 1
Members : 2 <-- 2
Slots : [UU] <-- [_U]
Failed disk : 0
This Slot : 0 (out-of-sync)
Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
Sector Offset : 0
Num Stripes : 15261808
Chunk Size : 64 KiB <-- 64 KiB
Reserved : 0
Migrate State : rebuild
Map State : normal <-- degraded
Checkpoint : 787874 (512)
Dirty State : dirty
Disk01 Serial : PK1134P6JWDGUW
State : active
Id : 00010000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB) |
Tested the array by mounting the partitions and accessing some files and directories.
Code: | # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1,8T 0 disk
└─md125 254:8128 0 1,8T 0 raid1
├─md125p1 254:8129 0 1023M 0 md
├─md125p2 254:8130 0 31G 0 md [SWAP]
└─md125p3 254:8131 0 1,8T 0 md /
sdb 8:16 0 1,8T 0 disk
└─md125 254:8128 0 1,8T 0 raid1
├─md125p1 254:8129 0 1023M 0 md
├─md125p2 254:8130 0 31G 0 md [SWAP]
└─md125p3 254:8131 0 1,8T 0 md /
# mount /dev/md/Volume0_0p3 /mnt/gentoo # /dev/md/Volume0_0p3 symlinks to /dev/md125p3
# ...
# umount /mnt/gentoo |
I didn't want to wait for recovery to finish, so I stopped the array, checked everything was offline, checked metadata again, and rebooted.
Code: | mdadm --manage /dev/md/Volume0_0 --stop
mdadm --manage /dev/md/imsm1 --stop
cat /proc/mdstat # this should say "no devices"
mdadm --examine /dev/sda
mdadm --examine /dev/sdb
reboot |
During boot, Intel Storage Manager showed the RAID with both disks attached and state "Rebuild" (i.e. recovery), and the system came up normally.
|
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|