View previous topic :: View next topic |
Author |
Message |
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Sat Apr 19, 2014 1:09 pm Post subject: Help with failing disk |
|
|
I have a 4 disk Raid5 array. One disk has completely died, and before I found time to replace that disk, (while running with 3 or 4 disks) one of my other hard drives started to die. first one partition, then another, while the others on that same disk continue to work.
I really need to recover some files from one of the partitions in the newly failing disk, and i just bought a couple new hard drives and replaced the one completely destroyed disk.
My question is, is there any way to recover one of the partitions in that 'fourth' disk so that i can assemble and mount it (using 3 of 4 partitions)?
Code: | mdadm --assemble --force /dev/md5 /dev/sda6 /dev/sdc6 /dev/sdd6
mdadm: cannot open device /dev/sdd6: No such file or directory
mdadm: /dev/sdd6 has no superblock - assembly aborted |
/dev/sdd is the failing drive
/dev/sdb is the completely failed drive that has been replaced
here is some snippets from smartctl -a /dev/sdd
Code: | === START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.
Error 16129 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
When the command that caused the error occurred, the device was activ
e or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 5f bb 7e 00 Error: UNC 1 sectors at LBA = 0x007ebb5f = 8305
503
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 02 5e bb 7e e0 0a 3d+16:18:43.635 READ DMA EXT
25 00 08 76 ba 7e e0 0a 3d+16:18:43.275 READ DMA EXT
ca 00 08 3f 00 00 e0 0a 3d+16:18:43.141 WRITE DMA
ca 00 08 6f 30 00 e0 0a 3d+16:18:43.091 WRITE DMA
ca 00 08 67 30 00 e0 0a 3d+16:18:43.038 WRITE DMA
Error 16128 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
When the command that caused the error occurred, the device was activ
e or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 05 73 bb 7e 00 Error: UNC 5 sectors at LBA = 0x007ebb73 = 8305
523
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 70 bb 7e e0 0a 3d+14:28:37.227 READ DMA EXT
27 00 00 00 00 00 e0 0a 3d+14:28:37.225 READ NATIVE MAX ADDRESS EX
T
ec 00 00 00 00 00 a0 0a 3d+14:28:37.102 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 0a 3d+14:28:36.982 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 0a 3d+14:28:36.981 READ NATIVE MAX ADDRESS EX
T
Error 16127 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
When the command that caused the error occurred, the device was activ
e or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 77 bb 7e 00 Error: UNC 1 sectors at LBA = 0x007ebb77 = 8305
527
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 70 bb 7e e0 0a 3d+14:27:52.480 READ DMA EXT
27 00 00 00 00 00 e0 0a 3d+14:27:52.479 READ NATIVE MAX ADDRESS EX
T
ec 00 00 00 00 00 a0 0a 3d+14:27:52.356 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 0a 3d+14:27:52.236 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 0a 3d+14:27:52.234 READ NATIVE MAX ADDRESS EX
T
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 70 bb 7e e0 0a 3d+14:27:52.480 READ DMA EXT
27 00 00 00 00 00 e0 0a 3d+14:27:52.479 READ NATIVE MAX ADDRESS EX
T
ec 00 00 00 00 00 a0 0a 3d+14:27:52.356 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 0a 3d+14:27:52.236 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 0a 3d+14:27:52.234 READ NATIVE MAX ADDRESS EX
T
Error 16126 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
When the command that caused the error occurred, the device was activ
e or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 05 73 bb 7e 00 Error: UNC 5 sectors at LBA = 0x007ebb73 = 8305
523
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 70 bb 7e e0 0a 3d+14:27:00.958 READ DMA EXT
27 00 00 00 00 00 e0 0a 3d+14:27:00.956 READ NATIVE MAX ADDRESS EX
T
ec 00 00 00 00 00 a0 0a 3d+14:27:00.834 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 0a 3d+14:27:00.713 SET FEATURES [Set transfer
mode]
27 00 00 00 00 00 e0 0a 3d+14:27:00.712 READ NATIVE MAX ADDRESS EX
T
Error 16125 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
When the command that caused the error occurred, the device was activ
e or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 70 bb 7e 00 Error: UNC 8 sectors at LBA = 0x007ebb70 = 8305
520
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 70 bb 7e e0 0a 3d+14:26:40.956 READ DMA EXT
25 00 08 30 02 8a e0 0a 3d+14:26:40.854 READ DMA EXT
c8 00 40 20 00 00 e0 0a 3d+14:26:40.853 READ DMA
25 00 08 a8 6d 70 e0 0a 3d+14:26:40.725 READ DMA EXT
c8 00 20 00 00 00 e0 0a 3d+14:26:40.633 READ DMA
|
I'm hoping someone who understands this stuff can help me point me in the right direction of possibly preserving even a little of the unreadable partition. _________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Sat Apr 19, 2014 3:47 pm Post subject: |
|
|
remix,
Install ddrescue and use that to make an image of the most recently failed drive onto the new drive
Make sure to put the ddrescue log on a third drive.
Code: | # Rescue Logfile. Created by GNU ddrescue version 1.15
# Command line: ddrescue -b 4096 -r 8 -f /dev/sde3 /dev/null /root/rescue_log.txt
# current_pos current_status
0x18D786D0000 ?
# pos size status
0x00000000 0x16E4BE9E000 +
0x16E4BE9E000 0x00002000 *
0x16E4BEA0000 0xFD4F9D000 +
0x17E20E3D000 0x00003000 *
0x17E20E40000 0x8FBF8000 +
0x17EB0A38000 0x00008000 *
0x17EB0A40000 0x358CC2000 +
0x18209702000 0x0000E000 *
0x18209710000 0x2DE00000 +
0x18237510000 0x00010000 *
0x18237520000 0x01AC0000 +
0x18238FE0000 0x00010000 *
0x18238FF0000 0x012C1000 +
0x1823A2B1000 0x0000F000 *
0x1823A2C0000 0x2D752000 +
0x18267A12000 0x0000E000 *
0x18267A20000 0x11EDD4000 +
0x183867F4000 0x0000C000 *
0x18386800000 0x9F1ED0000 +
0x18D786D0000 0x4260530000 ? | is one I did earlier.
DON'T DO THIS YET Notice the output file here is /dev/null ... all I was trying to do was get the drive to do one last read and relocate the data so I could grab it later.
You need the best image you can get first.
You need the -b 4096 for advanced format drives. There is no point in trying to recover 512 bytes if the drive has a4k block size.
-r 8 eight retries is a good place to start.
The log allows ddrescue to resume recovery, even with a different command. It will only work on areas not yet recovered.
ddrescuse can work much harder and you can help it too but thats for another post. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Sun Apr 20, 2014 4:29 am Post subject: |
|
|
I got confused when you said "Don't do this yet"
If I am understanding you correctly, i should not perform a Disk to Disk ddrescue yet,
first, I should just copy to /dev/null and output the errors to rescue_log.txt
my setup,
/dev/sda good
/dev/sdb newly installed, partitioned to match raid ( -b 4096)
/dev/sdc good
/dev/sdd failing ( -b 512)
and another brand new hd ( -b 4096) that is waiting to replace /dev/sdd once i can recover anything i can from my 5th raid partition.
/dev/sdd1, working
/dev/sdd2, working
/dev/sdd3, starting to fail
/dev/sdd4, extended
/dev/sdd5, failed (most important)
/dev/sdd6, failed (don't really care)
Code: | ddrescue -b 512 -r 8 -f /dev/sdd5 /dev/null /root/rescue_log.txt |
then i'll inspect /root/rescue_log.txt knowing nothing of what those hex addresses mean
then actually perform the copy
Code: | ddrescue -b 4096 -f -n /dev/sdd5 /dev/sdb5 /root/rescue_copy_log.txt |
or should i be copying over the entire disk?
Code: | ddrescue -b 4096 -f -n /dev/sdd /dev/sdb /root/rescue_copy_log.txt |
_________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Sun Apr 20, 2014 8:31 am Post subject: |
|
|
remix,
The disk may fail completely at any time.
Do the disc to disc rescue first. The disk to /dev/null is a final desperate attempt to get more data recovered.
You may as well copy the entire disk. If you only copy a partition, how will you recover your raid sets?
You will need to get the good partitions onto the new drive at some time.
Of course, if you are still using the raid sets in degraded mode, the data will change and whatever ddrescue copies now from the degraded arrays will be useless.
You should not use ddrescue at all until yon know what its telling you. Read its man and/or info pages.
The hex numbers are block numbers. The symbols at the end of each line tell what the block numbers mean.
A log showing perfect data recovery will have exactly one line of data.
Being able to do arithmetic is hex is useful, since you can work out where and how many blocks you have lost, or still have to recover.
With a bit more poking about the filesystem, you can get a rough idea of whats there and determine its importance.
That enables an informed decision about giving up or trying harder.
Only use -b 4096 on drives with 4k physical sectors. Use -b 512 on other drives. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Sun Apr 20, 2014 10:47 pm Post subject: |
|
|
Good point.
Sounds like it would be safer to boot into a livedvd and not mount any of the degraded raid partitions.
I'll install the new blank disk, and perform the ddrescue
Code: | ddrescue -b 4096 -r 8 -n /dev/sdd /dev/sdb /sshfs_mounted_volume/rescue_copy_log.txt |
I've read this guide, http://wiki.gentoo.org/wiki/Ddrescue
I'll check out the man page as well
thanks! _________________ help the needy |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Sun Apr 20, 2014 11:59 pm Post subject: |
|
|
i just read the man page and -b is the block size of the input device, which in my case is 512
output device is 409
so it is
Code: | ddrescue -b 512 -r 8 -n /dev/sdd /dev/sdb /sshfs_mounted_volume/rescue_copy_log.txt |
_________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Mon Apr 21, 2014 5:46 pm Post subject: |
|
|
remix,
Give it a go. Post the log when it stops.
You will find that gravity can assist the data recovery,
rerun the same command using the same log file a total of six times.
Make a copy of the log each time the command completes, so you can look at the differences later.
Between each invocation of the command, move the drive so you try it with each edge and both faces 'down'.
If you have a bearing failure, gravity and the odd orientations can get you one last read and thats all you need. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Tue Apr 22, 2014 8:34 am Post subject: |
|
|
awesome tip! should i be flipping it during the retries? (without stopping or rebooting)
i just finished the first go through, i did set it to retry 8 times.
Code: | GNU ddrescue 1.16
Press Ctrl-C to interrupt
rescued: 1000 GB, errsize: 543 kB, current rate: 0 B/s
ipos: 717936 MB, errors: 40, average rate: 8546 kB/s
opos: 717936 MB, time since last successful read: 4.3 m
Retrying bad sectors... Retry 1 |
_________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Wed Apr 23, 2014 5:57 pm Post subject: |
|
|
remix,
That looks fairly good so far.
Code: | rescued: 1000 GB, errsize: 543 kB, errors: 40, |
You have 543 kB still to recover in 40 regions of the drive.
Its time to tell ddrescue to try harder, now that most of the data has been read.
By using the same input device, output device and log file, ddrescue will try to fill in the holes in your image an ignore data already recovered.
Look back at the copies of the logs and determine which drive orientation produced the best resuts.
You will still run all four edges and two faces but treat each drive spin up as it it were the last, so start with that orientation.
--retries= can be increased. I tend to try 8, 16, 32, 64 and 128
--direct may be useful, it has no effect on some operating systems
--try-again can help when you have a group of contiguous blocks that cant be read.
--retrim will help too.
After you have tried the above, on all 6 faces, (just with --retries=8) its time to look at what is still missing.
If its unallocated space, it doesn't matter.
If its a file or two, they are gone - you need to decide if you need these files and how much time you want to spend on data recovery.
If its a directory then the files in that directory and its child directories cannot be accessed normally but they may be perfectly recovered.
If its filesystem metadata, it depends what aspects of the metadata are damaged.
Please post the log - like my sample above, next time and we can begin to take into account whats damaged. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Thu Apr 24, 2014 8:00 am Post subject: |
|
|
i'm ok with some of the files being completely inaccessible, well let's see how many that would be.
Code: | # Rescue Logfile. Created by GNU ddrescue version 1.16
# Command line: ddrescue -b 512 -r 8 -f /dev/sdd /dev/sdb ddrescue.log
# current_pos current_status
0xA728A5FE00 +
# pos size status
0x00000000 0xA2FBE8F000 +
0xA2FBE8F000 0x00003000 -
0xA2FBE92000 0x008E0000 +
0xA2FC772000 0x00006000 -
0xA2FC778000 0x0001A000 +
0xA2FC792000 0x00001000 -
0xA2FC793000 0x00199000 +
0xA2FC92C000 0x00003000 -
0xA2FC92F000 0x000C6000 +
0xA2FC9F5000 0x00001000 -
0xA2FC9F6000 0x00488000 +
0xA2FCE7E000 0x00001000 -
0xA2FCE7F000 0x00008000 +
0xA2FCE87000 0x00001000 -
0xA2FCE88000 0x000E6000 +
0xA2FCF6E000 0x00005000 -
0xA2FCF73000 0x0060A000 +
0xA2FD57D000 0x00004000 -
0xA2FD581000 0x0000B000 +
0xA2FD58C000 0x00001000 -
0xA2FD58D000 0x00002000 +
0xA2FD58F000 0x0000F000 -
0xA2FD59E000 0x000D0000 +
0xA2FD66E000 0x00002000 -
0xA2FD670000 0x00042000 +
0xA2FD6B2000 0x00006000 -
0xA2FD6B8000 0x000A1000 +
0xA2FD759000 0x00001000 -
0xA2FD75A000 0x00016000 +
0xA2FD770000 0x00002000 -
0xA2FD772000 0x00ADB000 +
0xA2FE24D000 0x00002000 -
0xA2FE24F000 0x00005000 +
0xA2FE254000 0x00001000 -
0xA2FE255000 0x00001000 +
0xA2FE256000 0x0000B000 -
0xA2FE261000 0x00172000 +
0xA2FE3D3000 0x00005000 -
0xA2FE3D8000 0x002C8000 +
0xA2FE6A0000 0x00006000 -
0xA2FE6A6000 0x00006000 +
0xA2FE6AC000 0x00001000 -
0xA2FE6AD000 0x00002000 +
0xA2FE6AF000 0x00001000 -
0xA2FE6B0000 0x1C121000 +
0xA31A7D1000 0x00002000 -
0xA31A7D3000 0x00137000 +
0xA31A90A000 0x00001000 -
0xA31A90B000 0x2AD8C000 +
0xA345697000 0x00002000 -
0xA345699000 0x001BE000 +
0xA345857000 0x00002000 -
0xA345859000 0x3E2D43000 +
0xA72859C000 0x00001000 -
0xA72859D000 0x00005000 +
0xA7285A2000 0x00007000 -
0xA7285A9000 0x003E4000 +
0xA72898D000 0x00010000 -
0xA72899D000 0x000B4000 +
0xA728A51000 0x0000F000 -
0xA728A60000 0x41B8356000 + |
would it be safe then to just replace that 'fourth' drive with this new copied drive? and it will function normally except for those few files or directories? _________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Thu Apr 24, 2014 9:58 pm Post subject: |
|
|
remix,
Code: | # pos size status
0x00000000 0xA2FBE8F000 + |
Says that from the start of the drive, to block 0xA2FBE8F000' all the data has been recovered.
At Code: | # pos size status
0xA2FBE8F000 0x00003000 - | is the first bad area. A 512b block is 0x200 byles, so 0x1000 is eight blocks. (Its hex)
So this area is 24 (decimal) blocks.
Each Status + is recovered data. Each status - is data yet to be recovered.
You can do better than Code: | ddrescue -b 512 -r 8 -f /dev/sdd /dev/sdb ddrescue.log |
Code: | ddrescue -b 512 -r 16 --direct --try-again --retrim -f /dev/sdd /dev/sdb ddrescue.log | may get back more data.
Don't forget to do all six faces/edges.
The idea is to restart the raid with this drive in place of the failing drive but not yet.
What metadata version is the raid set. You need to know that the metadata is recovered.
mdadm -E /dev/.... will tell.
What filesystem is on the raid set?
Once you add the recovered drive into the raid set, you have decided that further data recovery is not worthwhile.
The raid can't tell the data is corrupt due to the unrecovered data. It will just operate in degraded mode and assume all is well.
When you add another drive, it will regenerate the redundent data based on whatever is on the other drives at that time _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Mon Apr 28, 2014 5:24 am Post subject: |
|
|
thanks for the info, it makes sense to me.
the filesystem on the raid partitions is reiserfs
the new log is long so i pastie'd it here http://pastie.org/9118746 _________________ help the needy |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Mon Apr 28, 2014 6:14 am Post subject: |
|
|
i don't think i recovered enough, not sure what i did wrong (other than not having backups)
Code: | OptimusPrime / # mdadm --assemble /dev/md4 --scan --force
mdadm: /dev/sdd5 has no superblock - assembly aborted
OptimusPrime / # mdadm --assemble /dev/md5 --scan --force
mdadm: /dev/sdb6 has no superblock - assembly aborted
OptimusPrime / # mdadm --assemble /dev/md6 --scan --force
mdadm: /dev/sdb7 has no superblock - assembly aborted
|
I have 2 partitions that seem to be ok
Code: | OptimusPrime / # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md4 : inactive sda5[0](S) sdd5[3](S) sdc5[2](S)
527373312 blocks
md1 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1]
102558528 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md2 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1]
776397120 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md3 : inactive sda3[0](S) sdb3[4](S) sdd3[3](S) sdc3[2](S)
859412736 blocks
unused devices: <none> |
should i just forfeit all the data in those 3 partitions and reformat? _________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Mon Apr 28, 2014 9:47 pm Post subject: |
|
|
remix,
We are not done yet.
Can you remember the parameters to --create that you used when you created md3 and md4?
What does Code: | mdadm -E /dev/sd[abcd]3 |
Also Code: | mdadm -E dev/sd[abcd]5 |
How many elements are in each raid set?
How many have at least some data in now?
Read and understand RAID Recovery. In a nutshell, you can recreate the metadata.
You need to do it on a degraded array. Its best if you don't get the raid metadata version wrong as metadata version 0.9 is at the end of eacd partition and the filesystem starts in the normal place, as if raid was not in use. With metadata version >=1, the metadata is at the start of the volume, where the filesystem superblock would be.
You can recover from getting it wrong but its best that you don't need to
The basic idea is to run a mdadm --create to rewrite the raid metadata in exactly the way you did when you first made the raid set but in degraded mode with the known clean option, so the raid is not resynced. That leaves your original data in place.
Were you able to recover any more data? _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Thu May 08, 2014 10:57 pm Post subject: |
|
|
i created the md devices using
Code: | mdadm --create --verbose --level=5 --raid-devices=4 /dev/md4 /dev/sdb5 /dev/sdc5 /dev/sdd5 /dev/sde5
mdadm --create --verbose --level=5 --raid-devices=4 /dev/md5 /dev/sdb6 /dev/sdc6 /dev/sdd6 /dev/sde6
mdadm --create --verbose --level=5 --raid-devices=4 /dev/md6 /dev/sdb7 /dev/sdc7 /dev/sdd7 /dev/sde7
... |
the output of mdadm -E /dev/sd[abcd]5
looks like i'll need to restore the superblocks in /dev/sdb
/dev/sdb is the drive that i restored the old failed drive into
when you wrote
Quote: | The basic idea is to run a mdadm --create to rewrite the raid metadata in exactly the way you did when you first made the raid set but in degraded mode with the known clean option, so the raid is not resynced. That leaves your original data in place. |
do you mean 'in degraded mode' by adding only 3 of the 4 disks?
i read the RAID Recovery guide and i didn't get how to perform what you asked. _________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Fri May 09, 2014 11:44 am Post subject: |
|
|
remix,
There is no need to use --create on the raid sets that are now working. There is nothing to do to them if /proc/mdstat shows that they are up to strength.
If its /dev/md4 thats the problem, I need the output of Code: | mdadm -E /dev/sd[abcd]5 | and I need to know what you think is in each /dev/sd?5.
Your command Code: | mdadm --create --verbose --level=5 --raid-devices=4 /dev/md4 /dev/sdb5 /dev/sdc5 /dev/sdd5 /dev/sde5 | makes use of a few mdadm defaults.
Like
--chunk= ... its now 512k, it used to be 64k
--metadata= its now 1.2, it used to be 0.90
The raid metadata that you need to create is a data structure that points to your data. Getting --chunk= incorrect, is harmless, you can have an many goes as you want but it must be correct to allow the kernel to read the filesystem on the raid. It tells how big the individual data elements are on the drive, so reading 512k at a time, when its actually 64k doesn't work.
The --metadata= is rather more important. It tells where the raid metadata is on the underlying block device. If its wrong, it will either overwrite the end of your filesystem or the filesystem (not raid)
I was considering --create in degraded mode possibly with --assume-clean ... depending what Code: | mdadm -E /dev/sd[abcd]5 | shows and what you believe is on each partition also with the explicit --chunk= and --metadata= values that Code: | mdadm -E /dev/sd[abcd]5 | will show.
Its also important to choose the 'best' 3 of the four elements from the raid set.
Code: | $ sudo mdadm -E /dev/sda5
Password:
/dev/sda5:
Magic : a92b4efc
Version : 0.90.00 <----
UUID : 5e3cadd4:cfd2665d:96901ac7:6d8f5a5d
Creation Time : Sat Apr 11 20:30:16 2009
Raid Level : raid5
Used Dev Size : 5253120 (5.01 GiB 5.38 GB)
Array Size : 15759360 (15.03 GiB 16.14 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 126
Update Time : Sun Mar 16 11:02:16 2014
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Checksum : 78b17729 - correct
Events : 77
Layout : left-symmetric
Chunk Size : 64K <-----
|
I've highlighted my --chunk= and --metadata= above. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
remix l33t
Joined: 28 Apr 2004 Posts: 797 Location: hawaii
|
Posted: Sun Jun 15, 2014 6:48 am Post subject: |
|
|
Chunk Size : 64K
Metadata : 0.90.00
Code: |
# mdadm -E /dev/sd[abcd]5
/dev/sda5:
Magic : a92b4efc
Version : 0.90.00
UUID : bfb60f39:66b601a2:fbf2ea5a:12bfd232 (local to host OptimusPrime)
Creation Time : Mon Mar 8 22:09:01 2010
Raid Level : raid5
Used Dev Size : 175791104 (167.65 GiB 180.01 GB)
Array Size : 527373312 (502.94 GiB 540.03 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 4
Update Time : Fri Apr 18 18:23:26 2014
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 2
Spare Devices : 0
Checksum : 87adfde0 - correct
Events : 17379
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 0 8 5 0 active sync /dev/sda5
0 0 8 5 0 active sync /dev/sda5
1 1 0 0 1 faulty removed
2 2 8 37 2 active sync /dev/sdc5
3 3 0 0 3 faulty removed
mdadm: No md superblock detected on /dev/sdb5.
/dev/sdc5:
Magic : a92b4efc
Version : 0.90.00
UUID : bfb60f39:66b601a2:fbf2ea5a:12bfd232 (local to host OptimusPrime)
Creation Time : Mon Mar 8 22:09:01 2010
Raid Level : raid5
Used Dev Size : 175791104 (167.65 GiB 180.01 GB)
Array Size : 527373312 (502.94 GiB 540.03 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 4
Update Time : Fri Apr 18 18:23:26 2014
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 2
Spare Devices : 0
Checksum : 87adfe04 - correct
Events : 17379
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 2 8 37 2 active sync /dev/sdc5
0 0 8 5 0 active sync /dev/sda5
1 1 0 0 1 faulty removed
2 2 8 37 2 active sync /dev/sdc5
3 3 0 0 3 faulty removed
/dev/sdd5:
Magic : a92b4efc
Version : 0.90.00
UUID : bfb60f39:66b601a2:fbf2ea5a:12bfd232 (local to host OptimusPrime)
Creation Time : Mon Mar 8 22:09:01 2010
Raid Level : raid5
Used Dev Size : 175791104 (167.65 GiB 180.01 GB)
Array Size : 527373312 (502.94 GiB 540.03 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 4
Update Time : Fri Apr 18 18:22:24 2014
State : active
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
Checksum : 87adb9e3 - correct
Events : 17375
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 3 8 53 3 active sync /dev/sdd5
0 0 8 5 0 active sync /dev/sda5
1 1 0 0 1 faulty removed
2 2 8 37 2 active sync /dev/sdc5
3 3 8 53 3 active sync /dev/sdd5 |
_________________ help the needy |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54793 Location: 56N 3W
|
Posted: Sun Jun 15, 2014 12:02 pm Post subject: |
|
|
remix,
First of all, understand that something will be corrupt but we have no idea what.
As it stands, that raid set should assemble and run with the --force option, if not, we need te rewrite the raid meta data, which does nothing to the user data on the raid.
Its just like rewriting a partition table.
We must be sure to pass the chunk size and metadata versions to mdadm --create as 64k and 0.90 are no longer the defaults and we want to recreate the raid metadata as was, so your data reappears.
Code: | /dev/sda5:
Update Time : Fri Apr 18 18:23:26 2014
Events : 17379
/dev/sdc5:
Update Time : Fri Apr 18 18:23:26 2014
Events : 17379
/dev/sdd5:
Update Time : Fri Apr 18 18:22:24 2014
Events : 17375 |
Notice the update times and Event Counts /dev/sdd5 is a few writes behind. They may be anthing. Also /dev/sdb5 is missing.
Before you go any further, understand that assembling the raid and mounting any filesystem it may contain are separate operations.
Getting the raid assembled is a prerequsite to reading the filesystem but depending on whats damaged, there may be further steps to get at your data.
If all else fails ...
Code: | mdadm --create /dev/md4 --metadata=0.90 --raid-devices=4 --chunk=64 --level=raid5 --assume-clean /dev/sda5 missing /dev/sdc5 /dev/sdd5 |
Before you do that, make sure you understand what it is trying to do.
After it completes Code: | mdadm -E /dev/sd[abcd]5 | should show
Code: | Number Major Minor RaidDevice State
0 0 8 5 0 active sync /dev/sda5
1 1 0 0 1 missing
2 2 8 37 2 active sync /dev/sdc5
3 3 8 53 3 active sync /dev/sdd5
| Its important that the partitions are in the same slots.
The Event counts will all be zero and the raid should be assembled and running. Look in /proc/mdstat.
So far so good. The next step is to try to mount /dev/md4 read only and look around.
Code: | mount -o ro /dev/md4 /mnt/someplace | There are lots of reason that can fail and a few things to try to fix it.
Do not be tempted to run fsck. That makes guesses about what to do and often does the wrong thing.
You are not ready to allow any writes to the filesystem yet, even if it mounts. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|