Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Software RAID5 problems [Closed]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Wed Oct 17, 2007 2:37 am    Post subject: Software RAID5 problems [Closed] Reply with quote

I'm not sure if one of my drives failed or the array just stopped.

Quote:

# mdadm --verbose --detail --scan
ARRAY /dev/md0 level=raid5 num-devices=4 spares=1 UUID=caf566b6:70af8e3e:1d8c8a89:78bbe66e
devices=/dev/sdb1,/dev/sdc1,/dev/sdd1


Quote:

# mdadm --verbose --examine --scan
ARRAY /dev/md0 level=raid5 num-devices=4 UUID=caf566b6:70af8e3e:1d8c8a89:78bbe66e
spares=2 devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1,/dev/sdd1


Quote:

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : inactive sdb1[1] sdd1[4](S) sdc1[2]
1220963776 blocks

unused devices: <none>


In my limited experience, it looks like the /dev/sda1 dropped out but I don't know how to test it to see if it's really bad or to add it back in if it's not. The Assemble or Create commands I try all say that /dev/md0 is busy.

Thanks in advance. Please feel free to speak slowly and don't assume I know what any of this means. :D


Last edited by ctav01 on Thu Nov 08, 2007 7:10 pm; edited 1 time in total
Back to top
View user's profile Send private message
Cyker
Veteran
Veteran


Joined: 15 Jun 2006
Posts: 1746

PostPosted: Wed Oct 17, 2007 5:27 pm    Post subject: Reply with quote

Can you do
Quote:
mdadm --verbose --detail /dev/md0
and post what it says please?
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Wed Oct 17, 2007 8:26 pm    Post subject: Reply with quote

Quote:

# mdadm --verbose --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sat Mar 17 19:19:27 2007
Raid Level : raid5
Used Dev Size : 195358336 (186.31 GiB 200.05 GB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Tue Oct 16 06:36:21 2007
State : active, degraded, Not Started
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 64K

UUID : caf566b6:70af8e3e:1d8c8a89:78bbe66e
Events : 0.7246

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 0 0 3 removed

4 8 49 - spare /dev/sdd1
Back to top
View user's profile Send private message
Cyker
Veteran
Veteran


Joined: 15 Jun 2006
Posts: 1746

PostPosted: Wed Oct 17, 2007 9:05 pm    Post subject: Reply with quote

Erk... that doesn't look good...

It seems one of the drives has stopped being detected and the other has, for some reason, been relegated to 'spare'.
You can't (normally) have a 2-device RAID5 array which is why the array has stopped itself (It, too, is also probably thinking WTF?! ;))

Check the drives, esp. that the connectors are all in firmly (One thing I hate about SATA vs IDE - SATA plug/socket design and build quality is, by and large utter utter crap, and very vulnerable to 'chip creep'. Should be plug creep i guess :P)

We need three drives at least to have a degraded RAID5, so hopefully the spare or the missing can be coaxed into being re-integrated into the array, then it won't mater if the 4th one can be re-integrated or just rebuilt...
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Mon Oct 22, 2007 4:51 am    Post subject: Reply with quote

Sorry, it took me a while to get the machine on a bench. So I double-checked the SATA connections and:

Quote:

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : inactive sdb1[1](S) sdd1[4](S) sdc1[2](S) sda1[0](S)
1416322112 blocks

unused devices: <none>


Quote:

# mdadm --verbose --examine --scan
ARRAY /dev/md0 level=raid5 num-devices=4 UUID=caf566b6:70af8e3e:1d8c8a89:78bbe66e
spares=2 devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1,/dev/sdd1


Quote:

mdadm --verbose --detail --scan
mdadm: md device /dev/md0 does not appear to be active.


Quote:

mdadm --verbose --detail /dev/md0
mdadm: md device /dev/md0 does not appear to be active.



So I tried:
Quote:

# mdadm -C /dev/md0 --verbose -l 5 -n 4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 64K
mdadm: Cannot open /dev/sda1: Device or resource busy
mdadm: Cannot open /dev/sdb1: Device or resource busy
mdadm: Cannot open /dev/sdc1: Device or resource busy
mdadm: Cannot open /dev/sdd1: Device or resource busy
mdadm: create aborted


Any suggestions?
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Tue Oct 23, 2007 11:17 pm    Post subject: Reply with quote

bump
Back to top
View user's profile Send private message
heschne
n00b
n00b


Joined: 25 Oct 2007
Posts: 1
Location: Bavaria

PostPosted: Thu Oct 25, 2007 8:13 pm    Post subject: Reply with quote

Still interested in this?

I got the following
(after adding another controller, readjusting the cables changing sd? sequence, reboot, change my mind and rearranging again)
Code:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdc[0](S) sdd[4](S) sda[3](S) sdb[1](S)
      1250284544 blocks
       
unused devices: <none>


checking the different drives gave a messy picture (leaving away the common header)
Code:

# mdadm -E /dev/sda
...
      Number   Major   Minor   RaidDevice State
this     3       8        0        3      active sync   /dev/sda
   0     0       8       32        0      active sync   /dev/sdc
   1     1       8       16        1      active sync   /dev/sdb
   2     2       0        0        2      faulty removed
   3     3       8        0        3      active sync   /dev/sda
   4     4       8       48        4      spare   /dev/sdd

 # mdadm -E /dev/sdb
...
      Number   Major   Minor   RaidDevice State
this     1       8       16        1      active sync   /dev/sdb
   0     0       8       32        0      active sync   /dev/sdc
   1     1       8       16        1      active sync   /dev/sdb
   2     2       0        0        2      faulty removed
   3     3       8        0        3      active sync   /dev/sda
   4     4       8       48        4      spare   /dev/sdd

# mdadm -E /dev/sdc
....
      Number   Major   Minor   RaidDevice State
this     0       8       32        0      active sync   /dev/sdc
   0     0       8       32        0      active sync   /dev/sdc
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       8       48        4      spare   /dev/sdd

# mdadm -E /dev/sdd
...
      Number   Major   Minor   RaidDevice State
this     4       8       48        4      spare   /dev/sdd
   0     0       8       32        0      active sync   /dev/sdc
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       8       48        4      spare   /dev/sdd


and of course:
Code:

# mdadm --detail /dev/md0
mdadm: md device /dev/md0 does not appear to be active.

as well as
Code:

# mdadm -A /dev/md0
mdadm: /dev/md0 assembled from 1 drive and 1 spare - not enough to start the array.


WHAT HELPED....

stop the device (in case you screwed some more)
Code:

# mdadm -S /dev/md0
mdadm: stopped /dev/md0


and start it again, manually forcing the devices into the raid
Code:

# mdadm -Af /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdd
mdadm: forcing event count in /dev/sdb(1) from 34689 upto 34700
mdadm: forcing event count in /dev/sda(3) from 34689 upto 34700
mdadm: clearing FAULTY flag for device 1 in /dev/md0 for /dev/sdb
mdadm: clearing FAULTY flag for device 0 in /dev/md0 for /dev/sda
mdadm: /dev/md0 has been started with 3 drives (out of 4) and 1 spare.
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc[0] sdd[4] sda[3] sdb[1]
      937713408 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
      [>....................]  recovery =  0.0% (304632/312571136) finish=341.6min speed=15231K/sec
     
unused devices: <none>




now.. this is the state, it looks good, (I had it at that level before) so I am confident, it will rebuild
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Fri Oct 26, 2007 6:05 am    Post subject: Reply with quote

Definitely still interested.

Quote:

# mdadm -S /dev/md0
mdadm: stopped /dev/md0


Quote:

# mdadm -Af /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdd
mdadm: no recogniseable superblock on /dev/sda
mdadm: /dev/sda has no superblock - assembly aborted


*sigh*
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Mon Oct 29, 2007 1:31 pm    Post subject: Reply with quote

bump
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Wed Oct 31, 2007 12:19 am    Post subject: Reply with quote

Still need help with this. I tried assembling the array and this is what I got. Not sure what the slot 3 thing means.

Code:

# mdadm -A /dev/md0 --verbose /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 4.
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: added /dev/sdc1 to /dev/md0 as 2
mdadm: no uptodate device for slot 3 of /dev/md0
mdadm: added /dev/sdd1 to /dev/md0 as 4
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: /dev/md0 assembled from 2 drives and 1 spare - not enough to start the array.


And then I reread an earlier post and tried forcing it.

Code:

# mdadm -Af /dev/md0 --verbose /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 4.
mdadm: forcing event count in /dev/sda1(0) from 7228 upto 7246
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: added /dev/sdc1 to /dev/md0 as 2
mdadm: no uptodate device for slot 3 of /dev/md0
mdadm: added /dev/sdd1 to /dev/md0 as 4
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 3 drives (out of 4) and 1 spare.


Now not sure how to test it or what to replace.
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Thu Nov 01, 2007 1:58 am    Post subject: Reply with quote

Well, I was able to get it assembled but it looks like my data is all gone. And for a while there, it was showing a failed drive but now it all seems fine so I'm not sure what to do now. Any comments please?

Code:

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      586075008 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/187 pages [0KB], 512KB chunk

unused devices: <none>


Code:

# mdadm --verbose --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sat Mar 17 19:19:27 2007
     Raid Level : raid5
     Array Size : 586075008 (558.92 GiB 600.14 GB)
  Used Dev Size : 195358336 (186.31 GiB 200.05 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Oct 31 12:10:03 2007
          State : active
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : caf566b6:70af8e3e:1d8c8a89:78bbe66e
         Events : 0.7260

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1


Code:

# mdadm --verbose --examine --scan
ARRAY /dev/md0 level=raid5 num-devices=4 UUID=caf566b6:70af8e3e:1d8c8a89:78bbe66e
   devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1,/dev/sdd1


I guess what I'm asking is now that it appears to all be back up and running, how do I test it enough to trust it again? I know RAID5 is no substitute for backing up but one day it was up and the next day it was gone with no warning. There's got to be something I'm missing.

Thanks in advance.
Back to top
View user's profile Send private message
Cyker
Veteran
Veteran


Joined: 15 Jun 2006
Posts: 1746

PostPosted: Fri Nov 09, 2007 7:53 pm    Post subject: Reply with quote

Got your PM; Sorry, I haven't added anything because TBH I don't have much to add!! :(

Its very odd that the array would have just gone; I mean, you can check data and power cable connectors are secure at both ends, that they have enough power driving them and all of that, but to get a double-HD failure is quite worrying!

The data loss is likely due to you essentially reconstructing the array, but if two of the disks had really failed then that data would have been gone anyway...

All I can think of is to run the manufacturer's HD utils on all the drive (I have an old version of Maxtor's PowerMAX which I boot with GRUB off a syslinux floopy emulator into FreeDOS) to make sure there is no impending problems, and also memtest the heck out of your RAM just in case.

Bad RAM is one of the deadliest things for data corruption, which is why server people are willing to pay the superlatively extortionate price for ECC memory.

This is all standard checking stuff 'thi - I really have no clue as to what might have caused your problem in the first place :(

(And now I'm terrified of the possibility of it happening to me; I haven't got enough money to back up my own array, which is of a similar configuration!! :shock:)
Back to top
View user's profile Send private message
ctav01
Tux's lil' helper
Tux's lil' helper


Joined: 11 Feb 2004
Posts: 81
Location: Pleasanton, CA

PostPosted: Sat Nov 10, 2007 6:22 am    Post subject: Reply with quote

Thanks for the reply.

Actually, I got very lucky. I have no idea why the array went down but the forced assemble seemed to restore it and I can't find any missing data (I'm backing everything up as quick as I can though).

I'm not exactly sure how SMART works but I've got it looking at all the drives and one is showing 994 errors and a second one 17 errors so I think I've found the one to replace.

Thanks again for your help.
Back to top
View user's profile Send private message
Cyker
Veteran
Veteran


Joined: 15 Jun 2006
Posts: 1746

PostPosted: Sun Nov 11, 2007 12:10 am    Post subject: Reply with quote

ctav01 wrote:
Thanks for the reply.

Actually, I got very lucky. I have no idea why the array went down but the forced assemble seemed to restore it and I can't find any missing data (I'm backing everything up as quick as I can though).

I'm not exactly sure how SMART works but I've got it looking at all the drives and one is showing 994 errors and a second one 17 errors so I think I've found the one to replace.

Thanks again for your help.


Glad it had a happy ending :)

One thing - Take SMART readings with a pinch of salt; My Seagate drives report hundred-millions of errors while my WD reports only a dozen, but neither of them actually have anything wrong with them.
If the 994 and 17 error drives are the exact same model 'tho, then yeah the 994 may warrant further checking ;)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum