15 March 2010

RAIDers of the Lost Disk

A few years ago I decided it was a good idea to have a dedicated file server in my home. After a bit of looking around, I set my mind on a Maxtor Shared Storage II - 1TB. This has 2 drives of 500GB each inside, and it can be set up as a Raid-0 or Raid-1 device. It is configured via a simple web interface.
I bought one and configured it as a Raid-1 device. After a short while, I also decided to update the firmware with a version based on OpenMSS.

Shortly after the warranty expired, one of the drives failed badly. The clicking that was coming out of it was pretty loud but in a twisted way also quite pleasant, somehow clicking along with Bob Marley's "Redemption Songs". Anyway, I managed to replace the faulty drive and rebuild the array, and my file server has been living happily ever since... until yesterday.

It was either a power failure or a loose PSU connector, or both. As a result, the power light started flashing alternatively green (once) and amber (once). I went to the diagnostics page only to find that my device was functioning "within normal parameters". Hmmm... that can't be right.

~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue May  5 11:18:29 2009
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.515034

    Number   Major   Minor   RaidDevice State
       0       8       22        0      active sync   /dev/sdb6
       1       0        0        -      removed

What??? Removed??? How???
~ # mdadm --examine /dev/sda6
mdadm: cannot open /dev/sda6: No such file or directory
mdadm: cannot find device size for /dev/sda6: No such file or directory

Hmmm...
~ # ls /dev/sd*
/dev/sda   /dev/sda3  /dev/sda6  /dev/sdb1  /dev/sdb4  /dev/sdb7
/dev/sda1  /dev/sda4  /dev/sda7  /dev/sdb2  /dev/sdb5
/dev/sda2  /dev/sda5  /dev/sdb   /dev/sdb3  /dev/sdb6

~ # cat /proc/partitions
major minor  #blocks  name

   8    16  488386584 sdb
   8    17     257008 sdb1
   8    18     257040 sdb2
   8    19     257040 sdb3
   8    20          1 sdb4
   8    21     506016 sdb5
   8    22  487106833 sdb6
   8     0  488386584 sdc
   8     1     257008 sdc1
   8     2     257040 sdc2
   8     3     257040 sdc3
   8     4          1 sdc4
   8     5     506016 sdc5
   8     6  487106833 sdc6
  31     0        256 mtdblock0
   9     0  487106752 md0

How exactly did my sda partitions become sdc? Reboot? Yes, reboot!
... [reboot] ...
~ # cat /proc/partitions
major minor  #blocks  name

   8     0  488386584 sda
   8     1     257008 sda1
   8     2     257040 sda2
   8     3     257040 sda3
   8     4          1 sda4
   8     5     506016 sda5
   8     6  487106833 sda6
   8    16  488386584 sdb
   8    17     257008 sdb1
   8    18     257040 sdb2
   8    19     257040 sdb3
   8    20          1 sdb4
   8    21     506016 sdb5
   8    22  487106833 sdb6
  31     0        256 mtdblock0
   9     0  487106752 md0

That's better, but how... ??? Anyway, let's check sda6.
~ # mdadm --query /dev/sda6
/dev/sda6: is not an md array
/dev/sda6: device 1 in 2 device mismatch raid1 md0.  Use mdadm --examine for more detail.

~ # mdadm --examine /dev/sda6
/dev/sda6:
          Magic : a92b4efc
        Version : 00.90.01
           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Fri May  1 20:10:03 2009
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 34c79134 - correct
         Events : 0.513042


      Number   Major   Minor   RaidDevice State
this     1       8        6        1      active sync   /dev/sda6

   0     0       8       22        0      active sync   /dev/sdb6
   1     1       8        6        1      active sync   /dev/sda6

Mismatched, as I would expect, but it's clean. Good.
~ # mdadm --manage --add /dev/md0 /dev/sda6
mdadm: hot added /dev/sda6

~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue May  5 11:22:02 2009
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 0% complete

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.515210

    Number   Major   Minor   RaidDevice State
       0       8       22        0      active sync   /dev/sdb6
       1       0        0        -      removed

       2       8        6        1      spare rebuilding   /dev/sda6

Rebuilding. Good sign, but why do I stil have device 1 - removed - in the list?
~ # cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sda6[2] sdb6[0]
      487106752 blocks [2/1] [U_]
      [=>...................]  recovery =  9.8% (47870464/487106752) finish=114.8min speed=63713K/sec
unused devices: none

Under 2 hours to sync up. Time for coffee.
... [coffee] ...
~ # cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sda6[1] sdb6[0]
      487106752 blocks [2/2] [UU]
unused devices: none

~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue May  5 14:05:25 2009
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.515939

    Number   Major   Minor   RaidDevice State
       0       8       22        0      active sync   /dev/sdb6
       1       8        6        1      active sync   /dev/sda6

One last reboot and we're back on track.
Sorted.

No comments: