02 April 2010

RAIDers of the Lost Disk (again)

Here we go again. The NAS is not happy.
Just out of curiosity, I see if there is a hacked firmware based on a more recent image, and I find one based on version 3.4.90, with SSH of course.
So here we go. Let's check the RAID device...

~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri May 22 15:20:30 2009
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.525126

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1       8        6        1      active sync   /dev/sda6

Again only one drive out of two.
Let's see what happened to /dev/sdb.

~ # /usr/sbin/smartctl -l selftest /dev/sdb
smartctl version 5.1-14 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log, version number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended off-line   Completed                     00%      7474         -
# 2  Off-line            Interrupted (host reset)      50%      7466         -
# 3  Off-line            Interrupted (host reset)      50%      7379         -
# 4  Short off-line      Completed: read failure       50%      7334         0x00032141
# 5  Off-line            Interrupted (host reset)      00%      7334         -
# 6  Short off-line      Completed                     00%      7330         -
# 7  Off-line            Interrupted (host reset)      00%      7330         -
# 8  Off-line            Interrupted (host reset)      00%      5973         -
# 9  Off-line            Interrupted (host reset)      00%      5396         -
#10  Off-line            Interrupted (host reset)      00%      5393         -
#11  Off-line            Interrupted (host reset)      00%      5376         -
#12  Short off-line      Completed                     00%      4687         -
#13  Off-line            Interrupted (host reset)      00%      4687         -
#14  Off-line            Interrupted (host reset)      00%      4003         -
#15  Off-line            Interrupted (host reset)      00%      3819         -
#16  Short off-line      Completed                     00%      3659         -
#17  Short off-line      Completed                     00%      3659         -
#18  Short off-line      Completed                     00%      3655         -
#19  Off-line            Interrupted (host reset)      70%      3652         -
#20  Short off-line      Aborted by host               70%      3652         -
#21  Off-line            Interrupted (host reset)      00%      3651         -

Ouch! LBA_of_first_error = 0x32141 (= 205121 in base 10)
Let's check also the SMART attributes.

~ # /usr/sbin/smartctl -A /dev/sdb
smartctl version 5.1-14 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 32
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   168   162   063    Pre-fail  Always       -       18676
  4 Start_Stop_Count        0x0032   210   210   000    Old_age   Always       -       20884
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0027   247   243   187    Pre-fail  Always       -       41160
  9 Power_On_Hours          0x0032   232   232   000    Old_age   Always       -       7559
 10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age   Always       -       76
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Unknown_Attribute       0x0022   056   039   000    Old_age   Always       -       959119404
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0032   046   253   000    Old_age   Always       -       44
195 Hardware_ECC_Recovered  0x000a   252   210   000    Old_age   Always       -       37129
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age   Always       -       0
202 Unknown_Attribute       0x000a   253   252   000    Old_age   Always       -       0
203 Unknown_Attribute       0x000b   253   252   180    Pre-fail  Always       -       11
204 Unknown_Attribute       0x000a   253   252   000    Old_age   Always       -       0
205 Unknown_Attribute       0x000a   253   252   000    Old_age   Always       -       0
207 Unknown_Attribute       0x002a   253   252   000    Old_age   Always       -       0
208 Unknown_Attribute       0x002a   253   252   000    Old_age   Always       -       0
210 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0
211 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0
212 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0

Not too bad after all, since
Current_Pending_Sector = 0
Offline_Uncorrectable = 0

Now let's find which partition has the problem.

~ # fsck.ext3 -nv /dev/sdb1
e2fsck 1.38 (30-Jun-2005)
/dev/sdb1: clean, 3045/64256 files, 24511/64252 blocks

~ # fsck.ext3 -nv /dev/sdb2
e2fsck 1.38 (30-Jun-2005)
Warning!  /dev/sdb2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sdb2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? no

Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (39452, counted=39449).
Fix? no

Free inodes count wrong (61113, counted=61112).
Fix? no


/dev/sdb2: ********** WARNING: Filesystem still has errors **********


    2887 inodes used (4%)
      13 non-contiguous inodes (0.5%)
         # of inodes with ind/dind/tind blocks: 156/0/0
   24548 blocks used (38%)
       0 bad blocks
       1 large file

    2309 regular files
     175 directories
      47 character device files
      40 block device files
       0 fifos
       8 links
     308 symbolic links (308 fast symbolic links)
       0 sockets
--------
    2887 files


/dev/sdb3 is a swap partition, so we can skip that.
/dev/sdb4 is an extended partition, so we can skip that too.

~ # fsck.ext3 -nv /dev/sdb5
e2fsck 1.38 (30-Jun-2005)
Warning!  /dev/sdb5 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sdb5 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? no

Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (118269, counted=118286).
Fix? no

Free inodes count wrong (126523, counted=126537).
Fix? no


/dev/sdb5: ********** WARNING: Filesystem still has errors **********


      69 inodes used (0%)
       4 non-contiguous inodes (5.8%)
         # of inodes with ind/dind/tind blocks: 0/0/0
    8235 blocks used (6%)
       0 bad blocks
       1 large file

      32 regular files
      14 directories
       0 character device files
       0 block device files
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       0 sockets
--------
      46 files

~ # fsck.ext3 -nv /dev/sdb6
e2fsck 1.38 (30-Jun-2005)
/dev/sdb6: clean, 93575/60899328 files, 56875141/121776688 blocks

So the errors are on /dev/sdb2 and /dev/sdb5
Let's see where they mount to.

~ # mount | grep /sdb
/dev/sdb1 on /mnt/__mxo_sdb1 type ext3 (rw)

~ # cat /proc/mounts | grep /sdb
/dev/sdb5 /tmp ext3 rw 0 0

~ # cat /proc/cmdline
console=ttyS0,115200 root=/dev/sdb2 rw

Are we booting from /dev/sdb2?

~ # mxoparam -h

Maxtor mxoparam version 1.0
-a         show all maxtor params
-b         get wait for button status
-c [0-1]        set wait for button 0 = Off 1 = On
-d         show max number of drives
-e         enable watchdog in uboot
-f         disable watchdog in uboot
-g         set led solid green
-h         show help
-k         kick watchdog
-p         get boot partition
-q [part]  set boot partition
           0 = drive 0 partition 1
           1 = drive 0 partition 2
           2 = drive 1 partition 1
           3 = drive 1 partition 2
-r         reset partion fail count
-s         get serial number
-t [sn]    set serial number
-v         show version
-x         disable watchdog now
-w         enable watchdog now
-y         set led solid yellow

~ # mxoparam -p
Boot partition is 3

Looks like the system is booting from the second disk, second partition (/dev/sdb2)
This means we can't unmount it, and we need to unmount it before we can fix it.
Therefore, we need to make the system boot from /dev/sda2 otherwise we won't be able to fix /dev/sdb*

First of all, let's make sure /dev/sda2 is exactly the same as /dev/sdb2

~ # dd if=/dev/sdb2 of=/dev/sda2

~ # mount -n /dev/sda2 /mnt/__mxo_sda2 -t ext3

~ # cp -a /mnt/__mxo_sdb2 /mnt/__mxo_sda2


Now let's set the new boot partition





~ # mxoparam -q 1
REBOOT!
...
Let's check it's booting up from the right place now.





~ # cat /proc/cmdline
console=ttyS0,115200 root=/dev/sda2 rw
Right! We're ready to fix /dev/sdb2 and /dev/sdb5 now!
~ # fsck -v /dev/sdb2
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/sdb2 has gone 384 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? yes

Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb2: ***** FILE SYSTEM WAS MODIFIED *****

    3157 inodes used (4%)
      14 non-contiguous inodes (0.4%)
         # of inodes with ind/dind/tind blocks: 159/0/0
   24622 blocks used (38%)
       0 bad blocks
       1 large file

    2317 regular files
     179 directories
     246 character device files
      84 block device files
       0 fifos
       7 links
     321 symbolic links (321 fast symbolic links)
       0 sockets
--------
    3154 files


~ # fsck -v /dev/sdb5
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/sdb5: recovering journal
/dev/sdb5 has been mounted 35 times without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? yes

Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb5: ***** FILE SYSTEM WAS MODIFIED *****

    2939 inodes used (2%)
      28 non-contiguous inodes (1.0%)
         # of inodes with ind/dind/tind blocks: 159/0/0
   26690 blocks used (21%)
       0 bad blocks
       1 large file

    2345 regular files
     189 directories
      47 character device files
      40 block device files
       0 fifos
       8 links
     308 symbolic links (308 fast symbolic links)
       0 sockets
--------
    2937 files
Now we can rebuild the RAID array.
~ # mdadm --manage --add /dev/md0 /dev/sdb6
mdadm: hot added /dev/sdb6
NAS:~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat May 23 14:39:50 2009
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 0% complete

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.541445

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1       8        6        1      active sync   /dev/sda6

       2       8       22        0      spare rebuilding   /dev/sdb6

~ # cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sdb6[2] sda6[1]
      487106752 blocks [2/1] [_U]
      [>....................]  recovery =  1.5% (7396736/487106752) finish=145.1min speed=55092K/sec
unused devices: [none]
Good... 2.5 hours later...
~ # cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sdb6[0] sda6[1]
      487106752 blocks [2/2] [UU]

unused devices: [none]
Reboot again and we're done!

No comments: