02 April 2010

RAIDers of the Lost Disk (again)

Here we go again. The NAS is not happy.
Just out of curiosity, I see if there is a hacked firmware based on a more recent image, and I find one based on version 3.4.90, with SSH of course.
So here we go. Let's check the RAID device...

~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri May 22 15:20:30 2009
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.525126

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1       8        6        1      active sync   /dev/sda6

Again only one drive out of two.
Let's see what happened to /dev/sdb.

~ # /usr/sbin/smartctl -l selftest /dev/sdb
smartctl version 5.1-14 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log, version number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended off-line   Completed                     00%      7474         -
# 2  Off-line            Interrupted (host reset)      50%      7466         -
# 3  Off-line            Interrupted (host reset)      50%      7379         -
# 4  Short off-line      Completed: read failure       50%      7334         0x00032141
# 5  Off-line            Interrupted (host reset)      00%      7334         -
# 6  Short off-line      Completed                     00%      7330         -
# 7  Off-line            Interrupted (host reset)      00%      7330         -
# 8  Off-line            Interrupted (host reset)      00%      5973         -
# 9  Off-line            Interrupted (host reset)      00%      5396         -
#10  Off-line            Interrupted (host reset)      00%      5393         -
#11  Off-line            Interrupted (host reset)      00%      5376         -
#12  Short off-line      Completed                     00%      4687         -
#13  Off-line            Interrupted (host reset)      00%      4687         -
#14  Off-line            Interrupted (host reset)      00%      4003         -
#15  Off-line            Interrupted (host reset)      00%      3819         -
#16  Short off-line      Completed                     00%      3659         -
#17  Short off-line      Completed                     00%      3659         -
#18  Short off-line      Completed                     00%      3655         -
#19  Off-line            Interrupted (host reset)      70%      3652         -
#20  Short off-line      Aborted by host               70%      3652         -
#21  Off-line            Interrupted (host reset)      00%      3651         -

Ouch! LBA_of_first_error = 0x32141 (= 205121 in base 10)
Let's check also the SMART attributes.

~ # /usr/sbin/smartctl -A /dev/sdb
smartctl version 5.1-14 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 32
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   168   162   063    Pre-fail  Always       -       18676
  4 Start_Stop_Count        0x0032   210   210   000    Old_age   Always       -       20884
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0027   247   243   187    Pre-fail  Always       -       41160
  9 Power_On_Hours          0x0032   232   232   000    Old_age   Always       -       7559
 10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age   Always       -       76
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Unknown_Attribute       0x0022   056   039   000    Old_age   Always       -       959119404
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0032   046   253   000    Old_age   Always       -       44
195 Hardware_ECC_Recovered  0x000a   252   210   000    Old_age   Always       -       37129
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age   Always       -       0
202 Unknown_Attribute       0x000a   253   252   000    Old_age   Always       -       0
203 Unknown_Attribute       0x000b   253   252   180    Pre-fail  Always       -       11
204 Unknown_Attribute       0x000a   253   252   000    Old_age   Always       -       0
205 Unknown_Attribute       0x000a   253   252   000    Old_age   Always       -       0
207 Unknown_Attribute       0x002a   253   252   000    Old_age   Always       -       0
208 Unknown_Attribute       0x002a   253   252   000    Old_age   Always       -       0
210 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0
211 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0
212 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -       0

Not too bad after all, since
Current_Pending_Sector = 0
Offline_Uncorrectable = 0

Now let's find which partition has the problem.

~ # fsck.ext3 -nv /dev/sdb1
e2fsck 1.38 (30-Jun-2005)
/dev/sdb1: clean, 3045/64256 files, 24511/64252 blocks

~ # fsck.ext3 -nv /dev/sdb2
e2fsck 1.38 (30-Jun-2005)
Warning!  /dev/sdb2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sdb2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? no

Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (39452, counted=39449).
Fix? no

Free inodes count wrong (61113, counted=61112).
Fix? no


/dev/sdb2: ********** WARNING: Filesystem still has errors **********


    2887 inodes used (4%)
      13 non-contiguous inodes (0.5%)
         # of inodes with ind/dind/tind blocks: 156/0/0
   24548 blocks used (38%)
       0 bad blocks
       1 large file

    2309 regular files
     175 directories
      47 character device files
      40 block device files
       0 fifos
       8 links
     308 symbolic links (308 fast symbolic links)
       0 sockets
--------
    2887 files


/dev/sdb3 is a swap partition, so we can skip that.
/dev/sdb4 is an extended partition, so we can skip that too.

~ # fsck.ext3 -nv /dev/sdb5
e2fsck 1.38 (30-Jun-2005)
Warning!  /dev/sdb5 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sdb5 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? no

Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (118269, counted=118286).
Fix? no

Free inodes count wrong (126523, counted=126537).
Fix? no


/dev/sdb5: ********** WARNING: Filesystem still has errors **********


      69 inodes used (0%)
       4 non-contiguous inodes (5.8%)
         # of inodes with ind/dind/tind blocks: 0/0/0
    8235 blocks used (6%)
       0 bad blocks
       1 large file

      32 regular files
      14 directories
       0 character device files
       0 block device files
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       0 sockets
--------
      46 files

~ # fsck.ext3 -nv /dev/sdb6
e2fsck 1.38 (30-Jun-2005)
/dev/sdb6: clean, 93575/60899328 files, 56875141/121776688 blocks

So the errors are on /dev/sdb2 and /dev/sdb5
Let's see where they mount to.

~ # mount | grep /sdb
/dev/sdb1 on /mnt/__mxo_sdb1 type ext3 (rw)

~ # cat /proc/mounts | grep /sdb
/dev/sdb5 /tmp ext3 rw 0 0

~ # cat /proc/cmdline
console=ttyS0,115200 root=/dev/sdb2 rw

Are we booting from /dev/sdb2?

~ # mxoparam -h

Maxtor mxoparam version 1.0
-a         show all maxtor params
-b         get wait for button status
-c [0-1]        set wait for button 0 = Off 1 = On
-d         show max number of drives
-e         enable watchdog in uboot
-f         disable watchdog in uboot
-g         set led solid green
-h         show help
-k         kick watchdog
-p         get boot partition
-q [part]  set boot partition
           0 = drive 0 partition 1
           1 = drive 0 partition 2
           2 = drive 1 partition 1
           3 = drive 1 partition 2
-r         reset partion fail count
-s         get serial number
-t [sn]    set serial number
-v         show version
-x         disable watchdog now
-w         enable watchdog now
-y         set led solid yellow

~ # mxoparam -p
Boot partition is 3

Looks like the system is booting from the second disk, second partition (/dev/sdb2)
This means we can't unmount it, and we need to unmount it before we can fix it.
Therefore, we need to make the system boot from /dev/sda2 otherwise we won't be able to fix /dev/sdb*

First of all, let's make sure /dev/sda2 is exactly the same as /dev/sdb2

~ # dd if=/dev/sdb2 of=/dev/sda2

~ # mount -n /dev/sda2 /mnt/__mxo_sda2 -t ext3

~ # cp -a /mnt/__mxo_sdb2 /mnt/__mxo_sda2


Now let's set the new boot partition





~ # mxoparam -q 1
REBOOT!
...
Let's check it's booting up from the right place now.





~ # cat /proc/cmdline
console=ttyS0,115200 root=/dev/sda2 rw
Right! We're ready to fix /dev/sdb2 and /dev/sdb5 now!
~ # fsck -v /dev/sdb2
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/sdb2 has gone 384 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? yes

Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb2: ***** FILE SYSTEM WAS MODIFIED *****

    3157 inodes used (4%)
      14 non-contiguous inodes (0.4%)
         # of inodes with ind/dind/tind blocks: 159/0/0
   24622 blocks used (38%)
       0 bad blocks
       1 large file

    2317 regular files
     179 directories
     246 character device files
      84 block device files
       0 fifos
       7 links
     321 symbolic links (321 fast symbolic links)
       0 sockets
--------
    3154 files


~ # fsck -v /dev/sdb5
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/sdb5: recovering journal
/dev/sdb5 has been mounted 35 times without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? yes

Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb5: ***** FILE SYSTEM WAS MODIFIED *****

    2939 inodes used (2%)
      28 non-contiguous inodes (1.0%)
         # of inodes with ind/dind/tind blocks: 159/0/0
   26690 blocks used (21%)
       0 bad blocks
       1 large file

    2345 regular files
     189 directories
      47 character device files
      40 block device files
       0 fifos
       8 links
     308 symbolic links (308 fast symbolic links)
       0 sockets
--------
    2937 files
Now we can rebuild the RAID array.
~ # mdadm --manage --add /dev/md0 /dev/sdb6
mdadm: hot added /dev/sdb6
NAS:~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Sat May  5 06:30:50 2007
     Raid Level : raid1
     Array Size : 487106752 (464.54 GiB 498.80 GB)
    Device Size : 487106752 (464.54 GiB 498.80 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat May 23 14:39:50 2009
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 0% complete

           UUID : 7ff7415e:4719112d:d63dd33d:40ff685f
         Events : 0.541445

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1       8        6        1      active sync   /dev/sda6

       2       8       22        0      spare rebuilding   /dev/sdb6

~ # cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sdb6[2] sda6[1]
      487106752 blocks [2/1] [_U]
      [>....................]  recovery =  1.5% (7396736/487106752) finish=145.1min speed=55092K/sec
unused devices: [none]
Good... 2.5 hours later...
~ # cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sdb6[0] sda6[1]
      487106752 blocks [2/2] [UU]

unused devices: [none]
Reboot again and we're done!

Maxtor Shared Storage II - LED Codes

I just want to write this for future reference, since the relevant link on the Seagate support website has already changed a couple of times and I don't want to keep wasting time looking for it next time it changes.
Back Panel LED Codes

LED Definition Status
Power LED

(located in center of the power button)
Power Switch Illuminated

   - Power On
Not Illuminated
   - Power Off
Ethernet LED

(located on bottom side of Ethernet connector)
Shows if the drive is connected through a 10/100 or a 1 Gb Ethernet connection.
Left Green
   - 10/100 Mbps Ethernet connectivity
Left Amber
   - 1 Gbps Ethernet connectivity
Illuminated

   - Power On
Blinking
   - Network communication is occurring
Not Illuminated
   - Power Off
Activity LED

(located on top side of Ethernet connector)
A flashing Activity LED indicates that the network connection is functional and that packets are being transmitted or received.
Front Panel LED Codes

LED Definition Status
Top Power Activity Illuminated

   - Power On Blinking

   - Drive is either powering up or shutting down
Not Illuminated
   - Power Off
Center Hard Disk Activity Illuminated

   - Power On Blinking

   - Data is being transferred to/from the drive
Not Illuminated
   - Power Off
Bottom Network Activity Illuminated

   - Power On Blinking

   - Network communication is occurring
Not Illuminated
   - Power Off
Front Panel LED Error Codes

Green LED - Number of Blinks Amber LED - Number of Blinks Status
1 4 /share file system error
1 3 Boot Error - Attempting to boot from disk 0
2 3 Boot Error - Attempting to boot from disk 1
1 2 HDD S.M.A.R.T. Error - Attempting to boot from disk 0
2 2 HDD S.M.A.R.T. Error - Attempting to boot from disk 1
1 1 RAID Error