degraded raid array with bad blocks

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* degraded raid array with bad blocks
@ 2015-07-16 18:14 Fabian Fischer
  2015-07-17  2:09 ` Roman Mamedov
  2015-07-21 22:48 ` NeilBrown
  0 siblings, 2 replies; 3+ messages in thread
From: Fabian Fischer @ 2015-07-16 18:14 UTC (permalink / raw
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]

Hi,
today I had some problems with my mdadm raid5 (4disks). Firstly I try to
explaine what happened and what the result is:

One disk in my array has some bad blocks. After some hardware-changes
one of the intact disks was thrown out of the array due to a faulty
sata-cable.
I shut down the server and replaced the cable.
After booting, the removed disk wasn't re added to the array (maybe
because of different event count). --re-add doesn't work.
So I used --add.

Because of the bad blocks on one of the remaining disks, the rebuild
stops when reaching the first bad block. The re added disk is declared
as spare, 2 disks active and the disk with bad blocks as faulty.

/dev/md127:
        Version : 1.2
  Creation Time : Tue Apr 19 08:51:32 2011
     Raid Level : raid5
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 1953512960 (1863.02 GiB 2000.40 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Thu Jul 16 19:02:09 2015
          State : clean, FAILED
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : FiFa-Server:0
           UUID : 839fb405:d0b1f13a:5a55ee42:fc8a2061
         Events : 107223

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       80        1      active sync   /dev/sdf
       5       8       32        2      active sync   /dev/sdc
       6       0        0        6      removed

       4       8       96        -      faulty   /dev/sdg
       6       8       64        -      spare   /dev/sde


In my opinion there a 3 possibilities to get the array back working. I
am not sure whether both possibilities really exist and which one is the
most promising.
	- Using the 'spare'-disk as active disk. The data on the disk
	  should be still there.
	- Ignoring the bad blocks and loose information stored in this
	  blocks
	- force start the array without the 'spare' disk and copy the
	  data to backup-storage, or does the bad block will cause the
	  array to fail when reaching a bad block?

In the attachment you can find the output of --examine.
In can not explain why 3 disk have a Bad Block Log. According to
smart-values only sdg has Reallocated_Sector_Ct >0
Another thing I can't explain is why sdg (which is the disk with known
bad blocks) has a lower event count.


I hope I can get some great ideas how to fix my array.

Fabian


[-- Attachment #2: examine.txt --]
[-- Type: text/plain, Size: 3741 bytes --]

/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 839fb405:d0b1f13a:5a55ee42:fc8a2061
           Name : FiFa-Server:0
  Creation Time : Tue Apr 19 08:51:32 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=1200 sectors
          State : clean
    Device UUID : 97ccd551:8820c0e3:4ab3d67d:908a2fd9

    Update Time : Thu Jul 16 19:02:09 2015
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 9d018eea - correct
         Events : 107223

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : .AA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x8
     Array UUID : 839fb405:d0b1f13a:5a55ee42:fc8a2061
           Name : FiFa-Server:0
  Creation Time : Tue Apr 19 08:51:32 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=1200 sectors
          State : clean
    Device UUID : 6db7566d:3c709370:12634f66:f6bfd4f6

    Update Time : Thu Jul 16 19:02:09 2015
  Bad Block Log : 512 entries available at offset 72 sectors - bad blocks present.
       Checksum : 4b73aeab - correct
         Events : 107223

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : .AA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 839fb405:d0b1f13a:5a55ee42:fc8a2061
           Name : FiFa-Server:0
  Creation Time : Tue Apr 19 08:51:32 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=1200 sectors
          State : clean
    Device UUID : 27df429f:c9661838:3a8024e4:d55055e8

    Update Time : Thu Jul 16 19:02:09 2015
       Checksum : b4327aa4 - correct
         Events : 107223

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .AA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x8
     Array UUID : 839fb405:d0b1f13a:5a55ee42:fc8a2061
           Name : FiFa-Server:0
  Creation Time : Tue Apr 19 08:51:32 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=1200 sectors
          State : clean
    Device UUID : b6786dff:c6be0236:10d8f95f:63afad93

    Update Time : Thu Jul 16 19:02:03 2015
  Bad Block Log : 512 entries available at offset 72 sectors - bad blocks present.
       Checksum : 397b22c8 - correct
         Events : 107192

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: degraded raid array with bad blocks
  2015-07-16 18:14 degraded raid array with bad blocks Fabian Fischer
@ 2015-07-17  2:09 ` Roman Mamedov
  2015-07-21 22:48 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: Roman Mamedov @ 2015-07-17  2:09 UTC (permalink / raw
  To: Fabian Fischer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

On Thu, 16 Jul 2015 20:14:21 +0200
Fabian Fischer <raid@fabianfischer.org> wrote:

> After booting, the removed disk wasn't re added to the array (maybe
> because of different event count). --re-add doesn't work.
> So I used --add.

As to why --re-add didn't work, I *just* had the same situation, maybe you
needed to do 'mdadm --remove /dev/md127 faulty' first.

> Because of the bad blocks on one of the remaining disks, the rebuild
> stops when reaching the first bad block. The re added disk is declared
> as spare, 2 disks active and the disk with bad blocks as faulty.

One course of action is to use dd_rescue to clone the disk with bad blocks to
a new clean disk (skipping the bad blocks as you go -- you will lose some
data), then assemble the array with the new disk in place of the cloned one and
proceed with trying to rebuild. This  time it will not have bad blocks, but
will have just zeroes at those locations, so rebuild should complete
successfully. After the rebuild completes you should fsck the filesystem and
check file checksums (if you saved them), to figure out where the damage
actually landed, and restore those files from backup.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: degraded raid array with bad blocks
  2015-07-16 18:14 degraded raid array with bad blocks Fabian Fischer
  2015-07-17  2:09 ` Roman Mamedov
@ 2015-07-21 22:48 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: NeilBrown @ 2015-07-21 22:48 UTC (permalink / raw
  To: Fabian Fischer; +Cc: linux-raid

On Thu, 16 Jul 2015 20:14:21 +0200 Fabian Fischer
<raid@fabianfischer.org> wrote:

> Hi,
> today I had some problems with my mdadm raid5 (4disks). Firstly I try to
> explaine what happened and what the result is:
> 
> One disk in my array has some bad blocks. After some hardware-changes
> one of the intact disks was thrown out of the array due to a faulty
> sata-cable.
> I shut down the server and replaced the cable.
> After booting, the removed disk wasn't re added to the array (maybe
> because of different event count). --re-add doesn't work.
> So I used --add.

You need a bitmap configured for --re-add to be useful.
(It is generally a good idea anyway).

> 
> Because of the bad blocks on one of the remaining disks, the rebuild
> stops when reaching the first bad block. The re added disk is declared
> as spare, 2 disks active and the disk with bad blocks as faulty.

That shouldn't happen.
Your devices all have a bad block log present, so when rebuilding hits
a bad block it should just record that as a bad block on the recovering
disk and keep going.  That is the whole point of the bad block log.

What kernel are you running?  And what version of mdadm?


> 
> /dev/md127:
>         Version : 1.2
>   Creation Time : Tue Apr 19 08:51:32 2011
>      Raid Level : raid5
>      Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
>   Used Dev Size : 1953512960 (1863.02 GiB 2000.40 GB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Jul 16 19:02:09 2015
>           State : clean, FAILED
>  Active Devices : 2
> Working Devices : 3
>  Failed Devices : 1
>   Spare Devices : 1
> 
>          Layout : left-symmetric
>      Chunk Size : 512K
> 
>            Name : FiFa-Server:0
>            UUID : 839fb405:d0b1f13a:5a55ee42:fc8a2061
>          Events : 107223
> 
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       80        1      active sync   /dev/sdf
>        5       8       32        2      active sync   /dev/sdc
>        6       0        0        6      removed
> 
>        4       8       96        -      faulty   /dev/sdg
>        6       8       64        -      spare   /dev/sde
> 
> 
> In my opinion there a 3 possibilities to get the array back working. I
> am not sure whether both possibilities really exist and which one is the
> most promising.
> 	- Using the 'spare'-disk as active disk. The data on the disk
> 	  should be still there.
> 	- Ignoring the bad blocks and loose information stored in this
> 	  blocks
> 	- force start the array without the 'spare' disk and copy the
> 	  data to backup-storage, or does the bad block will cause the
> 	  array to fail when reaching a bad block?

If you have somewhere to store backed up data, and if you can assemble
the array with "--assemble --force", then taking that approach and
copying all the data to somewhere else is the safest option.

To do anything else would require a clear understanding of the history
of the array.  Maybe re-creating the array using the 3 "best" devices
would help, but you would want to be really sure what you were doing.

The data in the record bad blocks is probably lost already anyway -
hopefully there is nothing critical there.

Given the update times on the superblocks are very close the array is
probably quite consistent.

I think "--assemble --force" listing the three devices was "Device
Role: Active device ..." should work and give you a degraded array.
Then 'fsck' that and copy data off.
Then maybe recreate the array from scratch.

NeilBrown



> 
> In the attachment you can find the output of --examine.
> In can not explain why 3 disk have a Bad Block Log. According to
> smart-values only sdg has Reallocated_Sector_Ct >0
> Another thing I can't explain is why sdg (which is the disk with known
> bad blocks) has a lower event count.
> 
> 
> I hope I can get some great ideas how to fix my array.
> 
> Fabian
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-07-21 22:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-16 18:14 degraded raid array with bad blocks Fabian Fischer
2015-07-17  2:09 ` Roman Mamedov
2015-07-21 22:48 ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.