All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* XFS corrupt after RAID failure and resync
@ 2015-01-06  5:39 David Raffelt
  2015-01-06 12:36 ` Stefan Ring
  2015-01-06 12:41 ` Brian Foster
  0 siblings, 2 replies; 12+ messages in thread
From: David Raffelt @ 2015-01-06  5:39 UTC (permalink / raw
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2625 bytes --]

Hi All,

I have 7 drives in a RAID6 configuration with a XFS partition (running Arch
linux). Recently two drives dropped out simultaneously, and a hot spare
immediately synced successfully so that I now have 6/7 drives up in the
array.

After a reboot (to replace the faulty drives) the XFS file system would not
mount. Note that I had to perform a hard reboot since the server hung on
shutdown. When I try to mount I get the following error:
mount: mount /dev/md0 on /export/data failed: Structure needs cleaning

I have tried to perform: xfs_repair /dev/md0
And I get the following output:

Phase 1 - find and verify superblock...
couldn't verify primary superblock - bad magic number !!!
attempting to find secondary superblock...
..............................................................................
..............................................................................
                          [many lines like this]
..............................................................................
..............................................................................
found candidate secondary superblock...unable to verify superblock,
continuing
...
...........................................................................

Note that it has been scanning for many hours and has located several
secondary superblocks with the same error. It is till scanning however
based on other posts I'm guessing it will not be successful.

To investigate the superblock info I used xfs_db and the magic number looks
ok:
sudo xfs_db /dev/md0
xfs_db> sb
xfs_db> p

magicnum = 0x58465342
blocksize = 4096
dblocks = 3662666880
rblocks = 0
rextents = 0
uuid = e74e5814-3e0f-4cd1-9a68-65d9df8a373f
logstart = 2147483655
rootino = 1024
rbmino = 1025
rsumino = 1026
rextsize = 1
agblocks = 114458368
agcount = 32
rbmblocks = 0
logblocks = 521728
versionnum = 0xbdb4
sectsize = 4096
inodesize = 512
inopblock = 8
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 12
inodelog = 9
inopblog = 3
agblklog = 27
rextslog = 0
inprogress = 0
imax_pct = 5
icount = 4629568
ifree = 34177
fdblocks = 362013500
frextents = 0
uquotino = 0
gquotino = null
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 128
width = 640
dirblklog = 0
logsectlog = 12
logsectsize = 4096
logsunit = 4096
features2 = 0xa
bad_features2 = 0xa
features_compat = 0
features_ro_compat = 0
features_incompat = 0
features_log_incompat = 0
crc = 0 (unchecked)
pquotino = 0
lsn = 0


Any help or suggestions at this point would be much appreciated!  Is my
only option to try a repair -L?

Thanks in advance,
Dave

[-- Attachment #1.2: Type: text/html, Size: 6966 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* XFS corrupt after RAID failure and resync
@ 2015-01-06  6:12 David Raffelt
  2015-01-06 12:47 ` Brian Foster
       [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
  0 siblings, 2 replies; 12+ messages in thread
From: David Raffelt @ 2015-01-06  6:12 UTC (permalink / raw
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 661 bytes --]

Hi again,
Some more information.... the kernel log show the following errors were
occurring after the RAID recovery, but before I reset the server.

Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
xfs_trans_read_buf() returned error 117.


Thanks,
Dave

[-- Attachment #1.2: Type: text/html, Size: 852 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06  5:39 David Raffelt
@ 2015-01-06 12:36 ` Stefan Ring
  2015-01-06 12:41 ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Stefan Ring @ 2015-01-06 12:36 UTC (permalink / raw
  To: David Raffelt; +Cc: Linux fs XFS

On Tue, Jan 6, 2015 at 6:39 AM, David Raffelt
<david.raffelt@florey.edu.au> wrote:
> Hi All,
>
> I have 7 drives in a RAID6 configuration with a XFS partition (running Arch
> linux). Recently two drives dropped out simultaneously, and a hot spare
> immediately synced successfully so that I now have 6/7 drives up in the
> array.
>
> After a reboot (to replace the faulty drives) the XFS file system would not
> mount. Note that I had to perform a hard reboot since the server hung on
> shutdown. When I try to mount I get the following error:
> mount: mount /dev/md0 on /export/data failed: Structure needs cleaning
>
> I have tried to perform: xfs_repair /dev/md0
> And I get the following output:
>
> Phase 1 - find and verify superblock...
> couldn't verify primary superblock - bad magic number !!!
> attempting to find secondary superblock...

This is certainly not what you want to hear, but I guess it will not
matter much what you do. If you want to, you can continue trying
everything for your own amusement, but I would say farewell to the
data on that RAID. A rebuild should not have any effect on the data
contained by the array. If it does, then something must have gone
wrong. And since something seems to have gone wrong, it has likely
gone wrong in a big way. What are the chances of just the tiny little
amount of data that XFS needs for mounting getting corrupted, while
everything else magically got away unscathed?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06  5:39 David Raffelt
  2015-01-06 12:36 ` Stefan Ring
@ 2015-01-06 12:41 ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2015-01-06 12:41 UTC (permalink / raw
  To: David Raffelt; +Cc: xfs

On Tue, Jan 06, 2015 at 04:39:19PM +1100, David Raffelt wrote:
> Hi All,
> 
> I have 7 drives in a RAID6 configuration with a XFS partition (running Arch
> linux). Recently two drives dropped out simultaneously, and a hot spare
> immediately synced successfully so that I now have 6/7 drives up in the
> array.
> 

So at this point the fs was verified to be coherent/accessible?

> After a reboot (to replace the faulty drives) the XFS file system would not
> mount. Note that I had to perform a hard reboot since the server hung on
> shutdown. When I try to mount I get the following error:
> mount: mount /dev/md0 on /export/data failed: Structure needs cleaning

Hmm, seems like something associated with the array went wrong here. Do
you recall any errors before you hard reset? What happened when the
array was put back together after the reset?

Do you have anything more descriptive in the log or dmesg on the failed
mount attempt after the box rebooted?

> 
> I have tried to perform: xfs_repair /dev/md0
> And I get the following output:
> 
> Phase 1 - find and verify superblock...
> couldn't verify primary superblock - bad magic number !!!
> attempting to find secondary superblock...
> ..............................................................................
> ..............................................................................
>                           [many lines like this]
> ..............................................................................
> ..............................................................................
> found candidate secondary superblock...unable to verify superblock,
> continuing
> ...
> ...........................................................................
> 
> Note that it has been scanning for many hours and has located several
> secondary superblocks with the same error. It is till scanning however
> based on other posts I'm guessing it will not be successful.
> 

This is behavior I would expect if the array was borked (e.g., drives
misordered or something of that nature). Is the array in a sane state
when you attempted this (e.g., what does mdadm show for the state of the
various drives)? It is strange that it complains about the magic number
given the output below.

> To investigate the superblock info I used xfs_db and the magic number looks
> ok:
> sudo xfs_db /dev/md0
> xfs_db> sb
> xfs_db> p
> 
> magicnum = 0x58465342
> blocksize = 4096
> dblocks = 3662666880
> rblocks = 0
> rextents = 0
> uuid = e74e5814-3e0f-4cd1-9a68-65d9df8a373f
> logstart = 2147483655
> rootino = 1024
> rbmino = 1025
> rsumino = 1026
> rextsize = 1
> agblocks = 114458368
> agcount = 32
> rbmblocks = 0
> logblocks = 521728
> versionnum = 0xbdb4
> sectsize = 4096
> inodesize = 512
> inopblock = 8
> fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
> blocklog = 12
> sectlog = 12
> inodelog = 9
> inopblog = 3
> agblklog = 27
> rextslog = 0
> inprogress = 0
> imax_pct = 5
> icount = 4629568
> ifree = 34177
> fdblocks = 362013500
> frextents = 0
> uquotino = 0
> gquotino = null
> qflags = 0
> flags = 0
> shared_vn = 0
> inoalignmt = 2
> unit = 128
> width = 640
> dirblklog = 0
> logsectlog = 12
> logsectsize = 4096
> logsunit = 4096
> features2 = 0xa
> bad_features2 = 0xa
> features_compat = 0
> features_ro_compat = 0
> features_incompat = 0
> features_log_incompat = 0
> crc = 0 (unchecked)
> pquotino = 0
> lsn = 0
> 
> 
> Any help or suggestions at this point would be much appreciated!  Is my
> only option to try a repair -L?
> 

Repair probably would have complained about a dirty log above if -L was
necessary. Did you run with '-n?' Are you aware of whether the fs was
cleanly unmounted before you performed the hard reset?

Brian

> Thanks in advance,
> Dave

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06  6:12 David Raffelt
@ 2015-01-06 12:47 ` Brian Foster
       [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2015-01-06 12:47 UTC (permalink / raw
  To: David Raffelt; +Cc: xfs

On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> Hi again,
> Some more information.... the kernel log show the following errors were
> occurring after the RAID recovery, but before I reset the server.
> 

By after the raid recovery, you mean after the two drives had failed out
and 1 hot spare was activated and resync completed? It certainly seems
like something went wrong in this process. The output below looks like
it's failing to read in some inodes. Is there any stack trace output
that accompanies these error messages to confirm?

I suppose I would try to verify that the array configuration looks sane,
but after the hot spare resync and then one or two other drive
replacements (was the hot spare ultimately replaced?), it's hard to say
whether it might be recoverable.

Brian

> Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> xfs_trans_read_buf() returned error 117.
> 
> 
> Thanks,
> Dave

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
       [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
@ 2015-01-06 20:34   ` David Raffelt
  2015-01-06 23:16     ` Brian Foster
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Raffelt @ 2015-01-06 20:34 UTC (permalink / raw
  To: Brian Foster, stefanrin; +Cc: xfs@oss.sgi.com


[-- Attachment #1.1: Type: text/plain, Size: 4027 bytes --]

Hi Brian and Stefan,
Thanks for your reply.  I checked the status of the array after the rebuild
(and before the reset).

md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
      14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
[UUUUUU_]

However given that I've never had any problems before with mdadm rebuilds I
did not think to check the data before rebooting.  Note that the array is
still in this state. Before the reboot I tried to run a smartctl check on
the failed drives and it could not read them. When I rebooted I did not
actually replace any drives, I just power cycled to see if I could
re-access the drives that were thrown out of the array. According to
smartctl they are completely fine.

I guess there is no way I can re-add the old drives and remove the newly
synced drive?  Even though I immediately kicked all users off the system
when I got the mdadm alert, it's possible a small amount of data was
written to the array during the resync.

It looks like the filesystem was not unmounted properly before reboot:
Jan 06 09:11:54 server systemd[1]: Failed unmounting /export/data.
Jan 06 09:11:54 server systemd[1]: Shutting down.

Here is the mount errors in the log after rebooting:
Jan 06 09:15:17 server kernel: XFS (md0): Mounting Filesystem
Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 09:15:17 server kernel: XFS (md0): metadata I/O error: block 0x400
("xfs_trans_read_buf_map") error 117 numblks 16
Jan 06 09:15:17 server kernel: XFS (md0): xfs_imap_to_bp:
xfs_trans_read_buf() returned error 117.
Jan 06 09:15:17 server kernel: XFS (md0): failed to read root inode

xfs_repair -n -L also complains about a bad magic number.

Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
volume. It was only ever meant to be a scratch drive for intermediate
scientific results, however inevitably most users used it to store lots of
data. Oh well.

Thanks again,
Dave












On 6 January 2015 at 23:47, Brian Foster <bfoster@redhat.com> wrote:

> On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> > Hi again,
> > Some more information.... the kernel log show the following errors were
> > occurring after the RAID recovery, but before I reset the server.
> >
>
> By after the raid recovery, you mean after the two drives had failed out
> and 1 hot spare was activated and resync completed? It certainly seems
> like something went wrong in this process. The output below looks like
> it's failing to read in some inodes. Is there any stack trace output
> that accompanies these error messages to confirm?
>
> I suppose I would try to verify that the array configuration looks sane,
> but after the hot spare resync and then one or two other drive
> replacements (was the hot spare ultimately replaced?), it's hard to say
> whether it might be recoverable.
>
> Brian
>
> > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> and
> > run xfs_repair
> > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> and
> > run xfs_repair
> > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> and
> > run xfs_repair
> > Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> > 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> > Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> > xfs_trans_read_buf() returned error 117.
> >
> >
> > Thanks,
> > Dave
>
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
>
>


-- 
*David Raffelt (PhD)*
Postdoctoral Fellow

The Florey Institute of Neuroscience and Mental Health
Melbourne Brain Centre - Austin Campus
245 Burgundy Street
Heidelberg Vic 3084
Ph: +61 3 9035 7024
www.florey.edu.au

[-- Attachment #1.2: Type: text/html, Size: 5690 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06 20:34   ` David Raffelt
@ 2015-01-06 23:16     ` Brian Foster
       [not found]     ` <8cc9a649ec2240faa4e38fd742437546@000S-EX-HUB-NP2.unimelb.edu.au>
  2015-01-07  2:35     ` Chris Murphy
  2 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2015-01-06 23:16 UTC (permalink / raw
  To: David Raffelt; +Cc: stefanrin, xfs@oss.sgi.com

On Wed, Jan 07, 2015 at 07:34:37AM +1100, David Raffelt wrote:
> Hi Brian and Stefan,
> Thanks for your reply.  I checked the status of the array after the rebuild
> (and before the reset).
> 
> md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
>       14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> [UUUUUU_]
> 
> However given that I've never had any problems before with mdadm rebuilds I
> did not think to check the data before rebooting.  Note that the array is
> still in this state. Before the reboot I tried to run a smartctl check on
> the failed drives and it could not read them. When I rebooted I did not
> actually replace any drives, I just power cycled to see if I could
> re-access the drives that were thrown out of the array. According to
> smartctl they are completely fine.
> 
> I guess there is no way I can re-add the old drives and remove the newly
> synced drive?  Even though I immediately kicked all users off the system
> when I got the mdadm alert, it's possible a small amount of data was
> written to the array during the resync.
> 
> It looks like the filesystem was not unmounted properly before reboot:
> Jan 06 09:11:54 server systemd[1]: Failed unmounting /export/data.
> Jan 06 09:11:54 server systemd[1]: Shutting down.
> 
> Here is the mount errors in the log after rebooting:
> Jan 06 09:15:17 server kernel: XFS (md0): Mounting Filesystem
> Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 09:15:17 server kernel: XFS (md0): metadata I/O error: block 0x400
> ("xfs_trans_read_buf_map") error 117 numblks 16
> Jan 06 09:15:17 server kernel: XFS (md0): xfs_imap_to_bp:
> xfs_trans_read_buf() returned error 117.
> Jan 06 09:15:17 server kernel: XFS (md0): failed to read root inode
> 

So it fails to read the root inode. You could also try to read said
inode via xfs_db (e.g., 'sb,' 'p rootino,' 'inode <ino#>,' 'p') and see
what it shows.

Are you able to run xfs_metadump against the fs? If so and you're
willing/able to make the dump available somewhere (compressed), I'd be
interested to take a look to see what might be causing the difference in
behavior between repair and xfs_db.

Brian

> xfs_repair -n -L also complains about a bad magic number.
> 
> Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> volume. It was only ever meant to be a scratch drive for intermediate
> scientific results, however inevitably most users used it to store lots of
> data. Oh well.
> 
> Thanks again,
> Dave
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 6 January 2015 at 23:47, Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> > > Hi again,
> > > Some more information.... the kernel log show the following errors were
> > > occurring after the RAID recovery, but before I reset the server.
> > >
> >
> > By after the raid recovery, you mean after the two drives had failed out
> > and 1 hot spare was activated and resync completed? It certainly seems
> > like something went wrong in this process. The output below looks like
> > it's failing to read in some inodes. Is there any stack trace output
> > that accompanies these error messages to confirm?
> >
> > I suppose I would try to verify that the array configuration looks sane,
> > but after the hot spare resync and then one or two other drive
> > replacements (was the hot spare ultimately replaced?), it's hard to say
> > whether it might be recoverable.
> >
> > Brian
> >
> > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> > > 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> > > Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> > > xfs_trans_read_buf() returned error 117.
> > >
> > >
> > > Thanks,
> > > Dave
> >
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> >
> >
> 
> 
> -- 
> *David Raffelt (PhD)*
> Postdoctoral Fellow
> 
> The Florey Institute of Neuroscience and Mental Health
> Melbourne Brain Centre - Austin Campus
> 245 Burgundy Street
> Heidelberg Vic 3084
> Ph: +61 3 9035 7024
> www.florey.edu.au

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
       [not found]     ` <8cc9a649ec2240faa4e38fd742437546@000S-EX-HUB-NP2.unimelb.edu.au>
@ 2015-01-06 23:47       ` David Raffelt
  2015-01-07  0:27         ` Dave Chinner
  2015-01-07 16:16         ` Brian Foster
  0 siblings, 2 replies; 12+ messages in thread
From: David Raffelt @ 2015-01-06 23:47 UTC (permalink / raw
  To: Brian Foster; +Cc: stefanrin@gmail.com, xfs@oss.sgi.com


[-- Attachment #1.1: Type: text/plain, Size: 6424 bytes --]

Hi Brain,
Below is the root inode data. I'm currently running xfs_metadump and will
send you a link to the file.
Cheers!
David




xfs_db> sb
xfs_db> p rootino
rootino = 1024
xfs_db> inode 1024
xfs_db> p
core.magic = 0
core.mode = 0
core.version = 0
core.format = 0 (dev)
core.uid = 0
core.gid = 0
core.flushiter = 0
core.atime.sec = Thu Jan  1 10:00:00 1970
core.atime.nsec = 000000000
core.mtime.sec = Thu Jan  1 10:00:00 1970
core.mtime.nsec = 000000000
core.ctime.sec = Thu Jan  1 10:00:00 1970
core.ctime.nsec = 000000000
core.size = 0
core.nblocks = 0
core.extsize = 0
core.nextents = 0
core.naextents = 0
core.forkoff = 0
core.aformat = 0 (dev)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.filestream = 0
core.gen = 0
next_unlinked = 0
u.dev = 0


On 7 January 2015 at 10:16, Brian Foster <bfoster@redhat.com> wrote:

> On Wed, Jan 07, 2015 at 07:34:37AM +1100, David Raffelt wrote:
> > Hi Brian and Stefan,
> > Thanks for your reply.  I checked the status of the array after the
> rebuild
> > (and before the reset).
> >
> > md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
> >       14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> > [UUUUUU_]
> >
> > However given that I've never had any problems before with mdadm
> rebuilds I
> > did not think to check the data before rebooting.  Note that the array is
> > still in this state. Before the reboot I tried to run a smartctl check on
> > the failed drives and it could not read them. When I rebooted I did not
> > actually replace any drives, I just power cycled to see if I could
> > re-access the drives that were thrown out of the array. According to
> > smartctl they are completely fine.
> >
> > I guess there is no way I can re-add the old drives and remove the newly
> > synced drive?  Even though I immediately kicked all users off the system
> > when I got the mdadm alert, it's possible a small amount of data was
> > written to the array during the resync.
> >
> > It looks like the filesystem was not unmounted properly before reboot:
> > Jan 06 09:11:54 server systemd[1]: Failed unmounting /export/data.
> > Jan 06 09:11:54 server systemd[1]: Shutting down.
> >
> > Here is the mount errors in the log after rebooting:
> > Jan 06 09:15:17 server kernel: XFS (md0): Mounting Filesystem
> > Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount
> and
> > run xfs_repair
> > Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount
> and
> > run xfs_repair
> > Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount
> and
> > run xfs_repair
> > Jan 06 09:15:17 server kernel: XFS (md0): metadata I/O error: block 0x400
> > ("xfs_trans_read_buf_map") error 117 numblks 16
> > Jan 06 09:15:17 server kernel: XFS (md0): xfs_imap_to_bp:
> > xfs_trans_read_buf() returned error 117.
> > Jan 06 09:15:17 server kernel: XFS (md0): failed to read root inode
> >
>
> So it fails to read the root inode. You could also try to read said
> inode via xfs_db (e.g., 'sb,' 'p rootino,' 'inode <ino#>,' 'p') and see
> what it shows.
>
> Are you able to run xfs_metadump against the fs? If so and you're
> willing/able to make the dump available somewhere (compressed), I'd be
> interested to take a look to see what might be causing the difference in
> behavior between repair and xfs_db.
>
> Brian
>
> > xfs_repair -n -L also complains about a bad magic number.
> >
> > Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> > volume. It was only ever meant to be a scratch drive for intermediate
> > scientific results, however inevitably most users used it to store lots
> of
> > data. Oh well.
> >
> > Thanks again,
> > Dave
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 6 January 2015 at 23:47, Brian Foster <bfoster@redhat.com> wrote:
> >
> > > On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> > > > Hi again,
> > > > Some more information.... the kernel log show the following errors
> were
> > > > occurring after the RAID recovery, but before I reset the server.
> > > >
> > >
> > > By after the raid recovery, you mean after the two drives had failed
> out
> > > and 1 hot spare was activated and resync completed? It certainly seems
> > > like something went wrong in this process. The output below looks like
> > > it's failing to read in some inodes. Is there any stack trace output
> > > that accompanies these error messages to confirm?
> > >
> > > I suppose I would try to verify that the array configuration looks
> sane,
> > > but after the hot spare resync and then one or two other drive
> > > replacements (was the hot spare ultimately replaced?), it's hard to say
> > > whether it might be recoverable.
> > >
> > > Brian
> > >
> > > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected.
> Unmount
> > > and
> > > > run xfs_repair
> > > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected.
> Unmount
> > > and
> > > > run xfs_repair
> > > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected.
> Unmount
> > > and
> > > > run xfs_repair
> > > > Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> > > > 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> > > > Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> > > > xfs_trans_read_buf() returned error 117.
> > > >
> > > >
> > > > Thanks,
> > > > Dave
> > >
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > >
> > >
> >
> >
> > --
> > *David Raffelt (PhD)*
> > Postdoctoral Fellow
> >
> > The Florey Institute of Neuroscience and Mental Health
> > Melbourne Brain Centre - Austin Campus
> > 245 Burgundy Street
> > Heidelberg Vic 3084
> > Ph: +61 3 9035 7024
> > www.florey.edu.au
>
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
>
>


-- 
*David Raffelt (PhD)*
Postdoctoral Fellow

The Florey Institute of Neuroscience and Mental Health
Melbourne Brain Centre - Austin Campus
245 Burgundy Street
Heidelberg Vic 3084
Ph: +61 3 9035 7024
www.florey.edu.au

[-- Attachment #1.2: Type: text/html, Size: 9187 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06 23:47       ` David Raffelt
@ 2015-01-07  0:27         ` Dave Chinner
  2015-01-07 16:16         ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2015-01-07  0:27 UTC (permalink / raw
  To: David Raffelt; +Cc: stefanrin@gmail.com, Brian Foster, xfs@oss.sgi.com

On Wed, Jan 07, 2015 at 10:47:00AM +1100, David Raffelt wrote:
> Hi Brain,
> Below is the root inode data. I'm currently running xfs_metadump and will
> send you a link to the file.
> Cheers!
> David
> 
> 
> 
> 
> xfs_db> sb
> xfs_db> p rootino
> rootino = 1024
> xfs_db> inode 1024
> xfs_db> p
> core.magic = 0
> core.mode = 0
> core.version = 0
> core.format = 0 (dev)

It's a zero'd block. Even an unallocated inode should have the
magic number, inode version and data fork format stamped in it.
This is something we typically see when RAID rebuilds have not
worked correctly, even though they say they were "successful"....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06 20:34   ` David Raffelt
  2015-01-06 23:16     ` Brian Foster
       [not found]     ` <8cc9a649ec2240faa4e38fd742437546@000S-EX-HUB-NP2.unimelb.edu.au>
@ 2015-01-07  2:35     ` Chris Murphy
  2 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2015-01-07  2:35 UTC (permalink / raw
  To: xfs@oss.sgi.com

On Tue, Jan 6, 2015 at 1:34 PM, David Raffelt
<david.raffelt@florey.edu.au> wrote:
> Hi Brian and Stefan,
> Thanks for your reply.  I checked the status of the array after the rebuild
> (and before the reset).
>
> md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
>       14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> [UUUUUU_]
>
> However given that I've never had any problems before with mdadm rebuilds I
> did not think to check the data before rebooting.  Note that the array is
> still in this state. Before the reboot I tried to run a smartctl check on
> the failed drives and it could not read them. When I rebooted I did not
> actually replace any drives, I just power cycled to see if I could re-access
> the drives that were thrown out of the array. According to smartctl they are
> completely fine.
>
> I guess there is no way I can re-add the old drives and remove the newly
> synced drive?  Even though I immediately kicked all users off the system
> when I got the mdadm alert, it's possible a small amount of data was written
> to the array during the resync.

Well it sounds like there's more than one possibility here. If I
follow correctly, you definitely had a working degraded 5/7 drive
array, correct? In which case at least it should be possible to get
that back, but I don't know what was happening at the time the system
hung up on poweroff.

It's not rare for smart to not test for certain failure vectors so it
might say the drive is fine when it isn't. But what you should do next
is

mdadm -Evv /dev/sd[abcdefg]1   ##use actual drive letters

Are you able to get information on all seven drives? Or do you
definitely have at least one drive failed?

If the event counter from the above examine is the same for at least 5
drives, you should be able to assemble the array with this command:

mdadm --assemble --verbose /dev/mdX /dev/sd[bcdef]1

You have to feed the drive letter designation with the right letters
for drives with the same event counter. If that's 5 drives, use that.
If it's 6 drives, use that. If the event counters are all off, then
it's a matter of what they are so you may just post the event counters
so we can see this. This isn't going to write anything to the array,
the fs isn't mounted. So if it fails, nothing is worse off. If it
works, then you can run xfs_repair -n and see if you get a sane
result. If that works you can mount it in this degraded state and
maybe extract some of the more important data before proceeding to the
next step.

In the meantime I'm also curious about:

smarctl -l scterc /dev/sdX

This has to be issued per drive, no shortcut available by specifying
all letters at once in brackets. And then lastly this one:

cat /sys/block/sd[abcdefg]/device/timeout

Again plug in the correct letters.



> Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> volume. It was only ever meant to be a scratch drive for intermediate
> scientific results, however inevitably most users used it to store lots of
> data. Oh well.

Right well it's not fore sure toast yet. Also, one of the things
gluster is intended to mitigate is the loss of an entire brick, which
is what happened, but you need another 15TB of space to do
distributed-replicated on your scratch space. If you can tolerate
upwards of 48 hour single disk rebuild times, there are now 8TB HGST
Helium drives :-P


-- 
Chris Murphy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
  2015-01-06 23:47       ` David Raffelt
  2015-01-07  0:27         ` Dave Chinner
@ 2015-01-07 16:16         ` Brian Foster
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2015-01-07 16:16 UTC (permalink / raw
  To: David Raffelt; +Cc: stefanrin@gmail.com, xfs@oss.sgi.com

On Wed, Jan 07, 2015 at 10:47:00AM +1100, David Raffelt wrote:
> Hi Brain,
> Below is the root inode data. I'm currently running xfs_metadump and will
> send you a link to the file.
> Cheers!
> David
> 
> 

Thanks for the metadump. It appears that repair complains about the sb
magic number and goes off scanning for secondary superblocks due to a
bug in verify_set_primary_sb(). This function scans through all the
superblocks and tries to find a consistent value across the set by
tracking which valid sb value occurs most frequently. The bug is that
even if enough valid superblocks are found, we return the validity of
the last sb we happened to look at.

In your case, 10 or so of the 32 superblocks are corrupted and the last
one scanned ('sb 31,' 'p') is one of those. If I work around that issue,
the repair continues on a bit. It generates a _ton_ of noise and
eventually falls over somewhere else (and if I work around that, the
cycle repeats yet again somewhere else). Anyways, I'm adding to my todo
list to take a closer look at this code and perhaps put together a test
case if we don't have enough coverage as is, but that problem doesn't
appear to be gating the ability to recover this particular fs.

Given that, and as Dave pointed out from the output below, the root
inode is clearly completely zeroed out, it does appear that this array
ended up pretty scrambled one way or another. The best I can recommend
is to try and see if the array can be put back together in some manner
that repair can cope with or restore from whatever backups might be
available (others more familiar with md are better help here).

It also might be a good idea to audit the array recovery process
involved with this scenario for future occurrences, because clearly
something went horribly wrong. E.g., was the hotspare or whatever other
array mods made done via script or manually? Do the storage servers that
run these arrays have sane shutdown/startup sequences in the event of
degraded/syncing/busy arrays? etc. It might be worthwhile to try and
reproduce some of these array failure conditions on a test box to
identify any problems with the recovery process before it has to be run
again on one of your other glusterfs servers.

Brian

> 
> 
> xfs_db> sb
> xfs_db> p rootino
> rootino = 1024
> xfs_db> inode 1024
> xfs_db> p
> core.magic = 0
> core.mode = 0
> core.version = 0
> core.format = 0 (dev)
> core.uid = 0
> core.gid = 0
> core.flushiter = 0
> core.atime.sec = Thu Jan  1 10:00:00 1970
> core.atime.nsec = 000000000
> core.mtime.sec = Thu Jan  1 10:00:00 1970
> core.mtime.nsec = 000000000
> core.ctime.sec = Thu Jan  1 10:00:00 1970
> core.ctime.nsec = 000000000
> core.size = 0
> core.nblocks = 0
> core.extsize = 0
> core.nextents = 0
> core.naextents = 0
> core.forkoff = 0
> core.aformat = 0 (dev)
> core.dmevmask = 0
> core.dmstate = 0
> core.newrtbm = 0
> core.prealloc = 0
> core.realtime = 0
> core.immutable = 0
> core.append = 0
> core.sync = 0
> core.noatime = 0
> core.nodump = 0
> core.rtinherit = 0
> core.projinherit = 0
> core.nosymlinks = 0
> core.extsz = 0
> core.extszinherit = 0
> core.nodefrag = 0
> core.filestream = 0
> core.gen = 0
> next_unlinked = 0
> u.dev = 0
> 
> 
> On 7 January 2015 at 10:16, Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Wed, Jan 07, 2015 at 07:34:37AM +1100, David Raffelt wrote:
> > > Hi Brian and Stefan,
> > > Thanks for your reply.  I checked the status of the array after the
> > rebuild
> > > (and before the reset).
> > >
> > > md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
> > >       14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> > > [UUUUUU_]
> > >
> > > However given that I've never had any problems before with mdadm
> > rebuilds I
> > > did not think to check the data before rebooting.  Note that the array is
> > > still in this state. Before the reboot I tried to run a smartctl check on
> > > the failed drives and it could not read them. When I rebooted I did not
> > > actually replace any drives, I just power cycled to see if I could
> > > re-access the drives that were thrown out of the array. According to
> > > smartctl they are completely fine.
> > >
> > > I guess there is no way I can re-add the old drives and remove the newly
> > > synced drive?  Even though I immediately kicked all users off the system
> > > when I got the mdadm alert, it's possible a small amount of data was
> > > written to the array during the resync.
> > >
> > > It looks like the filesystem was not unmounted properly before reboot:
> > > Jan 06 09:11:54 server systemd[1]: Failed unmounting /export/data.
> > > Jan 06 09:11:54 server systemd[1]: Shutting down.
> > >
> > > Here is the mount errors in the log after rebooting:
> > > Jan 06 09:15:17 server kernel: XFS (md0): Mounting Filesystem
> > > Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount
> > and
> > > run xfs_repair
> > > Jan 06 09:15:17 server kernel: XFS (md0): metadata I/O error: block 0x400
> > > ("xfs_trans_read_buf_map") error 117 numblks 16
> > > Jan 06 09:15:17 server kernel: XFS (md0): xfs_imap_to_bp:
> > > xfs_trans_read_buf() returned error 117.
> > > Jan 06 09:15:17 server kernel: XFS (md0): failed to read root inode
> > >
> >
> > So it fails to read the root inode. You could also try to read said
> > inode via xfs_db (e.g., 'sb,' 'p rootino,' 'inode <ino#>,' 'p') and see
> > what it shows.
> >
> > Are you able to run xfs_metadump against the fs? If so and you're
> > willing/able to make the dump available somewhere (compressed), I'd be
> > interested to take a look to see what might be causing the difference in
> > behavior between repair and xfs_db.
> >
> > Brian
> >
> > > xfs_repair -n -L also complains about a bad magic number.
> > >
> > > Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> > > volume. It was only ever meant to be a scratch drive for intermediate
> > > scientific results, however inevitably most users used it to store lots
> > of
> > > data. Oh well.
> > >
> > > Thanks again,
> > > Dave
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 6 January 2015 at 23:47, Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > > On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> > > > > Hi again,
> > > > > Some more information.... the kernel log show the following errors
> > were
> > > > > occurring after the RAID recovery, but before I reset the server.
> > > > >
> > > >
> > > > By after the raid recovery, you mean after the two drives had failed
> > out
> > > > and 1 hot spare was activated and resync completed? It certainly seems
> > > > like something went wrong in this process. The output below looks like
> > > > it's failing to read in some inodes. Is there any stack trace output
> > > > that accompanies these error messages to confirm?
> > > >
> > > > I suppose I would try to verify that the array configuration looks
> > sane,
> > > > but after the hot spare resync and then one or two other drive
> > > > replacements (was the hot spare ultimately replaced?), it's hard to say
> > > > whether it might be recoverable.
> > > >
> > > > Brian
> > > >
> > > > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected.
> > Unmount
> > > > and
> > > > > run xfs_repair
> > > > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected.
> > Unmount
> > > > and
> > > > > run xfs_repair
> > > > > Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected.
> > Unmount
> > > > and
> > > > > run xfs_repair
> > > > > Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> > > > > 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> > > > > Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> > > > > xfs_trans_read_buf() returned error 117.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Dave
> > > >
> > > > > _______________________________________________
> > > > > xfs mailing list
> > > > > xfs@oss.sgi.com
> > > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > >
> > > >
> > >
> > >
> > > --
> > > *David Raffelt (PhD)*
> > > Postdoctoral Fellow
> > >
> > > The Florey Institute of Neuroscience and Mental Health
> > > Melbourne Brain Centre - Austin Campus
> > > 245 Burgundy Street
> > > Heidelberg Vic 3084
> > > Ph: +61 3 9035 7024
> > > www.florey.edu.au
> >
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> >
> >
> 
> 
> -- 
> *David Raffelt (PhD)*
> Postdoctoral Fellow
> 
> The Florey Institute of Neuroscience and Mental Health
> Melbourne Brain Centre - Austin Campus
> 245 Burgundy Street
> Heidelberg Vic 3084
> Ph: +61 3 9035 7024
> www.florey.edu.au

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XFS corrupt after RAID failure and resync
@ 2015-01-08  8:09 Chris Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2015-01-08  8:09 UTC (permalink / raw
  To: David Raffelt; +Cc: Chris Murphy, xfs@oss.sgi.com

On Wed, Jan 7, 2015 at 12:05 AM, David Raffelt
<david.raffelt@florey.edu.au> wrote:

> Yes, after the 2 disks were dropped I definitely had a working degraded
> drive with 5/7 . I only see XFS errors in the kernel log soon AFTER the hot
> spare finished syncing.

I suggest moving this to the linux-raid@ list and include the following:

brief description: e.g. 7 drive raid6 array, 2 drives got booted at
some point due to errors, a hotspare starts rebuilding and finishes,
then XFS errors appear in the log, and xfs_repair -n results suggest a
bad RAID assembly

kernel version
mdadm version
drive model numbers as well as their SCT ERC values
mdadm -E for all drives

The list can take all of this. I'm not sure if it'll also take a large
journal but I'd try it first before using a URL.

For the journal, two things: first it's not going back far enough, the
problems had already begun and it'd be good to have a lot more context
so I'd dig back and find the first indication of a problem, you can
use journalctl --since for this. It can take the form:

journalctl --since "24 hours ago" or "2015-01-04 12:15:00"


Also use the option -o short-monotonic which will use monotonic time,
could come in handy, and is more like dmesg output.

>> smarctl -l scterc /dev/sdX
>
>
> I'm ashamed to say that this command only works on 1 of the 8 drives since
> this is the only enterprise class drive (we are funded by small science
> grants). We have been gradually replacing the desktop class drives as they
> fail.

The errors in your logs are a lot more extensive than what I'm used to
seeing in cases of misconfiguration with desktop drives that lack
configurable SCT ERC. But the failure is consistent with that common
misconfiguration. The problem with desktop drives is the combination
of long error recoveries for bad sectors along with a short kernel
SCSI command timer. So what happens is the kernel thinks the drive has
hung up, and does a link reset. In reality the drive is probably in a
so called "deep recovery" but doesn't get a chance to report an
explicit read error. An explicit read error includes the affected
sector LBA which the md kernel code can then use to rebuild the data
from parity and overwrite the bad sector which fixes the problem.

However...


>> This has to be issued per drive, no shortcut available by specifying
>> all letters at once in brackets. And then lastly this one:
>>
>> cat /sys/block/sd[abcdefg]/device/timeout
>>
>> Again plug in the correct letters.
>
>
> All devices are set to 30 seconds.

This effectively prevents consumer drives from reporting marginally
bad blocks. If they're clearly bad, drive ECC reports read errors
fairly quickly. If they're fuzzy, then the ECC does a bunch of retries
potentially well beyond 30 seconds. I've heard times of 2-3 minutes,
which seems crazy but, that's apparently how long it can be before the
drive will give up and report a read error. And that read error is
necessary for RAID to work correctly.

So what you need to do for all drives that do not have configurable SCT ERC, is:

echo 180 > /sys/block/sdX/device/timeout

That way the kernel will wait up to 3 minutes. The drive will almost
certainly report an explicit read error in less than that, and then md
can fix the problem by writing over that bad sector. To force this
correction actively rather than passively you should schedule a scrub
of all arrays:

echo check > /sys/block/mdX/md/sync_action

You can do this on complete arrays in normal operation. I wouldn't do
this on the degraded array though. Consult linux-raid@ and do what's
suggested there.




>> Right well it's not fore sure toast yet. Also, one of the things
>> gluster is intended to mitigate is the loss of an entire brick, which
>> is what happened, but you need another 15TB of space to do
>> distributed-replicated on your scratch space. If you can tolerate
>> upwards of 48 hour single disk rebuild times, there are now 8TB HGST
>> Helium drives :-P
>
>
> Just to confirm, we have 3x15TB bricks in a 45TB volume. Don't we need
> complete duplication in a distributed-replicated Gluster volume, or can we
> get away with only 1 more brick?

If you want all the data to be replicated you need double the storage.
But you can have more than one volume, such that one has replication
and the other doesn't. The bricks used for replication volumes don't
both have to be raid6. It could be one raid6 and one raid5, or one
raid6 and one raid0. It's a risk assessment.


-- 
Chris Murphy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-01-08  8:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-08  8:09 XFS corrupt after RAID failure and resync Chris Murphy
  -- strict thread matches above, loose matches on Subject: below --
2015-01-06  6:12 David Raffelt
2015-01-06 12:47 ` Brian Foster
     [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
2015-01-06 20:34   ` David Raffelt
2015-01-06 23:16     ` Brian Foster
     [not found]     ` <8cc9a649ec2240faa4e38fd742437546@000S-EX-HUB-NP2.unimelb.edu.au>
2015-01-06 23:47       ` David Raffelt
2015-01-07  0:27         ` Dave Chinner
2015-01-07 16:16         ` Brian Foster
2015-01-07  2:35     ` Chris Murphy
2015-01-06  5:39 David Raffelt
2015-01-06 12:36 ` Stefan Ring
2015-01-06 12:41 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.