mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
@ 2015-07-12  6:02 Edward Kuns
  2015-07-12 13:45 ` Phil Turmel
  0 siblings, 1 reply; 7+ messages in thread
From: Edward Kuns @ 2015-07-12  6:02 UTC (permalink / raw
  To: linux-raid

I experienced a total drive failure.  Looking into it, I discovered
that the particular hard drive model that failed is a particularly bad
one.  So I replaced not only the failed drive, but another of the same
model.  In the process, I ran into a problem where on reboot the RAID
device was inactive.

I finally found a solution to my problem in the earlier thread "raid5
reshape is stuck" that started on 15 May.  By the way, I am on Fedora
21

> rpm -q mdadm
mdadm-3.3.2-1.fc21.x86_64

> uname -srvmpio
Linux 4.0.4-202.fc21.x86_64 #1 SMP Wed May 27 22:28:42 UTC 2015 x86_64
x86_64 x86_64 GNU/Linux

The short version of the story is that I replaced the dead drive and
let the raid5 partition rebuild.  Then I added a new drive and let the
partition rebuild.  Then I removed the not-yet-dead drive and here is
where I ran into the same problem as the other poster.  Basically, I
did this to replace the still-working-but-suspect device, after the
partition completed rebuilding when I replaced the actually-dead
drive:

mdadm --manage /dev/md125 --add /dev/sdf1
mdadm --grow --raid-devices=5 /dev/md125

 ... wait for the rebuild to complete

mdadm --fail /dev/md125 /dev/sdd2
mdadm --remove /dev/md125 /dev/sdd2
mdadm --grow --raid-devices=4 /dev/md125

mdadm: this change will reduce the size of the array.
       use --grow --array-size first to truncate array.
       e.g. mdadm --grow /dev/md125 --array-size 118964736

mdadm --grow /dev/md125 --array-size 118964736
mdadm --grow --raid-devices=4 /dev/md125

... this failed with a mysterious complaint about my first partition
(Cannot set new_offset).  Research got me to try:

mdadm --grow --raid-devices=4 /dev/md125 --backup-file /root/md125.backup

.... here everything ground to a halt.  The reshape was at 0% and
there was no disk activity.

The solution was to edit
/lib/systemd/system/mdadm-grow-continue@.service to look like this (it
was important that the backup file was placed in /tmp and not in /root
or anywhere else.  SELinux allowed mdadm to create a file in /tmp by
not anywhere else I tried):

#  This file is part of mdadm.
#
#  mdadm is free software; you can redistribute it and/or modify it
#  under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.

[Unit]
Description=Manage MD Reshape on /dev/%I
DefaultDependencies=no

[Service]
ExecStart=/usr/sbin/mdadm --grow --continue /dev/%I
--backup-file=/tmp/raid-backup-file
StandardInput=null
#StandardOutput=null
#StandardError=null
KillMode=none

I had to comment out the standard out and error lines to see why the
service was failing.  I was pulling out my hair.  The raid device
failed to initialize, so my computer dumped me into runlevel 1.

When the process finished after the above fix, I ended up in a weird state:

    Number   Major   Minor   RaidDevice State
       0       8        2        0    active sync   /dev/sda2
       1       8       17        1    active sync   /dev/sdb1
       5       8       33        2    active sync   /dev/sdc1
       6       0        0        6    removed

       6       8       49        -    spare   /dev/sdd1

but that is probably as a result of what I tried to bring it back.  I
could "stop" the raid and manually recreate it and the filesystems on
it were fine.  But it wouldn't come up without me doing that.  I'm
going to try to fail and re-add that disk again and see if it works
now that it was able to complete a sync.  I did a fail, remove, and
add on /dev/sdd1  and it very quickly synced and came into service.
The command "mdadm --detail /dev/md125" now shows a happy raid5 with
four partitions in it, all "active sync"  So all I had to do was add
the --backup-file to the command to "grow" down to 4 devices, and also
to mdadm-grow-continue@.service.

I thought I'd let you know, in particular, that adding
--backup-file=/tmp/raid-backup-file to the service file worked to get
the process unstuck, and that due to SELinux it must be in tmp.  Also,
should the "Cannot set new_offset" complaint maybe suggest trying
again with a backup file?

                 Eddie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
  2015-07-12  6:02 mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) Edward Kuns
@ 2015-07-12 13:45 ` Phil Turmel
  2015-07-12 19:24   ` Edward Kuns
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2015-07-12 13:45 UTC (permalink / raw
  To: Edward Kuns, linux-raid

Hi Edward,

On 07/12/2015 02:02 AM, Edward Kuns wrote:

[trim /]

> The short version of the story is that I replaced the dead drive and
> let the raid5 partition rebuild.  Then I added a new drive and let the
> partition rebuild.  Then I removed the not-yet-dead drive and here is
> where I ran into the same problem as the other poster.  Basically, I
> did this to replace the still-working-but-suspect device, after the
> partition completed rebuilding when I replaced the actually-dead
> drive:
> 
> mdadm --manage /dev/md125 --add /dev/sdf1
> mdadm --grow --raid-devices=5 /dev/md125
> 
>  ... wait for the rebuild to complete
> 
> mdadm --fail /dev/md125 /dev/sdd2
> mdadm --remove /dev/md125 /dev/sdd2
> mdadm --grow --raid-devices=4 /dev/md125
> 
> mdadm: this change will reduce the size of the array.
>        use --grow --array-size first to truncate array.
>        e.g. mdadm --grow /dev/md125 --array-size 118964736
> 
> mdadm --grow /dev/md125 --array-size 118964736
> mdadm --grow --raid-devices=4 /dev/md125
> 
> ... this failed with a mysterious complaint about my first partition
> (Cannot set new_offset).  Research got me to try:
> 
> mdadm --grow --raid-devices=4 /dev/md125 --backup-file /root/md125.backup

Why were you using --grow for these operations only to reverse it?  This
is dangerous if you have a layer or filesystem on your array that
doesn't support shrinking.  None of the --grow operations were necessary
in this sequence to achieve the end result of replacing disks.

> .... here everything ground to a halt.  The reshape was at 0% and
> there was no disk activity.
> 
> The solution was to edit
> /lib/systemd/system/mdadm-grow-continue@.service to look like this (it
> was important that the backup file was placed in /tmp and not in /root
> or anywhere else.  SELinux allowed mdadm to create a file in /tmp by
> not anywhere else I tried):

I'm not an SELinux guy, so I can't help with the rest, but you should
know that many modern distros delete /tmp on reboot and/or play games
with namespaces to isolate different users' /tmp spaces.

[trim /]

> I did a fail, remove, and
> add on /dev/sdd1  and it very quickly synced and came into service.
> The command "mdadm --detail /dev/md125" now shows a happy raid5 with
> four partitions in it, all "active sync"

These are the only operations you should have done in the first place.
Although I would have put the --add first, so the --fail operation would
have triggered a rebuild onto the spare right away.  At no point should
you have changed the number of raid devices.

And for the still-running but suspect drive, the --replace operation
would have been the right choice, again, after --add of a spare.

HTH,

Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
  2015-07-12 13:45 ` Phil Turmel
@ 2015-07-12 19:24   ` Edward Kuns
  2015-07-13 13:54     ` Phil Turmel
  2015-07-13 18:37     ` Wols Lists
  0 siblings, 2 replies; 7+ messages in thread
From: Edward Kuns @ 2015-07-12 19:24 UTC (permalink / raw
  To: Phil Turmel; +Cc: linux-raid

On Sun, Jul 12, 2015 at 8:45 AM, Phil Turmel <philip@turmel.org> wrote:
> Why were you using --grow for these operations only to reverse it?  This
> is dangerous if you have a layer or filesystem on your array that
> doesn't support shrinking.  None of the --grow operations were necessary
> in this sequence to achieve the end result of replacing disks.
[snip]
> At no point should you have changed the number of raid devices.
[snip]
> And for the still-running but suspect drive, the --replace operation
> would have been the right choice, again, after --add of a spare.

I didn't mention the steps I did to replace the failed drive because
that went flawlessly.  I did a fail and remove on it to be sure, but
got complaints that it was already failed/removed.  When I did an add
for the replacement drive, it came in and synced automatically.  I
only ran into trouble trying to replace the "not yet dead but suspect"
drive.  I was following examples on the Internet.  The example I was
following was a clearly a bad one.  The examples I found didn't
suggest the --replace option.  This is ultimately my fault for not
being familiar enough with this.  Now I know better.

FWIW, I had LVM on top of the raid5, with two partitions (/var and an
extra storage one) on the LVM.  (I think there is some spare space
too.)  The goal, of course, is being able to survive any single-drive
failure, which I did.

You said this is dangerous.  I went from 4->5 and then immediately
5->4 drives.  I didn't expand the LVM on the raid5, and the
replacement partition was a little bigger than the original.  Next
time, I'll use --replace, obviously.  I just want to understand why it
is dangerous.  As long as the replacement partition is as big as the
one it is replacing, isn't this just extra work, and more chance of
running into problems like the one I ran into?  But other than that,
it shouldn't risk the actual data stored on the RAID,should it?

> many modern distros delete /tmp on reboot and/or play
> games with namespaces to isolate different users' /tmp spaces.

So if the machine crashes during a rebuild, you may lose that backup
file, depending on the distro.  OK.  Is there a better solution to
this?  Unfortunately, at the time of the failure to shrink, the
rebuild that failed to start, stdout and stderr were not going to
/var/log/messages, so I have no idea what the complaint was at that
time.  Does this service send so much output to stdout/stderr that
it's useful to suppress it?  If I'd seen something in
/var/log/messages, it would have been more clear that there was a
service with a complaint that was the cause of the rebuild failing to
start.  I wouldn't have done as much thrashing trying to figure out
why.

> These are the only operations you should have done in the first place.
> Although I would have put the --add first, so the --fail operation would
> have triggered a rebuild onto the spare right away.

I did the fail/remove/add at the very end, after replacing the dead
drive, after finally completing the "don't do it this way again"
grow-to-5-then-shrink-to-4 process to replace the not-yet-dead drive.
After the shrink finally completed, the new 4th drive showed as a
spare and removed at the same time.  i.e., this dump from my first
EMail:

    Number   Major   Minor   RaidDevice State
       0       8        2        0    active sync   /dev/sda2
       1       8       17        1    active sync   /dev/sdb1
       5       8       33        2    active sync   /dev/sdc1
       6       0        0        6    removed

       6       8       49        -    spare   /dev/sdd1

Doing a fail, then remove, then add on that 4th partition (sdd1)
brought it back and it very quickly synced.  I did a forced fsck on
both partitions to be sure, and both were clean.

       Thanks

              Eddie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
  2015-07-12 19:24   ` Edward Kuns
@ 2015-07-13 13:54     ` Phil Turmel
  2015-07-13 21:38       ` Edward Kuns
  2015-07-13 18:37     ` Wols Lists
  1 sibling, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2015-07-13 13:54 UTC (permalink / raw
  To: Edward Kuns; +Cc: linux-raid, NeilBrown

Hi Eddie,

On 07/12/2015 03:24 PM, Edward Kuns wrote:
> On Sun, Jul 12, 2015 at 8:45 AM, Phil Turmel <philip@turmel.org> wrote:
>> Why were you using --grow for these operations only to reverse it?  This
>> is dangerous if you have a layer or filesystem on your array that
>> doesn't support shrinking.  None of the --grow operations were necessary
>> in this sequence to achieve the end result of replacing disks.
> [snip]
>> At no point should you have changed the number of raid devices.
> [snip]
>> And for the still-running but suspect drive, the --replace operation
>> would have been the right choice, again, after --add of a spare.
> 
> I didn't mention the steps I did to replace the failed drive because
> that went flawlessly.  I did a fail and remove on it to be sure, but
> got complaints that it was already failed/removed.  When I did an add
> for the replacement drive, it came in and synced automatically.  I
> only ran into trouble trying to replace the "not yet dead but suspect"
> drive.  I was following examples on the Internet.  The example I was
> following was a clearly a bad one.  The examples I found didn't
> suggest the --replace option.  This is ultimately my fault for not
> being familiar enough with this.  Now I know better.

Even without the --replace operation, --grow should never have been
used.  On older kernels without support for --replace, the correct
operation is --add spare then --fail, --remove.

> FWIW, I had LVM on top of the raid5, with two partitions (/var and an
> extra storage one) on the LVM.  (I think there is some spare space
> too.)  The goal, of course, is being able to survive any single-drive
> failure, which I did.
> 
> You said this is dangerous.  I went from 4->5 and then immediately
> 5->4 drives.  I didn't expand the LVM on the raid5, and the
> replacement partition was a little bigger than the original.  Next
> time, I'll use --replace, obviously.  I just want to understand why it
> is dangerous.  As long as the replacement partition is as big as the
> one it is replacing, isn't this just extra work, and more chance of
> running into problems like the one I ran into?  But other than that,
> it shouldn't risk the actual data stored on the RAID,should it?

In theory, no.  But the --grow operation has to move virtually every
data block to a new location, and in your case, then back to its
original location.  Lots of unnecessary data movement that has a low but
non-zero error-rate.

Also, the complex operations in --grow have produced somewhat more than
its fair share of mdadm bugs.  Stuck reshapes are usually recoverable,
but typically only with assistance from this list.  Drive failures
during reshapes can be particularly sticky, especially when the failure
is of the device holding a critical section backup.

>> many modern distros delete /tmp on reboot and/or play
>> games with namespaces to isolate different users' /tmp spaces.
> 
> So if the machine crashes during a rebuild, you may lose that backup
> file, depending on the distro.  OK.  Is there a better solution to
> this?  Unfortunately, at the time of the failure to shrink, the
> rebuild that failed to start, stdout and stderr were not going to
> /var/log/messages, so I have no idea what the complaint was at that
> time.  Does this service send so much output to stdout/stderr that
> it's useful to suppress it?  If I'd seen something in
> /var/log/messages, it would have been more clear that there was a
> service with a complaint that was the cause of the rebuild failing to
> start.  I wouldn't have done as much thrashing trying to figure out
> why.

I don't use systemd so can't advise on this.  Without systemd, mdadm
just runs mdmon in the background and it all just works.

>> These are the only operations you should have done in the first place.
>> Although I would have put the --add first, so the --fail operation would
>> have triggered a rebuild onto the spare right away.
> 
> I did the fail/remove/add at the very end, after replacing the dead
> drive, after finally completing the "don't do it this way again"
> grow-to-5-then-shrink-to-4 process to replace the not-yet-dead drive.
> After the shrink finally completed, the new 4th drive showed as a
> spare and removed at the same time.  i.e., this dump from my first

Growing and shrinking didn't do anything to replace your suspect drive.
 It just moved the data blocks around on your other drives, all while
not redundant.

> EMail:
> 
>     Number   Major   Minor   RaidDevice State
>        0       8        2        0    active sync   /dev/sda2
>        1       8       17        1    active sync   /dev/sdb1
>        5       8       33        2    active sync   /dev/sdc1
>        6       0        0        6    removed
> 
>        6       8       49        -    spare   /dev/sdd1

It seems there is a corner case where at completion of shrink where one
device becomes a spare, the new spare doesn't trigger the recovery code
to pull it into service.

Probably never noticed because reshaping a degraded array is *uncommon*.
 :-)

This one is for Neil, I think...

Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
  2015-07-12 19:24   ` Edward Kuns
  2015-07-13 13:54     ` Phil Turmel
@ 2015-07-13 18:37     ` Wols Lists
  2015-07-13 22:07       ` Edward Kuns
  1 sibling, 1 reply; 7+ messages in thread
From: Wols Lists @ 2015-07-13 18:37 UTC (permalink / raw
  To: Edward Kuns, Phil Turmel; +Cc: linux-raid

On 12/07/15 20:24, Edward Kuns wrote:
>> > many modern distros delete /tmp on reboot and/or play
>> > games with namespaces to isolate different users' /tmp spaces.

> So if the machine crashes during a rebuild, you may lose that backup
> file, depending on the distro.  OK. 

Please note that this is the DEFINED behaviour of /tmp, so it has a very
high probability of happening.

If you want temporary data to survive a reboot, put it in /var/tmp.

Oh - and if SeLinux only lets you put it in /tmp, what happens if you
don't have a separate /tmp partition? You can't put the  backup file on
the partition you are rebuilding, and SeLinux won't let you put it
anywhere else? That's a big disaster in the making ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
  2015-07-13 13:54     ` Phil Turmel
@ 2015-07-13 21:38       ` Edward Kuns
  0 siblings, 0 replies; 7+ messages in thread
From: Edward Kuns @ 2015-07-13 21:38 UTC (permalink / raw
  To: Phil Turmel; +Cc: linux-raid, NeilBrown

On Mon, Jul 13, 2015 at 8:54 AM, Phil Turmel <philip@turmel.org> wrote:
> Hi Eddie,
> On older kernels without support for --replace, the correct
> operation is --add spare then --fail, --remove.

Makes sense.  That was my original plan, since I didn't know about the
replace option.  Doing otherwise was a bad decision on my part.

To make sure I understand this: 1) If you start out with a 4-drive
healthy raid5 array and do add / fail / remove, the "fail" step
immediately removes that drive from being an active participant in the
array and causes the new drive to be populated with data recalculated
from parity, right?  2) The new drive will sit in the array as a
"spare" until it is needed, which doesn't happen until the "fail"
step?  And, 3)  The "replace" option, instead, does the logical
equivalent of moving all the data off one drive onto a spare but
doesn't involve the other drives in a parity recalculation?

>> it shouldn't risk the actual data stored on the RAID,should it?
>
> In theory, no.  But the --grow operation has to move virtually every
> data block to a new location, and in your case, then back to its
> original location.  Lots of unnecessary data movement that has a
> low but non-zero error-rate.
>
> Also, the complex operations in --grow have produced somewhat
> more than its fair share of mdadm bugs.  Stuck reshapes are usually
> recoverable, but typically only with assistance from this list.  Drive
> failures during reshapes can be particularly sticky, especially when
> the failure is of the device holding a critical section backup.

That all makes perfect sense, thanks.

> I don't use systemd so can't advise on this.  Without systemd, mdadm
> just runs mdmon in the background and it all just works.

I can't exactly say I use it by choice.  I'd change distros but that
would only delay the inevitable.

> Growing and shrinking didn't do anything to replace your suspect drive.
> It just moved the data blocks around on your other drives, all while
> not redundant.

I'm confused here.  I started the grow 4->5 with a healthy raid5 with
4 drives.  One of the four drives was "suspect" in that I expect it to
fail at some point in the near future -- but it hadn't yet failed.  I
thought this grow would give me a raid with four data drives + one
parity drive, all working.  (And it seemed to.)  And then I could fail
the suspect drive and go back down to three data drives + one parity.
The final output of the shrink certainly agrees with what you say, but
I clearly don't understand it.  I don't understand how going from 4
healthy drives to 5 healthy drives, and then failing and removing one
of them and shrinking back down to 4 drives, ended up with 3 good and
one spare.  But that is what happened.

> It seems there is a corner case where at completion of shrink where one
> device becomes a spare, the new spare doesn't trigger the recovery code
> to pull it into service.
>
> Probably never noticed because reshaping a degraded array is *uncommon*.
>  :-)

It would be nice if my error in judgement helps save someone else in
the future!  If there is any data I can gather from my server that
will help, I can get it.  Although I won't be reproducing this
experiment any time in the future on a server that has any data I care
about.  But note that I didn't reshape a degraded array.  I reshaped a
healthy array and ended up with a degraded one.

                   Eddie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May)
  2015-07-13 18:37     ` Wols Lists
@ 2015-07-13 22:07       ` Edward Kuns
  0 siblings, 0 replies; 7+ messages in thread
From: Edward Kuns @ 2015-07-13 22:07 UTC (permalink / raw
  To: Wols Lists; +Cc: Phil Turmel, linux-raid

On Mon, Jul 13, 2015 at 1:37 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> Please note that this is the DEFINED behaviour of /tmp, so it has a very
> high probability of happening.
>
> If you want temporary data to survive a reboot, put it in /var/tmp.

OK.  Looking more carefully, I see my "/tmp" partition is of type
tmpfs.  So yes, on reboot it would have been totally clean, exactly as
you say.  In my case, my /var partition was on the raid5 being
reshaped, so /var/tmp wasn't an option for me.

> Oh - and if SeLinux only lets you put it in /tmp, what happens if you
> don't have a separate /tmp partition? You can't put the  backup file on
> the partition you are rebuilding, and SeLinux won't let you put it
> anywhere else? That's a big disaster in the making ...

Well, SELinux will let you put it anywhere that is labeled to allow
it.  So if it needs to go on some folder that isn't labelled properly,
then some labeling needs to be done.  I didn't want to deal with the
(minor) hassle of creating a label, of having to understand what label
would be appropriate, so I wanted to find a folder that was already
allowed by the existing labeling.  So I tried a bunch of folders in
succession until one worked.

What is the interaction between the backup file I had to specify in
/lib/systemd/system/mdadm-grow-continue@.service and the backup file I
had to specify on the command line to do the --grow to shrink the
array?  It kind of looks like the backup file on the "mdadm" command
line doesn't really matter, except that I had to specify one, because
"mdadm --grow --raid-devices=4 /dev/md125" wouldn't *try* to start
without a backup file specified, but then just crashed in the
mdadm-grow-continue service.  Specifying a (different) backup file
there and restarting the service got the reshape to complete.

This raises a big question with SELinux.  When a backup file is truly
needed, mdadm needs the ability to write the backup file to more than
one partition (not at a time, but in general), depending on which raid
device is being modified.  This means that some custom labeling may
need to be done either in advance to prepare for recovery or
on-the-fly in a recovery situation.

             Eddie

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-07-13 22:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-12  6:02 mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) Edward Kuns
2015-07-12 13:45 ` Phil Turmel
2015-07-12 19:24   ` Edward Kuns
2015-07-13 13:54     ` Phil Turmel
2015-07-13 21:38       ` Edward Kuns
2015-07-13 18:37     ` Wols Lists
2015-07-13 22:07       ` Edward Kuns

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.