* new OSD re-using old OSD id fails to boot
@ 2015-12-05 16:49 Loic Dachary
2015-12-09 1:13 ` David Zafman
0 siblings, 1 reply; 9+ messages in thread
From: Loic Dachary @ 2015-12-05 16:49 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 321 bytes --]
Hi Sage,
The problem described at "new OSD re-using old OSD id fails to boot" http://tracker.ceph.com/issues/13988 consistently fails the ceph-disk suite on master. I wonder if it could be a side effect of the recent optimizations introduced in the monitor ?
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-05 16:49 new OSD re-using old OSD id fails to boot Loic Dachary
@ 2015-12-09 1:13 ` David Zafman
2015-12-09 2:50 ` Sage Weil
2015-12-09 13:06 ` Loic Dachary
0 siblings, 2 replies; 9+ messages in thread
From: David Zafman @ 2015-12-09 1:13 UTC (permalink / raw)
To: Loic Dachary, Sage Weil; +Cc: Ceph Development
Remember I really think we want a disk replacement feature that would
retain the OSD id so that it avoids unnecessary data movement. See
tracker http://tracker.ceph.com/issues/13732
David
On 12/5/15 8:49 AM, Loic Dachary wrote:
> Hi Sage,
>
> The problem described at "new OSD re-using old OSD id fails to boot" http://tracker.ceph.com/issues/13988 consistently fails the ceph-disk suite on master. I wonder if it could be a side effect of the recent optimizations introduced in the monitor ?
>
> Cheers
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 1:13 ` David Zafman
@ 2015-12-09 2:50 ` Sage Weil
2015-12-09 10:39 ` Wei-Chung Cheng
2015-12-09 13:06 ` Loic Dachary
1 sibling, 1 reply; 9+ messages in thread
From: Sage Weil @ 2015-12-09 2:50 UTC (permalink / raw)
To: David Zafman; +Cc: Loic Dachary, Ceph Development
On Tue, 8 Dec 2015, David Zafman wrote:
> Remember I really think we want a disk replacement feature that would retain
> the OSD id so that it avoids unnecessary data movement. See tracker
> http://tracker.ceph.com/issues/13732
Yeah, I totally agree. We just need to form an opinion on how... probably
starting with the user experience. Ideally we'd go from up + in to down +
in to down + out, then pull the drive and replace, and then initialize a
new OSD with the same id... and journal partition. Something like
ceph-disk recreate id=N uuid=U <osd device path>
I.e., it could use the uuid (which the cluster has in the OSDMap) to find
(and re-use) the journal device.
For a journal failure it'd probably be different.. but maybe not?
Any other ideas?
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 2:50 ` Sage Weil
@ 2015-12-09 10:39 ` Wei-Chung Cheng
2015-12-09 13:08 ` Loic Dachary
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Wei-Chung Cheng @ 2015-12-09 10:39 UTC (permalink / raw)
To: Sage Weil; +Cc: David Zafman, Loic Dachary, Ceph Development
Hi Loic,
I try to reproduce this problem on my CentOS7.
I can not do the same issue.
This is my version:
ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
Would you describe more detail?
Hi David, Sage,
In most of time, when we found the osd failure, the OSD is already in
`out` state.
It could not avoid the redundant data movement unless we could set the
osd noout when failure.
Is it right? (Means if OSD go into `out` state, it will make some
redundant data movement)
Could we try the traditional spare behavior? (Set some disks backup
and auto replace the broken device?)
That can replace the failure osd before it go into the `out` state.
Or we could always set the osd noout?
In fact, I think these is a different problems between David and Loic.
(these two problems are the same import :p
If you have any problems, feel free to let me know.
thanks!!
vicente
2015-12-09 10:50 GMT+08:00 Sage Weil <sweil@redhat.com>:
> On Tue, 8 Dec 2015, David Zafman wrote:
>> Remember I really think we want a disk replacement feature that would retain
>> the OSD id so that it avoids unnecessary data movement. See tracker
>> http://tracker.ceph.com/issues/13732
>
> Yeah, I totally agree. We just need to form an opinion on how... probably
> starting with the user experience. Ideally we'd go from up + in to down +
> in to down + out, then pull the drive and replace, and then initialize a
> new OSD with the same id... and journal partition. Something like
>
> ceph-disk recreate id=N uuid=U <osd device path>
>
> I.e., it could use the uuid (which the cluster has in the OSDMap) to find
> (and re-use) the journal device.
>
> For a journal failure it'd probably be different.. but maybe not?
>
> Any other ideas?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 1:13 ` David Zafman
2015-12-09 2:50 ` Sage Weil
@ 2015-12-09 13:06 ` Loic Dachary
1 sibling, 0 replies; 9+ messages in thread
From: Loic Dachary @ 2015-12-09 13:06 UTC (permalink / raw)
To: David Zafman, Sage Weil; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 743 bytes --]
On 09/12/2015 02:13, David Zafman wrote:
>
> Remember I really think we want a disk replacement feature that would retain the OSD id so that it avoids unnecessary data movement. See tracker http://tracker.ceph.com/issues/13732
I remember this, yes, it is a good idea :-) Would that help fix http://tracker.ceph.com/issues/13988 ?
> David
>
> On 12/5/15 8:49 AM, Loic Dachary wrote:
>> Hi Sage,
>>
>> The problem described at "new OSD re-using old OSD id fails to boot" http://tracker.ceph.com/issues/13988 consistently fails the ceph-disk suite on master. I wonder if it could be a side effect of the recent optimizations introduced in the monitor ?
>>
>> Cheers
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 10:39 ` Wei-Chung Cheng
@ 2015-12-09 13:08 ` Loic Dachary
2015-12-09 14:00 ` Sage Weil
2015-12-09 19:55 ` David Zafman
2 siblings, 0 replies; 9+ messages in thread
From: Loic Dachary @ 2015-12-09 13:08 UTC (permalink / raw)
To: Wei-Chung Cheng; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 445 bytes --]
On 09/12/2015 11:39, Wei-Chung Cheng wrote:
> Hi Loic,
>
> I try to reproduce this problem on my CentOS7.
> I can not do the same issue.
> This is my version:
> ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> Would you describe more detail?
I reproduced the problem yesterday once more by running the ceph-disk suite. It happens every time.
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 10:39 ` Wei-Chung Cheng
2015-12-09 13:08 ` Loic Dachary
@ 2015-12-09 14:00 ` Sage Weil
2015-12-09 19:55 ` David Zafman
2 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2015-12-09 14:00 UTC (permalink / raw)
To: Wei-Chung Cheng; +Cc: David Zafman, Loic Dachary, Ceph Development
On Wed, 9 Dec 2015, Wei-Chung Cheng wrote:
> Hi Loic,
>
> I try to reproduce this problem on my CentOS7.
> I can not do the same issue.
> This is my version:
> ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> Would you describe more detail?
>
>
> Hi David, Sage,
>
> In most of time, when we found the osd failure, the OSD is already in
> `out` state.
> It could not avoid the redundant data movement unless we could set the
> osd noout when failure.
> Is it right? (Means if OSD go into `out` state, it will make some
> redundant data movement)
>
> Could we try the traditional spare behavior? (Set some disks backup
> and auto replace the broken device?)
>
> That can replace the failure osd before it go into the `out` state.
> Or we could always set the osd noout?
I don't think there is a problem with 'out' if the osd id is reused and
the crush position remains the same. And I expect usually the OSD will be
replaced by a disk with a similar size. If the replacement is smaller (or
0--removed entirely) then you get double-movement, but if it's the same or
larger I think it's fine.
The sequence would be something like
up + in
down + in
5-10 minutes go by
down + out (marked out by monitor)
new replicas uniformly distributed across cluster
days go by
disk removed
new disk inserted
ceph-disk recreate ... recreates osd dir w/ the same id, new uuid
on startup, osd adjusts crush weight (maybe.. usually by a smallish amount)
up + in
replicas migrate back to new device
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 10:39 ` Wei-Chung Cheng
2015-12-09 13:08 ` Loic Dachary
2015-12-09 14:00 ` Sage Weil
@ 2015-12-09 19:55 ` David Zafman
2015-12-09 20:08 ` Sage Weil
2 siblings, 1 reply; 9+ messages in thread
From: David Zafman @ 2015-12-09 19:55 UTC (permalink / raw)
To: Wei-Chung Cheng, Sage Weil; +Cc: Loic Dachary, Ceph Development
On 12/9/15 2:39 AM, Wei-Chung Cheng wrote:
> Hi Loic,
>
> I try to reproduce this problem on my CentOS7.
> I can not do the same issue.
> This is my version:
> ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> Would you describe more detail?
>
>
> Hi David, Sage,
>
> In most of time, when we found the osd failure, the OSD is already in
> `out` state.
> It could not avoid the redundant data movement unless we could set the
> osd noout when failure.
> Is it right? (Means if OSD go into `out` state, it will make some
> redundant data movement)
Yes, one case would be that during the 5 minute down window of an OSD
disk failure, the noout flag can be set if a spare disk is available.
Another scenario would be a bad SMART status or noticing EIO errors from
a disk prompting a replacement. So if a spare disk is already installed
or you have hot swappable drives, it would be nice to replace the drive
and let recovery copy back all the data that should be there. Using
noout would be critical to this effort.
I don't understand why Sage suggests below that a down+out phase would
be required during the replacement.
>
> Could we try the traditional spare behavior? (Set some disks backup
> and auto replace the broken device?)
>
> That can replace the failure osd before it go into the `out` state.
> Or we could always set the osd noout?
>
> In fact, I think these is a different problems between David and Loic.
> (these two problems are the same import :p
>
> If you have any problems, feel free to let me know.
>
> thanks!!
> vicente
>
>
> 2015-12-09 10:50 GMT+08:00 Sage Weil <sweil@redhat.com>:
>> On Tue, 8 Dec 2015, David Zafman wrote:
>>> Remember I really think we want a disk replacement feature that would retain
>>> the OSD id so that it avoids unnecessary data movement. See tracker
>>> http://tracker.ceph.com/issues/13732
>> Yeah, I totally agree. We just need to form an opinion on how... probably
>> starting with the user experience. Ideally we'd go from up + in to down +
>> in to down + out, then pull the drive and replace, and then initialize a
Here ^^^^^^^^^^^^
>> new OSD with the same id... and journal partition. Something like
>>
>> ceph-disk recreate id=N uuid=U <osd device path>
>>
>> I.e., it could use the uuid (which the cluster has in the OSDMap) to find
>> (and re-use) the journal device.
>>
>> For a journal failure it'd probably be different.. but maybe not?
>>
>> Any other ideas?
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: new OSD re-using old OSD id fails to boot
2015-12-09 19:55 ` David Zafman
@ 2015-12-09 20:08 ` Sage Weil
0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2015-12-09 20:08 UTC (permalink / raw)
To: David Zafman; +Cc: Wei-Chung Cheng, Loic Dachary, Ceph Development
On Wed, 9 Dec 2015, David Zafman wrote:
> On 12/9/15 2:39 AM, Wei-Chung Cheng wrote:
> > Hi Loic,
> >
> > I try to reproduce this problem on my CentOS7.
> > I can not do the same issue.
> > This is my version:
> > ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> > Would you describe more detail?
> >
> >
> > Hi David, Sage,
> >
> > In most of time, when we found the osd failure, the OSD is already in
> > `out` state.
> > It could not avoid the redundant data movement unless we could set the
> > osd noout when failure.
> > Is it right? (Means if OSD go into `out` state, it will make some
> > redundant data movement)
> Yes, one case would be that during the 5 minute down window of an OSD disk
> failure, the noout flag can be set if a spare disk is available. Another
> scenario would be a bad SMART status or noticing EIO errors from a disk
> prompting a replacement. So if a spare disk is already installed or you have
> hot swappable drives, it would be nice to replace the drive and let recovery
> copy back all the data that should be there. Using noout would be critical to
> this effort.
>
> I don't understand why Sage suggests below that a down+out phase would be
> required during the replacement.
Hmm, I wasn't thinking about a hot spare scenario. We've always assumed
that there is no point to hot spares--you may as well have them
participating in the cluster, doing useful work, and let the failure
rebalance distributed across all disks (and not hammer the replacement).
sage
> >
> > Could we try the traditional spare behavior? (Set some disks backup
> > and auto replace the broken device?)
> >
> > That can replace the failure osd before it go into the `out` state.
> > Or we could always set the osd noout?
> >
> > In fact, I think these is a different problems between David and Loic.
> > (these two problems are the same import :p
> >
> > If you have any problems, feel free to let me know.
> >
> > thanks!!
> > vicente
> >
> >
> > 2015-12-09 10:50 GMT+08:00 Sage Weil <sweil@redhat.com>:
> > > On Tue, 8 Dec 2015, David Zafman wrote:
> > > > Remember I really think we want a disk replacement feature that would
> > > > retain
> > > > the OSD id so that it avoids unnecessary data movement. See tracker
> > > > http://tracker.ceph.com/issues/13732
> > > Yeah, I totally agree. We just need to form an opinion on how... probably
> > > starting with the user experience. Ideally we'd go from up + in to down +
> > > in to down + out, then pull the drive and replace, and then initialize a
> Here ^^^^^^^^^^^^
> > > new OSD with the same id... and journal partition. Something like
> > >
> > > ceph-disk recreate id=N uuid=U <osd device path>
> > >
> > > I.e., it could use the uuid (which the cluster has in the OSDMap) to find
> > > (and re-use) the journal device.
> > >
> > > For a journal failure it'd probably be different.. but maybe not?
> > >
> > > Any other ideas?
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-12-09 20:08 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-05 16:49 new OSD re-using old OSD id fails to boot Loic Dachary
2015-12-09 1:13 ` David Zafman
2015-12-09 2:50 ` Sage Weil
2015-12-09 10:39 ` Wei-Chung Cheng
2015-12-09 13:08 ` Loic Dachary
2015-12-09 14:00 ` Sage Weil
2015-12-09 19:55 ` David Zafman
2015-12-09 20:08 ` Sage Weil
2015-12-09 13:06 ` Loic Dachary
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.