Crashes with 874bbfe600a6 in 3.18.25

LKML Archive mirror
 help / color / mirror / Atom feed

* Crashes with 874bbfe600a6 in 3.18.25
@ 2016-01-20 21:19 Jan Kara
  2016-01-20 21:39 ` Shaohua Li
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Kara @ 2016-01-20 21:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: LKML, stable, Tejun Heo, Daniel Bilik, Sasha Levin

[-- Attachment #1: Type: text/plain, Size: 953 bytes --]

Hello,

a friend of mine started seeing crashes with 3.18.25 kernel - once
appropriate load is put on the machine it crashes within minutes. He
tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
sure delayed work run in local cpu" makes the kernel stable again. I'm
attaching screenshot of the crash - sadly the initial part is missing but
it seems that we crashed when processing timers on otherwise idle CPU. This
is a production machine so experimentation is not easy but if we really
need more information it may be possible to reproduce the issue again and
gather it.

Anyone has idea what is going on? I was looking into the code for a while
but so far I have no good explanation.  It would be good to understand the
cause instead of just blindly reverting the commit from stable tree...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: delayed-work-oops.png --]
[-- Type: image/png, Size: 23695 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-20 21:19 Crashes with 874bbfe600a6 in 3.18.25 Jan Kara
@ 2016-01-20 21:39 ` Shaohua Li
  2016-01-21  9:52   ` Jan Kara
  0 siblings, 1 reply; 57+ messages in thread
From: Shaohua Li @ 2016-01-20 21:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, stable, Tejun Heo, Daniel Bilik, Sasha Levin

On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> Hello,
> 
> a friend of mine started seeing crashes with 3.18.25 kernel - once
> appropriate load is put on the machine it crashes within minutes. He
> tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
> Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
> sure delayed work run in local cpu" makes the kernel stable again. I'm
> attaching screenshot of the crash - sadly the initial part is missing but
> it seems that we crashed when processing timers on otherwise idle CPU. This
> is a production machine so experimentation is not easy but if we really
> need more information it may be possible to reproduce the issue again and
> gather it.
> 
> Anyone has idea what is going on? I was looking into the code for a while
> but so far I have no good explanation.  It would be good to understand the
> cause instead of just blindly reverting the commit from stable tree...

Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-20 21:39 ` Shaohua Li
@ 2016-01-21  9:52   ` Jan Kara
  2016-01-21 13:29     ` Sasha Levin
  2016-01-22  1:10     ` Sasha Levin
  0 siblings, 2 replies; 57+ messages in thread
From: Jan Kara @ 2016-01-21  9:52 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jan Kara, LKML, stable, Tejun Heo, Daniel Bilik, Sasha Levin

On Wed 20-01-16 13:39:01, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > a friend of mine started seeing crashes with 3.18.25 kernel - once
> > appropriate load is put on the machine it crashes within minutes. He
> > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
> > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
> > sure delayed work run in local cpu" makes the kernel stable again. I'm
> > attaching screenshot of the crash - sadly the initial part is missing but
> > it seems that we crashed when processing timers on otherwise idle CPU. This
> > is a production machine so experimentation is not easy but if we really
> > need more information it may be possible to reproduce the issue again and
> > gather it.
> > 
> > Anyone has idea what is going on? I was looking into the code for a while
> > but so far I have no good explanation.  It would be good to understand the
> > cause instead of just blindly reverting the commit from stable tree...
> 
> Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25?

That doesn't seem to be included in 3.18-stable although it was CCed to stable.
Sasha?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-21  9:52   ` Jan Kara
@ 2016-01-21 13:29     ` Sasha Levin
  2016-01-22  1:10     ` Sasha Levin
  1 sibling, 0 replies; 57+ messages in thread
From: Sasha Levin @ 2016-01-21 13:29 UTC (permalink / raw)
  To: Jan Kara, Shaohua Li; +Cc: LKML, stable, Tejun Heo, Daniel Bilik

On 01/21/2016 04:52 AM, Jan Kara wrote:
> On Wed 20-01-16 13:39:01, Shaohua Li wrote:
>> > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
>>> > > Hello,
>>> > > 
>>> > > a friend of mine started seeing crashes with 3.18.25 kernel - once
>>> > > appropriate load is put on the machine it crashes within minutes. He
>>> > > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
>>> > > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
>>> > > sure delayed work run in local cpu" makes the kernel stable again. I'm
>>> > > attaching screenshot of the crash - sadly the initial part is missing but
>>> > > it seems that we crashed when processing timers on otherwise idle CPU. This
>>> > > is a production machine so experimentation is not easy but if we really
>>> > > need more information it may be possible to reproduce the issue again and
>>> > > gather it.
>>> > > 
>>> > > Anyone has idea what is going on? I was looking into the code for a while
>>> > > but so far I have no good explanation.  It would be good to understand the
>>> > > cause instead of just blindly reverting the commit from stable tree...
>> > 
>> > Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25?
> That doesn't seem to be included in 3.18-stable although it was CCed to stable.
> Sasha?

Yup, that's not in. I'll include it and release it along with 3.18.26 on the weekend.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-21  9:52   ` Jan Kara
  2016-01-21 13:29     ` Sasha Levin
@ 2016-01-22  1:10     ` Sasha Levin
  2016-01-22 16:09       ` Tejun Heo
  1 sibling, 1 reply; 57+ messages in thread
From: Sasha Levin @ 2016-01-22  1:10 UTC (permalink / raw)
  To: Jan Kara, Shaohua Li, Tejun Heo; +Cc: LKML, stable, Tejun Heo, Daniel Bilik

On 01/21/2016 04:52 AM, Jan Kara wrote:
> On Wed 20-01-16 13:39:01, Shaohua Li wrote:
>> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
>>> Hello,
>>>
>>> a friend of mine started seeing crashes with 3.18.25 kernel - once
>>> appropriate load is put on the machine it crashes within minutes. He
>>> tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
>>> Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
>>> sure delayed work run in local cpu" makes the kernel stable again. I'm
>>> attaching screenshot of the crash - sadly the initial part is missing but
>>> it seems that we crashed when processing timers on otherwise idle CPU. This
>>> is a production machine so experimentation is not easy but if we really
>>> need more information it may be possible to reproduce the issue again and
>>> gather it.
>>>
>>> Anyone has idea what is going on? I was looking into the code for a while
>>> but so far I have no good explanation.  It would be good to understand the
>>> cause instead of just blindly reverting the commit from stable tree...
>>
>> Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25?
> 
> That doesn't seem to be included in 3.18-stable although it was CCed to stable.
> Sasha?

Looks like it requires more than trivial backport (I think). Tejun?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-22  1:10     ` Sasha Levin
@ 2016-01-22 16:09       ` Tejun Heo
  2016-01-23  2:20         ` Ben Hutchings
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-01-22 16:09 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik, Thomas Gleixner

(cc'ing Thomas)

On Thu, Jan 21, 2016 at 08:10:20PM -0500, Sasha Levin wrote:
> On 01/21/2016 04:52 AM, Jan Kara wrote:
> > On Wed 20-01-16 13:39:01, Shaohua Li wrote:
> >> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> >>> Hello,
> >>>
> >>> a friend of mine started seeing crashes with 3.18.25 kernel - once
> >>> appropriate load is put on the machine it crashes within minutes. He
> >>> tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
> >>> Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
> >>> sure delayed work run in local cpu" makes the kernel stable again. I'm
> >>> attaching screenshot of the crash - sadly the initial part is missing but
> >>> it seems that we crashed when processing timers on otherwise idle CPU. This
> >>> is a production machine so experimentation is not easy but if we really
> >>> need more information it may be possible to reproduce the issue again and
> >>> gather it.
> >>>
> >>> Anyone has idea what is going on? I was looking into the code for a while
> >>> but so far I have no good explanation.  It would be good to understand the
> >>> cause instead of just blindly reverting the commit from stable tree...
> >>
> >> Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25?
> > 
> > That doesn't seem to be included in 3.18-stable although it was CCed to stable.
> > Sasha?
> 
> Looks like it requires more than trivial backport (I think). Tejun?

The timer migration has changed quite a bit.  Given that we've never
seen vmstat work crashing in 3.18 era, I wonder whether the right
thing to do here is reverting 874bbfe600a6 from 3.18 stable?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-22 16:09       ` Tejun Heo
@ 2016-01-23  2:20         ` Ben Hutchings
  2016-01-23 16:11           ` Thomas Gleixner
  0 siblings, 1 reply; 57+ messages in thread
From: Ben Hutchings @ 2016-01-23  2:20 UTC (permalink / raw)
  To: Tejun Heo, Sasha Levin
  Cc: Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> (cc'ing Thomas)
> 
> On Thu, Jan 21, 2016 at 08:10:20PM -0500, Sasha Levin wrote:
> > On 01/21/2016 04:52 AM, Jan Kara wrote:
> > > On Wed 20-01-16 13:39:01, Shaohua Li wrote:
> > > > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> > > > > Hello,
> > > > > 
> > > > > a friend of mine started seeing crashes with 3.18.25 kernel - once
> > > > > appropriate load is put on the machine it crashes within minutes. He
> > > > > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
> > > > > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
> > > > > sure delayed work run in local cpu" makes the kernel stable again. I'm
> > > > > attaching screenshot of the crash - sadly the initial part is missing but
> > > > > it seems that we crashed when processing timers on otherwise idle CPU. This
> > > > > is a production machine so experimentation is not easy but if we really
> > > > > need more information it may be possible to reproduce the issue again and
> > > > > gather it.
> > > > > 
> > > > > Anyone has idea what is going on? I was looking into the code for a while
> > > > > but so far I have no good explanation.  It would be good to understand the
> > > > > cause instead of just blindly reverting the commit from stable tree...
> > > > 
> > > > Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25?
> > > 
> > > That doesn't seem to be included in 3.18-stable although it was CCed to stable.
> > > Sasha?
> > 
> > Looks like it requires more than trivial backport (I think). Tejun?
> 
> The timer migration has changed quite a bit.  Given that we've never
> seen vmstat work crashing in 3.18 era, I wonder whether the right
> thing to do here is reverting 874bbfe600a6 from 3.18 stable?

It's not just 3.18 that has this; 874bbfe600a6 was backported to all
stable branches from 3.10 onward.  Only the 4.2-ckt branch has
22b886dd10180939.

Ben.

-- 
Ben Hutchings
Life is what happens to you while you're busy making other plans.
                                                               - John Lennon

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-23  2:20         ` Ben Hutchings
@ 2016-01-23 16:11           ` Thomas Gleixner
  2016-01-26  9:34             ` Jan Kara
  0 siblings, 1 reply; 57+ messages in thread
From: Thomas Gleixner @ 2016-01-23 16:11 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Tejun Heo, Sasha Levin, Jan Kara, Shaohua Li, LKML, stable,
	Daniel Bilik

[-- Attachment #1: Type: TEXT/PLAIN, Size: 690 bytes --]

On Sat, 23 Jan 2016, Ben Hutchings wrote:
> On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > Looks like it requires more than trivial backport (I think). Tejun?
> > 
> > The timer migration has changed quite a bit.  Given that we've never
> > seen vmstat work crashing in 3.18 era, I wonder whether the right
> > thing to do here is reverting 874bbfe600a6 from 3.18 stable?
> 
> It's not just 3.18 that has this; 874bbfe600a6 was backported to all
> stable branches from 3.10 onward.  Only the 4.2-ckt branch has
> 22b886dd10180939.

22b886dd10180939 fixes a bug which was introduced with the timer wheel
overhaul in 4.2. So only 4.2/3 should have it backported.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-23 16:11           ` Thomas Gleixner
@ 2016-01-26  9:34             ` Jan Kara
  2016-01-26  9:49               ` Thomas Gleixner
  2016-01-26 11:14               ` Petr Mladek
  0 siblings, 2 replies; 57+ messages in thread
From: Jan Kara @ 2016-01-26  9:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ben Hutchings, Tejun Heo, Sasha Levin, Jan Kara, Shaohua Li, LKML,
	stable, Daniel Bilik

On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > Looks like it requires more than trivial backport (I think). Tejun?
> > > 
> > > The timer migration has changed quite a bit.  Given that we've never
> > > seen vmstat work crashing in 3.18 era, I wonder whether the right
> > > thing to do here is reverting 874bbfe600a6 from 3.18 stable?
> > 
> > It's not just 3.18 that has this; 874bbfe600a6 was backported to all
> > stable branches from 3.10 onward.  Only the 4.2-ckt branch has
> > 22b886dd10180939.
> 
> 22b886dd10180939 fixes a bug which was introduced with the timer wheel
> overhaul in 4.2. So only 4.2/3 should have it backported.

Thanks for explanation. So do I understand right that timers are always run
on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to
run timer for delayed work on the calling CPU) doesn't make sense there? If
that is true than reverting the commit from older stable kernels is
probably the easiest way to resolve the crashes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-26  9:34             ` Jan Kara
@ 2016-01-26  9:49               ` Thomas Gleixner
  2016-01-26 11:14               ` Petr Mladek
  1 sibling, 0 replies; 57+ messages in thread
From: Thomas Gleixner @ 2016-01-26  9:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1923 bytes --]

On Tue, 26 Jan 2016, Jan Kara wrote:
> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> > On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > > Looks like it requires more than trivial backport (I think). Tejun?
> > > > 
> > > > The timer migration has changed quite a bit.  Given that we've never
> > > > seen vmstat work crashing in 3.18 era, I wonder whether the right
> > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable?
> > > 
> > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all
> > > stable branches from 3.10 onward.  Only the 4.2-ckt branch has
> > > 22b886dd10180939.
> > 
> > 22b886dd10180939 fixes a bug which was introduced with the timer wheel
> > overhaul in 4.2. So only 4.2/3 should have it backported.
> 
> Thanks for explanation. So do I understand right that timers are always run
> on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to
> run timer for delayed work on the calling CPU) doesn't make sense there? If
> that is true than reverting the commit from older stable kernels is
> probably the easiest way to resolve the crashes.

I was merily referring to 22b886dd10180939 which is a bug fix for things we
reworked in the timer wheel core code in 4.2. It's completely unrelated to the
problem at hand.

Non pinned timers can be migrated due to power saving decisions since
2.6.36. What changed over time is how the decision is made, but the general
principle still applies.

The timer code was completely unchanged between 3.18 and 4.0 and even with the
larger overhaul in 4.2 we did not change the migration logic. We merily
changed the internal implementation of the timer wheel.

I have no idea how 874bbfe600a6 can result in crashing on older kernels. Can
you ask the reporter to enable DEBUG_OBJECTS so we might get an idea what goes
wrong with that timer.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-26  9:34             ` Jan Kara
  2016-01-26  9:49               ` Thomas Gleixner
@ 2016-01-26 11:14               ` Petr Mladek
  2016-01-26 13:09                 ` Thomas Gleixner
  1 sibling, 1 reply; 57+ messages in thread
From: Petr Mladek @ 2016-01-26 11:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Thomas Gleixner, Ben Hutchings, Tejun Heo, Sasha Levin,
	Shaohua Li, LKML, stable, Daniel Bilik

On Tue 2016-01-26 10:34:00, Jan Kara wrote:
> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> > On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > > Looks like it requires more than trivial backport (I think). Tejun?
> > > > 
> > > > The timer migration has changed quite a bit.  Given that we've never
> > > > seen vmstat work crashing in 3.18 era, I wonder whether the right
> > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable?
> > > 
> > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all
> > > stable branches from 3.10 onward.  Only the 4.2-ckt branch has
> > > 22b886dd10180939.
> > 
> > 22b886dd10180939 fixes a bug which was introduced with the timer wheel
> > overhaul in 4.2. So only 4.2/3 should have it backported.
> 
> Thanks for explanation. So do I understand right that timers are always run
> on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to
> run timer for delayed work on the calling CPU) doesn't make sense there? If
> that is true than reverting the commit from older stable kernels is
> probably the easiest way to resolve the crashes.

The commit 874bbfe600a6 ("workqueue: make sure delayed work run in
local cpu") forces the timer to run on the local CPU. It might be correct
for vmstat. But I wonder if it might break some other delayed work
user that depends on running on different CPU.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-26 11:14               ` Petr Mladek
@ 2016-01-26 13:09                 ` Thomas Gleixner
  2016-02-03  9:35                   ` Jiri Slaby
  0 siblings, 1 reply; 57+ messages in thread
From: Thomas Gleixner @ 2016-01-26 13:09 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML,
	stable, Daniel Bilik

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2333 bytes --]

On Tue, 26 Jan 2016, Petr Mladek wrote:
> On Tue 2016-01-26 10:34:00, Jan Kara wrote:
> > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> > > On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > > > Looks like it requires more than trivial backport (I think). Tejun?
> > > > > 
> > > > > The timer migration has changed quite a bit.  Given that we've never
> > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right
> > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable?
> > > > 
> > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all
> > > > stable branches from 3.10 onward.  Only the 4.2-ckt branch has
> > > > 22b886dd10180939.
> > > 
> > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel
> > > overhaul in 4.2. So only 4.2/3 should have it backported.
> > 
> > Thanks for explanation. So do I understand right that timers are always run
> > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to
> > run timer for delayed work on the calling CPU) doesn't make sense there? If
> > that is true than reverting the commit from older stable kernels is
> > probably the easiest way to resolve the crashes.
> 
> The commit 874bbfe600a6 ("workqueue: make sure delayed work run in
> local cpu") forces the timer to run on the local CPU. It might be correct
> for vmstat. But I wonder if it might break some other delayed work
> user that depends on running on different CPU.

The default of add_timer() is to run on the current cpu. It only moves the
timer to a different cpu when the power saving code says so. So 874bbfe600a6
enforces that the timer runs on the cpu on which queue_delayed_work() is
called, but before that commit it was likely that the timer was queued on the
calling cpu. So there is nothing which can depend on running on a different
CPU, except callers of queue_delayed_work_on() which provide the target cpu
explicitely. 874bbfe600a6 does not affect those callers at all.

Now, what's different is:

+       if (cpu == WORK_CPU_UNBOUND)
+               cpu = raw_smp_processor_id();
        dwork->cpu = cpu;

So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to
the current cpu, but I can't see how that matters.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-01-26 13:09                 ` Thomas Gleixner
@ 2016-02-03  9:35                   ` Jiri Slaby
  2016-02-03 10:41                     ` Thomas Gleixner
  2016-02-03 12:28                     ` Michal Hocko
  0 siblings, 2 replies; 57+ messages in thread
From: Jiri Slaby @ 2016-02-03  9:35 UTC (permalink / raw)
  To: Thomas Gleixner, Petr Mladek
  Cc: Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML,
	stable, Daniel Bilik

On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> On Tue, 26 Jan 2016, Petr Mladek wrote:
>> On Tue 2016-01-26 10:34:00, Jan Kara wrote:
>>> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
>>>> On Sat, 23 Jan 2016, Ben Hutchings wrote:
>>>>> On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
>>>>>>> Looks like it requires more than trivial backport (I think). Tejun?
>>>>>>
>>>>>> The timer migration has changed quite a bit.  Given that we've never
>>>>>> seen vmstat work crashing in 3.18 era, I wonder whether the right
>>>>>> thing to do here is reverting 874bbfe600a6 from 3.18 stable?
>>>>>
>>>>> It's not just 3.18 that has this; 874bbfe600a6 was backported to all
>>>>> stable branches from 3.10 onward.  Only the 4.2-ckt branch has
>>>>> 22b886dd10180939.
>>>>
>>>> 22b886dd10180939 fixes a bug which was introduced with the timer wheel
>>>> overhaul in 4.2. So only 4.2/3 should have it backported.
>>>
>>> Thanks for explanation. So do I understand right that timers are always run
>>> on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to
>>> run timer for delayed work on the calling CPU) doesn't make sense there? If
>>> that is true than reverting the commit from older stable kernels is
>>> probably the easiest way to resolve the crashes.
>>
>> The commit 874bbfe600a6 ("workqueue: make sure delayed work run in
>> local cpu") forces the timer to run on the local CPU. It might be correct
>> for vmstat. But I wonder if it might break some other delayed work
>> user that depends on running on different CPU.
> 
> The default of add_timer() is to run on the current cpu. It only moves the
> timer to a different cpu when the power saving code says so. So 874bbfe600a6
> enforces that the timer runs on the cpu on which queue_delayed_work() is
> called, but before that commit it was likely that the timer was queued on the
> calling cpu. So there is nothing which can depend on running on a different
> CPU, except callers of queue_delayed_work_on() which provide the target cpu
> explicitely. 874bbfe600a6 does not affect those callers at all.
> 
> Now, what's different is:
> 
> +       if (cpu == WORK_CPU_UNBOUND)
> +               cpu = raw_smp_processor_id();
>         dwork->cpu = cpu;
> 
> So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to
> the current cpu, but I can't see how that matters.

What happens in later kernels, when the cpu is offlined before the
delayed_work timer ticks? In stable 3.12, with the patch, this  scenario
results in an oops:
 #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88
    [exception RIP: __queue_work+121]
    RIP: ffffffff81071989  RSP: ffff8c03fdd63e30  RFLAGS: 00010086
    RAX: ffff88048b96bc00  RBX: ffff8c03e9bcc800  RCX: ffff880473820478
    RDX: 0000000000000400  RSI: 0000000000000004  RDI: ffff880473820458
    RBP: 0000000000000000   R8: ffff8c03fdd71f40   R9: ffff8c03ea4c4002
    R10: 0000000000000000  R11: 0000000000000005  R12: ffff880473820458
    R13: 00000000000000a8  R14: 000000000000e328  R15: 00000000000000a8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611
 #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7
 #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5
 #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c
#10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665
#11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835
#12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd

The CPU was 168, and that one was offlined in the meantime. So
__queue_work fails at:
  if (!(wq->flags & WQ_UNBOUND))
    pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
  else
    pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
    ^^^                           ^^^^ NODE is -1
      \ pwq is NULL

  if (last_pool && last_pool != pwq->pool) { <--- BOOM

Any ideas?

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03  9:35                   ` Jiri Slaby
@ 2016-02-03 10:41                     ` Thomas Gleixner
  2016-02-03 12:28                     ` Michal Hocko
  1 sibling, 0 replies; 57+ messages in thread
From: Thomas Gleixner @ 2016-02-03 10:41 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Petr Mladek, Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin,
	Shaohua Li, LKML, stable, Daniel Bilik

On Wed, 3 Feb 2016, Jiri Slaby wrote:
> On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> What happens in later kernels, when the cpu is offlined before the
> delayed_work timer ticks? In stable 3.12, with the patch, this  scenario
> results in an oops:
>  #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88
>     [exception RIP: __queue_work+121]
>     RIP: ffffffff81071989  RSP: ffff8c03fdd63e30  RFLAGS: 00010086
>     RAX: ffff88048b96bc00  RBX: ffff8c03e9bcc800  RCX: ffff880473820478
>     RDX: 0000000000000400  RSI: 0000000000000004  RDI: ffff880473820458
>     RBP: 0000000000000000   R8: ffff8c03fdd71f40   R9: ffff8c03ea4c4002
>     R10: 0000000000000000  R11: 0000000000000005  R12: ffff880473820458
>     R13: 00000000000000a8  R14: 000000000000e328  R15: 00000000000000a8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611
>  #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7
>  #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5
>  #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c
> #10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665
> #11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835
> #12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd
> 
> The CPU was 168, and that one was offlined in the meantime. So
> __queue_work fails at:
>   if (!(wq->flags & WQ_UNBOUND))
>     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
>   else
>     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
>     ^^^                           ^^^^ NODE is -1
>       \ pwq is NULL
> 
>   if (last_pool && last_pool != pwq->pool) { <--- BOOM

I don't see how that works on later kernels. If cpu_to_node() returns -1 we
access outside of the array bounds....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03  9:35                   ` Jiri Slaby
  2016-02-03 10:41                     ` Thomas Gleixner
@ 2016-02-03 12:28                     ` Michal Hocko
  2016-02-03 16:24                       ` Tejun Heo
  1 sibling, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2016-02-03 12:28 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Tejun Heo,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

[I wasn't aware of this email thread before so I am jumping in late]

On Wed 03-02-16 10:35:32, Jiri Slaby wrote:
> On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> > On Tue, 26 Jan 2016, Petr Mladek wrote:
[...]
> >> The commit 874bbfe600a6 ("workqueue: make sure delayed work run in
> >> local cpu") forces the timer to run on the local CPU. It might be correct
> >> for vmstat. But I wonder if it might break some other delayed work
> >> user that depends on running on different CPU.
> > 
> > The default of add_timer() is to run on the current cpu. It only moves the
> > timer to a different cpu when the power saving code says so. So 874bbfe600a6
> > enforces that the timer runs on the cpu on which queue_delayed_work() is
> > called, but before that commit it was likely that the timer was queued on the
> > calling cpu. So there is nothing which can depend on running on a different
> > CPU, except callers of queue_delayed_work_on() which provide the target cpu
> > explicitely. 874bbfe600a6 does not affect those callers at all.
> > 
> > Now, what's different is:
> > 
> > +       if (cpu == WORK_CPU_UNBOUND)
> > +               cpu = raw_smp_processor_id();
> >         dwork->cpu = cpu;
> > 
> > So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to
> > the current cpu, but I can't see how that matters.

It matters because if somebody did queue_delayed_work() and the
current cpu gets offlined then even though the associated timer gets
migrated the __queue_work wouldn't recognize the associated cpu as
WORK_CPU_UNBOUND anymore and won't reset the following path will go
kaboom...

> The CPU was 168, and that one was offlined in the meantime. So
> __queue_work fails at:
>   if (!(wq->flags & WQ_UNBOUND))
>     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
>   else
>     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
>     ^^^                           ^^^^ NODE is -1
>       \ pwq is NULL
> 
>   if (last_pool && last_pool != pwq->pool) { <--- BOOM

So I think 874bbfe600a6 is really bogus. It should be reverted. We
already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
schedule per-cpu work on the CPU we need it to run on"). This which
should be used for the stable trees as a replacement.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 12:28                     ` Michal Hocko
@ 2016-02-03 16:24                       ` Tejun Heo
  2016-02-03 16:48                         ` Michal Hocko
                                           ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-03 16:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > The CPU was 168, and that one was offlined in the meantime. So
> > __queue_work fails at:
> >   if (!(wq->flags & WQ_UNBOUND))
> >     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> >   else
> >     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> >     ^^^                           ^^^^ NODE is -1
> >       \ pwq is NULL
> > 
> >   if (last_pool && last_pool != pwq->pool) { <--- BOOM

So, the proper fix here is keeping cpu <-> node mapping stable across
cpu on/offlining which has been being worked on for a long time now.
The patchst is pending and it fixes other issues too.

> So I think 874bbfe600a6 is really bogus. It should be reverted. We
> already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> schedule per-cpu work on the CPU we need it to run on"). This which
> should be used for the stable trees as a replacement.

It's not bogus.  We can't flip a property that has been guaranteed
without any provision for verification.  Why do you think vmstat blow
up in the first place?  vmstat would be the canary case as it runs
frequently on all systems.  It's exactly the sign that we can't break
this guarantee willy-nilly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 16:24                       ` Tejun Heo
@ 2016-02-03 16:48                         ` Michal Hocko
  2016-02-03 16:59                           ` Tejun Heo
  2016-02-03 17:01                         ` Mike Galbraith
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2016-02-03 16:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed 03-02-16 11:24:41, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > >   if (!(wq->flags & WQ_UNBOUND))
> > >     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > >   else
> > >     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> > >     ^^^                           ^^^^ NODE is -1
> > >       \ pwq is NULL
> > > 
> > >   if (last_pool && last_pool != pwq->pool) { <--- BOOM
> 
> So, the proper fix here is keeping cpu <-> node mapping stable across
> cpu on/offlining which has been being worked on for a long time now.
> The patchst is pending and it fixes other issues too.

What if that node was memory offlined as well? It just doesn't make any
sense to stick to the old node when the old cpu went away already. If
anything and add_timer_on also for WORK_CPU_UNBOUND is really required
then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that
__queue_work can actually move on to the local CPU properly and handle
the offline cpu properly.

> > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> > schedule per-cpu work on the CPU we need it to run on"). This which
> > should be used for the stable trees as a replacement.
> 
> It's not bogus.  We can't flip a property that has been guaranteed
> without any provision for verification.  Why do you think vmstat blow
> up in the first place?

Because it wants to have a strong per-cpu guarantee while it used
to fail to tell so. My understanding was that this is exactly what
queue_delayed_work_on is for while WORK_CPU_UNBOUND tells that the
caller doesn't really insist on any particular CPU (just local CPU is
preferred).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 16:48                         ` Michal Hocko
@ 2016-02-03 16:59                           ` Tejun Heo
  2016-02-04  6:37                             ` Michal Hocko
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-02-03 16:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote:
> > So, the proper fix here is keeping cpu <-> node mapping stable across
> > cpu on/offlining which has been being worked on for a long time now.
> > The patchst is pending and it fixes other issues too.
> 
> What if that node was memory offlined as well? It just doesn't make any
> sense to stick to the old node when the old cpu went away already. If

Whether a memory node is offlined or not doesn't affect how cpus map
to the node.  The mapping is something which is fixed at physical and
firmware level throughout while the system is running.  If the node
becomes memory-less what changes is the memory allocation strategy for
the node, not how cpus map to nodes.  The only problem here is that we
currently lose how we mapped logical IDs to physical ones across
off/online cycles.

> anything and add_timer_on also for WORK_CPU_UNBOUND is really required
> then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that
> __queue_work can actually move on to the local CPU properly and handle
> the offline cpu properly.

delayed_work->cpu is determined on queueing time.  Dealing with
offlined cpus at execution is completley fine.  There's no need to
"preserve" anything.

> > It's not bogus.  We can't flip a property that has been guaranteed
> > without any provision for verification.  Why do you think vmstat blow
> > up in the first place?
> 
> Because it wants to have a strong per-cpu guarantee while it used
> to fail to tell so. My understanding was that this is exactly what
> queue_delayed_work_on is for while WORK_CPU_UNBOUND tells that the
> caller doesn't really insist on any particular CPU (just local CPU is
> preferred).

What you said just doesn't fit the reality.  Again, think about why
vmstat crashed.  Why is this difficult to understand?

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 16:24                       ` Tejun Heo
  2016-02-03 16:48                         ` Michal Hocko
@ 2016-02-03 17:01                         ` Mike Galbraith
  2016-02-03 17:06                           ` Tejun Heo
  2016-02-03 18:46                         ` Thomas Gleixner
  2016-02-05  5:44                         ` Mike Galbraith
  3 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-03 17:01 UTC (permalink / raw)
  To: Tejun Heo, Michal Hocko
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, 2016-02-03 at 11:24 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > >   if (!(wq->flags & WQ_UNBOUND))
> > >     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > >   else
> > >     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> > >     ^^^                           ^^^^ NODE is -1
> > >       \ pwq is NULL
> > > 
> > >   if (last_pool && last_pool != pwq->pool) { <--- BOOM
> 
> So, the proper fix here is keeping cpu <-> node mapping stable across
> cpu on/offlining which has been being worked on for a long time now.
> The patchst is pending and it fixes other issues too.

Hm, so it's ok to queue work to an offline CPU?  What happens if it
doesn't come back for an eternity or two?

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 17:01                         ` Mike Galbraith
@ 2016-02-03 17:06                           ` Tejun Heo
  2016-02-03 17:13                             ` Mike Galbraith
                                               ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-03 17:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> Hm, so it's ok to queue work to an offline CPU?  What happens if it
> doesn't come back for an eternity or two?

Right now, it just loses affinity.  A more interesting case is a cpu
going offline whlie work items bound to the cpu are still running and
the root problem is that we've never distinguished between affinity
for correctness and optimization and thus can't flush or warn on the
stagglers.  The plan is to ensure that all correctness users specify
the CPU explicitly.  Once we're there, we can warn on illegal usages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 17:06                           ` Tejun Heo
@ 2016-02-03 17:13                             ` Mike Galbraith
  2016-02-03 17:15                               ` Tejun Heo
  2016-02-04  2:00                             ` Mike Galbraith
  2016-02-04 10:04                             ` Mike Galbraith
  2 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-03 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > Hm, so it's ok to queue work to an offline CPU?  What happens if it
> > doesn't come back for an eternity or two?
> 
> Right now, it just loses affinity.  A more interesting case is a cpu
> going offline whlie work items bound to the cpu are still running and
> the root problem is that we've never distinguished between affinity
> for correctness and optimization and thus can't flush or warn on the
> stagglers.  The plan is to ensure that all correctness users specify
> the CPU explicitly.  Once we're there, we can warn on illegal usages.

Ah, and the rest (the vast majority) can then be safely deflected away
from nohz_full cpus.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 17:13                             ` Mike Galbraith
@ 2016-02-03 17:15                               ` Tejun Heo
  0 siblings, 0 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-03 17:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Wed, Feb 03, 2016 at 06:13:15PM +0100, Mike Galbraith wrote:
> Ah, and the rest (the vast majority) can then be safely deflected away
> from nohz_full cpus.

Yeap, it should be possible to bounce majority of work items across
CPUs all we want.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 16:24                       ` Tejun Heo
  2016-02-03 16:48                         ` Michal Hocko
  2016-02-03 17:01                         ` Mike Galbraith
@ 2016-02-03 18:46                         ` Thomas Gleixner
  2016-02-03 19:01                           ` Tejun Heo
  2016-02-05  5:44                         ` Mike Galbraith
  3 siblings, 1 reply; 57+ messages in thread
From: Thomas Gleixner @ 2016-02-03 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, 3 Feb 2016, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > >   if (!(wq->flags & WQ_UNBOUND))
> > >     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > >   else
> > >     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> > >     ^^^                           ^^^^ NODE is -1
> > >       \ pwq is NULL
> > > 
> > >   if (last_pool && last_pool != pwq->pool) { <--- BOOM
> 
> So, the proper fix here is keeping cpu <-> node mapping stable across
> cpu on/offlining which has been being worked on for a long time now.
> The patchst is pending and it fixes other issues too.
> 
> > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> > schedule per-cpu work on the CPU we need it to run on"). This which
> > should be used for the stable trees as a replacement.
> 
> It's not bogus.  We can't flip a property that has been guaranteed
> without any provision for verification.  Why do you think vmstat blow
> up in the first place?  vmstat would be the canary case as it runs
> frequently on all systems.  It's exactly the sign that we can't break
> this guarantee willy-nilly.

You're in complete failure denial mode once again.

Fact is:

  That patch breaks stuff because there is no stable cpu -> node mapping
  accross cpu on/offlining. As a result this selects unbound_pwq_by_node() on
  node -1.

The reason why you need to do that work->cpu assignment might be legitimate,
but that does not justify that you expose systems to a lurking out of bounds
access which results in a NULL pointer dereference.

As long as cpu_to_node(cpu) can return -1, we need a sanity check there. And
we need that now and not at some point in the future when the patches
establishing a stable cpu -> node mapping are finished.

Stop arguing around a bug which really exists and was exposed by this patch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 18:46                         ` Thomas Gleixner
@ 2016-02-03 19:01                           ` Tejun Heo
  2016-02-03 19:05                             ` Thomas Gleixner
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-02-03 19:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

Hello, Thomas.

On Wed, Feb 03, 2016 at 07:46:11PM +0100, Thomas Gleixner wrote:
> > > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> > > schedule per-cpu work on the CPU we need it to run on"). This which
> > > should be used for the stable trees as a replacement.
> > 
> > It's not bogus.  We can't flip a property that has been guaranteed
> > without any provision for verification.  Why do you think vmstat blow
> > up in the first place?  vmstat would be the canary case as it runs
> > frequently on all systems.  It's exactly the sign that we can't break
> > this guarantee willy-nilly.
> 
> You're in complete failure denial mode once again.

Well, you're in an unnecessary escalation mode as usual.  Was the
attitude really necessary?  Chill out and read the thread again.
Michal is saying the dwork->cpu assignment was bogus and I was
refuting that.

> Fact is:
> 
>   That patch breaks stuff because there is no stable cpu -> node mapping
>   accross cpu on/offlining. As a result this selects unbound_pwq_by_node() on
>   node -1.
> 
> The reason why you need to do that work->cpu assignment might be legitimate,
> but that does not justify that you expose systems to a lurking out of bounds
> access which results in a NULL pointer dereference.
> 
> As long as cpu_to_node(cpu) can return -1, we need a sanity check there. And
> we need that now and not at some point in the future when the patches
> establishing a stable cpu -> node mapping are finished.
> 
> Stop arguing around a bug which really exists and was exposed by this patch.

Michal brought it up here but there's a different thread where Mike
reported NUMA_NO_NODE issue and I already posted the fix.

 http://lkml.kernel.org/g/20160203185425.GK14091@mtj.duckdns.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 19:01                           ` Tejun Heo
@ 2016-02-03 19:05                             ` Thomas Gleixner
  2016-02-03 19:15                               ` Tejun Heo
  0 siblings, 1 reply; 57+ messages in thread
From: Thomas Gleixner @ 2016-02-03 19:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, 3 Feb 2016, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 07:46:11PM +0100, Thomas Gleixner wrote:
> > > > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> > > > schedule per-cpu work on the CPU we need it to run on"). This which
> > > > should be used for the stable trees as a replacement.
> > > 
> > > It's not bogus.  We can't flip a property that has been guaranteed
> > > without any provision for verification.  Why do you think vmstat blow
> > > up in the first place?  vmstat would be the canary case as it runs
> > > frequently on all systems.  It's exactly the sign that we can't break
> > > this guarantee willy-nilly.
> > 
> > You're in complete failure denial mode once again.
> 
> Well, you're in an unnecessary escalation mode as usual.  Was the
> attitude really necessary?  Chill out and read the thread again.
> Michal is saying the dwork->cpu assignment was bogus and I was
> refuting that.

Right, but at the same time you could have admitted, that the current state is
buggy and needs a sanity check in unbound_pwq_by_node().
 
> Michal brought it up here but there's a different thread where Mike
> reported NUMA_NO_NODE issue and I already posted the fix.
> 
>  http://lkml.kernel.org/g/20160203185425.GK14091@mtj.duckdns.org

5 minute ago w/o cc'ing the people who participated in that discussion.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 19:05                             ` Thomas Gleixner
@ 2016-02-03 19:15                               ` Tejun Heo
  0 siblings, 0 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-03 19:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, Feb 03, 2016 at 08:05:57PM +0100, Thomas Gleixner wrote:
> > Well, you're in an unnecessary escalation mode as usual.  Was the
> > attitude really necessary?  Chill out and read the thread again.
> > Michal is saying the dwork->cpu assignment was bogus and I was
> > refuting that.
>
> Right, but at the same time you could have admitted, that the current state is
> buggy and needs a sanity check in unbound_pwq_by_node().

It's crashing.  Of course it's buggy.  The main discussion on the bug
was on the other thread and I was trying to put out the confusions
posted on this thread.

> > Michal brought it up here but there's a different thread where Mike
> > reported NUMA_NO_NODE issue and I already posted the fix.
> > 
> >  http://lkml.kernel.org/g/20160203185425.GK14091@mtj.duckdns.org
> 
> 5 minute ago w/o cc'ing the people who participated in that discussion.

Yeah, I got to the thread this morning and got your email right after
sending out the patch.  I don't know.  The handling of this wasn't out
of the norm.  I asked this multiple times but let me try again.  Can
we please try to stay civil and technical?  There are times where
escalation is necessary but this one was gratuitous.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 17:06                           ` Tejun Heo
  2016-02-03 17:13                             ` Mike Galbraith
@ 2016-02-04  2:00                             ` Mike Galbraith
  2016-02-05 16:49                               ` Tejun Heo
  2016-02-04 10:04                             ` Mike Galbraith
  2 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-04  2:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > Hm, so it's ok to queue work to an offline CPU?  What happens if it
> > doesn't come back for an eternity or two?
> 
> Right now, it just loses affinity.  A more interesting case is a cpu
> going offline whlie work items bound to the cpu are still running and
> the root problem is that we've never distinguished between affinity
> for correctness and optimization and thus can't flush or warn on the
> stagglers.  The plan is to ensure that all correctness users specify
> the CPU explicitly.  Once we're there, we can warn on illegal usages.

Isn't it the case that, currently at least, each and every spot that
requires execution on a specific CPU yet does not take active measures
to deal with hotplug events is in fact buggy?  The timer code clearly
states that the user is responsible, and so do both workqueue.[ch].

I was surprised me to hear that some think they have an iron clad
guarantee, given the null and void clause is prominently displayed.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 16:59                           ` Tejun Heo
@ 2016-02-04  6:37                             ` Michal Hocko
  2016-02-04  7:40                               ` Michal Hocko
  0 siblings, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2016-02-04  6:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed 03-02-16 11:59:01, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote:
[...]
> > anything and add_timer_on also for WORK_CPU_UNBOUND is really required
> > then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that
> > __queue_work can actually move on to the local CPU properly and handle
> > the offline cpu properly.
> 
> delayed_work->cpu is determined on queueing time.  Dealing with
> offlined cpus at execution is completley fine.  There's no need to
> "preserve" anything.

I've seen you have posted a fix in the mean time but just for my
understading. Why the following is not an appropriate fix?

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c579dbab2e36..52bb11cf20d1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1459,9 +1459,9 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 
 	dwork->wq = wq;
 	/* timer isn't guaranteed to run in this cpu, record earlier */
+	dwork->cpu = cpu;
 	if (cpu == WORK_CPU_UNBOUND)
 		cpu = raw_smp_processor_id();
-	dwork->cpu = cpu;
 	timer->expires = jiffies + delay;
 
 	add_timer_on(timer, cpu);

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04  6:37                             ` Michal Hocko
@ 2016-02-04  7:40                               ` Michal Hocko
  0 siblings, 0 replies; 57+ messages in thread
From: Michal Hocko @ 2016-02-04  7:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Thu 04-02-16 07:37:23, Michal Hocko wrote:
> On Wed 03-02-16 11:59:01, Tejun Heo wrote:
> > On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote:
> [...]
> > > anything and add_timer_on also for WORK_CPU_UNBOUND is really required
> > > then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that
> > > __queue_work can actually move on to the local CPU properly and handle
> > > the offline cpu properly.
> > 
> > delayed_work->cpu is determined on queueing time.  Dealing with
> > offlined cpus at execution is completley fine.  There's no need to
> > "preserve" anything.
> 
> I've seen you have posted a fix in the mean time but just for my
> understading. Why the following is not an appropriate fix?
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c579dbab2e36..52bb11cf20d1 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1459,9 +1459,9 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
>  
>  	dwork->wq = wq;
>  	/* timer isn't guaranteed to run in this cpu, record earlier */
> +	dwork->cpu = cpu;
>  	if (cpu == WORK_CPU_UNBOUND)
>  		cpu = raw_smp_processor_id();
> -	dwork->cpu = cpu;
>  	timer->expires = jiffies + delay;
>  
>  	add_timer_on(timer, cpu);

Ok, so after some more thinking about that, this won't really help for
memory less CPU which would still have NUMA_NO_NODE associated with it
AFAIU. So this is definitely better to be handled at unbound_pwq_by_node
level.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 17:06                           ` Tejun Heo
  2016-02-03 17:13                             ` Mike Galbraith
  2016-02-04  2:00                             ` Mike Galbraith
@ 2016-02-04 10:04                             ` Mike Galbraith
  2016-02-04 10:46                               ` Thomas Gleixner
  2 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-04 10:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > Hm, so it's ok to queue work to an offline CPU?  What happens if it
> > doesn't come back for an eternity or two?
> 
> Right now, it just loses affinity....

WRT affinity...

Somebody somewhere queues a delayed work, a timer is started on CPUX,
work is targeted at CPUX.  Now wash/rinse/repeat mod_delayed_work()
along with migrations.  Should __queue_delayed_work() not refrain from
altering dwork->cpu once set?

I'm also wondering why 22b886dd only applies to kernels >= 4.2.

<quote>
Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU.  As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

  cpu 0                                 cpu 1
  -----------------------------------------------------------------------------
  del_timer(timer) succeeds
                                        del_timer(timer)
                                          lock_timer_base(timer) locks cpu_0_base
  add_timer_on(timer, 1)
    spin_lock(&cpu_1_base->lock)
    timer->flags set to cpu_1_base
    operates on @timer                    operates on @timer
</quote>

What's the difference between...
     timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
and...
     timer_set_base(timer, base);

...that makes that fix unneeded prior to 4.2?  We take the same locks
in < 4.2 kernels, so seemingly both will diddle concurrently above.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04 10:04                             ` Mike Galbraith
@ 2016-02-04 10:46                               ` Thomas Gleixner
  2016-02-04 11:07                                 ` Mike Galbraith
  2016-02-04 11:20                                 ` Jan Kara
  0 siblings, 2 replies; 57+ messages in thread
From: Thomas Gleixner @ 2016-02-04 10:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Thu, 4 Feb 2016, Mike Galbraith wrote:
> On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > > Hm, so it's ok to queue work to an offline CPU?  What happens if it
> > > doesn't come back for an eternity or two?
> > 
> > Right now, it just loses affinity....
> 
> WRT affinity...
> 
> Somebody somewhere queues a delayed work, a timer is started on CPUX,
> work is targeted at CPUX.  Now wash/rinse/repeat mod_delayed_work()
> along with migrations.  Should __queue_delayed_work() not refrain from
> altering dwork->cpu once set?
> 
> I'm also wondering why 22b886dd only applies to kernels >= 4.2.
> 
> <quote>
> Regardless of the previous CPU a timer was on, add_timer_on()
> currently simply sets timer->flags to the new CPU.  As the caller must
> be seeing the timer as idle, this is locally fine, but the timer
> leaving the old base while unlocked can lead to race conditions as
> follows.
> 
> Let's say timer was on cpu 0.
> 
>   cpu 0                                 cpu 1
>   -----------------------------------------------------------------------------
>   del_timer(timer) succeeds
>                                         del_timer(timer)
>                                           lock_timer_base(timer) locks cpu_0_base
>   add_timer_on(timer, 1)
>     spin_lock(&cpu_1_base->lock)
>     timer->flags set to cpu_1_base
>     operates on @timer                    operates on @timer
> </quote>
> 
> What's the difference between...
>      timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> and...
>      timer_set_base(timer, base);
> 
> ...that makes that fix unneeded prior to 4.2?  We take the same locks
> in < 4.2 kernels, so seemingly both will diddle concurrently above.

Indeed, you are right.

The same can happen on pre 4.2, just the fix does not apply as we changed the
internals how the base is managed in the timer itself. Backport below.

Thanks,

	tglx

8<----------------------------

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -956,13 +956,26 @@ EXPORT_SYMBOL(add_timer);
  */
 void add_timer_on(struct timer_list *timer, int cpu)
 {
-	struct tvec_base *base = per_cpu(tvec_bases, cpu);
+	struct tvec_base *new_base = per_cpu(tvec_bases, cpu);
+	struct tvec_base *base;
 	unsigned long flags;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
-	timer_set_base(timer, base);
+
+	/*
+	 * If @timer was on a different CPU, it must be migrated with the
+	 * old base locked to prevent other operations proceeding with the
+	 * wrong base locked.  See lock_timer_base().
+	 */
+	base = lock_timer_base(timer, &flags);
+	if (base != new_base) {
+		timer_set_base(timer, NULL);
+		spin_unlock(&base->lock);
+		base = new_base;
+		spin_lock(&base->lock);
+		timer_set_base(timer, base);
+	}
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
 	spin_unlock_irqrestore(&base->lock, flags);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04 10:46                               ` Thomas Gleixner
@ 2016-02-04 11:07                                 ` Mike Galbraith
  2016-02-04 11:20                                 ` Jan Kara
  1 sibling, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2016-02-04 11:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Thu, 2016-02-04 at 11:46 +0100, Thomas Gleixner wrote:
> On Thu, 4 Feb 2016, Mike Galbraith wrote:

> > I'm also wondering why 22b886dd only applies to kernels >= 4.2.
> > 
> > 
> > Regardless of the previous CPU a timer was on, add_timer_on()
> > currently simply sets timer->flags to the new CPU.  As the caller must
> > be seeing the timer as idle, this is locally fine, but the timer
> > leaving the old base while unlocked can lead to race conditions as
> > follows.
> > 
> > Let's say timer was on cpu 0.
> > 
> >   cpu 0                                 cpu 1
> >   -----------------------------------------------------------------------------
> >   del_timer(timer) succeeds
> >                                         del_timer(timer)
> >                                           lock_timer_base(timer) locks cpu_0_base
> >   add_timer_on(timer, 1)
> >     spin_lock(&cpu_1_base->lock)
> >     timer->flags set to cpu_1_base
> >     operates on @timer                    operates on @timer
> > 
> > 
> > What's the difference between...
> >      timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> > and...
> >      timer_set_base(timer, base);
> > 
> > ...that makes that fix unneeded prior to 4.2?  We take the same locks
> > in < 4.2 kernels, so seemingly both will diddle concurrently above.
> 
> Indeed, you are right.

Whew, thanks for confirming, looking for what the hell I was missing
wasn't going well at all, ate most of my day.

> The same can happen on pre 4.2, just the fix does not apply as we changed the
> internals how the base is managed in the timer itself. Backport below.

Exactly what I did locally.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04 10:46                               ` Thomas Gleixner
  2016-02-04 11:07                                 ` Mike Galbraith
@ 2016-02-04 11:20                                 ` Jan Kara
  2016-02-04 16:39                                   ` Daniel Bilik
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Kara @ 2016-02-04 11:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek,
	Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Thu 04-02-16 11:46:47, Thomas Gleixner wrote:
> On Thu, 4 Feb 2016, Mike Galbraith wrote:
> > On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> > > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > > > Hm, so it's ok to queue work to an offline CPU?  What happens if it
> > > > doesn't come back for an eternity or two?
> > > 
> > > Right now, it just loses affinity....
> > 
> > WRT affinity...
> > 
> > Somebody somewhere queues a delayed work, a timer is started on CPUX,
> > work is targeted at CPUX.  Now wash/rinse/repeat mod_delayed_work()
> > along with migrations.  Should __queue_delayed_work() not refrain from
> > altering dwork->cpu once set?
> > 
> > I'm also wondering why 22b886dd only applies to kernels >= 4.2.
> > 
> > <quote>
> > Regardless of the previous CPU a timer was on, add_timer_on()
> > currently simply sets timer->flags to the new CPU.  As the caller must
> > be seeing the timer as idle, this is locally fine, but the timer
> > leaving the old base while unlocked can lead to race conditions as
> > follows.
> > 
> > Let's say timer was on cpu 0.
> > 
> >   cpu 0                                 cpu 1
> >   -----------------------------------------------------------------------------
> >   del_timer(timer) succeeds
> >                                         del_timer(timer)
> >                                           lock_timer_base(timer) locks cpu_0_base
> >   add_timer_on(timer, 1)
> >     spin_lock(&cpu_1_base->lock)
> >     timer->flags set to cpu_1_base
> >     operates on @timer                    operates on @timer
> > </quote>
> > 
> > What's the difference between...
> >      timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> > and...
> >      timer_set_base(timer, base);
> > 
> > ...that makes that fix unneeded prior to 4.2?  We take the same locks
> > in < 4.2 kernels, so seemingly both will diddle concurrently above.
> 
> Indeed, you are right.
> 
> The same can happen on pre 4.2, just the fix does not apply as we changed the
> internals how the base is managed in the timer itself. Backport below.

Thanks for backport Thomas and to Mike for persistence :). I've asked my
friend seeing crashes with 3.18.25 to try whether this patch fixes the
issues. It may take some time so stay tuned...

								Honza

> 8<----------------------------
> 
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -956,13 +956,26 @@ EXPORT_SYMBOL(add_timer);
>   */
>  void add_timer_on(struct timer_list *timer, int cpu)
>  {
> -	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> +	struct tvec_base *new_base = per_cpu(tvec_bases, cpu);
> +	struct tvec_base *base;
>  	unsigned long flags;
>  
>  	timer_stats_timer_set_start_info(timer);
>  	BUG_ON(timer_pending(timer) || !timer->function);
> -	spin_lock_irqsave(&base->lock, flags);
> -	timer_set_base(timer, base);
> +
> +	/*
> +	 * If @timer was on a different CPU, it must be migrated with the
> +	 * old base locked to prevent other operations proceeding with the
> +	 * wrong base locked.  See lock_timer_base().
> +	 */
> +	base = lock_timer_base(timer, &flags);
> +	if (base != new_base) {
> +		timer_set_base(timer, NULL);
> +		spin_unlock(&base->lock);
> +		base = new_base;
> +		spin_lock(&base->lock);
> +		timer_set_base(timer, base);
> +	}
>  	debug_activate(timer, timer->expires);
>  	internal_add_timer(base, timer);
>  	spin_unlock_irqrestore(&base->lock, flags);
> 
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04 11:20                                 ` Jan Kara
@ 2016-02-04 16:39                                   ` Daniel Bilik
  2016-02-05  2:40                                     ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Daniel Bilik @ 2016-02-04 16:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Thomas Gleixner, Mike Galbraith, LKML, stable

On Thu, 4 Feb 2016 12:20:44 +0100
Jan Kara <jack@suse.cz> wrote:

> Thanks for backport Thomas and to Mike for persistence :). I've asked my
> friend seeing crashes with 3.18.25 to try whether this patch fixes the
> issues. It may take some time so stay tuned...

Patch tested and it really fixes the crash we were experiencing on 3.18.25
with commit 874bbfe+. But it seem to introduce (rather scary) regression.
Tested host shows abnormal cpu usage in both kernel and userland under the
same load and traffic pattern. One picture is worth a thousand words, so
I've taken snapshots of our graphs, see here:
http://neosystem.cz/test/linux-3.18.25/
The host was running 3.18.25 with commit 874bbfe+ (1e7af29+ on
3.18-stable) reverted. With this commit included, it crashed within
minutes. Around 13:30 we booted 3.18.25 with commit 874bbfe+ included and
with the patch from Thomas. And around 15:40 we've booted the host with
previous kernel, just to ensure this abnormal behaviour was really caused
by the test kernel.
Also interesting, in addition to high cpu usage, there is abnormally high
number of zombie processes reported by the system.

HTH.

--
						Daniel Bilik
						neosystem.cz

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04 16:39                                   ` Daniel Bilik
@ 2016-02-05  2:40                                     ` Mike Galbraith
  2016-02-05  8:11                                       ` Daniel Bilik
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-05  2:40 UTC (permalink / raw)
  To: Daniel Bilik, Jan Kara; +Cc: Thomas Gleixner, LKML, stable

On Thu, 2016-02-04 at 17:39 +0100, Daniel Bilik wrote:
> On Thu, 4 Feb 2016 12:20:44 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > Thanks for backport Thomas and to Mike for persistence :). I've asked my
> > friend seeing crashes with 3.18.25 to try whether this patch fixes the
> > issues. It may take some time so stay tuned...
> 
> Patch tested and it really fixes the crash we were experiencing on 3.18.25
> with commit 874bbfe+. But it seem to introduce (rather scary) regression.
> Tested host shows abnormal cpu usage in both kernel and userland under the
> same load and traffic pattern. One picture is worth a thousand words, so
> I've taken snapshots of our graphs, see here:
> http://neosystem.cz/test/linux-3.18.25/
> The host was running 3.18.25 with commit 874bbfe+ (1e7af29+ on
> 3.18-stable) reverted. With this commit included, it crashed within
> minutes. Around 13:30 we booted 3.18.25 with commit 874bbfe+ included and
> with the patch from Thomas. And around 15:40 we've booted the host with
> previous kernel, just to ensure this abnormal behaviour was really caused
> by the test kernel.
> Also interesting, in addition to high cpu usage, there is abnormally high
> number of zombie processes reported by the system.

IMHO you should restore the CC list and re-post.  (If I were the
maintainer of either the workqueue code or 3.18-stable, I'd be highly
interested in this finding).

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-03 16:24                       ` Tejun Heo
                                           ` (2 preceding siblings ...)
  2016-02-03 18:46                         ` Thomas Gleixner
@ 2016-02-05  5:44                         ` Mike Galbraith
  3 siblings, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2016-02-05  5:44 UTC (permalink / raw)
  To: Tejun Heo, Michal Hocko
  Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings,
	Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik

On Wed, 2016-02-03 at 11:24 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > >   if (!(wq->flags & WQ_UNBOUND))
> > >     pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > >   else
> > >     pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> > >     ^^^                           ^^^^ NODE is -1
> > >       \ pwq is NULL
> > > 
> > >   if (last_pool && last_pool != pwq->pool) { <--- BOOM
> 
> So, the proper fix here is keeping cpu <-> node mapping stable across
> cpu on/offlining which has been being worked on for a long time now.
> The patchst is pending and it fixes other issues too.
> 
> > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > already have a proper fix for vmstat 176bed1de5bf ("vmstat:
> > explicitly
> > schedule per-cpu work on the CPU we need it to run on"). This which
> > should be used for the stable trees as a replacement.
> 
> It's not bogus.  We can't flip a property that has been guaranteed
> without any provision for verification.  Why do you think vmstat blow
> up in the first place?  vmstat would be the canary case as it runs
> frequently on all systems.  It's exactly the sign that we can't break
> this guarantee willy-nilly.

If the intent of the below is to fulfill a guarantee...

+       /* timer isn't guaranteed to run in this cpu, record earlier */
+       if (cpu == WORK_CPU_UNBOUND)
+               cpu = raw_smp_processor_id();
        dwork->cpu = cpu;
        timer->expires = jiffies + delay;
  
-       if (unlikely(cpu != WORK_CPU_UNBOUND))
-               add_timer_on(timer, cpu);
-       else
-               add_timer(timer);
+       add_timer_on(timer, cpu);

...it appears to be incomplete.  Hotplug aside, when adding a timer
with the expectation that it stay put, should it not also be pinned?

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05  2:40                                     ` Mike Galbraith
@ 2016-02-05  8:11                                       ` Daniel Bilik
  2016-02-05  8:33                                         ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Daniel Bilik @ 2016-02-05  8:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Thomas Gleixner, Tejun Heo, Michal Hocko, Jiri Slaby,
	Petr Mladek, Sasha Levin, Shaohua Li, LKML, stable

On Fri, 05 Feb 2016 03:40:46 +0100
Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> On Thu, 2016-02-04 at 17:39 +0100, Daniel Bilik wrote:
> > On Thu, 4 Feb 2016 12:20:44 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Thanks for backport Thomas and to Mike for persistence :). I've
> > > asked my friend seeing crashes with 3.18.25 to try whether this
> > > patch fixes the issues. It may take some time so stay tuned...
> > 
> > Patch tested and it really fixes the crash we were experiencing on
> > 3.18.25 with commit 874bbfe+. But it seem to introduce (rather scary)
> > regression. Tested host shows abnormal cpu usage in both kernel and
> > userland under the same load and traffic pattern. One picture is worth
> > a thousand words, so I've taken snapshots of our graphs, see here:
> > http://neosystem.cz/test/linux-3.18.25/
> > The host was running 3.18.25 with commit 874bbfe+ (1e7af29+ on
> > 3.18-stable) reverted. With this commit included, it crashed within
> > minutes. Around 13:30 we booted 3.18.25 with commit 874bbfe+ included
> > and with the patch from Thomas. And around 15:40 we've booted the host
> > with previous kernel, just to ensure this abnormal behaviour was
> > really caused by the test kernel.
> > Also interesting, in addition to high cpu usage, there is abnormally
> > high number of zombie processes reported by the system.
> 
> IMHO you should restore the CC list and re-post.  (If I were the
> maintainer of either the workqueue code or 3.18-stable, I'd be highly
> interested in this finding).

Sorry, I haven't realized tha patch proposed by Thomas is already on its
way to stable. CC restored and re-posting.

--
						Daniel Bilik
						neosystem.cz

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05  8:11                                       ` Daniel Bilik
@ 2016-02-05  8:33                                         ` Mike Galbraith
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2016-02-05  8:33 UTC (permalink / raw)
  To: Daniel Bilik
  Cc: Jan Kara, Thomas Gleixner, Tejun Heo, Michal Hocko, Jiri Slaby,
	Petr Mladek, Sasha Levin, Shaohua Li, LKML, stable

On Fri, 2016-02-05 at 09:11 +0100, Daniel Bilik wrote:
> On Fri, 05 Feb 2016 03:40:46 +0100
> Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> > IMHO you should restore the CC list and re-post.  (If I were the
> > maintainer of either the workqueue code or 3.18-stable, I'd be highly
> > interested in this finding).
> 
> Sorry, I haven't realized tha patch proposed by Thomas is already on its
> way to stable. CC restored and re-posting.

I don't know where it's at, but where things stand is that it is
needed, but when combined with the patch which at least uncovered the
fact that it's needed, the two aren't playing well together according
to your test result.  Given both patches are already in kernels
upstream, and presumably Thomas's patch will eventually wander to
stable to fix them up, there might be some maintainer interest.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-04  2:00                             ` Mike Galbraith
@ 2016-02-05 16:49                               ` Tejun Heo
  2016-02-05 20:47                                 ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-02-05 16:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

Hello, Mike.

On Thu, Feb 04, 2016 at 03:00:17AM +0100, Mike Galbraith wrote:
> Isn't it the case that, currently at least, each and every spot that
> requires execution on a specific CPU yet does not take active measures
> to deal with hotplug events is in fact buggy?  The timer code clearly
> states that the user is responsible, and so do both workqueue.[ch].

Yeah, the usages which require affinity for correctness must flush the
work items from a cpu down callback.

> I was surprised me to hear that some think they have an iron clad
> guarantee, given the null and void clause is prominently displayed.

Nobody is (or at least should be) expecting workqueue to handle
affinity across CPU offlining events.  That is not the problem.  The
problem is that currently queue_work(work) and
queue_work_on(smp_processor_id(), work) are identical and there likely
are affinity-for-correctness users which are doing the former.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05 16:49                               ` Tejun Heo
@ 2016-02-05 20:47                                 ` Mike Galbraith
  2016-02-05 20:54                                   ` Tejun Heo
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-05 20:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Fri, 2016-02-05 at 11:49 -0500, Tejun Heo wrote:
> Hello, Mike.
> 
> On Thu, Feb 04, 2016 at 03:00:17AM +0100, Mike Galbraith wrote:
> > Isn't it the case that, currently at least, each and every spot that
> > requires execution on a specific CPU yet does not take active measures
> > to deal with hotplug events is in fact buggy?  The timer code clearly
> > states that the user is responsible, and so do both workqueue.[ch].
> 
> Yeah, the usages which require affinity for correctness must flush the
> work items from a cpu down callback.

Good, we agree.  Now bear with me a moment..

That very point is what makes it wrong for the workqueue code to ever
target a work item.  The instant it does target selection, correctness
may be at stake, it doesn't know, thus it must assume the full onus,
which it has neither the knowledge not the time to do.  That's how we
exploded on node = -1, trying to help out the user by doing his job,
but then not doing the whole job.  IMHO, a better plan is to let the
user screw it up all by himself.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05 20:47                                 ` Mike Galbraith
@ 2016-02-05 20:54                                   ` Tejun Heo
  2016-02-05 20:59                                     ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-02-05 20:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

Hello, Mike.

On Fri, Feb 05, 2016 at 09:47:11PM +0100, Mike Galbraith wrote:
> That very point is what makes it wrong for the workqueue code to ever
> target a work item.  The instant it does target selection, correctness
> may be at stake, it doesn't know, thus it must assume the full onus,
> which it has neither the knowledge not the time to do.  That's how we
> exploded on node = -1, trying to help out the user by doing his job,

I have a hard time seeing the NUMA_NO_NODE bug as something that
indicative of anything.  It is a dumb bug from mm side which puts
everyone using cpu_to_node() at risk.

> but then not doing the whole job.  IMHO, a better plan is to let the
> user screw it up all by himself.

What are you suggesting?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05 20:54                                   ` Tejun Heo
@ 2016-02-05 20:59                                     ` Mike Galbraith
  2016-02-05 21:06                                       ` Tejun Heo
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-05 20:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:

> What are you suggesting?

That 874bbfe6 should die.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05 20:59                                     ` Mike Galbraith
@ 2016-02-05 21:06                                       ` Tejun Heo
  2016-02-06 13:07                                         ` Henrique de Moraes Holschuh
  2016-02-09 15:31                                         ` Mike Galbraith
  0 siblings, 2 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-05 21:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> 
> > What are you suggesting?
> 
> That 874bbfe6 should die.

Yeah, it's gonna be killed.  The commit is there because the behavior
change broke things.  We don't want to guarantee it but have been and
can't change it right away just because we don't like it when things
may break from it.  The plan is to implement a debug option to force
workqueue to always execute these work items on a foreign cpu to weed
out breakages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05 21:06                                       ` Tejun Heo
@ 2016-02-06 13:07                                         ` Henrique de Moraes Holschuh
  2016-02-07  5:19                                           ` Mike Galbraith
  2016-02-09 15:31                                         ` Mike Galbraith
  1 sibling, 1 reply; 57+ messages in thread
From: Henrique de Moraes Holschuh @ 2016-02-06 13:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik

On Fri, 05 Feb 2016, Tejun Heo wrote:
> On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> > 
> > > What are you suggesting?
> > 
> > That 874bbfe6 should die.
> 
> Yeah, it's gonna be killed.  The commit is there because the behavior
> change broke things.  We don't want to guarantee it but have been and
> can't change it right away just because we don't like it when things
> may break from it.  The plan is to implement a debug option to force
> workqueue to always execute these work items on a foreign cpu to weed
> out breakages.

Is there a path to filter down sane behavior (whichever one it might be) to
the affected stable/LTS kernels?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-06 13:07                                         ` Henrique de Moraes Holschuh
@ 2016-02-07  5:19                                           ` Mike Galbraith
  2016-02-07  5:59                                             ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-07  5:19 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh, Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Sat, 2016-02-06 at 11:07 -0200, Henrique de Moraes Holschuh wrote:
> On Fri, 05 Feb 2016, Tejun Heo wrote:
> > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> > > 
> > > > What are you suggesting?
> > > 
> > > That 874bbfe6 should die.
> > 
> > Yeah, it's gonna be killed.  The commit is there because the behavior
> > change broke things.  We don't want to guarantee it but have been and
> > can't change it right away just because we don't like it when things
> > may break from it.  The plan is to implement a debug option to force
> > workqueue to always execute these work items on a foreign cpu to weed
> > out breakages.
> 
> Is there a path to filter down sane behavior (whichever one it might be) to
> the affected stable/LTS kernels?

What Michal said, replace 874bbfe6 with 176bed1d.  Without 22b886dd,
874bbfe6 is a landmine, uses add_timer_on() as if it were mod_timer(),
which it is not, or rather was not until 22b886dd came along, and still
does not look like the mod_timer() alias that add_timer() is.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-07  5:19                                           ` Mike Galbraith
@ 2016-02-07  5:59                                             ` Mike Galbraith
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2016-02-07  5:59 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh, Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik

On Sun, 2016-02-07 at 06:19 +0100, Mike Galbraith wrote:
> On Sat, 2016-02-06 at 11:07 -0200, Henrique de Moraes Holschuh wrote:
> > On Fri, 05 Feb 2016, Tejun Heo wrote:
> > > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> > > > 
> > > > > What are you suggesting?
> > > > 
> > > > That 874bbfe6 should die.
> > > 
> > > Yeah, it's gonna be killed.  The commit is there because the behavior
> > > change broke things.  We don't want to guarantee it but have been and
> > > can't change it right away just because we don't like it when things
> > > may break from it.  The plan is to implement a debug option to force
> > > workqueue to always execute these work items on a foreign cpu to weed
> > > out breakages.
> > 
> > Is there a path to filter down sane behavior (whichever one it might be) to
> > the affected stable/LTS kernels?
> 
> What Michal said, replace 874bbfe6 with 176bed1d.  Without 22b886dd,
> 874bbfe6 is a landmine, uses add_timer_on() as if it were mod_timer(),
> which it is not, or rather was not until 22b886dd came along, and still
> does not look like the mod_timer() alias that add_timer() is.

BTW, with the 874bbfe6 22b886dd pair, mundane workqueue timers are no
longer deflected to housekeeper CPUs, so NO_HZ_FULL regresses.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-05 21:06                                       ` Tejun Heo
  2016-02-06 13:07                                         ` Henrique de Moraes Holschuh
@ 2016-02-09 15:31                                         ` Mike Galbraith
  2016-02-09 16:39                                           ` Linus Torvalds
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-09 15:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik, Greg Kroah-Hartman, Linus Torvalds

On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote:
> On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> > 
> > > What are you suggesting?
> > 
> > That 874bbfe6 should die.
> 
> Yeah, it's gonna be killed.  The commit is there because the behavior
> change broke things.  We don't want to guarantee it but have been and
> can't change it right away just because we don't like it when things
> may break from it.  The plan is to implement a debug option to force
> workqueue to always execute these work items on a foreign cpu to weed
> out breakages.

A niggling question remaining is when is it gonna be killed?

1. Meanwhile, 874bbfe6 was sent to 2.6.31+, meaning that every stable
tree where it landed which did not ALSO receive 22b886dd has become
destabilized.  We have two 3.12-stability reports, one the hotplug
explosion that you provided a workaround for, one the corruption, and
one corruption report for 3.18.  Both breakage types would be sort of
fixed up by getting 22b886dd and your hotplug workaround (which does
_not_ guarantee survival) were applied everywhere, however...

2. We also have a report for the 3.18 corruption victim that adding
22b886dd did NOT restore the stable status quo, rather it replaced the
corruption that 874bbfe6 caused with a performance regression.

3. 874bbfe6 + 22b886dd also inflicts a NO_HZ_FULL regression. 
 Admittedly not a huge deal, but another regression nonetheless.

The only evidence I've seen that anything at all was the broken by the
changes that triggered the inception of 874bbfe6 in the first place was
the b0rked vmstat thing that Linus had already fixed with 176bed1d.  So
where is the breakage you mention that makes keeping 874bbfe6 the
prudent thing to do vs just reverting 874bbfe6 immediately, perhaps
22b886dd as well given it is fallout thereof, and getting that sent off
to stable?

It looks for all the world as if the sole excuse for either to exist is
to prevent any other stupid mistakes like the vmstat thing from being
exposed for what they are by actively hiding them, when in fact, that
hiding doesn't survive a hotplug event (as we saw in the crash analysis
I showed you).  Surely there's a better reason to keep that commit than
hiding bugs that can only remain hidden until they meet hotplug.  What
is it?

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 15:31                                         ` Mike Galbraith
@ 2016-02-09 16:39                                           ` Linus Torvalds
  2016-02-09 16:50                                             ` Tejun Heo
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2016-02-09 16:39 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek,
	Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik, Greg Kroah-Hartman

On Tue, Feb 9, 2016 at 7:31 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote:
>> >
>> > That 874bbfe6 should die.
>>
>> Yeah, it's gonna be killed.  The commit is there because the behavior
>> change broke things.  We don't want to guarantee it but have been and
>> can't change it right away just because we don't like it when things
>> may break from it.  The plan is to implement a debug option to force
>> workqueue to always execute these work items on a foreign cpu to weed
>> out breakages.
>
> A niggling question remaining is when is it gonna be killed?

It probably should be killed sooner rather than later.

Just document that if you need something to run on a _particular_ cpu,
you need to use "schedule_delayed_work_on()" and "add_timer_on()".

The proper fix was 176bed1de5bf, and 874bbfe6 was just wrong.

                 Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 16:39                                           ` Linus Torvalds
@ 2016-02-09 16:50                                             ` Tejun Heo
  2016-02-09 17:04                                               ` Mike Galbraith
  2016-02-09 17:04                                               ` Linus Torvalds
  0 siblings, 2 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-09 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

Hello,

On Tue, Feb 09, 2016 at 08:39:15AM -0800, Linus Torvalds wrote:
> > A niggling question remaining is when is it gonna be killed?
> 
> It probably should be killed sooner rather than later.
> 
> Just document that if you need something to run on a _particular_ cpu,
> you need to use "schedule_delayed_work_on()" and "add_timer_on()".

I'll queue a patch to put unbound work items on foreign cpus (maybe
every Nth to reduce perf impact).  Wanted to align it to rc1 and then
let it get tested during the devel cycle but missed this window.  It's
a bit late in devel cycle but we can still do it in this cycle.

> The proper fix was 176bed1de5bf, and 874bbfe6 was just wrong.

idk, not doing so is likely to cause subtle bugs which are difficult
to track down.  The problem with -stable is 874bbfe6 being backported
without the matching timer fix.  The right thing to do now probably is
reverting 874bbfe6 for -stable kernels which don't get the timer fix.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 16:50                                             ` Tejun Heo
@ 2016-02-09 17:04                                               ` Mike Galbraith
  2016-02-09 17:54                                                 ` Tejun Heo
  2016-02-09 17:04                                               ` Linus Torvalds
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-09 17:04 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable,
	Daniel Bilik, Greg Kroah-Hartman

On Tue, 2016-02-09 at 11:50 -0500, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 09, 2016 at 08:39:15AM -0800, Linus Torvalds wrote:
> > > A niggling question remaining is when is it gonna be killed?
> > 
> > It probably should be killed sooner rather than later.
> > 
> > Just document that if you need something to run on a _particular_
> > cpu,
> > you need to use "schedule_delayed_work_on()" and "add_timer_on()".
> 
> I'll queue a patch to put unbound work items on foreign cpus (maybe
> every Nth to reduce perf impact).  Wanted to align it to rc1 and then
> let it get tested during the devel cycle but missed this window.  It's
> a bit late in devel cycle but we can still do it in this cycle.

Or do something like the below, and get guinea pigs for free. 

workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs

WORK_CPU_UNBOUND work items queued to a bound workqueue always run
locally.  This is a good thing normally, but not when the user has
asked us to keep unbound work away from certain CPUs.  Round robin
these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
trumps performance.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 kernel/workqueue.c |   27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -303,6 +303,9 @@ static bool workqueue_freezing;		/* PL:
 
 static cpumask_var_t wq_unbound_cpumask; /* PL: low level cpumask for all unbound wqs */
 
+/* CPU where WORK_CPU_UNBOUND work was last round robin scheduled from this CPU */
+static DEFINE_PER_CPU(unsigned int, wq_unbound_rr_cpu_last);
+
 /* the per-cpu worker pools */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
 				     cpu_worker_pools);
@@ -1298,6 +1301,28 @@ static bool is_chained_work(struct workq
 	return worker && worker->current_pwq->wq == wq;
 }
 
+/*
+ * When queueing WORK_CPU_UNBOUND work to a !WQ_UNBOUND queue, round
+ * robin among wq_unbound_cpumask to avoid perturbing sensitive tasks.
+ */
+static unsigned int select_round_robin_cpu(unsigned int cpu)
+{
+	int new_cpu;
+
+	if (cpumask_test_cpu(cpu, wq_unbound_cpumask))
+		return cpu;
+	if (cpumask_empty(wq_unbound_cpumask))
+		return cpu;
+	new_cpu = __this_cpu_read(wq_unbound_rr_cpu_last);
+	new_cpu = cpumask_next_and(new_cpu, wq_unbound_cpumask, cpu_online_mask);
+	if (unlikely(new_cpu >= nr_cpu_ids))
+		new_cpu = cpumask_first_and(wq_unbound_cpumask, cpu_online_mask);
+	if (unlikely(WARN_ON_ONCE(new_cpu >= nr_cpu_ids)))
+		return cpu;
+	__this_cpu_write(wq_unbound_rr_cpu_last, new_cpu);
+	return new_cpu;
+}
+
 static void __queue_work(int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
@@ -1323,7 +1348,7 @@ static void __queue_work(int cpu, struct
 		return;
 retry:
 	if (req_cpu == WORK_CPU_UNBOUND)
-		cpu = raw_smp_processor_id();
+		cpu = select_round_robin_cpu(raw_smp_processor_id());
 
 	/* pwq which will be used unless @work is executing elsewhere */
 	if (!(wq->flags & WQ_UNBOUND))

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 16:50                                             ` Tejun Heo
  2016-02-09 17:04                                               ` Mike Galbraith
@ 2016-02-09 17:04                                               ` Linus Torvalds
  2016-02-09 17:51                                                 ` Tejun Heo
  1 sibling, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2016-02-09 17:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

On Tue, Feb 9, 2016 at 8:50 AM, Tejun Heo <tj@kernel.org> wrote:
>
> idk, not doing so is likely to cause subtle bugs which are difficult
> to track down.  The problem with -stable is 874bbfe6 being backported
> without the matching timer fix.

Well, according to this thread, even witht he timer fix the end result
then shows odd problems, _and_ has a NO_HZ_FULL regression.

I do agree about subtle bugs, but we haven't actually seen any other
ones than the vmstat breakage so far.

Also, I suspect that to flush out any bugs, we might want to

 (a) actually dequeue timers and work queues that are bound to a
particular CPU when a CPU goes down.

     Sure, we *could* make it a rule that everybody who binds a timer
to a particular CPU should just register the cpu-down thing, but why
make a rule that you have to make extra work? People who do per-cpu
work should have a setup function for when a new CPU comes _up_, but
why make people do pointless extra crap for the cpu-down case when the
generic code could just do ti for them.

 (b) maybe one of the test-bots could be encouraged to do a lot of cpu
offlining/onlining as a stress test>

That (a) part is important in that it avoids the subtle bug where some
timer or workqueue entry ends up being run on the wrong CPU after all,
just because the target CPU went down.

And the (b) part would hopefully flush out things that didn't start
things properly when a new cpu comes online.

Hmm? The above is obviously a longer-term thing and a bigger change,
but I think we should be able to just revert 874bbfe6 without anything
else going on, since I don't think we ever found anything else than
vmstat that had issues.

                    Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 17:04                                               ` Linus Torvalds
@ 2016-02-09 17:51                                                 ` Tejun Heo
  2016-02-09 18:06                                                   ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-02-09 17:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

Hello,

On Tue, Feb 09, 2016 at 09:04:18AM -0800, Linus Torvalds wrote:
> On Tue, Feb 9, 2016 at 8:50 AM, Tejun Heo <tj@kernel.org> wrote:
> > idk, not doing so is likely to cause subtle bugs which are difficult
> > to track down.  The problem with -stable is 874bbfe6 being backported
> > without the matching timer fix.
> 
> Well, according to this thread, even witht he timer fix the end result
> then shows odd problems, _and_ has a NO_HZ_FULL regression.

I don't know what that odd problem is indicating but it's likely we're
seeing another issue exposed by these changes or a bug during
backport, but yeah it's problematic.

> I do agree about subtle bugs, but we haven't actually seen any other
> ones than the vmstat breakage so far.

The thing with vmstat is that it's a work item which is most likely to
expose the issue as it runs constantly on all systems and we started
seeing it triggering soon after timer migration becomes more common.
I'd be surprised if we don't discover a lot more subtler ones down the
road.  Maybe it's that most of them won't trigger often enough to
matter much but it's a bit scary.

> Also, I suspect that to flush out any bugs, we might want to
> 
>  (a) actually dequeue timers and work queues that are bound to a
> particular CPU when a CPU goes down.
> 
>      Sure, we *could* make it a rule that everybody who binds a timer
> to a particular CPU should just register the cpu-down thing, but why
> make a rule that you have to make extra work? People who do per-cpu
> work should have a setup function for when a new CPU comes _up_, but
> why make people do pointless extra crap for the cpu-down case when the
> generic code could just do ti for them.

This goes the same for work items and timers.  If we want to do
explicit dequeueing or flushing of cpu-bound stuff on cpu down, we'll
have to either dedicate *_on() interfaces for correctness or introduce
a separate set of interfaces to use for optimization and correctness.
The current situation is that work itmes which are explicitly shut
down on cpu-down are correctness usages while the ones which are not
are optimization usages.  I'll try to scan through the usages and see
what the actual proportions are like.  Maybe we can get away with
declaring that _on() usages are absolute.

>  (b) maybe one of the test-bots could be encouraged to do a lot of cpu
> offlining/onlining as a stress test>
> 
> That (a) part is important in that it avoids the subtle bug where some
> timer or workqueue entry ends up being run on the wrong CPU after all,
> just because the target CPU went down.
> 
> And the (b) part would hopefully flush out things that didn't start
> things properly when a new cpu comes online.
> 
> Hmm? The above is obviously a longer-term thing and a bigger change,
> but I think we should be able to just revert 874bbfe6 without anything
> else going on, since I don't think we ever found anything else than
> vmstat that had issues.

So, how about reverting 874bbfe6 and performing random foreign
queueing during -rc's for a couple cycles so that we can at least find
out the broken ones quickly in devel branch and backport fixes as
they're found?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 17:04                                               ` Mike Galbraith
@ 2016-02-09 17:54                                                 ` Tejun Heo
  2016-02-09 17:56                                                   ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Tejun Heo @ 2016-02-09 17:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

Hello, Mike.

On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote:
> workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs
> 
> WORK_CPU_UNBOUND work items queued to a bound workqueue always run
> locally.  This is a good thing normally, but not when the user has
> asked us to keep unbound work away from certain CPUs.  Round robin
> these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
> trumps performance.

I don't think doing this by default for everyone is a good idea.  A
lot of workqueue usages tend to touch whatever the scheduler was
touching after all.  Doing things per-cpu is generally a pretty good
thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 17:54                                                 ` Tejun Heo
@ 2016-02-09 17:56                                                   ` Mike Galbraith
  2016-02-09 18:02                                                     ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-09 17:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

On Tue, 2016-02-09 at 12:54 -0500, Tejun Heo wrote:
> Hello, Mike.
> 
> On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote:
> > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask
> > CPUs
> > 
> > WORK_CPU_UNBOUND work items queued to a bound workqueue always run
> > locally.  This is a good thing normally, but not when the user has
> > asked us to keep unbound work away from certain CPUs.  Round robin
> > these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
> > trumps performance.
> 
> I don't think doing this by default for everyone is a good idea.  A
> lot of workqueue usages tend to touch whatever the scheduler was
> touching after all.  Doing things per-cpu is generally a pretty good
> thing.

It doesn't do anything unless the user twiddles the mask to exclude
certain (think no_hz_full) CPUs, so there are no clueless victims.

	-Mike

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 17:56                                                   ` Mike Galbraith
@ 2016-02-09 18:02                                                     ` Mike Galbraith
  2016-02-09 18:27                                                       ` Tejun Heo
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2016-02-09 18:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

On Tue, 2016-02-09 at 18:56 +0100, Mike Galbraith wrote:
> On Tue, 2016-02-09 at 12:54 -0500, Tejun Heo wrote:
> > Hello, Mike.
> > 
> > On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote:
> > > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask
> > > CPUs
> > > 
> > > WORK_CPU_UNBOUND work items queued to a bound workqueue always
> > > run
> > > locally.  This is a good thing normally, but not when the user
> > > has
> > > asked us to keep unbound work away from certain CPUs.  Round
> > > robin
> > > these to wq_unbound_cpumask CPUs instead, as perturbation
> > > avoidance
> > > trumps performance.
> > 
> > I don't think doing this by default for everyone is a good idea.  A
> > lot of workqueue usages tend to touch whatever the scheduler was
> > touching after all.  Doing things per-cpu is generally a pretty
> > good
> > thing.
> 
> It doesn't do anything unless the user twiddles the mask to exclude
> certain (think no_hz_full) CPUs, so there are no clueless victims.

(a plus: testers/robots can twiddle mask to help find bugs, _and_
nohz_full people can use it if they so choose)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 17:51                                                 ` Tejun Heo
@ 2016-02-09 18:06                                                   ` Linus Torvalds
  0 siblings, 0 replies; 57+ messages in thread
From: Linus Torvalds @ 2016-02-09 18:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

On Tue, Feb 9, 2016 at 9:51 AM, Tejun Heo <tj@kernel.org> wrote:
>>
>>  (a) actually dequeue timers and work queues that are bound to a
>> particular CPU when a CPU goes down.
>>
> This goes the same for work items and timers.  If we want to do
> explicit dequeueing or flushing of cpu-bound stuff on cpu down, we'll
> have to either dedicate *_on() interfaces for correctness or introduce
> a separate set of interfaces to use for optimization and correctness.

We already do that. "add_timer_on()" for timers, and cpu !=
WORK_CPU_UNBOUND for work items.

>    Maybe we can get away with
> declaring that _on() usages are absolute.

I really think that anything else would be odd as hell. If you asked
for a timer (or work) on a particular CPU, and you get it on another
one, that's a bug.

It's much better to just dequeue those entries and say "sorry, your
CPU went away".

Of course, we could play around with just run them early at CPU-down
time (and anybody trying to requeue would get an error because the CPU
is in the process of going down), but that sounds like more work for
any users, and like a much more fundamental difference. The "just
silently dequeue" makes more sense, and pairs well with anything that
sets things up on CPU-up time (which a percpu entity will have to do
anyway).

> So, how about reverting 874bbfe6 and performing random foreign
> queueing during -rc's for a couple cycles so that we can at least find
> out the broken ones quickly in devel branch and backport fixes as
> they're found?

Yeah, that sounds good to me. Having some "cpu work/timer debug"
config option that ends up spreading out non-cpu-specific timers and
work in order to find bugs sounds like a good idea. And I don't think
it should be limited to rc releases, I think lots of people might be
willing to run that (the same way we had people - and even
distributions - that did PAGEALLOC_DEBUG which is a lot bigger
hammer).

             Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Crashes with 874bbfe600a6 in 3.18.25
  2016-02-09 18:02                                                     ` Mike Galbraith
@ 2016-02-09 18:27                                                       ` Tejun Heo
  0 siblings, 0 replies; 57+ messages in thread
From: Tejun Heo @ 2016-02-09 18:27 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner,
	Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li,
	LKML, stable, Daniel Bilik, Greg Kroah-Hartman

Hello, Mike.

On Tue, Feb 09, 2016 at 07:02:35PM +0100, Mike Galbraith wrote:
> > It doesn't do anything unless the user twiddles the mask to exclude
> > certain (think no_hz_full) CPUs, so there are no clueless victims.
> 
> (a plus: testers/robots can twiddle mask to help find bugs, _and_
> nohz_full people can use it if they so choose)

Ah, yeah, it makes sense then.  I'm gonna play with a bit and add a
debug option to always force the behavior.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2016-02-09 18:28 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-20 21:19 Crashes with 874bbfe600a6 in 3.18.25 Jan Kara
2016-01-20 21:39 ` Shaohua Li
2016-01-21  9:52   ` Jan Kara
2016-01-21 13:29     ` Sasha Levin
2016-01-22  1:10     ` Sasha Levin
2016-01-22 16:09       ` Tejun Heo
2016-01-23  2:20         ` Ben Hutchings
2016-01-23 16:11           ` Thomas Gleixner
2016-01-26  9:34             ` Jan Kara
2016-01-26  9:49               ` Thomas Gleixner
2016-01-26 11:14               ` Petr Mladek
2016-01-26 13:09                 ` Thomas Gleixner
2016-02-03  9:35                   ` Jiri Slaby
2016-02-03 10:41                     ` Thomas Gleixner
2016-02-03 12:28                     ` Michal Hocko
2016-02-03 16:24                       ` Tejun Heo
2016-02-03 16:48                         ` Michal Hocko
2016-02-03 16:59                           ` Tejun Heo
2016-02-04  6:37                             ` Michal Hocko
2016-02-04  7:40                               ` Michal Hocko
2016-02-03 17:01                         ` Mike Galbraith
2016-02-03 17:06                           ` Tejun Heo
2016-02-03 17:13                             ` Mike Galbraith
2016-02-03 17:15                               ` Tejun Heo
2016-02-04  2:00                             ` Mike Galbraith
2016-02-05 16:49                               ` Tejun Heo
2016-02-05 20:47                                 ` Mike Galbraith
2016-02-05 20:54                                   ` Tejun Heo
2016-02-05 20:59                                     ` Mike Galbraith
2016-02-05 21:06                                       ` Tejun Heo
2016-02-06 13:07                                         ` Henrique de Moraes Holschuh
2016-02-07  5:19                                           ` Mike Galbraith
2016-02-07  5:59                                             ` Mike Galbraith
2016-02-09 15:31                                         ` Mike Galbraith
2016-02-09 16:39                                           ` Linus Torvalds
2016-02-09 16:50                                             ` Tejun Heo
2016-02-09 17:04                                               ` Mike Galbraith
2016-02-09 17:54                                                 ` Tejun Heo
2016-02-09 17:56                                                   ` Mike Galbraith
2016-02-09 18:02                                                     ` Mike Galbraith
2016-02-09 18:27                                                       ` Tejun Heo
2016-02-09 17:04                                               ` Linus Torvalds
2016-02-09 17:51                                                 ` Tejun Heo
2016-02-09 18:06                                                   ` Linus Torvalds
2016-02-04 10:04                             ` Mike Galbraith
2016-02-04 10:46                               ` Thomas Gleixner
2016-02-04 11:07                                 ` Mike Galbraith
2016-02-04 11:20                                 ` Jan Kara
2016-02-04 16:39                                   ` Daniel Bilik
2016-02-05  2:40                                     ` Mike Galbraith
2016-02-05  8:11                                       ` Daniel Bilik
2016-02-05  8:33                                         ` Mike Galbraith
2016-02-03 18:46                         ` Thomas Gleixner
2016-02-03 19:01                           ` Tejun Heo
2016-02-03 19:05                             ` Thomas Gleixner
2016-02-03 19:15                               ` Tejun Heo
2016-02-05  5:44                         ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).