From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1757685AbcBDLUg (ORCPT <rfc822;w@1wt.eu>);
	Thu, 4 Feb 2016 06:20:36 -0500
Received: from mx2.suse.de ([195.135.220.15]:57042 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1757244AbcBDLUc (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 4 Feb 2016 06:20:32 -0500
Date: Thu, 4 Feb 2016 12:20:44 +0100
From: Jan Kara <jack@suse.cz>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>, Tejun Heo <tj@kernel.org>,
        Michal Hocko <mhocko@kernel.org>, Jiri Slaby <jslaby@suse.cz>,
        Petr Mladek <pmladek@suse.com>, Jan Kara <jack@suse.cz>,
        Ben Hutchings <ben@decadent.org.uk>,
        Sasha Levin <sasha.levin@oracle.com>, Shaohua Li <shli@fb.com>,
        LKML <linux-kernel@vger.kernel.org>, stable@vger.kernel.org,
        Daniel Bilik <daniel.bilik@neosystem.cz>
Subject: Re: Crashes with 874bbfe600a6 in 3.18.25
Message-ID: <20160204112044.GE4956@quack.suse.cz>
References: <20160126093400.GV24938@quack.suse.cz>
 <20160126111438.GA731@pathway.suse.cz>
 <alpine.DEB.2.11.1601261352010.3886@nanos>
 <56B1C9E4.4020400@suse.cz>
 <20160203122855.GB6762@dhcp22.suse.cz>
 <20160203162441.GE14091@mtj.duckdns.org>
 <1454518913.6148.15.camel@gmail.com>
 <20160203170652.GI14091@mtj.duckdns.org>
 <1454580263.3407.114.camel@gmail.com>
 <alpine.DEB.2.11.1602041130380.25254@nanos>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.11.1602041130380.25254@nanos>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu 04-02-16 11:46:47, Thomas Gleixner wrote:
> On Thu, 4 Feb 2016, Mike Galbraith wrote:
> > On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> > > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > > > Hm, so it's ok to queue work to an offline CPU?  What happens if it
> > > > doesn't come back for an eternity or two?
> > > 
> > > Right now, it just loses affinity....
> > 
> > WRT affinity...
> > 
> > Somebody somewhere queues a delayed work, a timer is started on CPUX,
> > work is targeted at CPUX.  Now wash/rinse/repeat mod_delayed_work()
> > along with migrations.  Should __queue_delayed_work() not refrain from
> > altering dwork->cpu once set?
> > 
> > I'm also wondering why 22b886dd only applies to kernels >= 4.2.
> > 
> > <quote>
> > Regardless of the previous CPU a timer was on, add_timer_on()
> > currently simply sets timer->flags to the new CPU.  As the caller must
> > be seeing the timer as idle, this is locally fine, but the timer
> > leaving the old base while unlocked can lead to race conditions as
> > follows.
> > 
> > Let's say timer was on cpu 0.
> > 
> >   cpu 0                                 cpu 1
> >   -----------------------------------------------------------------------------
> >   del_timer(timer) succeeds
> >                                         del_timer(timer)
> >                                           lock_timer_base(timer) locks cpu_0_base
> >   add_timer_on(timer, 1)
> >     spin_lock(&cpu_1_base->lock)
> >     timer->flags set to cpu_1_base
> >     operates on @timer                    operates on @timer
> > </quote>
> > 
> > What's the difference between...
> >      timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu;
> > and...
> >      timer_set_base(timer, base);
> > 
> > ...that makes that fix unneeded prior to 4.2?  We take the same locks
> > in < 4.2 kernels, so seemingly both will diddle concurrently above.
> 
> Indeed, you are right.
> 
> The same can happen on pre 4.2, just the fix does not apply as we changed the
> internals how the base is managed in the timer itself. Backport below.

Thanks for backport Thomas and to Mike for persistence :). I've asked my
friend seeing crashes with 3.18.25 to try whether this patch fixes the
issues. It may take some time so stay tuned...

								Honza

> 8<----------------------------
> 
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -956,13 +956,26 @@ EXPORT_SYMBOL(add_timer);
>   */
>  void add_timer_on(struct timer_list *timer, int cpu)
>  {
> -	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> +	struct tvec_base *new_base = per_cpu(tvec_bases, cpu);
> +	struct tvec_base *base;
>  	unsigned long flags;
>  
>  	timer_stats_timer_set_start_info(timer);
>  	BUG_ON(timer_pending(timer) || !timer->function);
> -	spin_lock_irqsave(&base->lock, flags);
> -	timer_set_base(timer, base);
> +
> +	/*
> +	 * If @timer was on a different CPU, it must be migrated with the
> +	 * old base locked to prevent other operations proceeding with the
> +	 * wrong base locked.  See lock_timer_base().
> +	 */
> +	base = lock_timer_base(timer, &flags);
> +	if (base != new_base) {
> +		timer_set_base(timer, NULL);
> +		spin_unlock(&base->lock);
> +		base = new_base;
> +		spin_lock(&base->lock);
> +		timer_set_base(timer, base);
> +	}
>  	debug_activate(timer, timer->expires);
>  	internal_add_timer(base, timer);
>  	spin_unlock_irqrestore(&base->lock, flags);
> 
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR