From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756482AbcBDCAZ (ORCPT ); Wed, 3 Feb 2016 21:00:25 -0500 Received: from mail-wm0-f48.google.com ([74.125.82.48]:35340 "EHLO mail-wm0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753386AbcBDCAW (ORCPT ); Wed, 3 Feb 2016 21:00:22 -0500 Message-ID: <1454551217.3677.27.camel@gmail.com> Subject: Re: Crashes with 874bbfe600a6 in 3.18.25 From: Mike Galbraith To: Tejun Heo Cc: Michal Hocko , Jiri Slaby , Thomas Gleixner , Petr Mladek , Jan Kara , Ben Hutchings , Sasha Levin , Shaohua Li , LKML , stable@vger.kernel.org, Daniel Bilik Date: Thu, 04 Feb 2016 03:00:17 +0100 In-Reply-To: <20160203170652.GI14091@mtj.duckdns.org> References: <20160122160903.GH32380@htj.duckdns.org> <1453515623.3734.156.camel@decadent.org.uk> <20160126093400.GV24938@quack.suse.cz> <20160126111438.GA731@pathway.suse.cz> <56B1C9E4.4020400@suse.cz> <20160203122855.GB6762@dhcp22.suse.cz> <20160203162441.GE14091@mtj.duckdns.org> <1454518913.6148.15.camel@gmail.com> <20160203170652.GI14091@mtj.duckdns.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > > Hm, so it's ok to queue work to an offline CPU? What happens if it > > doesn't come back for an eternity or two? > > Right now, it just loses affinity. A more interesting case is a cpu > going offline whlie work items bound to the cpu are still running and > the root problem is that we've never distinguished between affinity > for correctness and optimization and thus can't flush or warn on the > stagglers. The plan is to ensure that all correctness users specify > the CPU explicitly. Once we're there, we can warn on illegal usages. Isn't it the case that, currently at least, each and every spot that requires execution on a specific CPU yet does not take active measures to deal with hotplug events is in fact buggy? The timer code clearly states that the user is responsible, and so do both workqueue.[ch]. I was surprised me to hear that some think they have an iron clad guarantee, given the null and void clause is prominently displayed. -Mike