From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757008AbcBIPbe (ORCPT ); Tue, 9 Feb 2016 10:31:34 -0500 Received: from mail-wm0-f47.google.com ([74.125.82.47]:38003 "EHLO mail-wm0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755356AbcBIPbb (ORCPT ); Tue, 9 Feb 2016 10:31:31 -0500 Message-ID: <1455031885.3807.74.camel@gmail.com> Subject: Re: Crashes with 874bbfe600a6 in 3.18.25 From: Mike Galbraith To: Tejun Heo Cc: Michal Hocko , Jiri Slaby , Thomas Gleixner , Petr Mladek , Jan Kara , Ben Hutchings , Sasha Levin , Shaohua Li , LKML , stable@vger.kernel.org, Daniel Bilik , Greg Kroah-Hartman , Linus Torvalds Date: Tue, 09 Feb 2016 16:31:25 +0100 In-Reply-To: <20160205210606.GH4401@htj.duckdns.org> References: <56B1C9E4.4020400@suse.cz> <20160203122855.GB6762@dhcp22.suse.cz> <20160203162441.GE14091@mtj.duckdns.org> <1454518913.6148.15.camel@gmail.com> <20160203170652.GI14091@mtj.duckdns.org> <1454551217.3677.27.camel@gmail.com> <20160205164923.GC4401@htj.duckdns.org> <1454705231.3819.151.camel@gmail.com> <20160205205456.GG4401@htj.duckdns.org> <1454705989.3819.158.camel@gmail.com> <20160205210606.GH4401@htj.duckdns.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote: > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote: > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > > > > > What are you suggesting? > > > > That 874bbfe6 should die. > > Yeah, it's gonna be killed. The commit is there because the behavior > change broke things. We don't want to guarantee it but have been and > can't change it right away just because we don't like it when things > may break from it. The plan is to implement a debug option to force > workqueue to always execute these work items on a foreign cpu to weed > out breakages. A niggling question remaining is when is it gonna be killed? 1. Meanwhile, 874bbfe6 was sent to 2.6.31+, meaning that every stable tree where it landed which did not ALSO receive 22b886dd has become destabilized. We have two 3.12-stability reports, one the hotplug explosion that you provided a workaround for, one the corruption, and one corruption report for 3.18. Both breakage types would be sort of fixed up by getting 22b886dd and your hotplug workaround (which does _not_ guarantee survival) were applied everywhere, however... 2. We also have a report for the 3.18 corruption victim that adding 22b886dd did NOT restore the stable status quo, rather it replaced the corruption that 874bbfe6 caused with a performance regression. 3. 874bbfe6 + 22b886dd also inflicts a NO_HZ_FULL regression. Admittedly not a huge deal, but another regression nonetheless. The only evidence I've seen that anything at all was the broken by the changes that triggered the inception of 874bbfe6 in the first place was the b0rked vmstat thing that Linus had already fixed with 176bed1d. So where is the breakage you mention that makes keeping 874bbfe6 the prudent thing to do vs just reverting 874bbfe6 immediately, perhaps 22b886dd as well given it is fallout thereof, and getting that sent off to stable? It looks for all the world as if the sole excuse for either to exist is to prevent any other stupid mistakes like the vmstat thing from being exposed for what they are by actively hiding them, when in fact, that hiding doesn't survive a hotplug event (as we saw in the crash analysis I showed you). Surely there's a better reason to keep that commit than hiding bugs that can only remain hidden until they meet hotplug. What is it? -Mike