From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753237AbbGBLiI (ORCPT <rfc822;w@1wt.eu>);
	Thu, 2 Jul 2015 07:38:08 -0400
Received: from foss.arm.com ([217.140.101.70]:49185 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753415AbbGBLiF (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 2 Jul 2015 07:38:05 -0400
Date: Thu, 2 Jul 2015 12:40:32 +0100
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Yuyang Du <yuyang.du@intel.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Rabin Vincent <rabin.vincent@axis.com>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>
Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Message-ID: <20150702114032.GA7598@e105550-lin.cambridge.arm.com>
References: <20150630143057.GA31689@axis.com>
 <1435728995.9397.7.camel@gmail.com>
 <20150701145551.GA15690@axis.com>
 <20150701204404.GH25159@twins.programming.kicks-ass.net>
 <20150701232511.GA5197@intel.com>
 <1435824347.5351.18.camel@gmail.com>
 <20150702010539.GB5197@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150702010539.GB5197@intel.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Jul 02, 2015 at 09:05:39AM +0800, Yuyang Du wrote:
> Hi Mike,
> 
> On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
> > On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
> > 
> > > That being said, it is also obvious to prevent the livelock from happening:
> > > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > > just avoid idleness by making another idleness.
> > 
> > Yeah, but that's just the symptom, not the disease.  Better for the idle
> > balance symptom may actually be to only pull one when idle balancing.
> > After all, the immediate goal is to find something better to do than
> > idle, not to achieve continual perfect (is the enemy of good) balance.
> > 
> Symptom? :)
> 
> You mean "pull one and stop, can't be greedy"? Right, but still need to
> assure you don't make another idle CPU (meaning until nr_running == 1), which
> is the cure to disease.
> 
> I am ok with at most "pull one", but probably we stick to the load_balance()
> by pulling an fair amount, assuming load_balance() magically computes the
> right imbalance, otherwise you may have to do multiple "pull one"s.

Talking about the disease and looking at the debug data that Rabin has
provided I think the problem is due to the way blocked load is handled
(or not handled) in calculate_imbalance().

We have three entities in the root cfs_rq on cpu1:

1. Task entity pid 7, load_avg_contrib = 5.
2. Task entity pid 30, load_avg_contrib = 10.
3. Group entity, load_avg_contrib = 118, but contains task entity pid
   413 further down the hierarchy with task_h_load() = 0. The 118 comes
   from the blocked load contribution in the system.slice task group.

calculate_imbalance() figures out the average loads are:

cpu0: load/capacity = 0*1024/1024 = 0
cpu1: load/capacity = (5 + 10 + 118)*1024/1024 = 133

domain: load/capacity = (0 + 133)*1024/(2*1024) = 62

env->imbalance = 62

Rabin reported env->imbalance = 60 after pulling the rcu task with
load_avg_contrib = 5. It doesn't match my numbers exactly, but it pretty
close ;-)

detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
empty the src_rq.

IOW, since task groups include blocked load in the load_avg_contrib (see
__update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
imbalance includes blocked load and hence env->imbalance >=
sum(task_h_load(p)) for all tasks p on the rq. Which leads to
detach_tasks() emptying the rq completely in the reported scenario where
blocked load > runnable load.

Whether emptying the src_rq is the right thing to do depends on on your
point of view. Does balanced load (runnable+blocked) take priority over
keeping cpus busy or not? For idle_balance() it seems intuitively
correct to not empty the rq and hence you could consider env->imbalance
to be too big.

I think we will see more of this kind of problems if we include
weighted_cpuload() as well. Parts of the imbalance calculation code is
quite old and could use some attention first.

A short term fix could be what Yuyang propose, stop pulling tasks when
there is only one left in detach_tasks(). It won't affect active load
balance where we may want to migrate the last task as it active load
balance doesn't use detach_tasks().

Morten