From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753172AbbGBHRX (ORCPT <rfc822;w@1wt.eu>);
	Thu, 2 Jul 2015 03:17:23 -0400
Received: from mga03.intel.com ([134.134.136.65]:54965 "EHLO mga03.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753145AbbGBHRG (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 2 Jul 2015 03:17:06 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.15,390,1432623600"; 
   d="scan'208";a="598723881"
Date: Thu, 2 Jul 2015 07:25:11 +0800
From: Yuyang Du <yuyang.du@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Rabin Vincent <rabin.vincent@axis.com>,
        Mike Galbraith <umgwanakikbuti@gmail.com>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>
Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Message-ID: <20150701232511.GA5197@intel.com>
References: <20150630143057.GA31689@axis.com>
 <1435728995.9397.7.camel@gmail.com>
 <20150701145551.GA15690@axis.com>
 <20150701204404.GH25159@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150701204404.GH25159@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello,

We have two basic load balancing: idle blancing and periodic balancing.

It takes a tick for periodic balancing to happen again, so the livelock
could not be from it.

On the contrary, the idle balancing occurs as needed at arbitrary time,
so the livelock could happen.

And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
tasks from the other, makes the other idle, and this iterates...

That being said, it is also obvious to prevent the livelock from happening:
idle pulling until the source rq's nr_running is 1, becuase otherwise we
just avoid idleness by making another idleness.

Hope the patch at the end should work.

On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
> >  PID: 413    TASK: 8edda408  CPU: 1   COMMAND: "rngd"
> >   task_h_load():     0 [ = (load_avg_contrib {    0} * cfs_rq->h_load {    0}) / (cfs_rq->runnable_load_avg {    0} + 1) ]
> >   SE: 8edda450 load_avg_contrib:     0 load.weight:  1024 PARENT: 8fffbd00 GROUPNAME: (null)
> >   SE: 8fffbd00 load_avg_contrib:     0 load.weight:     2 PARENT: 8f531f80 GROUPNAME: rngd@hwrng.service
> >   SE: 8f531f80 load_avg_contrib:     0 load.weight:  1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
> >   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT: 00000000 GROUPNAME: system.slice
> 
> Firstly, a group (parent) load_avg_contrib should never be less than
> that of its constituent parts, therefore the top 3 SEs should have at
> least 118 too.

I think the downward is parent, so with this case, "parent is bigger than
child" is ok. 

But if the child is the only contributor to the parent's load (probably is this case),
then the load_avg_contrib should not jump suddenly from 0 to 118.

So this might be due to the __update_group_entity_contrib(), I did not look into
detail in this complex function, but a glimpse suggests it is at least not consistent
if not arbitray, so likely will not satisfy "parent is bigger than child".

But this patch series http://comments.gmane.org/gmane.linux.kernel/1981970 should
be very consistent in computing all SE's load avergage, thus safisfy the universal
truth...

> Now its been a while since I looked at the per entity load tracking
> stuff so some of the details have left me, but while it looks like we
> add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
> propagate that into the corresponding (group) se.
> 
> This means the se->avg.load_avg_contrib is accounted per cpu without
> migration benefits. So if our task just got migrated onto a cpu that
> hasn't ran the group in a while, the group will not have accumulated
> runtime.

Yes, this is true, the migration has nothing to do with either the source
or the destination group SEs. Down this road, if we want the addition and
subtraction after migation, we need to do it bottom-up along the SE, and 
do without rq->lock (?). Need to think about it for a while.

Thanks,
Yuyang
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..f7cc1ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
 		return 0;
 
 	while (!list_empty(tasks)) {
+
+		if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
+			break;
+
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
 		env->loop++;