From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753172AbbGBHRX (ORCPT ); Thu, 2 Jul 2015 03:17:23 -0400 Received: from mga03.intel.com ([134.134.136.65]:54965 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753145AbbGBHRG (ORCPT ); Thu, 2 Jul 2015 03:17:06 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,390,1432623600"; d="scan'208";a="598723881" Date: Thu, 2 Jul 2015 07:25:11 +0800 From: Yuyang Du To: Peter Zijlstra Cc: Rabin Vincent , Mike Galbraith , "mingo@redhat.com" , "linux-kernel@vger.kernel.org" , Paul Turner , Ben Segall , Morten Rasmussen Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance() Message-ID: <20150701232511.GA5197@intel.com> References: <20150630143057.GA31689@axis.com> <1435728995.9397.7.camel@gmail.com> <20150701145551.GA15690@axis.com> <20150701204404.GH25159@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150701204404.GH25159@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, We have two basic load balancing: idle blancing and periodic balancing. It takes a tick for periodic balancing to happen again, so the livelock could not be from it. On the contrary, the idle balancing occurs as needed at arbitrary time, so the livelock could happen. And obviously, the idle balancing livelock SHOULD happen: one CPU pulls tasks from the other, makes the other idle, and this iterates... That being said, it is also obvious to prevent the livelock from happening: idle pulling until the source rq's nr_running is 1, becuase otherwise we just avoid idleness by making another idleness. Hope the patch at the end should work. On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote: > On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote: > > PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd" > > task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ] > > SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null) > > SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: rngd@hwrng.service > > SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice > > SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice > > Firstly, a group (parent) load_avg_contrib should never be less than > that of its constituent parts, therefore the top 3 SEs should have at > least 118 too. I think the downward is parent, so with this case, "parent is bigger than child" is ok. But if the child is the only contributor to the parent's load (probably is this case), then the load_avg_contrib should not jump suddenly from 0 to 118. So this might be due to the __update_group_entity_contrib(), I did not look into detail in this complex function, but a glimpse suggests it is at least not consistent if not arbitray, so likely will not satisfy "parent is bigger than child". But this patch series http://comments.gmane.org/gmane.linux.kernel/1981970 should be very consistent in computing all SE's load avergage, thus safisfy the universal truth... > Now its been a while since I looked at the per entity load tracking > stuff so some of the details have left me, but while it looks like we > add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not > propagate that into the corresponding (group) se. > > This means the se->avg.load_avg_contrib is accounted per cpu without > migration benefits. So if our task just got migrated onto a cpu that > hasn't ran the group in a while, the group will not have accumulated > runtime. Yes, this is true, the migration has nothing to do with either the source or the destination group SEs. Down this road, if we want the addition and subtraction after migation, we need to do it bottom-up along the SE, and do without rq->lock (?). Need to think about it for a while. Thanks, Yuyang diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 40a7fcb..f7cc1ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env) return 0; while (!list_empty(tasks)) { + + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1) + break; + p = list_first_entry(tasks, struct task_struct, se.group_node); env->loop++;