All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Rabin Vincent <rabin.vincent@axis.com>
To: mingo@redhat.com, peterz@infradead.org
Cc: linux-kernel@vger.kernel.org
Subject: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Date: Tue, 30 Jun 2015 16:30:58 +0200	[thread overview]
Message-ID: <20150630143057.GA31689@axis.com> (raw)

Hi,

We're seeing a livelock where two CPUs both loop with interrupts
disabled in pick_next_task_fair() / idle_balance() and continuously
fetch all tasks from each other.  This has been observed on a 3.18
kernel but the relevant code paths appear to be the same in current
mainline.

The problem situation appears to be this:

cpu0                                        cpu1


pick_next_task
 take rq0->lock
 pick_next_task_fair
  running=0
  idle_balance()
   drop rq0->lock


                                        wakeup A,B on cpu0


                                        pick_next_task
                                         take rq1->lock
                                         pick_next_task_fair
                                          running=0
                                          idle_balance()
                                           drop r1->lock
                                           load_balance()
                                           busiest=rq0
                                           take rq0->lock
                                           detach A,B
                                           drop rq0->lock
                                           take rq1->lock
                                           attach A,B
                                           pulled_task = 2
                                           drop rq1->lock

   load_balance()
   busiest=rq1
   take rq1->lock
   detach A,B
   drop rq1->lock
   take rq0->lock
   attach A,B
   pulled_task = 2
   drop rq0->lock

                                          running=0()
                                          idle_balance()
                                          busiest=rq0, pull A,B, etc.


  running = 0
  load_balance()
  busiest=rq1, pull A,B, etc

And this goes on, with interrupts disabled on both CPUs, for at least a
100 ms (which is when a hardware watchdog hits).

The conditions needed, apart from the right timing, are:

 - cgroups. One of the tasks, say A, needs to be in a CPU cgroup.  When
   the problem occurs, A's ->se has zero load_avg_contrib and
   task_h_load(A) is zero.  However, the se->parent->parent of A has a
   (relatively) high load_avg_contrib.  cpu0's cfs_rq has therefore a
   relatively high runnable_load_avg.  find_busiest_group() therefore
   detects imbalance, and detach_tasks() detaches all tasks.

 - PREEMPT=n.  Otherwise, the code under #ifdef in detach_tasks() would
   ensure that we'd only ever pull a maximum of one task during idle
   balancing.

 - cpu0 and cpu1 are threads on the same core (cpus_share_cache() ==
   true).  otherwise, cpu1 will not be able to wakeup tasks on cpu0
   while cpu0 has interrupts disabled (since an IPI would be required).
   Turning off the default TTWU_QUEUE feature would also provide the
   same effect.

I see two simple ways to prevent the livelock.  One is to just to remove
the #ifdef:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..74c94dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env)
 		detached++;
 		env->imbalance -= load;
 
-#ifdef CONFIG_PREEMPT
 		/*
 		 * NEWIDLE balancing is a source of latency, so preemptible
 		 * kernels will stop after the first task is detached to minimize
@@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env)
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
 			break;
-#endif
 
 		/*
 		 * We only want to steal up to the prescribed amount of

Or this should work too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..13358cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq)
 	 */
 	if (this_rq->cfs.h_nr_running && !pulled_task)
 		pulled_task = 1;
+	else if (!this_rq->cfs.h_nr_running && pulled_task)
+		pulled_task = 0;
 
 out:
 	/* Move the next balance forward */

But it seems that the real problem is that detach_tasks() can remove all
tasks, when cgroups are involved as described above?

Thanks.

/Rabin

             reply	other threads:[~2015-06-30 14:31 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-30 14:30 Rabin Vincent [this message]
2015-07-01  5:36 ` [PATCH?] Livelock in pick_next_task_fair() / idle_balance() Mike Galbraith
2015-07-01 14:55   ` Rabin Vincent
2015-07-01 15:47     ` Mike Galbraith
2015-07-01 20:44     ` Peter Zijlstra
2015-07-01 23:25       ` Yuyang Du
2015-07-02  8:05         ` Mike Galbraith
2015-07-02  1:05           ` Yuyang Du
2015-07-02 10:25             ` Mike Galbraith
2015-07-02 11:40             ` Morten Rasmussen
2015-07-02 19:37               ` Yuyang Du
2015-07-03  9:34                 ` Morten Rasmussen
2015-07-03 16:38                   ` Peter Zijlstra
2015-07-05 22:31                     ` Yuyang Du
2015-07-09 14:32                       ` Morten Rasmussen
2015-07-09 23:24                         ` Yuyang Du
2015-07-05 20:12                   ` Yuyang Du
2015-07-06 17:36                     ` Dietmar Eggemann
2015-07-07 11:17                       ` Rabin Vincent
2015-07-13 17:43                         ` Dietmar Eggemann
2015-07-09 13:53                     ` Morten Rasmussen
2015-07-09 22:34                       ` Yuyang Du
2015-07-02 10:53         ` Peter Zijlstra
2015-07-02 11:44           ` Morten Rasmussen
2015-07-02 18:42             ` Yuyang Du
2015-07-03  4:42               ` Mike Galbraith
2015-07-03 16:39         ` Peter Zijlstra
2015-07-05 22:11           ` Yuyang Du
2015-07-09  6:15             ` Stefan Ekenberg
2015-07-26 18:57             ` Yuyang Du
2015-08-03 17:05             ` [tip:sched/core] sched/fair: Avoid pulling all tasks in idle balancing tip-bot for Yuyang Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150630143057.GA31689@axis.com \
    --to=rabin.vincent@axis.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.