Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

LKML Archive mirror
 help / color / mirror / Atom feed

From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <vschneid@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kprateek.nayak@amd.com>,
	<wuyun.abel@bytedance.com>, <tglx@linutronix.de>, <efault@gmx.de>,
	<tim.c.chen@intel.com>, <yu.c.chen.y@gmail.com>
Subject: Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
Date: Tue, 7 May 2024 23:15:58 +0800	[thread overview]
Message-ID: <ZjpFruUiBiNi6VSO@chenyu5-mobl2> (raw)
In-Reply-To: <20240405110010.934104715@infradead.org>

On 2024-04-05 at 12:28:04 +0200, Peter Zijlstra wrote:
> Allow applications to directly set a suggested request/slice length using
> sched_attr::sched_runtime.
> 
> The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
> 
> Applications should strive to use their periodic runtime at a high
> confidence interval (95%+) as the target slice. Using a smaller slice
> will introduce undue preemptions, while using a larger value will
> increase latency.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>

Is it possible to leverage this task slice to do better task wakeup placement?
The idea is that, the smaller the slice the wakee has, the less idle CPU it
should scan. This can reduce wake latency and inhibit costly task migration,
especially on large systems.

We did some experiments and got some performance improvements:


From 9cb806476586d7048fcbd0f66d0101f0dbb8fd2b Mon Sep 17 00:00:00 2001
From: Chen Yu <yu.c.chen@intel.com>
Date: Tue, 7 May 2024 22:36:29 +0800
Subject: [RFC PATCH] sched/eevdf: Use customized slice to reduce wakeup latency
 and inhibit task migration

Problem 1:
The overhead of task migration is high on many-core system. The overhead
brings performance penalty due to broken cache locality/higher cache-to-cache
latency.

Problem 2:
During wakeup, the time spent on searching for an idle CPU is costly on
many-core system. Besides, access to other CPU's rq statistics brings
cace contention:

available_idle_cpu(cpu) -> idle_cpu(cpu) -> {rq->curr, rq->nr_running}

Although SIS_UTIL throttles the scan depth based on system utilization,
there is requirement to further limit the scan depth for specific workload,
especially for short duration wakee.

Now we have the interface to customize the request/slice. The smaller the
slice is, the earlier the task can be picked up, and the lower wakeup latency
the task expects. Leverage the wakee's slice to further throttle the
idle CPU scan depth - the shorter slice, the less CPUs to scan.

Test on 240 CPUs, 2 sockets system. With SNC(sub-numa-cluster) enabled,
each LLC domain has 60 CPUs. There is noticeable improvement of netperf.
(With SNC disabled, more improvements should be seen because C2C is higher)

The global slice is 3 msec(sysctl_sched_base_slice) by default on my ubuntu
22.04, and the customized slice is set to 0.1 msec for both netperf and netserver:

for i in $(seq 1 $job); do
	netperf_slice -e 100000 -4 -H 127.0.01 -t TCP_RR -c -C -l 100 &
done

case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	60-threads	 1.00 (  1.60)	 +0.35 (  1.73)
TCP_RR          	120-threads	 1.00 (  1.34)	 -0.96 (  1.24)
TCP_RR          	180-threads	 1.00 (  1.59)	+92.20 (  4.24)
TCP_RR          	240-threads	 1.00 (  9.71)	+43.11 (  2.97)

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 23 ++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edc23f6588a3..f269ae7d6e24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7368,6 +7368,24 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+/*
+ * Scale the scan number of idle CPUs according to customized
+ * wakee's slice. The smaller the slice is, the earlier the task
+ * wants be picked up, thus the lower wakeup latency the task expects.
+ * The baseline is the global sysctl_sched_base_slice. Task slice
+ * smaller than the global one would shrink the scan number.
+ */
+static int adjust_idle_scan(struct task_struct *p, int nr)
+{
+	if (!sched_feat(SIS_FAST))
+		return nr;
+
+	if (!p->se.custom_slice || p->se.slice >= sysctl_sched_base_slice)
+		return nr;
+
+	return div_u64(nr * p->se.slice, sysctl_sched_base_slice);
+}
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (sched_feat(SIS_UTIL)) {
 		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
 		if (sd_share) {
-			/* because !--nr is the condition to stop scan */
-			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+			nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
 			/* overloaded LLC is unlikely to have idle cpu/core */
-			if (nr == 1)
+			if (nr <= 0)
 				return -1;
 		}
 	}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..176324236018 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_FAST, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1

next prev parent reply	other threads:[~2024-05-07 15:16 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05 10:27 [RFC][PATCH 00/10] sched/fair: Complete EEVDF Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 01/10] sched/eevdf: Add feature comments Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 02/10] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 03/10] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
2024-04-05 21:11   ` Benjamin Segall
2024-04-05 10:27 ` [RFC][PATCH 04/10] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 05/10] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
2024-04-06  2:20   ` Mike Galbraith
2024-04-05 10:28 ` [RFC][PATCH 06/10] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 07/10] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue Peter Zijlstra
2024-04-06  9:23   ` Chen Yu
2024-04-08  9:06     ` Peter Zijlstra
2024-04-11  1:32       ` Yan-Jie Wang
2024-04-25 10:25         ` Peter Zijlstra
2024-04-12 10:42   ` K Prateek Nayak
2024-04-15 10:56     ` Mike Galbraith
2024-04-16  3:18       ` K Prateek Nayak
2024-04-16  5:36         ` Mike Galbraith
2024-04-18 16:24           ` Mike Galbraith
2024-04-18 17:08             ` K Prateek Nayak
2024-04-24 15:20             ` Peter Zijlstra
2024-04-25 11:28             ` Peter Zijlstra
2024-04-26 10:56               ` Peter Zijlstra
2024-04-26 11:16                 ` Peter Zijlstra
2024-04-26 16:03                   ` Mike Galbraith
2024-04-27  6:42                     ` Mike Galbraith
2024-04-28 16:32                       ` Mike Galbraith
2024-04-29 12:14                         ` Peter Zijlstra
2024-04-15 17:07   ` Luis Machado
2024-04-24 15:15     ` Luis Machado
2024-04-25 10:42       ` Peter Zijlstra
2024-04-25 11:49         ` Peter Zijlstra
2024-04-26  9:32           ` Peter Zijlstra
2024-04-26  9:36             ` Peter Zijlstra
2024-04-26 10:16             ` Luis Machado
2024-04-29 14:33             ` Luis Machado
2024-05-02 10:26               ` Luis Machado
2024-05-10 14:49                 ` Luis Machado
2024-05-15  9:36                   ` Peter Zijlstra
2024-05-15 11:48                     ` Peter Zijlstra
2024-05-15 18:03                       ` Mike Galbraith
2024-05-20 15:20                       ` Luis Machado
2024-04-26 10:15         ` Luis Machado
2024-04-20  5:57   ` Mike Galbraith
2024-04-22 13:13   ` Tobias Huschle
2024-04-05 10:28 ` [RFC][PATCH 09/10] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
2024-04-06  8:16   ` Hillf Danton
2024-05-07  5:34   ` Mike Galbraith
2024-05-15 10:13     ` Peter Zijlstra
2024-05-07 15:15   ` Chen Yu [this message]
2024-05-08 13:52     ` Mike Galbraith
2024-05-09  3:48       ` Chen Yu
2024-05-09  5:00         ` Mike Galbraith
2024-05-13  4:07     ` K Prateek Nayak
2024-05-14  9:18       ` Chen Yu
2024-05-14 15:23         ` K Prateek Nayak
2024-05-14 16:15           ` Chen Yu

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:edc23f6588a dfblob:143f55df890 dfblob:f269ae7d6e2
dfblob:17632423601
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZjpFruUiBiNi6VSO@chenyu5-mobl2 \
    --to=yu.c.chen@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen.y@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).