All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Chen Yu <yu.c.chen@intel.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>, <mingo@redhat.com>,
	<juri.lelli@redhat.com>, <vincent.guittot@linaro.org>,
	<dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
	<bsegall@google.com>, <mgorman@suse.de>, <bristot@redhat.com>,
	<vschneid@redhat.com>, <linux-kernel@vger.kernel.org>,
	<wuyun.abel@bytedance.com>, <tglx@linutronix.de>, <efault@gmx.de>,
	<tim.c.chen@intel.com>, <yu.c.chen.y@gmail.com>
Subject: Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
Date: Tue, 14 May 2024 17:18:55 +0800	[thread overview]
Message-ID: <ZkMsf4Fz7/AFoQfC@chenyu5-mobl2> (raw)
In-Reply-To: <422fc38c-6096-8804-17ce-1420661743e8@amd.com>

Hi Prateek,

On 2024-05-13 at 09:37:07 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 5/7/2024 8:45 PM, Chen Yu wrote:
> > On 2024-04-05 at 12:28:04 +0200, Peter Zijlstra wrote:
> >> Allow applications to directly set a suggested request/slice length using
> >> sched_attr::sched_runtime.
> >>
> >> The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> >> which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
> >>
> >> Applications should strive to use their periodic runtime at a high
> >> confidence interval (95%+) as the target slice. Using a smaller slice
> >> will introduce undue preemptions, while using a larger value will
> >> increase latency.
> >>
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>
> > 
> > Is it possible to leverage this task slice to do better task wakeup placement?
> > The idea is that, the smaller the slice the wakee has, the less idle CPU it
> > should scan. This can reduce wake latency and inhibit costly task migration,
> > especially on large systems.
> > 
> > We did some experiments and got some performance improvements:
> > 
> > 
> > From 9cb806476586d7048fcbd0f66d0101f0dbb8fd2b Mon Sep 17 00:00:00 2001
> > From: Chen Yu <yu.c.chen@intel.com>
> > Date: Tue, 7 May 2024 22:36:29 +0800
> > Subject: [RFC PATCH] sched/eevdf: Use customized slice to reduce wakeup latency
> >  and inhibit task migration
> > 
> > Problem 1:
> > The overhead of task migration is high on many-core system. The overhead
> > brings performance penalty due to broken cache locality/higher cache-to-cache
> > latency.
> > 
> > Problem 2:
> > During wakeup, the time spent on searching for an idle CPU is costly on
> > many-core system. Besides, access to other CPU's rq statistics brings
> > cace contention:
> > 
> > available_idle_cpu(cpu) -> idle_cpu(cpu) -> {rq->curr, rq->nr_running}
> > 
> > Although SIS_UTIL throttles the scan depth based on system utilization,
> > there is requirement to further limit the scan depth for specific workload,
> > especially for short duration wakee.
> > 
> > Now we have the interface to customize the request/slice. The smaller the
> > slice is, the earlier the task can be picked up, and the lower wakeup latency
> > the task expects. Leverage the wakee's slice to further throttle the
> > idle CPU scan depth - the shorter slice, the less CPUs to scan.
> > 
> > Test on 240 CPUs, 2 sockets system. With SNC(sub-numa-cluster) enabled,
> > each LLC domain has 60 CPUs. There is noticeable improvement of netperf.
> > (With SNC disabled, more improvements should be seen because C2C is higher)
> > 
> > The global slice is 3 msec(sysctl_sched_base_slice) by default on my ubuntu
> > 22.04, and the customized slice is set to 0.1 msec for both netperf and netserver:
> > 
> > for i in $(seq 1 $job); do
> > 	netperf_slice -e 100000 -4 -H 127.0.01 -t TCP_RR -c -C -l 100 &
> > done
> > 
> > case            	load    	baseline(std%)	compare%( std%)
> > TCP_RR          	60-threads	 1.00 (  1.60)	 +0.35 (  1.73)
> > TCP_RR          	120-threads	 1.00 (  1.34)	 -0.96 (  1.24)
> > TCP_RR          	180-threads	 1.00 (  1.59)	+92.20 (  4.24)
> > TCP_RR          	240-threads	 1.00 (  9.71)	+43.11 (  2.97)
> > 
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  kernel/sched/fair.c     | 23 ++++++++++++++++++++---
> >  kernel/sched/features.h |  1 +
> >  2 files changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index edc23f6588a3..f269ae7d6e24 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7368,6 +7368,24 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> >  
> >  #endif /* CONFIG_SCHED_SMT */
> >  
> > +/*
> > + * Scale the scan number of idle CPUs according to customized
> > + * wakee's slice. The smaller the slice is, the earlier the task
> > + * wants be picked up, thus the lower wakeup latency the task expects.
> > + * The baseline is the global sysctl_sched_base_slice. Task slice
> > + * smaller than the global one would shrink the scan number.
> > + */
> > +static int adjust_idle_scan(struct task_struct *p, int nr)
> > +{
> > +	if (!sched_feat(SIS_FAST))
> > +		return nr;
> > +
> > +	if (!p->se.custom_slice || p->se.slice >= sysctl_sched_base_slice)
> > +		return nr;
> > +
> > +	return div_u64(nr * p->se.slice, sysctl_sched_base_slice);
> > +}
> > +
> >  /*
> >   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> > @@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  	if (sched_feat(SIS_UTIL)) {
> >  		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> >  		if (sd_share) {
> > -			/* because !--nr is the condition to stop scan */> -			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > +			nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
> >  			/* overloaded LLC is unlikely to have idle cpu/core */
> > -			if (nr == 1)
> > +			if (nr <= 0)
> 
> I was wondering if this would preserve the current behavior with
> SIS_FAST toggled off? Since the implementation below still does a
> "--nr <= 0" , wouldn't it effectively visit one CPU less overall now?
>
> Have you tried something similar to the below hunk?
> 
> 	/* because !--nr is the condition to stop scan */
> 	nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan)) + 1;
> 	if (nr == 1)
> 		return -1;
>

Yeah, right, to keep the scan depth consistent, the "+1" should be kept.
 
> I agree with Mike that looking at slice to limit scan-depth seems odd.
> My experience with netperf is that the workload cares more about the
> server-client being co-located on the closest cache domain and by
> limiting scan-depth using slice, this is indirectly achieved since all
> the wakeups carry the WF_SYNc flag.
>

Exactly. This is the original motivation.
 
> P.S. have you tried using the slice in __select_idle_cpu()? Similar to
> sched_idle_cpu() check, perhaps an additional sched_preempt_short_cpu()
> which compares rq->curr->se.slice with the waking task's slice and
> returs that cpu if SIS_SHORT can help run the workload quicker?

This is a good idea, it seems to be benefit PREEMPT_SHORT. If the customized
task slice is introduced, we can leverage this hint for latency related
optimization. Task wakeup is one thing, I can also think of other aspects,
like idle load balance, etc. I'm not sure what is the proper usage of the
task slice though, this is why I sent this RFC.

> Note:
> This will not work if the SIS scan itself is the largest overhead in the
> wakeup cycle and not the task placement itself. Previously during
> SIS_UTIL testing, to measure the overheads of scan vs placement, we
> would do a full scan but return the result that SIS_UTIL would have
> returned to determine the overhead of the search itself.
>

Regarding the task placement, do you mean the time between a task is enqueued
and picked up? Do you have any recommendation which workload can expose the
scan overhead most?

thanks,
Chenyu
 
> >  				return -1;
> >  		}
> >  	}
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 143f55df890b..176324236018 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> >   */
> >  SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_FAST, true)
> >  
> >  /*
> >   * Issue a WARN when we do multiple update_rq_clock() calls
> 
> --
> Thanks and Regards,
> Prateek

  reply	other threads:[~2024-05-14  9:19 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05 10:27 [RFC][PATCH 00/10] sched/fair: Complete EEVDF Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 01/10] sched/eevdf: Add feature comments Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 02/10] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 03/10] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
2024-04-05 21:11   ` Benjamin Segall
2024-04-05 10:27 ` [RFC][PATCH 04/10] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 05/10] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
2024-04-06  2:20   ` Mike Galbraith
2024-04-05 10:28 ` [RFC][PATCH 06/10] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 07/10] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue Peter Zijlstra
2024-04-06  9:23   ` Chen Yu
2024-04-08  9:06     ` Peter Zijlstra
2024-04-11  1:32       ` Yan-Jie Wang
2024-04-25 10:25         ` Peter Zijlstra
2024-04-12 10:42   ` K Prateek Nayak
2024-04-15 10:56     ` Mike Galbraith
2024-04-16  3:18       ` K Prateek Nayak
2024-04-16  5:36         ` Mike Galbraith
2024-04-18 16:24           ` Mike Galbraith
2024-04-18 17:08             ` K Prateek Nayak
2024-04-24 15:20             ` Peter Zijlstra
2024-04-25 11:28             ` Peter Zijlstra
2024-04-26 10:56               ` Peter Zijlstra
2024-04-26 11:16                 ` Peter Zijlstra
2024-04-26 16:03                   ` Mike Galbraith
2024-04-27  6:42                     ` Mike Galbraith
2024-04-28 16:32                       ` Mike Galbraith
2024-04-29 12:14                         ` Peter Zijlstra
2024-04-15 17:07   ` Luis Machado
2024-04-24 15:15     ` Luis Machado
2024-04-25 10:42       ` Peter Zijlstra
2024-04-25 11:49         ` Peter Zijlstra
2024-04-26  9:32           ` Peter Zijlstra
2024-04-26  9:36             ` Peter Zijlstra
2024-04-26 10:16             ` Luis Machado
2024-04-29 14:33             ` Luis Machado
2024-05-02 10:26               ` Luis Machado
2024-05-10 14:49                 ` Luis Machado
2024-05-15  9:36                   ` Peter Zijlstra
2024-05-15 11:48                     ` Peter Zijlstra
2024-05-15 18:03                       ` Mike Galbraith
2024-05-20 15:20                       ` Luis Machado
2024-05-29 22:50                 ` Peter Zijlstra
2024-06-03 19:30                   ` Luis Machado
2024-06-04 10:11                     ` Peter Zijlstra
2024-06-04 13:59                       ` Hongyan Xia
2024-06-04 14:23                       ` Luis Machado
2024-06-04 14:49                         ` Hongyan Xia
2024-06-04 19:12                         ` Peter Zijlstra
2024-06-05  7:22                           ` Peter Zijlstra
2024-06-05  9:14                             ` Luis Machado
2024-06-05  9:42                               ` Peter Zijlstra
2024-05-23  8:45               ` Peter Zijlstra
2024-05-23  9:06                 ` Luis Machado
2024-05-23  9:33                   ` Peter Zijlstra
2024-06-03 15:57                     ` Hongyan Xia
2024-04-26 10:15         ` Luis Machado
2024-04-20  5:57   ` Mike Galbraith
2024-04-22 13:13   ` Tobias Huschle
2024-04-05 10:28 ` [RFC][PATCH 09/10] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
2024-04-06  8:16   ` Hillf Danton
2024-05-07  5:34   ` Mike Galbraith
2024-05-15 10:13     ` Peter Zijlstra
2024-05-07 15:15   ` Chen Yu
2024-05-08 13:52     ` Mike Galbraith
2024-05-09  3:48       ` Chen Yu
2024-05-09  5:00         ` Mike Galbraith
2024-05-13  4:07     ` K Prateek Nayak
2024-05-14  9:18       ` Chen Yu [this message]
2024-05-14 15:23         ` K Prateek Nayak
2024-05-14 16:15           ` Chen Yu
2024-05-22 14:48           ` Chen Yu
2024-05-27 10:11 ` [RFC][PATCH 00/10] sched/fair: Complete EEVDF K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZkMsf4Fz7/AFoQfC@chenyu5-mobl2 \
    --to=yu.c.chen@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.c.chen.y@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.