Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

LKML Archive mirror
 help / color / mirror / Atom feed

From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <vschneid@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kprateek.nayak@amd.com>,
	<wuyun.abel@bytedance.com>, <tglx@linutronix.de>, <efault@gmx.de>,
	<yu.chen.surf@gmail.com>
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue
Date: Sat, 6 Apr 2024 17:23:25 +0800	[thread overview]
Message-ID: <ZhEUjX1Nw0y2eJ1o@chenyu5-mobl2> (raw)
In-Reply-To: <20240405110010.631664251@infradead.org>

On 2024-04-05 at 12:28:02 +0200, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
> 
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
> 
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
> 
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
> 
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
> 
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
> 
> XXX test the cfs-throttle stuff
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---

Tested schbench on xeon server, which has 240 CPUs/2 sockets.
schbench -m 2 -r 100
the result seems ok to me.

baseline:
NO_DELAY_DEQUEUE
NO_DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1658446 total samples)
	  50.0th: 5          (361126 samples)
	  90.0th: 11         (654121 samples)
	* 99.0th: 25         (123032 samples)
	  99.9th: 673        (13845 samples)
	  min=1, max=8337
Request Latencies percentiles (usec) runtime 100 (s) (1662381 total samples)
	  50.0th: 14992      (524771 samples)
	  90.0th: 15344      (657370 samples)
	* 99.0th: 15568      (129769 samples)
	  99.9th: 15888      (10017 samples)
	  min=3529, max=43841
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 16544      (37 samples)
	* 50.0th: 16608      (30 samples)
	  90.0th: 16736      (31 samples)
	  min=16403, max=17698
average rps: 16623.81


DELAY_DEQUEUE
DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1668161 total samples)
	  50.0th: 6          (394867 samples)
	  90.0th: 12         (653021 samples)
	* 99.0th: 31         (142636 samples)
	  99.9th: 755        (14547 samples)
	  min=1, max=5226
Request Latencies percentiles (usec) runtime 100 (s) (1671859 total samples)
	  50.0th: 14384      (511809 samples)
	  90.0th: 14992      (653508 samples)
	* 99.0th: 15408      (149257 samples)
	  99.9th: 15984      (12090 samples)
	  min=3546, max=38360
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 16672      (45 samples)
	* 50.0th: 16736      (52 samples)
	  90.0th: 16736      (0 samples)
	  min=16629, max=16800
average rps: 16718.59


The 99th wakeup latency increases a little bit, and should be in the acceptible
range(25 -> 31 us). Meanwhile the throughput increases accordingly. Here are
the possible reason I can think of:

1. wakeup latency: The time to find an eligible entity in the tree
   during wakeup might take longer - if there are more delayed-dequeue
   tasks in the tree.
2. throughput: Inhibit task dequeue can decrease the ratio to touch the
   task group's load_avg: dequeue_entity()-> { update_load_avg(), update_cfs_group()),
   which reduces the cache contention among CPUs, and improves throughput.


> +	} else {
> +		bool sleep = flags & DEQUEUE_SLEEP;
> +
> +		SCHED_WARN_ON(sleep && se->sched_delayed);
> +		update_curr(cfs_rq);
> +
> +		if (sched_feat(DELAY_DEQUEUE) && sleep &&
> +		    !entity_eligible(cfs_rq, se)) {

Regarding the elibigle check, it was found that there could be an overflow
issue, and it brings false negative of entity_eligible(), which was described here:
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
and also reported on another machine
https://lore.kernel.org/lkml/ZeCo7STWxq+oyN2U@gmail.com/
I don't have good idea to avoid that overflow properly, while I'm trying to
reproduce it locally, do you have any guidance on how to address it?

thanks,
Chenyu

next prev parent reply	other threads:[~2024-04-06  9:23 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05 10:27 [RFC][PATCH 00/10] sched/fair: Complete EEVDF Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 01/10] sched/eevdf: Add feature comments Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 02/10] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 03/10] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
2024-04-05 21:11   ` Benjamin Segall
2024-04-05 10:27 ` [RFC][PATCH 04/10] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 05/10] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
2024-04-06  2:20   ` Mike Galbraith
2024-04-05 10:28 ` [RFC][PATCH 06/10] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 07/10] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue Peter Zijlstra
2024-04-06  9:23   ` Chen Yu [this message]
2024-04-08  9:06     ` Peter Zijlstra
2024-04-11  1:32       ` Yan-Jie Wang
2024-04-25 10:25         ` Peter Zijlstra
2024-04-12 10:42   ` K Prateek Nayak
2024-04-15 10:56     ` Mike Galbraith
2024-04-16  3:18       ` K Prateek Nayak
2024-04-16  5:36         ` Mike Galbraith
2024-04-18 16:24           ` Mike Galbraith
2024-04-18 17:08             ` K Prateek Nayak
2024-04-24 15:20             ` Peter Zijlstra
2024-04-25 11:28             ` Peter Zijlstra
2024-04-26 10:56               ` Peter Zijlstra
2024-04-26 11:16                 ` Peter Zijlstra
2024-04-26 16:03                   ` Mike Galbraith
2024-04-27  6:42                     ` Mike Galbraith
2024-04-28 16:32                       ` Mike Galbraith
2024-04-29 12:14                         ` Peter Zijlstra
2024-04-15 17:07   ` Luis Machado
2024-04-24 15:15     ` Luis Machado
2024-04-25 10:42       ` Peter Zijlstra
2024-04-25 11:49         ` Peter Zijlstra
2024-04-26  9:32           ` Peter Zijlstra
2024-04-26  9:36             ` Peter Zijlstra
2024-04-26 10:16             ` Luis Machado
2024-04-29 14:33             ` Luis Machado
2024-05-02 10:26               ` Luis Machado
2024-05-10 14:49                 ` Luis Machado
2024-05-15  9:36                   ` Peter Zijlstra
2024-05-15 11:48                     ` Peter Zijlstra
2024-05-15 18:03                       ` Mike Galbraith
2024-04-26 10:15         ` Luis Machado
2024-04-20  5:57   ` Mike Galbraith
2024-04-22 13:13   ` Tobias Huschle
2024-04-05 10:28 ` [RFC][PATCH 09/10] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
2024-04-06  8:16   ` Hillf Danton
2024-05-07  5:34   ` Mike Galbraith
2024-05-15 10:13     ` Peter Zijlstra
2024-05-07 15:15   ` Chen Yu
2024-05-08 13:52     ` Mike Galbraith
2024-05-09  3:48       ` Chen Yu
2024-05-09  5:00         ` Mike Galbraith
2024-05-13  4:07     ` K Prateek Nayak
2024-05-14  9:18       ` Chen Yu
2024-05-14 15:23         ` K Prateek Nayak
2024-05-14 16:15           ` Chen Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZhEUjX1Nw0y2eJ1o@chenyu5-mobl2 \
    --to=yu.c.chen@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.chen.surf@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).