LKML Archive mirror
 help / color / mirror / Atom feed
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <vschneid@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kprateek.nayak@amd.com>,
	<wuyun.abel@bytedance.com>, <tglx@linutronix.de>, <efault@gmx.de>,
	<yu.chen.surf@gmail.com>
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue
Date: Sat, 6 Apr 2024 17:23:25 +0800	[thread overview]
Message-ID: <ZhEUjX1Nw0y2eJ1o@chenyu5-mobl2> (raw)
In-Reply-To: <20240405110010.631664251@infradead.org>

On 2024-04-05 at 12:28:02 +0200, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
> 
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
> 
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
> 
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
> 
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
> 
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
> 
> XXX test the cfs-throttle stuff
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---

Tested schbench on xeon server, which has 240 CPUs/2 sockets.
schbench -m 2 -r 100
the result seems ok to me.

baseline:
NO_DELAY_DEQUEUE
NO_DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1658446 total samples)
	  50.0th: 5          (361126 samples)
	  90.0th: 11         (654121 samples)
	* 99.0th: 25         (123032 samples)
	  99.9th: 673        (13845 samples)
	  min=1, max=8337
Request Latencies percentiles (usec) runtime 100 (s) (1662381 total samples)
	  50.0th: 14992      (524771 samples)
	  90.0th: 15344      (657370 samples)
	* 99.0th: 15568      (129769 samples)
	  99.9th: 15888      (10017 samples)
	  min=3529, max=43841
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 16544      (37 samples)
	* 50.0th: 16608      (30 samples)
	  90.0th: 16736      (31 samples)
	  min=16403, max=17698
average rps: 16623.81


DELAY_DEQUEUE
DELAY_ZERO
Wakeup Latencies percentiles (usec) runtime 100 (s) (1668161 total samples)
	  50.0th: 6          (394867 samples)
	  90.0th: 12         (653021 samples)
	* 99.0th: 31         (142636 samples)
	  99.9th: 755        (14547 samples)
	  min=1, max=5226
Request Latencies percentiles (usec) runtime 100 (s) (1671859 total samples)
	  50.0th: 14384      (511809 samples)
	  90.0th: 14992      (653508 samples)
	* 99.0th: 15408      (149257 samples)
	  99.9th: 15984      (12090 samples)
	  min=3546, max=38360
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 16672      (45 samples)
	* 50.0th: 16736      (52 samples)
	  90.0th: 16736      (0 samples)
	  min=16629, max=16800
average rps: 16718.59


The 99th wakeup latency increases a little bit, and should be in the acceptible
range(25 -> 31 us). Meanwhile the throughput increases accordingly. Here are
the possible reason I can think of:

1. wakeup latency: The time to find an eligible entity in the tree
   during wakeup might take longer - if there are more delayed-dequeue
   tasks in the tree.
2. throughput: Inhibit task dequeue can decrease the ratio to touch the
   task group's load_avg: dequeue_entity()-> { update_load_avg(), update_cfs_group()),
   which reduces the cache contention among CPUs, and improves throughput.


> +	} else {
> +		bool sleep = flags & DEQUEUE_SLEEP;
> +
> +		SCHED_WARN_ON(sleep && se->sched_delayed);
> +		update_curr(cfs_rq);
> +
> +		if (sched_feat(DELAY_DEQUEUE) && sleep &&
> +		    !entity_eligible(cfs_rq, se)) {

Regarding the elibigle check, it was found that there could be an overflow
issue, and it brings false negative of entity_eligible(), which was described here:
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
and also reported on another machine
https://lore.kernel.org/lkml/ZeCo7STWxq+oyN2U@gmail.com/
I don't have good idea to avoid that overflow properly, while I'm trying to
reproduce it locally, do you have any guidance on how to address it?

thanks,
Chenyu

  reply	other threads:[~2024-04-06  9:23 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05 10:27 [RFC][PATCH 00/10] sched/fair: Complete EEVDF Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 01/10] sched/eevdf: Add feature comments Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 02/10] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 03/10] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
2024-04-05 21:11   ` Benjamin Segall
2024-04-05 10:27 ` [RFC][PATCH 04/10] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
2024-04-05 10:27 ` [RFC][PATCH 05/10] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
2024-04-06  2:20   ` Mike Galbraith
2024-04-05 10:28 ` [RFC][PATCH 06/10] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 07/10] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue Peter Zijlstra
2024-04-06  9:23   ` Chen Yu [this message]
2024-04-08  9:06     ` Peter Zijlstra
2024-04-11  1:32       ` Yan-Jie Wang
2024-04-25 10:25         ` Peter Zijlstra
2024-04-12 10:42   ` K Prateek Nayak
2024-04-15 10:56     ` Mike Galbraith
2024-04-16  3:18       ` K Prateek Nayak
2024-04-16  5:36         ` Mike Galbraith
2024-04-18 16:24           ` Mike Galbraith
2024-04-18 17:08             ` K Prateek Nayak
2024-04-24 15:20             ` Peter Zijlstra
2024-04-25 11:28             ` Peter Zijlstra
2024-04-26 10:56               ` Peter Zijlstra
2024-04-26 11:16                 ` Peter Zijlstra
2024-04-26 16:03                   ` Mike Galbraith
2024-04-27  6:42                     ` Mike Galbraith
2024-04-28 16:32                       ` Mike Galbraith
2024-04-29 12:14                         ` Peter Zijlstra
2024-04-15 17:07   ` Luis Machado
2024-04-24 15:15     ` Luis Machado
2024-04-25 10:42       ` Peter Zijlstra
2024-04-25 11:49         ` Peter Zijlstra
2024-04-26  9:32           ` Peter Zijlstra
2024-04-26  9:36             ` Peter Zijlstra
2024-04-26 10:16             ` Luis Machado
2024-04-29 14:33             ` Luis Machado
2024-05-02 10:26               ` Luis Machado
2024-05-10 14:49                 ` Luis Machado
2024-05-15  9:36                   ` Peter Zijlstra
2024-05-15 11:48                     ` Peter Zijlstra
2024-05-15 18:03                       ` Mike Galbraith
2024-04-26 10:15         ` Luis Machado
2024-04-20  5:57   ` Mike Galbraith
2024-04-22 13:13   ` Tobias Huschle
2024-04-05 10:28 ` [RFC][PATCH 09/10] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
2024-04-05 10:28 ` [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
2024-04-06  8:16   ` Hillf Danton
2024-05-07  5:34   ` Mike Galbraith
2024-05-15 10:13     ` Peter Zijlstra
2024-05-07 15:15   ` Chen Yu
2024-05-08 13:52     ` Mike Galbraith
2024-05-09  3:48       ` Chen Yu
2024-05-09  5:00         ` Mike Galbraith
2024-05-13  4:07     ` K Prateek Nayak
2024-05-14  9:18       ` Chen Yu
2024-05-14 15:23         ` K Prateek Nayak
2024-05-14 16:15           ` Chen Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZhEUjX1Nw0y2eJ1o@chenyu5-mobl2 \
    --to=yu.c.chen@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yu.chen.surf@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).