Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

BPF Archive mirror
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@kernel.org, joshdon@google.com,
	brho@google.com, pjt@google.com, derkling@google.com,
	haoluo@google.com, dvernet@meta.com, dschatzberg@meta.com,
	dskarlat@cs.cmu.edu, riel@surriel.com, changwoo@igalia.com,
	himadrics@inria.fr, memxor@gmail.com, andrea.righi@canonical.com,
	joel@joelfernandes.org, linux-kernel@vger.kernel.org,
	bpf@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class
Date: Wed, 15 May 2024 10:41:18 -1000	[thread overview]
Message-ID: <ZkUd7oUr11VGme1p@slm.duckdns.org> (raw)
In-Reply-To: <20240513080359.GI30852@noisy.programming.kicks-ass.net>

Hello, Peter.

On Mon, May 13, 2024 at 10:03:59AM +0200, Peter Zijlstra wrote:
> On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:
> > The hierarchical scheduling overhead isn't the main motivation for us. We
> > can't use the CPU controller for all workloads and while it'd be nice to
> > improve that,
> 
> Hurmph, I had the impression from the earlier threads that this ~5%
> cgroup overhead was most definitely a problem and a motivator for all
> this.
>
> The overhead was prohibitive, it was claimed, and you needed a solution.
> Did not previous versions use this very argument in order to push for
> all this?

Being able to experiment with potential solutions for problems like
hierarchical scheduling overhead is important and something we wanted to
demonstrate for sched_ext. It's true that the current hierarchical
scheduling is too expensive to deploy on certain workloads but as I wrote
before it's also not that difficult to work around and isn't a high priority
problem for us.

> By improving the cgroup mess -- I very much agree that the cgroup thing
> is not very nice. This whole argument goes away and we all get a better
> cgroup implementation.

Improving the cgroup CPU controller performance would be great. However, I
don't see how that'd be an argument against sched_ext. Sure, with sched_ext,
we can easily test out potential ideas which can lower the hierarchical
scheduling overhead but, if anything, that should make us want it more. Why
wouldn't we want to have such ability for other problems too?

> > This view works only if you assume that the entire world contains only a
> > handful of developers who can work on schedulers. The only way that would be
> > the case is if the barrier of entry is raised unreasonably high. Sometimes a
> > high barrier of entry can't be avoided or is beneficial. However, if it's
> > pushed up high enough to leave only a handful of people to work on an area
> > as large as scheduling, something probably is wrong.
> 
> I've never really felt there were too few sched patches to stare at on
> any one day (quite the opposite on many days in fact).
> 
> There have also always been plenty out of tree scheduler patches --
> although I rarely if ever have time to look at them.
...
> > I believe we agree that we want more people contributing to the scheduling
> > area. 
> 
> I think therein lies the rub -- contribution. If we were to do this
> thing, random loadable BPF schedulers, then how do we ensure people will
> contribute back?

Everything has cost and benefits. Forcing potential contributors into a
single narrow funnel has the benefit of concentrating the effort as you're
pointing out. However, the cost is that it's a single funnel. In addition to
the inherent downsides of having only one of anything, it can handle only so
much, and pushes people away from even considering contributing.

There are multiple types of contributions. Getting concrete patches into the
main scheduler is one. Trying out wildly different ideas and exploring the
problem space is another. Providing a viable competing implementation can be
an important contribution too by keeping everyone on their toes. If we
concentrate just on direct code contributions, we can lose the sight of the
bigger picture costing us in other areas.

During the short period of time that we've been experimenting with
sched_ext, we've already found multiple fairly generic approaches that show
significant gains. That's not because people who have been playing with
sched_ext have special abilities, but rather because there are plenty of
sometimes obvious things which have been difficult to try with the in-kernel
scheduler. Sure, anyone can modify the kernel, but, without a practical way
to publish, deploy and maintain such modifications, it’s really difficult to
justify such effort when the chance of landing upstream is really low. If
our experience up to this point is any indication, capable engineers who are
interested in the area don't seem to be in particularly short supply. What
is in short supply is an environment in which they can participate, develop
and refine their ideas.

Opportunity cost is often more difficult to appreciate but it is as real as
any cost. While there may be more than enough patches for you to review, we
are leaving a lot of opportunities unpursued and potential contributors
outside the fence because the funnel is too narrow and the barrier of entry
too high. Yes, there are benefits to the current setup where we tell
everyone to contribute to a single code base but at this point I believe
it's costing us more than benefiting.

> That is, from where I am sitting I see $vendor mandate their $enterprise
> product needs their $BPF scheduler. At which point $vendor will have no
> incentive to ever contribute back.
> 
> And customers of $vendor that want to run additional workloads on
> their machine are then stuck with that scheduler, irrespective of it
> being suitable for them or not. This is not a good experience.

The above scenario sounds contrived to me. The situation is already like
this with vendor patched kernels. Just like for patched kernels, the vendor
has to share the code for sched_ext schedulers due to GPL. After all, the
BPF verifier will flat out reject loading any non-GPL programs. In addition,
sched_ext has benefits in terms of user experience. Because sched_ext is
designed to be supplemental to the default scheduler, its users have an easy
out - falling back to CFS/EEVDF by simply unloading the sched_ext scheduler.
With patched kernels, they'd have to reboot and a stock kernel might not
even be available.

> So I don't at all mind people playing around with schedulers -- they can
> do so today, there are a ton of out of tree patches to start or learn
> from, or like I said, it really isn't all that hard to just rip out fair
> and write something new.
> 
> Open source, you get to do your own thing. Have at.
> 
> But part of what made Linux work so well, is in my opinion the GPL. GPL
> forces people to contribute back -- to work on the shared project. And I
> see the whole BPF thing as a run-around on that.
> 
> Even the large cloud vendors and service providers (Amazon, Google,
> Facebook etc.) contribute back because of rebase pain -- as you well
> know. The rebase pain offsets the 'TIVO hole'.

Two things are being conflated here. What GPL gives us is that ideas and
code don't get locked up behind a paywall. If someone based their work on a
GPL project, others get to take a look at what they did to learn and copy
from them. The upstream pressure is a separate mechanism which nudges people
towards upstream because the overhead of rebase is painful regardless of the
license requirements.

The upstream pressure works well but as I wrote above it also can be pushed
too far to the point where it costs rather than benefits long term
development. Controlling too tight runs the risk of pushing changes and
proposals worth considering under the ground and potential contributors
away. It may be difficult to judge and agree on where the current situation
exactly is but it is not difficult to see signs of stress. Even just for us,
scheduling is one of the common pain points for both server workloads and
Oculus. Talking to other organizations, we hear similar concerns.

You said two conflicting things - that people can have at it as it's open
source but at the same time that even large organizations are forced to the
funnel due to the rebase pain. It's true that even for large organizations,
deviating from upstream is expensive. However, big orgs can still do it
because the benefit usually scales with the number of machines allowing them
to cross the break-even point and thus pay for it.

But the same pain applies to smaller organizations, researchers and
individuals. Imagine how big a deterrence the current situation would be for
them. It's extremely challenging for them to build a user base and community
as it's very awkward to deploy and painful to maintain custom kernels. Some
still persevere but most would be discouraged even from starting if the
prospect of their work being useful is so slim. This limits potential
contributions from a lot of organizations.

CFS / EEVDF is an excellent general purpose scheduler. It obviously is the
most used and most important scheduler in the whole world. It's difficult to
believe that the only way to get enough people to contribute to it is by
suppressing alternatives. The current approach of funneling potential
contributors to a single code base with a very high bar creates a lot of
pain for those potential contributors, and probably feels unnecessarily
punitive to anyone new to the space. If we have to really worry about losing
contributors to the main Linux scheduler just because sched_ext creates an
additional space that interested engineers can work in, something has gone
really wrong and I don't believe that matches the reality.

> But with the BPF muck; where is the motivation to help improve things?
> 
> Keeping a rando github repo with BPF schedulers is not contributing.
> That's just a repo with multiple out of tree schedulers to be ignored.
> Who will put in the effort of upsteaming things if they can hack up a
> BPF and throw it over the wall?

I wouldn't be so dismissive about development happening outside the kernel
tree. We already see strong community and collaborations in the SCX repo
which is serving as an umbrella project for the sched_ext schedulers.
Different schedulers are chasing different directions but they actively
learn and borrow from each other. It can definitely serve as an incubator to
prove and refine new ideas which can be adopted widely and to grow
scheduling engineers.

For an example, it's still early but Changwoo's work on interactivity in
scx_lavd seems generally useful and has already been adopted by scx_rusty.
It's something which can easily be applied to EEVDF too. Changwoo may or may
not work on EEVDF directly (he says he wants to) but the code change
necessary is neither big nor difficult. Figuring out what actually works was
the hard part, not the implementation. Not all ideas would be like this but
this serves as a good example of how contribution is not just directly
writing patches and how work outside the tree can benefit the kernel.

> So yeah, I'm very much NOT supportive of this effort. From where I'm
> sitting there is simply not a single benefit. You're not making my life
> better, so why would I care?
> 
> How does this BPF muck translate into better quality patches for me?

I'm not sure whether it would make your life better but I firmly believe
that it will benefit overall Linux scheduling in the long term. You don't
necessarily have to care. We'll do our best to ensure that it bothers you as
little as possible.

Maybe I'm mistaken and we won't find much that'd be useful enough for EEVDF
but also maybe there are enough things that we haven't tried that will make
things better for everyone. I believe in the latter and the indications till
now seem to agree. You don't have to share my optimism but wouldn’t it at
least be worthwhile to find out?

Thanks.

-- 
tejun

next prev parent reply	other threads:[~2024-05-15 20:41 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 15:09 [PATCHSET v6] sched: Implement BPF extensible scheduler class Tejun Heo
2024-05-01 15:09 ` [PATCH 01/39] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2024-05-01 15:09 ` [PATCH 02/39] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2024-05-01 15:09 ` [PATCH 03/39] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2024-05-01 15:09 ` [PATCH 04/39] sched: Add sched_class->reweight_task() Tejun Heo
2024-06-24 10:23   ` Peter Zijlstra
2024-06-24 10:31     ` Peter Zijlstra
2024-06-24 23:59     ` Tejun Heo
2024-06-25  7:29       ` Peter Zijlstra
2024-06-25 23:57         ` Tejun Heo
2024-06-26  1:29           ` [PATCH sched/urgent] sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks Tejun Heo
2024-06-26  2:19           ` [PATCH sched_ext/for-6.11] sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() Tejun Heo
2024-05-01 15:09 ` [PATCH 05/39] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2024-06-24 11:06   ` Peter Zijlstra
2024-06-24 22:18     ` Tejun Heo
2024-06-25  8:16       ` Peter Zijlstra
2024-05-01 15:09 ` [PATCH 06/39] sched: Factor out cgroup weight conversion functions Tejun Heo
2024-05-01 15:09 ` [PATCH 07/39] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2024-06-24 11:19   ` Peter Zijlstra
2024-06-24 18:56     ` Tejun Heo
2024-05-01 15:09 ` [PATCH 08/39] sched: Enumerate CPU cgroup file types Tejun Heo
2024-05-01 15:09 ` [PATCH 09/39] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2024-06-24 11:32   ` Peter Zijlstra
2024-06-24 21:18     ` Tejun Heo
2024-06-25  8:29       ` Peter Zijlstra
2024-06-25 23:41         ` Tejun Heo
2024-06-26  8:23           ` Peter Zijlstra
2024-06-26 18:01             ` Tejun Heo
2024-06-27  1:27               ` [PATCH sched_ext/for-6.11] sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect Tejun Heo
2024-05-01 15:09 ` [PATCH 10/39] sched: Factor out update_other_load_avgs() from __update_blocked_others() Tejun Heo
2024-06-24 12:35   ` Peter Zijlstra
2024-06-24 16:15     ` Vincent Guittot
2024-06-24 19:24       ` Tejun Heo
2024-06-25  9:13         ` Vincent Guittot
2024-06-26 20:49           ` Tejun Heo
2024-05-01 15:09 ` [PATCH 11/39] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-05-01 15:09 ` [PATCH 12/39] sched: Add normal_policy() Tejun Heo
2024-05-01 15:09 ` [PATCH 13/39] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2024-05-01 15:09 ` [PATCH 14/39] sched_ext: Implement BPF " Tejun Heo
2024-05-01 15:09 ` [PATCH 15/39] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2024-05-01 15:09 ` [PATCH 16/39] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2024-05-01 15:09 ` [PATCH 17/39] sched_ext: Implement runnable task stall watchdog Tejun Heo
2024-05-01 15:09 ` [PATCH 18/39] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2024-06-24 12:40   ` Peter Zijlstra
2024-06-24 19:06     ` Tejun Heo
2024-05-01 15:09 ` [PATCH 19/39] sched_ext: Print sched_ext info when dumping stack Tejun Heo
2024-06-24 12:46   ` Peter Zijlstra
2024-06-24 14:25     ` Linus Torvalds
2024-06-24 18:34     ` Tejun Heo
2024-05-01 15:09 ` [PATCH 20/39] sched_ext: Print debug dump after an error exit Tejun Heo
2024-05-01 15:09 ` [PATCH 21/39] tools/sched_ext: Add scx_show_state.py Tejun Heo
2024-05-01 15:09 ` [PATCH 22/39] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2024-05-01 15:09 ` [PATCH 23/39] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2024-05-01 15:09 ` [PATCH 24/39] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2024-05-01 15:10 ` [PATCH 25/39] sched_ext: Add task state tracking operations Tejun Heo
2024-05-01 15:10 ` [PATCH 26/39] sched_ext: Implement tickless support Tejun Heo
2024-05-01 15:10 ` [PATCH 27/39] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2024-05-01 15:10 ` [PATCH 28/39] sched_ext: Add cgroup support Tejun Heo
2024-05-01 15:10 ` [PATCH 29/39] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2024-05-01 15:10 ` [PATCH 30/39] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2024-05-01 15:10 ` [PATCH 31/39] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2024-05-01 15:10 ` [PATCH 32/39] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2024-05-01 15:10 ` [PATCH 33/39] sched_ext: Bypass BPF scheduler while PM events are in progress Tejun Heo
2024-05-01 15:10 ` [PATCH 34/39] sched_ext: Implement core-sched support Tejun Heo
2024-05-01 15:10 ` [PATCH 35/39] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2024-05-01 15:10 ` [PATCH 36/39] sched_ext: Implement DSQ iterator Tejun Heo
2024-05-01 15:10 ` [PATCH 37/39] sched_ext: Add cpuperf support Tejun Heo
2024-05-01 15:10 ` [PATCH 38/39] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2024-05-02  2:24   ` Bagas Sanjaya
2024-05-01 15:10 ` [PATCH 39/39] sched_ext: Add selftests Tejun Heo
2024-05-02  8:48 ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Peter Zijlstra
2024-05-02 19:20   ` Tejun Heo
2024-05-03  8:52     ` Peter Zijlstra
2024-05-05 23:31       ` Tejun Heo
2024-05-13  8:03         ` Peter Zijlstra
2024-05-13 18:26           ` Steven Rostedt
2024-05-14  0:07             ` Qais Yousef
2024-05-14 21:34               ` David Vernet
2024-05-27 21:25                 ` Qais Yousef
2024-05-28 23:46                   ` Tejun Heo
2024-05-29 22:09                     ` Qais Yousef
2024-05-17  9:58               ` Peter Zijlstra
2024-05-27 20:29                 ` Qais Yousef
2024-05-14 20:22           ` Chris Mason
2024-05-14 22:06           ` Josh Don
2024-05-15 20:41           ` Tejun Heo [this message]
2024-05-21  0:19             ` Tejun Heo
2024-05-30 16:49               ` Tejun Heo
2024-05-06 18:47       ` Rik van Riel
2024-05-07 19:33         ` Tejun Heo
2024-05-07 19:47           ` Rik van Riel
2024-05-09  7:38       ` Changwoo Min
2024-05-10 18:24 ` Peter Jung
2024-05-13 20:36 ` Andrea Righi
2024-06-11 21:34 ` Linus Torvalds
2024-06-13 23:38   ` Tejun Heo
2024-06-19 20:56   ` Thomas Gleixner
2024-06-19 22:10     ` Linus Torvalds
2024-06-19 22:27       ` Thomas Gleixner
2024-06-19 22:55         ` Linus Torvalds
2024-06-20  2:35           ` Thomas Gleixner
2024-06-20  5:07             ` Linus Torvalds
2024-06-20 17:11               ` Linus Torvalds
2024-06-20 17:41                 ` Tejun Heo
2024-06-20 22:15                   ` [PATCH sched_ext/for-6.11] sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() Tejun Heo
2024-06-20 22:42                     ` Linus Torvalds
2024-06-21 19:46                       ` Tejun Heo
2024-06-24  9:04                         ` Peter Zijlstra
2024-06-24 18:41                           ` Tejun Heo
2024-06-24  9:02                       ` Peter Zijlstra
2024-06-21 19:52                     ` Tejun Heo
2024-06-24  8:59                     ` Peter Zijlstra
2024-06-24 21:01                       ` Tejun Heo
2024-06-25  7:49                         ` Peter Zijlstra
2024-06-25 23:30                           ` Tejun Heo
2024-06-26  8:28                             ` Peter Zijlstra
2024-06-26 17:56                               ` Tejun Heo
2024-06-20 18:47               ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Thomas Gleixner
2024-06-20 19:20                 ` Linus Torvalds
2024-06-21  9:35                   ` Thomas Gleixner
2024-06-21 16:34                     ` Linus Torvalds
2024-06-23  2:00                       ` Tejun Heo
2024-06-23 10:31                       ` Thomas Gleixner
2024-06-23 10:33                       ` Thomas Gleixner
2024-06-24 14:23                         ` Jason Gunthorpe
2024-06-20 19:58                 ` Tejun Heo
2024-06-24  9:34                   ` Peter Zijlstra
2024-06-24 20:17                     ` Tejun Heo
2024-06-24 20:51                       ` [PATCH sched_ext/for-6.11] sched, sched_ext: Simplify dl_prio() case handling in sched_fork() Tejun Heo
2024-06-20 19:35             ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Tejun Heo
2024-06-21 10:46               ` Thomas Gleixner
2024-06-21 21:14                 ` Chris Mason
2024-06-23  8:14                   ` Thomas Gleixner
2024-06-24 16:42                     ` Chris Mason
2024-06-24 18:11                       ` Tejun Heo
2024-06-24 22:01                         ` Peter Oskolkov
2024-06-24 22:17                     ` David Vernet
2024-06-24 21:54             ` Peter Oskolkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZkUd7oUr11VGme1p@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=andrea.righi@canonical.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=himadrics@inria.fr \
    --cc=joel@joelfernandes.org \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).