Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

LKML Archive mirror
 help / color / mirror / Atom feed

From: Eric Wong <normalperson@yhbt.net>
To: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Linux API <linux-api@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Davide Libenzi <davidel@xmailserver.org>,
	Michael Kerrisk-manpages <mtk.manpages@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
Date: Tue, 10 Feb 2015 04:49:39 +0000	[thread overview]
Message-ID: <20150210044939.GA15616@dcvr.yhbt.net> (raw)
In-Reply-To: <54D98209.2080901@akamai.com>

Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> > On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
> >> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> >>> On 02/09/2015 12:06 PM, Jason Baron wrote:
> >>>> Epoll file descriptors that are added to a shared wakeup source are always
> >>>> added in a non-exclusive manner. That means that when we have multiple epoll
> >>>> fds attached to a shared wakeup source they are all woken up. This can
> >>>> lead to excessive cpu usage and uneven load distribution.
> >>>>
> >>>> This patch introduces two new 'events' flags that are intended to be used
> >>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> >>>> source in an exclusive manner such that the minimum number of threads are
> >>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> >>>> also be added to the 'events' flag, such that we round robin around the set
> >>>> of waiting threads.
> >>>>
> >>>> An implementation note is that in the epoll wakeup routine,
> >>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> >>>> wakeup, only when there are current waiters. The idea is to use this additional
> >>>> heuristic in order minimize wakeup latencies.
> >>> I don't understand what this is intended to do.
> >>>
> >>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
> >>>
> >> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
> > Now I understand.  Why are you using multiple epollfds?
> >
> > --Andy
> 
> So the multiple epollfds is really a way to partition the set of
> events. Otherwise, I have all the threads contending on all the events
> that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

> In the use-case I'm trying to describe, I've partitioned a large set
> of the events, but there may still be some event sources that we wish
> to share among all of the threads (or even subsets of them), so as not
> to overload any one in particular.
 
> More specifically, in the case of a single listen socket, its natural
> to call accept() on the thread that has been woken up, but without
> doing round robin, you quickly get into a very unbalanced load, and in
> addition you waste a lot of cpu doing unnecessary wakeups. There are
> other approaches to solve this, specifically using SO_REUSEPORT, which
> creates a separate socket per-thread and gets one back to the
> separately partitioned events case previously described. However,
> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
> does not have knowledge of the threads that are actively waiting as
> the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.

next prev parent reply	other threads:[~2015-02-10  4:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-09 20:05 [PATCH 0/2] Add epoll round robin wakeup mode Jason Baron
2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
2015-02-09 20:26   ` Michael Kerrisk
2015-02-09 21:50   ` Peter Zijlstra
2015-02-10  4:06     ` Jason Baron
2015-02-10  9:03       ` Peter Zijlstra
2015-02-10 15:59         ` Jason Baron
2015-02-10 16:11           ` Peter Zijlstra
2015-02-09 20:06 ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
2015-02-09 20:18   ` Andy Lutomirski
2015-02-09 21:32     ` Jason Baron
2015-02-09 22:45       ` Andy Lutomirski
2015-02-10  3:59         ` Jason Baron
2015-02-10  4:49           ` Eric Wong [this message]
2015-02-10 19:16             ` Jason Baron
2015-02-10 19:32               ` Eric Wong
2015-02-09 20:27   ` Michael Kerrisk
2015-02-09 20:25 ` [PATCH 0/2] Add epoll round robin wakeup mode Michael Kerrisk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150210044939.GA15616@dcvr.yhbt.net \
    --to=normalperson@yhbt.net \
    --cc=akpm@linux-foundation.org \
    --cc=davidel@xmailserver.org \
    --cc=jbaron@akamai.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=peterz@infradead.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).