LKML Archive mirror
 help / color / mirror / Atom feed
From: Jason Baron <jbaron@akamai.com>
To: akpm@linux-foundation.org
Cc: mingo@kernel.org, peterz@infradead.org, viro@ftp.linux.org.uk,
	mtk.manpages@gmail.com, normalperson@yhbt.net, m@silodev.com,
	corbet@lwn.net, luto@amacapital.net,
	torvalds@linux-foundation.org, hagen@jauu.net,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org
Subject: [PATCH] epoll: add EPOLLEXCLUSIVE flag
Date: Mon,  7 Dec 2015 22:23:47 -0500	[thread overview]
Message-ID: <e26b962fcb6f87a81b38d5dc812cf33fd7549c8a.1449523436.git.jbaron@akamai.com> (raw)
In-Reply-To: <cover.1449523436.git.jbaron@akamai.com>
In-Reply-To: <cover.1449523436.git.jbaron@akamai.com>

Currently, epoll file descriptors or epfds (the fd returned from
epoll_create[1]()) that are added to a shared wakeup source are always
added in a non-exclusive manner. This means that when we have multiple
epfds attached to a shared fd source they are all woken up. This creates
thundering herd type behavior.

Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation. This new
flag allows for exclusive wakeups when there are multiple epfds attached to
a shared fd event source.

The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait(). The idea is to search for threads which are
idle and ready to process the wakeup events. Thus, we queue an event to at
least 1 epfd, but may still potentially queue an event to all epfds that
are attached to the shared fd source.

Performance testing was done by Madars Vitolins using a modified version of
Enduro/X. The use of the 'EPOLLEXCLUSIVE' flag reduce the length of this
particular workload from 860s down to 24s.

Tested-by: Madars Vitolins <m@silodev.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 24 +++++++++++++++++++++---
 include/uapi/linux/eventpoll.h |  3 +++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..ae1dbcf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1081,9 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLEXCLUSIVE)
+		return ewake;
+
 	return 1;
 }
 
@@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1862,6 +1871,15 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		goto error_tgt_fput;
 
 	/*
+	 * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
+	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
+	 * Also, we do not currently supported nested exclusive wakeups.
+	 */
+	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
+		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
+		goto error_tgt_fput;
+
+	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
 	 */
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..1c31549 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Set exclusive wakeup mode for the target file descriptor */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
2.6.1


  reply	other threads:[~2015-12-08  3:25 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-08  3:23 [PATCH] epoll: add exclusive wakeups flag Jason Baron
2015-12-08  3:23 ` Jason Baron [this message]
2016-01-28  7:16 ` Michael Kerrisk (man-pages)
2016-01-28 17:57   ` Jason Baron
2016-01-29  8:14     ` Michael Kerrisk (man-pages)
2016-02-01 19:42       ` Jason Baron
2016-03-10 18:53       ` Jason Baron
2016-03-10 19:47         ` Michael Kerrisk (man-pages)
2016-03-10 19:58         ` Michael Kerrisk (man-pages)
2016-03-10 20:40           ` Jason Baron
2016-03-11 20:30             ` Michael Kerrisk (man-pages)
     [not found]               ` <56E32FC5.4030902@akamai.com>
     [not found]                 ` <56E353CF.6050503@gmail.com>
     [not found]                   ` <56E6D0ED.20609@akamai.com>
2016-03-14 17:47                     ` Michael Kerrisk (man-pages)
2016-03-14 19:32                       ` Jason Baron
2016-03-14 20:01                         ` Michael Kerrisk (man-pages)
2016-03-14 21:03                           ` Michael Kerrisk (man-pages)
2016-03-14 22:35                             ` Jason Baron
2016-03-14 23:09                               ` Madars Vitolins
2016-03-14 23:26                               ` Michael Kerrisk (man-pages)
2016-03-15  2:36                                 ` Jason Baron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e26b962fcb6f87a81b38d5dc812cf33fd7549c8a.1449523436.git.jbaron@akamai.com \
    --to=jbaron@akamai.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=hagen@jauu.net \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=m@silodev.com \
    --cc=mingo@kernel.org \
    --cc=mtk.manpages@gmail.com \
    --cc=normalperson@yhbt.net \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ftp.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).