LKML Archive mirror
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
@ 2016-02-04 15:39 99% Jason Baron
  2016-02-04 21:44 99% ` Madars Vitolins
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2016-02-04 15:39 UTC (permalink / raw)
  To: akpm, torvalds, mtk.manpages
  Cc: mingo, peterz, viro, normalperson, m, corbet, luto, hagen,
	linux-kernel, linux-fsdevel, linux-api

In the current implementation of the EPOLLEXCLUSIVE flag (added for 4.5-rc1),
if epoll waiters create different POLL* sets and register them as exclusive
against the same target fd, the current implementation will stop waking any
further waiters once it finds the first idle waiter. This means that waiters
could miss wakeups in certain cases.

For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM);
So if one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the wakeup
since the current implementation will stop after it finds any intersection
of events with a waiter that is blocked in epoll_wait().

We could potentially address this by requiring all epoll waiters that are
added to p be required to pass the same set of POLL* events. IE the first
EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* flags to
be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I think
it might be somewhat confusing interface as we would have to reference count
the number of users for that set, and so userspace would have to keep track
of that count, or we would need a more involved interface. It also adds some
shared state that we'd have store somewhere. I don't think anybody will want
to bloat __wait_queue_head for this.

I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE such
that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way if
the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit the
first idle waiter that specifies the EPOLLIN bit, since any remaining waiters
that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we can do
the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. If both
'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least one
example of this I saw in fs/pipe.c), then we just wake the entire exclusive
list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any
performance critical path, so I think that's ok (in fs/pipe.c its in
pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by default
in any exclusive set. Thus, the user can specify EPOLLERR and/or EPOLLHUP but
is not required to do so.

Since epoll waiters may be interested in other events as well besides EPOLLIN,
EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 'dup' call
on the target fd and adding that as one normally would with EPOLL_CTL_ADD. Since
I think that the POLLIN and POLLOUT events are what we are interest in balancing,
I think that the 'dup' thing could perhaps be added to only one of the waiter
threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should
be sufficient for the majority of use-cases.

Since EPOLLEXCLUSIVE is intended to be used with a target fd shared among
multiple epfds, where between 1 and n of the epfds may receive an event, it
does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would get an
event. Thus, it is not allowed to be specified in conjunction with
EPOLLEXCLUSIVE.

EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ae1dbcf..cde6074 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -94,6 +94,11 @@
 /* Epoll private bits inside the event mask */
 #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
+#define EPOLLINOUT_BITS (POLLIN | POLLOUT)
+
+#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | \
+				EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE)
+
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
 
@@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * wait list.
 	 */
 	if (waitqueue_active(&ep->wq)) {
-		ewake = 1;
+		if ((epi->event.events & EPOLLEXCLUSIVE) &&
+					!((unsigned long)key & POLLFREE)) {
+			switch ((unsigned long)key & EPOLLINOUT_BITS) {
+			case POLLIN:
+				if (epi->event.events & POLLIN)
+					ewake = 1;
+				break;
+			case POLLOUT:
+				if (epi->event.events & POLLOUT)
+					ewake = 1;
+				break;
+			case 0:
+				ewake = 1;
+				break;
+			}
+		}
 		wake_up_locked(&ep->wq);
 	}
 	if (waitqueue_active(&ep->poll_wait))
@@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
 	 * Also, we do not currently supported nested exclusive wakeups.
 	 */
-	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
-		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
-		goto error_tgt_fput;
+	if (epds.events & EPOLLEXCLUSIVE) {
+		if (op == EPOLL_CTL_MOD)
+			goto error_tgt_fput;
+		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
+				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
+			goto error_tgt_fput;
+	}
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
@@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
-			epds.events |= POLLERR | POLLHUP;
-			error = ep_modify(ep, epi, &epds);
+			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
+				epds.events |= POLLERR | POLLHUP;
+				error = ep_modify(ep, epi, &epds);
+			}
 		} else
 			error = -ENOENT;
 		break;
-- 
2.6.1

^ permalink raw reply related	[relevance 99%]

* [PATCH] epoll: add EPOLLEXCLUSIVE flag
  @ 2015-12-08  3:23 99% ` Jason Baron
  0 siblings, 0 replies; 44+ results
From: Jason Baron @ 2015-12-08  3:23 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, viro, mtk.manpages, normalperson, m, corbet, luto,
	torvalds, hagen, linux-kernel, linux-fsdevel, linux-api

Currently, epoll file descriptors or epfds (the fd returned from
epoll_create[1]()) that are added to a shared wakeup source are always
added in a non-exclusive manner. This means that when we have multiple
epfds attached to a shared fd source they are all woken up. This creates
thundering herd type behavior.

Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation. This new
flag allows for exclusive wakeups when there are multiple epfds attached to
a shared fd event source.

The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait(). The idea is to search for threads which are
idle and ready to process the wakeup events. Thus, we queue an event to at
least 1 epfd, but may still potentially queue an event to all epfds that
are attached to the shared fd source.

Performance testing was done by Madars Vitolins using a modified version of
Enduro/X. The use of the 'EPOLLEXCLUSIVE' flag reduce the length of this
particular workload from 860s down to 24s.

Tested-by: Madars Vitolins <m@silodev.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 24 +++++++++++++++++++++---
 include/uapi/linux/eventpoll.h |  3 +++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..ae1dbcf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1081,9 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLEXCLUSIVE)
+		return ewake;
+
 	return 1;
 }
 
@@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1862,6 +1871,15 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		goto error_tgt_fput;
 
 	/*
+	 * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
+	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
+	 * Also, we do not currently supported nested exclusive wakeups.
+	 */
+	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
+		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
+		goto error_tgt_fput;
+
+	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
 	 */
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..1c31549 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Set exclusive wakeup mode for the target file descriptor */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
2.6.1


^ permalink raw reply related	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18  8:07 99%   ` Ingo Molnar
@ 2015-02-18 15:42 99%     ` Jason Baron
  2015-02-18 16:33 99%       ` Ingo Molnar
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2015-02-18 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra

On 02/18/2015 03:07 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>> Epoll file descriptors that are added to a shared wakeup 
>> source are always added in a non-exclusive manner. That 
>> means that when we have multiple epoll fds attached to a 
>> shared wakeup source they are all woken up. This can lead 
>> to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are 
>> intended to be used with EPOLL_CTL_ADD operations. 
>> EPOLLEXCLUSIVE, adds the epoll fd to the event source in 
>> an exclusive manner such that the minimum number of 
>> threads are woken. EPOLLROUNDROBIN, which depends on 
>> EPOLLEXCLUSIVE also being set, can also be added to the 
>> 'events' flag, such that we round robin through the set 
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup 
>> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, 
>> we return 1, for a successful wakeup, only when there are 
>> current waiters. The idea is to use this additional 
>> heuristic in order minimize wakeup latencies.
>>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>> ---
>>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>>  include/uapi/linux/eventpoll.h |  6 ++++++
>>  2 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index d77f944..382c832 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -92,7 +92,8 @@
>>   */
>>  
>>  /* Epoll private bits inside the event mask */
>> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
>> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
>> +			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>>  
>>  /* Maximum number of nesting allowed inside epoll sets */
>>  #define EP_MAX_NESTS 4
>> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>>  	unsigned long flags;
>>  	struct epitem *epi = ep_item_from_wait(wait);
>>  	struct eventpoll *ep = epi->ep;
>> +	int ewake = 0;
>>  
>>  	if ((unsigned long)key & POLLFREE) {
>>  		ep_pwq_from_wait(wait)->whead = NULL;
>> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>>  	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>>  	 * wait list.
>>  	 */
>> -	if (waitqueue_active(&ep->wq))
>> +	if (waitqueue_active(&ep->wq)) {
>> +		ewake = 1;
>>  		wake_up_locked(&ep->wq);
>> +	}
>>  	if (waitqueue_active(&ep->poll_wait))
>>  		pwake++;
>>  
>> @@ -1078,6 +1082,8 @@ out_unlock:
>>  	if (pwake)
>>  		ep_poll_safewake(&ep->poll_wait);
>>  
>> +	if (epi->event.events & EPOLLROUNDROBIN)
>> +		return ewake;
>>  	return 1;
>>  }
>>  
>> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>>  		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>>  		pwq->whead = whead;
>>  		pwq->base = epi;
>> -		add_wait_queue(whead, &pwq->wait);
>> +		if (epi->event.events & EPOLLROUNDROBIN)
>> +			add_wait_queue_rr(whead, &pwq->wait);
>> +		else if (epi->event.events & EPOLLEXCLUSIVE)
>> +			add_wait_queue_exclusive(whead, &pwq->wait);
>> +		else
>> +			add_wait_queue(whead, &pwq->wait);
>>  		list_add_tail(&pwq->llink, &epi->pwqlist);
>>  		epi->nwait++;
>>  	} else {
>> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>>  		struct epoll_event __user *, event)
>>  {
>> -	int error;
>> -	int full_check = 0;
>> +	int error, full_check = 0, wait_flags = 0;
>>  	struct fd f, tf;
>>  	struct eventpoll *ep;
>>  	struct epitem *epi;
>> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>>  	if (f.file == tf.file || !is_file_epoll(f.file))
>>  		goto error_tgt_fput;
>>  
>> +	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
>> +	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
>> +	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
>> +		goto error_tgt_fput;
>> +
>>  	/*
>>  	 * At this point it is safe to assume that the "private_data" contains
>>  	 * our own data structure.
>> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
>> index bc81fb2..10260a1 100644
>> --- a/include/uapi/linux/eventpoll.h
>> +++ b/include/uapi/linux/eventpoll.h
>> @@ -26,6 +26,12 @@
>>  #define EPOLL_CTL_DEL 2
>>  #define EPOLL_CTL_MOD 3
>>  
>> +/* Balance wakeups for a shared event source */
>> +#define EPOLLROUNDROBIN (1 << 27)
>> +
>> +/* Add exclusively */
>> +#define EPOLLEXCLUSIVE (1 << 28)
>> +
>>  /*
>>   * Request the handling of system wakeup events so as to prevent system suspends
>>   * from happening while those events are being processed.
> So let me rephrase the justification of your two patches:
>
> Unlike regular waitqueue usage (where threads remove 
> themselves from the waitqueue once they receive a wakeup), 
> epoll waitqueues are static and 'persistent': epoll target 
> threads are on the poll waitqueue indefinitely, only 
> register/unregister removes threads from them.
>
> So they are not really 'wait queues', but static 'task 
> lists', and are thus exposed to classic thundering herd 
> scheduling problems and scheduling assymetries: a single 
> event on a shared event source will wake all epoll 
> 'task-lists', and not only will it wake them, but due to 
> the static nature of the lists, even an exclusive wakeup 
> will iterate along the list with O(N) overhead, until it 
> finds a wakeable thread.
>
> As the number of lists and the number of threads in the 
> lists increases this scales suboptimally, and it also looks 
> slightly odd that a random set of epoll worker threads is 
> 'more equal' than the others, in receiving a comparably 
> higher proportion of events.

yes,  in fact we are currently working around these imbalances
by doing register/unregister (EPOLL_CTL_ADD/EPOLL_CTL_DEL),
periodically to re-set the order of the queues. This resolves the
balancing to an extent, but not the spurious wakeups.

>
> The solution is to add this new ABI to allow epoll events 
> to be actively load-balanced both between the persistent 
> 'task lists', and to also allow the individual task lists 
> to act as dynamic runqueues: the head of the list is likely 
> to be sleeping, newly woken tasks get moved to the tail of 
> the list.
>
> This has two main advantages: firstly it solves the O(N) 
> (micro-)problem, but it also more evenly distributes events 
> both between task-lists and within epoll groups as tasks as 
> well.

Its solving 2 issues - spurious wakeups, and more
even loading of threads. The event distribution is more
even between 'epoll groups' with this patch, however,
if multiple threads are blocking on a single 'epoll group',
this patch does not affect the the event distribution there.
Currently, threads are added to 'epoll group' as exclusive
already, so when you have multiple threads blocking on
an epoll group, only one wakes up. In our use case, we
have a 1-to-1 mapping b/w threads and epoll groups, so
we don't have spurious or un-balanced wakeups there.

That suggests the alternative user-space model for
addressing this problem. That is, to have a single epoll
group added with EPOLLONESHOT. In this way threads
can pull work or events off of a single queue, work on
the event and then re-arm (such that other threads don't
see events from that source in the meantime).
This, however, means all threads work on all events, and
they all have to synchronize to an extent on the single queue.
That is all register/unregister and re-arm event to that queue
have to be visible or synchronized for all the waiters. This
model also doesn't allow userspace to partition events that
are naturally local to thread, since there a single epoll group.

The second userspace model is to have worker threads with
their own separate epoll groups, and then have separate
thread(s) to address the shared wakeup sources. Then the
threads that are waiting on the shared wakeup sources can
distribute the events fairly to the worker threads. This involves
extra context switching for shared events, and I think ends up
degenerating back into the original problem.

> The disadvantages: slightly higher management micro-costs, 
> plus a global waitqueue list, which used to be read-mostly, 
> is now actively dirtied by every event, adding more global 
> serialization. The latter is somewhat muted by the fact 
> that the waitqueue lock itself is already a global 
> serialization point today and got dirtied by every event, 
> and the list head is next to it, usually in the same 
> cacheline.

Yes, I'm a bit concerned about the changes to the core
wakeup function, however in the non-rotate case, the
only additional write is to initialize the 'rotate_list' on entry.
I measured the latency of the __wake_up_common() for
the case where this code was added and we were not
doing 'rotate', and I didn't measure any additional latency
with ftrace, but it perhaps warrants more careful testing.

The outstanding issues I have are:

1) Does epoll need 2 new flags here - EPOLLEXCLUSIVE and
EPOLLROUNDROBIN. IE should they just be combined since
EPOLLROUNDROBIN depends on EPOLLEXCLUSIVE, or is there
a valid use for just EPOLLEXCLUSIVE (wake up the first waiter
but don't do the balancing)?

2) The concern Andy raised regarding potential starvation.
That is could a adversarial thread cause us to miss wakeups
if it can add itself exclusively to the shared wakeup source.
Currently, the adversarial thread would need to simply be
able to open the file in question. For things like pipe, this is
not an issue, but perhaps it is for files in the global
namespace...

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  @ 2015-02-17 19:33 99% ` Jason Baron
  2015-02-18  8:07 99%   ` Ingo Molnar
       [not found]       ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com>
  0 siblings, 2 replies; 44+ results
From: Jason Baron @ 2015-02-17 19:33 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel, linux-api

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin through the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
 include/uapi/linux/eventpoll.h |  6 ++++++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..382c832 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,8 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
+			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1082,8 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLROUNDROBIN)
+		return ewake;
 	return 1;
 }
 
@@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLROUNDROBIN)
+			add_wait_queue_rr(whead, &pwq->wait);
+		else if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
 SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		struct epoll_event __user *, event)
 {
-	int error;
-	int full_check = 0;
+	int error, full_check = 0, wait_flags = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
@@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
+	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
+	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
+		goto error_tgt_fput;
+
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..10260a1 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,12 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Balance wakeups for a shared event source */
+#define EPOLLROUNDROBIN (1 << 27)
+
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2


^ permalink raw reply related	[relevance 99%]

* [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  @ 2015-02-09 20:06 99% ` Jason Baron
  2015-02-09 20:18 99%   ` Andy Lutomirski
  2015-02-09 20:27 99%   ` Michael Kerrisk
  0 siblings, 2 replies; 44+ results
From: Jason Baron @ 2015-02-09 20:06 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin around the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
 include/uapi/linux/eventpoll.h |  6 ++++++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..382c832 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,8 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
+			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1082,8 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLROUNDROBIN)
+		return ewake;
 	return 1;
 }
 
@@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLROUNDROBIN)
+			add_wait_queue_rr(whead, &pwq->wait);
+		else if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
 SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		struct epoll_event __user *, event)
 {
-	int error;
-	int full_check = 0;
+	int error, full_check = 0, wait_flags = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
@@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
+	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
+	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
+		goto error_tgt_fput;
+
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..10260a1 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,12 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Balance wakeups for a shared event source */
+#define EPOLLROUNDROBIN (1 << 27)
+
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2


^ permalink raw reply related	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 17:51 99%             ` Ingo Molnar
@ 2015-02-19  3:26 99%               ` Jason Baron
  2015-02-22  0:24 99%                 ` Eric Wong
  2015-02-18 22:18 99%               ` Eric Wong
  1 sibling, 1 reply; 44+ results
From: Jason Baron @ 2015-02-19  3:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra,
	luto@amacapital.net >> Andy Lutomirski

On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>>> [...] However, I think the userspace API change is less 
>>> clear since epoll_wait() doesn't currently have an 
>>> 'input' events argument as epoll_ctl() does.
>> ... but the change would be a bit clearer and somewhat 
>> more flexible: LIFO or FIFO queueing, right?
>>
>> But having the queueing model as part of the epoll 
>> context is a legitimate approach as well.
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
>
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.
>
> Where events are task driven the scheduler will already try 
> to pair up waker and wakee so it might not show up in 
> measurements that markedly.
>

Right, so this makes me think that we may want to potentially
support a variety of wakeup policies. Adding these to the
generic wake up code is just going to be too messy. So, perhaps
a better approach here would be to register a single
wait_queue_t with the event source queue that will always
be woken up, and then layer any epoll balancing/irq affinity
policies on top of that. So in essence we end up with sort of
two queues layers, but I think it provides much nicer isolation
between layers. Also, the bulk of the changes are going to be
isolated to the epoll code, and we avoid Andy's concern about
missing, or starving out wakeups.

So here's a stab at how this API could look:

1. ep1 = epoll_create1(EPOLL_POLICY);

So EPOLL_POLICY here could the round robin policy described
here, or the irq affinity or other ideas. The idea is to create
an fd that is local to the process, such that other processes
can not subsequently attach to it and affect our policy.

2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);

This associates ep1 with the event source. ep1 can be
associated with or added to at most 1 wakeup source. This call
would largely just form the association, but not queue anything
to the fd_source wait queue.

3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
    epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
    epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
     .
     .
     .

Finally, we add the epoll sets to the event source (indirectly via
ep1). So the first add would actually queue the callback to the
fd_source. While the subsequent calls would simply queue things
to the 'nested' wakeup queue associated with ep1.

So any existing epoll/poll/select calls could be queued as well
to fd_source and will operate independenly from this mechanism,
as the fd_source queue continues to be 'wake all'. Also, there
should be no changes necessary to __wake_up_common(), other
than potentially passing more back though the
wait_queue_func_t, such as 'nr_exclusive'.

Thanks,

-Jason













^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-17 19:33 99% ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
@ 2015-02-18  8:07 99%   ` Ingo Molnar
  2015-02-18 15:42 99%     ` Jason Baron
       [not found]       ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com>
  1 sibling, 1 reply; 44+ results
From: Ingo Molnar @ 2015-02-18  8:07 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra


* Jason Baron <jbaron@akamai.com> wrote:

> Epoll file descriptors that are added to a shared wakeup 
> source are always added in a non-exclusive manner. That 
> means that when we have multiple epoll fds attached to a 
> shared wakeup source they are all woken up. This can lead 
> to excessive cpu usage and uneven load distribution.
> 
> This patch introduces two new 'events' flags that are 
> intended to be used with EPOLL_CTL_ADD operations. 
> EPOLLEXCLUSIVE, adds the epoll fd to the event source in 
> an exclusive manner such that the minimum number of 
> threads are woken. EPOLLROUNDROBIN, which depends on 
> EPOLLEXCLUSIVE also being set, can also be added to the 
> 'events' flag, such that we round robin through the set 
> of waiting threads.
> 
> An implementation note is that in the epoll wakeup 
> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, 
> we return 1, for a successful wakeup, only when there are 
> current waiters. The idea is to use this additional 
> heuristic in order minimize wakeup latencies.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---
>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>  include/uapi/linux/eventpoll.h |  6 ++++++
>  2 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d77f944..382c832 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,8 @@
>   */
>  
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
> +			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>  
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>  	unsigned long flags;
>  	struct epitem *epi = ep_item_from_wait(wait);
>  	struct eventpoll *ep = epi->ep;
> +	int ewake = 0;
>  
>  	if ((unsigned long)key & POLLFREE) {
>  		ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>  	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>  	 * wait list.
>  	 */
> -	if (waitqueue_active(&ep->wq))
> +	if (waitqueue_active(&ep->wq)) {
> +		ewake = 1;
>  		wake_up_locked(&ep->wq);
> +	}
>  	if (waitqueue_active(&ep->poll_wait))
>  		pwake++;
>  
> @@ -1078,6 +1082,8 @@ out_unlock:
>  	if (pwake)
>  		ep_poll_safewake(&ep->poll_wait);
>  
> +	if (epi->event.events & EPOLLROUNDROBIN)
> +		return ewake;
>  	return 1;
>  }
>  
> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>  		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>  		pwq->whead = whead;
>  		pwq->base = epi;
> -		add_wait_queue(whead, &pwq->wait);
> +		if (epi->event.events & EPOLLROUNDROBIN)
> +			add_wait_queue_rr(whead, &pwq->wait);
> +		else if (epi->event.events & EPOLLEXCLUSIVE)
> +			add_wait_queue_exclusive(whead, &pwq->wait);
> +		else
> +			add_wait_queue(whead, &pwq->wait);
>  		list_add_tail(&pwq->llink, &epi->pwqlist);
>  		epi->nwait++;
>  	} else {
> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>  		struct epoll_event __user *, event)
>  {
> -	int error;
> -	int full_check = 0;
> +	int error, full_check = 0, wait_flags = 0;
>  	struct fd f, tf;
>  	struct eventpoll *ep;
>  	struct epitem *epi;
> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>  	if (f.file == tf.file || !is_file_epoll(f.file))
>  		goto error_tgt_fput;
>  
> +	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
> +	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
> +	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
> +		goto error_tgt_fput;
> +
>  	/*
>  	 * At this point it is safe to assume that the "private_data" contains
>  	 * our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index bc81fb2..10260a1 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,12 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
>  
> +/* Balance wakeups for a shared event source */
> +#define EPOLLROUNDROBIN (1 << 27)
> +
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent system suspends
>   * from happening while those events are being processed.

So let me rephrase the justification of your two patches:

Unlike regular waitqueue usage (where threads remove 
themselves from the waitqueue once they receive a wakeup), 
epoll waitqueues are static and 'persistent': epoll target 
threads are on the poll waitqueue indefinitely, only 
register/unregister removes threads from them.

So they are not really 'wait queues', but static 'task 
lists', and are thus exposed to classic thundering herd 
scheduling problems and scheduling assymetries: a single 
event on a shared event source will wake all epoll 
'task-lists', and not only will it wake them, but due to 
the static nature of the lists, even an exclusive wakeup 
will iterate along the list with O(N) overhead, until it 
finds a wakeable thread.

As the number of lists and the number of threads in the 
lists increases this scales suboptimally, and it also looks 
slightly odd that a random set of epoll worker threads is 
'more equal' than the others, in receiving a comparably 
higher proportion of events.

The solution is to add this new ABI to allow epoll events 
to be actively load-balanced both between the persistent 
'task lists', and to also allow the individual task lists 
to act as dynamic runqueues: the head of the list is likely 
to be sleeping, newly woken tasks get moved to the tail of 
the list.

This has two main advantages: firstly it solves the O(N) 
(micro-)problem, but it also more evenly distributes events 
both between task-lists and within epoll groups as tasks as 
well.

The disadvantages: slightly higher management micro-costs, 
plus a global waitqueue list, which used to be read-mostly, 
is now actively dirtied by every event, adding more global 
serialization. The latter is somewhat muted by the fact 
that the waitqueue lock itself is already a global 
serialization point today and got dirtied by every event, 
and the list head is next to it, usually in the same 
cacheline.

Did I get it right?

Thanks,

	Ingo

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-10  4:49 99%           ` Eric Wong
@ 2015-02-10 19:16 99%             ` Jason Baron
  2015-02-10 19:32 99%               ` Eric Wong
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2015-02-10 19:16 UTC (permalink / raw)
  To: Eric Wong
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

On 02/09/2015 11:49 PM, Eric Wong wrote:
> Jason Baron <jbaron@akamai.com> wrote:
>> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
>>> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
>>>> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>>>>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>>>>> Epoll file descriptors that are added to a shared wakeup source are always
>>>>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>>>>> fds attached to a shared wakeup source they are all woken up. This can
>>>>>> lead to excessive cpu usage and uneven load distribution.
>>>>>>
>>>>>> This patch introduces two new 'events' flags that are intended to be used
>>>>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>>>>> source in an exclusive manner such that the minimum number of threads are
>>>>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>>>>> also be added to the 'events' flag, such that we round robin around the set
>>>>>> of waiting threads.
>>>>>>
>>>>>> An implementation note is that in the epoll wakeup routine,
>>>>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>>>>> wakeup, only when there are current waiters. The idea is to use this additional
>>>>>> heuristic in order minimize wakeup latencies.
>>>>> I don't understand what this is intended to do.
>>>>>
>>>>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>>>>
>>>> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
>>> Now I understand.  Why are you using multiple epollfds?
>>>
>>> --Andy
>> So the multiple epollfds is really a way to partition the set of
>> events. Otherwise, I have all the threads contending on all the events
>> that are being generated. So I'm not sure if that is scalable.
> I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
> maxevents value is sufficient for you.  All events would be shared, so
> they can migrate between threads(*).  Each thread takes a largish set of
> events on every epoll_wait call and doesn't call epoll_wait again until
> it's done with the whole set it got.
>
> You'll hit more contention on EPOLL_CTL_MOD with shared events and a
> single epoll, but I think it's a better goal to make that lock-free.

Its not just EPOLL_CTL_MOD, but there's also going to be contention on
epoll add and remove since there is only 1 epoll fd in this case. I'm also
concerned about the balancing of the workload among threads in the single
queue case. I think its quite reasonable to have user-space partition
the set
of events among threads as it sees fit using multiple epoll fds.

However, currently this multiple epoll fd scheme does not handle events from
a shared event source well. As I mentioned there is a thundering herd wakeup
in this case, and the wakeups are unbalanced. In fact, we have an
application
that currently does EPOLL_CTL_REMOVEs followed by EPOLL_CTL_ADDs
periodically against a shared wakeup source in order to re-balance the
wakeup
queues. This solves the balancing issues to an extent, but not the
thundering
herd. I'd like to move this logic down into the kernel with this patch set.

> (*) Too large a maxevents will lead to head-of-line blocking, but from
> what I'm inferring, you already risk that with multiple epollfds and
> separate threads working on them.
>
> Do you have a userland use case to share?

I've been trying to describe the use case, maybe I haven't been doing a good
job :(

>> In the use-case I'm trying to describe, I've partitioned a large set
>> of the events, but there may still be some event sources that we wish
>> to share among all of the threads (or even subsets of them), so as not
>> to overload any one in particular.
>  
>> More specifically, in the case of a single listen socket, its natural
>> to call accept() on the thread that has been woken up, but without
>> doing round robin, you quickly get into a very unbalanced load, and in
>> addition you waste a lot of cpu doing unnecessary wakeups. There are
>> other approaches to solve this, specifically using SO_REUSEPORT, which
>> creates a separate socket per-thread and gets one back to the
>> separately partitioned events case previously described. However,
>> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
>> does not have knowledge of the threads that are actively waiting as
>> the epoll code does.
> Did you try my suggestion of using a dedicated thread (or thread pool)
> which does nothing but loop on accept() + EPOLL_CTL_ADD?
>
> Those dedicated threads could do its own round-robin in userland to pick
> a different epollfd to call EPOLL_CTL_ADD on.

Thanks for your suggestion! I'm not actively working on the user-space
code here, but I will pass it along.

I would prefer though not to have to context switch the 'accept' thread
on and off the cpu every time there is a new connection. So the approach
suggested here essentially moves this dedicated thread (threads), down
into the kernel and avoids the creation of these threads entirely.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* [PATCH net-next] epoll: add EPOLLEXCLUSIVE support
@ 2012-02-14 20:48 99% Hagen Paul Pfeifer
  2012-02-14 21:06 99% ` Eric Dumazet
  2012-02-14 21:23 99% ` David Miller
  0 siblings, 2 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-02-14 20:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Hagen Paul Pfeifer, Davide Libenzi, Eric Dumazet

High performance server sometimes create one listening socket (e.g. port
80), create a epoll file descriptor and add the socket. Afterwards
create SC_NPROCESSORS_ONLN threads and wait for events. This often
result in a thundering herd problem because all CPUs are scheduled.

This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
If a descriptor is added with this flag only one CPU is scheduled in.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Reported-by: Li Yu <raise.sail@gmail.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/eventpoll.c            |    7 +++++--
 include/linux/eventpoll.h |    3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index aabdfc3..bb442b1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -88,7 +88,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -913,7 +913,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (unlikely(epi->event.events & EPOLLEXCLUSIVE))
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 657ab55..d334389 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Set Exclusive wake up behaviour for the target file descriptor */
+#define EPOLLEXCLUSIVE (1 << 29)
+
 /* Set the One Shot behaviour for the target file descriptor */
 #define EPOLLONESHOT (1 << 30)
 
-- 
1.7.9


^ permalink raw reply related	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 22:45 99%       ` Andy Lutomirski
@ 2015-02-10  3:59 99%         ` Jason Baron
  2015-02-10  4:49 99%           ` Eric Wong
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2015-02-10  3:59 UTC (permalink / raw)
  To: Andy Lutomirski, Linux API
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong,
	Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
>> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>>> Epoll file descriptors that are added to a shared wakeup source are always
>>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>>> fds attached to a shared wakeup source they are all woken up. This can
>>>> lead to excessive cpu usage and uneven load distribution.
>>>>
>>>> This patch introduces two new 'events' flags that are intended to be used
>>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>>> source in an exclusive manner such that the minimum number of threads are
>>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>>> also be added to the 'events' flag, such that we round robin around the set
>>>> of waiting threads.
>>>>
>>>> An implementation note is that in the epoll wakeup routine,
>>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>>> wakeup, only when there are current waiters. The idea is to use this additional
>>>> heuristic in order minimize wakeup latencies.
>>> I don't understand what this is intended to do.
>>>
>>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>>
>> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
> Now I understand.  Why are you using multiple epollfds?
>
> --Andy

So the multiple epollfds is really a way to partition the set of events. Otherwise, I have all the threads contending on all the events that are being generated. So I'm not sure if that is scalable.

In the use-case I'm trying to describe, I've partitioned a large set of the events, but there may still be some event sources that we wish to share among all of the threads (or even subsets of them), so as not to overload any one in particular.

More specifically, in the case of a single listen socket, its natural to call accept() on the thread that has been woken up, but without doing round robin, you quickly get into a very unbalanced load, and in addition you waste a lot of cpu doing unnecessary wakeups. There are other approaches to solve this, specifically using SO_REUSEPORT, which creates a separate socket per-thread and gets one back to the separately partitioned events case previously described. However, SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition does not have knowledge of the threads that are actively waiting as the epoll code does.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* [PATCH] epoll: add EPOLLEXCLUSIVE support
@ 2012-03-25 19:12 99% Hagen Paul Pfeifer
  0 siblings, 0 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-25 19:12 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, Hagen Paul Pfeifer

High performance server sometimes create one listening socket (e.g. port
80), create a epoll file descriptor and add the socket. Afterwards
create SC_NPROCESSORS_ONLN threads and wait for events. This often
result in a thundering herd problem because all CPUs are scheduled.

This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
If a descriptor is added with this flag only one CPU is scheduled in.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---

Dave rejected the patch and said not network specific. Because there
is no epoll maintainer I send it now directly.

 fs/eventpoll.c            |    7 +++++--
 include/linux/eventpoll.h |    3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 629e9ed..16d787f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -88,7 +88,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -969,7 +969,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (unlikely(epi->event.events & EPOLLEXCLUSIVE))
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 657ab55..d334389 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Set Exclusive wake up behaviour for the target file descriptor */
+#define EPOLLEXCLUSIVE (1 << 29)
+
 /* Set the One Shot behaviour for the target file descriptor */
 #define EPOLLONESHOT (1 << 30)
 
-- 
1.7.9.1


^ permalink raw reply related	[relevance 99%]

* [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
@ 2012-03-28 13:57 99% Hagen Paul Pfeifer
  2012-03-28 14:09 99% ` richard -rw- weinberger
  0 siblings, 1 reply; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-28 13:57 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, Hagen Paul Pfeifer

High performance server sometimes create one listening socket (e.g. port
80), create a epoll file descriptor and add the socket. Afterwards
create SC_NPROCESSORS_ONLN threads and wait for events. This often
result in a thundering herd problem because all CPUs are scheduled.

This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
If a descriptor is added with this flag only one CPU is scheduled in.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
Dave rejected the patch and said not network specific. Because there
is no epoll maintainer this time directly.

 fs/eventpoll.c            |    7 +++++--
 include/linux/eventpoll.h |    3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 629e9ed..16d787f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -88,7 +88,7 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -969,7 +969,10 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (unlikely(epi->event.events & EPOLLEXCLUSIVE))
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 657ab55..d334389 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -26,6 +26,9 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Set Exclusive wake up behaviour for the target file descriptor */
+#define EPOLLEXCLUSIVE (1 << 29)
+
 /* Set the One Shot behaviour for the target file descriptor */
 #define EPOLLONESHOT (1 << 30)
 
-- 
1.7.9.1


^ permalink raw reply related	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 17:38 99%         ` Jason Baron
  2015-02-18 17:45 99%           ` Ingo Molnar
@ 2015-02-18 23:12 99%           ` Andy Lutomirski
  1 sibling, 0 replies; 44+ results
From: Andy Lutomirski @ 2015-02-18 23:12 UTC (permalink / raw)
  To: Jason Baron
  Cc: Davide Libenzi, Michael Kerrisk, Linux FS Devel, Eric Wong,
	Andrew Morton, Peter Zijlstra, linux-kernel@vger.kernel.org,
	Al Viro, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	Ingo Molnar, Linux API, Ingo Molnar

On Feb 18, 2015 9:38 AM, "Jason Baron" <jbaron@akamai.com> wrote:
>
> On 02/18/2015 11:33 AM, Ingo Molnar wrote:
> > * Jason Baron <jbaron@akamai.com> wrote:
> >
> >>> This has two main advantages: firstly it solves the
> >>> O(N) (micro-)problem, but it also more evenly
> >>> distributes events both between task-lists and within
> >>> epoll groups as tasks as well.
> >> Its solving 2 issues - spurious wakeups, and more even
> >> loading of threads. The event distribution is more even
> >> between 'epoll groups' with this patch, however, if
> >> multiple threads are blocking on a single 'epoll group',
> >> this patch does not affect the the event distribution
> >> there. [...]
> > Regarding your last point, are you sure about that?
> >
> > If we have say 16 epoll threads registered, and if the list
> > is static (no register/unregister activity), then the
> > wakeup pattern is in strict order of the list: threads
> > closer to the list head will be woken more frequently, in a
> > wake-once fashion. So if threads do just quick work and go
> > back to sleep quickly, then typically only the first 2-3
> > threads will get any runtime in practice - the wakeup
> > iteration never gets 'deep' into the list.
> >
> > With the round-robin shuffling of the list, the threads get
> > shuffled to the tail on wakeup, which distributes events
> > evenly: all 16 epoll threads will accumulate an even
> > distribution of runtime, statistically.
> >
> > Have I misunderstood this somehow?
> >
> >
>
> So in the case of multiple threads per epoll set, we currently
> add to the head of wakeup queue exclusively in 'epoll_wait()',
> and then subsequently remove from the queue once
> 'epoll_wait()' returns. So I don't think this patch addresses
> balancing on a per epoll set basis.
>
> I think we could address the case you describe by simply doing
> __add_wait_queue_tail_exclusive() instead of
> __add_wait_queue_exclusive() in epoll_wait(). However, I think
> the userspace API change is less clear since epoll_wait() doesn't
> currently have an 'input' events argument as epoll_ctl() does.

FWIW there's currently discussion about adding a new epoll API for
batch epoll_ctl.  It could be with coordinating with that effort if
some variant could address both use cases.

I'm still nervous about changing the per-fd wakeup stuff to do
anything other than waking everything.  After all, epoll and poll can
be used concurrently.

What about a slightly different approach: could an epoll fd support
multiple contexts?  For example, an fd could be set (with epoll_ctl or
the new batch stuff) to wake an any epoll waiter, one specific epoll
waiter, an epoll waiter preferably on the waking cpu, etc.

This would have the benefit of keeping the wakeup changes localized to
the epoll code.

--Andy

>
> Thanks,
>
> -Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-10  3:59 99%         ` Jason Baron
@ 2015-02-10  4:49 99%           ` Eric Wong
  2015-02-10 19:16 99%             ` Jason Baron
  0 siblings, 1 reply; 44+ results
From: Eric Wong @ 2015-02-10  4:49 UTC (permalink / raw)
  To: Jason Baron
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> > On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
> >> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> >>> On 02/09/2015 12:06 PM, Jason Baron wrote:
> >>>> Epoll file descriptors that are added to a shared wakeup source are always
> >>>> added in a non-exclusive manner. That means that when we have multiple epoll
> >>>> fds attached to a shared wakeup source they are all woken up. This can
> >>>> lead to excessive cpu usage and uneven load distribution.
> >>>>
> >>>> This patch introduces two new 'events' flags that are intended to be used
> >>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> >>>> source in an exclusive manner such that the minimum number of threads are
> >>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> >>>> also be added to the 'events' flag, such that we round robin around the set
> >>>> of waiting threads.
> >>>>
> >>>> An implementation note is that in the epoll wakeup routine,
> >>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> >>>> wakeup, only when there are current waiters. The idea is to use this additional
> >>>> heuristic in order minimize wakeup latencies.
> >>> I don't understand what this is intended to do.
> >>>
> >>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
> >>>
> >> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
> > Now I understand.  Why are you using multiple epollfds?
> >
> > --Andy
> 
> So the multiple epollfds is really a way to partition the set of
> events. Otherwise, I have all the threads contending on all the events
> that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

> In the use-case I'm trying to describe, I've partitioned a large set
> of the events, but there may still be some event sources that we wish
> to share among all of the threads (or even subsets of them), so as not
> to overload any one in particular.
 
> More specifically, in the case of a single listen socket, its natural
> to call accept() on the thread that has been woken up, but without
> doing round robin, you quickly get into a very unbalanced load, and in
> addition you waste a lot of cpu doing unnecessary wakeups. There are
> other approaches to solve this, specifically using SO_REUSEPORT, which
> creates a separate socket per-thread and gets one back to the
> separately partitioned events case previously described. However,
> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
> does not have knowledge of the threads that are actively waiting as
> the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:18 99%   ` Andy Lutomirski
@ 2015-02-09 21:32 99%     ` Jason Baron
  2015-02-09 22:45 99%       ` Andy Lutomirski
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2015-02-09 21:32 UTC (permalink / raw)
  To: Andy Lutomirski, peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> On 02/09/2015 12:06 PM, Jason Baron wrote:
>> Epoll file descriptors that are added to a shared wakeup source are always
>> added in a non-exclusive manner. That means that when we have multiple epoll
>> fds attached to a shared wakeup source they are all woken up. This can
>> lead to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are intended to be used
>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>> source in an exclusive manner such that the minimum number of threads are
>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>> also be added to the 'events' flag, such that we round robin around the set
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup routine,
>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>> wakeup, only when there are current waiters. The idea is to use this additional
>> heuristic in order minimize wakeup latencies.
>
> I don't understand what this is intended to do.
>
> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>

hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.

> If an event has EPOLLET, then the considerations are similar to EPOLLONESHOT, right?
>

EPOLLET is still going to cause this thundering herd.

> If an event is a normal level-triggered non-one-shot event, then I don't understand how a round-robin wakeup makes any sense.  It's level-triggered, after all.

Yeah, so the current behavior is to wake up all of the threads. I'm trying to add a new mode where it load balances among the threads interested in the event. Perhaps, the test program I attached to 0/2 will show the issue better?

Also, this originally came up in the context of a single listening socket which was attached to multiple epoll fds each in a separate thread. With the attached patch, I can measure a large decrease in cpu usage and better balancing behavior among the accepting threads.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 17:51 99%             ` Ingo Molnar
  2015-02-19  3:26 99%               ` Jason Baron
@ 2015-02-18 22:18 99%               ` Eric Wong
  1 sibling, 0 replies; 44+ results
From: Eric Wong @ 2015-02-18 22:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Baron, peterz, mingo, viro, akpm, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra

Ingo Molnar <mingo@kernel.org> wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > > [...] However, I think the userspace API change is less 
> > > clear since epoll_wait() doesn't currently have an 
> > > 'input' events argument as epoll_ctl() does.
> > 
> > ... but the change would be a bit clearer and somewhat 
> > more flexible: LIFO or FIFO queueing, right?
> > 
> > But having the queueing model as part of the epoll 
> > context is a legitimate approach as well.
> 
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
> 
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.

Right.  One thing in the back of my mind has been to have CPU
affinity for epoll.  Either having everything in an epoll set
favor a certain CPU or even having affinity down to the epitem
level (so concurrent epoll_wait callers end up favoring the
same epitems).

I'm not convinced this series is worth doing without a
comparison against my previous suggestion to use a dedicated
thread which only makes blocking accept4 + EPOLL_CTL_ADD calls.

The majority of epoll events in a typical server should not be
for listen sockets, so I'd rather not bloat existing code paths
for them.  For web servers nowadays, the benefits of maintaining
long-lived connections to avoid handshakes is even more
beneficial with increasing HTTPS and HTTP2 adoption; so
listen socket events should become less common.

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 15:42 99%     ` Jason Baron
@ 2015-02-18 16:33 99%       ` Ingo Molnar
  2015-02-18 17:38 99%         ` Jason Baron
  0 siblings, 1 reply; 44+ results
From: Ingo Molnar @ 2015-02-18 16:33 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra


* Jason Baron <jbaron@akamai.com> wrote:

> > This has two main advantages: firstly it solves the 
> > O(N) (micro-)problem, but it also more evenly 
> > distributes events both between task-lists and within 
> > epoll groups as tasks as well.
> 
> Its solving 2 issues - spurious wakeups, and more even 
> loading of threads. The event distribution is more even 
> between 'epoll groups' with this patch, however, if 
> multiple threads are blocking on a single 'epoll group', 
> this patch does not affect the the event distribution 
> there. [...]

Regarding your last point, are you sure about that?

If we have say 16 epoll threads registered, and if the list 
is static (no register/unregister activity), then the 
wakeup pattern is in strict order of the list: threads 
closer to the list head will be woken more frequently, in a 
wake-once fashion. So if threads do just quick work and go 
back to sleep quickly, then typically only the first 2-3 
threads will get any runtime in practice - the wakeup 
iteration never gets 'deep' into the list.

With the round-robin shuffling of the list, the threads get 
shuffled to the tail on wakeup, which distributes events 
evenly: all 16 epoll threads will accumulate an even 
distribution of runtime, statistically.

Have I misunderstood this somehow?

Thanks,

	Ingo

^ permalink raw reply	[relevance 99%]

* Re: [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
  2016-02-04 22:59 99%   ` Andrew Morton
@ 2016-02-05  9:49 99%     ` Madars Vitolins
  0 siblings, 0 replies; 44+ results
From: Madars Vitolins @ 2016-02-05  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Baron, torvalds, mtk.manpages, mingo, peterz, viro,
	normalperson, corbet, luto, hagen, linux-kernel, linux-fsdevel,
	linux-api

Andrew Morton @ 2016-02-05 00:59 rakstīja:
> On Thu, 04 Feb 2016 23:44:05 +0200 Madars Vitolins <m@silodev.com> 
> wrote:
> 
>> Hi Jason,
>> 
>> 
>> Just run off the original tests with this patch (eventpoll.c from
>> 4.5-rc2 + patch bellow). Got the same good results, no regression.
>> 
>> $ time ./bankcl
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 1359856158.04 USD
>> Account balance is: 101528948.40 USD
>> 
>> real	0m41.826s
>> user	0m29.513s
>> sys	0m6.490s
>> 
>> 
>> Test case:
>> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
>> 
>> PS,
>> 
>> Original cases 0m24.953s vs 0m41.826s now probably is related with my 
>> pc
>> setup. As I just now re-run test with original patch, got the same ~41
>> sec.
>> 
>> So I am fine with this patch!
>> 
> 
> Thanks, I shall add your Tested-by:
> 
> One thing we're sorely missing is an epoll test suite, in
> tools/testing/selftests.  If anyone has anything which we can use to
> kick things off, please hand it over ;)

Not bad idea, probably we need a "tools/testing/selftests/eventpoll" 
folder under which we should have test cases for various epoll scenarios 
with common "run.sh". In spare time I can try to build a case for 
EXCLUSIVE flag (with queues & multiple processes :) ).

Thanks,
Madars

^ permalink raw reply	[relevance 99%]

* Re: [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
  2016-02-04 21:44 99% ` Madars Vitolins
@ 2016-02-04 22:59 99%   ` Andrew Morton
  2016-02-05  9:49 99%     ` Madars Vitolins
  0 siblings, 1 reply; 44+ results
From: Andrew Morton @ 2016-02-04 22:59 UTC (permalink / raw)
  To: Madars Vitolins
  Cc: Jason Baron, torvalds, mtk.manpages, mingo, peterz, viro,
	normalperson, corbet, luto, hagen, linux-kernel, linux-fsdevel,
	linux-api

On Thu, 04 Feb 2016 23:44:05 +0200 Madars Vitolins <m@silodev.com> wrote:

> Hi Jason,
> 
> 
> Just run off the original tests with this patch (eventpoll.c from 
> 4.5-rc2 + patch bellow). Got the same good results, no regression.
> 
> $ time ./bankcl
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 1359856158.04 USD
> Account balance is: 101528948.40 USD
> 
> real	0m41.826s
> user	0m29.513s
> sys	0m6.490s
> 
> 
> Test case: 
> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/
> 
> PS,
> 
> Original cases 0m24.953s vs 0m41.826s now probably is related with my pc 
> setup. As I just now re-run test with original patch, got the same ~41 
> sec.
> 
> So I am fine with this patch!
> 

Thanks, I shall add your Tested-by:

One thing we're sorely missing is an epoll test suite, in
tools/testing/selftests.  If anyone has anything which we can use to
kick things off, please hand it over ;)

^ permalink raw reply	[relevance 99%]

* Re: [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
  2016-02-04 15:39 99% [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT Jason Baron
@ 2016-02-04 21:44 99% ` Madars Vitolins
  2016-02-04 22:59 99%   ` Andrew Morton
  0 siblings, 1 reply; 44+ results
From: Madars Vitolins @ 2016-02-04 21:44 UTC (permalink / raw)
  To: Jason Baron
  Cc: akpm, torvalds, mtk.manpages, mingo, peterz, viro, normalperson,
	corbet, luto, hagen, linux-kernel, linux-fsdevel, linux-api

Hi Jason,


Just run off the original tests with this patch (eventpoll.c from 
4.5-rc2 + patch bellow). Got the same good results, no regression.

$ time ./bankcl
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 101528948.40 USD

real	0m41.826s
user	0m29.513s
sys	0m6.490s


Test case: 
https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

PS,

Original cases 0m24.953s vs 0m41.826s now probably is related with my pc 
setup. As I just now re-run test with original patch, got the same ~41 
sec.

So I am fine with this patch!

Thanks,
Madars


Jason Baron @ 2016-02-04 17:39 rakstīja:
> In the current implementation of the EPOLLEXCLUSIVE flag (added for 
> 4.5-rc1),
> if epoll waiters create different POLL* sets and register them as 
> exclusive
> against the same target fd, the current implementation will stop waking 
> any
> further waiters once it finds the first idle waiter. This means that 
> waiters
> could miss wakeups in certain cases.
> 
> For example, when we wake up a pipe for reading we do:
> wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM);
> So if one epoll set or epfd is added to pipe p with POLLIN and a second 
> set
> epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the 
> wakeup
> since the current implementation will stop after it finds any 
> intersection
> of events with a waiter that is blocked in epoll_wait().
> 
> We could potentially address this by requiring all epoll waiters that 
> are
> added to p be required to pass the same set of POLL* events. IE the 
> first
> EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* 
> flags to
> be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I 
> think
> it might be somewhat confusing interface as we would have to reference 
> count
> the number of users for that set, and so userspace would have to keep 
> track
> of that count, or we would need a more involved interface. It also adds 
> some
> shared state that we'd have store somewhere. I don't think anybody will 
> want
> to bloat __wait_queue_head for this.
> 
> I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE 
> such
> that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way 
> if
> the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit 
> the
> first idle waiter that specifies the EPOLLIN bit, since any remaining 
> waiters
> that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we 
> can do
> the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. 
> If both
> 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least 
> one
> example of this I saw in fs/pipe.c), then we just wake the entire 
> exclusive
> list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any
> performance critical path, so I think that's ok (in fs/pipe.c its in
> pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by 
> default
> in any exclusive set. Thus, the user can specify EPOLLERR and/or 
> EPOLLHUP but
> is not required to do so.
> 
> Since epoll waiters may be interested in other events as well besides 
> EPOLLIN,
> EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 
> 'dup' call
> on the target fd and adding that as one normally would with 
> EPOLL_CTL_ADD. Since
> I think that the POLLIN and POLLOUT events are what we are interest in
> balancing,
> I think that the 'dup' thing could perhaps be added to only one of the 
> waiter
> threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP 
> should
> be sufficient for the majority of use-cases.
> 
> Since EPOLLEXCLUSIVE is intended to be used with a target fd shared 
> among
> multiple epfds, where between 1 and n of the epfds may receive an 
> event, it
> does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would 
> get an
> event. Thus, it is not allowed to be specified in conjunction with
> EPOLLEXCLUSIVE.
> 
> EPOLL_CTL_MOD is also not allowed if the fd was previously added as
> EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
> interesting, but this could be relaxed at some further point.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---
>  fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index ae1dbcf..cde6074 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -94,6 +94,11 @@
>  /* Epoll private bits inside the event mask */
>  #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | 
> EPOLLEXCLUSIVE)
> 
> +#define EPOLLINOUT_BITS (POLLIN | POLLOUT)
> +
> +#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | 
> \
> +				EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE)
> +
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> 
> @@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait,
> unsigned mode, int sync, void *k
>  	 * wait list.
>  	 */
>  	if (waitqueue_active(&ep->wq)) {
> -		ewake = 1;
> +		if ((epi->event.events & EPOLLEXCLUSIVE) &&
> +					!((unsigned long)key & POLLFREE)) {
> +			switch ((unsigned long)key & EPOLLINOUT_BITS) {
> +			case POLLIN:
> +				if (epi->event.events & POLLIN)
> +					ewake = 1;
> +				break;
> +			case POLLOUT:
> +				if (epi->event.events & POLLOUT)
> +					ewake = 1;
> +				break;
> +			case 0:
> +				ewake = 1;
> +				break;
> +			}
> +		}
>  		wake_up_locked(&ep->wq);
>  	}
>  	if (waitqueue_active(&ep->poll_wait))
> @@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
>  	 * Also, we do not currently supported nested exclusive wakeups.
>  	 */
> -	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
> -		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
> -		goto error_tgt_fput;
> +	if (epds.events & EPOLLEXCLUSIVE) {
> +		if (op == EPOLL_CTL_MOD)
> +			goto error_tgt_fput;
> +		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
> +				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
> +			goto error_tgt_fput;
> +	}
> 
>  	/*
>  	 * At this point it is safe to assume that the "private_data" 
> contains
> @@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  		break;
>  	case EPOLL_CTL_MOD:
>  		if (epi) {
> -			epds.events |= POLLERR | POLLHUP;
> -			error = ep_modify(ep, epi, &epds);
> +			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
> +				epds.events |= POLLERR | POLLHUP;
> +				error = ep_modify(ep, epi, &epds);
> +			}
>  		} else
>  			error = -ENOENT;
>  		break;

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
       [not found]       ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com>
@ 2015-02-27 22:24 99%     ` Jason Baron
  0 siblings, 0 replies; 44+ results
From: Jason Baron @ 2015-02-27 22:24 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: normalperson, linux-kernel, Michael Kerrisk-manpages,
	Peter Zijlstra, linux-fsdevel, viro, davidel, Andrew Morton,
	linux-api, mingo

Hi,

v3 of this series implements this idea using using a different
approach:

http://lkml.iu.edu/hypermail/linux/kernel/1502.3/00667.html

If that still meets your needs it would be helpful to know in
order to move this forward.

Looking back at your posting, I was concerned about the
test case not lining up with the kernel code change.

Thanks,

-Jason

On 02/27/2015 03:56 PM, Hagen Paul Pfeifer wrote:
>
> Applause! Nice patch, sad that I submitted this patch ~3 years ago with
> exactly the same naming (EPOLLEXCLUSIVE) & nearly exact commit message and
> you rejected the patch ...
>
> Hagen
>


^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-22  0:24 99%                 ` Eric Wong
@ 2015-02-25 15:48 99%                   ` Jason Baron
  0 siblings, 0 replies; 44+ results
From: Jason Baron @ 2015-02-25 15:48 UTC (permalink / raw)
  To: Eric Wong
  Cc: Ingo Molnar, peterz, mingo, viro, akpm, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra,
	luto@amacapital.net >> Andy Lutomirski

On 02/21/2015 07:24 PM, Eric Wong wrote:
> Jason Baron <jbaron@akamai.com> wrote:
>> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
>>> * Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>>>> [...] However, I think the userspace API change is less 
>>>>> clear since epoll_wait() doesn't currently have an 
>>>>> 'input' events argument as epoll_ctl() does.
>>>> ... but the change would be a bit clearer and somewhat 
>>>> more flexible: LIFO or FIFO queueing, right?
>>>>
>>>> But having the queueing model as part of the epoll 
>>>> context is a legitimate approach as well.
>>> Btw., there's another optimization that the networking code 
>>> already does when processing incoming packets: waking up a 
>>> thread on the local CPU, where the wakeup is running.
>>>
>>> Doing the same on epoll would have real scalability 
>>> advantages where incoming events are IRQ driven and are 
>>> distributed amongst multiple CPUs.
>>>
>>> Where events are task driven the scheduler will already try 
>>> to pair up waker and wakee so it might not show up in 
>>> measurements that markedly.
>>>
>> Right, so this makes me think that we may want to potentially
>> support a variety of wakeup policies. Adding these to the
>> generic wake up code is just going to be too messy. So, perhaps
>> a better approach here would be to register a single
>> wait_queue_t with the event source queue that will always
>> be woken up, and then layer any epoll balancing/irq affinity
>> policies on top of that. So in essence we end up with sort of
>> two queues layers, but I think it provides much nicer isolation
>> between layers. Also, the bulk of the changes are going to be
>> isolated to the epoll code, and we avoid Andy's concern about
>> missing, or starving out wakeups.
>>
>> So here's a stab at how this API could look:
>>
>> 1. ep1 = epoll_create1(EPOLL_POLICY);
>>
>> So EPOLL_POLICY here could the round robin policy described
>> here, or the irq affinity or other ideas. The idea is to create
>> an fd that is local to the process, such that other processes
>> can not subsequently attach to it and affect our policy.
> I'm not against defining more policies if needed.
> Maybe FIFO vs LIFO is a good case for this.
>
> For affinity, it could probably be done transparently based on
> epoll_wait retrievals + EPOLL_CTL_MOD operations.
>
>> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
>>
>> This associates ep1 with the event source. ep1 can be
>> associated with or added to at most 1 wakeup source. This call
>> would largely just form the association, but not queue anything
>> to the fd_source wait queue.
> This would mean one extra FD for every fd_source, but that's
> only a handful of FDs (listen sockets), correct?

Yes, one extra epoll fd per shared wakeup source, so this should
result in very few additional fds.

>> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
>>     epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
>>     epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>>      .
>>      .
>>      .
>>
>> Finally, we add the epoll sets to the event source (indirectly via
>> ep1). So the first add would actually queue the callback to the
>> fd_source. While the subsequent calls would simply queue things
>> to the 'nested' wakeup queue associated with ep1.
> I'm not sure I follow, wouldn't this increase the number of wakeups?

I agree, my text there is confusing...I've posted this idea as
v3 of this series, so hopefully that clarifies this approach.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-19  3:26 99%               ` Jason Baron
@ 2015-02-22  0:24 99%                 ` Eric Wong
  2015-02-25 15:48 99%                   ` Jason Baron
  0 siblings, 1 reply; 44+ results
From: Eric Wong @ 2015-02-22  0:24 UTC (permalink / raw)
  To: Jason Baron
  Cc: Ingo Molnar, peterz, mingo, viro, akpm, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra,
	luto@amacapital.net >> Andy Lutomirski

Jason Baron <jbaron@akamai.com> wrote:
> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> >>> [...] However, I think the userspace API change is less 
> >>> clear since epoll_wait() doesn't currently have an 
> >>> 'input' events argument as epoll_ctl() does.
> >> ... but the change would be a bit clearer and somewhat 
> >> more flexible: LIFO or FIFO queueing, right?
> >>
> >> But having the queueing model as part of the epoll 
> >> context is a legitimate approach as well.
> > Btw., there's another optimization that the networking code 
> > already does when processing incoming packets: waking up a 
> > thread on the local CPU, where the wakeup is running.
> >
> > Doing the same on epoll would have real scalability 
> > advantages where incoming events are IRQ driven and are 
> > distributed amongst multiple CPUs.
> >
> > Where events are task driven the scheduler will already try 
> > to pair up waker and wakee so it might not show up in 
> > measurements that markedly.
> >
> 
> Right, so this makes me think that we may want to potentially
> support a variety of wakeup policies. Adding these to the
> generic wake up code is just going to be too messy. So, perhaps
> a better approach here would be to register a single
> wait_queue_t with the event source queue that will always
> be woken up, and then layer any epoll balancing/irq affinity
> policies on top of that. So in essence we end up with sort of
> two queues layers, but I think it provides much nicer isolation
> between layers. Also, the bulk of the changes are going to be
> isolated to the epoll code, and we avoid Andy's concern about
> missing, or starving out wakeups.
> 
> So here's a stab at how this API could look:
> 
> 1. ep1 = epoll_create1(EPOLL_POLICY);
> 
> So EPOLL_POLICY here could the round robin policy described
> here, or the irq affinity or other ideas. The idea is to create
> an fd that is local to the process, such that other processes
> can not subsequently attach to it and affect our policy.

I'm not against defining more policies if needed.
Maybe FIFO vs LIFO is a good case for this.

For affinity, it could probably be done transparently based on
epoll_wait retrievals + EPOLL_CTL_MOD operations.

> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
> 
> This associates ep1 with the event source. ep1 can be
> associated with or added to at most 1 wakeup source. This call
> would largely just form the association, but not queue anything
> to the fd_source wait queue.

This would mean one extra FD for every fd_source, but that's
only a handful of FDs (listen sockets), correct?

> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
>     epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
>     epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>      .
>      .
>      .
> 
> Finally, we add the epoll sets to the event source (indirectly via
> ep1). So the first add would actually queue the callback to the
> fd_source. While the subsequent calls would simply queue things
> to the 'nested' wakeup queue associated with ep1.

I'm not sure I follow, wouldn't this increase the number of wakeups?

> So any existing epoll/poll/select calls could be queued as well
> to fd_source and will operate independenly from this mechanism,
> as the fd_source queue continues to be 'wake all'. Also, there
> should be no changes necessary to __wake_up_common(), other
> than potentially passing more back though the
> wait_queue_func_t, such as 'nr_exclusive'.

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 17:45 99%           ` Ingo Molnar
@ 2015-02-18 17:51 99%             ` Ingo Molnar
  2015-02-19  3:26 99%               ` Jason Baron
  2015-02-18 22:18 99%               ` Eric Wong
  0 siblings, 2 replies; 44+ results
From: Ingo Molnar @ 2015-02-18 17:51 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> > [...] However, I think the userspace API change is less 
> > clear since epoll_wait() doesn't currently have an 
> > 'input' events argument as epoll_ctl() does.
> 
> ... but the change would be a bit clearer and somewhat 
> more flexible: LIFO or FIFO queueing, right?
> 
> But having the queueing model as part of the epoll 
> context is a legitimate approach as well.

Btw., there's another optimization that the networking code 
already does when processing incoming packets: waking up a 
thread on the local CPU, where the wakeup is running.

Doing the same on epoll would have real scalability 
advantages where incoming events are IRQ driven and are 
distributed amongst multiple CPUs.

Where events are task driven the scheduler will already try 
to pair up waker and wakee so it might not show up in 
measurements that markedly.

Thanks,

	Ingo

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 17:38 99%         ` Jason Baron
@ 2015-02-18 17:45 99%           ` Ingo Molnar
  2015-02-18 17:51 99%             ` Ingo Molnar
  2015-02-18 23:12 99%           ` Andy Lutomirski
  1 sibling, 1 reply; 44+ results
From: Ingo Molnar @ 2015-02-18 17:45 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra


* Jason Baron <jbaron@akamai.com> wrote:

> So in the case of multiple threads per epoll set, we 
> currently add to the head of wakeup queue exclusively in 
> 'epoll_wait()', and then subsequently remove from the 
> queue once 'epoll_wait()' returns. So I don't think this 
> patch addresses balancing on a per epoll set basis.

Okay, so I was confused about how the code works.

> I think we could address the case you describe by simply 
> doing __add_wait_queue_tail_exclusive() instead of 
> __add_wait_queue_exclusive() in epoll_wait(). [...]

Yes.

> [...] However, I think the userspace API change is less 
> clear since epoll_wait() doesn't currently have an 
> 'input' events argument as epoll_ctl() does.

... but the change would be a bit clearer and somewhat more 
flexible: LIFO or FIFO queueing, right?

But having the queueing model as part of the epoll context 
is a legitimate approach as well.

Thanks,

	Ingo

^ permalink raw reply	[relevance 99%]

* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-18 16:33 99%       ` Ingo Molnar
@ 2015-02-18 17:38 99%         ` Jason Baron
  2015-02-18 17:45 99%           ` Ingo Molnar
  2015-02-18 23:12 99%           ` Andy Lutomirski
  0 siblings, 2 replies; 44+ results
From: Jason Baron @ 2015-02-18 17:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra

On 02/18/2015 11:33 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>>> This has two main advantages: firstly it solves the 
>>> O(N) (micro-)problem, but it also more evenly 
>>> distributes events both between task-lists and within 
>>> epoll groups as tasks as well.
>> Its solving 2 issues - spurious wakeups, and more even 
>> loading of threads. The event distribution is more even 
>> between 'epoll groups' with this patch, however, if 
>> multiple threads are blocking on a single 'epoll group', 
>> this patch does not affect the the event distribution 
>> there. [...]
> Regarding your last point, are you sure about that?
>
> If we have say 16 epoll threads registered, and if the list 
> is static (no register/unregister activity), then the 
> wakeup pattern is in strict order of the list: threads 
> closer to the list head will be woken more frequently, in a 
> wake-once fashion. So if threads do just quick work and go 
> back to sleep quickly, then typically only the first 2-3 
> threads will get any runtime in practice - the wakeup 
> iteration never gets 'deep' into the list.
>
> With the round-robin shuffling of the list, the threads get 
> shuffled to the tail on wakeup, which distributes events 
> evenly: all 16 epoll threads will accumulate an even 
> distribution of runtime, statistically.
>
> Have I misunderstood this somehow?
>
>

So in the case of multiple threads per epoll set, we currently
add to the head of wakeup queue exclusively in 'epoll_wait()',
and then subsequently remove from the queue once
'epoll_wait()' returns. So I don't think this patch addresses
balancing on a per epoll set basis.

I think we could address the case you describe by simply doing
__add_wait_queue_tail_exclusive() instead of
__add_wait_queue_exclusive() in epoll_wait(). However, I think
the userspace API change is less clear since epoll_wait() doesn't
currently have an 'input' events argument as epoll_ctl() does.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-10 19:16 99%             ` Jason Baron
@ 2015-02-10 19:32 99%               ` Eric Wong
  0 siblings, 0 replies; 44+ results
From: Eric Wong @ 2015-02-10 19:32 UTC (permalink / raw)
  To: Jason Baron
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 11:49 PM, Eric Wong wrote:
> > Do you have a userland use case to share?
> 
> I've been trying to describe the use case, maybe I haven't been doing a good
> job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

> > Did you try my suggestion of using a dedicated thread (or thread pool)
> > which does nothing but loop on accept() + EPOLL_CTL_ADD?
> >
> > Those dedicated threads could do its own round-robin in userland to pick
> > a different epollfd to call EPOLL_CTL_ADD on.
> 
> Thanks for your suggestion! I'm not actively working on the user-space
> code here, but I will pass it along.
> 
> I would prefer though not to have to context switch the 'accept' thread
> on and off the cpu every time there is a new connection. So the approach
> suggested here essentially moves this dedicated thread (threads), down
> into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux <= 2.6.31

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 21:32 99%     ` Jason Baron
@ 2015-02-09 22:45 99%       ` Andy Lutomirski
  2015-02-10  3:59 99%         ` Jason Baron
  0 siblings, 1 reply; 44+ results
From: Andy Lutomirski @ 2015-02-09 22:45 UTC (permalink / raw)
  To: Jason Baron, Linux API
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong,
	Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>> Epoll file descriptors that are added to a shared wakeup source are always
>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>> fds attached to a shared wakeup source they are all woken up. This can
>>> lead to excessive cpu usage and uneven load distribution.
>>>
>>> This patch introduces two new 'events' flags that are intended to be used
>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>> source in an exclusive manner such that the minimum number of threads are
>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>> also be added to the 'events' flag, such that we round robin around the set
>>> of waiting threads.
>>>
>>> An implementation note is that in the epoll wakeup routine,
>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>> wakeup, only when there are current waiters. The idea is to use this additional
>>> heuristic in order minimize wakeup latencies.
>>
>> I don't understand what this is intended to do.
>>
>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>
>
> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.

Now I understand.  Why are you using multiple epollfds?

--Andy

>
>> If an event has EPOLLET, then the considerations are similar to EPOLLONESHOT, right?
>>
>
> EPOLLET is still going to cause this thundering herd.
>
>> If an event is a normal level-triggered non-one-shot event, then I don't understand how a round-robin wakeup makes any sense.  It's level-triggered, after all.
>
> Yeah, so the current behavior is to wake up all of the threads. I'm trying to add a new mode where it load balances among the threads interested in the event. Perhaps, the test program I attached to 0/2 will show the issue better?
>
> Also, this originally came up in the context of a single listening socket which was attached to multiple epoll fds each in a separate thread. With the attached patch, I can measure a large decrease in cpu usage and better balancing behavior among the accepting threads.
>
> Thanks,
>
> -Jason



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:06 99% ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
  2015-02-09 20:18 99%   ` Andy Lutomirski
@ 2015-02-09 20:27 99%   ` Michael Kerrisk
  1 sibling, 0 replies; 44+ results
From: Michael Kerrisk @ 2015-02-09 20:27 UTC (permalink / raw)
  To: Jason Baron
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, normalperson,
	Davide Libenzi, Linux Kernel, Linux-Fsdevel, Linux API

[CC += linux-api@vger.kernel.org]


On Mon, Feb 9, 2015 at 9:06 PM, Jason Baron <jbaron@akamai.com> wrote:
> Epoll file descriptors that are added to a shared wakeup source are always
> added in a non-exclusive manner. That means that when we have multiple epoll
> fds attached to a shared wakeup source they are all woken up. This can
> lead to excessive cpu usage and uneven load distribution.
>
> This patch introduces two new 'events' flags that are intended to be used
> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> source in an exclusive manner such that the minimum number of threads are
> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> also be added to the 'events' flag, such that we round robin around the set
> of waiting threads.
>
> An implementation note is that in the epoll wakeup routine,
> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> wakeup, only when there are current waiters. The idea is to use this additional
> heuristic in order minimize wakeup latencies.
>
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---
>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>  include/uapi/linux/eventpoll.h |  6 ++++++
>  2 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d77f944..382c832 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,8 @@
>   */
>
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
> +                        EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>         unsigned long flags;
>         struct epitem *epi = ep_item_from_wait(wait);
>         struct eventpoll *ep = epi->ep;
> +       int ewake = 0;
>
>         if ((unsigned long)key & POLLFREE) {
>                 ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>          * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>          * wait list.
>          */
> -       if (waitqueue_active(&ep->wq))
> +       if (waitqueue_active(&ep->wq)) {
> +               ewake = 1;
>                 wake_up_locked(&ep->wq);
> +       }
>         if (waitqueue_active(&ep->poll_wait))
>                 pwake++;
>
> @@ -1078,6 +1082,8 @@ out_unlock:
>         if (pwake)
>                 ep_poll_safewake(&ep->poll_wait);
>
> +       if (epi->event.events & EPOLLROUNDROBIN)
> +               return ewake;
>         return 1;
>  }
>
> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>                 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>                 pwq->whead = whead;
>                 pwq->base = epi;
> -               add_wait_queue(whead, &pwq->wait);
> +               if (epi->event.events & EPOLLROUNDROBIN)
> +                       add_wait_queue_rr(whead, &pwq->wait);
> +               else if (epi->event.events & EPOLLEXCLUSIVE)
> +                       add_wait_queue_exclusive(whead, &pwq->wait);
> +               else
> +                       add_wait_queue(whead, &pwq->wait);
>                 list_add_tail(&pwq->llink, &epi->pwqlist);
>                 epi->nwait++;
>         } else {
> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>                 struct epoll_event __user *, event)
>  {
> -       int error;
> -       int full_check = 0;
> +       int error, full_check = 0, wait_flags = 0;
>         struct fd f, tf;
>         struct eventpoll *ep;
>         struct epitem *epi;
> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>         if (f.file == tf.file || !is_file_epoll(f.file))
>                 goto error_tgt_fput;
>
> +       wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
> +       if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
> +           ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
> +               goto error_tgt_fput;
> +
>         /*
>          * At this point it is safe to assume that the "private_data" contains
>          * our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index bc81fb2..10260a1 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,12 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
>
> +/* Balance wakeups for a shared event source */
> +#define EPOLLROUNDROBIN (1 << 27)
> +
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent system suspends
>   * from happening while those events are being processed.
> --
> 1.8.2.rc2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[relevance 99%]

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:06 99% ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
@ 2015-02-09 20:18 99%   ` Andy Lutomirski
  2015-02-09 21:32 99%     ` Jason Baron
  2015-02-09 20:27 99%   ` Michael Kerrisk
  1 sibling, 1 reply; 44+ results
From: Andy Lutomirski @ 2015-02-09 20:18 UTC (permalink / raw)
  To: Jason Baron, peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

On 02/09/2015 12:06 PM, Jason Baron wrote:
> Epoll file descriptors that are added to a shared wakeup source are always
> added in a non-exclusive manner. That means that when we have multiple epoll
> fds attached to a shared wakeup source they are all woken up. This can
> lead to excessive cpu usage and uneven load distribution.
>
> This patch introduces two new 'events' flags that are intended to be used
> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> source in an exclusive manner such that the minimum number of threads are
> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> also be added to the 'events' flag, such that we round robin around the set
> of waiting threads.
>
> An implementation note is that in the epoll wakeup routine,
> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> wakeup, only when there are current waiters. The idea is to use this additional
> heuristic in order minimize wakeup latencies.

I don't understand what this is intended to do.

If an event has EPOLLONESHOT, then this only one thread should be woken 
regardless, right?  If not, isn't that just a bug that should be fixed?

If an event has EPOLLET, then the considerations are similar to 
EPOLLONESHOT, right?

If an event is a normal level-triggered non-one-shot event, then I don't 
understand how a round-robin wakeup makes any sense.  It's 
level-triggered, after all.

--Andy

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-29 15:05 99%         ` Hagen Paul Pfeifer
  2012-03-29 15:53 99%           ` Jason Baron
@ 2012-04-05 22:30 99%           ` Andy Lutomirski
  1 sibling, 0 replies; 44+ results
From: Andy Lutomirski @ 2012-04-05 22:30 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Jason Baron, richard -rw- weinberger, torvalds, LKML, Al Viro,
	Lucas De Marchi, Andrew Morton, linux-fsdevel, eric.dumazet

On 03/29/2012 08:05 AM, Hagen Paul Pfeifer wrote:
> * Jason Baron | 2012-03-29 10:16:53 [-0400]:
> 
>> Right, for level triggered events, they all wait up. However, if you use
>> edge triggered, ie add 'EPOLLET', then the event gets 'consumed' by the
>> first thread that wakes up, and the subseqent waiters wouldn't get woken
>> up. IE you'll get one wakeup.
> 
> I addressed level triggered, right - it match the model. But I don't wanted to
> wake up every every thread anyway. I don't want to abandon level triggered
> functioning.

I think that what you want is a mode in which, once epoll_wait returns,
the bits it returns get cleared from the event mask.  Am I right?

(To get this right, you'd need another flag that disables the automatic
addition of POLLERR and POLLHUP.)

If you do that and add an extra syscall that simultaneously does a bunch
of epoll_ctls, a timerfd_settime, and an epoll_wait, I'll be extra happy.

--Andy

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-29 18:54 99%               ` Jason Baron
@ 2012-03-29 21:19 99%                 ` Hagen Paul Pfeifer
  0 siblings, 0 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-29 21:19 UTC (permalink / raw)
  To: Jason Baron
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

* Jason Baron | 2012-03-29 14:54:17 [-0400]:

>But the behavior of the testcase you've supplied is not changed by the
>'EPOLLEXCLUSIVE' support. So is this not the right testcase?

Hi Jason,

I am currently abroad, the example yesterday was just a quick hack somewhere
in between hotel and meeting to demonstrate the behaviour. ;)

I travel with a company laptop, my testcode is on another PC, I have access to
my git tree and just wanted to get this patch merged in this window.

Hagen


^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-29 16:32 99%             ` Hagen Paul Pfeifer
@ 2012-03-29 18:54 99%               ` Jason Baron
  2012-03-29 21:19 99%                 ` Hagen Paul Pfeifer
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2012-03-29 18:54 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

On Thu, Mar 29, 2012 at 06:32:22PM +0200, Hagen Paul Pfeifer wrote:
> * Jason Baron | 2012-03-29 11:53:24 [-0400]:
> 
> >I was trying to better understand the use-case, since at least for the
> >test case you posted, 'EPOLLET', already does what you want.
> >
> >Also, the 'EPOLLEXCLUSIVE' flag in your patch addresses multiple threads
> >blocking on *different* epoll fds. However, if multiple threads are
> >blocked on a single epoll fd, they will all be woken even if 'EPOLLEXCLUSIVE'
> >is set. Shouldn't 'EPOLLEXCLUSIVE' affect that case too?
> 
> Hey Jason,
> 
> I just wanted to address the "main use-case" (as implemented in a bunch of
> network server): one listen socket (say 80) is created and a epoll fd is
> created. The listen socket is added to the set and n threads are created
> afterwards. So now you have the situation that one listening socket is added
> to the set and all threads are awoken if a new client connects. This patch
> reduce the useless-all-thread-awoken-overhead by awake only one thread.
> 
> Hagen

Hi,

But the behavior of the testcase you've supplied is not changed by the
'EPOLLEXCLUSIVE' support. So is this not the right testcase?

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-29 15:53 99%           ` Jason Baron
@ 2012-03-29 16:32 99%             ` Hagen Paul Pfeifer
  2012-03-29 18:54 99%               ` Jason Baron
  0 siblings, 1 reply; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-29 16:32 UTC (permalink / raw)
  To: Jason Baron
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

* Jason Baron | 2012-03-29 11:53:24 [-0400]:

>I was trying to better understand the use-case, since at least for the
>test case you posted, 'EPOLLET', already does what you want.
>
>Also, the 'EPOLLEXCLUSIVE' flag in your patch addresses multiple threads
>blocking on *different* epoll fds. However, if multiple threads are
>blocked on a single epoll fd, they will all be woken even if 'EPOLLEXCLUSIVE'
>is set. Shouldn't 'EPOLLEXCLUSIVE' affect that case too?

Hey Jason,

I just wanted to address the "main use-case" (as implemented in a bunch of
network server): one listen socket (say 80) is created and a epoll fd is
created. The listen socket is added to the set and n threads are created
afterwards. So now you have the situation that one listening socket is added
to the set and all threads are awoken if a new client connects. This patch
reduce the useless-all-thread-awoken-overhead by awake only one thread.

Hagen

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-29 15:05 99%         ` Hagen Paul Pfeifer
@ 2012-03-29 15:53 99%           ` Jason Baron
  2012-03-29 16:32 99%             ` Hagen Paul Pfeifer
  2012-04-05 22:30 99%           ` Andy Lutomirski
  1 sibling, 1 reply; 44+ results
From: Jason Baron @ 2012-03-29 15:53 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

On Thu, Mar 29, 2012 at 05:05:41PM +0200, Hagen Paul Pfeifer wrote:
> * Jason Baron | 2012-03-29 10:16:53 [-0400]:
> 
> >Right, for level triggered events, they all wait up. However, if you use
> >edge triggered, ie add 'EPOLLET', then the event gets 'consumed' by the
> >first thread that wakes up, and the subseqent waiters wouldn't get woken
> >up. IE you'll get one wakeup.
> 
> I addressed level triggered, right - it match the model. But I don't wanted to
> wake up every every thread anyway. I don't want to abandon level triggered
> functioning.
> 
> Any objective against this flag?
> 

I was trying to better understand the use-case, since at least for the
test case you posted, 'EPOLLET', already does what you want.

Also, the 'EPOLLEXCLUSIVE' flag in your patch addresses multiple threads
blocking on *different* epoll fds. However, if multiple threads are
blocked on a single epoll fd, they will all be woken even if 'EPOLLEXCLUSIVE'
is set. Shouldn't 'EPOLLEXCLUSIVE' affect that case too?

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-29 14:16 99%       ` Jason Baron
@ 2012-03-29 15:05 99%         ` Hagen Paul Pfeifer
  2012-03-29 15:53 99%           ` Jason Baron
  2012-04-05 22:30 99%           ` Andy Lutomirski
  0 siblings, 2 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-29 15:05 UTC (permalink / raw)
  To: Jason Baron
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

* Jason Baron | 2012-03-29 10:16:53 [-0400]:

>Right, for level triggered events, they all wait up. However, if you use
>edge triggered, ie add 'EPOLLET', then the event gets 'consumed' by the
>first thread that wakes up, and the subseqent waiters wouldn't get woken
>up. IE you'll get one wakeup.

I addressed level triggered, right - it match the model. But I don't wanted to
wake up every every thread anyway. I don't want to abandon level triggered
functioning.

Any objective against this flag?

Hagen

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-28 19:58 99%     ` Hagen Paul Pfeifer
@ 2012-03-29 14:16 99%       ` Jason Baron
  2012-03-29 15:05 99%         ` Hagen Paul Pfeifer
  2012-03-29 14:51 99%       ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 44+ results
From: Jason Baron @ 2012-03-29 14:16 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

On Wed, Mar 28, 2012 at 09:58:48PM +0200, Hagen Paul Pfeifer wrote:
> 
> >Hmmm...Looking at ep_poll() it does an '__add_wait_queue_exclusive()'.
> >So, I *think* epoll_wait() should do what you want, if you are waiting
> >on the same epfd in all the threads.
> >
> >I think the case you are describing is where each thread does its own
> >ep_create(), and then a subsequent epoll_wait() on the fd from the
> >create?
> >
> >So, I *think* you can get what you want without adding this flag.
> 
> ;) sorry:
> 
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> epoll_wait returned
> 
> 
> minimal example:
> 
> >>>>>>>>>>>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <pthread.h>
> #include <sys/epoll.h>
> 
> #define	AMAX 16
> 
> static void *runner(void *args)
> {
> 	int fd = (int) *((int *) args);
> 	struct epoll_event events[AMAX];
> 
> 	epoll_wait(fd, events, AMAX, -1);
> 	write(1, "epoll_wait returned\n", 20);
> 
> 	return NULL;
> }
> 
> int main(int ac, char **av)
> {
> 	int i, evfd, pipefd[2];
> 	pthread_t thread_id[2];
> 	struct epoll_event epoll_ev;
> 
> 	pipe(pipefd);
> 	evfd = epoll_create(64);
> 
> 	memset(&epoll_ev, 0, sizeof(struct epoll_event));
> 	epoll_ev.events   = EPOLLIN | EPOLLPRI | EPOLLERR | EPOLLHUP;
> 	epoll_ctl(evfd, EPOLL_CTL_ADD, pipefd[0], &epoll_ev);
> 
> 	for (i = 0; i < 10; i++)
> 		pthread_create(&thread_id[0], NULL, runner, &evfd);
> 
> 	sleep(1);
> 	close(pipefd[1]);
> 	write(pipefd[0], "x", 1);
> 	sleep(1);
> 
> 	return EXIT_SUCCESS;
> }

Right, for level triggered events, they all wait up. However, if you use
edge triggered, ie add 'EPOLLET', then the event gets 'consumed' by the
first thread that wakes up, and the subseqent waiters wouldn't get woken
up. IE you'll get one wakeup.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-28 19:58 99%     ` Hagen Paul Pfeifer
  2012-03-29 14:16 99%       ` Jason Baron
@ 2012-03-29 14:51 99%       ` Hagen Paul Pfeifer
  1 sibling, 0 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-29 14:51 UTC (permalink / raw)
  To: Jason Baron
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

Ping? Any signed-off-by for this patch now? Andrew?

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-28 16:21 99%   ` Jason Baron
@ 2012-03-28 19:58 99%     ` Hagen Paul Pfeifer
  2012-03-29 14:16 99%       ` Jason Baron
  2012-03-29 14:51 99%       ` Hagen Paul Pfeifer
  0 siblings, 2 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-03-28 19:58 UTC (permalink / raw)
  To: Jason Baron
  Cc: richard -rw- weinberger, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

* Jason Baron | 2012-03-28 12:21:08 [-0400]:

>Hmmm...Looking at ep_poll() it does an '__add_wait_queue_exclusive()'.
>So, I *think* epoll_wait() should do what you want, if you are waiting
>on the same epfd in all the threads.
>
>I think the case you are describing is where each thread does its own
>ep_create(), and then a subsequent epoll_wait() on the fd from the
>create?
>
>So, I *think* you can get what you want without adding this flag.

;) sorry:

epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned
epoll_wait returned


minimal example:

>>>>>>>>>>>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/epoll.h>

#define	AMAX 16

static void *runner(void *args)
{
	int fd = (int) *((int *) args);
	struct epoll_event events[AMAX];

	epoll_wait(fd, events, AMAX, -1);
	write(1, "epoll_wait returned\n", 20);

	return NULL;
}

int main(int ac, char **av)
{
	int i, evfd, pipefd[2];
	pthread_t thread_id[2];
	struct epoll_event epoll_ev;

	pipe(pipefd);
	evfd = epoll_create(64);

	memset(&epoll_ev, 0, sizeof(struct epoll_event));
	epoll_ev.events   = EPOLLIN | EPOLLPRI | EPOLLERR | EPOLLHUP;
	epoll_ctl(evfd, EPOLL_CTL_ADD, pipefd[0], &epoll_ev);

	for (i = 0; i < 10; i++)
		pthread_create(&thread_id[0], NULL, runner, &evfd);

	sleep(1);
	close(pipefd[1]);
	write(pipefd[0], "x", 1);
	sleep(1);

	return EXIT_SUCCESS;
}
<<<<<<<<<<<


Cheers, Hagen

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-28 14:09 99% ` richard -rw- weinberger
@ 2012-03-28 16:21 99%   ` Jason Baron
  2012-03-28 19:58 99%     ` Hagen Paul Pfeifer
  0 siblings, 1 reply; 44+ results
From: Jason Baron @ 2012-03-28 16:21 UTC (permalink / raw)
  To: richard -rw- weinberger
  Cc: Hagen Paul Pfeifer, torvalds, LKML, Al Viro, Lucas De Marchi,
	Andrew Morton, linux-fsdevel, eric.dumazet

On Wed, Mar 28, 2012 at 04:09:24PM +0200, richard -rw- weinberger wrote:
> On Wed, Mar 28, 2012 at 3:57 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> > High performance server sometimes create one listening socket (e.g. port
> > 80), create a epoll file descriptor and add the socket. Afterwards
> > create SC_NPROCESSORS_ONLN threads and wait for events. This often
> > result in a thundering herd problem because all CPUs are scheduled.
> >
> > This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
> > If a descriptor is added with this flag only one CPU is scheduled in.
> >
> > Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> > ---
> > Dave rejected the patch and said not network specific. Because there
> > is no epoll maintainer this time directly.
> 
> CC'ing maintainers for you...
> Please use scripts/get_maintainer.pl.
> 

Hmmm...Looking at ep_poll() it does an '__add_wait_queue_exclusive()'.
So, I *think* epoll_wait() should do what you want, if you are waiting
on the same epfd in all the threads.

I think the case you are describing is where each thread does its own
ep_create(), and then a subsequent epoll_wait() on the fd from the
create?

So, I *think* you can get what you want without adding this flag.

Thanks,

-Jason

^ permalink raw reply	[relevance 99%]

* Re: [PATCH Resend] epoll: add EPOLLEXCLUSIVE support
  2012-03-28 13:57 99% [PATCH Resend] " Hagen Paul Pfeifer
@ 2012-03-28 14:09 99% ` richard -rw- weinberger
  2012-03-28 16:21 99%   ` Jason Baron
  0 siblings, 1 reply; 44+ results
From: richard -rw- weinberger @ 2012-03-28 14:09 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: torvalds, LKML, Al Viro, Lucas De Marchi, Andrew Morton, jbaron,
	linux-fsdevel

On Wed, Mar 28, 2012 at 3:57 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> High performance server sometimes create one listening socket (e.g. port
> 80), create a epoll file descriptor and add the socket. Afterwards
> create SC_NPROCESSORS_ONLN threads and wait for events. This often
> result in a thundering herd problem because all CPUs are scheduled.
>
> This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
> If a descriptor is added with this flag only one CPU is scheduled in.
>
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> ---
> Dave rejected the patch and said not network specific. Because there
> is no epoll maintainer this time directly.

CC'ing maintainers for you...
Please use scripts/get_maintainer.pl.

-- 
Thanks,
//richard

^ permalink raw reply	[relevance 99%]

* Re: [PATCH net-next] epoll: add EPOLLEXCLUSIVE support
  2012-02-14 21:06 99% ` Eric Dumazet
@ 2012-02-14 21:38 99%   ` Hagen Paul Pfeifer
  0 siblings, 0 replies; 44+ results
From: Hagen Paul Pfeifer @ 2012-02-14 21:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev, Davide Libenzi

* Eric Dumazet | 2012-02-14 22:06:15 [+0100]:

>Seems pretty good to me.
>
>Do you have some performance numbers to share ?

No, but I did some tests with one of my network performance tools. I imagine
that I can *construct* test-cases and add some 'perf stat cs:u' statistics.
IMHO it is not fair to present some artificial tunned performance numbers.
There are use-cases where EPOLLEXCLUSIVE can be really helpfull, yes I think
that this flag SHOULD be a userspace default. ;-)

Hagen

^ permalink raw reply	[relevance 99%]

* Re: [PATCH net-next] epoll: add EPOLLEXCLUSIVE support
  2012-02-14 20:48 99% [PATCH net-next] epoll: add EPOLLEXCLUSIVE support Hagen Paul Pfeifer
  2012-02-14 21:06 99% ` Eric Dumazet
@ 2012-02-14 21:23 99% ` David Miller
  1 sibling, 0 replies; 44+ results
From: David Miller @ 2012-02-14 21:23 UTC (permalink / raw)
  To: hagen; +Cc: linux-kernel, netdev, davidel, eric.dumazet

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Tue, 14 Feb 2012 21:48:04 +0100

> High performance server sometimes create one listening socket (e.g. port
> 80), create a epoll file descriptor and add the socket. Afterwards
> create SC_NPROCESSORS_ONLN threads and wait for events. This often
> result in a thundering herd problem because all CPUs are scheduled.
> 
> This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
> If a descriptor is added with this flag only one CPU is scheduled in.
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Reported-by: Li Yu <raise.sail@gmail.com>

This is not a networking specific change and therefore should not
be submitted via my tree(s).

^ permalink raw reply	[relevance 99%]

* Re: [PATCH net-next] epoll: add EPOLLEXCLUSIVE support
  2012-02-14 20:48 99% [PATCH net-next] epoll: add EPOLLEXCLUSIVE support Hagen Paul Pfeifer
@ 2012-02-14 21:06 99% ` Eric Dumazet
  2012-02-14 21:38 99%   ` Hagen Paul Pfeifer
  2012-02-14 21:23 99% ` David Miller
  1 sibling, 1 reply; 44+ results
From: Eric Dumazet @ 2012-02-14 21:06 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: linux-kernel, netdev, Davide Libenzi

Le mardi 14 février 2012 à 21:48 +0100, Hagen Paul Pfeifer a écrit :
> High performance server sometimes create one listening socket (e.g. port
> 80), create a epoll file descriptor and add the socket. Afterwards
> create SC_NPROCESSORS_ONLN threads and wait for events. This often
> result in a thundering herd problem because all CPUs are scheduled.
> 
> This patch add an additional flag to epoll_ctl(2) called EPOLLEXCLUSIVE.
> If a descriptor is added with this flag only one CPU is scheduled in.
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Reported-by: Li Yu <raise.sail@gmail.com>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> ---

Seems pretty good to me.

Do you have some performance numbers to share ?





^ permalink raw reply	[relevance 99%]

Results 1-44 of 44 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2012-02-14 20:48 99% [PATCH net-next] epoll: add EPOLLEXCLUSIVE support Hagen Paul Pfeifer
2012-02-14 21:06 99% ` Eric Dumazet
2012-02-14 21:38 99%   ` Hagen Paul Pfeifer
2012-02-14 21:23 99% ` David Miller
2012-03-28 13:57 99% [PATCH Resend] " Hagen Paul Pfeifer
2012-03-28 14:09 99% ` richard -rw- weinberger
2012-03-28 16:21 99%   ` Jason Baron
2012-03-28 19:58 99%     ` Hagen Paul Pfeifer
2012-03-29 14:16 99%       ` Jason Baron
2012-03-29 15:05 99%         ` Hagen Paul Pfeifer
2012-03-29 15:53 99%           ` Jason Baron
2012-03-29 16:32 99%             ` Hagen Paul Pfeifer
2012-03-29 18:54 99%               ` Jason Baron
2012-03-29 21:19 99%                 ` Hagen Paul Pfeifer
2012-04-05 22:30 99%           ` Andy Lutomirski
2012-03-29 14:51 99%       ` Hagen Paul Pfeifer
2012-03-25 19:12 99% [PATCH] " Hagen Paul Pfeifer
2016-02-04 15:39 99% [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT Jason Baron
2016-02-04 21:44 99% ` Madars Vitolins
2016-02-04 22:59 99%   ` Andrew Morton
2016-02-05  9:49 99%     ` Madars Vitolins
2015-02-17 19:33     [PATCH v2 0/2] Add epoll round robin wakeup mode Jason Baron
2015-02-17 19:33 99% ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
2015-02-18  8:07 99%   ` Ingo Molnar
2015-02-18 15:42 99%     ` Jason Baron
2015-02-18 16:33 99%       ` Ingo Molnar
2015-02-18 17:38 99%         ` Jason Baron
2015-02-18 17:45 99%           ` Ingo Molnar
2015-02-18 17:51 99%             ` Ingo Molnar
2015-02-19  3:26 99%               ` Jason Baron
2015-02-22  0:24 99%                 ` Eric Wong
2015-02-25 15:48 99%                   ` Jason Baron
2015-02-18 22:18 99%               ` Eric Wong
2015-02-18 23:12 99%           ` Andy Lutomirski
     [not found]       ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com>
2015-02-27 22:24 99%     ` Jason Baron
2015-02-09 20:05     [PATCH 0/2] Add epoll round robin wakeup mode Jason Baron
2015-02-09 20:06 99% ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
2015-02-09 20:18 99%   ` Andy Lutomirski
2015-02-09 21:32 99%     ` Jason Baron
2015-02-09 22:45 99%       ` Andy Lutomirski
2015-02-10  3:59 99%         ` Jason Baron
2015-02-10  4:49 99%           ` Eric Wong
2015-02-10 19:16 99%             ` Jason Baron
2015-02-10 19:32 99%               ` Eric Wong
2015-02-09 20:27 99%   ` Michael Kerrisk
2015-12-08  3:23     [PATCH] epoll: add exclusive wakeups flag Jason Baron
2015-12-08  3:23 99% ` [PATCH] epoll: add EPOLLEXCLUSIVE flag Jason Baron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).