* [PATCH v2 0/2] Add epoll round robin wakeup mode @ 2015-02-17 19:33 Jason Baron 2015-02-17 19:33 ` [PATCH v2 1/2] sched/wait: add " Jason Baron ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Jason Baron @ 2015-02-17 19:33 UTC (permalink / raw) To: peterz, mingo, viro Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api When we are sharing a wakeup source among multiple epoll fds, we end up with thundering herd wakeups, since there is currently no way to add to the wakeup source exclusively. This series introduces 2 new epoll flags, EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN which is to be used in conjunction to EPOLLEXCLUSIVE to evenly distribute the wakeups. This patch was originally motivated by a desire to improve wakeup balance and cpu usage for a listen socket() shared amongst multiple epoll fd sets. See: http://lwn.net/Articles/632590/ for previous test program and testing resutls. Epoll manpage text: EPOLLEXCLUSIVE Provides exclusive wakeups when attaching multiple epoll fds to a shared wakeup source. Must be specified with an EPOLL_CTL_ADD operation. EPOLLROUNDROBIN Provides balancing for exclusive wakeups when attaching multiple epoll fds to a shared wakeup soruce. Depends on EPOLLEXCLUSIVE being set and must be specified with an EPOLL_CTL_ADD operation. Thanks, -Jason Jason Baron (2): sched/wait: add round robin wakeup mode epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN fs/eventpoll.c | 25 ++++++++++++++++++++----- include/linux/wait.h | 11 +++++++++++ include/uapi/linux/eventpoll.h | 6 ++++++ kernel/sched/wait.c | 10 ++++++++-- 4 files changed, 45 insertions(+), 7 deletions(-) -- 1.8.2.rc2 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 1/2] sched/wait: add round robin wakeup mode 2015-02-17 19:33 [PATCH v2 0/2] Add epoll round robin wakeup mode Jason Baron @ 2015-02-17 19:33 ` Jason Baron 2015-02-17 19:33 ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron 2015-02-17 19:46 ` [PATCH v2 0/2] Add epoll round robin wakeup mode Andy Lutomirski 2 siblings, 0 replies; 19+ messages in thread From: Jason Baron @ 2015-02-17 19:33 UTC (permalink / raw) To: peterz, mingo, viro Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api The motivation for this flag is to allow the distribution of wakeups from a shared source in a balanced manner. Currently, we can add threads exclusively but that often results in the same thread woken up again and again. In the case where we are trying to balance work across threads this is not desirable. The WQ_FLAG_ROUND_ROBIN is restricted to being exclusive as well, otherwise we do not know who is being woken up. Signed-off-by: Jason Baron <jbaron@akamai.com> --- include/linux/wait.h | 11 +++++++++++ kernel/sched/wait.c | 10 ++++++++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/include/linux/wait.h b/include/linux/wait.h index 2232ed1..bbdef98 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -16,6 +16,7 @@ int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *ke /* __wait_queue::flags */ #define WQ_FLAG_EXCLUSIVE 0x01 #define WQ_FLAG_WOKEN 0x02 +#define WQ_FLAG_ROUND_ROBIN 0x04 struct __wait_queue { unsigned int flags; @@ -109,6 +110,16 @@ static inline int waitqueue_active(wait_queue_head_t *q) extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait); extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait); + +/* + * rr relies on exclusive, otherwise we don't know which entry was woken + */ +static inline void add_wait_queue_rr(wait_queue_head_t *q, wait_queue_t *wait) +{ + wait->flags |= WQ_FLAG_ROUND_ROBIN; + add_wait_queue_exclusive(q, wait); +} + extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait); static inline void __add_wait_queue(wait_queue_head_t *head, wait_queue_t *new) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 852143a..dcb75dd 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -66,14 +66,20 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + LIST_HEAD(rotate_list); list_for_each_entry_safe(curr, next, &q->task_list, task_list) { unsigned flags = curr->flags; if (curr->func(curr, mode, wake_flags, key) && - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) - break; + (flags & WQ_FLAG_EXCLUSIVE)) { + if ((flags & WQ_FLAG_ROUND_ROBIN) && (nr_exclusive > 0)) + list_move_tail(&curr->task_list, &rotate_list); + if (!--nr_exclusive) + break; + } } + list_splice_tail(&rotate_list, &q->task_list); } /** -- 1.8.2.rc2 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-17 19:33 [PATCH v2 0/2] Add epoll round robin wakeup mode Jason Baron 2015-02-17 19:33 ` [PATCH v2 1/2] sched/wait: add " Jason Baron @ 2015-02-17 19:33 ` Jason Baron 2015-02-18 8:07 ` Ingo Molnar [not found] ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com> 2015-02-17 19:46 ` [PATCH v2 0/2] Add epoll round robin wakeup mode Andy Lutomirski 2 siblings, 2 replies; 19+ messages in thread From: Jason Baron @ 2015-02-17 19:33 UTC (permalink / raw) To: peterz, mingo, viro Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api Epoll file descriptors that are added to a shared wakeup source are always added in a non-exclusive manner. That means that when we have multiple epoll fds attached to a shared wakeup source they are all woken up. This can lead to excessive cpu usage and uneven load distribution. This patch introduces two new 'events' flags that are intended to be used with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event source in an exclusive manner such that the minimum number of threads are woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can also be added to the 'events' flag, such that we round robin through the set of waiting threads. An implementation note is that in the epoll wakeup routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful wakeup, only when there are current waiters. The idea is to use this additional heuristic in order minimize wakeup latencies. Signed-off-by: Jason Baron <jbaron@akamai.com> --- fs/eventpoll.c | 25 ++++++++++++++++++++----- include/uapi/linux/eventpoll.h | 6 ++++++ 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d77f944..382c832 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -92,7 +92,8 @@ */ /* Epoll private bits inside the event mask */ -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET) +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \ + EPOLLEXCLUSIVE | EPOLLROUNDROBIN) /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k unsigned long flags; struct epitem *epi = ep_item_from_wait(wait); struct eventpoll *ep = epi->ep; + int ewake = 0; if ((unsigned long)key & POLLFREE) { ep_pwq_from_wait(wait)->whead = NULL; @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. */ - if (waitqueue_active(&ep->wq)) + if (waitqueue_active(&ep->wq)) { + ewake = 1; wake_up_locked(&ep->wq); + } if (waitqueue_active(&ep->poll_wait)) pwake++; @@ -1078,6 +1082,8 @@ out_unlock: if (pwake) ep_poll_safewake(&ep->poll_wait); + if (epi->event.events & EPOLLROUNDROBIN) + return ewake; return 1; } @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); pwq->whead = whead; pwq->base = epi; - add_wait_queue(whead, &pwq->wait); + if (epi->event.events & EPOLLROUNDROBIN) + add_wait_queue_rr(whead, &pwq->wait); + else if (epi->event.events & EPOLLEXCLUSIVE) + add_wait_queue_exclusive(whead, &pwq->wait); + else + add_wait_queue(whead, &pwq->wait); list_add_tail(&pwq->llink, &epi->pwqlist); epi->nwait++; } else { @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size) SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event) { - int error; - int full_check = 0; + int error, full_check = 0, wait_flags = 0; struct fd f, tf; struct eventpoll *ep; struct epitem *epi; @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (f.file == tf.file || !is_file_epoll(f.file)) goto error_tgt_fput; + wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN); + if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) && + ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file)))))) + goto error_tgt_fput; + /* * At this point it is safe to assume that the "private_data" contains * our own data structure. diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index bc81fb2..10260a1 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -26,6 +26,12 @@ #define EPOLL_CTL_DEL 2 #define EPOLL_CTL_MOD 3 +/* Balance wakeups for a shared event source */ +#define EPOLLROUNDROBIN (1 << 27) + +/* Add exclusively */ +#define EPOLLEXCLUSIVE (1 << 28) + /* * Request the handling of system wakeup events so as to prevent system suspends * from happening while those events are being processed. -- 1.8.2.rc2 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-17 19:33 ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron @ 2015-02-18 8:07 ` Ingo Molnar 2015-02-18 15:42 ` Jason Baron [not found] ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com> 1 sibling, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2015-02-18 8:07 UTC (permalink / raw) To: Jason Baron Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra * Jason Baron <jbaron@akamai.com> wrote: > Epoll file descriptors that are added to a shared wakeup > source are always added in a non-exclusive manner. That > means that when we have multiple epoll fds attached to a > shared wakeup source they are all woken up. This can lead > to excessive cpu usage and uneven load distribution. > > This patch introduces two new 'events' flags that are > intended to be used with EPOLL_CTL_ADD operations. > EPOLLEXCLUSIVE, adds the epoll fd to the event source in > an exclusive manner such that the minimum number of > threads are woken. EPOLLROUNDROBIN, which depends on > EPOLLEXCLUSIVE also being set, can also be added to the > 'events' flag, such that we round robin through the set > of waiting threads. > > An implementation note is that in the epoll wakeup > routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, > we return 1, for a successful wakeup, only when there are > current waiters. The idea is to use this additional > heuristic in order minimize wakeup latencies. > > Signed-off-by: Jason Baron <jbaron@akamai.com> > --- > fs/eventpoll.c | 25 ++++++++++++++++++++----- > include/uapi/linux/eventpoll.h | 6 ++++++ > 2 files changed, 26 insertions(+), 5 deletions(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index d77f944..382c832 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -92,7 +92,8 @@ > */ > > /* Epoll private bits inside the event mask */ > -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET) > +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \ > + EPOLLEXCLUSIVE | EPOLLROUNDROBIN) > > /* Maximum number of nesting allowed inside epoll sets */ > #define EP_MAX_NESTS 4 > @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k > unsigned long flags; > struct epitem *epi = ep_item_from_wait(wait); > struct eventpoll *ep = epi->ep; > + int ewake = 0; > > if ((unsigned long)key & POLLFREE) { > ep_pwq_from_wait(wait)->whead = NULL; > @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k > * Wake up ( if active ) both the eventpoll wait list and the ->poll() > * wait list. > */ > - if (waitqueue_active(&ep->wq)) > + if (waitqueue_active(&ep->wq)) { > + ewake = 1; > wake_up_locked(&ep->wq); > + } > if (waitqueue_active(&ep->poll_wait)) > pwake++; > > @@ -1078,6 +1082,8 @@ out_unlock: > if (pwake) > ep_poll_safewake(&ep->poll_wait); > > + if (epi->event.events & EPOLLROUNDROBIN) > + return ewake; > return 1; > } > > @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, > init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); > pwq->whead = whead; > pwq->base = epi; > - add_wait_queue(whead, &pwq->wait); > + if (epi->event.events & EPOLLROUNDROBIN) > + add_wait_queue_rr(whead, &pwq->wait); > + else if (epi->event.events & EPOLLEXCLUSIVE) > + add_wait_queue_exclusive(whead, &pwq->wait); > + else > + add_wait_queue(whead, &pwq->wait); > list_add_tail(&pwq->llink, &epi->pwqlist); > epi->nwait++; > } else { > @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size) > SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, > struct epoll_event __user *, event) > { > - int error; > - int full_check = 0; > + int error, full_check = 0, wait_flags = 0; > struct fd f, tf; > struct eventpoll *ep; > struct epitem *epi; > @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, > if (f.file == tf.file || !is_file_epoll(f.file)) > goto error_tgt_fput; > > + wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN); > + if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) && > + ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file)))))) > + goto error_tgt_fput; > + > /* > * At this point it is safe to assume that the "private_data" contains > * our own data structure. > diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h > index bc81fb2..10260a1 100644 > --- a/include/uapi/linux/eventpoll.h > +++ b/include/uapi/linux/eventpoll.h > @@ -26,6 +26,12 @@ > #define EPOLL_CTL_DEL 2 > #define EPOLL_CTL_MOD 3 > > +/* Balance wakeups for a shared event source */ > +#define EPOLLROUNDROBIN (1 << 27) > + > +/* Add exclusively */ > +#define EPOLLEXCLUSIVE (1 << 28) > + > /* > * Request the handling of system wakeup events so as to prevent system suspends > * from happening while those events are being processed. So let me rephrase the justification of your two patches: Unlike regular waitqueue usage (where threads remove themselves from the waitqueue once they receive a wakeup), epoll waitqueues are static and 'persistent': epoll target threads are on the poll waitqueue indefinitely, only register/unregister removes threads from them. So they are not really 'wait queues', but static 'task lists', and are thus exposed to classic thundering herd scheduling problems and scheduling assymetries: a single event on a shared event source will wake all epoll 'task-lists', and not only will it wake them, but due to the static nature of the lists, even an exclusive wakeup will iterate along the list with O(N) overhead, until it finds a wakeable thread. As the number of lists and the number of threads in the lists increases this scales suboptimally, and it also looks slightly odd that a random set of epoll worker threads is 'more equal' than the others, in receiving a comparably higher proportion of events. The solution is to add this new ABI to allow epoll events to be actively load-balanced both between the persistent 'task lists', and to also allow the individual task lists to act as dynamic runqueues: the head of the list is likely to be sleeping, newly woken tasks get moved to the tail of the list. This has two main advantages: firstly it solves the O(N) (micro-)problem, but it also more evenly distributes events both between task-lists and within epoll groups as tasks as well. The disadvantages: slightly higher management micro-costs, plus a global waitqueue list, which used to be read-mostly, is now actively dirtied by every event, adding more global serialization. The latter is somewhat muted by the fact that the waitqueue lock itself is already a global serialization point today and got dirtied by every event, and the list head is next to it, usually in the same cacheline. Did I get it right? Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 8:07 ` Ingo Molnar @ 2015-02-18 15:42 ` Jason Baron 2015-02-18 16:33 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Jason Baron @ 2015-02-18 15:42 UTC (permalink / raw) To: Ingo Molnar Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra On 02/18/2015 03:07 AM, Ingo Molnar wrote: > * Jason Baron <jbaron@akamai.com> wrote: > >> Epoll file descriptors that are added to a shared wakeup >> source are always added in a non-exclusive manner. That >> means that when we have multiple epoll fds attached to a >> shared wakeup source they are all woken up. This can lead >> to excessive cpu usage and uneven load distribution. >> >> This patch introduces two new 'events' flags that are >> intended to be used with EPOLL_CTL_ADD operations. >> EPOLLEXCLUSIVE, adds the epoll fd to the event source in >> an exclusive manner such that the minimum number of >> threads are woken. EPOLLROUNDROBIN, which depends on >> EPOLLEXCLUSIVE also being set, can also be added to the >> 'events' flag, such that we round robin through the set >> of waiting threads. >> >> An implementation note is that in the epoll wakeup >> routine, 'ep_poll_callback()', if EPOLLROUNDROBIN is set, >> we return 1, for a successful wakeup, only when there are >> current waiters. The idea is to use this additional >> heuristic in order minimize wakeup latencies. >> >> Signed-off-by: Jason Baron <jbaron@akamai.com> >> --- >> fs/eventpoll.c | 25 ++++++++++++++++++++----- >> include/uapi/linux/eventpoll.h | 6 ++++++ >> 2 files changed, 26 insertions(+), 5 deletions(-) >> >> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >> index d77f944..382c832 100644 >> --- a/fs/eventpoll.c >> +++ b/fs/eventpoll.c >> @@ -92,7 +92,8 @@ >> */ >> >> /* Epoll private bits inside the event mask */ >> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET) >> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \ >> + EPOLLEXCLUSIVE | EPOLLROUNDROBIN) >> >> /* Maximum number of nesting allowed inside epoll sets */ >> #define EP_MAX_NESTS 4 >> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k >> unsigned long flags; >> struct epitem *epi = ep_item_from_wait(wait); >> struct eventpoll *ep = epi->ep; >> + int ewake = 0; >> >> if ((unsigned long)key & POLLFREE) { >> ep_pwq_from_wait(wait)->whead = NULL; >> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k >> * Wake up ( if active ) both the eventpoll wait list and the ->poll() >> * wait list. >> */ >> - if (waitqueue_active(&ep->wq)) >> + if (waitqueue_active(&ep->wq)) { >> + ewake = 1; >> wake_up_locked(&ep->wq); >> + } >> if (waitqueue_active(&ep->poll_wait)) >> pwake++; >> >> @@ -1078,6 +1082,8 @@ out_unlock: >> if (pwake) >> ep_poll_safewake(&ep->poll_wait); >> >> + if (epi->event.events & EPOLLROUNDROBIN) >> + return ewake; >> return 1; >> } >> >> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, >> init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); >> pwq->whead = whead; >> pwq->base = epi; >> - add_wait_queue(whead, &pwq->wait); >> + if (epi->event.events & EPOLLROUNDROBIN) >> + add_wait_queue_rr(whead, &pwq->wait); >> + else if (epi->event.events & EPOLLEXCLUSIVE) >> + add_wait_queue_exclusive(whead, &pwq->wait); >> + else >> + add_wait_queue(whead, &pwq->wait); >> list_add_tail(&pwq->llink, &epi->pwqlist); >> epi->nwait++; >> } else { >> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size) >> SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, >> struct epoll_event __user *, event) >> { >> - int error; >> - int full_check = 0; >> + int error, full_check = 0, wait_flags = 0; >> struct fd f, tf; >> struct eventpoll *ep; >> struct epitem *epi; >> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, >> if (f.file == tf.file || !is_file_epoll(f.file)) >> goto error_tgt_fput; >> >> + wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN); >> + if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) && >> + ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file)))))) >> + goto error_tgt_fput; >> + >> /* >> * At this point it is safe to assume that the "private_data" contains >> * our own data structure. >> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h >> index bc81fb2..10260a1 100644 >> --- a/include/uapi/linux/eventpoll.h >> +++ b/include/uapi/linux/eventpoll.h >> @@ -26,6 +26,12 @@ >> #define EPOLL_CTL_DEL 2 >> #define EPOLL_CTL_MOD 3 >> >> +/* Balance wakeups for a shared event source */ >> +#define EPOLLROUNDROBIN (1 << 27) >> + >> +/* Add exclusively */ >> +#define EPOLLEXCLUSIVE (1 << 28) >> + >> /* >> * Request the handling of system wakeup events so as to prevent system suspends >> * from happening while those events are being processed. > So let me rephrase the justification of your two patches: > > Unlike regular waitqueue usage (where threads remove > themselves from the waitqueue once they receive a wakeup), > epoll waitqueues are static and 'persistent': epoll target > threads are on the poll waitqueue indefinitely, only > register/unregister removes threads from them. > > So they are not really 'wait queues', but static 'task > lists', and are thus exposed to classic thundering herd > scheduling problems and scheduling assymetries: a single > event on a shared event source will wake all epoll > 'task-lists', and not only will it wake them, but due to > the static nature of the lists, even an exclusive wakeup > will iterate along the list with O(N) overhead, until it > finds a wakeable thread. > > As the number of lists and the number of threads in the > lists increases this scales suboptimally, and it also looks > slightly odd that a random set of epoll worker threads is > 'more equal' than the others, in receiving a comparably > higher proportion of events. yes, in fact we are currently working around these imbalances by doing register/unregister (EPOLL_CTL_ADD/EPOLL_CTL_DEL), periodically to re-set the order of the queues. This resolves the balancing to an extent, but not the spurious wakeups. > > The solution is to add this new ABI to allow epoll events > to be actively load-balanced both between the persistent > 'task lists', and to also allow the individual task lists > to act as dynamic runqueues: the head of the list is likely > to be sleeping, newly woken tasks get moved to the tail of > the list. > > This has two main advantages: firstly it solves the O(N) > (micro-)problem, but it also more evenly distributes events > both between task-lists and within epoll groups as tasks as > well. Its solving 2 issues - spurious wakeups, and more even loading of threads. The event distribution is more even between 'epoll groups' with this patch, however, if multiple threads are blocking on a single 'epoll group', this patch does not affect the the event distribution there. Currently, threads are added to 'epoll group' as exclusive already, so when you have multiple threads blocking on an epoll group, only one wakes up. In our use case, we have a 1-to-1 mapping b/w threads and epoll groups, so we don't have spurious or un-balanced wakeups there. That suggests the alternative user-space model for addressing this problem. That is, to have a single epoll group added with EPOLLONESHOT. In this way threads can pull work or events off of a single queue, work on the event and then re-arm (such that other threads don't see events from that source in the meantime). This, however, means all threads work on all events, and they all have to synchronize to an extent on the single queue. That is all register/unregister and re-arm event to that queue have to be visible or synchronized for all the waiters. This model also doesn't allow userspace to partition events that are naturally local to thread, since there a single epoll group. The second userspace model is to have worker threads with their own separate epoll groups, and then have separate thread(s) to address the shared wakeup sources. Then the threads that are waiting on the shared wakeup sources can distribute the events fairly to the worker threads. This involves extra context switching for shared events, and I think ends up degenerating back into the original problem. > The disadvantages: slightly higher management micro-costs, > plus a global waitqueue list, which used to be read-mostly, > is now actively dirtied by every event, adding more global > serialization. The latter is somewhat muted by the fact > that the waitqueue lock itself is already a global > serialization point today and got dirtied by every event, > and the list head is next to it, usually in the same > cacheline. Yes, I'm a bit concerned about the changes to the core wakeup function, however in the non-rotate case, the only additional write is to initialize the 'rotate_list' on entry. I measured the latency of the __wake_up_common() for the case where this code was added and we were not doing 'rotate', and I didn't measure any additional latency with ftrace, but it perhaps warrants more careful testing. The outstanding issues I have are: 1) Does epoll need 2 new flags here - EPOLLEXCLUSIVE and EPOLLROUNDROBIN. IE should they just be combined since EPOLLROUNDROBIN depends on EPOLLEXCLUSIVE, or is there a valid use for just EPOLLEXCLUSIVE (wake up the first waiter but don't do the balancing)? 2) The concern Andy raised regarding potential starvation. That is could a adversarial thread cause us to miss wakeups if it can add itself exclusively to the shared wakeup source. Currently, the adversarial thread would need to simply be able to open the file in question. For things like pipe, this is not an issue, but perhaps it is for files in the global namespace... Thanks, -Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 15:42 ` Jason Baron @ 2015-02-18 16:33 ` Ingo Molnar 2015-02-18 17:38 ` Jason Baron 0 siblings, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2015-02-18 16:33 UTC (permalink / raw) To: Jason Baron Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra * Jason Baron <jbaron@akamai.com> wrote: > > This has two main advantages: firstly it solves the > > O(N) (micro-)problem, but it also more evenly > > distributes events both between task-lists and within > > epoll groups as tasks as well. > > Its solving 2 issues - spurious wakeups, and more even > loading of threads. The event distribution is more even > between 'epoll groups' with this patch, however, if > multiple threads are blocking on a single 'epoll group', > this patch does not affect the the event distribution > there. [...] Regarding your last point, are you sure about that? If we have say 16 epoll threads registered, and if the list is static (no register/unregister activity), then the wakeup pattern is in strict order of the list: threads closer to the list head will be woken more frequently, in a wake-once fashion. So if threads do just quick work and go back to sleep quickly, then typically only the first 2-3 threads will get any runtime in practice - the wakeup iteration never gets 'deep' into the list. With the round-robin shuffling of the list, the threads get shuffled to the tail on wakeup, which distributes events evenly: all 16 epoll threads will accumulate an even distribution of runtime, statistically. Have I misunderstood this somehow? Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 16:33 ` Ingo Molnar @ 2015-02-18 17:38 ` Jason Baron 2015-02-18 17:45 ` Ingo Molnar 2015-02-18 23:12 ` Andy Lutomirski 0 siblings, 2 replies; 19+ messages in thread From: Jason Baron @ 2015-02-18 17:38 UTC (permalink / raw) To: Ingo Molnar Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra On 02/18/2015 11:33 AM, Ingo Molnar wrote: > * Jason Baron <jbaron@akamai.com> wrote: > >>> This has two main advantages: firstly it solves the >>> O(N) (micro-)problem, but it also more evenly >>> distributes events both between task-lists and within >>> epoll groups as tasks as well. >> Its solving 2 issues - spurious wakeups, and more even >> loading of threads. The event distribution is more even >> between 'epoll groups' with this patch, however, if >> multiple threads are blocking on a single 'epoll group', >> this patch does not affect the the event distribution >> there. [...] > Regarding your last point, are you sure about that? > > If we have say 16 epoll threads registered, and if the list > is static (no register/unregister activity), then the > wakeup pattern is in strict order of the list: threads > closer to the list head will be woken more frequently, in a > wake-once fashion. So if threads do just quick work and go > back to sleep quickly, then typically only the first 2-3 > threads will get any runtime in practice - the wakeup > iteration never gets 'deep' into the list. > > With the round-robin shuffling of the list, the threads get > shuffled to the tail on wakeup, which distributes events > evenly: all 16 epoll threads will accumulate an even > distribution of runtime, statistically. > > Have I misunderstood this somehow? > > So in the case of multiple threads per epoll set, we currently add to the head of wakeup queue exclusively in 'epoll_wait()', and then subsequently remove from the queue once 'epoll_wait()' returns. So I don't think this patch addresses balancing on a per epoll set basis. I think we could address the case you describe by simply doing __add_wait_queue_tail_exclusive() instead of __add_wait_queue_exclusive() in epoll_wait(). However, I think the userspace API change is less clear since epoll_wait() doesn't currently have an 'input' events argument as epoll_ctl() does. Thanks, -Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 17:38 ` Jason Baron @ 2015-02-18 17:45 ` Ingo Molnar 2015-02-18 17:51 ` Ingo Molnar 2015-02-18 23:12 ` Andy Lutomirski 1 sibling, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2015-02-18 17:45 UTC (permalink / raw) To: Jason Baron Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra * Jason Baron <jbaron@akamai.com> wrote: > So in the case of multiple threads per epoll set, we > currently add to the head of wakeup queue exclusively in > 'epoll_wait()', and then subsequently remove from the > queue once 'epoll_wait()' returns. So I don't think this > patch addresses balancing on a per epoll set basis. Okay, so I was confused about how the code works. > I think we could address the case you describe by simply > doing __add_wait_queue_tail_exclusive() instead of > __add_wait_queue_exclusive() in epoll_wait(). [...] Yes. > [...] However, I think the userspace API change is less > clear since epoll_wait() doesn't currently have an > 'input' events argument as epoll_ctl() does. ... but the change would be a bit clearer and somewhat more flexible: LIFO or FIFO queueing, right? But having the queueing model as part of the epoll context is a legitimate approach as well. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 17:45 ` Ingo Molnar @ 2015-02-18 17:51 ` Ingo Molnar 2015-02-18 22:18 ` Eric Wong 2015-02-19 3:26 ` Jason Baron 0 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2015-02-18 17:51 UTC (permalink / raw) To: Jason Baron Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra * Ingo Molnar <mingo@kernel.org> wrote: > > [...] However, I think the userspace API change is less > > clear since epoll_wait() doesn't currently have an > > 'input' events argument as epoll_ctl() does. > > ... but the change would be a bit clearer and somewhat > more flexible: LIFO or FIFO queueing, right? > > But having the queueing model as part of the epoll > context is a legitimate approach as well. Btw., there's another optimization that the networking code already does when processing incoming packets: waking up a thread on the local CPU, where the wakeup is running. Doing the same on epoll would have real scalability advantages where incoming events are IRQ driven and are distributed amongst multiple CPUs. Where events are task driven the scheduler will already try to pair up waker and wakee so it might not show up in measurements that markedly. Thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 17:51 ` Ingo Molnar @ 2015-02-18 22:18 ` Eric Wong 2015-02-19 3:26 ` Jason Baron 1 sibling, 0 replies; 19+ messages in thread From: Eric Wong @ 2015-02-18 22:18 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Baron, peterz, mingo, viro, akpm, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra Ingo Molnar <mingo@kernel.org> wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > [...] However, I think the userspace API change is less > > > clear since epoll_wait() doesn't currently have an > > > 'input' events argument as epoll_ctl() does. > > > > ... but the change would be a bit clearer and somewhat > > more flexible: LIFO or FIFO queueing, right? > > > > But having the queueing model as part of the epoll > > context is a legitimate approach as well. > > Btw., there's another optimization that the networking code > already does when processing incoming packets: waking up a > thread on the local CPU, where the wakeup is running. > > Doing the same on epoll would have real scalability > advantages where incoming events are IRQ driven and are > distributed amongst multiple CPUs. Right. One thing in the back of my mind has been to have CPU affinity for epoll. Either having everything in an epoll set favor a certain CPU or even having affinity down to the epitem level (so concurrent epoll_wait callers end up favoring the same epitems). I'm not convinced this series is worth doing without a comparison against my previous suggestion to use a dedicated thread which only makes blocking accept4 + EPOLL_CTL_ADD calls. The majority of epoll events in a typical server should not be for listen sockets, so I'd rather not bloat existing code paths for them. For web servers nowadays, the benefits of maintaining long-lived connections to avoid handshakes is even more beneficial with increasing HTTPS and HTTP2 adoption; so listen socket events should become less common. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 17:51 ` Ingo Molnar 2015-02-18 22:18 ` Eric Wong @ 2015-02-19 3:26 ` Jason Baron 2015-02-22 0:24 ` Eric Wong 1 sibling, 1 reply; 19+ messages in thread From: Jason Baron @ 2015-02-19 3:26 UTC (permalink / raw) To: Ingo Molnar Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, luto@amacapital.net >> Andy Lutomirski On 02/18/2015 12:51 PM, Ingo Molnar wrote: > * Ingo Molnar <mingo@kernel.org> wrote: > >>> [...] However, I think the userspace API change is less >>> clear since epoll_wait() doesn't currently have an >>> 'input' events argument as epoll_ctl() does. >> ... but the change would be a bit clearer and somewhat >> more flexible: LIFO or FIFO queueing, right? >> >> But having the queueing model as part of the epoll >> context is a legitimate approach as well. > Btw., there's another optimization that the networking code > already does when processing incoming packets: waking up a > thread on the local CPU, where the wakeup is running. > > Doing the same on epoll would have real scalability > advantages where incoming events are IRQ driven and are > distributed amongst multiple CPUs. > > Where events are task driven the scheduler will already try > to pair up waker and wakee so it might not show up in > measurements that markedly. > Right, so this makes me think that we may want to potentially support a variety of wakeup policies. Adding these to the generic wake up code is just going to be too messy. So, perhaps a better approach here would be to register a single wait_queue_t with the event source queue that will always be woken up, and then layer any epoll balancing/irq affinity policies on top of that. So in essence we end up with sort of two queues layers, but I think it provides much nicer isolation between layers. Also, the bulk of the changes are going to be isolated to the epoll code, and we avoid Andy's concern about missing, or starving out wakeups. So here's a stab at how this API could look: 1. ep1 = epoll_create1(EPOLL_POLICY); So EPOLL_POLICY here could the round robin policy described here, or the irq affinity or other ideas. The idea is to create an fd that is local to the process, such that other processes can not subsequently attach to it and affect our policy. 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL); This associates ep1 with the event source. ep1 can be associated with or added to at most 1 wakeup source. This call would largely just form the association, but not queue anything to the fd_source wait queue. 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event); epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event); epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event); . . . Finally, we add the epoll sets to the event source (indirectly via ep1). So the first add would actually queue the callback to the fd_source. While the subsequent calls would simply queue things to the 'nested' wakeup queue associated with ep1. So any existing epoll/poll/select calls could be queued as well to fd_source and will operate independenly from this mechanism, as the fd_source queue continues to be 'wake all'. Also, there should be no changes necessary to __wake_up_common(), other than potentially passing more back though the wait_queue_func_t, such as 'nr_exclusive'. Thanks, -Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-19 3:26 ` Jason Baron @ 2015-02-22 0:24 ` Eric Wong 2015-02-25 15:48 ` Jason Baron 0 siblings, 1 reply; 19+ messages in thread From: Eric Wong @ 2015-02-22 0:24 UTC (permalink / raw) To: Jason Baron Cc: Ingo Molnar, peterz, mingo, viro, akpm, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, luto@amacapital.net >> Andy Lutomirski Jason Baron <jbaron@akamai.com> wrote: > On 02/18/2015 12:51 PM, Ingo Molnar wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > > > >>> [...] However, I think the userspace API change is less > >>> clear since epoll_wait() doesn't currently have an > >>> 'input' events argument as epoll_ctl() does. > >> ... but the change would be a bit clearer and somewhat > >> more flexible: LIFO or FIFO queueing, right? > >> > >> But having the queueing model as part of the epoll > >> context is a legitimate approach as well. > > Btw., there's another optimization that the networking code > > already does when processing incoming packets: waking up a > > thread on the local CPU, where the wakeup is running. > > > > Doing the same on epoll would have real scalability > > advantages where incoming events are IRQ driven and are > > distributed amongst multiple CPUs. > > > > Where events are task driven the scheduler will already try > > to pair up waker and wakee so it might not show up in > > measurements that markedly. > > > > Right, so this makes me think that we may want to potentially > support a variety of wakeup policies. Adding these to the > generic wake up code is just going to be too messy. So, perhaps > a better approach here would be to register a single > wait_queue_t with the event source queue that will always > be woken up, and then layer any epoll balancing/irq affinity > policies on top of that. So in essence we end up with sort of > two queues layers, but I think it provides much nicer isolation > between layers. Also, the bulk of the changes are going to be > isolated to the epoll code, and we avoid Andy's concern about > missing, or starving out wakeups. > > So here's a stab at how this API could look: > > 1. ep1 = epoll_create1(EPOLL_POLICY); > > So EPOLL_POLICY here could the round robin policy described > here, or the irq affinity or other ideas. The idea is to create > an fd that is local to the process, such that other processes > can not subsequently attach to it and affect our policy. I'm not against defining more policies if needed. Maybe FIFO vs LIFO is a good case for this. For affinity, it could probably be done transparently based on epoll_wait retrievals + EPOLL_CTL_MOD operations. > 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL); > > This associates ep1 with the event source. ep1 can be > associated with or added to at most 1 wakeup source. This call > would largely just form the association, but not queue anything > to the fd_source wait queue. This would mean one extra FD for every fd_source, but that's only a handful of FDs (listen sockets), correct? > 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event); > epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event); > epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event); > . > . > . > > Finally, we add the epoll sets to the event source (indirectly via > ep1). So the first add would actually queue the callback to the > fd_source. While the subsequent calls would simply queue things > to the 'nested' wakeup queue associated with ep1. I'm not sure I follow, wouldn't this increase the number of wakeups? > So any existing epoll/poll/select calls could be queued as well > to fd_source and will operate independenly from this mechanism, > as the fd_source queue continues to be 'wake all'. Also, there > should be no changes necessary to __wake_up_common(), other > than potentially passing more back though the > wait_queue_func_t, such as 'nr_exclusive'. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-22 0:24 ` Eric Wong @ 2015-02-25 15:48 ` Jason Baron 0 siblings, 0 replies; 19+ messages in thread From: Jason Baron @ 2015-02-25 15:48 UTC (permalink / raw) To: Eric Wong Cc: Ingo Molnar, peterz, mingo, viro, akpm, davidel, mtk.manpages, linux-kernel, linux-fsdevel, linux-api, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, luto@amacapital.net >> Andy Lutomirski On 02/21/2015 07:24 PM, Eric Wong wrote: > Jason Baron <jbaron@akamai.com> wrote: >> On 02/18/2015 12:51 PM, Ingo Molnar wrote: >>> * Ingo Molnar <mingo@kernel.org> wrote: >>> >>>>> [...] However, I think the userspace API change is less >>>>> clear since epoll_wait() doesn't currently have an >>>>> 'input' events argument as epoll_ctl() does. >>>> ... but the change would be a bit clearer and somewhat >>>> more flexible: LIFO or FIFO queueing, right? >>>> >>>> But having the queueing model as part of the epoll >>>> context is a legitimate approach as well. >>> Btw., there's another optimization that the networking code >>> already does when processing incoming packets: waking up a >>> thread on the local CPU, where the wakeup is running. >>> >>> Doing the same on epoll would have real scalability >>> advantages where incoming events are IRQ driven and are >>> distributed amongst multiple CPUs. >>> >>> Where events are task driven the scheduler will already try >>> to pair up waker and wakee so it might not show up in >>> measurements that markedly. >>> >> Right, so this makes me think that we may want to potentially >> support a variety of wakeup policies. Adding these to the >> generic wake up code is just going to be too messy. So, perhaps >> a better approach here would be to register a single >> wait_queue_t with the event source queue that will always >> be woken up, and then layer any epoll balancing/irq affinity >> policies on top of that. So in essence we end up with sort of >> two queues layers, but I think it provides much nicer isolation >> between layers. Also, the bulk of the changes are going to be >> isolated to the epoll code, and we avoid Andy's concern about >> missing, or starving out wakeups. >> >> So here's a stab at how this API could look: >> >> 1. ep1 = epoll_create1(EPOLL_POLICY); >> >> So EPOLL_POLICY here could the round robin policy described >> here, or the irq affinity or other ideas. The idea is to create >> an fd that is local to the process, such that other processes >> can not subsequently attach to it and affect our policy. > I'm not against defining more policies if needed. > Maybe FIFO vs LIFO is a good case for this. > > For affinity, it could probably be done transparently based on > epoll_wait retrievals + EPOLL_CTL_MOD operations. > >> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL); >> >> This associates ep1 with the event source. ep1 can be >> associated with or added to at most 1 wakeup source. This call >> would largely just form the association, but not queue anything >> to the fd_source wait queue. > This would mean one extra FD for every fd_source, but that's > only a handful of FDs (listen sockets), correct? Yes, one extra epoll fd per shared wakeup source, so this should result in very few additional fds. >> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event); >> epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event); >> epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event); >> . >> . >> . >> >> Finally, we add the epoll sets to the event source (indirectly via >> ep1). So the first add would actually queue the callback to the >> fd_source. While the subsequent calls would simply queue things >> to the 'nested' wakeup queue associated with ep1. > I'm not sure I follow, wouldn't this increase the number of wakeups? I agree, my text there is confusing...I've posted this idea as v3 of this series, so hopefully that clarifies this approach. Thanks, -Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN 2015-02-18 17:38 ` Jason Baron 2015-02-18 17:45 ` Ingo Molnar @ 2015-02-18 23:12 ` Andy Lutomirski 1 sibling, 0 replies; 19+ messages in thread From: Andy Lutomirski @ 2015-02-18 23:12 UTC (permalink / raw) To: Jason Baron Cc: Davide Libenzi, Michael Kerrisk, Linux FS Devel, Eric Wong, Andrew Morton, Peter Zijlstra, linux-kernel@vger.kernel.org, Al Viro, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Linux API, Ingo Molnar On Feb 18, 2015 9:38 AM, "Jason Baron" <jbaron@akamai.com> wrote: > > On 02/18/2015 11:33 AM, Ingo Molnar wrote: > > * Jason Baron <jbaron@akamai.com> wrote: > > > >>> This has two main advantages: firstly it solves the > >>> O(N) (micro-)problem, but it also more evenly > >>> distributes events both between task-lists and within > >>> epoll groups as tasks as well. > >> Its solving 2 issues - spurious wakeups, and more even > >> loading of threads. The event distribution is more even > >> between 'epoll groups' with this patch, however, if > >> multiple threads are blocking on a single 'epoll group', > >> this patch does not affect the the event distribution > >> there. [...] > > Regarding your last point, are you sure about that? > > > > If we have say 16 epoll threads registered, and if the list > > is static (no register/unregister activity), then the > > wakeup pattern is in strict order of the list: threads > > closer to the list head will be woken more frequently, in a > > wake-once fashion. So if threads do just quick work and go > > back to sleep quickly, then typically only the first 2-3 > > threads will get any runtime in practice - the wakeup > > iteration never gets 'deep' into the list. > > > > With the round-robin shuffling of the list, the threads get > > shuffled to the tail on wakeup, which distributes events > > evenly: all 16 epoll threads will accumulate an even > > distribution of runtime, statistically. > > > > Have I misunderstood this somehow? > > > > > > So in the case of multiple threads per epoll set, we currently > add to the head of wakeup queue exclusively in 'epoll_wait()', > and then subsequently remove from the queue once > 'epoll_wait()' returns. So I don't think this patch addresses > balancing on a per epoll set basis. > > I think we could address the case you describe by simply doing > __add_wait_queue_tail_exclusive() instead of > __add_wait_queue_exclusive() in epoll_wait(). However, I think > the userspace API change is less clear since epoll_wait() doesn't > currently have an 'input' events argument as epoll_ctl() does. FWIW there's currently discussion about adding a new epoll API for batch epoll_ctl. It could be with coordinating with that effort if some variant could address both use cases. I'm still nervous about changing the per-fd wakeup stuff to do anything other than waking everything. After all, epoll and poll can be used concurrently. What about a slightly different approach: could an epoll fd support multiple contexts? For example, an fd could be set (with epoll_ctl or the new batch stuff) to wake an any epoll waiter, one specific epoll waiter, an epoll waiter preferably on the waking cpu, etc. This would have the benefit of keeping the wakeup changes localized to the epoll code. --Andy > > Thanks, > > -Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com>]
* Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN [not found] ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com> @ 2015-02-27 22:24 ` Jason Baron 0 siblings, 0 replies; 19+ messages in thread From: Jason Baron @ 2015-02-27 22:24 UTC (permalink / raw) To: Hagen Paul Pfeifer Cc: normalperson, linux-kernel, Michael Kerrisk-manpages, Peter Zijlstra, linux-fsdevel, viro, davidel, Andrew Morton, linux-api, mingo Hi, v3 of this series implements this idea using using a different approach: http://lkml.iu.edu/hypermail/linux/kernel/1502.3/00667.html If that still meets your needs it would be helpful to know in order to move this forward. Looking back at your posting, I was concerned about the test case not lining up with the kernel code change. Thanks, -Jason On 02/27/2015 03:56 PM, Hagen Paul Pfeifer wrote: > > Applause! Nice patch, sad that I submitted this patch ~3 years ago with > exactly the same naming (EPOLLEXCLUSIVE) & nearly exact commit message and > you rejected the patch ... > > Hagen > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 0/2] Add epoll round robin wakeup mode 2015-02-17 19:33 [PATCH v2 0/2] Add epoll round robin wakeup mode Jason Baron 2015-02-17 19:33 ` [PATCH v2 1/2] sched/wait: add " Jason Baron 2015-02-17 19:33 ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron @ 2015-02-17 19:46 ` Andy Lutomirski 2015-02-17 20:33 ` Jason Baron 2 siblings, 1 reply; 19+ messages in thread From: Andy Lutomirski @ 2015-02-17 19:46 UTC (permalink / raw) To: Jason Baron Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong, Davide Libenzi, Michael Kerrisk-manpages, linux-kernel@vger.kernel.org, Linux FS Devel, Linux API On Tue, Feb 17, 2015 at 11:33 AM, Jason Baron <jbaron@akamai.com> wrote: > When we are sharing a wakeup source among multiple epoll fds, we end up with > thundering herd wakeups, since there is currently no way to add to the > wakeup source exclusively. This series introduces 2 new epoll flags, > EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN > which is to be used in conjunction to EPOLLEXCLUSIVE to evenly > distribute the wakeups. This patch was originally motivated by a desire to > improve wakeup balance and cpu usage for a listen socket() shared amongst > multiple epoll fd sets. > > See: http://lwn.net/Articles/632590/ for previous test program and testing > resutls. > > Epoll manpage text: > > EPOLLEXCLUSIVE > Provides exclusive wakeups when attaching multiple epoll fds to a > shared wakeup source. Must be specified with an EPOLL_CTL_ADD operation. > > EPOLLROUNDROBIN > Provides balancing for exclusive wakeups when attaching multiple epoll > fds to a shared wakeup soruce. Depends on EPOLLEXCLUSIVE being set and > must be specified with an EPOLL_CTL_ADD operation. > > Thanks, What permissions do you need on the file descriptor to do this? This will be the first case where a poll-like operation has side effects, and that's rather weird IMO. --Andy > > -Jason > > > Jason Baron (2): > sched/wait: add round robin wakeup mode > epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN > > fs/eventpoll.c | 25 ++++++++++++++++++++----- > include/linux/wait.h | 11 +++++++++++ > include/uapi/linux/eventpoll.h | 6 ++++++ > kernel/sched/wait.c | 10 ++++++++-- > 4 files changed, 45 insertions(+), 7 deletions(-) > > -- > 1.8.2.rc2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 0/2] Add epoll round robin wakeup mode 2015-02-17 19:46 ` [PATCH v2 0/2] Add epoll round robin wakeup mode Andy Lutomirski @ 2015-02-17 20:33 ` Jason Baron 2015-02-17 21:09 ` Andy Lutomirski 0 siblings, 1 reply; 19+ messages in thread From: Jason Baron @ 2015-02-17 20:33 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong, Davide Libenzi, Michael Kerrisk-manpages, linux-kernel@vger.kernel.org, Linux FS Devel, Linux API On 02/17/2015 02:46 PM, Andy Lutomirski wrote: > On Tue, Feb 17, 2015 at 11:33 AM, Jason Baron <jbaron@akamai.com> wrote: >> When we are sharing a wakeup source among multiple epoll fds, we end up with >> thundering herd wakeups, since there is currently no way to add to the >> wakeup source exclusively. This series introduces 2 new epoll flags, >> EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN >> which is to be used in conjunction to EPOLLEXCLUSIVE to evenly >> distribute the wakeups. This patch was originally motivated by a desire to >> improve wakeup balance and cpu usage for a listen socket() shared amongst >> multiple epoll fd sets. >> >> See: http://lwn.net/Articles/632590/ for previous test program and testing >> resutls. >> >> Epoll manpage text: >> >> EPOLLEXCLUSIVE >> Provides exclusive wakeups when attaching multiple epoll fds to a >> shared wakeup source. Must be specified with an EPOLL_CTL_ADD operation. >> >> EPOLLROUNDROBIN >> Provides balancing for exclusive wakeups when attaching multiple epoll >> fds to a shared wakeup soruce. Depends on EPOLLEXCLUSIVE being set and >> must be specified with an EPOLL_CTL_ADD operation. >> >> Thanks, > What permissions do you need on the file descriptor to do this? This > will be the first case where a poll-like operation has side effects, > and that's rather weird IMO. > So in the case where you have both non-exclusive and exclusive waiters, all of the non-exclusive waiters will continue to get woken up. However, I think you're getting at having multiple exclusive waiters and potentially 'starving' out other exclusive waiters. In general, I think wait queues are associated with a 'struct file', so I think unless you are sharing your fd table, this isn't an issue. However, there may be cases where this is not true? In which case, perhaps, we could limit this to CAP_SYS_ADMIN... Thanks, -Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 0/2] Add epoll round robin wakeup mode 2015-02-17 20:33 ` Jason Baron @ 2015-02-17 21:09 ` Andy Lutomirski 2015-02-18 3:15 ` Jason Baron 0 siblings, 1 reply; 19+ messages in thread From: Andy Lutomirski @ 2015-02-17 21:09 UTC (permalink / raw) To: Jason Baron Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong, Davide Libenzi, Michael Kerrisk-manpages, linux-kernel@vger.kernel.org, Linux FS Devel, Linux API On Tue, Feb 17, 2015 at 12:33 PM, Jason Baron <jbaron@akamai.com> wrote: > On 02/17/2015 02:46 PM, Andy Lutomirski wrote: >> On Tue, Feb 17, 2015 at 11:33 AM, Jason Baron <jbaron@akamai.com> wrote: >>> When we are sharing a wakeup source among multiple epoll fds, we end up with >>> thundering herd wakeups, since there is currently no way to add to the >>> wakeup source exclusively. This series introduces 2 new epoll flags, >>> EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN >>> which is to be used in conjunction to EPOLLEXCLUSIVE to evenly >>> distribute the wakeups. This patch was originally motivated by a desire to >>> improve wakeup balance and cpu usage for a listen socket() shared amongst >>> multiple epoll fd sets. >>> >>> See: http://lwn.net/Articles/632590/ for previous test program and testing >>> resutls. >>> >>> Epoll manpage text: >>> >>> EPOLLEXCLUSIVE >>> Provides exclusive wakeups when attaching multiple epoll fds to a >>> shared wakeup source. Must be specified with an EPOLL_CTL_ADD operation. >>> >>> EPOLLROUNDROBIN >>> Provides balancing for exclusive wakeups when attaching multiple epoll >>> fds to a shared wakeup soruce. Depends on EPOLLEXCLUSIVE being set and >>> must be specified with an EPOLL_CTL_ADD operation. >>> >>> Thanks, >> What permissions do you need on the file descriptor to do this? This >> will be the first case where a poll-like operation has side effects, >> and that's rather weird IMO. >> > > So in the case where you have both non-exclusive and exclusive > waiters, all of the non-exclusive waiters will continue to get woken > up. However, I think you're getting at having multiple exclusive > waiters and potentially 'starving' out other exclusive waiters. > > In general, I think wait queues are associated with a 'struct file', > so I think unless you are sharing your fd table, this isn't an issue. > However, there may be cases where this is not true? In which > case, perhaps, we could limit this to CAP_SYS_ADMIN... There's also SCM_RIGHTS, which can be used in conjunction with file sealing and such. In general, I feel like this patch series solves a problem that isn't well understood and does it by adding a rather strange new mechanism. Is there really a problem that can't be addressed by more normal epoll features? --Andy > > Thanks, > > -Jason > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 0/2] Add epoll round robin wakeup mode 2015-02-17 21:09 ` Andy Lutomirski @ 2015-02-18 3:15 ` Jason Baron 0 siblings, 0 replies; 19+ messages in thread From: Jason Baron @ 2015-02-18 3:15 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong, Davide Libenzi, Michael Kerrisk-manpages, linux-kernel@vger.kernel.org, Linux FS Devel, Linux API, Linus Torvalds, Mathieu Desnoyers, edumazet On 02/17/2015 04:09 PM, Andy Lutomirski wrote: > On Tue, Feb 17, 2015 at 12:33 PM, Jason Baron <jbaron@akamai.com> wrote: >> On 02/17/2015 02:46 PM, Andy Lutomirski wrote: >>> On Tue, Feb 17, 2015 at 11:33 AM, Jason Baron <jbaron@akamai.com> wrote: >>>> When we are sharing a wakeup source among multiple epoll fds, we end up with >>>> thundering herd wakeups, since there is currently no way to add to the >>>> wakeup source exclusively. This series introduces 2 new epoll flags, >>>> EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN >>>> which is to be used in conjunction to EPOLLEXCLUSIVE to evenly >>>> distribute the wakeups. This patch was originally motivated by a desire to >>>> improve wakeup balance and cpu usage for a listen socket() shared amongst >>>> multiple epoll fd sets. >>>> >>>> See: http://lwn.net/Articles/632590/ for previous test program and testing >>>> resutls. >>>> >>>> Epoll manpage text: >>>> >>>> EPOLLEXCLUSIVE >>>> Provides exclusive wakeups when attaching multiple epoll fds to a >>>> shared wakeup source. Must be specified with an EPOLL_CTL_ADD operation. >>>> >>>> EPOLLROUNDROBIN >>>> Provides balancing for exclusive wakeups when attaching multiple epoll >>>> fds to a shared wakeup soruce. Depends on EPOLLEXCLUSIVE being set and >>>> must be specified with an EPOLL_CTL_ADD operation. >>>> >>>> Thanks, >>> What permissions do you need on the file descriptor to do this? This >>> will be the first case where a poll-like operation has side effects, >>> and that's rather weird IMO. >>> >> So in the case where you have both non-exclusive and exclusive >> waiters, all of the non-exclusive waiters will continue to get woken >> up. However, I think you're getting at having multiple exclusive >> waiters and potentially 'starving' out other exclusive waiters. >> >> In general, I think wait queues are associated with a 'struct file', >> so I think unless you are sharing your fd table, this isn't an issue. >> However, there may be cases where this is not true? In which >> case, perhaps, we could limit this to CAP_SYS_ADMIN... > There's also SCM_RIGHTS, which can be used in conjunction with file > sealing and such. > > In general, I feel like this patch series solves a problem that isn't > well understood and does it by adding a rather strange new mechanism. > Is there really a problem that can't be addressed by more normal epoll > features? > > --Andy hmm....so I dug through some of the Linux archives a bit and this problem seems to crop up every so often without resolution. So I do believe that its an issue that ppl are more generally interested in. See: http://lkml.iu.edu/hypermail/linux/kernel/1201.1/02620.html http://marc.info/?l=linux-kernel&m=128638781921073&w=2 In the latter thread, Linus suggests adding it to the "requested events" field to poll: http://marc.info/?l=linux-kernel&m=128639416832335&w=2 So, I think that this series at least moves in that suggested direction. Thanks, -Jason ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-02-27 22:24 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-02-17 19:33 [PATCH v2 0/2] Add epoll round robin wakeup mode Jason Baron 2015-02-17 19:33 ` [PATCH v2 1/2] sched/wait: add " Jason Baron 2015-02-17 19:33 ` [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron 2015-02-18 8:07 ` Ingo Molnar 2015-02-18 15:42 ` Jason Baron 2015-02-18 16:33 ` Ingo Molnar 2015-02-18 17:38 ` Jason Baron 2015-02-18 17:45 ` Ingo Molnar 2015-02-18 17:51 ` Ingo Molnar 2015-02-18 22:18 ` Eric Wong 2015-02-19 3:26 ` Jason Baron 2015-02-22 0:24 ` Eric Wong 2015-02-25 15:48 ` Jason Baron 2015-02-18 23:12 ` Andy Lutomirski [not found] ` <CAPh34mcPNQELwZCDTHej+HK=bpWgJ=jb1LeCtKoUHVgoDJOJoQ@mail.gmail.com> 2015-02-27 22:24 ` Jason Baron 2015-02-17 19:46 ` [PATCH v2 0/2] Add epoll round robin wakeup mode Andy Lutomirski 2015-02-17 20:33 ` Jason Baron 2015-02-17 21:09 ` Andy Lutomirski 2015-02-18 3:15 ` Jason Baron
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).