[PATCH 0/2] Add epoll round robin wakeup mode

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Add epoll round robin wakeup mode
@ 2015-02-09 20:05 Jason Baron
  2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jason Baron @ 2015-02-09 20:05 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

Hi,

When we are sharing a wakeup source among multiple epoll fds, we end up with
thundering herd wakeups, since there is currently no way to add to the
wakeup source exclusively. This series introduces 2 new epoll flags,
EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN
which is to be used in conjunction to EPOLLEXCLUSIVE to evenly
distribute the wakeups. I'm showing perf results from the simple pipe() usecase
below. But this patch was originally motivated by a desire to improve
wakeup balance and cpu usage for a shared listen socket().

Perf stat, 3.19.0-rc7+, 4 core, Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz:

pipe test wake all:

 Performance counter stats for './wake':

      10837.480396      task-clock (msec)         #    1.879 CPUs utilized          
           2047108      context-switches          #    0.189 M/sec                  
            214491      cpu-migrations            #    0.020 M/sec                  
               247      page-faults               #    0.023 K/sec                  
       23655687888      cycles                    #    2.183 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       11242141621      instructions              #    0.48  insns per cycle        
        2313479486      branches                  #  213.470 M/sec                  
          13679036      branch-misses             #    0.59% of all branches        

       5.768295821 seconds time elapsed

pipe test wake balanced:

 Performance counter stats for './wake -o':

        291.250312      task-clock (msec)         #    0.094 CPUs utilized          
             40308      context-switches          #    0.138 M/sec                  
              1448      cpu-migrations            #    0.005 M/sec                  
               248      page-faults               #    0.852 K/sec                  
         646407197      cycles                    #    2.219 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
         364256883      instructions              #    0.56  insns per cycle        
          65775397      branches                  #  225.838 M/sec                  
            535637      branch-misses             #    0.81% of all branches        

       3.086694452 seconds time elapsed

Rough epoll manpage text:

EPOLLEXCLUSIVE
	Provides exclusive wakeups when attaching multiple epoll fds to a
	shared wakeup source. Must be specified on an EPOLL_CTL_ADD operation.

EPOLLROUNDROBIN
	Provides balancing for exclusive wakeups when attaching multiple epoll
	fds to a shared wakeup soruce. Must be specificed with EPOLLEXCLUSIVE
	during an EPOLL_CTL_ADD operation.


Thanks,

-Jason

#include <unistd.h>
#include <sys/epoll.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define NUM_THREADS 100
#define NUM_EVENTS 20000
#define EPOLLEXCLUSIVE (1 << 28)
#define EPOLLBALANCED (1 << 27)

int optimize, exclusive;
int p[2];
pthread_t threads[NUM_THREADS];
int event_count[NUM_THREADS];

struct epoll_event evt = {
	.events = EPOLLIN 
};

void die(const char *msg) {
    perror(msg);
    exit(-1);
}

void *run_func(void *ptr)
{
	int i = 0;
	int j = 0;
	int ret;
	int epfd;
	char buf[4];
	int id = *(int *)ptr;
	int *contents;

	if ((epfd = epoll_create(1)) < 0)
		die("create");

	if (optimize)
		evt.events |= ((EPOLLBALANCED | EPOLLEXCLUSIVE));
	else if (exclusive)
		evt.events |= EPOLLEXCLUSIVE;
	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, p[0], &evt);
	if (ret)
		perror("epoll_ctl add error!\n");

	while (1) { 
    		ret = epoll_wait(epfd, &evt, 10000, -1);
		ret = read(p[0], buf, sizeof(int));
		if (ret == 4)
			event_count[id]++;
	}
}

int main(int argc, char *argv[])
{
	int ret, i, j;
	int id[NUM_THREADS];
	int total = 0;
	int nohit = 0;
	int extra_wakeups = 0;

	if (argc == 2) {
		if (strcmp(argv[1], "-o") == 0)
			optimize = 1;
		if (strcmp(argv[1], "-e") == 0)
			exclusive = 1;
	}

	if (pipe(p) < 0)
		die("pipe");

	for (i = 0; i < NUM_THREADS; i++) {
		id[i] = i;
		pthread_create(&threads[i], NULL, run_func, &id[i]);
	} 

	for (j = 0; j < NUM_EVENTS; j++) {
		write(p[1], p, sizeof(int));
		usleep(100);
	}

	for (i = 0; i < NUM_THREADS; i++) {
		pthread_cancel(threads[i]);
		printf("joined: %d\n", i);
		printf("event count: %d\n", event_count[i]);
		total += event_count[i];
		if (!event_count[i])
			nohit++;
	} 

	printf("total events is: %d\n", total);
	printf("nohit is: %d\n", nohit);
}


Jason Baron (2):
  sched/wait: add round robin wakeup mode
  epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

 fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
 include/linux/wait.h           | 11 +++++++++++
 include/uapi/linux/eventpoll.h |  6 ++++++
 kernel/sched/wait.c            |  5 ++++-
 4 files changed, 41 insertions(+), 6 deletions(-)

-- 
1.8.2.rc2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-09 20:05 [PATCH 0/2] Add epoll round robin wakeup mode Jason Baron
@ 2015-02-09 20:05 ` Jason Baron
  2015-02-09 20:26   ` Michael Kerrisk
  2015-02-09 21:50   ` Peter Zijlstra
  2015-02-09 20:06 ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
  2015-02-09 20:25 ` [PATCH 0/2] Add epoll round robin wakeup mode Michael Kerrisk
  2 siblings, 2 replies; 18+ messages in thread
From: Jason Baron @ 2015-02-09 20:05 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

The motivation for this flag is to allow the distribution of wakeups from
a shared source in a balanced manner. Currently, we can add threads exclusively
but that often results in the same thread woken up again and again. In the case
where we are trying to balance work across threads this is not desirable.

The WQ_FLAG_ROUND_ROBIN is restricted to being exclusive as well, otherwise we
do not know who is being woken up.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/linux/wait.h | 11 +++++++++++
 kernel/sched/wait.c  |  5 ++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2232ed1..bbdef98 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -16,6 +16,7 @@ int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *ke
 /* __wait_queue::flags */
 #define WQ_FLAG_EXCLUSIVE	0x01
 #define WQ_FLAG_WOKEN		0x02
+#define WQ_FLAG_ROUND_ROBIN	0x04
 
 struct __wait_queue {
 	unsigned int		flags;
@@ -109,6 +110,16 @@ static inline int waitqueue_active(wait_queue_head_t *q)
 
 extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
 extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait);
+
+/*
+ * rr relies on exclusive, otherwise we don't know which entry was woken
+ */
+static inline void add_wait_queue_rr(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	wait->flags |= WQ_FLAG_ROUND_ROBIN;
+	add_wait_queue_exclusive(q, wait);
+}
+
 extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
 
 static inline void __add_wait_queue(wait_queue_head_t *head, wait_queue_t *new)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 852143a..17d1039 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -71,8 +71,11 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 		unsigned flags = curr->flags;
 
 		if (curr->func(curr, mode, wake_flags, key) &&
-				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
+			       (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) {
+			if (flags & WQ_FLAG_ROUND_ROBIN)
+				list_move_tail(&curr->task_list, &q->task_list);
 			break;
+		}
 	}
 }
 
-- 
1.8.2.rc2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:05 [PATCH 0/2] Add epoll round robin wakeup mode Jason Baron
  2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
@ 2015-02-09 20:06 ` Jason Baron
  2015-02-09 20:18   ` Andy Lutomirski
  2015-02-09 20:27   ` Michael Kerrisk
  2015-02-09 20:25 ` [PATCH 0/2] Add epoll round robin wakeup mode Michael Kerrisk
  2 siblings, 2 replies; 18+ messages in thread
From: Jason Baron @ 2015-02-09 20:06 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin around the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
 include/uapi/linux/eventpoll.h |  6 ++++++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..382c832 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,8 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
+			 EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
+		ewake = 1;
 		wake_up_locked(&ep->wq);
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1078,6 +1082,8 @@ out_unlock:
 	if (pwake)
 		ep_poll_safewake(&ep->poll_wait);
 
+	if (epi->event.events & EPOLLROUNDROBIN)
+		return ewake;
 	return 1;
 }
 
@@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (epi->event.events & EPOLLROUNDROBIN)
+			add_wait_queue_rr(whead, &pwq->wait);
+		else if (epi->event.events & EPOLLEXCLUSIVE)
+			add_wait_queue_exclusive(whead, &pwq->wait);
+		else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
 SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		struct epoll_event __user *, event)
 {
-	int error;
-	int full_check = 0;
+	int error, full_check = 0, wait_flags = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
@@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (f.file == tf.file || !is_file_epoll(f.file))
 		goto error_tgt_fput;
 
+	wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
+	if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
+	    ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
+		goto error_tgt_fput;
+
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..10260a1 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,12 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Balance wakeups for a shared event source */
+#define EPOLLROUNDROBIN (1 << 27)
+
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:06 ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
@ 2015-02-09 20:18   ` Andy Lutomirski
  2015-02-09 21:32     ` Jason Baron
  2015-02-09 20:27   ` Michael Kerrisk
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2015-02-09 20:18 UTC (permalink / raw)
  To: Jason Baron, peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

On 02/09/2015 12:06 PM, Jason Baron wrote:
> Epoll file descriptors that are added to a shared wakeup source are always
> added in a non-exclusive manner. That means that when we have multiple epoll
> fds attached to a shared wakeup source they are all woken up. This can
> lead to excessive cpu usage and uneven load distribution.
>
> This patch introduces two new 'events' flags that are intended to be used
> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> source in an exclusive manner such that the minimum number of threads are
> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> also be added to the 'events' flag, such that we round robin around the set
> of waiting threads.
>
> An implementation note is that in the epoll wakeup routine,
> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> wakeup, only when there are current waiters. The idea is to use this additional
> heuristic in order minimize wakeup latencies.

I don't understand what this is intended to do.

If an event has EPOLLONESHOT, then this only one thread should be woken 
regardless, right?  If not, isn't that just a bug that should be fixed?

If an event has EPOLLET, then the considerations are similar to 
EPOLLONESHOT, right?

If an event is a normal level-triggered non-one-shot event, then I don't 
understand how a round-robin wakeup makes any sense.  It's 
level-triggered, after all.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Add epoll round robin wakeup mode
  2015-02-09 20:05 [PATCH 0/2] Add epoll round robin wakeup mode Jason Baron
  2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
  2015-02-09 20:06 ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
@ 2015-02-09 20:25 ` Michael Kerrisk
  2 siblings, 0 replies; 18+ messages in thread
From: Michael Kerrisk @ 2015-02-09 20:25 UTC (permalink / raw)
  To: Jason Baron
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, normalperson,
	Davide Libenzi, Linux Kernel, Linux-Fsdevel, Linux API

[CC += linux-api@vger.kernel.org]

Jason,

Since this is a kernel-user-space API change, please CC linux-api@.
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed
to linux-api@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html


Thanks,

Michael


On Mon, Feb 9, 2015 at 9:05 PM, Jason Baron <jbaron@akamai.com> wrote:
> Hi,
>
> When we are sharing a wakeup source among multiple epoll fds, we end up with
> thundering herd wakeups, since there is currently no way to add to the
> wakeup source exclusively. This series introduces 2 new epoll flags,
> EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN
> which is to be used in conjunction to EPOLLEXCLUSIVE to evenly
> distribute the wakeups. I'm showing perf results from the simple pipe() usecase
> below. But this patch was originally motivated by a desire to improve
> wakeup balance and cpu usage for a shared listen socket().
>
> Perf stat, 3.19.0-rc7+, 4 core, Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz:
>
> pipe test wake all:
>
>  Performance counter stats for './wake':
>
>       10837.480396      task-clock (msec)         #    1.879 CPUs utilized
>            2047108      context-switches          #    0.189 M/sec
>             214491      cpu-migrations            #    0.020 M/sec
>                247      page-faults               #    0.023 K/sec
>        23655687888      cycles                    #    2.183 GHz
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>        11242141621      instructions              #    0.48  insns per cycle
>         2313479486      branches                  #  213.470 M/sec
>           13679036      branch-misses             #    0.59% of all branches
>
>        5.768295821 seconds time elapsed
>
> pipe test wake balanced:
>
>  Performance counter stats for './wake -o':
>
>         291.250312      task-clock (msec)         #    0.094 CPUs utilized
>              40308      context-switches          #    0.138 M/sec
>               1448      cpu-migrations            #    0.005 M/sec
>                248      page-faults               #    0.852 K/sec
>          646407197      cycles                    #    2.219 GHz
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>          364256883      instructions              #    0.56  insns per cycle
>           65775397      branches                  #  225.838 M/sec
>             535637      branch-misses             #    0.81% of all branches
>
>        3.086694452 seconds time elapsed
>
> Rough epoll manpage text:
>
> EPOLLEXCLUSIVE
>         Provides exclusive wakeups when attaching multiple epoll fds to a
>         shared wakeup source. Must be specified on an EPOLL_CTL_ADD operation.
>
> EPOLLROUNDROBIN
>         Provides balancing for exclusive wakeups when attaching multiple epoll
>         fds to a shared wakeup soruce. Must be specificed with EPOLLEXCLUSIVE
>         during an EPOLL_CTL_ADD operation.
>
>
> Thanks,
>
> -Jason
>
> #include <unistd.h>
> #include <sys/epoll.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <pthread.h>
>
> #define NUM_THREADS 100
> #define NUM_EVENTS 20000
> #define EPOLLEXCLUSIVE (1 << 28)
> #define EPOLLBALANCED (1 << 27)
>
> int optimize, exclusive;
> int p[2];
> pthread_t threads[NUM_THREADS];
> int event_count[NUM_THREADS];
>
> struct epoll_event evt = {
>         .events = EPOLLIN
> };
>
> void die(const char *msg) {
>     perror(msg);
>     exit(-1);
> }
>
> void *run_func(void *ptr)
> {
>         int i = 0;
>         int j = 0;
>         int ret;
>         int epfd;
>         char buf[4];
>         int id = *(int *)ptr;
>         int *contents;
>
>         if ((epfd = epoll_create(1)) < 0)
>                 die("create");
>
>         if (optimize)
>                 evt.events |= ((EPOLLBALANCED | EPOLLEXCLUSIVE));
>         else if (exclusive)
>                 evt.events |= EPOLLEXCLUSIVE;
>         ret = epoll_ctl(epfd, EPOLL_CTL_ADD, p[0], &evt);
>         if (ret)
>                 perror("epoll_ctl add error!\n");
>
>         while (1) {
>                 ret = epoll_wait(epfd, &evt, 10000, -1);
>                 ret = read(p[0], buf, sizeof(int));
>                 if (ret == 4)
>                         event_count[id]++;
>         }
> }
>
> int main(int argc, char *argv[])
> {
>         int ret, i, j;
>         int id[NUM_THREADS];
>         int total = 0;
>         int nohit = 0;
>         int extra_wakeups = 0;
>
>         if (argc == 2) {
>                 if (strcmp(argv[1], "-o") == 0)
>                         optimize = 1;
>                 if (strcmp(argv[1], "-e") == 0)
>                         exclusive = 1;
>         }
>
>         if (pipe(p) < 0)
>                 die("pipe");
>
>         for (i = 0; i < NUM_THREADS; i++) {
>                 id[i] = i;
>                 pthread_create(&threads[i], NULL, run_func, &id[i]);
>         }
>
>         for (j = 0; j < NUM_EVENTS; j++) {
>                 write(p[1], p, sizeof(int));
>                 usleep(100);
>         }
>
>         for (i = 0; i < NUM_THREADS; i++) {
>                 pthread_cancel(threads[i]);
>                 printf("joined: %d\n", i);
>                 printf("event count: %d\n", event_count[i]);
>                 total += event_count[i];
>                 if (!event_count[i])
>                         nohit++;
>         }
>
>         printf("total events is: %d\n", total);
>         printf("nohit is: %d\n", nohit);
> }
>
>
> Jason Baron (2):
>   sched/wait: add round robin wakeup mode
>   epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
>
>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>  include/linux/wait.h           | 11 +++++++++++
>  include/uapi/linux/eventpoll.h |  6 ++++++
>  kernel/sched/wait.c            |  5 ++++-
>  4 files changed, 41 insertions(+), 6 deletions(-)
>
> --
> 1.8.2.rc2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
@ 2015-02-09 20:26   ` Michael Kerrisk
  2015-02-09 21:50   ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Kerrisk @ 2015-02-09 20:26 UTC (permalink / raw)
  To: Jason Baron
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, normalperson,
	Davide Libenzi, Linux Kernel, Linux-Fsdevel, Linux API

[CC += linux-api@vger.kernel.org]


On Mon, Feb 9, 2015 at 9:05 PM, Jason Baron <jbaron@akamai.com> wrote:
> The motivation for this flag is to allow the distribution of wakeups from
> a shared source in a balanced manner. Currently, we can add threads exclusively
> but that often results in the same thread woken up again and again. In the case
> where we are trying to balance work across threads this is not desirable.
>
> The WQ_FLAG_ROUND_ROBIN is restricted to being exclusive as well, otherwise we
> do not know who is being woken up.
>
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---
>  include/linux/wait.h | 11 +++++++++++
>  kernel/sched/wait.c  |  5 ++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 2232ed1..bbdef98 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -16,6 +16,7 @@ int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *ke
>  /* __wait_queue::flags */
>  #define WQ_FLAG_EXCLUSIVE      0x01
>  #define WQ_FLAG_WOKEN          0x02
> +#define WQ_FLAG_ROUND_ROBIN    0x04
>
>  struct __wait_queue {
>         unsigned int            flags;
> @@ -109,6 +110,16 @@ static inline int waitqueue_active(wait_queue_head_t *q)
>
>  extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
>  extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait);
> +
> +/*
> + * rr relies on exclusive, otherwise we don't know which entry was woken
> + */
> +static inline void add_wait_queue_rr(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +       wait->flags |= WQ_FLAG_ROUND_ROBIN;
> +       add_wait_queue_exclusive(q, wait);
> +}
> +
>  extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
>
>  static inline void __add_wait_queue(wait_queue_head_t *head, wait_queue_t *new)
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 852143a..17d1039 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -71,8 +71,11 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>                 unsigned flags = curr->flags;
>
>                 if (curr->func(curr, mode, wake_flags, key) &&
> -                               (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> +                              (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) {
> +                       if (flags & WQ_FLAG_ROUND_ROBIN)
> +                               list_move_tail(&curr->task_list, &q->task_list);
>                         break;
> +               }
>         }
>  }
>
> --
> 1.8.2.rc2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:06 ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
  2015-02-09 20:18   ` Andy Lutomirski
@ 2015-02-09 20:27   ` Michael Kerrisk
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Kerrisk @ 2015-02-09 20:27 UTC (permalink / raw)
  To: Jason Baron
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, normalperson,
	Davide Libenzi, Linux Kernel, Linux-Fsdevel, Linux API

[CC += linux-api@vger.kernel.org]


On Mon, Feb 9, 2015 at 9:06 PM, Jason Baron <jbaron@akamai.com> wrote:
> Epoll file descriptors that are added to a shared wakeup source are always
> added in a non-exclusive manner. That means that when we have multiple epoll
> fds attached to a shared wakeup source they are all woken up. This can
> lead to excessive cpu usage and uneven load distribution.
>
> This patch introduces two new 'events' flags that are intended to be used
> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> source in an exclusive manner such that the minimum number of threads are
> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> also be added to the 'events' flag, such that we round robin around the set
> of waiting threads.
>
> An implementation note is that in the epoll wakeup routine,
> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> wakeup, only when there are current waiters. The idea is to use this additional
> heuristic in order minimize wakeup latencies.
>
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---
>  fs/eventpoll.c                 | 25 ++++++++++++++++++++-----
>  include/uapi/linux/eventpoll.h |  6 ++++++
>  2 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d77f944..382c832 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,8 @@
>   */
>
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
> +                        EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>         unsigned long flags;
>         struct epitem *epi = ep_item_from_wait(wait);
>         struct eventpoll *ep = epi->ep;
> +       int ewake = 0;
>
>         if ((unsigned long)key & POLLFREE) {
>                 ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>          * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>          * wait list.
>          */
> -       if (waitqueue_active(&ep->wq))
> +       if (waitqueue_active(&ep->wq)) {
> +               ewake = 1;
>                 wake_up_locked(&ep->wq);
> +       }
>         if (waitqueue_active(&ep->poll_wait))
>                 pwake++;
>
> @@ -1078,6 +1082,8 @@ out_unlock:
>         if (pwake)
>                 ep_poll_safewake(&ep->poll_wait);
>
> +       if (epi->event.events & EPOLLROUNDROBIN)
> +               return ewake;
>         return 1;
>  }
>
> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
>                 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>                 pwq->whead = whead;
>                 pwq->base = epi;
> -               add_wait_queue(whead, &pwq->wait);
> +               if (epi->event.events & EPOLLROUNDROBIN)
> +                       add_wait_queue_rr(whead, &pwq->wait);
> +               else if (epi->event.events & EPOLLEXCLUSIVE)
> +                       add_wait_queue_exclusive(whead, &pwq->wait);
> +               else
> +                       add_wait_queue(whead, &pwq->wait);
>                 list_add_tail(&pwq->llink, &epi->pwqlist);
>                 epi->nwait++;
>         } else {
> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>                 struct epoll_event __user *, event)
>  {
> -       int error;
> -       int full_check = 0;
> +       int error, full_check = 0, wait_flags = 0;
>         struct fd f, tf;
>         struct eventpoll *ep;
>         struct epitem *epi;
> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>         if (f.file == tf.file || !is_file_epoll(f.file))
>                 goto error_tgt_fput;
>
> +       wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
> +       if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
> +           ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))))))
> +               goto error_tgt_fput;
> +
>         /*
>          * At this point it is safe to assume that the "private_data" contains
>          * our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index bc81fb2..10260a1 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,12 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
>
> +/* Balance wakeups for a shared event source */
> +#define EPOLLROUNDROBIN (1 << 27)
> +
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent system suspends
>   * from happening while those events are being processed.
> --
> 1.8.2.rc2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 20:18   ` Andy Lutomirski
@ 2015-02-09 21:32     ` Jason Baron
  2015-02-09 22:45       ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-02-09 21:32 UTC (permalink / raw)
  To: Andy Lutomirski, peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, linux-kernel,
	linux-fsdevel

On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> On 02/09/2015 12:06 PM, Jason Baron wrote:
>> Epoll file descriptors that are added to a shared wakeup source are always
>> added in a non-exclusive manner. That means that when we have multiple epoll
>> fds attached to a shared wakeup source they are all woken up. This can
>> lead to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are intended to be used
>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>> source in an exclusive manner such that the minimum number of threads are
>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>> also be added to the 'events' flag, such that we round robin around the set
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup routine,
>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>> wakeup, only when there are current waiters. The idea is to use this additional
>> heuristic in order minimize wakeup latencies.
>
> I don't understand what this is intended to do.
>
> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>

hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.

> If an event has EPOLLET, then the considerations are similar to EPOLLONESHOT, right?
>

EPOLLET is still going to cause this thundering herd.

> If an event is a normal level-triggered non-one-shot event, then I don't understand how a round-robin wakeup makes any sense.  It's level-triggered, after all.

Yeah, so the current behavior is to wake up all of the threads. I'm trying to add a new mode where it load balances among the threads interested in the event. Perhaps, the test program I attached to 0/2 will show the issue better?

Also, this originally came up in the context of a single listening socket which was attached to multiple epoll fds each in a separate thread. With the attached patch, I can measure a large decrease in cpu usage and better balancing behavior among the accepting threads.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
  2015-02-09 20:26   ` Michael Kerrisk
@ 2015-02-09 21:50   ` Peter Zijlstra
  2015-02-10  4:06     ` Jason Baron
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-02-09 21:50 UTC (permalink / raw)
  To: Jason Baron
  Cc: mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel

On Mon, Feb 09, 2015 at 08:05:57PM +0000, Jason Baron wrote:
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 852143a..17d1039 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -71,8 +71,11 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>  		unsigned flags = curr->flags;
>  
>  		if (curr->func(curr, mode, wake_flags, key) &&
> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> +			       (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) {
> +			if (flags & WQ_FLAG_ROUND_ROBIN)
> +				list_move_tail(&curr->task_list, &q->task_list);
>  			break;
> +		}
>  	}
>  }

I think you meant to write something like:

	if (curr->func(curr, mode, wake_flags, key) &&
	    (flags & WQ_FLAG_EXCLUSIVE)) {
		if (flag & WQ_FLAG_ROUND_ROBIN)
			list_move_tail(&curr->task_list, &q->task_list);
		if (!--nr_exclusive)
			break;
	}

Otherwise can only work for nr_exclusive==1.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 21:32     ` Jason Baron
@ 2015-02-09 22:45       ` Andy Lutomirski
  2015-02-10  3:59         ` Jason Baron
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2015-02-09 22:45 UTC (permalink / raw)
  To: Jason Baron, Linux API
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong,
	Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>> Epoll file descriptors that are added to a shared wakeup source are always
>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>> fds attached to a shared wakeup source they are all woken up. This can
>>> lead to excessive cpu usage and uneven load distribution.
>>>
>>> This patch introduces two new 'events' flags that are intended to be used
>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>> source in an exclusive manner such that the minimum number of threads are
>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>> also be added to the 'events' flag, such that we round robin around the set
>>> of waiting threads.
>>>
>>> An implementation note is that in the epoll wakeup routine,
>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>> wakeup, only when there are current waiters. The idea is to use this additional
>>> heuristic in order minimize wakeup latencies.
>>
>> I don't understand what this is intended to do.
>>
>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>
>
> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.

Now I understand.  Why are you using multiple epollfds?

--Andy

>
>> If an event has EPOLLET, then the considerations are similar to EPOLLONESHOT, right?
>>
>
> EPOLLET is still going to cause this thundering herd.
>
>> If an event is a normal level-triggered non-one-shot event, then I don't understand how a round-robin wakeup makes any sense.  It's level-triggered, after all.
>
> Yeah, so the current behavior is to wake up all of the threads. I'm trying to add a new mode where it load balances among the threads interested in the event. Perhaps, the test program I attached to 0/2 will show the issue better?
>
> Also, this originally came up in the context of a single listening socket which was attached to multiple epoll fds each in a separate thread. With the attached patch, I can measure a large decrease in cpu usage and better balancing behavior among the accepting threads.
>
> Thanks,
>
> -Jason



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-09 22:45       ` Andy Lutomirski
@ 2015-02-10  3:59         ` Jason Baron
  2015-02-10  4:49           ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-02-10  3:59 UTC (permalink / raw)
  To: Andy Lutomirski, Linux API
  Cc: Peter Zijlstra, Ingo Molnar, Al Viro, Andrew Morton, Eric Wong,
	Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
>> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>>> Epoll file descriptors that are added to a shared wakeup source are always
>>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>>> fds attached to a shared wakeup source they are all woken up. This can
>>>> lead to excessive cpu usage and uneven load distribution.
>>>>
>>>> This patch introduces two new 'events' flags that are intended to be used
>>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>>> source in an exclusive manner such that the minimum number of threads are
>>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>>> also be added to the 'events' flag, such that we round robin around the set
>>>> of waiting threads.
>>>>
>>>> An implementation note is that in the epoll wakeup routine,
>>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>>> wakeup, only when there are current waiters. The idea is to use this additional
>>>> heuristic in order minimize wakeup latencies.
>>> I don't understand what this is intended to do.
>>>
>>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>>
>> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
> Now I understand.  Why are you using multiple epollfds?
>
> --Andy

So the multiple epollfds is really a way to partition the set of events. Otherwise, I have all the threads contending on all the events that are being generated. So I'm not sure if that is scalable.

In the use-case I'm trying to describe, I've partitioned a large set of the events, but there may still be some event sources that we wish to share among all of the threads (or even subsets of them), so as not to overload any one in particular.

More specifically, in the case of a single listen socket, its natural to call accept() on the thread that has been woken up, but without doing round robin, you quickly get into a very unbalanced load, and in addition you waste a lot of cpu doing unnecessary wakeups. There are other approaches to solve this, specifically using SO_REUSEPORT, which creates a separate socket per-thread and gets one back to the separately partitioned events case previously described. However, SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition does not have knowledge of the threads that are actively waiting as the epoll code does.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-09 21:50   ` Peter Zijlstra
@ 2015-02-10  4:06     ` Jason Baron
  2015-02-10  9:03       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-02-10  4:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel

On 02/09/2015 04:50 PM, Peter Zijlstra wrote:
> On Mon, Feb 09, 2015 at 08:05:57PM +0000, Jason Baron wrote:
>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>> index 852143a..17d1039 100644
>> --- a/kernel/sched/wait.c
>> +++ b/kernel/sched/wait.c
>> @@ -71,8 +71,11 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>>  		unsigned flags = curr->flags;
>>  
>>  		if (curr->func(curr, mode, wake_flags, key) &&
>> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
>> +			       (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) {
>> +			if (flags & WQ_FLAG_ROUND_ROBIN)
>> +				list_move_tail(&curr->task_list, &q->task_list);
>>  			break;
>> +		}
>>  	}
>>  }
> I think you meant to write something like:
>
> 	if (curr->func(curr, mode, wake_flags, key) &&
> 	    (flags & WQ_FLAG_EXCLUSIVE)) {
> 		if (flag & WQ_FLAG_ROUND_ROBIN)
> 			list_move_tail(&curr->task_list, &q->task_list);
> 		if (!--nr_exclusive)
> 			break;
> 	}
>
> Otherwise can only work for nr_exclusive==1.

Indeed. I'm also wondering if its worth avoiding the list_move_tail() for the case where nr_exclusive is initially 0. IE the wake all case, where we are just going to end up doing a bunch of list_move_tail() calls, but end up in the same state.

Thanks,

-Jason




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-10  3:59         ` Jason Baron
@ 2015-02-10  4:49           ` Eric Wong
  2015-02-10 19:16             ` Jason Baron
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2015-02-10  4:49 UTC (permalink / raw)
  To: Jason Baron
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> > On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
> >> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> >>> On 02/09/2015 12:06 PM, Jason Baron wrote:
> >>>> Epoll file descriptors that are added to a shared wakeup source are always
> >>>> added in a non-exclusive manner. That means that when we have multiple epoll
> >>>> fds attached to a shared wakeup source they are all woken up. This can
> >>>> lead to excessive cpu usage and uneven load distribution.
> >>>>
> >>>> This patch introduces two new 'events' flags that are intended to be used
> >>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> >>>> source in an exclusive manner such that the minimum number of threads are
> >>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> >>>> also be added to the 'events' flag, such that we round robin around the set
> >>>> of waiting threads.
> >>>>
> >>>> An implementation note is that in the epoll wakeup routine,
> >>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> >>>> wakeup, only when there are current waiters. The idea is to use this additional
> >>>> heuristic in order minimize wakeup latencies.
> >>> I don't understand what this is intended to do.
> >>>
> >>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
> >>>
> >> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
> > Now I understand.  Why are you using multiple epollfds?
> >
> > --Andy
> 
> So the multiple epollfds is really a way to partition the set of
> events. Otherwise, I have all the threads contending on all the events
> that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

> In the use-case I'm trying to describe, I've partitioned a large set
> of the events, but there may still be some event sources that we wish
> to share among all of the threads (or even subsets of them), so as not
> to overload any one in particular.
 
> More specifically, in the case of a single listen socket, its natural
> to call accept() on the thread that has been woken up, but without
> doing round robin, you quickly get into a very unbalanced load, and in
> addition you waste a lot of cpu doing unnecessary wakeups. There are
> other approaches to solve this, specifically using SO_REUSEPORT, which
> creates a separate socket per-thread and gets one back to the
> separately partitioned events case previously described. However,
> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
> does not have knowledge of the threads that are actively waiting as
> the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-10  4:06     ` Jason Baron
@ 2015-02-10  9:03       ` Peter Zijlstra
  2015-02-10 15:59         ` Jason Baron
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-02-10  9:03 UTC (permalink / raw)
  To: Jason Baron
  Cc: mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel

On Mon, Feb 09, 2015 at 11:06:17PM -0500, Jason Baron wrote:
> On 02/09/2015 04:50 PM, Peter Zijlstra wrote:
> > On Mon, Feb 09, 2015 at 08:05:57PM +0000, Jason Baron wrote:
> >> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> >> index 852143a..17d1039 100644
> >> --- a/kernel/sched/wait.c
> >> +++ b/kernel/sched/wait.c
> >> @@ -71,8 +71,11 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> >>  		unsigned flags = curr->flags;
> >>  
> >>  		if (curr->func(curr, mode, wake_flags, key) &&
> >> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> >> +			       (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) {
> >> +			if (flags & WQ_FLAG_ROUND_ROBIN)
> >> +				list_move_tail(&curr->task_list, &q->task_list);
> >>  			break;
> >> +		}
> >>  	}
> >>  }
> > I think you meant to write something like:
> >
> > 	if (curr->func(curr, mode, wake_flags, key) &&
> > 	    (flags & WQ_FLAG_EXCLUSIVE)) {
> > 		if (flag & WQ_FLAG_ROUND_ROBIN)
> > 			list_move_tail(&curr->task_list, &q->task_list);
> > 		if (!--nr_exclusive)
> > 			break;
> > 	}
> >
> > Otherwise can only work for nr_exclusive==1.
> 
> Indeed. I'm also wondering if its worth avoiding the list_move_tail()
> for the case where nr_exclusive is initially 0. IE the wake all case,
> where we are just going to end up doing a bunch of list_move_tail()
> calls, but end up in the same state.

After writing this email, it occurred to me that you could probably do
this with a custom wake function.

Where autoremove_wake_function() does a list_del_init() you could do a
rotate_wake_function() that does list_move_tail().

That would avoid the entire WQ flag muckery.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-10  9:03       ` Peter Zijlstra
@ 2015-02-10 15:59         ` Jason Baron
  2015-02-10 16:11           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-02-10 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel

On 02/10/2015 04:03 AM, Peter Zijlstra wrote:
> On Mon, Feb 09, 2015 at 11:06:17PM -0500, Jason Baron wrote:
>> On 02/09/2015 04:50 PM, Peter Zijlstra wrote:
>>> On Mon, Feb 09, 2015 at 08:05:57PM +0000, Jason Baron wrote:
>>>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>>>> index 852143a..17d1039 100644
>>>> --- a/kernel/sched/wait.c
>>>> +++ b/kernel/sched/wait.c
>>>> @@ -71,8 +71,11 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>>>>  		unsigned flags = curr->flags;
>>>>  
>>>>  		if (curr->func(curr, mode, wake_flags, key) &&
>>>> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
>>>> +			       (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) {
>>>> +			if (flags & WQ_FLAG_ROUND_ROBIN)
>>>> +				list_move_tail(&curr->task_list, &q->task_list);
>>>>  			break;
>>>> +		}
>>>>  	}
>>>>  }
>>> I think you meant to write something like:
>>>
>>> 	if (curr->func(curr, mode, wake_flags, key) &&
>>> 	    (flags & WQ_FLAG_EXCLUSIVE)) {
>>> 		if (flag & WQ_FLAG_ROUND_ROBIN)
>>> 			list_move_tail(&curr->task_list, &q->task_list);
>>> 		if (!--nr_exclusive)
>>> 			break;
>>> 	}
>>>
>>> Otherwise can only work for nr_exclusive==1.
>> Indeed. I'm also wondering if its worth avoiding the list_move_tail()
>> for the case where nr_exclusive is initially 0. IE the wake all case,
>> where we are just going to end up doing a bunch of list_move_tail()
>> calls, but end up in the same state.
> After writing this email, it occurred to me that you could probably do
> this with a custom wake function.
>
> Where autoremove_wake_function() does a list_del_init() you could do a
> rotate_wake_function() that does list_move_tail().
>
> That would avoid the entire WQ flag muckery.

hmmm...but don't we need the head/tail of the list to add it back too?

Further, we can't just append to tail while walking the list b/c
otherwise it can result in multiple wakeups to the same item. So I could
add to a local list, for example, in __wake_up_common(). And then just
add that to the tail once the list_for_each() finishes.

In terms of the flag, maybe another option would be to have the
wait_queue_func_t return a 'ROTATE_ME' value instead
of 1, since I think we currently only make use of 0 and 1?

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/wait: add round robin wakeup mode
  2015-02-10 15:59         ` Jason Baron
@ 2015-02-10 16:11           ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-02-10 16:11 UTC (permalink / raw)
  To: Jason Baron
  Cc: mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	linux-kernel, linux-fsdevel

On Tue, Feb 10, 2015 at 10:59:01AM -0500, Jason Baron wrote:
> hmmm...but don't we need the head/tail of the list to add it back too?

Ah, good point that ;-)

> Further, we can't just append to tail while walking the list b/c
> otherwise it can result in multiple wakeups to the same item. So I could
> add to a local list, for example, in __wake_up_common(). And then just
> add that to the tail once the list_for_each() finishes.

True; you can do horrible things, but I think that is the safest option
indeed.

> In terms of the flag, maybe another option would be to have the
> wait_queue_func_t return a 'ROTATE_ME' value instead
> of 1, since I think we currently only make use of 0 and 1?

Lets stick with the flag then.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-10  4:49           ` Eric Wong
@ 2015-02-10 19:16             ` Jason Baron
  2015-02-10 19:32               ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Baron @ 2015-02-10 19:16 UTC (permalink / raw)
  To: Eric Wong
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

On 02/09/2015 11:49 PM, Eric Wong wrote:
> Jason Baron <jbaron@akamai.com> wrote:
>> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
>>> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@akamai.com> wrote:
>>>> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>>>>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>>>>> Epoll file descriptors that are added to a shared wakeup source are always
>>>>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>>>>> fds attached to a shared wakeup source they are all woken up. This can
>>>>>> lead to excessive cpu usage and uneven load distribution.
>>>>>>
>>>>>> This patch introduces two new 'events' flags that are intended to be used
>>>>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>>>>>> source in an exclusive manner such that the minimum number of threads are
>>>>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>>>>> also be added to the 'events' flag, such that we round robin around the set
>>>>>> of waiting threads.
>>>>>>
>>>>>> An implementation note is that in the epoll wakeup routine,
>>>>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
>>>>>> wakeup, only when there are current waiters. The idea is to use this additional
>>>>>> heuristic in order minimize wakeup latencies.
>>>>> I don't understand what this is intended to do.
>>>>>
>>>>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right?  If not, isn't that just a bug that should be fixed?
>>>>>
>>>> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
>>> Now I understand.  Why are you using multiple epollfds?
>>>
>>> --Andy
>> So the multiple epollfds is really a way to partition the set of
>> events. Otherwise, I have all the threads contending on all the events
>> that are being generated. So I'm not sure if that is scalable.
> I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
> maxevents value is sufficient for you.  All events would be shared, so
> they can migrate between threads(*).  Each thread takes a largish set of
> events on every epoll_wait call and doesn't call epoll_wait again until
> it's done with the whole set it got.
>
> You'll hit more contention on EPOLL_CTL_MOD with shared events and a
> single epoll, but I think it's a better goal to make that lock-free.

Its not just EPOLL_CTL_MOD, but there's also going to be contention on
epoll add and remove since there is only 1 epoll fd in this case. I'm also
concerned about the balancing of the workload among threads in the single
queue case. I think its quite reasonable to have user-space partition
the set
of events among threads as it sees fit using multiple epoll fds.

However, currently this multiple epoll fd scheme does not handle events from
a shared event source well. As I mentioned there is a thundering herd wakeup
in this case, and the wakeups are unbalanced. In fact, we have an
application
that currently does EPOLL_CTL_REMOVEs followed by EPOLL_CTL_ADDs
periodically against a shared wakeup source in order to re-balance the
wakeup
queues. This solves the balancing issues to an extent, but not the
thundering
herd. I'd like to move this logic down into the kernel with this patch set.

> (*) Too large a maxevents will lead to head-of-line blocking, but from
> what I'm inferring, you already risk that with multiple epollfds and
> separate threads working on them.
>
> Do you have a userland use case to share?

I've been trying to describe the use case, maybe I haven't been doing a good
job :(

>> In the use-case I'm trying to describe, I've partitioned a large set
>> of the events, but there may still be some event sources that we wish
>> to share among all of the threads (or even subsets of them), so as not
>> to overload any one in particular.
>  
>> More specifically, in the case of a single listen socket, its natural
>> to call accept() on the thread that has been woken up, but without
>> doing round robin, you quickly get into a very unbalanced load, and in
>> addition you waste a lot of cpu doing unnecessary wakeups. There are
>> other approaches to solve this, specifically using SO_REUSEPORT, which
>> creates a separate socket per-thread and gets one back to the
>> separately partitioned events case previously described. However,
>> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
>> does not have knowledge of the threads that are actively waiting as
>> the epoll code does.
> Did you try my suggestion of using a dedicated thread (or thread pool)
> which does nothing but loop on accept() + EPOLL_CTL_ADD?
>
> Those dedicated threads could do its own round-robin in userland to pick
> a different epollfd to call EPOLL_CTL_ADD on.

Thanks for your suggestion! I'm not actively working on the user-space
code here, but I will pass it along.

I would prefer though not to have to context switch the 'accept' thread
on and off the cpu every time there is a new connection. So the approach
suggested here essentially moves this dedicated thread (threads), down
into the kernel and avoids the creation of these threads entirely.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
  2015-02-10 19:16             ` Jason Baron
@ 2015-02-10 19:32               ` Eric Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2015-02-10 19:32 UTC (permalink / raw)
  To: Jason Baron
  Cc: Andy Lutomirski, Linux API, Peter Zijlstra, Ingo Molnar, Al Viro,
	Andrew Morton, Davide Libenzi, Michael Kerrisk-manpages,
	linux-kernel@vger.kernel.org, Linux FS Devel

Jason Baron <jbaron@akamai.com> wrote:
> On 02/09/2015 11:49 PM, Eric Wong wrote:
> > Do you have a userland use case to share?
> 
> I've been trying to describe the use case, maybe I haven't been doing a good
> job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

> > Did you try my suggestion of using a dedicated thread (or thread pool)
> > which does nothing but loop on accept() + EPOLL_CTL_ADD?
> >
> > Those dedicated threads could do its own round-robin in userland to pick
> > a different epollfd to call EPOLL_CTL_ADD on.
> 
> Thanks for your suggestion! I'm not actively working on the user-space
> code here, but I will pass it along.
> 
> I would prefer though not to have to context switch the 'accept' thread
> on and off the cpu every time there is a new connection. So the approach
> suggested here essentially moves this dedicated thread (threads), down
> into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux <= 2.6.31

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-02-10 19:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-09 20:05 [PATCH 0/2] Add epoll round robin wakeup mode Jason Baron
2015-02-09 20:05 ` [PATCH 1/2] sched/wait: add " Jason Baron
2015-02-09 20:26   ` Michael Kerrisk
2015-02-09 21:50   ` Peter Zijlstra
2015-02-10  4:06     ` Jason Baron
2015-02-10  9:03       ` Peter Zijlstra
2015-02-10 15:59         ` Jason Baron
2015-02-10 16:11           ` Peter Zijlstra
2015-02-09 20:06 ` [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN Jason Baron
2015-02-09 20:18   ` Andy Lutomirski
2015-02-09 21:32     ` Jason Baron
2015-02-09 22:45       ` Andy Lutomirski
2015-02-10  3:59         ` Jason Baron
2015-02-10  4:49           ` Eric Wong
2015-02-10 19:16             ` Jason Baron
2015-02-10 19:32               ` Eric Wong
2015-02-09 20:27   ` Michael Kerrisk
2015-02-09 20:25 ` [PATCH 0/2] Add epoll round robin wakeup mode Michael Kerrisk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).