All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [RFC Patch] tcp: make icsk_retransmit_timer pinned
@ 2019-11-01 22:16 Cong Wang
  2019-11-01 22:30 ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2019-11-01 22:16 UTC (permalink / raw
  To: netdev; +Cc: Cong Wang, Thomas Gleixner, Eric Dumazet

While investigating the spinlock contention on resetting TCP
retransmit timer:

  61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
   ...
    - 58.83% tcp_v4_rcv
      - 58.80% tcp_v4_do_rcv
         - 58.80% tcp_rcv_established
            - 52.88% __tcp_push_pending_frames
               - 52.88% tcp_write_xmit
                  - 28.16% tcp_event_new_data_sent
                     - 28.15% sk_reset_timer
                        + mod_timer
                  - 24.68% tcp_schedule_loss_probe
                     - 24.68% sk_reset_timer
                        + 24.68% mod_timer

it turns out to be a serious timer migration issue. After collecting timer_start
trace events for tcp_write_timer, it shows more than 77% times this timer got
migrated to a difference CPU:

	$ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
	1303826
	$ wc -l tcp_timer_trace.txt
	1681068 tcp_timer_trace.txt
	$ python
	Python 2.7.5 (default, Jul 11 2019, 17:13:53)
	[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> 1303826 / 1681068.0
	0.7755938486723916

And all of those migration happened during an idle CPU serving a network RX
softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
I don't know whether we should relax it for this scenario particuarly, something
like:

-	if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
+	if ((!idle_cpu(cpu) || in_serving_softirq()) &&
+	    housekeeping_cpu(cpu, HK_FLAG_TIMER))
 		return cpu;

(There could be better way than in_serving_softirq() to measure the idleness,
of course.)

Or simply just make the TCP retransmit timer pinned. At least this approach
has the minimum impact.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/inet_connection_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index eb30fc1770de..de5510ddb1c8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
+	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
 	timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
 	timer_setup(&sk->sk_timer, keepalive_handler, 0);
 	icsk->icsk_pending = icsk->icsk_ack.pending = 0;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned
  2019-11-01 22:16 [RFC Patch] tcp: make icsk_retransmit_timer pinned Cong Wang
@ 2019-11-01 22:30 ` Eric Dumazet
  2019-11-01 22:43   ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2019-11-01 22:30 UTC (permalink / raw
  To: Cong Wang; +Cc: netdev, Thomas Gleixner

On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> While investigating the spinlock contention on resetting TCP
> retransmit timer:
>
>   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
>    ...
>     - 58.83% tcp_v4_rcv
>       - 58.80% tcp_v4_do_rcv
>          - 58.80% tcp_rcv_established
>             - 52.88% __tcp_push_pending_frames
>                - 52.88% tcp_write_xmit
>                   - 28.16% tcp_event_new_data_sent
>                      - 28.15% sk_reset_timer
>                         + mod_timer
>                   - 24.68% tcp_schedule_loss_probe
>                      - 24.68% sk_reset_timer
>                         + 24.68% mod_timer
>
> it turns out to be a serious timer migration issue. After collecting timer_start
> trace events for tcp_write_timer, it shows more than 77% times this timer got
> migrated to a difference CPU:
>
>         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
>         1303826
>         $ wc -l tcp_timer_trace.txt
>         1681068 tcp_timer_trace.txt
>         $ python
>         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
>         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
>         Type "help", "copyright", "credits" or "license" for more information.
>         >>> 1303826 / 1681068.0
>         0.7755938486723916
>
> And all of those migration happened during an idle CPU serving a network RX
> softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> I don't know whether we should relax it for this scenario particuarly, something
> like:
>
> -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
>                 return cpu;
>
> (There could be better way than in_serving_softirq() to measure the idleness,
> of course.)
>
> Or simply just make the TCP retransmit timer pinned. At least this approach
> has the minimum impact.
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/ipv4/inet_connection_sock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index eb30fc1770de..de5510ddb1c8 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
>         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
>         timer_setup(&sk->sk_timer, keepalive_handler, 0);
>         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> --
> 2.21.0
>

Now you are talking ...

We have disabled /proc/sys/kernel/timer_migration on all Google servers,
because this made no sense on servers really, and not only for tcp timers.

This has been a hot topic years ago ( random example :
https://lore.kernel.org/patchwork/patch/947052/ )

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned
  2019-11-01 22:30 ` Eric Dumazet
@ 2019-11-01 22:43   ` Cong Wang
  2019-11-01 23:44     ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2019-11-01 22:43 UTC (permalink / raw
  To: Eric Dumazet; +Cc: netdev, Thomas Gleixner

On Fri, Nov 1, 2019 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > While investigating the spinlock contention on resetting TCP
> > retransmit timer:
> >
> >   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> >    ...
> >     - 58.83% tcp_v4_rcv
> >       - 58.80% tcp_v4_do_rcv
> >          - 58.80% tcp_rcv_established
> >             - 52.88% __tcp_push_pending_frames
> >                - 52.88% tcp_write_xmit
> >                   - 28.16% tcp_event_new_data_sent
> >                      - 28.15% sk_reset_timer
> >                         + mod_timer
> >                   - 24.68% tcp_schedule_loss_probe
> >                      - 24.68% sk_reset_timer
> >                         + 24.68% mod_timer
> >
> > it turns out to be a serious timer migration issue. After collecting timer_start
> > trace events for tcp_write_timer, it shows more than 77% times this timer got
> > migrated to a difference CPU:
> >
> >         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
> >         1303826
> >         $ wc -l tcp_timer_trace.txt
> >         1681068 tcp_timer_trace.txt
> >         $ python
> >         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
> >         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
> >         Type "help", "copyright", "credits" or "license" for more information.
> >         >>> 1303826 / 1681068.0
> >         0.7755938486723916
> >
> > And all of those migration happened during an idle CPU serving a network RX
> > softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> > I don't know whether we should relax it for this scenario particuarly, something
> > like:
> >
> > -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> > +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> > +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
> >                 return cpu;
> >
> > (There could be better way than in_serving_softirq() to measure the idleness,
> > of course.)
> >
> > Or simply just make the TCP retransmit timer pinned. At least this approach
> > has the minimum impact.
> >
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >  net/ipv4/inet_connection_sock.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index eb30fc1770de..de5510ddb1c8 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
> >  {
> >         struct inet_connection_sock *icsk = inet_csk(sk);
> >
> > -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> > +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
> >         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
> >         timer_setup(&sk->sk_timer, keepalive_handler, 0);
> >         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> > --
> > 2.21.0
> >
>
> Now you are talking ...
>
> We have disabled /proc/sys/kernel/timer_migration on all Google servers,
> because this made no sense on servers really, and not only for tcp timers.

So let's make the sysctl timer_migration disabled by default? It is
always how we want to trade off CPU power saving with latency.

Did you measure how much CPU power it increases after disabling it?
If not much, we can certainly make it disabled by default.

>
> This has been a hot topic years ago ( random example :
> https://lore.kernel.org/patchwork/patch/947052/ )

Yeah, this specific patch has been merged for a long time,
but I know you are not just talking about this single one. :)

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned
  2019-11-01 22:43   ` Cong Wang
@ 2019-11-01 23:44     ` Eric Dumazet
  2019-11-01 23:51       ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2019-11-01 23:44 UTC (permalink / raw
  To: Cong Wang; +Cc: netdev, Thomas Gleixner

On Fri, Nov 1, 2019 at 3:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> So let's make the sysctl timer_migration disabled by default? It is
> always how we want to trade off CPU power saving with latency.

At least timer_migration sysctl deserves a proper documentation.
I do not see any.

And maybe automatically disable it for hosts with more than 64
possible cpus would make sense,
but that  is only a suggestion. I won't fight this battle.

(All sysctls can be set by admins, we do not need to change the default)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned
  2019-11-01 23:44     ` Eric Dumazet
@ 2019-11-01 23:51       ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2019-11-01 23:51 UTC (permalink / raw
  To: Eric Dumazet; +Cc: netdev, Thomas Gleixner

On Fri, Nov 1, 2019 at 4:44 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > So let's make the sysctl timer_migration disabled by default? It is
> > always how we want to trade off CPU power saving with latency.
>
> At least timer_migration sysctl deserves a proper documentation.
> I do not see any.
>
> And maybe automatically disable it for hosts with more than 64
> possible cpus would make sense,
> but that  is only a suggestion. I won't fight this battle.
>
> (All sysctls can be set by admins, we do not need to change the default)

Make people rely on the default value, as obviously not everyone
is able to revise all of the sysctl's.

Anyway, I read it as the patch makes TCP retransmit timer pinned
not interesting, therefore let's discard it. We can at least carry it
by ourselves, so not a big deal.

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-11-01 23:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-11-01 22:16 [RFC Patch] tcp: make icsk_retransmit_timer pinned Cong Wang
2019-11-01 22:30 ` Eric Dumazet
2019-11-01 22:43   ` Cong Wang
2019-11-01 23:44     ` Eric Dumazet
2019-11-01 23:51       ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.