Re: [RFC 1/3] tcp: Consider mtu probing for tcp_xmit_size_goal

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Leonard Crestez <cdleonard@gmail.com>
To: Eric Dumazet <edumazet@google.com>,
	Matt Mathis <mattmathis@google.com>,
	Neal Cardwell <ncardwell@google.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Willem de Bruijn <willemb@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	David Ahern <dsahern@kernel.org>,
	John Heffner <johnwheffner@gmail.com>,
	Leonard Crestez <lcrestez@drivenets.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Roopa Prabhu <roopa@cumulusnetworks.com>,
	netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC 1/3] tcp: Consider mtu probing for tcp_xmit_size_goal
Date: Mon, 17 May 2021 16:42:35 +0300	[thread overview]
Message-ID: <e6a3dfab-ccea-1a0c-6fd8-bfca466aefba@gmail.com> (raw)
In-Reply-To: <CANn89i+4x+YLVmPNSSiOEB4isQYussWSLqFb5x+0hQ5hyS4j_A@mail.gmail.com>

On 5/11/21 4:04 PM, Eric Dumazet wrote:
> On Tue, May 11, 2021 at 2:04 PM Leonard Crestez <cdleonard@gmail.com> wrote:
>>
>> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
>> in order to accumulate enough data" but linux almost never does that.
>>
>> Linux checks for (probe_size + (1 + reorder) * mss_cache) bytes to be
>> available in the send buffer and if that condition is not met it will
>> send anyway using the current MSS. The feature can be made to work by
>> sending very large chunks of data from userspace (for example 128k) but
>> for small writes on fast links tcp mtu probes almost never happen.
> 
> Why should they happen ?
> 
> I am not sure the kernel should perform extra checks just because
> applications are not properly written.

My tests show that application writing a few kb at a time almost never 
trigger MTU probing enough to reach 9200. The reasons for this are very 
difficult for me to understand.

It seems that only writing in very large chunks like 160k makes it 
happen, much more than the size_needed calculated inside tcp_mtu_probing 
(which is about 50k). This seems unreasonable. Ideally linux should try 
to accumulate enough data for a probe (as the RFC suggests) but at least 
it should send probes that fit inside a single userspace write.

I dug a little deeper and what seems to happen is this:

  * size_needed is ~60k
  * once the head of the queue reached size_needed tcp_push_one is 
called which sends everything ignoring MTU probing
  * size_needed is reached again and tcp_push_pending_frames is called. 
At this point the cwnd has shrunk < 11 (due to the previous burst) so 
probing is skipped again in favor of just sending in mss-sized chunks.

This happens repeatedly, a sender-limited app performing periodic 128k 
writes will see MSS stuck below MTU.

I don't understand the push_one logic and why it completely skips mtu 
probing, it seems like an optimization which doesn't take RFC4821 into 
account.

>> This patch tries to take mtu probe into account in tcp_xmit_size_goal, a
>> function which otherwise attempts to accumulate a packet suitable for
>> TSO. No delays are introduced beyond existing autocork heuristics.
> 
> 
> MTU probing should not be attempted for every write().
> This belongs to some kind of slow path, once in a while.

MTU probing is only attempted every 10 minutes but once a probe is 
pending it does have a slight impact on every write. This is already the 
case, tcp_write_xmit calls tcp_mtu_probe almost every time.

I had an idea for reducing the overhead in tcp_size_needed but it turns 
out I was indeed mistaken about what this function does. I thought it 
returned ~mss when all GSO is disabled but this is not so.

>>   static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>>                                         int large_allowed)
>>   {
>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>>          struct tcp_sock *tp = tcp_sk(sk);
>>          u32 new_size_goal, size_goal;
>>
>>          if (!large_allowed)
>>                  return mss_now;
>> @@ -932,11 +933,19 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>>                  tp->gso_segs = min_t(u16, new_size_goal / mss_now,
>>                                       sk->sk_gso_max_segs);
>>                  size_goal = tp->gso_segs * mss_now;
>>          }
>>
>> -       return max(size_goal, mss_now);
>> +       size_goal = max(size_goal, mss_now);
>> +
>> +       if (unlikely(icsk->icsk_mtup.wait_data)) {
>> +               int mtu_probe_size_needed = tcp_mtu_probe_size_needed(sk, NULL);
>> +               if (mtu_probe_size_needed > 0)
>> +                       size_goal = max(size_goal, (u32)mtu_probe_size_needed);
>> +       }
> 
> 
> I think you are mistaken.
> 
> This function usually returns 64KB depending on MSS.
>   Have you really tested this part ?

I assumed that with all gso features disabled this function returns one 
MSS but this is not true. My patch had a positive effect just because I 
made tcp_mtu_probing return "0" instead of "-1" if not enough data is 
queued.

I don't fully understand the implications of that change though. If 
tcp_mtu_probe returns zero what guarantee is there that data will 
eventually be sent even if no further userspace writes happen?

I'd welcome any suggestions.

--
Regards,
Leonard

next prev parent reply	other threads:[~2021-05-17 13:42 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-11 12:04 [RFC 0/3] tcp: Improve mtu probe preconditions Leonard Crestez
2021-05-11 12:04 ` [RFC 1/3] tcp: Consider mtu probing for tcp_xmit_size_goal Leonard Crestez
2021-05-11 13:04   ` Eric Dumazet
2021-05-17 13:42     ` Leonard Crestez [this message]
2021-05-11 12:04 ` [RFC 2/3] tcp: Use mtu probes if RACK is enabled Leonard Crestez
2021-05-11 12:04 ` [RFC 3/3] tcp: Adjust congestion window handling for mtu probe Leonard Crestez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6a3dfab-ccea-1a0c-6fd8-bfca466aefba@gmail.com \
    --to=cdleonard@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=johnwheffner@gmail.com \
    --cc=kuba@kernel.org \
    --cc=lcrestez@drivenets.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mattmathis@google.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=roopa@cumulusnetworks.com \
    --cc=soheil@google.com \
    --cc=willemb@google.com \
    --cc=yoshfuji@linux-ipv6.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.