From: Mina Almasry <almasrymina@google.com>
To: Pavel Begunkov <asml.silence@gmail.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org,
linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org,
sparclinux@vger.kernel.org, linux-renesas-soc@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Donald Hunter" <donald.hunter@gmail.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Richard Henderson" <richard.henderson@linaro.org>,
"Ivan Kokshaysky" <ink@jurassic.park.msu.ru>,
"Matt Turner" <mattst88@gmail.com>,
"Thomas Bogendoerfer" <tsbogend@alpha.franken.de>,
"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
"Helge Deller" <deller@gmx.de>,
"Andreas Larsson" <andreas@gaisler.com>,
"Sergey Shtylyov" <s.shtylyov@omp.ru>,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Masami Hiramatsu" <mhiramat@kernel.org>,
"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
"Arnd Bergmann" <arnd@arndb.de>,
"Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Andrii Nakryiko" <andrii@kernel.org>,
"Martin KaFai Lau" <martin.lau@linux.dev>,
"Eduard Zingerman" <eddyz87@gmail.com>,
"Song Liu" <song@kernel.org>,
"Yonghong Song" <yonghong.song@linux.dev>,
"John Fastabend" <john.fastabend@gmail.com>,
"KP Singh" <kpsingh@kernel.org>,
"Stanislav Fomichev" <sdf@google.com>,
"Hao Luo" <haoluo@google.com>, "Jiri Olsa" <jolsa@kernel.org>,
"Steffen Klassert" <steffen.klassert@secunet.com>,
"Herbert Xu" <herbert@gondor.apana.org.au>,
"David Ahern" <dsahern@kernel.org>,
"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
"Shuah Khan" <shuah@kernel.org>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
"Christian König" <christian.koenig@amd.com>,
"Bagas Sanjaya" <bagasdotme@gmail.com>,
"Christoph Hellwig" <hch@infradead.org>,
"Nikolay Aleksandrov" <razor@blackwall.org>,
"David Wei" <dw@davidwei.uk>, "Jason Gunthorpe" <jgg@ziepe.ca>,
"Yunsheng Lin" <linyunsheng@huawei.com>,
"Shailend Chand" <shailend@google.com>,
"Harshitha Ramamurthy" <hramamurthy@google.com>,
"Shakeel Butt" <shakeel.butt@linux.dev>,
"Jeroen de Borst" <jeroendb@google.com>,
"Praveen Kaligineedi" <pkaligineedi@google.com>,
"Willem de Bruijn" <willemb@google.com>,
"Kaiyuan Zhang" <kaiyuanz@google.com>
Subject: Re: [PATCH net-next v12 10/13] tcp: RX path for devmem TCP
Date: Fri, 21 Jun 2024 13:31:29 -0700 [thread overview]
Message-ID: <CAHS8izMce36FwLhFB0znHQYmxpe5hmTSXtZA7+b5VsmSJUfhRw@mail.gmail.com> (raw)
In-Reply-To: <20a6a727-d9f2-495c-bf75-72c27740dd82@gmail.com>
On Mon, Jun 17, 2024 at 9:36 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 6/13/24 02:35, Mina Almasry wrote:
> >
> > The pages awaiting freeing are stored in the newly added
> > sk->sk_user_frags, and each page passed to userspace is get_page()'d.
> > This reference is dropped once the userspace indicates that it is
> > done reading this page. All pages are released when the socket is
> > destroyed.
>
> One small concern is that if the pool gets destroyed (i.e.
> page_pool_destroy) before sockets holding netiov, page pool will
> semi-busily poll until the sockets die or such and will spam with
> pr_warn(). E.g. when a user drops the nl but leaks data sockets
> and continues with its userspace business. You can probably do
> it in a loop and create dozens of such pending
> page_pool_release_retry().
>
Yes, true, but this is not really an issue with netiovs per se, it's a
quirk with the page_pool in general. If a non-devmem page_pool is
destroyed while there are pages waiting in the receive queues to be
recvmsg'd, the behavior you described happens anyway AFAIU.
Jakub did some work to improve this. IIRC he disabled the regular
warning and he reparents the orphan page_pools so they appear in the
stats of his netlink API.
Since this is behavior already applying to pages, I did not seek to
improve it as I add devmem support, I just retain it. We could improve
it in a separate patchset, but I do not see this behavior as a
critical issue really, especially since the alarming pr_warn has been
removed.
> > +static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> > + unsigned int max_frags)
> > +{
> > + int err, k;
> > +
> > + if (p->idx < p->max)
> > + return 0;
> > +
> > + xa_lock_bh(&sk->sk_user_frags);
> > +
> > + tcp_xa_pool_commit_locked(sk, p);
> > +
> > + for (k = 0; k < max_frags; k++) {
> > + err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> > + XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> > + if (err)
> > + break;
> > + }
> > +
> > + xa_unlock_bh(&sk->sk_user_frags);
> > +
> > + p->max = k;
> > + p->idx = 0;
> > + return k ? 0 : err;
> > +}
>
> Personally, I'd prefer this optimisation to be in a separate patch,
> especially since there is some degree of hackiness to it.
>
>
To be honest this optimization is very necessary from my POV. We ran
into real production problems due to the excessive locking when we use
regular xa_alloc(), and Eric implemented this optimization to resolve
that. I simply squashed the optimization for this upstream series.
If absolutely necessary I can refactor it into a separate patch or
carry the optimization locally, but this seems like a problem everyone
looking to use devmem TCP will re-discover, so probably worth just
having here?
> > + /* if remaining_len is not satisfied yet, we need to go to the
> > + * next frag in the frag_list to satisfy remaining_len.
> > + */
> > + skb = skb_shinfo(skb)->frag_list ?: skb->next;
> > +
> > + offset = offset - start;
>
> It's an offset into the current skb, isn't it? Wouldn't
> offset = 0; be less confusing?
>
Seems so, AFAICT. Let me try to apply this and see if it trips up any tests.
> > + } while (skb);
> > +
> > + if (remaining_len) {
> > + err = -EFAULT;
> > + goto out;
> > + }
>
> Having data left is not a fault,
I think it is. The caller of tcp_recvmsg_dmabuf() expects all of
remaining_len to be used up, otherwise it messes up with the math in
the caller. __skb_datagram_iter(), which is the equivalent to this one
for pages, regards having left over data as a fault and also returns
-EFAULT, AFAICT.
> and to get here you
> need to get an skb with no data left, which shouldn't
> happen. Seems like everything you need is covered by
> the "!sent" check below.
>
I think we can get here if we run out of skbs with data, no?
> > @@ -2503,6 +2504,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head)
> > void tcp_v4_destroy_sock(struct sock *sk)
> > {
> > struct tcp_sock *tp = tcp_sk(sk);
> > + __maybe_unused unsigned long index;
> > + __maybe_unused void *netmem;
>
> How about adding a function to get rid of __maybe_unused?.
>
> static void sock_release_devmem_frags() {
> #ifdef PP
> unsigned index;
> ...
> #endif PP
> }
>
Will do.
> Also, even though you wire it up for TCP, since ->sk_user_frags
> is in struct sock I'd expect the release to be somewhere in the
> generic sock path like __sk_destruct(), and same for init.
> Perhpas, it's better to leave it for later.
>
--
Thanks,
Mina
next prev parent reply other threads:[~2024-06-21 20:31 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-13 1:35 [PATCH net-next v12 00/13] Device Memory TCP Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 01/13] netdev: add netdev_rx_queue_restart() Mina Almasry
2024-06-17 13:20 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 02/13] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 03/13] netdev: support binding dma-buf to netdevice Mina Almasry
2024-06-14 8:36 ` Markus Elfring
2024-06-17 13:22 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 04/13] netdev: netdevice devmem allocator Mina Almasry
2024-06-17 13:42 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 05/13] page_pool: convert to use netmem Mina Almasry
2024-06-13 8:36 ` Paul Barker
2024-06-13 14:18 ` Mina Almasry
2024-06-17 17:52 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 06/13] page_pool: devmem support Mina Almasry
2024-06-17 14:16 ` Pavel Begunkov
2024-06-21 18:48 ` Mina Almasry
2024-06-24 0:12 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 07/13] memory-provider: dmabuf devmem memory provider Mina Almasry
2024-06-17 14:45 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 08/13] net: support non paged skb frags Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 09/13] net: add support for skbs with unreadable frags Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 10/13] tcp: RX path for devmem TCP Mina Almasry
2024-06-17 16:36 ` Pavel Begunkov
2024-06-21 20:31 ` Mina Almasry [this message]
2024-06-24 0:13 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 11/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 12/13] net: add devmem TCP documentation Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 13/13] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2024-06-14 1:34 ` [PATCH net-next v12 00/13] Device Memory TCP Jakub Kicinski
2024-06-14 4:40 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHS8izMce36FwLhFB0znHQYmxpe5hmTSXtZA7+b5VsmSJUfhRw@mail.gmail.com \
--to=almasrymina@google.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=andreas@gaisler.com \
--cc=andrii@kernel.org \
--cc=arnd@arndb.de \
--cc=asml.silence@gmail.com \
--cc=ast@kernel.org \
--cc=bagasdotme@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=christian.koenig@amd.com \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=deller@gmx.de \
--cc=donald.hunter@gmail.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=dsahern@kernel.org \
--cc=dw@davidwei.uk \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=haoluo@google.com \
--cc=hawk@kernel.org \
--cc=hch@infradead.org \
--cc=herbert@gondor.apana.org.au \
--cc=hramamurthy@google.com \
--cc=ilias.apalodimas@linaro.org \
--cc=ink@jurassic.park.msu.ru \
--cc=jeroendb@google.com \
--cc=jgg@ziepe.ca \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kaiyuanz@google.com \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-parisc@vger.kernel.org \
--cc=linux-renesas-soc@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=martin.lau@linux.dev \
--cc=mathieu.desnoyers@efficios.com \
--cc=mattst88@gmail.com \
--cc=mhiramat@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pkaligineedi@google.com \
--cc=razor@blackwall.org \
--cc=richard.henderson@linaro.org \
--cc=rostedt@goodmis.org \
--cc=s.shtylyov@omp.ru \
--cc=sdf@google.com \
--cc=shailend@google.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=steffen.klassert@secunet.com \
--cc=sumit.semwal@linaro.org \
--cc=tsbogend@alpha.franken.de \
--cc=willemb@google.com \
--cc=willemdebruijn.kernel@gmail.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).