From: Pavel Begunkov <asml.silence@gmail.com>
To: Mina Almasry <almasrymina@google.com>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org,
linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org,
sparclinux@vger.kernel.org, linux-renesas-soc@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org
Cc: "David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Donald Hunter" <donald.hunter@gmail.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Richard Henderson" <richard.henderson@linaro.org>,
"Ivan Kokshaysky" <ink@jurassic.park.msu.ru>,
"Matt Turner" <mattst88@gmail.com>,
"Thomas Bogendoerfer" <tsbogend@alpha.franken.de>,
"James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
"Helge Deller" <deller@gmx.de>,
"Andreas Larsson" <andreas@gaisler.com>,
"Sergey Shtylyov" <s.shtylyov@omp.ru>,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Masami Hiramatsu" <mhiramat@kernel.org>,
"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
"Arnd Bergmann" <arnd@arndb.de>,
"Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Andrii Nakryiko" <andrii@kernel.org>,
"Martin KaFai Lau" <martin.lau@linux.dev>,
"Eduard Zingerman" <eddyz87@gmail.com>,
"Song Liu" <song@kernel.org>,
"Yonghong Song" <yonghong.song@linux.dev>,
"John Fastabend" <john.fastabend@gmail.com>,
"KP Singh" <kpsingh@kernel.org>,
"Stanislav Fomichev" <sdf@google.com>,
"Hao Luo" <haoluo@google.com>, "Jiri Olsa" <jolsa@kernel.org>,
"Steffen Klassert" <steffen.klassert@secunet.com>,
"Herbert Xu" <herbert@gondor.apana.org.au>,
"David Ahern" <dsahern@kernel.org>,
"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
"Shuah Khan" <shuah@kernel.org>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
"Christian König" <christian.koenig@amd.com>,
"Bagas Sanjaya" <bagasdotme@gmail.com>,
"Christoph Hellwig" <hch@infradead.org>,
"Nikolay Aleksandrov" <razor@blackwall.org>,
"David Wei" <dw@davidwei.uk>, "Jason Gunthorpe" <jgg@ziepe.ca>,
"Yunsheng Lin" <linyunsheng@huawei.com>,
"Shailend Chand" <shailend@google.com>,
"Harshitha Ramamurthy" <hramamurthy@google.com>,
"Shakeel Butt" <shakeel.butt@linux.dev>,
"Jeroen de Borst" <jeroendb@google.com>,
"Praveen Kaligineedi" <pkaligineedi@google.com>,
"Willem de Bruijn" <willemb@google.com>,
"Kaiyuan Zhang" <kaiyuanz@google.com>
Subject: Re: [PATCH net-next v12 10/13] tcp: RX path for devmem TCP
Date: Mon, 17 Jun 2024 17:36:24 +0100 [thread overview]
Message-ID: <20a6a727-d9f2-495c-bf75-72c27740dd82@gmail.com> (raw)
In-Reply-To: <20240613013557.1169171-11-almasrymina@google.com>
On 6/13/24 02:35, Mina Almasry wrote:
> In tcp_recvmsg_locked(), detect if the skb being received by the user
> is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> flag - pass it to tcp_recvmsg_devmem() for custom handling.
>
> tcp_recvmsg_devmem() copies any data in the skb header to the linear
> buffer, and returns a cmsg to the user indicating the number of bytes
> returned in the linear buffer.
>
> tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> and returns to the user a cmsg_devmem indicating the location of the
> data in the dmabuf device memory. cmsg_devmem contains this information:
>
> 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> 2. the size of the frag. 'frag_size'.
> 3. an opaque token 'frag_token' to return to the kernel when the buffer
> is to be released.
>
> The pages awaiting freeing are stored in the newly added
> sk->sk_user_frags, and each page passed to userspace is get_page()'d.
> This reference is dropped once the userspace indicates that it is
> done reading this page. All pages are released when the socket is
> destroyed.
One small concern is that if the pool gets destroyed (i.e.
page_pool_destroy) before sockets holding netiov, page pool will
semi-busily poll until the sockets die or such and will spam with
pr_warn(). E.g. when a user drops the nl but leaks data sockets
and continues with its userspace business. You can probably do
it in a loop and create dozens of such pending
page_pool_release_retry().
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
...
> +static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p,
> + unsigned int max_frags)
> +{
> + int err, k;
> +
> + if (p->idx < p->max)
> + return 0;
> +
> + xa_lock_bh(&sk->sk_user_frags);
> +
> + tcp_xa_pool_commit_locked(sk, p);
> +
> + for (k = 0; k < max_frags; k++) {
> + err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k],
> + XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL);
> + if (err)
> + break;
> + }
> +
> + xa_unlock_bh(&sk->sk_user_frags);
> +
> + p->max = k;
> + p->idx = 0;
> + return k ? 0 : err;
> +}
Personally, I'd prefer this optimisation to be in a separate patch,
especially since there is some degree of hackiness to it.
> +
> +/* On error, returns the -errno. On success, returns number of bytes sent to the
> + * user. May not consume all of @remaining_len.
> + */
> +static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> + unsigned int offset, struct msghdr *msg,
> + int remaining_len)
> +{
> + struct dmabuf_cmsg dmabuf_cmsg = { 0 };
> + struct tcp_xa_pool tcp_xa_pool;
> + unsigned int start;
> + int i, copy, n;
> + int sent = 0;
> + int err = 0;
> +
> + tcp_xa_pool.max = 0;
> + tcp_xa_pool.idx = 0;
> + do {
> + start = skb_headlen(skb);
> +
> + if (skb_frags_readable(skb)) {
> + err = -ENODEV;
> + goto out;
> + }
> +
> + /* Copy header. */
> + copy = start - offset;
> + if (copy > 0) {
> + copy = min(copy, remaining_len);
> +
> + n = copy_to_iter(skb->data + offset, copy,
> + &msg->msg_iter);
> + if (n != copy) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + offset += copy;
> + remaining_len -= copy;
> +
> + /* First a dmabuf_cmsg for # bytes copied to user
> + * buffer.
> + */
> + memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg));
> + dmabuf_cmsg.frag_size = copy;
> + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_LINEAR,
> + sizeof(dmabuf_cmsg), &dmabuf_cmsg);
> + if (err || msg->msg_flags & MSG_CTRUNC) {
> + msg->msg_flags &= ~MSG_CTRUNC;
> + if (!err)
> + err = -ETOOSMALL;
> + goto out;
> + }
> +
> + sent += copy;
> +
> + if (remaining_len == 0)
> + goto out;
> + }
> +
> + /* after that, send information of dmabuf pages through a
> + * sequence of cmsg
> + */
> + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> + skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> + struct net_iov *niov;
> + u64 frag_offset;
> + int end;
> +
> + /* !skb_frags_readable() should indicate that ALL the
> + * frags in this skb are dmabuf net_iovs. We're checking
> + * for that flag above, but also check individual frags
> + * here. If the tcp stack is not setting
> + * skb_frags_readable() correctly, we still don't want
> + * to crash here.
> + */
> + if (!skb_frag_net_iov(frag)) {
> + net_err_ratelimited("Found non-dmabuf skb with net_iov");
> + err = -ENODEV;
> + goto out;
> + }
> +
> + niov = skb_frag_net_iov(frag);
> + end = start + skb_frag_size(frag);
> + copy = end - offset;
> +
> + if (copy > 0) {
> + copy = min(copy, remaining_len);
> +
> + frag_offset = net_iov_virtual_addr(niov) +
> + skb_frag_off(frag) + offset -
> + start;
> + dmabuf_cmsg.frag_offset = frag_offset;
> + dmabuf_cmsg.frag_size = copy;
> + err = tcp_xa_pool_refill(sk, &tcp_xa_pool,
> + skb_shinfo(skb)->nr_frags - i);
> + if (err)
> + goto out;
> +
> + /* Will perform the exchange later */
> + dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
> + dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov);
> +
> + offset += copy;
> + remaining_len -= copy;
> +
> + err = put_cmsg(msg, SOL_SOCKET,
> + SO_DEVMEM_DMABUF,
> + sizeof(dmabuf_cmsg),
> + &dmabuf_cmsg);
> + if (err || msg->msg_flags & MSG_CTRUNC) {
> + msg->msg_flags &= ~MSG_CTRUNC;
> + if (!err)
> + err = -ETOOSMALL;
> + goto out;
> + }
> +
> + atomic_long_inc(&niov->pp_ref_count);
> + tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag);
> +
> + sent += copy;
> +
> + if (remaining_len == 0)
> + goto out;
> + }
> + start = end;
> + }
> +
> + tcp_xa_pool_commit(sk, &tcp_xa_pool);
> + if (!remaining_len)
> + goto out;
> +
> + /* if remaining_len is not satisfied yet, we need to go to the
> + * next frag in the frag_list to satisfy remaining_len.
> + */
> + skb = skb_shinfo(skb)->frag_list ?: skb->next;
> +
> + offset = offset - start;
It's an offset into the current skb, isn't it? Wouldn't
offset = 0; be less confusing?
> + } while (skb);
> +
> + if (remaining_len) {
> + err = -EFAULT;
> + goto out;
> + }
Having data left is not a fault, and to get here you
need to get an skb with no data left, which shouldn't
happen. Seems like everything you need is covered by
the "!sent" check below.
> +
> +out:
> + tcp_xa_pool_commit(sk, &tcp_xa_pool);
> + if (!sent)
> + sent = err;
> +
> + return sent;
> +}
> +
...
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index de0c8f43448ab..57e48b75ac02a 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -79,6 +79,7 @@
> #include <linux/seq_file.h>
> #include <linux/inetdevice.h>
> #include <linux/btf_ids.h>
> +#include <linux/skbuff_ref.h>
>
> #include <crypto/hash.h>
> #include <linux/scatterlist.h>
> @@ -2503,6 +2504,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head)
> void tcp_v4_destroy_sock(struct sock *sk)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> + __maybe_unused unsigned long index;
> + __maybe_unused void *netmem;
How about adding a function to get rid of __maybe_unused?.
static void sock_release_devmem_frags() {
#ifdef PP
unsigned index;
...
#endif PP
}
Also, even though you wire it up for TCP, since ->sk_user_frags
is in struct sock I'd expect the release to be somewhere in the
generic sock path like __sk_destruct(), and same for init.
Perhpas, it's better to leave it for later.
> +
> +#ifdef CONFIG_PAGE_POOL
> + xa_for_each(&sk->sk_user_frags, index, netmem)
> + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem));
> +#endif
> +
> + xa_destroy(&sk->sk_user_frags);
>
> trace_tcp_destroy_sock(sk);
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index bc67f6b9efae4..5d563312efe14 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -624,6 +624,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>
> __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
>
> + xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
> +
> return newsk;
> }
> EXPORT_SYMBOL(tcp_create_openreq_child);
--
Pavel Begunkov
next prev parent reply other threads:[~2024-06-17 16:36 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-13 1:35 [PATCH net-next v12 00/13] Device Memory TCP Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 01/13] netdev: add netdev_rx_queue_restart() Mina Almasry
2024-06-17 13:20 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 02/13] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 03/13] netdev: support binding dma-buf to netdevice Mina Almasry
2024-06-14 8:36 ` Markus Elfring
2024-06-17 13:22 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 04/13] netdev: netdevice devmem allocator Mina Almasry
2024-06-17 13:42 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 05/13] page_pool: convert to use netmem Mina Almasry
2024-06-13 8:36 ` Paul Barker
2024-06-13 14:18 ` Mina Almasry
2024-06-17 17:52 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 06/13] page_pool: devmem support Mina Almasry
2024-06-17 14:16 ` Pavel Begunkov
2024-06-21 18:48 ` Mina Almasry
2024-06-24 0:12 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 07/13] memory-provider: dmabuf devmem memory provider Mina Almasry
2024-06-17 14:45 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 08/13] net: support non paged skb frags Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 09/13] net: add support for skbs with unreadable frags Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 10/13] tcp: RX path for devmem TCP Mina Almasry
2024-06-17 16:36 ` Pavel Begunkov [this message]
2024-06-21 20:31 ` Mina Almasry
2024-06-24 0:13 ` Pavel Begunkov
2024-06-13 1:35 ` [PATCH net-next v12 11/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 12/13] net: add devmem TCP documentation Mina Almasry
2024-06-13 1:35 ` [PATCH net-next v12 13/13] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2024-06-14 1:34 ` [PATCH net-next v12 00/13] Device Memory TCP Jakub Kicinski
2024-06-14 4:40 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20a6a727-d9f2-495c-bf75-72c27740dd82@gmail.com \
--to=asml.silence@gmail.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=almasrymina@google.com \
--cc=andreas@gaisler.com \
--cc=andrii@kernel.org \
--cc=arnd@arndb.de \
--cc=ast@kernel.org \
--cc=bagasdotme@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=christian.koenig@amd.com \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=deller@gmx.de \
--cc=donald.hunter@gmail.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=dsahern@kernel.org \
--cc=dw@davidwei.uk \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=haoluo@google.com \
--cc=hawk@kernel.org \
--cc=hch@infradead.org \
--cc=herbert@gondor.apana.org.au \
--cc=hramamurthy@google.com \
--cc=ilias.apalodimas@linaro.org \
--cc=ink@jurassic.park.msu.ru \
--cc=jeroendb@google.com \
--cc=jgg@ziepe.ca \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kaiyuanz@google.com \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-parisc@vger.kernel.org \
--cc=linux-renesas-soc@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=martin.lau@linux.dev \
--cc=mathieu.desnoyers@efficios.com \
--cc=mattst88@gmail.com \
--cc=mhiramat@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pkaligineedi@google.com \
--cc=razor@blackwall.org \
--cc=richard.henderson@linaro.org \
--cc=rostedt@goodmis.org \
--cc=s.shtylyov@omp.ru \
--cc=sdf@google.com \
--cc=shailend@google.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=steffen.klassert@secunet.com \
--cc=sumit.semwal@linaro.org \
--cc=tsbogend@alpha.franken.de \
--cc=willemb@google.com \
--cc=willemdebruijn.kernel@gmail.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).