From: Dragos Tatulea <dtatulea@nvidia.com>
To: "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
Mina Almasry <almasrymina@google.com>,
Jacob Keller <jacob.e.keller@intel.com>
Cc: Jianbo Liu <jianbol@nvidia.com>, <netdev@vger.kernel.org>,
<linux-kernel@vger.kernel.org>
Subject: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref
Date: Wed, 24 Apr 2024 19:56:46 +0300 [thread overview]
Message-ID: <20240424165646.1625690-2-dtatulea@nvidia.com> (raw)
The patch referenced in the fixes tag introduced the skb_frag_unref API.
This API still has a dark corner: when skb->pp_recycled is false and a
fragment being referenced is backed by a page_pool page, skb_frag_unref
can leak the page out of the page_pool, like in the following example:
BUG: Bad page state in process swapper/4 pfn:103423
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
flags: 0x200000000000000(node=0|zone=2)
page_type: 0xffffffff()
raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
page dumped because: page_pool leak
Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
swapper/4 Not tainted 6.9.0-rc4+ #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x53/0x70
bad_page+0x6f/0xf0
free_unref_page_prepare+0x271/0x420
free_unref_page+0x38/0x120
___pskb_trim+0x261/0x390
skb_segment+0xf60/0x1040
tcp_gso_segment+0xe8/0x4e0
inet_gso_segment+0x155/0x3d0
skb_mac_gso_segment+0x99/0x100
__skb_gso_segment+0xb4/0x160
? netif_skb_features+0x95/0x2f0
validate_xmit_skb+0x16c/0x330
validate_xmit_skb_list+0x4c/0x70
sch_direct_xmit+0x23e/0x350
__dev_queue_xmit+0x334/0xbc0
tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
tcf_mirred_act+0xd7/0x4b0 [act_mirred]
? tcf_pedit_act+0x6f/0x540 [act_pedit]
tcf_action_exec+0x82/0x170
fl_classify+0x1ee/0x200 [cls_flower]
? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
? mlx5e_completion_event+0x39/0x40 [mlx5_core]
? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
tcf_classify+0x26a/0x470
tc_run+0xa2/0x120
? handle_irq_event+0x53/0x80
? kvm_clock_get_cycles+0x11/0x20
__netif_receive_skb_core.constprop.0+0x932/0xee0
__netif_receive_skb_list_core+0xfe/0x1f0
netif_receive_skb_list_internal+0x1b5/0x2b0
napi_gro_complete.constprop.0+0xee/0x120
dev_gro_receive+0x3f4/0x710
napi_gro_receive+0x7d/0x220
mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
__napi_poll+0x25/0x1b0
net_rx_action+0x2c1/0x330
__do_softirq+0xcb/0x28c
irq_exit_rcu+0x69/0x90
common_interrupt+0x85/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x26/0x40
RIP: 0010:default_idle+0x17/0x20
Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
default_idle_call+0x3d/0xf0
do_idle+0x1ce/0x1e0
cpu_startup_entry+0x29/0x30
start_secondary+0x109/0x130
common_startup_64+0x129/0x138
</TASK>
How it happens:
-> skb_segment
-> clone skb into nskb
-> call __pskb_trim(nskb)
-> call pskb_expand_head(nskb) because nskb is cloned
-> call skb_release_data(nskb) because nskb is cloned
-> nskb->pp_recycle = 0
-> skb_unref(nskb->frag[i], nskb)
-> nskb->pp_recycle is false, frag is page_pool page
-> Fragment is released via put_page(frag page), hence leaking
from the page_pool.
Something tells me that this is not be the only issue of this kind...
The patch itself contains a suggested fix for this specific bug: it
forces the unref in ___pskb_trim to recycle to the page_pool. If the
page is not a page_pool page, it will be dereferenced as a regular page.
The alternative would be to save the skb->pp_recycled flag before
pskb_expand_head and use it later during the unref.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Co-developed-by: Jianbo Liu <jianbol@nvidia.com>
Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
Cc: Mina Almasry <almasrymina@google.com>
---
net/core/skbuff.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37c858dc11a6..ab75b4f876ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
skb_shinfo(skb)->nr_frags = i;
for (; i < nfrags; i++)
- skb_frag_unref(skb, i);
+ __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
if (skb_has_frag_list(skb))
skb_drop_fraglist(skb);
--
2.44.0
next reply other threads:[~2024-04-24 16:57 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-24 16:56 Dragos Tatulea [this message]
2024-04-24 17:25 ` [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref Dragos Tatulea
2024-04-24 22:08 ` Mina Almasry
2024-04-25 8:17 ` Dragos Tatulea
2024-04-25 19:20 ` Mina Almasry
2024-04-25 19:47 ` Dragos Tatulea
2024-04-25 20:42 ` Mina Almasry
2024-04-26 14:59 ` Dragos Tatulea
2024-04-26 19:03 ` Mina Almasry
2024-04-26 23:08 ` Jakub Kicinski
2024-04-27 4:24 ` Mina Almasry
2024-04-29 15:00 ` Jakub Kicinski
2024-05-02 20:08 ` Mina Almasry
2024-05-03 0:46 ` Jakub Kicinski
2024-04-26 23:05 ` Jakub Kicinski
2024-04-29 7:39 ` Dragos Tatulea
2024-05-01 6:20 ` Dragos Tatulea
2024-05-01 7:48 ` Mina Almasry
2024-05-01 7:58 ` Dragos Tatulea
2024-05-01 14:24 ` Jakub Kicinski
2024-05-01 14:28 ` Jakub Kicinski
2024-05-01 16:23 ` Mina Almasry
2024-04-26 23:07 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240424165646.1625690-2-dtatulea@nvidia.com \
--to=dtatulea@nvidia.com \
--cc=almasrymina@google.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jacob.e.keller@intel.com \
--cc=jianbol@nvidia.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.