All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Dragos Tatulea <dtatulea@nvidia.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	Mina Almasry <almasrymina@google.com>,
	Jacob Keller <jacob.e.keller@intel.com>
Cc: Jianbo Liu <jianbol@nvidia.com>, <netdev@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref
Date: Wed, 24 Apr 2024 19:56:46 +0300	[thread overview]
Message-ID: <20240424165646.1625690-2-dtatulea@nvidia.com> (raw)

The patch referenced in the fixes tag introduced the skb_frag_unref API.
This API still has a dark corner: when skb->pp_recycled is false and a
fragment being referenced is backed by a page_pool page, skb_frag_unref
can leak the page out of the page_pool, like in the following example:

 BUG: Bad page state in process swapper/4  pfn:103423
 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
 flags: 0x200000000000000(node=0|zone=2)
 page_type: 0xffffffff()
 raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
 raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
 page dumped because: page_pool leak
 Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
 act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
 nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
 br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
 scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
 ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
 swapper/4 Not tainted 6.9.0-rc4+ #2
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x53/0x70
  bad_page+0x6f/0xf0
  free_unref_page_prepare+0x271/0x420
  free_unref_page+0x38/0x120
  ___pskb_trim+0x261/0x390
  skb_segment+0xf60/0x1040
  tcp_gso_segment+0xe8/0x4e0
  inet_gso_segment+0x155/0x3d0
  skb_mac_gso_segment+0x99/0x100
  __skb_gso_segment+0xb4/0x160
  ? netif_skb_features+0x95/0x2f0
  validate_xmit_skb+0x16c/0x330
  validate_xmit_skb_list+0x4c/0x70
  sch_direct_xmit+0x23e/0x350
  __dev_queue_xmit+0x334/0xbc0
  tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
  tcf_mirred_act+0xd7/0x4b0 [act_mirred]
  ? tcf_pedit_act+0x6f/0x540 [act_pedit]
  tcf_action_exec+0x82/0x170
  fl_classify+0x1ee/0x200 [cls_flower]
  ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
  ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
  ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
  tcf_classify+0x26a/0x470
  tc_run+0xa2/0x120
  ? handle_irq_event+0x53/0x80
  ? kvm_clock_get_cycles+0x11/0x20
  __netif_receive_skb_core.constprop.0+0x932/0xee0
  __netif_receive_skb_list_core+0xfe/0x1f0
  netif_receive_skb_list_internal+0x1b5/0x2b0
  napi_gro_complete.constprop.0+0xee/0x120
  dev_gro_receive+0x3f4/0x710
  napi_gro_receive+0x7d/0x220
  mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
  mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
  mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
  __napi_poll+0x25/0x1b0
  net_rx_action+0x2c1/0x330
  __do_softirq+0xcb/0x28c
  irq_exit_rcu+0x69/0x90
  common_interrupt+0x85/0xa0
  </IRQ>
  <TASK>
  asm_common_interrupt+0x26/0x40
 RIP: 0010:default_idle+0x17/0x20
 Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
 fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
 RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
 RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
 RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
 RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
 R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  default_idle_call+0x3d/0xf0
  do_idle+0x1ce/0x1e0
  cpu_startup_entry+0x29/0x30
  start_secondary+0x109/0x130
  common_startup_64+0x129/0x138
  </TASK>

How it happens:
-> skb_segment
  -> clone skb into nskb
  -> call __pskb_trim(nskb)
    -> call pskb_expand_head(nskb) because nskb is cloned
      -> call skb_release_data(nskb) because nskb is cloned
        -> nskb->pp_recycle = 0
    -> skb_unref(nskb->frag[i], nskb)
    	-> nskb->pp_recycle is false, frag is page_pool page
	-> Fragment is released via put_page(frag page), hence leaking
	   from the page_pool.

Something tells me that this is not be the only issue of this kind...

The patch itself contains a suggested fix for this specific bug: it
forces the unref in ___pskb_trim to recycle to the page_pool. If the
page is not a page_pool page, it will be dereferenced as a regular page.

The alternative would be to save the skb->pp_recycled flag before
pskb_expand_head and use it later during the unref.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Co-developed-by: Jianbo Liu <jianbol@nvidia.com>
Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
Cc: Mina Almasry <almasrymina@google.com>
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37c858dc11a6..ab75b4f876ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			skb_frag_unref(skb, i);
+			__skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
 
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);
-- 
2.44.0


             reply	other threads:[~2024-04-24 16:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-24 16:56 Dragos Tatulea [this message]
2024-04-24 17:25 ` [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref Dragos Tatulea
2024-04-24 22:08   ` Mina Almasry
2024-04-25  8:17     ` Dragos Tatulea
2024-04-25 19:20       ` Mina Almasry
2024-04-25 19:47         ` Dragos Tatulea
2024-04-25 20:42           ` Mina Almasry
2024-04-26 14:59             ` Dragos Tatulea
2024-04-26 19:03               ` Mina Almasry
2024-04-26 23:08         ` Jakub Kicinski
2024-04-27  4:24           ` Mina Almasry
2024-04-29 15:00             ` Jakub Kicinski
2024-05-02 20:08               ` Mina Almasry
2024-05-03  0:46                 ` Jakub Kicinski
2024-04-26 23:05       ` Jakub Kicinski
2024-04-29  7:39         ` Dragos Tatulea
2024-05-01  6:20           ` Dragos Tatulea
2024-05-01  7:48             ` Mina Almasry
2024-05-01  7:58               ` Dragos Tatulea
2024-05-01 14:24               ` Jakub Kicinski
2024-05-01 14:28                 ` Jakub Kicinski
2024-05-01 16:23                 ` Mina Almasry
2024-04-26 23:07     ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240424165646.1625690-2-dtatulea@nvidia.com \
    --to=dtatulea@nvidia.com \
    --cc=almasrymina@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jacob.e.keller@intel.com \
    --cc=jianbol@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.