diff mbox series

[RFC] net: Fix one page_pool page leak from skb_frag_unref

Message ID 20240424165646.1625690-2-dtatulea@nvidia.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: Fix one page_pool page leak from skb_frag_unref | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 928 this patch: 928
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 938 this patch: 938
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 939 this patch: 939
netdev/checkpatch warning WARNING: Co-developed-by: must be immediately followed by Signed-off-by:
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 53 this patch: 53
netdev/source_inline success Was 0 now: 0

Commit Message

Dragos Tatulea April 24, 2024, 4:56 p.m. UTC
The patch referenced in the fixes tag introduced the skb_frag_unref API.
This API still has a dark corner: when skb->pp_recycled is false and a
fragment being referenced is backed by a page_pool page, skb_frag_unref
can leak the page out of the page_pool, like in the following example:

 BUG: Bad page state in process swapper/4  pfn:103423
 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
 flags: 0x200000000000000(node=0|zone=2)
 page_type: 0xffffffff()
 raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
 raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
 page dumped because: page_pool leak
 Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
 act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
 nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
 br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
 scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
 ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
 swapper/4 Not tainted 6.9.0-rc4+ #2
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x53/0x70
  bad_page+0x6f/0xf0
  free_unref_page_prepare+0x271/0x420
  free_unref_page+0x38/0x120
  ___pskb_trim+0x261/0x390
  skb_segment+0xf60/0x1040
  tcp_gso_segment+0xe8/0x4e0
  inet_gso_segment+0x155/0x3d0
  skb_mac_gso_segment+0x99/0x100
  __skb_gso_segment+0xb4/0x160
  ? netif_skb_features+0x95/0x2f0
  validate_xmit_skb+0x16c/0x330
  validate_xmit_skb_list+0x4c/0x70
  sch_direct_xmit+0x23e/0x350
  __dev_queue_xmit+0x334/0xbc0
  tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
  tcf_mirred_act+0xd7/0x4b0 [act_mirred]
  ? tcf_pedit_act+0x6f/0x540 [act_pedit]
  tcf_action_exec+0x82/0x170
  fl_classify+0x1ee/0x200 [cls_flower]
  ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
  ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
  ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
  tcf_classify+0x26a/0x470
  tc_run+0xa2/0x120
  ? handle_irq_event+0x53/0x80
  ? kvm_clock_get_cycles+0x11/0x20
  __netif_receive_skb_core.constprop.0+0x932/0xee0
  __netif_receive_skb_list_core+0xfe/0x1f0
  netif_receive_skb_list_internal+0x1b5/0x2b0
  napi_gro_complete.constprop.0+0xee/0x120
  dev_gro_receive+0x3f4/0x710
  napi_gro_receive+0x7d/0x220
  mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
  mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
  mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
  __napi_poll+0x25/0x1b0
  net_rx_action+0x2c1/0x330
  __do_softirq+0xcb/0x28c
  irq_exit_rcu+0x69/0x90
  common_interrupt+0x85/0xa0
  </IRQ>
  <TASK>
  asm_common_interrupt+0x26/0x40
 RIP: 0010:default_idle+0x17/0x20
 Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
 fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
 RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
 RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
 RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
 RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
 R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  default_idle_call+0x3d/0xf0
  do_idle+0x1ce/0x1e0
  cpu_startup_entry+0x29/0x30
  start_secondary+0x109/0x130
  common_startup_64+0x129/0x138
  </TASK>

How it happens:
-> skb_segment
  -> clone skb into nskb
  -> call __pskb_trim(nskb)
    -> call pskb_expand_head(nskb) because nskb is cloned
      -> call skb_release_data(nskb) because nskb is cloned
        -> nskb->pp_recycle = 0
    -> skb_unref(nskb->frag[i], nskb)
    	-> nskb->pp_recycle is false, frag is page_pool page
	-> Fragment is released via put_page(frag page), hence leaking
	   from the page_pool.

Something tells me that this is not be the only issue of this kind...

The patch itself contains a suggested fix for this specific bug: it
forces the unref in ___pskb_trim to recycle to the page_pool. If the
page is not a page_pool page, it will be dereferenced as a regular page.

The alternative would be to save the skb->pp_recycled flag before
pskb_expand_head and use it later during the unref.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Co-developed-by: Jianbo Liu <jianbol@nvidia.com>
Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
Cc: Mina Almasry <almasrymina@google.com>
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Dragos Tatulea April 24, 2024, 5:25 p.m. UTC | #1
On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote:
> The patch referenced in the fixes tag introduced the skb_frag_unref API.
This is wrong actually. Only skb_frag_ref was introduced. Sorry for the
confusion.

> This API still has a dark corner: when skb->pp_recycled is false and a
> fragment being referenced is backed by a page_pool page, skb_frag_unref
> can leak the page out of the page_pool, like in the following example:
> 
>  BUG: Bad page state in process swapper/4  pfn:103423
>  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
>  flags: 0x200000000000000(node=0|zone=2)
>  page_type: 0xffffffff()
>  raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
>  raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
>  page dumped because: page_pool leak
>  Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
>  act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
>  nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
>  br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
>  scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
>  ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
>  swapper/4 Not tainted 6.9.0-rc4+ #2
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <IRQ>
>   dump_stack_lvl+0x53/0x70
>   bad_page+0x6f/0xf0
>   free_unref_page_prepare+0x271/0x420
>   free_unref_page+0x38/0x120
>   ___pskb_trim+0x261/0x390
>   skb_segment+0xf60/0x1040
>   tcp_gso_segment+0xe8/0x4e0
>   inet_gso_segment+0x155/0x3d0
>   skb_mac_gso_segment+0x99/0x100
>   __skb_gso_segment+0xb4/0x160
>   ? netif_skb_features+0x95/0x2f0
>   validate_xmit_skb+0x16c/0x330
>   validate_xmit_skb_list+0x4c/0x70
>   sch_direct_xmit+0x23e/0x350
>   __dev_queue_xmit+0x334/0xbc0
>   tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
>   tcf_mirred_act+0xd7/0x4b0 [act_mirred]
>   ? tcf_pedit_act+0x6f/0x540 [act_pedit]
>   tcf_action_exec+0x82/0x170
>   fl_classify+0x1ee/0x200 [cls_flower]
>   ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
>   ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
>   ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
>   tcf_classify+0x26a/0x470
>   tc_run+0xa2/0x120
>   ? handle_irq_event+0x53/0x80
>   ? kvm_clock_get_cycles+0x11/0x20
>   __netif_receive_skb_core.constprop.0+0x932/0xee0
>   __netif_receive_skb_list_core+0xfe/0x1f0
>   netif_receive_skb_list_internal+0x1b5/0x2b0
>   napi_gro_complete.constprop.0+0xee/0x120
>   dev_gro_receive+0x3f4/0x710
>   napi_gro_receive+0x7d/0x220
>   mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
>   mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
>   mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
>   __napi_poll+0x25/0x1b0
>   net_rx_action+0x2c1/0x330
>   __do_softirq+0xcb/0x28c
>   irq_exit_rcu+0x69/0x90
>   common_interrupt+0x85/0xa0
>   </IRQ>
>   <TASK>
>   asm_common_interrupt+0x26/0x40
>  RIP: 0010:default_idle+0x17/0x20
>  Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
>  fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
>  1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
>  RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
>  RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
>  RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
>  RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
>  R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>   default_idle_call+0x3d/0xf0
>   do_idle+0x1ce/0x1e0
>   cpu_startup_entry+0x29/0x30
>   start_secondary+0x109/0x130
>   common_startup_64+0x129/0x138
>   </TASK>
> 
> How it happens:
> -> skb_segment
>   -> clone skb into nskb
>   -> call __pskb_trim(nskb)
>     -> call pskb_expand_head(nskb) because nskb is cloned
>       -> call skb_release_data(nskb) because nskb is cloned
>         -> nskb->pp_recycle = 0
>     -> skb_unref(nskb->frag[i], nskb)
>     	-> nskb->pp_recycle is false, frag is page_pool page
> 	-> Fragment is released via put_page(frag page), hence leaking
> 	   from the page_pool.
> 
> Something tells me that this is not be the only issue of this kind...
> 
> The patch itself contains a suggested fix for this specific bug: it
> forces the unref in ___pskb_trim to recycle to the page_pool. If the
> page is not a page_pool page, it will be dereferenced as a regular page.
> 
> The alternative would be to save the skb->pp_recycled flag before
> pskb_expand_head and use it later during the unref.
> 

One more interesting point is the comment in skb_release_data [1] and it's
commit message [2] from Ilias. Looking at the commit message 

[1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803

> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Co-developed-by: Jianbo Liu <jianbol@nvidia.com>
> Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> Cc: Mina Almasry <almasrymina@google.com>
> ---
>  net/core/skbuff.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 37c858dc11a6..ab75b4f876ce 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
>  		skb_shinfo(skb)->nr_frags = i;
>  
>  		for (; i < nfrags; i++)
> -			skb_frag_unref(skb, i);
> +			__skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
>  
>  		if (skb_has_frag_list(skb))
>  			skb_drop_fraglist(skb);
Mina Almasry April 24, 2024, 10:08 p.m. UTC | #2
On Wed, Apr 24, 2024 at 10:25 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote:
> > The patch referenced in the fixes tag introduced the skb_frag_unref API.
> This is wrong actually. Only skb_frag_ref was introduced. Sorry for the
> confusion.
>
> > This API still has a dark corner: when skb->pp_recycled is false and a
> > fragment being referenced is backed by a page_pool page, skb_frag_unref
> > can leak the page out of the page_pool, like in the following example:
> >
> >  BUG: Bad page state in process swapper/4  pfn:103423
> >  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
> >  flags: 0x200000000000000(node=0|zone=2)
> >  page_type: 0xffffffff()
> >  raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
> >  raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
> >  page dumped because: page_pool leak
> >  Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
> >  act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
> >  nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> >  br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
> >  scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
> >  ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
> >  swapper/4 Not tainted 6.9.0-rc4+ #2
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >  Call Trace:
> >   <IRQ>
> >   dump_stack_lvl+0x53/0x70
> >   bad_page+0x6f/0xf0
> >   free_unref_page_prepare+0x271/0x420
> >   free_unref_page+0x38/0x120
> >   ___pskb_trim+0x261/0x390
> >   skb_segment+0xf60/0x1040
> >   tcp_gso_segment+0xe8/0x4e0
> >   inet_gso_segment+0x155/0x3d0
> >   skb_mac_gso_segment+0x99/0x100
> >   __skb_gso_segment+0xb4/0x160
> >   ? netif_skb_features+0x95/0x2f0
> >   validate_xmit_skb+0x16c/0x330
> >   validate_xmit_skb_list+0x4c/0x70
> >   sch_direct_xmit+0x23e/0x350
> >   __dev_queue_xmit+0x334/0xbc0
> >   tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> >   tcf_mirred_act+0xd7/0x4b0 [act_mirred]
> >   ? tcf_pedit_act+0x6f/0x540 [act_pedit]
> >   tcf_action_exec+0x82/0x170
> >   fl_classify+0x1ee/0x200 [cls_flower]
> >   ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> >   ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
> >   ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
> >   tcf_classify+0x26a/0x470
> >   tc_run+0xa2/0x120
> >   ? handle_irq_event+0x53/0x80
> >   ? kvm_clock_get_cycles+0x11/0x20
> >   __netif_receive_skb_core.constprop.0+0x932/0xee0
> >   __netif_receive_skb_list_core+0xfe/0x1f0
> >   netif_receive_skb_list_internal+0x1b5/0x2b0
> >   napi_gro_complete.constprop.0+0xee/0x120
> >   dev_gro_receive+0x3f4/0x710
> >   napi_gro_receive+0x7d/0x220
> >   mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
> >   mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
> >   mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
> >   __napi_poll+0x25/0x1b0
> >   net_rx_action+0x2c1/0x330
> >   __do_softirq+0xcb/0x28c
> >   irq_exit_rcu+0x69/0x90
> >   common_interrupt+0x85/0xa0
> >   </IRQ>
> >   <TASK>
> >   asm_common_interrupt+0x26/0x40
> >  RIP: 0010:default_idle+0x17/0x20
> >  Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
> >  fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
> >  1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
> >  RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
> >  RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
> >  RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
> >  RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
> >  R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
> >  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >   default_idle_call+0x3d/0xf0
> >   do_idle+0x1ce/0x1e0
> >   cpu_startup_entry+0x29/0x30
> >   start_secondary+0x109/0x130
> >   common_startup_64+0x129/0x138
> >   </TASK>
> >
> > How it happens:
> > -> skb_segment
> >   -> clone skb into nskb
> >   -> call __pskb_trim(nskb)
> >     -> call pskb_expand_head(nskb) because nskb is cloned
> >       -> call skb_release_data(nskb) because nskb is cloned
> >         -> nskb->pp_recycle = 0
> >     -> skb_unref(nskb->frag[i], nskb)
> >       -> nskb->pp_recycle is false, frag is page_pool page
> >       -> Fragment is released via put_page(frag page), hence leaking
> >          from the page_pool.
> >
> > Something tells me that this is not be the only issue of this kind...
> >
> > The patch itself contains a suggested fix for this specific bug: it
> > forces the unref in ___pskb_trim to recycle to the page_pool. If the
> > page is not a page_pool page, it will be dereferenced as a regular page.
> >
> > The alternative would be to save the skb->pp_recycled flag before
> > pskb_expand_head and use it later during the unref.
> >
>
> One more interesting point is the comment in skb_release_data [1] and it's
> commit message [2] from Ilias. Looking at the commit message
>

Sorry for the bug. I don't think this is the right fix to be honest
though. As you mention this is only 1 such instance of a general
underlying issue.

The underlying issue is that before the fixes commit, skb_frag_ref()
grabbed a regular page ref. After the fixes commit, skb_frag_ref()
grabs a page-pool ref. The unref path always dropped a regular page
ref, thanks to this commit as you point out:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803

AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
("skbuff: Fix a potential race while recycling page_pool packets").
The reason is that now that skb_frag_ref() can grab page-pool refs, we
don't need to make sure there is only 1 SKB that triggers the recycle
path anymore. All the skb and its clones can obtain page-pool refs,
and in the unref path we drop the page-pool refs. page_pool_put_page()
detects correctly that the last page-pool ref is put and recycles the
page only then.

I'm unable to repro this issue. Do you need to do anything special? Or
is 1 flow that hits skb_segment() good enough?

Reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
recycling page_pool packets") shows no regressions for me. Is it
possible to try that out? If that doesn't work, I think I prefer
reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
rather than merging this fix to make sure we removed the underlying
cause of the issue.

> [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
>
> > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > Co-developed-by: Jianbo Liu <jianbol@nvidia.com>
> > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > Cc: Mina Almasry <almasrymina@google.com>
> > ---
> >  net/core/skbuff.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 37c858dc11a6..ab75b4f876ce 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
> >               skb_shinfo(skb)->nr_frags = i;
> >
> >               for (; i < nfrags; i++)
> > -                     skb_frag_unref(skb, i);
> > +                     __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
> >
> >               if (skb_has_frag_list(skb))
> >                       skb_drop_fraglist(skb);
>
Dragos Tatulea April 25, 2024, 8:17 a.m. UTC | #3
On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> On Wed, Apr 24, 2024 at 10:25 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > 
> > On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote:
> > > The patch referenced in the fixes tag introduced the skb_frag_unref API.
> > This is wrong actually. Only skb_frag_ref was introduced. Sorry for the
> > confusion.
> > 
> > > This API still has a dark corner: when skb->pp_recycled is false and a
> > > fragment being referenced is backed by a page_pool page, skb_frag_unref
> > > can leak the page out of the page_pool, like in the following example:
> > > 
> > >  BUG: Bad page state in process swapper/4  pfn:103423
> > >  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
> > >  flags: 0x200000000000000(node=0|zone=2)
> > >  page_type: 0xffffffff()
> > >  raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
> > >  raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
> > >  page dumped because: page_pool leak
> > >  Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
> > >  act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
> > >  nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> > >  br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
> > >  scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
> > >  ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
> > >  swapper/4 Not tainted 6.9.0-rc4+ #2
> > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > >  Call Trace:
> > >   <IRQ>
> > >   dump_stack_lvl+0x53/0x70
> > >   bad_page+0x6f/0xf0
> > >   free_unref_page_prepare+0x271/0x420
> > >   free_unref_page+0x38/0x120
> > >   ___pskb_trim+0x261/0x390
> > >   skb_segment+0xf60/0x1040
> > >   tcp_gso_segment+0xe8/0x4e0
> > >   inet_gso_segment+0x155/0x3d0
> > >   skb_mac_gso_segment+0x99/0x100
> > >   __skb_gso_segment+0xb4/0x160
> > >   ? netif_skb_features+0x95/0x2f0
> > >   validate_xmit_skb+0x16c/0x330
> > >   validate_xmit_skb_list+0x4c/0x70
> > >   sch_direct_xmit+0x23e/0x350
> > >   __dev_queue_xmit+0x334/0xbc0
> > >   tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> > >   tcf_mirred_act+0xd7/0x4b0 [act_mirred]
> > >   ? tcf_pedit_act+0x6f/0x540 [act_pedit]
> > >   tcf_action_exec+0x82/0x170
> > >   fl_classify+0x1ee/0x200 [cls_flower]
> > >   ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> > >   ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
> > >   ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
> > >   tcf_classify+0x26a/0x470
> > >   tc_run+0xa2/0x120
> > >   ? handle_irq_event+0x53/0x80
> > >   ? kvm_clock_get_cycles+0x11/0x20
> > >   __netif_receive_skb_core.constprop.0+0x932/0xee0
> > >   __netif_receive_skb_list_core+0xfe/0x1f0
> > >   netif_receive_skb_list_internal+0x1b5/0x2b0
> > >   napi_gro_complete.constprop.0+0xee/0x120
> > >   dev_gro_receive+0x3f4/0x710
> > >   napi_gro_receive+0x7d/0x220
> > >   mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
> > >   mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
> > >   mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
> > >   __napi_poll+0x25/0x1b0
> > >   net_rx_action+0x2c1/0x330
> > >   __do_softirq+0xcb/0x28c
> > >   irq_exit_rcu+0x69/0x90
> > >   common_interrupt+0x85/0xa0
> > >   </IRQ>
> > >   <TASK>
> > >   asm_common_interrupt+0x26/0x40
> > >  RIP: 0010:default_idle+0x17/0x20
> > >  Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
> > >  fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
> > >  1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
> > >  RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
> > >  RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
> > >  RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
> > >  RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
> > >  R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
> > >  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >   default_idle_call+0x3d/0xf0
> > >   do_idle+0x1ce/0x1e0
> > >   cpu_startup_entry+0x29/0x30
> > >   start_secondary+0x109/0x130
> > >   common_startup_64+0x129/0x138
> > >   </TASK>
> > > 
> > > How it happens:
> > > -> skb_segment
> > >   -> clone skb into nskb
> > >   -> call __pskb_trim(nskb)
> > >     -> call pskb_expand_head(nskb) because nskb is cloned
> > >       -> call skb_release_data(nskb) because nskb is cloned
> > >         -> nskb->pp_recycle = 0
> > >     -> skb_unref(nskb->frag[i], nskb)
> > >       -> nskb->pp_recycle is false, frag is page_pool page
> > >       -> Fragment is released via put_page(frag page), hence leaking
> > >          from the page_pool.
> > > 
> > > Something tells me that this is not be the only issue of this kind...
> > > 
> > > The patch itself contains a suggested fix for this specific bug: it
> > > forces the unref in ___pskb_trim to recycle to the page_pool. If the
> > > page is not a page_pool page, it will be dereferenced as a regular page.
> > > 
> > > The alternative would be to save the skb->pp_recycled flag before
> > > pskb_expand_head and use it later during the unref.
> > > 
> > 
> > One more interesting point is the comment in skb_release_data [1] and it's
> > commit message [2] from Ilias. Looking at the commit message
> > 
> 
> Sorry for the bug. I don't think this is the right fix to be honest
> though. As you mention this is only 1 such instance of a general
> underlying issue.
> 
Right. It was a conversation starter :).

> The underlying issue is that before the fixes commit, skb_frag_ref()
> grabbed a regular page ref. After the fixes commit, skb_frag_ref()
> grabs a page-pool ref.
> 
It is still possible for skb_frag_ref() to grab a regular page ref for a
page_pool page: for example in skb_segment() when a nskb is created (not cloned,
so pp_recycle is off) and frags with page_pool pages are added to
it. I was seeing this in my test as well.

If the intention is to always use page_pool ref/unref for page_pool pages, then
the recycle flag check doesn't belong skb_frag_ref/unref().

>  The unref path always dropped a regular page
> ref, thanks to this commit as you point out:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> 
> AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> ("skbuff: Fix a potential race while recycling page_pool packets").
> The reason is that now that skb_frag_ref() can grab page-pool refs, we
> don't need to make sure there is only 1 SKB that triggers the recycle
> path anymore. All the skb and its clones can obtain page-pool refs,
> and in the unref path we drop the page-pool refs. page_pool_put_page()
> detects correctly that the last page-pool ref is put and recycles the
> page only then.
> 
I don't think this is a good way forward. For example, skb->pp_recycle is used
as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
flag states. This could interfere with that.

> I'm unable to repro this issue. Do you need to do anything special? Or
> is 1 flow that hits skb_segment() good enough?
> 
I don't have a very easy repro. But the test is a forwarding one.

> Reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
> recycling page_pool packets") shows no regressions for me. Is it
> possible to try that out?
> 
I can try it out, it might work. But again, I fear that simply reverting this
change can have trigger other unexpected behaviours. Also, this doesn't handle
the behaviour for skb_frag_ref mentioned above.

>  If that doesn't work, I think I prefer
> reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> rather than merging this fix to make sure we removed the underlying
> cause of the issue.
This is the safest bet.

So, to recap, I see 2 possibilities:

1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
will probably have to come back in one way or another.
2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
always referencing/dereferencing pages based on their type (page_pool or
normal). 

Thanks,
Dragos
> 
> > [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137
> > [2]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > 
> > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > Co-developed-by: Jianbo Liu <jianbol@nvidia.com>
> > > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > Cc: Mina Almasry <almasrymina@google.com>
> > > ---
> > >  net/core/skbuff.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 37c858dc11a6..ab75b4f876ce 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
> > >               skb_shinfo(skb)->nr_frags = i;
> > > 
> > >               for (; i < nfrags; i++)
> > > -                     skb_frag_unref(skb, i);
> > > +                     __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
> > > 
> > >               if (skb_has_frag_list(skb))
> > >                       skb_drop_fraglist(skb);
> > 
> 
>
Mina Almasry April 25, 2024, 7:20 p.m. UTC | #4
On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> >  If that doesn't work, I think I prefer
> > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > rather than merging this fix to make sure we removed the underlying
> > cause of the issue.
> This is the safest bet.
>
> So, to recap, I see 2 possibilities:
>
> 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
> will probably have to come back in one way or another.
> 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
> always referencing/dereferencing pages based on their type (page_pool or
> normal).
>

If this works, I would be very happy. I personally think ref/unref
should be done based on the page type. For me the net stack using the
regular {get|put}_page on a pp page isn't great. It requires special
handling to make sure the ref + unref are in sync. Also if the last pp
ref is dropped while there are pending regular refs,
__page_pool_page_can_be_recycled() check will fail and the page will
not be recycled.

On the other hand, since 0a149ab78ee2 ("page_pool: transition to
reference count management after page draining") I'm not sure there is
any reason to continue to use get/put_page on pp-pages, we can use the
new pp-ref instead.

I don't see any regressions with this diff (needs cleanup), but your
test setup seems much much better than mine (I think this is the
second reffing issue you manage to repro):

diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
index 4dcdbe9fbc5f..4c72227dce1b 100644
--- a/include/linux/skbuff_ref.h
+++ b/include/linux/skbuff_ref.h
@@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
 static inline void skb_page_ref(struct page *page, bool recycle)
 {
 #ifdef CONFIG_PAGE_POOL
-       if (recycle && napi_pp_get_page(page))
+       if (napi_pp_get_page(page))
                return;
 #endif
        get_page(page);
@@ -69,7 +69,7 @@ static inline void
 skb_page_unref(struct page *page, bool recycle)
 {
 #ifdef CONFIG_PAGE_POOL
-       if (recycle && napi_pp_put_page(page))
+       if (napi_pp_put_page(page))
                return;
 #endif
        put_page(page);
Dragos Tatulea April 25, 2024, 7:47 p.m. UTC | #5
On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote:
> On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > 
> > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> > >  If that doesn't work, I think I prefer
> > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > rather than merging this fix to make sure we removed the underlying
> > > cause of the issue.
> > This is the safest bet.
> > 
> > So, to recap, I see 2 possibilities:
> > 
> > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
> > will probably have to come back in one way or another.
> > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
> > always referencing/dereferencing pages based on their type (page_pool or
> > normal).
> > 
> 
> If this works, I would be very happy. I personally think ref/unref
> should be done based on the page type. For me the net stack using the
> regular {get|put}_page on a pp page isn't great. It requires special
> handling to make sure the ref + unref are in sync. Also if the last pp
> ref is dropped while there are pending regular refs,
> __page_pool_page_can_be_recycled() check will fail and the page will
> not be recycled.
> 
> On the other hand, since 0a149ab78ee2 ("page_pool: transition to
> reference count management after page draining") I'm not sure there is
> any reason to continue to use get/put_page on pp-pages, we can use the
> new pp-ref instead.
> 
> I don't see any regressions with this diff (needs cleanup), but your
> test setup seems much much better than mine (I think this is the
> second reffing issue you manage to repro):
> 
> diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
> index 4dcdbe9fbc5f..4c72227dce1b 100644
> --- a/include/linux/skbuff_ref.h
> +++ b/include/linux/skbuff_ref.h
> @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
>  static inline void skb_page_ref(struct page *page, bool recycle)
>  {
>  #ifdef CONFIG_PAGE_POOL
> -       if (recycle && napi_pp_get_page(page))
> +       if (napi_pp_get_page(page))
>                 return;
>  #endif
>         get_page(page);
> @@ -69,7 +69,7 @@ static inline void
>  skb_page_unref(struct page *page, bool recycle)
>  {
>  #ifdef CONFIG_PAGE_POOL
> -       if (recycle && napi_pp_put_page(page))
> +       if (napi_pp_put_page(page))
>                 return;
>  #endif
>         put_page(page);
> 
> 
This is option 2. I thought this would fix everything. But I just tested and
it's not the case: we are now reaching a negative pp_ref_count. So probably
somewhere a regular page reference is still being taken...

Thanks,
Dragos
Mina Almasry April 25, 2024, 8:42 p.m. UTC | #6
On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote:
> > On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > >
> > > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> > > >  If that doesn't work, I think I prefer
> > > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > > rather than merging this fix to make sure we removed the underlying
> > > > cause of the issue.
> > > This is the safest bet.
> > >
> > > So, to recap, I see 2 possibilities:
> > >
> > > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
> > > will probably have to come back in one way or another.
> > > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
> > > always referencing/dereferencing pages based on their type (page_pool or
> > > normal).
> > >
> >
> > If this works, I would be very happy. I personally think ref/unref
> > should be done based on the page type. For me the net stack using the
> > regular {get|put}_page on a pp page isn't great. It requires special
> > handling to make sure the ref + unref are in sync. Also if the last pp
> > ref is dropped while there are pending regular refs,
> > __page_pool_page_can_be_recycled() check will fail and the page will
> > not be recycled.
> >
> > On the other hand, since 0a149ab78ee2 ("page_pool: transition to
> > reference count management after page draining") I'm not sure there is
> > any reason to continue to use get/put_page on pp-pages, we can use the
> > new pp-ref instead.
> >
> > I don't see any regressions with this diff (needs cleanup), but your
> > test setup seems much much better than mine (I think this is the
> > second reffing issue you manage to repro):
> >
> > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
> > index 4dcdbe9fbc5f..4c72227dce1b 100644
> > --- a/include/linux/skbuff_ref.h
> > +++ b/include/linux/skbuff_ref.h
> > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
> >  static inline void skb_page_ref(struct page *page, bool recycle)
> >  {
> >  #ifdef CONFIG_PAGE_POOL
> > -       if (recycle && napi_pp_get_page(page))
> > +       if (napi_pp_get_page(page))
> >                 return;
> >  #endif
> >         get_page(page);
> > @@ -69,7 +69,7 @@ static inline void
> >  skb_page_unref(struct page *page, bool recycle)
> >  {
> >  #ifdef CONFIG_PAGE_POOL
> > -       if (recycle && napi_pp_put_page(page))
> > +       if (napi_pp_put_page(page))
> >                 return;
> >  #endif
> >         put_page(page);
> >
> >
> This is option 2. I thought this would fix everything. But I just tested and
> it's not the case: we are now reaching a negative pp_ref_count. So probably
> somewhere a regular page reference is still being taken...
>

I would guess the most likely root cause of this would be a call site
that does get_page() instead of skb_frag_ref(), right?

The other possibility would be if something like:

- page is not pp_page
- skb_page_ref(page) // obtains a regular reference.
- page is converted to pp_page
- skb_page_unref(page) // drops a pp reference.

But I'm not aware of non-pp pages ever being converted to pp pages.

You probably figured this out already, but if you would like to dig
further instead of reverting the offending patch, this diff would
probably catch the get_page() callsite, no? (on my test setup this
debug code doesn't trigger).

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b0ee64225de..a22a676f4b6b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1473,8 +1473,14 @@ static inline void folio_get(struct folio *folio)
        folio_ref_inc(folio);
 }

+static inline bool debug_is_pp_page(struct page *page)
+{
+       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+
 static inline void get_page(struct page *page)
 {
+       WARN_ON_ONCE(debug_is_pp_page(page));
        folio_get(page_folio(page));
 }

@@ -1569,6 +1575,8 @@ static inline void put_page(struct page *page)
 {
        struct folio *folio = page_folio(page);

+       WARN_ON_ONCE(debug_is_pp_page(page));
+
        /*
         * For some devmap managed pages we need to catch refcount transition
         * from 2 to 1:
Dragos Tatulea April 26, 2024, 2:59 p.m. UTC | #7
On Thu, 2024-04-25 at 13:42 -0700, Mina Almasry wrote:
> On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > 
> > On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote:
> > > On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > > > 
> > > > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> > > > >  If that doesn't work, I think I prefer
> > > > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > > > rather than merging this fix to make sure we removed the underlying
> > > > > cause of the issue.
> > > > This is the safest bet.
> > > > 
> > > > So, to recap, I see 2 possibilities:
> > > > 
> > > > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
> > > > will probably have to come back in one way or another.
> > > > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
> > > > always referencing/dereferencing pages based on their type (page_pool or
> > > > normal).
> > > > 
> > > 
> > > If this works, I would be very happy. I personally think ref/unref
> > > should be done based on the page type. For me the net stack using the
> > > regular {get|put}_page on a pp page isn't great. It requires special
> > > handling to make sure the ref + unref are in sync. Also if the last pp
> > > ref is dropped while there are pending regular refs,
> > > __page_pool_page_can_be_recycled() check will fail and the page will
> > > not be recycled.
> > > 
> > > On the other hand, since 0a149ab78ee2 ("page_pool: transition to
> > > reference count management after page draining") I'm not sure there is
> > > any reason to continue to use get/put_page on pp-pages, we can use the
> > > new pp-ref instead.
> > > 
> > > I don't see any regressions with this diff (needs cleanup), but your
> > > test setup seems much much better than mine (I think this is the
> > > second reffing issue you manage to repro):
> > > 
> > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
> > > index 4dcdbe9fbc5f..4c72227dce1b 100644
> > > --- a/include/linux/skbuff_ref.h
> > > +++ b/include/linux/skbuff_ref.h
> > > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
> > >  static inline void skb_page_ref(struct page *page, bool recycle)
> > >  {
> > >  #ifdef CONFIG_PAGE_POOL
> > > -       if (recycle && napi_pp_get_page(page))
> > > +       if (napi_pp_get_page(page))
> > >                 return;
> > >  #endif
> > >         get_page(page);
> > > @@ -69,7 +69,7 @@ static inline void
> > >  skb_page_unref(struct page *page, bool recycle)
> > >  {
> > >  #ifdef CONFIG_PAGE_POOL
> > > -       if (recycle && napi_pp_put_page(page))
> > > +       if (napi_pp_put_page(page))
> > >                 return;
> > >  #endif
> > >         put_page(page);
> > > 
> > > 
> > This is option 2. I thought this would fix everything. But I just tested and
> > it's not the case: we are now reaching a negative pp_ref_count. 
> > 
I was tired and botched the revert of the code in this RFC when testing.
Dropping the recycle flag works as expected. Do we need an RFC v2 or is this non
RFC material?

Thanks,
Dragos

> > So probably
> > somewhere a regular page reference is still being taken...
> > 
> 
> I would guess the most likely root cause of this would be a call site
> that does get_page() instead of skb_frag_ref(), right?
> 
> The other possibility would be if something like:
> 
> - page is not pp_page
> - skb_page_ref(page) // obtains a regular reference.
> - page is converted to pp_page
> - skb_page_unref(page) // drops a pp reference.
> 
> But I'm not aware of non-pp pages ever being converted to pp pages.
> 
> You probably figured this out already, but if you would like to dig
> further instead of reverting the offending patch, this diff would
> probably catch the get_page() callsite, no? (on my test setup this
> debug code doesn't trigger).
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b0ee64225de..a22a676f4b6b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1473,8 +1473,14 @@ static inline void folio_get(struct folio *folio)
>         folio_ref_inc(folio);
>  }
> 
> +static inline bool debug_is_pp_page(struct page *page)
> +{
> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
> +}
> +
>  static inline void get_page(struct page *page)
>  {
> +       WARN_ON_ONCE(debug_is_pp_page(page));
>         folio_get(page_folio(page));
>  }
> 
> @@ -1569,6 +1575,8 @@ static inline void put_page(struct page *page)
>  {
>         struct folio *folio = page_folio(page);
> 
> +       WARN_ON_ONCE(debug_is_pp_page(page));
> +
>         /*
>          * For some devmap managed pages we need to catch refcount transition
>          * from 2 to 1:
>
Mina Almasry April 26, 2024, 7:03 p.m. UTC | #8
On Fri, Apr 26, 2024 at 7:59 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Thu, 2024-04-25 at 13:42 -0700, Mina Almasry wrote:
> > On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > >
> > > On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote:
> > > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
> > > > index 4dcdbe9fbc5f..4c72227dce1b 100644
> > > > --- a/include/linux/skbuff_ref.h
> > > > +++ b/include/linux/skbuff_ref.h
> > > > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
> > > >  static inline void skb_page_ref(struct page *page, bool recycle)
> > > >  {
> > > >  #ifdef CONFIG_PAGE_POOL
> > > > -       if (recycle && napi_pp_get_page(page))
> > > > +       if (napi_pp_get_page(page))
> > > >                 return;
> > > >  #endif
> > > >         get_page(page);
> > > > @@ -69,7 +69,7 @@ static inline void
> > > >  skb_page_unref(struct page *page, bool recycle)
> > > >  {
> > > >  #ifdef CONFIG_PAGE_POOL
> > > > -       if (recycle && napi_pp_put_page(page))
> > > > +       if (napi_pp_put_page(page))
> > > >                 return;
> > > >  #endif
> > > >         put_page(page);
> > > >
> > > >
> > > This is option 2. I thought this would fix everything. But I just tested and
> > > it's not the case: we are now reaching a negative pp_ref_count.
> > >
> I was tired and botched the revert of the code in this RFC when testing.
> Dropping the recycle flag works as expected. Do we need an RFC v2 or is this non
> RFC material?
>
> Thanks,
> Dragos
>

IMO, it needs to be cleaned up to remove the bool recycle flag dead
code, but other than that I think it's good as a non RFC.

To save you some time I did the said clean up and here is a patch
passing make allmodconfig build + checkpatch/kdoc checks + tests on my
end:

https://pastebin.com/bX5KcHTb

I could submit it to the list if you prefer.

FWIW, if this approach shows no regressions, I think we may be able to
deprecate skb->pp_recycle flag entirely, but I can look into that as a
separate change.

--
Thanks,
Mina
Jakub Kicinski April 26, 2024, 11:05 p.m. UTC | #9
On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> >  The unref path always dropped a regular page
> > ref, thanks to this commit as you point out:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > 
> > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > ("skbuff: Fix a potential race while recycling page_pool packets").
> > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > don't need to make sure there is only 1 SKB that triggers the recycle
> > path anymore. All the skb and its clones can obtain page-pool refs,
> > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > detects correctly that the last page-pool ref is put and recycles the
> > page only then.
> >   
> I don't think this is a good way forward. For example, skb->pp_recycle is used
> as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> flag states. This could interfere with that.

That's a bit speculative, right? The simple invariant we are trying to
hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
reference skb is holding on that frag is a pp reference, not page
reference.

skb_gro_receive() needs to maintain that invariant, if it doesn't
we need to fix it..
Jakub Kicinski April 26, 2024, 11:07 p.m. UTC | #10
On Wed, 24 Apr 2024 15:08:42 -0700 Mina Almasry wrote:
> I'm unable to repro this issue. Do you need to do anything special? Or
> is 1 flow that hits skb_segment() good enough?

At some point we may want to start writing unit tests to make sure 
we don't regress the 5 corner cases we found previously.. ;)
Should be easy to operate on skbs under kunit.
Jakub Kicinski April 26, 2024, 11:08 p.m. UTC | #11
On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> -       if (recycle && napi_pp_get_page(page))
> +       if (napi_pp_get_page(page))

Pretty sure you can't do that. The "recycle" here is a concurrency
guarantee. A guarantee someone is holding a pp ref on that page,
a ref which will not go away while napi_pp_get_page() is executing.
Mina Almasry April 27, 2024, 4:24 a.m. UTC | #12
On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> > -       if (recycle && napi_pp_get_page(page))
> > +       if (napi_pp_get_page(page))
>
> Pretty sure you can't do that. The "recycle" here is a concurrency
> guarantee. A guarantee someone is holding a pp ref on that page,
> a ref which will not go away while napi_pp_get_page() is executing.

I don't mean to argue, but I think the get_page()/put_page() pair we
do in the page ref path is susceptible to the same issue. AFAIU it's
not safe to get_page() if another CPU can be dropping the last ref,
get_page_unless_zero() should be used instead.

Since get_page() is good in the page ref path without some guarantee,
it's not obvious to me why we need this guarantee in the pp ref path,
but I could be missing some subtlety. At any rate, if you prefer us
going down the road of reverting commit 2cc3aeb5eccc ("skbuff: Fix a
potential race while recycling page_pool packets"), I think that could
also fix the issue.
Dragos Tatulea April 29, 2024, 7:39 a.m. UTC | #13
On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > >  The unref path always dropped a regular page
> > > ref, thanks to this commit as you point out:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > 
> > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > detects correctly that the last page-pool ref is put and recycles the
> > > page only then.
> > >   
> > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > flag states. This could interfere with that.
> 
> That's a bit speculative, right? The simple invariant we are trying to
> hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> reference skb is holding on that frag is a pp reference, not page
> reference.
> 
Yes, it was a speculative statement. After re-reading it and the code of
skb_gro_receive() it makes less sense now.

Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
while recycling page_pool packets") seems less scary now. I just hope we don't
bump into too many scenarios similar to the ipsec one...

> skb_gro_receive() needs to maintain that invariant, if it doesn't
> we need to fix it..
> 

Thanks,
Dragos
Jakub Kicinski April 29, 2024, 3 p.m. UTC | #14
On Fri, 26 Apr 2024 21:24:09 -0700 Mina Almasry wrote:
> On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:  
> > > -       if (recycle && napi_pp_get_page(page))
> > > +       if (napi_pp_get_page(page))  
> >
> > Pretty sure you can't do that. The "recycle" here is a concurrency
> > guarantee. A guarantee someone is holding a pp ref on that page,
> > a ref which will not go away while napi_pp_get_page() is executing.  
> 
> I don't mean to argue, but I think the get_page()/put_page() pair we
> do in the page ref path is susceptible to the same issue. AFAIU it's
> not safe to get_page() if another CPU can be dropping the last ref,
> get_page_unless_zero() should be used instead.

Whoever gave us the pointer to operate on has a reference, so the page
can't disappear. get_page() is safe. The problem with pp is that we
don't know whether the caller has a pp ref or a page ref. IOW the pp
ref may not be owned by whoever called us.

I guess the situation may actually be worse and we can only pp-ref a
page if both "source" and "destination" skb has pp_recycle = 1 :S

> Since get_page() is good in the page ref path without some guarantee,
> it's not obvious to me why we need this guarantee in the pp ref path,
> but I could be missing some subtlety. At any rate, if you prefer us
> going down the road of reverting commit 2cc3aeb5eccc ("skbuff: Fix a
> potential race while recycling page_pool packets"), I think that could
> also fix the issue.
Dragos Tatulea May 1, 2024, 6:20 a.m. UTC | #15
On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote:
> On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > >  The unref path always dropped a regular page
> > > > ref, thanks to this commit as you point out:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > > 
> > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > > detects correctly that the last page-pool ref is put and recycles the
> > > > page only then.
> > > >   
> > > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > > flag states. This could interfere with that.
> > 
> > That's a bit speculative, right? The simple invariant we are trying to
> > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> > reference skb is holding on that frag is a pp reference, not page
> > reference.
> > 
> Yes, it was a speculative statement. After re-reading it and the code of
> skb_gro_receive() it makes less sense now.
> 
> Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
> while recycling page_pool packets") seems less scary now. I just hope we don't
> bump into too many scenarios similar to the ipsec one...
> 
> > skb_gro_receive() needs to maintain that invariant, if it doesn't
> > we need to fix it..
> > 
> 
Gentle ping. Not sure how to proceed with this:

1) Revert commit 2cc3aeb5eccc
("skbuff: Fix a potential race while recycling page_pool packets"). I tested
this btw and it works (for this specific scenario).

2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
for now.

Thanks,
Dragos
Mina Almasry May 1, 2024, 7:48 a.m. UTC | #16
On Tue, Apr 30, 2024 at 11:20 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote:
> > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > > >  The unref path always dropped a regular page
> > > > > ref, thanks to this commit as you point out:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > > >
> > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > > > detects correctly that the last page-pool ref is put and recycles the
> > > > > page only then.
> > > > >
> > > > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > > > flag states. This could interfere with that.
> > >
> > > That's a bit speculative, right? The simple invariant we are trying to
> > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> > > reference skb is holding on that frag is a pp reference, not page
> > > reference.
> > >
> > Yes, it was a speculative statement. After re-reading it and the code of
> > skb_gro_receive() it makes less sense now.
> >
> > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
> > while recycling page_pool packets") seems less scary now. I just hope we don't
> > bump into too many scenarios similar to the ipsec one...
> >
> > > skb_gro_receive() needs to maintain that invariant, if it doesn't
> > > we need to fix it..
> > >
> >
> Gentle ping. Not sure how to proceed with this:
>
> 1) Revert commit 2cc3aeb5eccc
> ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> this btw and it works (for this specific scenario).
>
> 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> for now.
>

I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
well. If we continue to run into edge cases after the revert of #1, I
think we may want to do #2 and I can look to reland it with the kunit
tests that Jakub suggested that reproduce these edge cases.

I can upload #1 in the morning if there are no objections. I don't see
any regressions with #1 but I was never able to repo this issue.
Dragos Tatulea May 1, 2024, 7:58 a.m. UTC | #17
On Wed, 2024-05-01 at 00:48 -0700, Mina Almasry wrote:
> On Tue, Apr 30, 2024 at 11:20 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > 
> > On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote:
> > > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> > > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > > > >  The unref path always dropped a regular page
> > > > > > ref, thanks to this commit as you point out:
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > > > > 
> > > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > > > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > > > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > > > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > > > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > > > > detects correctly that the last page-pool ref is put and recycles the
> > > > > > page only then.
> > > > > > 
> > > > > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > > > > flag states. This could interfere with that.
> > > > 
> > > > That's a bit speculative, right? The simple invariant we are trying to
> > > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> > > > reference skb is holding on that frag is a pp reference, not page
> > > > reference.
> > > > 
> > > Yes, it was a speculative statement. After re-reading it and the code of
> > > skb_gro_receive() it makes less sense now.
> > > 
> > > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
> > > while recycling page_pool packets") seems less scary now. I just hope we don't
> > > bump into too many scenarios similar to the ipsec one...
> > > 
> > > > skb_gro_receive() needs to maintain that invariant, if it doesn't
> > > > we need to fix it..
> > > > 
> > > 
> > Gentle ping. Not sure how to proceed with this:
> > 
> > 1) Revert commit 2cc3aeb5eccc
> > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> > this btw and it works (for this specific scenario).
> > 
> > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > for now.
> > 
> 
> I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
> well. If we continue to run into edge cases after the revert of #1, I
> think we may want to do #2 and I can look to reland it with the kunit
> tests that Jakub suggested that reproduce these edge cases.
> 
> I can upload #1 in the morning if there are no objections. I don't see
> any regressions with #1 but I was never able to repo this issue.
> 
Yes please. And once you post it we can also take it internally to check all the
other tests that were failing in the same place.

Thanks,
Dragos
Jakub Kicinski May 1, 2024, 2:24 p.m. UTC | #18
On Wed, 1 May 2024 00:48:43 -0700 Mina Almasry wrote:
> > 1) Revert commit 2cc3aeb5eccc
> > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> > this btw and it works (for this specific scenario).
> >
> > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > for now.
> 
> I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
> well.

I vote #2, actually :( Or #3 make page pool ref safe to acquire
concurrently, but that plus fixing all the places where we do crazy
things may be tricky.

Even taking the ref is not as simple as using atomic_long_inc_not_zero()
sadly, partly because we try to keep the refcount at one, in an apparent
attempt to avoid dirtying the cache line twice.

So maybe partial revert to stop be bleeding and retry after more testing
is the way to go?

I had a quick look at the code and there is also a bunch of functions
which "shift" frags from one skb to another, without checking whether
the pp_recycle state matches.
Jakub Kicinski May 1, 2024, 2:28 p.m. UTC | #19
On Wed, 1 May 2024 07:24:34 -0700 Jakub Kicinski wrote:
> I vote #2, actually :( Or #3 make page pool ref safe to acquire
> concurrently, but that plus fixing all the places where we do crazy
> things may be tricky.
> 
> Even taking the ref is not as simple as using atomic_long_inc_not_zero()
> sadly, partly because we try to keep the refcount at one, in an apparent
> attempt to avoid dirtying the cache line twice.
> 
> So maybe partial revert to stop be bleeding and retry after more testing
> is the way to go?
> 
> I had a quick look at the code and there is also a bunch of functions
> which "shift" frags from one skb to another, without checking whether
> the pp_recycle state matches.

BTW these two refs seem to look at the wrong skb:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0c8b82750000..afd3336928d0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2148,7 +2148,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		}
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
-			skb_frag_ref(skb, i);
+			skb_frag_ref(n, i);
 		}
 		skb_shinfo(n)->nr_frags = i;
 	}
@@ -5934,7 +5934,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	 * since we set nr_frags to 0.
 	 */
 	for (i = 0; i < from_shinfo->nr_frags; i++)
-		__skb_frag_ref(&from_shinfo->frags[i], from->pp_recycle);
+		__skb_frag_ref(&from_shinfo->frags[i], to->pp_recycle);
 
 	to->truesize += delta;
 	to->len += len;
Mina Almasry May 1, 2024, 4:23 p.m. UTC | #20
On Wed, May 1, 2024 at 7:24 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 1 May 2024 00:48:43 -0700 Mina Almasry wrote:
> > > 1) Revert commit 2cc3aeb5eccc
> > > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> > > this btw and it works (for this specific scenario).
> > >
> > > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > for now.
> >
> > I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
> > well.
>
> I vote #2, actually :( Or #3 make page pool ref safe to acquire
> concurrently, but that plus fixing all the places where we do crazy
> things may be tricky.
>
> Even taking the ref is not as simple as using atomic_long_inc_not_zero()
> sadly, partly because we try to keep the refcount at one, in an apparent
> attempt to avoid dirtying the cache line twice.
>
> So maybe partial revert to stop be bleeding and retry after more testing
> is the way to go?
>

OK, I will upload a revert sometime today.

> I had a quick look at the code and there is also a bunch of functions
> which "shift" frags from one skb to another, without checking whether
> the pp_recycle state matches.

You posted a diff, I will pick it up in a separate patch.
Mina Almasry May 2, 2024, 8:08 p.m. UTC | #21
On Mon, Apr 29, 2024 at 8:00 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 26 Apr 2024 21:24:09 -0700 Mina Almasry wrote:
> > On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> > > > -       if (recycle && napi_pp_get_page(page))
> > > > +       if (napi_pp_get_page(page))
> > >
> > > Pretty sure you can't do that. The "recycle" here is a concurrency
> > > guarantee. A guarantee someone is holding a pp ref on that page,
> > > a ref which will not go away while napi_pp_get_page() is executing.
> >
> > I don't mean to argue, but I think the get_page()/put_page() pair we
> > do in the page ref path is susceptible to the same issue. AFAIU it's
> > not safe to get_page() if another CPU can be dropping the last ref,
> > get_page_unless_zero() should be used instead.
>

I uploaded a revert for review, but to reland I perhaps need to
understand a bit more the concern here. AFAICT that diff you're
responding to is safe and it works very well with devmem so it would
be my preferred approach to reland (but there are other options if you
are convinced it's bad). FWIW my thoughts:

> Whoever gave us the pointer to operate on has a reference, so the page
> can't disappear. get_page() is safe.

Agreed.

> The problem with pp is that we
> don't know whether the caller has a pp ref or a page ref. IOW the pp
> ref may not be owned by whoever called us.
>

OK, this is where I'm not sure anymore. The diff you're replying to
attempts to enforce the invariant: "if anyone wants a reference on an
skb_frag, skb_frag_ref will be a pp ref on pp frags
(is_pp_page==true), and page refs on non-pp frags
(is_pp_page==false)".

Additionally the page doesn't transition from pp to non-pp and vice
versa while anyone is holding a pp ref, because
page_pool_set_pp_info() is called right after the page is obtained
from the buddy allocator (before released from the page pool) and
page_pool_clear_pp_info() is called only after all the pp refs are
dropped.

So:

1. We know the caller has a ref (otherwise get_page() wouldn't be safe
in the non-pp case).
2. We know that the page has not transitioned from pp to non-pp or
vice versa since the caller obtained the ref (from code inspection, pp
info is not changed until all the refs are dropped for pp pages).
3. AFAICT, it follows that if the page is pp, then the caller has a pp
ref, and if the page is non-pp, then the caller has a page ref.
4. So, if is_pp_page==true, then the caller has a pp ref, then
napi_pp_get_page() should be concurrently safe.

AFAICT the only way my mental model is broken is if there is code
doing a raw get_page() rather than a skb_frag_ref() in core net stack.
I would like to get rid of these call sites if they exist. They would
not interact well with devmem I think (but could be made to work with
some effort).

--
Thanks,
Mina
Jakub Kicinski May 3, 2024, 12:46 a.m. UTC | #22
On Thu, 2 May 2024 13:08:23 -0700 Mina Almasry wrote:
> OK, this is where I'm not sure anymore. The diff you're replying to
> attempts to enforce the invariant: "if anyone wants a reference on an
> skb_frag, skb_frag_ref will be a pp ref on pp frags
> (is_pp_page==true), and page refs on non-pp frags
> (is_pp_page==false)".
> 
> Additionally the page doesn't transition from pp to non-pp and vice
> versa while anyone is holding a pp ref, because
> page_pool_set_pp_info() is called right after the page is obtained
> from the buddy allocator (before released from the page pool) and
> page_pool_clear_pp_info() is called only after all the pp refs are
> dropped.
> 
> So:
> 
> 1. We know the caller has a ref (otherwise get_page() wouldn't be safe
> in the non-pp case).
> 2. We know that the page has not transitioned from pp to non-pp or
> vice versa since the caller obtained the ref (from code inspection, pp

How do we know that?

> info is not changed until all the refs are dropped for pp pages).
> 3. AFAICT, it follows that if the page is pp, then the caller has a pp
> ref, and if the page is non-pp, then the caller has a page ref.

If that's true we have nothing to worry about.

> 4. So, if is_pp_page==true, then the caller has a pp ref, then
> napi_pp_get_page() should be concurrently safe.
> 
> AFAICT the only way my mental model is broken is if there is code
> doing a raw get_page() rather than a skb_frag_ref() in core net stack.

Not just. We also get pages fed from the outside, which may be PP pages.
Right now it'd be okay because "from outside" pages would end up in Tx.
Tx always allocates skbs with ->pp_recycle = 0, so we'll hold full refs.

> I would like to get rid of these call sites if they exist. They would
> not interact well with devmem I think (but could be made to work with
> some effort).

Maybe if we convert more code to operate on netmem_ref it will become
clearer where raw pages get fed in, and therefore were we have to be
very careful?
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37c858dc11a6..ab75b4f876ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2634,7 +2634,7 @@  int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 		skb_shinfo(skb)->nr_frags = i;
 
 		for (; i < nfrags; i++)
-			skb_frag_unref(skb, i);
+			__skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
 
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);