Message ID | 168244294384.1741095.6037010854411310099.stgit@firesoul (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | page_pool: new approach for leak detection and shutdown phase | expand |
On Tue, Apr 25, 2023 at 07:15:43PM +0200, Jesper Dangaard Brouer wrote: > Performance is very important for page pool (PP). This add the use of > static_key APIs for regaining a single instruction, which makes the > new PP shutdown scheme zero impact. > > We are uncertain if this is 100% correct, because static_key APIs uses > a mutex lock and it is uncertain if all contexts that can return pages > can support this. We could spawn a workqueue (like we just removed) to > workaround this issue. With debug atomic sleep the issue should be trivial to see. iirc the callers of xdp_flush_frame_bulk() need to do it under rcu_read_lock equivalent, which is not sleepable and mutex-es should warn.
On 26/04/2023 01.30, Alexei Starovoitov wrote: > On Tue, Apr 25, 2023 at 07:15:43PM +0200, Jesper Dangaard Brouer wrote: >> Performance is very important for page pool (PP). This add the use of >> static_key APIs for regaining a single instruction, which makes the >> new PP shutdown scheme zero impact. >> >> We are uncertain if this is 100% correct, because static_key APIs uses >> a mutex lock and it is uncertain if all contexts that can return pages >> can support this. We could spawn a workqueue (like we just removed) to >> workaround this issue. > > With debug atomic sleep the issue should be trivial to see. > iirc the callers of xdp_flush_frame_bulk() need to do it under rcu_read_lock equivalent, > which is not sleepable and mutex-es should warn. > Sure, adding CONFIG_DEBUG_ATOMIC_SLEEP and (keeping) CONFIG_DEBUG_MUTEXES. Testing this with veth changes that added page_pool to veth commit 0ebab78cbcbf ("net: veth: add page_pool for page recycling"). which makes it easier to create/trigger inflight packets via softnet_data->defer_list. It confirms my suspicion, this patch is not correct. The dmesg warning looks like below. (The XXX printk is just from my own debugging patch) The call site net_rx_action+0xe7 is skb_defer_free_flush $ ./scripts/faddr2line vmlinux net_rx_action+0xe7 net_rx_action+0xe7/0x360: skb_defer_free_flush at net/core/dev.c:6615 (inlined by) skb_defer_free_flush at net/core/dev.c:6601 (inlined by) net_rx_action at net/core/dev.c:6677 [ 1540.073256] XXX page_pool_shutdown_attempt() inflight=24 [ 1540.078654] XXX page_pool_destroy() Enter into shutdown phase - inflight=24 [ 1540.094234] XXX page_pool_shutdown_attempt() inflight=23 [ 1540.099626] XXX page_pool_shutdown_attempt() inflight=22 [ 1540.105012] XXX page_pool_shutdown_attempt() inflight=21 [ 1540.110402] XXX page_pool_shutdown_attempt() inflight=20 [ 1540.115793] XXX page_pool_shutdown_attempt() inflight=19 [ 1540.121183] XXX page_pool_shutdown_attempt() inflight=18 [ 1540.126564] XXX page_pool_shutdown_attempt() inflight=17 [ 1540.131945] XXX page_pool_shutdown_attempt() inflight=16 [ 1540.137329] XXX page_pool_shutdown_attempt() inflight=15 [ 1540.142710] XXX page_pool_shutdown_attempt() inflight=14 [ 1540.148110] XXX page_pool_shutdown_attempt() inflight=13 [ 1540.153512] XXX page_pool_shutdown_attempt() inflight=12 [ 1540.312284] XXX page_pool_shutdown_attempt() inflight=11 [ 1540.317712] XXX page_pool_shutdown_attempt() inflight=10 [ 1540.323128] XXX page_pool_shutdown_attempt() inflight=9 [ 1540.328459] XXX page_pool_shutdown_attempt() inflight=8 [ 1540.333788] XXX page_pool_shutdown_attempt() inflight=7 [ 1540.339111] XXX page_pool_shutdown_attempt() inflight=6 [ 1540.344439] XXX page_pool_shutdown_attempt() inflight=5 [ 1540.349768] XXX page_pool_shutdown_attempt() inflight=4 [ 1540.355091] XXX page_pool_shutdown_attempt() inflight=3 [ 1540.360420] XXX page_pool_shutdown_attempt() inflight=2 [ 1540.365741] XXX page_pool_shutdown_attempt() inflight=1 [ 1540.371064] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49 [ 1540.380052] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 26501, name: netperf [ 1540.388171] preempt_count: 101, expected: 0 [ 1540.392452] RCU nest depth: 2, expected: 0 [ 1540.396640] CPU: 2 PID: 26501 Comm: netperf Not tainted 6.3.0-rc7-net-next-pp-shutdown-vm-lock-dbg+ #67 [ 1540.406147] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016 [ 1540.413749] Call Trace: [ 1540.416289] <IRQ> [ 1540.418395] dump_stack_lvl+0x32/0x50 [ 1540.422160] __might_resched+0x11c/0x160 [ 1540.426182] cpus_read_lock+0x16/0x60 [ 1540.429942] static_key_slow_dec+0x17/0x50 [ 1540.434137] page_pool_shutdown_attempt+0x50/0x60 [ 1540.438940] page_pool_return_skb_page+0x68/0xe0 [ 1540.443654] skb_free_head+0x4f/0x90 [ 1540.447326] skb_release_data+0x142/0x1a0 [ 1540.451435] napi_consume_skb+0x6b/0x180 [ 1540.455448] net_rx_action+0xe7/0x360 [ 1540.459209] __do_softirq+0xcb/0x2b1 [ 1540.462885] do_softirq+0x63/0x90 [ 1540.466297] </IRQ> [ 1540.468487] <TASK> [ 1540.470671] __local_bh_enable_ip+0x64/0x70 [ 1540.474948] __dev_queue_xmit+0x257/0x8a0 [ 1540.479049] ? __ip_local_out+0x48/0x160 [ 1540.483070] ip_finish_output2+0x25c/0x540 [ 1540.487255] __ip_queue_xmit+0x171/0x420 [ 1540.491266] ? ___slab_alloc+0xd1/0x670 [ 1540.495202] __tcp_transmit_skb+0x83c/0x950 [ 1540.499486] tcp_write_xmit+0x373/0xa60 [ 1540.503419] __tcp_push_pending_frames+0x32/0xf0 [ 1540.508125] tcp_sendmsg_locked+0x3a0/0xa10 [ 1540.512408] tcp_sendmsg+0x27/0x40 [ 1540.515907] sock_sendmsg+0x8b/0xa0 [ 1540.519494] ? sockfd_lookup_light+0x12/0x70 [ 1540.523856] __sys_sendto+0xeb/0x130 [ 1540.527531] ? __switch_to_asm+0x3a/0x80 [ 1540.531549] ? rseq_get_rseq_cs+0x32/0x290 [ 1540.535736] ? rseq_ip_fixup+0x14f/0x1e0 [ 1540.539751] __x64_sys_sendto+0x20/0x30 [ 1540.543686] do_syscall_64+0x3a/0x90 [ 1540.547357] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1540.552502] RIP: 0033:0x7f389f713680 [ 1540.556174] Code: 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 68 c3 0f 1f 80 00 00 00 00 55 48 83 ec 20 48 [ 1540.575075] RSP: 002b:00007ffcb4d4a538 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 1540.582755] RAX: ffffffffffffffda RBX: 000000000063fd28 RCX: 00007f389f713680 [ 1540.589983] RDX: 0000000000004000 RSI: 0000000001ac3c40 RDI: 0000000000000004 [ 1540.597208] RBP: 00007ffcb4d4a570 R08: 0000000000000000 R09: 0000000000000000 [ 1540.604438] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000063fcf8 [ 1540.611664] R13: 00007ffcb4d4a910 R14: 0000000000000000 R15: 00007f389fa87000 [ 1540.618896] </TASK> [ 1540.621199] ------------[ cut here ]------------ [ 1540.625911] WARNING: CPU: 2 PID: 26501 at kernel/smp.c:912 smp_call_function_many_cond+0x2c1/0x2e0 [ 1540.635001] Modules linked in: veth nf_defrag_ipv6 nf_defrag_ipv4 ib_umad rdma_ucm ib_ipoib rdma_cm iw_cm irdma ib_cm ice intel_uncore_frequency intel_uncore_frequency_common coretemp kvm_intel mlx5_ib kvm ib_uverbs irqbypass rapl intel_cstate ib_core i2c_i801 intel_uncore pcspkr i2c_smbus bfq acpi_ipmi wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel drm fuse zram zsmalloc mlx5_core ixgbe i40e igb igc sd_mod t10_pi mlxfw psample hid_generic crc64_rocksoft_generic i2c_algo_bit crc64_rocksoft i2c_core mdio ptp pps_core [last unloaded: bpfilter] [ 1540.684653] CPU: 2 PID: 26501 Comm: netperf Tainted: G W 6.3.0-rc7-net-next-pp-shutdown-vm-lock-dbg+ #67 [ 1540.695642] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016 [ 1540.703239] RIP: 0010:smp_call_function_many_cond+0x2c1/0x2e0 [ 1540.709090] Code: fe ff ff 8b 44 24 14 48 c7 c6 a0 04 fc 82 8b 15 85 f4 e6 01 48 8b 3c 24 8d 48 01 48 63 c9 e8 36 c5 4b 00 89 c2 e9 c5 fd ff ff <0f> 0b e9 86 fd ff ff 8b 7c 24 28 e8 4f b8 f7 ff eb af 66 66 2e 0f [ 1540.727999] RSP: 0018:ffffc9000003cda8 EFLAGS: 00010206 [ 1540.733322] RAX: 0000000000000102 RBX: ffffffff8383b2ea RCX: 0000000000000003 [ 1540.740552] RDX: 0000000000000000 RSI: ffffffff810271a0 RDI: ffffffff82fc04a0 [ 1540.747782] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 1540.755008] R10: 000000087c9e9000 R11: ffffffff82c4f620 R12: 0000000000000000 [ 1540.762241] R13: ffffffff8383b300 R14: 0000000000000002 R15: 0000000000000001 [ 1540.769474] FS: 00007f389f954740(0000) GS:ffff88887fc80000(0000) knlGS:0000000000000000 [ 1540.777685] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1540.783528] CR2: 00007f1e27582000 CR3: 000000015775c004 CR4: 00000000003706e0 [ 1540.790761] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1540.797987] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1540.805220] Call Trace: [ 1540.807763] <IRQ> [ 1540.809868] ? __pfx_do_sync_core+0x10/0x10 [ 1540.814146] on_each_cpu_cond_mask+0x20/0x40 [ 1540.818514] text_poke_bp_batch+0x9e/0x250 [ 1540.822708] text_poke_finish+0x1b/0x30 [ 1540.826644] arch_jump_label_transform_apply+0x16/0x30 [ 1540.831880] __static_key_slow_dec_cpuslocked.part.0+0x2f/0x40 [ 1540.837809] static_key_slow_dec+0x30/0x50 [ 1540.841999] page_pool_shutdown_attempt+0x50/0x60 [ 1540.846803] page_pool_return_skb_page+0x68/0xe0 [ 1540.851517] skb_free_head+0x4f/0x90 [ 1540.855194] skb_release_data+0x142/0x1a0 [ 1540.859300] napi_consume_skb+0x6b/0x180 [ 1540.863320] net_rx_action+0xe7/0x360 [ 1540.867082] __do_softirq+0xcb/0x2b1 [ 1540.870756] do_softirq+0x63/0x90 [ 1540.874171] </IRQ> [ 1540.876359] <TASK> [ 1540.878553] __local_bh_enable_ip+0x64/0x70 [ 1540.882829] __dev_queue_xmit+0x257/0x8a0 [ 1540.886938] ? __ip_local_out+0x48/0x160 [ 1540.890959] ip_finish_output2+0x25c/0x540 [ 1540.895155] __ip_queue_xmit+0x171/0x420 [ 1540.899175] ? ___slab_alloc+0xd1/0x670 [ 1540.903108] __tcp_transmit_skb+0x83c/0x950 [ 1540.907393] tcp_write_xmit+0x373/0xa60 [ 1540.911327] __tcp_push_pending_frames+0x32/0xf0 [ 1540.916041] tcp_sendmsg_locked+0x3a0/0xa10 [ 1540.920324] tcp_sendmsg+0x27/0x40 [ 1540.923821] sock_sendmsg+0x8b/0xa0 [ 1540.927411] ? sockfd_lookup_light+0x12/0x70 [ 1540.931780] __sys_sendto+0xeb/0x130 [ 1540.935456] ? __switch_to_asm+0x3a/0x80 [ 1540.939477] ? rseq_get_rseq_cs+0x32/0x290 [ 1540.943670] ? rseq_ip_fixup+0x14f/0x1e0 [ 1540.947693] __x64_sys_sendto+0x20/0x30 [ 1540.951626] do_syscall_64+0x3a/0x90 [ 1540.955300] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1540.960451] RIP: 0033:0x7f389f713680 [ 1540.964124] Code: 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 68 c3 0f 1f 80 00 00 00 00 55 48 83 ec 20 48 [ 1540.983035] RSP: 002b:00007ffcb4d4a538 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 1540.990724] RAX: ffffffffffffffda RBX: 000000000063fd28 RCX: 00007f389f713680 [ 1540.997957] RDX: 0000000000004000 RSI: 0000000001ac3c40 RDI: 0000000000000004 [ 1541.005187] RBP: 00007ffcb4d4a570 R08: 0000000000000000 R09: 0000000000000000 [ 1541.012422] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000063fcf8 [ 1541.019649] R13: 00007ffcb4d4a910 R14: 0000000000000000 R15: 00007f389fa87000 [ 1541.026881] </TASK> [ 1541.029166] ---[ end trace 0000000000000000 ]--- [ 1541.033925] XXX page_pool_shutdown_attempt() inflight=0
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ce7e8dda6403..3821d8874b15 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,6 +25,8 @@ #define BIAS_MAX LONG_MAX +DEFINE_STATIC_KEY_FALSE(pp_shutdown_phase); + #ifdef CONFIG_PAGE_POOL_STATS /* alloc_stat_inc is intended to be used in softirq context */ #define alloc_stat_inc(pool, __stat) (pool->alloc_stats.__stat++) @@ -378,7 +380,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, int i, nr_pages; /* API usage BUG: PP in shutdown phase, cannot alloc new pages */ - if (WARN_ON(pool->p.flags & PP_FLAG_SHUTDOWN)) + if (static_key_enabled(&pp_shutdown_phase) && + WARN_ON(pool->p.flags & PP_FLAG_SHUTDOWN)) return NULL; /* Don't support bulk alloc for high-order pages */ @@ -609,7 +612,7 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page, recycle_stat_inc(pool, ring_full); page_pool_return_page(pool, page); } - if (pool->p.flags & PP_FLAG_SHUTDOWN) + if (static_branch_unlikely(&pp_shutdown_phase)) page_pool_shutdown_attempt(pool); } EXPORT_SYMBOL(page_pool_put_defragged_page); @@ -659,7 +662,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, page_pool_return_page(pool, data[i]); out: - if (pool->p.flags & PP_FLAG_SHUTDOWN) + if (static_branch_unlikely(&pp_shutdown_phase)) page_pool_shutdown_attempt(pool); } EXPORT_SYMBOL(page_pool_put_page_bulk); @@ -817,7 +820,15 @@ static int page_pool_release(struct page_pool *pool) noinline static void page_pool_shutdown_attempt(struct page_pool *pool) { - page_pool_release(pool); + int inflight; + + if (!(pool->p.flags & PP_FLAG_SHUTDOWN)) + return; + + inflight = page_pool_release(pool); + + if (static_key_enabled(&pp_shutdown_phase) && !inflight) + static_branch_dec(&pp_shutdown_phase); } void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *), @@ -861,6 +872,7 @@ void page_pool_destroy(struct page_pool *pool) * Enter into shutdown phase, and retry release to handle races. */ pool->p.flags |= PP_FLAG_SHUTDOWN; + static_branch_inc(&pp_shutdown_phase); page_pool_shutdown_attempt(pool); } EXPORT_SYMBOL(page_pool_destroy);
Performance is very important for page pool (PP). This add the use of static_key APIs for regaining a single instruction, which makes the new PP shutdown scheme zero impact. We are uncertain if this is 100% correct, because static_key APIs uses a mutex lock and it is uncertain if all contexts that can return pages can support this. We could spawn a workqueue (like we just removed) to workaround this issue. Seeking input if this is worth the complexity. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/page_pool.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-)