Message ID | 8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net/sched: Fixes for sch_ingress and sch_clsact | expand |
On 22/05/2023 20:55, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc [...]
Hi Peilin!
With V2 patches 5 and 6 applied I was still able to trigger an oops.
Branch is 'net' + patches 5 & 6:
145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
clsact Qdiscs before grafting
1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
allocation
Kernel config is the same as in the syzbot report.
Note that this was on a _single core_ VM.
I will double check if v1 is triggering this issue (basically run the
repro for a long time). For multi-core my VM is running OOM even on a
32Gb system. I will check if we have a spare server to run the repro.
[ 695.782780][T12033]
==================================================================
[ 695.783617][T12033] BUG: KASAN: slab-use-after-free in
mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.784323][T12033] Write of size 8 at addr ffff888060cafb08 by task
repro/12033
[ 695.784996][T12033]
[ 695.785210][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted
6.4.0-rc2-00187-g145f639b9403 #1
[ 695.785981][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 695.786883][T12033] Call Trace:
[ 695.787178][T12033] <TASK>
[ 695.787444][T12033] dump_stack_lvl+0xd9/0x1b0
[ 695.787871][T12033] print_report+0xc4/0x5f0
[ 695.788283][T12033] ? __virt_addr_valid+0x5e/0x2d0
[ 695.788736][T12033] ? __phys_addr+0xc6/0x140
[ 695.789138][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] kasan_report+0xc0/0xf0
[ 695.789604][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] ? ingress_init+0x1c0/0x1c0
[ 695.789604][T12033] tcf_chain0_head_change.isra.0+0xb9/0x120
[ 695.789604][T12033] tc_new_tfilter+0x1ebb/0x22b0
[ 695.789604][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 695.789604][T12033] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 695.789604][T12033] ? kasan_quarantine_put+0x102/0x230
[ 695.789604][T12033] ? lockdep_hardirqs_on+0x7d/0x100
[ 695.789604][T12033] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? bpf_lsm_capable+0x9/0x10
[ 695.789604][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] ? rtnl_getlink+0xb10/0xb10
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? netdev_core_pick_tx+0x390/0x390
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] ? rtnl_getlink+0xb10/0xb10
[ 695.789604][T12033] ? netlink_ack+0x1370/0x1370
[ 695.789604][T12033] ? kasan_set_track+0x25/0x30
[ 695.789604][T12033] ? netlink_deliver_tap+0x1b1/0xd00
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] ? netlink_attachskb+0x880/0x880
[ 695.789604][T12033] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 695.789604][T12033] ? __phys_addr_symbol+0x30/0x70
[ 695.789604][T12033] ? __check_object_size+0x323/0x740
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] ? netlink_unicast+0x800/0x800
[ 695.789604][T12033] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 695.789604][T12033] ? netlink_unicast+0x800/0x800
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ? kernel_sendmsg+0x50/0x50
[ 695.789604][T12033] ? __copy_msghdr+0x460/0x460
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] ? do_recvmmsg+0x700/0x700
[ 695.789604][T12033] ? find_held_lock+0x2d/0x110
[ 695.789604][T12033] ? __might_fault+0xe5/0x190
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? rcu_is_watching+0x12/0xb0
[ 695.789604][T12033] ? xfd_validate_state+0x5d/0x180
[ 695.789604][T12033] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 695.789604][T12033] ? unlock_page_memcg+0x2d0/0x2d0
[ 695.789604][T12033] ? do_futex+0x350/0x350
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] ? syscall_enter_from_user_mode+0x26/0x80
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033] RIP: 0033:0x7f4aca44a89d
[ 695.789604][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 695.789604][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 695.789604][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX:
00007f4aca44a89d
[ 695.789604][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 695.789604][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09:
0000000000000000
[ 695.789604][T12033] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 695.789604][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15:
00007f4aca2cf000
[ 695.789604][T12033] </TASK>
[ 695.789604][T12033]
[ 695.789604][T12033] Allocated by task 12031:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] kasan_set_track+0x25/0x30
[ 695.789604][T12033] __kasan_kmalloc+0xa2/0xb0
[ 695.789604][T12033] __kmalloc_node+0x60/0x100
[ 695.789604][T12033] qdisc_alloc+0xb3/0xa90
[ 695.789604][T12033] qdisc_create+0xcf/0x1020
[ 695.789604][T12033] tc_modify_qdisc+0x495/0x1ab0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x439/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033]
[ 695.789604][T12033] Freed by task 15:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] kasan_set_track+0x25/0x30
[ 695.789604][T12033] kasan_save_free_info+0x2e/0x40
[ 695.789604][T12033] ____kasan_slab_free+0x15e/0x1b0
[ 695.789604][T12033] slab_free_freelist_hook+0x10b/0x1e0
[ 695.789604][T12033] __kmem_cache_free+0xaf/0x2e0
[ 695.789604][T12033] rcu_core+0x7f7/0x1ac0
[ 695.789604][T12033] __do_softirq+0x1d8/0x8fd
[ 695.789604][T12033]
[ 695.789604][T12033] Last potentially related work creation:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] __kasan_record_aux_stack+0xbf/0xd0
[ 695.789604][T12033] __call_rcu_common.constprop.0+0x9a/0x790
[ 695.789604][T12033] qdisc_put_unlocked+0x74/0x90
[ 695.789604][T12033] tcf_block_release+0x90/0xa0
[ 695.789604][T12033] tc_new_tfilter+0xa5e/0x22b0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033]
[ 695.789604][T12033] Second to last potentially related work creation:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] __kasan_record_aux_stack+0xbf/0xd0
[ 695.789604][T12033] __call_rcu_common.constprop.0+0x9a/0x790
[ 695.789604][T12033] rht_deferred_worker+0x10fd/0x2010
[ 695.789604][T12033] process_one_work+0x9f9/0x15f0
[ 695.789604][T12033] worker_thread+0x687/0x1110
[ 695.789604][T12033] kthread+0x334/0x430
[ 695.789604][T12033] ret_from_fork+0x1f/0x30
[ 695.789604][T12033]
[ 695.789604][T12033] The buggy address belongs to the object at
ffff888060caf800
[ 695.789604][T12033] which belongs to the cache kmalloc-1k of size 1024
[ 695.789604][T12033] The buggy address is located 776 bytes inside of
[ 695.789604][T12033] freed 1024-byte region [ffff888060caf800,
ffff888060cafc00)
[ 695.789604][T12033]
[ 695.789604][T12033] The buggy address belongs to the physical page:
[ 695.789604][T12033] page:ffffea0001832b00 refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x60cac
[ 695.789604][T12033] head:ffffea0001832b00 order:2 entire_mapcount:0
nr_pages_mapped:0 pincount:0
[ 695.789604][T12033] flags:
0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[ 695.789604][T12033] page_type: 0xffffffff()
[ 695.789604][T12033] raw: 00fff00000010200 ffff888012441dc0
ffffea0000bac600 dead000000000002
[ 695.789604][T12033] raw: 0000000000000000 0000000000080008
00000001ffffffff 0000000000000000
[ 695.789604][T12033] page dumped because: kasan: bad access detected
[ 695.789604][T12033] page_owner tracks the page as allocated
[ 695.789604][T12033] page last allocated via order 2, migratetype
Unmovable, gfp_mask
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC),
pid 13324, tgid 13318 (repro), ts 170262603079, free_ts 0
[ 695.789604][T12033] get_page_from_freelist+0xe71/0x2e80
[ 695.789604][T12033] __alloc_pages+0x1c8/0x4a0
[ 695.789604][T12033] alloc_pages+0x1a9/0x270
[ 695.789604][T12033] allocate_slab+0x24e/0x380
[ 695.789604][T12033] ___slab_alloc+0x89a/0x1400
[ 695.789604][T12033] __slab_alloc.constprop.0+0x56/0xa0
[ 695.789604][T12033] __kmem_cache_alloc_node+0x126/0x330
[ 695.789604][T12033] kmalloc_trace+0x25/0xe0
[ 695.789604][T12033] fl_change+0x1b3/0x51e0
[ 695.789604][T12033] tc_new_tfilter+0x992/0x22b0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] page_owner free stack trace missing
[ 695.789604][T12033]
[ 695.789604][T12033] Memory state around the buggy address:
[ 695.789604][T12033] ffff888060cafa00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ffff888060cafa80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] >ffff888060cafb00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ^
[ 695.789604][T12033] ffff888060cafb80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ffff888060cafc00: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 695.789604][T12033]
==================================================================
[ 695.996261][T12042] __nla_validate_parse: 32 callbacks suppressed
[ 695.996271][T12042] netlink: 24 bytes leftover after parsing
attributes in process `repro'.
[ 696.473670][T12046] netlink: 24 bytes leftover after parsing
attributes in process `repro'.
[ 696.660496][T12033] Kernel panic - not syncing: KASAN: panic_on_warn
set ...
[ 696.661250][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted
6.4.0-rc2-00187-g145f639b9403 #1
[ 696.661947][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 696.662768][T12033] Call Trace:
[ 696.663031][T12033] <TASK>
[ 696.663268][T12033] dump_stack_lvl+0xd9/0x1b0
[ 696.663659][T12033] panic+0x689/0x730
[ 696.663977][T12033] ? panic_smp_self_stop+0xa0/0xa0
[ 696.664396][T12033] ? preempt_schedule_thunk+0x1a/0x20
[ 696.664829][T12033] ? preempt_schedule_common+0x45/0xc0
[ 696.665263][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.665698][T12033] check_panic_on_warn+0xab/0xb0
[ 696.666087][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.666519][T12033] end_report+0xe9/0x120
[ 696.666861][T12033] kasan_report+0xcd/0xf0
[ 696.667209][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.667639][T12033] mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.668064][T12033] ? ingress_init+0x1c0/0x1c0
[ 696.668451][T12033] tcf_chain0_head_change.isra.0+0xb9/0x120
[ 696.668964][T12033] tc_new_tfilter+0x1ebb/0x22b0
[ 696.669391][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 696.669608][T12033] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 696.669608][T12033] ? kasan_quarantine_put+0x102/0x230
[ 696.669608][T12033] ? lockdep_hardirqs_on+0x7d/0x100
[ 696.669608][T12033] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? bpf_lsm_capable+0x9/0x10
[ 696.669608][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 696.669608][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 696.669608][T12033] ? rtnl_getlink+0xb10/0xb10
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? netdev_core_pick_tx+0x390/0x390
[ 696.669608][T12033] netlink_rcv_skb+0x166/0x440
[ 696.669608][T12033] ? rtnl_getlink+0xb10/0xb10
[ 696.669608][T12033] ? netlink_ack+0x1370/0x1370
[ 696.669608][T12033] ? kasan_set_track+0x25/0x30
[ 696.669608][T12033] ? netlink_deliver_tap+0x1b1/0xd00
[ 696.669608][T12033] netlink_unicast+0x530/0x800
[ 696.669608][T12033] ? netlink_attachskb+0x880/0x880
[ 696.669608][T12033] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 696.669608][T12033] ? __phys_addr_symbol+0x30/0x70
[ 696.669608][T12033] ? __check_object_size+0x323/0x740
[ 696.669608][T12033] netlink_sendmsg+0x90b/0xe10
[ 696.669608][T12033] ? netlink_unicast+0x800/0x800
[ 696.669608][T12033] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 696.669608][T12033] ? netlink_unicast+0x800/0x800
[ 696.669608][T12033] sock_sendmsg+0xd9/0x180
[ 696.669608][T12033] ____sys_sendmsg+0x264/0x910
[ 696.669608][T12033] ? kernel_sendmsg+0x50/0x50
[ 696.669608][T12033] ? __copy_msghdr+0x460/0x460
[ 696.669608][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 696.669608][T12033] ? do_recvmmsg+0x700/0x700
[ 696.669608][T12033] ? find_held_lock+0x2d/0x110
[ 696.669608][T12033] ? __might_fault+0xe5/0x190
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] __sys_sendmmsg+0x18e/0x430
[ 696.669608][T12033] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? rcu_is_watching+0x12/0xb0
[ 696.669608][T12033] ? xfd_validate_state+0x5d/0x180
[ 696.669608][T12033] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 696.669608][T12033] ? unlock_page_memcg+0x2d0/0x2d0
[ 696.669608][T12033] ? do_futex+0x350/0x350
[ 696.669608][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 696.669608][T12033] ? syscall_enter_from_user_mode+0x26/0x80
[ 696.669608][T12033] do_syscall_64+0x38/0xb0
[ 696.669608][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 696.669608][T12033] RIP: 0033:0x7f4aca44a89d
[ 696.669608][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 696.669608][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 696.669608][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX:
00007f4aca44a89d
[ 696.669608][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 696.669608][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09:
0000000000000000
[ 696.669608][T12033] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 696.669608][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15:
00007f4aca2cf000
[ 696.669608][T12033] </TASK>
[ 696.669608][T12033] Kernel Offset: disabled
[ 696.669608][T12033] Rebooting in 86400 seconds..
Hi Pedro, On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote: > With V2 patches 5 and 6 applied I was still able to trigger an oops. > > Branch is 'net' + patches 5 & 6: > 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and > clsact Qdiscs before grafting > 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs > 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file > allocation > > Kernel config is the same as in the syzbot report. > Note that this was on a _single core_ VM. > I will double check if v1 is triggering this issue (basically run the repro > for a long time). For multi-core my VM is running OOM even on a 32Gb system. > I will check if we have a spare server to run the repro. Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the "!ingress" path in qdisc_graft(). I think that's why you are still seeing the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible after [1,2/6], and I think we'll need a different reproducer for [5,6/6]. However I just noticed that for some reason my git-send-email in my new setup didn't auto-generate From: tags with my work email, so Author: will be my personal email (I have to send patches from personal email to avoid "[External] " subject prefixes) ... I will fix it in v3 soon. Sorry in advance for spamming. Thanks, Peilin Ye
On 23/05/2023 01:40, Peilin Ye wrote: > Hi Pedro, > > On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote: >> With V2 patches 5 and 6 applied I was still able to trigger an oops. >> >> Branch is 'net' + patches 5 & 6: >> 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and >> clsact Qdiscs before grafting >> 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs >> 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file >> allocation >> >> Kernel config is the same as in the syzbot report. >> Note that this was on a _single core_ VM. >> I will double check if v1 is triggering this issue (basically run the repro >> for a long time). For multi-core my VM is running OOM even on a 32Gb system. >> I will check if we have a spare server to run the repro. > > Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs > under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the > "!ingress" path in qdisc_graft(). I think that's why you are still seeing > the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible > after [1,2/6], and I think we'll need a different reproducer for [5,6/6]. > > However I just noticed that for some reason my git-send-email in my new > setup didn't auto-generate From: tags with my work email, so Author: will > be my personal email (I have to send patches from personal email to avoid > "[External] " subject prefixes) ... I will fix it in v3 soon. Sorry in > advance for spamming. > > Thanks, > Peilin Ye > I see, We need to make sure then, when the time comes, that all the required patches are back ported in the same bundle so we don't have a partial fix; given that they target different commit tags. I was still able to trigger an oops with the full patchset: [ 104.944353][ T6588] ------------[ cut here ]------------ [ 104.944896][ T6588] jump label: negative count! [ 104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at kernel/jump_label.c:263 static_key_slow_try_dec+0xf2/0x110 [ 104.946795][ T6588] Modules linked in: [ 104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted 6.4.0-rc2-00191-g4a3f9100193d #3 [ 104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110 [ 104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f [ 104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286 [ 104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: 0000000000000000 [ 104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: 0000000000000001 [ 104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: 0000000000000000 [ 104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff [ 104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: ffffffff8e7b0680 [ 104.954746][ T6588] FS: 00007f76c65d56c0(0000) GS:ffff8881f5a00000(0000) knlGS:0000000000000000 [ 104.955430][ T6588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4: 0000000000350ef0 [ 104.956559][ T6588] Call Trace: [ 104.956829][ T6588] <TASK> [ 104.957062][ T6588] ? clsact_egress_block_get+0x40/0x40 [ 104.957507][ T6588] static_key_slow_dec+0x60/0xc0 [ 104.957906][ T6588] qdisc_create+0xa45/0x1090 [ 104.958274][ T6588] ? tc_get_qdisc+0xb70/0xb70 [ 104.958646][ T6588] tc_modify_qdisc+0x491/0x1b70 [ 104.959031][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.959420][ T6588] ? bpf_lsm_capable+0x9/0x10 [ 104.959797][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.960175][ T6588] rtnetlink_rcv_msg+0x439/0xd30 [ 104.960625][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.960995][ T6588] ? __x64_sys_sendmmsg+0x9c/0x100 [ 104.961397][ T6588] ? do_syscall_64+0x38/0xb0 [ 104.961767][ T6588] ? netdev_core_pick_tx+0x390/0x390 [ 104.962195][ T6588] netlink_rcv_skb+0x166/0x440 [ 104.962584][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.962954][ T6588] ? netlink_ack+0x1370/0x1370 [ 104.963336][ T6588] ? kasan_set_track+0x25/0x30 [ 104.963731][ T6588] ? netlink_deliver_tap+0x1b1/0xd00 [ 104.964156][ T6588] netlink_unicast+0x530/0x800 [ 104.964537][ T6588] ? netlink_attachskb+0x880/0x880 [ 104.964951][ T6588] ? __sanitizer_cov_trace_switch+0x54/0x90 [ 104.965405][ T6588] ? __phys_addr_symbol+0x30/0x70 [ 104.965793][ T6588] ? __check_object_size+0x323/0x740 [ 104.966205][ T6588] netlink_sendmsg+0x90b/0xe10 [ 104.966577][ T6588] ? netlink_unicast+0x800/0x800 [ 104.966965][ T6588] ? bpf_lsm_socket_sendmsg+0x9/0x10 [ 104.967374][ T6588] ? netlink_unicast+0x800/0x800 [ 104.967761][ T6588] sock_sendmsg+0xd9/0x180 [ 104.968109][ T6588] ____sys_sendmsg+0x264/0x910 [ 104.968490][ T6588] ? kernel_sendmsg+0x50/0x50 [ 104.968871][ T6588] ? __copy_msghdr+0x460/0x460 [ 104.969244][ T6588] ? lockdep_hardirqs_on_prepare+0x410/0x410 [ 104.969716][ T6588] ? find_held_lock+0x2d/0x110 [ 104.970088][ T6588] ___sys_sendmsg+0x11d/0x1b0 [ 104.970491][ T6588] ? do_recvmmsg+0x700/0x700 [ 104.970853][ T6588] ? __fget_files+0x260/0x420 [ 104.971221][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.971649][ T6588] ? __fget_files+0x282/0x420 [ 104.972018][ T6588] ? __fget_light+0xe6/0x270 [ 104.972384][ T6588] __sys_sendmmsg+0x18e/0x430 [ 104.972766][ T6588] ? __ia32_sys_sendmsg+0xb0/0xb0 [ 104.973167][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.973588][ T6588] ? rcu_is_watching+0x12/0xb0 [ 104.973958][ T6588] ? xfd_validate_state+0x5d/0x180 [ 104.974355][ T6588] ? restore_fpregs_from_fpstate+0xc1/0x1d0 [ 104.974796][ T6588] ? unlock_page_memcg+0x2d0/0x2d0 [ 104.975178][ T6588] ? do_futex+0x350/0x350 [ 104.975501][ T6588] __x64_sys_sendmmsg+0x9c/0x100 [ 104.975867][ T6588] ? syscall_enter_from_user_mode+0x26/0x80 [ 104.976307][ T6588] do_syscall_64+0x38/0xb0 [ 104.976649][ T6588] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 104.977088][ T6588] RIP: 0033:0x7f76c66ee89d [ 104.977417][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48 [ 104.978835][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203 ORIG_RAX: 0000000000000133 [ 104.979468][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX: 00007f76c66ee89d [ 104.980073][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007 [ 104.980739][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09: 0000000000000000 [ 104.981352][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12: fffffffffffffeb8 [ 104.981968][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15: 00007f76c65b5000 [ 104.982587][ T6588] </TASK> [ 104.982833][ T6588] Kernel panic - not syncing: kernel: panic_on_warn set ... [ 104.983391][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted 6.4.0-rc2-00191-g4a3f9100193d #3 [ 104.984054][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 104.984854][ T6588] Call Trace: [ 104.985121][ T6588] <TASK> [ 104.985359][ T6588] dump_stack_lvl+0xd9/0x1b0 [ 104.985733][ T6588] panic+0x689/0x730 [ 104.986053][ T6588] ? panic_smp_self_stop+0xa0/0xa0 [ 104.986467][ T6588] ? show_trace_log_lvl+0x28e/0x400 [ 104.986888][ T6588] ? static_key_slow_try_dec+0xf2/0x110 [ 104.987329][ T6588] check_panic_on_warn+0xab/0xb0 [ 104.987729][ T6588] __warn+0xf2/0x380 [ 104.988046][ T6588] ? static_key_slow_try_dec+0xf2/0x110 [ 104.988476][ T6588] report_bug+0x3bc/0x580 [ 104.988838][ T6588] handle_bug+0x3c/0x70 [ 104.989169][ T6588] exc_invalid_op+0x17/0x40 [ 104.989529][ T6588] asm_exc_invalid_op+0x1a/0x20 [ 104.989914][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110 [ 104.990402][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f [ 104.990702][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286 [ 104.990702][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: 0000000000000000 [ 104.990702][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: 0000000000000001 [ 104.990702][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: 0000000000000000 [ 104.990702][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff [ 104.990702][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: ffffffff8e7b0680 [ 104.990702][ T6588] ? __warn_printk+0x187/0x310 [ 104.990702][ T6588] ? clsact_egress_block_get+0x40/0x40 [ 104.990702][ T6588] static_key_slow_dec+0x60/0xc0 [ 104.990702][ T6588] qdisc_create+0xa45/0x1090 [ 104.990702][ T6588] ? tc_get_qdisc+0xb70/0xb70 [ 104.990702][ T6588] tc_modify_qdisc+0x491/0x1b70 [ 104.990702][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.990702][ T6588] ? bpf_lsm_capable+0x9/0x10 [ 104.990702][ T6588] ? qdisc_create+0x1090/0x1090 [ 104.990702][ T6588] rtnetlink_rcv_msg+0x439/0xd30 [ 104.990702][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.990702][ T6588] ? __x64_sys_sendmmsg+0x9c/0x100 [ 104.990702][ T6588] ? do_syscall_64+0x38/0xb0 [ 104.990702][ T6588] ? netdev_core_pick_tx+0x390/0x390 [ 104.990702][ T6588] netlink_rcv_skb+0x166/0x440 [ 104.990702][ T6588] ? rtnl_getlink+0xb10/0xb10 [ 104.990702][ T6588] ? netlink_ack+0x1370/0x1370 [ 104.990702][ T6588] ? kasan_set_track+0x25/0x30 [ 104.990702][ T6588] ? netlink_deliver_tap+0x1b1/0xd00 [ 104.990702][ T6588] netlink_unicast+0x530/0x800 [ 104.990702][ T6588] ? netlink_attachskb+0x880/0x880 [ 104.990702][ T6588] ? __sanitizer_cov_trace_switch+0x54/0x90 [ 104.990702][ T6588] ? __phys_addr_symbol+0x30/0x70 [ 104.990702][ T6588] ? __check_object_size+0x323/0x740 [ 104.990702][ T6588] netlink_sendmsg+0x90b/0xe10 [ 104.990702][ T6588] ? netlink_unicast+0x800/0x800 [ 104.990702][ T6588] ? bpf_lsm_socket_sendmsg+0x9/0x10 [ 104.990702][ T6588] ? netlink_unicast+0x800/0x800 [ 104.990702][ T6588] sock_sendmsg+0xd9/0x180 [ 104.990702][ T6588] ____sys_sendmsg+0x264/0x910 [ 104.990702][ T6588] ? kernel_sendmsg+0x50/0x50 [ 104.990702][ T6588] ? __copy_msghdr+0x460/0x460 [ 104.990702][ T6588] ? lockdep_hardirqs_on_prepare+0x410/0x410 [ 104.990702][ T6588] ? find_held_lock+0x2d/0x110 [ 104.990702][ T6588] ___sys_sendmsg+0x11d/0x1b0 [ 104.990702][ T6588] ? do_recvmmsg+0x700/0x700 [ 104.990702][ T6588] ? __fget_files+0x260/0x420 [ 104.990702][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.990702][ T6588] ? __fget_files+0x282/0x420 [ 104.990702][ T6588] ? __fget_light+0xe6/0x270 [ 104.990702][ T6588] __sys_sendmmsg+0x18e/0x430 [ 104.990702][ T6588] ? __ia32_sys_sendmsg+0xb0/0xb0 [ 104.990702][ T6588] ? reacquire_held_locks+0x4b0/0x4b0 [ 104.990702][ T6588] ? rcu_is_watching+0x12/0xb0 [ 104.990702][ T6588] ? xfd_validate_state+0x5d/0x180 [ 104.990702][ T6588] ? restore_fpregs_from_fpstate+0xc1/0x1d0 [ 104.990702][ T6588] ? unlock_page_memcg+0x2d0/0x2d0 [ 104.990702][ T6588] ? do_futex+0x350/0x350 [ 104.990702][ T6588] __x64_sys_sendmmsg+0x9c/0x100 [ 104.990702][ T6588] ? syscall_enter_from_user_mode+0x26/0x80 [ 104.990702][ T6588] do_syscall_64+0x38/0xb0 [ 104.990702][ T6588] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 104.990702][ T6588] RIP: 0033:0x7f76c66ee89d [ 104.990702][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48 [ 104.990702][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203 ORIG_RAX: 0000000000000133 [ 104.990702][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX: 00007f76c66ee89d [ 104.990702][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007 [ 104.990702][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09: 0000000000000000 [ 104.990702][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12: fffffffffffffeb8 [ 104.990702][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15: 00007f76c65b5000 [ 104.990702][ T6588] </TASK> [ 104.990702][ T6588] Kernel Offset: disabled [ 104.990702][ T6588] Rebooting in 86400 seconds..
Hi Peilin, Thanks again for the great analysis and investing effort into fixing this! On Mon 22 May 2023 at 16:55, Peilin Ye <yepeilin.cs@gmail.com> wrote: > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > clsact Qdiscs and miniq_egress. > > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER > requests (thanks Hillf Danton for the hint), when replacing ingress or > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), > causing race conditions [1] including a use-after-free bug in > mini_qdisc_pair_swap() reported by syzbot: > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > ... > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > print_report mm/kasan/report.c:430 [inline] > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > ... > > @old and @new should not affect each other. In other words, @old should > never modify miniq_{in,e}gress after @new, and @new should not update > @old's RCU state. Fixing without changing sch_api.c turned out to be > difficult (please refer to Closes: for discussions). Instead, make sure > @new's first call always happen after @old's last call, in > qdisc_destroy(), has finished: > > In qdisc_graft(), return -EAGAIN and tell the caller to replay > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter > requests, and call qdisc_destroy() for @old before grafting @new. > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > just like qdisc_put() etc. > > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact > Qdiscs". > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-unlocked) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU sync 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is only one of the possible consequences of concurrently accessing > miniq_{in,e}gress pointers. The point is clear though: again, A should > never modify those per-net_device pointers after B, and B should not > update A's RCU state. > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ > Cc: Hillf Danton <hdanton@sina.com> > Cc: Vlad Buslov <vladbu@mellanox.com> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> > --- > changes in v2: > - replay the request if the current Qdisc has any ongoing RTNL-unlocked > filter requests (Vlad) > - minor changes in code comments and commit log > > include/net/sch_generic.h | 8 ++++++++ > net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------ > net/sched/sch_generic.c | 14 +++++++++++--- > 3 files changed, 45 insertions(+), 9 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..3e9cc43cbc90 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > refcount_inc(&qdisc->refcnt); > } > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return true; > + return refcount_dec_if_one(&qdisc->refcnt); > +} > + > /* Intended to be used by unlocked users, when concurrent qdisc release is > * possible. > */ > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > struct Qdisc *qdisc); > void qdisc_reset(struct Qdisc *qdisc); > +void qdisc_destroy(struct Qdisc *qdisc); > void qdisc_put(struct Qdisc *qdisc); > void qdisc_put_unlocked(struct Qdisc *qdisc); > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index f72a581666a2..b3bafa6c1b44 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > if ((q && q->flags & TCQ_F_INGRESS) || > (new && new->flags & TCQ_F_INGRESS)) { > ingress = 1; > - if (!dev_ingress_queue(dev)) { > + dev_queue = dev_ingress_queue(dev); > + if (!dev_queue) { > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > return -ENOENT; > } > + > + /* Replay if the current ingress (or clsact) Qdisc has ongoing > + * RTNL-unlocked filter request(s). This is the counterpart of that > + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > + */ > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + return -EAGAIN; > } > > if (dev->flags & IFF_UP) > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > qdisc_put(old); > } > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() @old before grafting @new to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); > } > > skip: > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if (new && new->ops->attach) > new->ops->attach(new); > - } else { > - notify_and_destroy(net, skb, n, classid, old, new, extack); > } > > if (dev->flags & IFF_UP) > @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > struct Qdisc *p = NULL; > int err; > > +replay: Perhaps also set q and p to NULL here? Even though on cursory look you should get the same lookup result since the function is called under rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case something touches this.") and tc_new_tfilter() got some non-obvious bugs after I introduced replay there without re-setting some of the required variables. > err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, > rtm_tca_policy, extack); > if (err < 0) > @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > return -ENOENT; > } > err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); > - if (err != 0) > + if (err != 0) { > + if (err == -EAGAIN) > + goto replay; > return err; > + } > } else { > qdisc_notify(net, skb, n, clid, NULL, q, NULL); > } > @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > if (err) { > if (q) > qdisc_put(q); > + if (err == -EAGAIN) > + goto replay; > return err; > } > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 37e41f972f69..e14ed47f961c 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > qdisc_free(q); > } > > -static void qdisc_destroy(struct Qdisc *qdisc) > +static void __qdisc_destroy(struct Qdisc *qdisc) > { > const struct Qdisc_ops *ops = qdisc->ops; > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > call_rcu(&qdisc->rcu, qdisc_free_cb); > } > > +void qdisc_destroy(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return; > + > + __qdisc_destroy(qdisc); > +} > + > void qdisc_put(struct Qdisc *qdisc) > { > if (!qdisc) > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > !refcount_dec_and_test(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > } > EXPORT_SYMBOL(qdisc_put); > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > rtnl_unlock(); > } > EXPORT_SYMBOL(qdisc_put_unlocked);
Hi Vlad, On Tue, May 23, 2023 at 05:04:40PM +0300, Vlad Buslov wrote: > > @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > > struct Qdisc *p = NULL; > > int err; > > > > +replay: > > Perhaps also set q and p to NULL here? Even though on cursory look you > should get the same lookup result since the function is called under > rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case > something touches this.") and tc_new_tfilter() got some non-obvious bugs > after I introduced replay there without re-setting some of the required > variables. Sure, I'll reinitialize tcm, q and p after "replay:" in next version, just like tc_modify_qdisc(). Thanks for the suggestion! Peilin Ye
On Tue, May 23, 2023 at 08:36:35AM -0300, Pedro Tammela wrote: > > Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs > > under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the > > "!ingress" path in qdisc_graft(). I think that's why you are still seeing > > the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible > > after [1,2/6], and I think we'll need a different reproducer for [5,6/6]. > > I was still able to trigger an oops with the full patchset: > > [ 104.944353][ T6588] ------------[ cut here ]------------ > [ 104.944896][ T6588] jump label: negative count! > [ 104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at kernel/jump_label.c:263 > static_key_slow_try_dec+0xf2/0x110 > [ 104.946795][ T6588] Modules linked in: > [ 104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted > 6.4.0-rc2-00191-g4a3f9100193d #3 > [ 104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX, > 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 > [ 104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110 > [ 104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d > c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c > ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f > [ 104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286 > [ 104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX: > 0000000000000000 > [ 104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI: > 0000000000000001 > [ 104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09: > 0000000000000000 > [ 104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12: > 00000000ffffffff > [ 104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15: > ffffffff8e7b0680 > [ 104.954746][ T6588] FS: 00007f76c65d56c0(0000) GS:ffff8881f5a00000(0000) > knlGS:0000000000000000 > [ 104.955430][ T6588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4: > 0000000000350ef0 > [ 104.956559][ T6588] Call Trace: > [ 104.956829][ T6588] <TASK> > [ 104.957062][ T6588] ? clsact_egress_block_get+0x40/0x40 > [ 104.957507][ T6588] static_key_slow_dec+0x60/0xc0 > [ 104.957906][ T6588] qdisc_create+0xa45/0x1090 > [ 104.958274][ T6588] ? tc_get_qdisc+0xb70/0xb70 > [ 104.958646][ T6588] tc_modify_qdisc+0x491/0x1b70 > [ 104.959031][ T6588] ? qdisc_create+0x1090/0x1090 > [ 104.959420][ T6588] ? bpf_lsm_capable+0x9/0x10 > [ 104.959797][ T6588] ? qdisc_create+0x1090/0x1090 Ah, qdisc_create() calls ->destroy() even "if ops->init() failed". We should check sch->parent in {ingress,clsact}_destroy() too. I'll update [1,2/6] in v5. Thanks for reporting this! Seems like I should've run the reproducer nevertheless. I'll run it before posting v5. Thanks, Peilin Ye
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index fab5ba3e61b7..3e9cc43cbc90 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) refcount_inc(&qdisc->refcnt); } +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return true; + return refcount_dec_if_one(&qdisc->refcnt); +} + /* Intended to be used by unlocked users, when concurrent qdisc release is * possible. */ @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, struct Qdisc *qdisc); void qdisc_reset(struct Qdisc *qdisc); +void qdisc_destroy(struct Qdisc *qdisc); void qdisc_put(struct Qdisc *qdisc); void qdisc_put_unlocked(struct Qdisc *qdisc); void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f72a581666a2..b3bafa6c1b44 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if ((q && q->flags & TCQ_F_INGRESS) || (new && new->flags & TCQ_F_INGRESS)) { ingress = 1; - if (!dev_ingress_queue(dev)) { + dev_queue = dev_ingress_queue(dev); + if (!dev_queue) { NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); return -ENOENT; } + + /* Replay if the current ingress (or clsact) Qdisc has ongoing + * RTNL-unlocked filter request(s). This is the counterpart of that + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). + */ + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) + return -EAGAIN; } if (dev->flags & IFF_UP) @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, qdisc_put(old); } } else { - dev_queue = dev_ingress_queue(dev); - old = dev_graft_qdisc(dev_queue, new); + old = dev_graft_qdisc(dev_queue, NULL); + + /* {ingress,clsact}_destroy() @old before grafting @new to avoid + * unprotected concurrent accesses to net_device::miniq_{in,e}gress + * pointer(s) in mini_qdisc_pair_swap(). + */ + qdisc_notify(net, skb, n, classid, old, new, extack); + qdisc_destroy(old); + + dev_graft_qdisc(dev_queue, new); } skip: @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if (new && new->ops->attach) new->ops->attach(new); - } else { - notify_and_destroy(net, skb, n, classid, old, new, extack); } if (dev->flags & IFF_UP) @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, struct Qdisc *p = NULL; int err; +replay: err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, rtm_tca_policy, extack); if (err < 0) @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, return -ENOENT; } err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); - if (err != 0) + if (err != 0) { + if (err == -EAGAIN) + goto replay; return err; + } } else { qdisc_notify(net, skb, n, clid, NULL, q, NULL); } @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, if (err) { if (q) qdisc_put(q); + if (err == -EAGAIN) + goto replay; return err; } diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 37e41f972f69..e14ed47f961c 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) qdisc_free(q); } -static void qdisc_destroy(struct Qdisc *qdisc) +static void __qdisc_destroy(struct Qdisc *qdisc) { const struct Qdisc_ops *ops = qdisc->ops; @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) call_rcu(&qdisc->rcu, qdisc_free_cb); } +void qdisc_destroy(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return; + + __qdisc_destroy(qdisc); +} + void qdisc_put(struct Qdisc *qdisc) { if (!qdisc) @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) !refcount_dec_and_test(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); } EXPORT_SYMBOL(qdisc_put); @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); rtnl_unlock(); } EXPORT_SYMBOL(qdisc_put_unlocked);
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for clsact Qdiscs and miniq_egress. Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER requests (thanks Hillf Danton for the hint), when replacing ingress or clsact Qdiscs, for example, the old Qdisc ("@old") could access the same miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), causing race conditions [1] including a use-after-free bug in mini_qdisc_pair_swap() reported by syzbot: BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 print_report mm/kasan/report.c:430 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:536 mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 ... @old and @new should not affect each other. In other words, @old should never modify miniq_{in,e}gress after @new, and @new should not update @old's RCU state. Fixing without changing sch_api.c turned out to be difficult (please refer to Closes: for discussions). Instead, make sure @new's first call always happen after @old's last call, in qdisc_destroy(), has finished: In qdisc_graft(), return -EAGAIN and tell the caller to replay (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter requests, and call qdisc_destroy() for @old before grafting @new. Introduce qdisc_refcount_dec_if_one() as the counterpart of qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, just like qdisc_put() etc. Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs". [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only create under TC_H_INGRESS") on eth0 that has 8 transmission queues: Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then adds a flower filter X to A. Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and b2) to replace A, then adds a flower filter Y to B. Thread 1 A's refcnt Thread 2 RTM_NEWQDISC (A, RTNL-locked) qdisc_create(A) 1 qdisc_graft(A) 9 RTM_NEWTFILTER (X, RTNL-unlocked) __tcf_qdisc_find(A) 10 tcf_chain0_head_change(A) mini_qdisc_pair_swap(A) (1st) | | RTM_NEWQDISC (B, RTNL-locked) RCU sync 2 qdisc_graft(B) | 1 notify_and_destroy(A) | tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) qdisc_destroy(A) tcf_chain0_head_change(B) tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) mini_qdisc_pair_swap(A) (3rd) | ... ... Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets on eth0 will not find filter Y in sch_handle_ingress(). This is only one of the possible consequences of concurrently accessing miniq_{in,e}gress pointers. The point is clear though: again, A should never modify those per-net_device pointers after B, and B should not update A's RCU state. Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ Cc: Hillf Danton <hdanton@sina.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> --- changes in v2: - replay the request if the current Qdisc has any ongoing RTNL-unlocked filter requests (Vlad) - minor changes in code comments and commit log include/net/sch_generic.h | 8 ++++++++ net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------ net/sched/sch_generic.c | 14 +++++++++++--- 3 files changed, 45 insertions(+), 9 deletions(-)