Message ID | 20231206210959.1035724-1-yonghong.song@linux.dev (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | [bpf-next,v4] bpf: Fix a race condition between btf_put() and map_free() | expand |
On 12/7/2023 5:09 AM, Yonghong Song wrote: > When running `./test_progs -j` in my local vm with latest kernel, > I once hit a kasan error like below: > > [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830 > [ 1887.186498] > [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 > [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 > [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred > [ 1887.190341] Call Trace: > [ 1887.190666] <TASK> > [ 1887.190949] dump_stack_lvl+0xac/0xe0 > [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 > [ 1887.192019] ? panic+0x3c0/0x3c0 > [ 1887.192449] print_report+0x14f/0x720 > [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 > [ 1887.193459] ? __virt_addr_valid+0xac/0x120 > [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.194572] kasan_report+0xc3/0x100 > [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 > [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 > [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 > [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 > [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 > [ 1887.198883] array_map_free+0x1a3/0x260 > [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 > [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 > [ 1887.200549] worker_thread+0x633/0x890 > [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 > [ 1887.201574] ? kthread+0x102/0x1d0 > [ 1887.202020] kthread+0x1ab/0x1d0 > [ 1887.202447] ? pr_cont_work+0x270/0x270 > [ 1887.202954] ? kthread_blkcg+0x50/0x50 > [ 1887.203444] ret_from_fork+0x34/0x50 > [ 1887.203914] ? kthread_blkcg+0x50/0x50 > [ 1887.204397] ret_from_fork_asm+0x11/0x20 > [ 1887.204913] </TASK> > [ 1887.204913] </TASK> > [ 1887.205209] > [ 1887.205416] Allocated by task 2197: > [ 1887.205881] kasan_set_track+0x3f/0x60 > [ 1887.206366] __kasan_kmalloc+0x6e/0x80 > [ 1887.206856] __kmalloc+0xac/0x1a0 > [ 1887.207293] btf_parse_fields+0xa15/0x1480 > [ 1887.207836] btf_parse_struct_metas+0x566/0x670 > [ 1887.208387] btf_new_fd+0x294/0x4d0 > [ 1887.208851] __sys_bpf+0x4ba/0x600 > [ 1887.209292] __x64_sys_bpf+0x41/0x50 > [ 1887.209762] do_syscall_64+0x4c/0xf0 > [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b > [ 1887.210868] > [ 1887.211074] Freed by task 36: > [ 1887.211460] kasan_set_track+0x3f/0x60 > [ 1887.211951] kasan_save_free_info+0x28/0x40 > [ 1887.212485] ____kasan_slab_free+0x101/0x180 > [ 1887.213027] __kmem_cache_free+0xe4/0x210 > [ 1887.213514] btf_free+0x5b/0x130 > [ 1887.213918] rcu_core+0x638/0xcc0 > [ 1887.214347] __do_softirq+0x114/0x37e > SNIP > Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new") > Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com>
On 12/6/23 1:09 PM, Yonghong Song wrote: > When running `./test_progs -j` in my local vm with latest kernel, > I once hit a kasan error like below: > > [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830 > [ 1887.186498] > [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 > [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 > [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred > [ 1887.190341] Call Trace: > [ 1887.190666] <TASK> > [ 1887.190949] dump_stack_lvl+0xac/0xe0 > [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 > [ 1887.192019] ? panic+0x3c0/0x3c0 > [ 1887.192449] print_report+0x14f/0x720 > [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 > [ 1887.193459] ? __virt_addr_valid+0xac/0x120 > [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.194572] kasan_report+0xc3/0x100 > [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 > [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 > [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 > [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 > [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 > [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 > [ 1887.198883] array_map_free+0x1a3/0x260 > [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 > [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 > [ 1887.200549] worker_thread+0x633/0x890 > [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 > [ 1887.201574] ? kthread+0x102/0x1d0 > [ 1887.202020] kthread+0x1ab/0x1d0 > [ 1887.202447] ? pr_cont_work+0x270/0x270 > [ 1887.202954] ? kthread_blkcg+0x50/0x50 > [ 1887.203444] ret_from_fork+0x34/0x50 > [ 1887.203914] ? kthread_blkcg+0x50/0x50 > [ 1887.204397] ret_from_fork_asm+0x11/0x20 > [ 1887.204913] </TASK> > [ 1887.204913] </TASK> > [ 1887.205209] > [ 1887.205416] Allocated by task 2197: > [ 1887.205881] kasan_set_track+0x3f/0x60 > [ 1887.206366] __kasan_kmalloc+0x6e/0x80 > [ 1887.206856] __kmalloc+0xac/0x1a0 > [ 1887.207293] btf_parse_fields+0xa15/0x1480 > [ 1887.207836] btf_parse_struct_metas+0x566/0x670 > [ 1887.208387] btf_new_fd+0x294/0x4d0 > [ 1887.208851] __sys_bpf+0x4ba/0x600 > [ 1887.209292] __x64_sys_bpf+0x41/0x50 > [ 1887.209762] do_syscall_64+0x4c/0xf0 > [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b > [ 1887.210868] > [ 1887.211074] Freed by task 36: > [ 1887.211460] kasan_set_track+0x3f/0x60 > [ 1887.211951] kasan_save_free_info+0x28/0x40 > [ 1887.212485] ____kasan_slab_free+0x101/0x180 > [ 1887.213027] __kmem_cache_free+0xe4/0x210 > [ 1887.213514] btf_free+0x5b/0x130 > [ 1887.213918] rcu_core+0x638/0xcc0 > [ 1887.214347] __do_softirq+0x114/0x37e > > The error happens at bpf_rb_root_free+0x1f8/0x2b0: > > 00000000000034c0 <bpf_rb_root_free>: > ; { > 34c0: f3 0f 1e fa endbr64 > 34c4: e8 00 00 00 00 callq 0x34c9 <bpf_rb_root_free+0x9> > 34c9: 55 pushq %rbp > 34ca: 48 89 e5 movq %rsp, %rbp > ... > ; if (rec && rec->refcount_off >= 0 && > 36aa: 4d 85 ed testq %r13, %r13 > 36ad: 74 a9 je 0x3658 <bpf_rb_root_free+0x198> > 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi > 36b3: e8 00 00 00 00 callq 0x36b8 <bpf_rb_root_free+0x1f8> > <==== kasan function > 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d > <==== use-after-free load > 36bc: 45 85 ff testl %r15d, %r15d > 36bf: 78 8c js 0x364d <bpf_rb_root_free+0x18d> > > So the problem is at rec->refcount_off in the above. > > I did some source code analysis and find the reason. > CPU A CPU B > bpf_map_put: > ... > btf_put with rcu callback > ... > bpf_map_free_deferred > with system_unbound_wq > ... ... ... > ... btf_free_rcu: ... > ... ... bpf_map_free_deferred: > ... ... > ... ---------> btf_struct_metas_free() > ... | race condition ... > ... ---------> map->ops->map_free() > ... > ... btf->struct_meta_tab = NULL > > In the above, map_free() corresponds to array_map_free() and eventually > calling bpf_rb_root_free() which calls: > ... > __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false); > ... > > Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code: > > meta = btf_find_struct_meta(btf, btf_id); > if (!meta) > return -EFAULT; > rec->fields[i].graph_root.value_rec = meta->record; > > So basically, 'value_rec' is a pointer to the record in struct_metas_tab. > And it is possible that that particular record has been freed by > btf_struct_metas_free() and hence we have a kasan error here. > > Actually it is very hard to reproduce the failure with current bpf/bpf-next > code, I only got the above error once. To increase reproducibility, I added > a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which > significantly increased reproducibility. > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 5e43ddd1b83f..aae5b5213e93 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work) > struct bpf_map *map = container_of(work, struct bpf_map, work); > struct btf_record *rec = map->record; > > + mdelay(100); > security_bpf_map_free(map); > bpf_map_release_memcg(map); > /* implementation dependent freeing */ > > To fix the problem, we need to have a reference on btf in order to > safeguard accessing field->graph_root.value_rec in map->ops->map_free(). > The function btf_parse_graph_root() is the place to get a btf reference. > The following are rough call stacks reaching bpf_parse_graph_root(): > > btf_parse > ... > btf_parse_fields > ... > btf_parse_graph_root > > map_check_btf > btf_parse_fields > ... > btf_parse_graph_root > > Looking at the above call stack, the btf_parse_graph_root() is indirectly > called from btf_parse() or map_check_btf(). > > We cannot take a reference in btf_parse() case since at that moment, > btf is still in the middle to self-validation and initial reference > (refcount_set(&btf->refcnt, 1)) has not been triggered yet. Thanks for the details analysis and clear explanation. It helps a lot. Sorry for jumping in late. I am trying to avoid making a special case for "bool has_btf_ref;" and "bool from_map_check". It seems to a bit too much to deal with the error path for btf_parse(). Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help? > It is also unnecessary to take a reference since the value_rec is > referring to a record in struct_meta_tab. If we optimize for not taking a refcnt, how about not taking a refcnt for all cases and postpone the btf_put(), instead of taking refcnt in one case but not another. Like your fix in v1. The failed selftest can be changed or even removed if it does not make sense anymore. > > But we do need to take a reference for map_check_btf() case as > the value_rec refers to a record in struct_meta_tab and we want > to make sure btf is not freed until value_rec usage at map_free > is done. > > So the fix is to get a btf reference for map_check_btf() case > whenever a graph_root is found. A boolean field 'has_btf_ref' is added to > struct btf_field_graph_root so later the btf reference can be properly > released. > > Rerun './test_progs -j' with the above mdelay() hack for a couple > of times and didn't observe the error. > Running Hou's test ([1]) is also successful. > > [1] https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/ >
On 12/7/23 5:23 PM, Martin KaFai Lau wrote: > On 12/6/23 1:09 PM, Yonghong Song wrote: >> When running `./test_progs -j` in my local vm with latest kernel, >> I once hit a kasan error like below: >> >> [ 1887.184724] BUG: KASAN: slab-use-after-free in >> bpf_rb_root_free+0x1f8/0x2b0 >> [ 1887.185599] Read of size 4 at addr ffff888106806910 by task >> kworker/u12:2/2830 >> [ 1887.186498] >> [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: >> G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 >> [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, >> 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >> [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred >> [ 1887.190341] Call Trace: >> [ 1887.190666] <TASK> >> [ 1887.190949] dump_stack_lvl+0xac/0xe0 >> [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 >> [ 1887.192019] ? panic+0x3c0/0x3c0 >> [ 1887.192449] print_report+0x14f/0x720 >> [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 >> [ 1887.193459] ? __virt_addr_valid+0xac/0x120 >> [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 >> [ 1887.194572] kasan_report+0xc3/0x100 >> [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 >> [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 >> [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 >> [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 >> [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 >> [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 >> [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 >> [ 1887.198883] array_map_free+0x1a3/0x260 >> [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 >> [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 >> [ 1887.200549] worker_thread+0x633/0x890 >> [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 >> [ 1887.201574] ? kthread+0x102/0x1d0 >> [ 1887.202020] kthread+0x1ab/0x1d0 >> [ 1887.202447] ? pr_cont_work+0x270/0x270 >> [ 1887.202954] ? kthread_blkcg+0x50/0x50 >> [ 1887.203444] ret_from_fork+0x34/0x50 >> [ 1887.203914] ? kthread_blkcg+0x50/0x50 >> [ 1887.204397] ret_from_fork_asm+0x11/0x20 >> [ 1887.204913] </TASK> >> [ 1887.204913] </TASK> >> [ 1887.205209] >> [ 1887.205416] Allocated by task 2197: >> [ 1887.205881] kasan_set_track+0x3f/0x60 >> [ 1887.206366] __kasan_kmalloc+0x6e/0x80 >> [ 1887.206856] __kmalloc+0xac/0x1a0 >> [ 1887.207293] btf_parse_fields+0xa15/0x1480 >> [ 1887.207836] btf_parse_struct_metas+0x566/0x670 >> [ 1887.208387] btf_new_fd+0x294/0x4d0 >> [ 1887.208851] __sys_bpf+0x4ba/0x600 >> [ 1887.209292] __x64_sys_bpf+0x41/0x50 >> [ 1887.209762] do_syscall_64+0x4c/0xf0 >> [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b >> [ 1887.210868] >> [ 1887.211074] Freed by task 36: >> [ 1887.211460] kasan_set_track+0x3f/0x60 >> [ 1887.211951] kasan_save_free_info+0x28/0x40 >> [ 1887.212485] ____kasan_slab_free+0x101/0x180 >> [ 1887.213027] __kmem_cache_free+0xe4/0x210 >> [ 1887.213514] btf_free+0x5b/0x130 >> [ 1887.213918] rcu_core+0x638/0xcc0 >> [ 1887.214347] __do_softirq+0x114/0x37e >> >> The error happens at bpf_rb_root_free+0x1f8/0x2b0: >> >> 00000000000034c0 <bpf_rb_root_free>: >> ; { >> 34c0: f3 0f 1e fa endbr64 >> 34c4: e8 00 00 00 00 callq 0x34c9 >> <bpf_rb_root_free+0x9> >> 34c9: 55 pushq %rbp >> 34ca: 48 89 e5 movq %rsp, %rbp >> ... >> ; if (rec && rec->refcount_off >= 0 && >> 36aa: 4d 85 ed testq %r13, %r13 >> 36ad: 74 a9 je 0x3658 >> <bpf_rb_root_free+0x198> >> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi >> 36b3: e8 00 00 00 00 callq 0x36b8 >> <bpf_rb_root_free+0x1f8> >> <==== kasan function >> 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d >> <==== use-after-free load >> 36bc: 45 85 ff testl %r15d, %r15d >> 36bf: 78 8c js 0x364d >> <bpf_rb_root_free+0x18d> >> >> So the problem is at rec->refcount_off in the above. >> >> I did some source code analysis and find the reason. >> CPU A CPU B >> bpf_map_put: >> ... >> btf_put with rcu callback >> ... >> bpf_map_free_deferred >> with system_unbound_wq >> ... ... ... >> ... btf_free_rcu: ... >> ... ... bpf_map_free_deferred: >> ... ... >> ... ---------> btf_struct_metas_free() >> ... | race condition ... >> ... ---------> map->ops->map_free() >> ... >> ... btf->struct_meta_tab = NULL >> >> In the above, map_free() corresponds to array_map_free() and eventually >> calling bpf_rb_root_free() which calls: >> ... >> __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false); >> ... >> >> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with >> following code: >> >> meta = btf_find_struct_meta(btf, btf_id); >> if (!meta) >> return -EFAULT; >> rec->fields[i].graph_root.value_rec = meta->record; >> >> So basically, 'value_rec' is a pointer to the record in >> struct_metas_tab. >> And it is possible that that particular record has been freed by >> btf_struct_metas_free() and hence we have a kasan error here. >> >> Actually it is very hard to reproduce the failure with current >> bpf/bpf-next >> code, I only got the above error once. To increase reproducibility, I >> added >> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which >> significantly increased reproducibility. >> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 5e43ddd1b83f..aae5b5213e93 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct >> work_struct *work) >> struct bpf_map *map = container_of(work, struct bpf_map, work); >> struct btf_record *rec = map->record; >> >> + mdelay(100); >> security_bpf_map_free(map); >> bpf_map_release_memcg(map); >> /* implementation dependent freeing */ >> >> To fix the problem, we need to have a reference on btf in order to >> safeguard accessing field->graph_root.value_rec in map->ops->map_free(). >> The function btf_parse_graph_root() is the place to get a btf reference. >> The following are rough call stacks reaching bpf_parse_graph_root(): >> >> btf_parse >> ... >> btf_parse_fields >> ... >> btf_parse_graph_root >> >> map_check_btf >> btf_parse_fields >> ... >> btf_parse_graph_root >> >> Looking at the above call stack, the btf_parse_graph_root() is >> indirectly >> called from btf_parse() or map_check_btf(). >> >> We cannot take a reference in btf_parse() case since at that moment, >> btf is still in the middle to self-validation and initial reference >> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. > > Thanks for the details analysis and clear explanation. It helps a lot. > > Sorry for jumping in late. > > I am trying to avoid making a special case for "bool has_btf_ref;" and > "bool from_map_check". It seems to a bit too much to deal with the > error path for btf_parse(). > > Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help? No, it does not. The core reason is what Hao is mentioned in https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ We simply cannot take btf reference if called from btf_parse(). Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() so we take ref for btf during btf_parse_fields(), then we have btf_put <=== expect refcount == 0 to start the destruction process ... btf_record_free <=== in which if graph_root, a btf reference will be hold so btf_put will never be able to actually free btf data. Yes, the kasan problem will be resolved but we leak memory. > >> It is also unnecessary to take a reference since the value_rec is >> referring to a record in struct_meta_tab. > > If we optimize for not taking a refcnt, how about not taking a refcnt > for all cases and postpone the btf_put(), instead of taking refcnt in > one case but not another. Like your fix in v1. The failed selftest can > be changed or even removed if it does not make sense anymore. After a couple of iterations, I think taking necessary reference approach sounds better and this will be consistent with how kptr is handled. For kptr, btf_parse will ignore it. Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check will both process it and that adds a little bit complexity. Alexei also suggested the same taking reference approach: https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/ > >> >> But we do need to take a reference for map_check_btf() case as >> the value_rec refers to a record in struct_meta_tab and we want >> to make sure btf is not freed until value_rec usage at map_free >> is done. >> >> So the fix is to get a btf reference for map_check_btf() case >> whenever a graph_root is found. A boolean field 'has_btf_ref' is >> added to >> struct btf_field_graph_root so later the btf reference can be properly >> released. >> >> Rerun './test_progs -j' with the above mdelay() hack for a couple >> of times and didn't observe the error. >> Running Hou's test ([1]) is also successful. >> >> [1] >> https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/ >> >
On 12/7/23 7:59 PM, Yonghong Song wrote: > > On 12/7/23 5:23 PM, Martin KaFai Lau wrote: >> On 12/6/23 1:09 PM, Yonghong Song wrote: >>> When running `./test_progs -j` in my local vm with latest kernel, >>> I once hit a kasan error like below: >>> >>> [ 1887.184724] BUG: KASAN: slab-use-after-free in >>> bpf_rb_root_free+0x1f8/0x2b0 >>> [ 1887.185599] Read of size 4 at addr ffff888106806910 by task >>> kworker/u12:2/2830 >>> [ 1887.186498] >>> [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: >>> G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 >>> [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, >>> 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >>> [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred >>> [ 1887.190341] Call Trace: >>> [ 1887.190666] <TASK> >>> [ 1887.190949] dump_stack_lvl+0xac/0xe0 >>> [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 >>> [ 1887.192019] ? panic+0x3c0/0x3c0 >>> [ 1887.192449] print_report+0x14f/0x720 >>> [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 >>> [ 1887.193459] ? __virt_addr_valid+0xac/0x120 >>> [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 >>> [ 1887.194572] kasan_report+0xc3/0x100 >>> [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 >>> [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 >>> [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 >>> [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 >>> [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 >>> [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 >>> [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 >>> [ 1887.198883] array_map_free+0x1a3/0x260 >>> [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 >>> [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 >>> [ 1887.200549] worker_thread+0x633/0x890 >>> [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 >>> [ 1887.201574] ? kthread+0x102/0x1d0 >>> [ 1887.202020] kthread+0x1ab/0x1d0 >>> [ 1887.202447] ? pr_cont_work+0x270/0x270 >>> [ 1887.202954] ? kthread_blkcg+0x50/0x50 >>> [ 1887.203444] ret_from_fork+0x34/0x50 >>> [ 1887.203914] ? kthread_blkcg+0x50/0x50 >>> [ 1887.204397] ret_from_fork_asm+0x11/0x20 >>> [ 1887.204913] </TASK> >>> [ 1887.204913] </TASK> >>> [ 1887.205209] >>> [ 1887.205416] Allocated by task 2197: >>> [ 1887.205881] kasan_set_track+0x3f/0x60 >>> [ 1887.206366] __kasan_kmalloc+0x6e/0x80 >>> [ 1887.206856] __kmalloc+0xac/0x1a0 >>> [ 1887.207293] btf_parse_fields+0xa15/0x1480 >>> [ 1887.207836] btf_parse_struct_metas+0x566/0x670 >>> [ 1887.208387] btf_new_fd+0x294/0x4d0 >>> [ 1887.208851] __sys_bpf+0x4ba/0x600 >>> [ 1887.209292] __x64_sys_bpf+0x41/0x50 >>> [ 1887.209762] do_syscall_64+0x4c/0xf0 >>> [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b >>> [ 1887.210868] >>> [ 1887.211074] Freed by task 36: >>> [ 1887.211460] kasan_set_track+0x3f/0x60 >>> [ 1887.211951] kasan_save_free_info+0x28/0x40 >>> [ 1887.212485] ____kasan_slab_free+0x101/0x180 >>> [ 1887.213027] __kmem_cache_free+0xe4/0x210 >>> [ 1887.213514] btf_free+0x5b/0x130 >>> [ 1887.213918] rcu_core+0x638/0xcc0 >>> [ 1887.214347] __do_softirq+0x114/0x37e >>> >>> The error happens at bpf_rb_root_free+0x1f8/0x2b0: >>> >>> 00000000000034c0 <bpf_rb_root_free>: >>> ; { >>> 34c0: f3 0f 1e fa endbr64 >>> 34c4: e8 00 00 00 00 callq 0x34c9 >>> <bpf_rb_root_free+0x9> >>> 34c9: 55 pushq %rbp >>> 34ca: 48 89 e5 movq %rsp, %rbp >>> ... >>> ; if (rec && rec->refcount_off >= 0 && >>> 36aa: 4d 85 ed testq %r13, %r13 >>> 36ad: 74 a9 je 0x3658 >>> <bpf_rb_root_free+0x198> >>> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi >>> 36b3: e8 00 00 00 00 callq 0x36b8 >>> <bpf_rb_root_free+0x1f8> >>> <==== kasan function >>> 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d >>> <==== use-after-free load >>> 36bc: 45 85 ff testl %r15d, %r15d >>> 36bf: 78 8c js 0x364d >>> <bpf_rb_root_free+0x18d> >>> >>> So the problem is at rec->refcount_off in the above. >>> >>> I did some source code analysis and find the reason. >>> CPU A CPU B >>> bpf_map_put: >>> ... >>> btf_put with rcu callback >>> ... >>> bpf_map_free_deferred >>> with system_unbound_wq >>> ... ... ... >>> ... btf_free_rcu: ... >>> ... ... bpf_map_free_deferred: >>> ... ... >>> ... ---------> btf_struct_metas_free() >>> ... | race condition ... >>> ... ---------> map->ops->map_free() >>> ... >>> ... btf->struct_meta_tab = NULL >>> >>> In the above, map_free() corresponds to array_map_free() and eventually >>> calling bpf_rb_root_free() which calls: >>> ... >>> __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false); >>> ... >>> >>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with >>> following code: >>> >>> meta = btf_find_struct_meta(btf, btf_id); >>> if (!meta) >>> return -EFAULT; >>> rec->fields[i].graph_root.value_rec = meta->record; >>> >>> So basically, 'value_rec' is a pointer to the record in >>> struct_metas_tab. >>> And it is possible that that particular record has been freed by >>> btf_struct_metas_free() and hence we have a kasan error here. >>> >>> Actually it is very hard to reproduce the failure with current >>> bpf/bpf-next >>> code, I only got the above error once. To increase reproducibility, >>> I added >>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which >>> significantly increased reproducibility. >>> >>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>> index 5e43ddd1b83f..aae5b5213e93 100644 >>> --- a/kernel/bpf/syscall.c >>> +++ b/kernel/bpf/syscall.c >>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct >>> work_struct *work) >>> struct bpf_map *map = container_of(work, struct bpf_map, >>> work); >>> struct btf_record *rec = map->record; >>> >>> + mdelay(100); >>> security_bpf_map_free(map); >>> bpf_map_release_memcg(map); >>> /* implementation dependent freeing */ >>> >>> To fix the problem, we need to have a reference on btf in order to >>> safeguard accessing field->graph_root.value_rec in >>> map->ops->map_free(). >>> The function btf_parse_graph_root() is the place to get a btf >>> reference. >>> The following are rough call stacks reaching bpf_parse_graph_root(): >>> >>> btf_parse >>> ... >>> btf_parse_fields >>> ... >>> btf_parse_graph_root >>> >>> map_check_btf >>> btf_parse_fields >>> ... >>> btf_parse_graph_root >>> >>> Looking at the above call stack, the btf_parse_graph_root() is >>> indirectly >>> called from btf_parse() or map_check_btf(). >>> >>> We cannot take a reference in btf_parse() case since at that moment, >>> btf is still in the middle to self-validation and initial reference >>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. >> >> Thanks for the details analysis and clear explanation. It helps a lot. >> >> Sorry for jumping in late. >> >> I am trying to avoid making a special case for "bool has_btf_ref;" >> and "bool from_map_check". It seems to a bit too much to deal with >> the error path for btf_parse(). >> >> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help? > > No, it does not. The core reason is what Hao is mentioned in > https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ > We simply cannot take btf reference if called from btf_parse(). > Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() > so we take ref for btf during btf_parse_fields(), then we have > btf_put <=== expect refcount == 0 to start the destruction process > ... > btf_record_free <=== in which if graph_root, a btf reference > will be hold > so btf_put will never be able to actually free btf data. > Yes, the kasan problem will be resolved but we leak memory. Let me send another version with better commit message. [...]
On 12/7/23 7:59 PM, Yonghong Song wrote: >> >> I am trying to avoid making a special case for "bool has_btf_ref;" and "bool >> from_map_check". It seems to a bit too much to deal with the error path for >> btf_parse(). >> >> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help? > > No, it does not. The core reason is what Hao is mentioned in > https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ > We simply cannot take btf reference if called from btf_parse(). > Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() > so we take ref for btf during btf_parse_fields(), then we have > btf_put <=== expect refcount == 0 to start the destruction process > ... > btf_record_free <=== in which if graph_root, a btf reference will be hold > so btf_put will never be able to actually free btf data. ah. There is a loop like btf->struct_meta_tab->...btf. > Yes, the kasan problem will be resolved but we leak memory. > >> >>> It is also unnecessary to take a reference since the value_rec is >>> referring to a record in struct_meta_tab. >> >> If we optimize for not taking a refcnt, how about not taking a refcnt for all >> cases and postpone the btf_put(), instead of taking refcnt in one case but not >> another. Like your fix in v1. The failed selftest can be changed or even >> removed if it does not make sense anymore. > > After a couple of iterations, I think taking necessary reference approach sounds > better > and this will be consistent with how kptr is handled. For kptr, btf_parse will > ignore it. Got it. It is why kptr.btf got away with the loop. On the other hand, am I reading it correctly that kptr.btf only needs to take the refcnt for btf that is btf_is_kernel()? > Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check will > both > process it and that adds a little bit complexity. > Alexei also suggested the same taking reference approach: > https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/
Hi, On 12/8/2023 12:02 PM, Yonghong Song wrote: > > On 12/7/23 7:59 PM, Yonghong Song wrote: >> >> On 12/7/23 5:23 PM, Martin KaFai Lau wrote: >>> On 12/6/23 1:09 PM, Yonghong Song wrote: >>>> When running `./test_progs -j` in my local vm with latest kernel, >>>> I once hit a kasan error like below: >>>> >>>> SNIP >>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with >>>> following code: >>>> >>>> meta = btf_find_struct_meta(btf, btf_id); >>>> if (!meta) >>>> return -EFAULT; >>>> rec->fields[i].graph_root.value_rec = meta->record; >>>> >>>> So basically, 'value_rec' is a pointer to the record in >>>> struct_metas_tab. >>>> And it is possible that that particular record has been freed by >>>> btf_struct_metas_free() and hence we have a kasan error here. >>>> >>>> Actually it is very hard to reproduce the failure with current >>>> bpf/bpf-next >>>> code, I only got the above error once. To increase reproducibility, >>>> I added >>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), >>>> which >>>> significantly increased reproducibility. >>>> >>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>>> index 5e43ddd1b83f..aae5b5213e93 100644 >>>> --- a/kernel/bpf/syscall.c >>>> +++ b/kernel/bpf/syscall.c >>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct >>>> work_struct *work) >>>> struct bpf_map *map = container_of(work, struct bpf_map, >>>> work); >>>> struct btf_record *rec = map->record; >>>> >>>> + mdelay(100); >>>> security_bpf_map_free(map); >>>> bpf_map_release_memcg(map); >>>> /* implementation dependent freeing */ >>>> >>>> To fix the problem, we need to have a reference on btf in order to >>>> safeguard accessing field->graph_root.value_rec in >>>> map->ops->map_free(). >>>> The function btf_parse_graph_root() is the place to get a btf >>>> reference. >>>> The following are rough call stacks reaching bpf_parse_graph_root(): >>>> >>>> btf_parse >>>> ... >>>> btf_parse_fields >>>> ... >>>> btf_parse_graph_root >>>> >>>> map_check_btf >>>> btf_parse_fields >>>> ... >>>> btf_parse_graph_root >>>> >>>> Looking at the above call stack, the btf_parse_graph_root() is >>>> indirectly >>>> called from btf_parse() or map_check_btf(). >>>> >>>> We cannot take a reference in btf_parse() case since at that moment, >>>> btf is still in the middle to self-validation and initial reference >>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. >>> >>> Thanks for the details analysis and clear explanation. It helps a lot. >>> >>> Sorry for jumping in late. >>> >>> I am trying to avoid making a special case for "bool has_btf_ref;" >>> and "bool from_map_check". It seems to a bit too much to deal with >>> the error path for btf_parse(). Maybe we could move the common btf used by kptr and graph_root into bpf_record and let the callers of btf_parse_fields() and btf_record_free() to decide the life cycle of btf in btf_record, so there will be less intrusive and less special case. The following is the code snippets for the idea (only simply tested): diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1ad0ed623f50..a0c4d696a231 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -230,6 +230,7 @@ struct btf_record { int spin_lock_off; int timer_off; int refcount_off; + struct btf *btf; struct btf_field fields[]; }; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 966dace27fae..08b4a2784ee4 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3598,7 +3598,6 @@ static int btf_parse_kptr(struct btf *btf, struct btf_field *field, field->kptr.dtor = NULL; id = info->kptr.type_id; kptr_btf = btf; - btf_get(kptr_btf); goto found_dtor; } if (id < 0) @@ -3753,6 +3752,7 @@ struct btf_record *btf_parse_fields(struct btf *btf, const struct btf_type *t, if (!rec) return ERR_PTR(-ENOMEM); + rec->btf = btf; rec->spin_lock_off = -EINVAL; rec->timer_off = -EINVAL; rec->refcount_off = -EINVAL; diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 8ef269e66ba5..d9f4198158b6 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -56,6 +56,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) ret = PTR_ERR(inner_map_meta->record); goto free; } + if (inner_map_meta->record) + btf_get(inner_map_meta->record->btf); + /* Note: We must use the same BTF, as we also used btf_record_dup above * which relies on BTF being same for both maps, as some members like * record->fields.list_head have pointers like value_rec pointing into diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 30ed7756fc71..d2641e51a1a7 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -516,7 +516,8 @@ void btf_record_free(struct btf_record *rec) case BPF_KPTR_PERCPU: if (rec->fields[i].kptr.module) module_put(rec->fields[i].kptr.module); - btf_put(rec->fields[i].kptr.btf); + if (rec->fields[i].kptr.btf != rec->btf) + btf_put(rec->fields[i].kptr.btf); break; case BPF_LIST_HEAD: case BPF_LIST_NODE: @@ -537,8 +538,13 @@ void btf_record_free(struct btf_record *rec) void bpf_map_free_record(struct bpf_map *map) { + struct btf *btf = NULL; + + if (!IS_ERR_OR_NULL(map->record)) + btf = map->record->btf; btf_record_free(map->record); map->record = NULL; + btf_put(btf); } struct btf_record *btf_record_dup(const struct btf_record *rec) @@ -561,7 +567,8 @@ struct btf_record *btf_record_dup(const struct btf_record *rec) case BPF_KPTR_UNREF: case BPF_KPTR_REF: case BPF_KPTR_PERCPU: - btf_get(fields[i].kptr.btf); + if (fields[i].kptr.btf != rec->btf) + btf_get(fields[i].kptr.btf); if (fields[i].kptr.module && !try_module_get(fields[i].kptr.module)) { ret = -ENXIO; goto free; @@ -692,7 +699,10 @@ static void bpf_map_free_deferred(struct work_struct *work) { struct bpf_map *map = container_of(work, struct bpf_map, work); struct btf_record *rec = map->record; + struct btf *btf = NULL; + if (!IS_ERR_OR_NULL(rec)) + btf = rec->btf; security_bpf_map_free(map); bpf_map_release_memcg(map); /* implementation dependent freeing */ @@ -707,6 +717,7 @@ static void bpf_map_free_deferred(struct work_struct *work) * template bpf_map struct used during verification. */ btf_record_free(rec); + btf_put(btf); } static void bpf_map_put_uref(struct bpf_map *map) @@ -1036,6 +1047,8 @@ static int map_check_btf(struct bpf_map *map, struct btf *btf, if (!IS_ERR_OR_NULL(map->record)) { int i; + btf_get(map->record->btf); + if (!bpf_capable()) { ret = -EPERM; goto free_map_tab; WDYT ? >>> >>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse >>> help? >> >> No, it does not. The core reason is what Hao is mentioned in >> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ >> >> We simply cannot take btf reference if called from btf_parse(). >> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() >> so we take ref for btf during btf_parse_fields(), then we have >> btf_put <=== expect refcount == 0 to start the destruction process >> ... >> btf_record_free <=== in which if graph_root, a btf reference >> will be hold >> so btf_put will never be able to actually free btf data. >> Yes, the kasan problem will be resolved but we leak memory. > Let me send another version with better commit message. > > [...]
On 12/8/23 12:16 AM, Martin KaFai Lau wrote: > On 12/7/23 7:59 PM, Yonghong Song wrote: >>> >>> I am trying to avoid making a special case for "bool has_btf_ref;" >>> and "bool from_map_check". It seems to a bit too much to deal with >>> the error path for btf_parse(). >>> >>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse >>> help? >> >> No, it does not. The core reason is what Hao is mentioned in >> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ >> >> We simply cannot take btf reference if called from btf_parse(). >> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() >> so we take ref for btf during btf_parse_fields(), then we have >> btf_put <=== expect refcount == 0 to start the destruction process >> ... >> btf_record_free <=== in which if graph_root, a btf >> reference will be hold >> so btf_put will never be able to actually free btf data. > > ah. There is a loop like btf->struct_meta_tab->...btf. > >> Yes, the kasan problem will be resolved but we leak memory. >> >>> >>>> It is also unnecessary to take a reference since the value_rec is >>>> referring to a record in struct_meta_tab. >>> >>> If we optimize for not taking a refcnt, how about not taking a >>> refcnt for all cases and postpone the btf_put(), instead of taking >>> refcnt in one case but not another. Like your fix in v1. The failed >>> selftest can be changed or even removed if it does not make sense >>> anymore. >> >> After a couple of iterations, I think taking necessary reference >> approach sounds better >> and this will be consistent with how kptr is handled. For kptr, >> btf_parse will ignore it. > > Got it. It is why kptr.btf got away with the loop. > > On the other hand, am I reading it correctly that kptr.btf only needs > to take the refcnt for btf that is btf_is_kernel()? No. besides vmlinux and module btf, it also takes reference for prog btf, see static int btf_parse_kptr(const struct btf *btf, struct btf_field *field, struct btf_field_info *info) { ... if (id == -ENOENT) { /* btf_parse_kptr should only be called w/ btf = program BTF */ WARN_ON_ONCE(btf_is_kernel(btf)); /* Type exists only in program BTF. Assume that it's a MEM_ALLOC * kptr allocated via bpf_obj_new */ field->kptr.dtor = NULL; id = info->kptr.type_id; kptr_btf = (struct btf *)btf; btf_get(kptr_btf); goto found_dtor; } ... } > >> Unfortunately, for graph_root (list_head, rb_root), btf_parse and >> map_check will both >> process it and that adds a little bit complexity. >> Alexei also suggested the same taking reference approach: >> https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/ >> >
On 12/8/23 12:30 AM, Hou Tao wrote: > Hi, > > On 12/8/2023 12:02 PM, Yonghong Song wrote: >> On 12/7/23 7:59 PM, Yonghong Song wrote: >>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote: >>>> On 12/6/23 1:09 PM, Yonghong Song wrote: >>>>> When running `./test_progs -j` in my local vm with latest kernel, >>>>> I once hit a kasan error like below: >>>>> >>>>> > SNIP >>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with >>>>> following code: >>>>> >>>>> meta = btf_find_struct_meta(btf, btf_id); >>>>> if (!meta) >>>>> return -EFAULT; >>>>> rec->fields[i].graph_root.value_rec = meta->record; >>>>> >>>>> So basically, 'value_rec' is a pointer to the record in >>>>> struct_metas_tab. >>>>> And it is possible that that particular record has been freed by >>>>> btf_struct_metas_free() and hence we have a kasan error here. >>>>> >>>>> Actually it is very hard to reproduce the failure with current >>>>> bpf/bpf-next >>>>> code, I only got the above error once. To increase reproducibility, >>>>> I added >>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), >>>>> which >>>>> significantly increased reproducibility. >>>>> >>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>>>> index 5e43ddd1b83f..aae5b5213e93 100644 >>>>> --- a/kernel/bpf/syscall.c >>>>> +++ b/kernel/bpf/syscall.c >>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct >>>>> work_struct *work) >>>>> struct bpf_map *map = container_of(work, struct bpf_map, >>>>> work); >>>>> struct btf_record *rec = map->record; >>>>> >>>>> + mdelay(100); >>>>> security_bpf_map_free(map); >>>>> bpf_map_release_memcg(map); >>>>> /* implementation dependent freeing */ >>>>> >>>>> To fix the problem, we need to have a reference on btf in order to >>>>> safeguard accessing field->graph_root.value_rec in >>>>> map->ops->map_free(). >>>>> The function btf_parse_graph_root() is the place to get a btf >>>>> reference. >>>>> The following are rough call stacks reaching bpf_parse_graph_root(): >>>>> >>>>> btf_parse >>>>> ... >>>>> btf_parse_fields >>>>> ... >>>>> btf_parse_graph_root >>>>> >>>>> map_check_btf >>>>> btf_parse_fields >>>>> ... >>>>> btf_parse_graph_root >>>>> >>>>> Looking at the above call stack, the btf_parse_graph_root() is >>>>> indirectly >>>>> called from btf_parse() or map_check_btf(). >>>>> >>>>> We cannot take a reference in btf_parse() case since at that moment, >>>>> btf is still in the middle to self-validation and initial reference >>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. >>>> Thanks for the details analysis and clear explanation. It helps a lot. >>>> >>>> Sorry for jumping in late. >>>> >>>> I am trying to avoid making a special case for "bool has_btf_ref;" >>>> and "bool from_map_check". It seems to a bit too much to deal with >>>> the error path for btf_parse(). > Maybe we could move the common btf used by kptr and graph_root into > bpf_record and let the callers of btf_parse_fields() and > btf_record_free() to decide the life cycle of btf in btf_record, so > there will be less intrusive and less special case. The following is the I didn't fully check the code but looks like we took a btf reference at map_check_btf() and free it at the end of bpf_map_free_deferred(). This is similar to my v1 patch, not exactly the same but similar since they all do btf_put() at the end of bpf_map_free_deferred(). Through discussion, doing on-demand btf_get()/btf_put() approach, similar to kptr approach, seems more favored. This also has advantage to free btf at its earlist possible point. > code snippets for the idea (only simply tested): > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 1ad0ed623f50..a0c4d696a231 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -230,6 +230,7 @@ struct btf_record { > int spin_lock_off; > int timer_off; > int refcount_off; > + struct btf *btf; > struct btf_field fields[]; > }; > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c > index 966dace27fae..08b4a2784ee4 100644 > --- a/kernel/bpf/btf.c > +++ b/kernel/bpf/btf.c > @@ -3598,7 +3598,6 @@ static int btf_parse_kptr(struct btf *btf, struct > btf_field *field, > field->kptr.dtor = NULL; > id = info->kptr.type_id; > kptr_btf = btf; > - btf_get(kptr_btf); > goto found_dtor; > } > if (id < 0) > @@ -3753,6 +3752,7 @@ struct btf_record *btf_parse_fields(struct btf > *btf, const struct btf_type *t, > if (!rec) > return ERR_PTR(-ENOMEM); > > + rec->btf = btf; > rec->spin_lock_off = -EINVAL; > rec->timer_off = -EINVAL; > rec->refcount_off = -EINVAL; > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c > index 8ef269e66ba5..d9f4198158b6 100644 > --- a/kernel/bpf/map_in_map.c > +++ b/kernel/bpf/map_in_map.c > @@ -56,6 +56,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) > ret = PTR_ERR(inner_map_meta->record); > goto free; > } > + if (inner_map_meta->record) > + btf_get(inner_map_meta->record->btf); > + > /* Note: We must use the same BTF, as we also used > btf_record_dup above > * which relies on BTF being same for both maps, as some members > like > * record->fields.list_head have pointers like value_rec > pointing into > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 30ed7756fc71..d2641e51a1a7 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -516,7 +516,8 @@ void btf_record_free(struct btf_record *rec) > case BPF_KPTR_PERCPU: > if (rec->fields[i].kptr.module) > module_put(rec->fields[i].kptr.module); > - btf_put(rec->fields[i].kptr.btf); > + if (rec->fields[i].kptr.btf != rec->btf) > + btf_put(rec->fields[i].kptr.btf); > break; > case BPF_LIST_HEAD: > case BPF_LIST_NODE: > @@ -537,8 +538,13 @@ void btf_record_free(struct btf_record *rec) > > void bpf_map_free_record(struct bpf_map *map) > { > + struct btf *btf = NULL; > + > + if (!IS_ERR_OR_NULL(map->record)) > + btf = map->record->btf; > btf_record_free(map->record); > map->record = NULL; > + btf_put(btf); > } > > struct btf_record *btf_record_dup(const struct btf_record *rec) > @@ -561,7 +567,8 @@ struct btf_record *btf_record_dup(const struct > btf_record *rec) > case BPF_KPTR_UNREF: > case BPF_KPTR_REF: > case BPF_KPTR_PERCPU: > - btf_get(fields[i].kptr.btf); > + if (fields[i].kptr.btf != rec->btf) > + btf_get(fields[i].kptr.btf); > if (fields[i].kptr.module && > !try_module_get(fields[i].kptr.module)) { > ret = -ENXIO; > goto free; > @@ -692,7 +699,10 @@ static void bpf_map_free_deferred(struct > work_struct *work) > { > struct bpf_map *map = container_of(work, struct bpf_map, work); > struct btf_record *rec = map->record; > + struct btf *btf = NULL; > > + if (!IS_ERR_OR_NULL(rec)) > + btf = rec->btf; > security_bpf_map_free(map); > bpf_map_release_memcg(map); > /* implementation dependent freeing */ > @@ -707,6 +717,7 @@ static void bpf_map_free_deferred(struct work_struct > *work) > * template bpf_map struct used during verification. > */ > btf_record_free(rec); > + btf_put(btf); > } > > static void bpf_map_put_uref(struct bpf_map *map) > @@ -1036,6 +1047,8 @@ static int map_check_btf(struct bpf_map *map, > struct btf *btf, > if (!IS_ERR_OR_NULL(map->record)) { > int i; > > + btf_get(map->record->btf); > + > if (!bpf_capable()) { > ret = -EPERM; > goto free_map_tab; > > > > WDYT ? > >>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse >>>> help? >>> No, it does not. The core reason is what Hao is mentioned in >>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ >>> >>> We simply cannot take btf reference if called from btf_parse(). >>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() >>> so we take ref for btf during btf_parse_fields(), then we have >>> btf_put <=== expect refcount == 0 to start the destruction process >>> ... >>> btf_record_free <=== in which if graph_root, a btf reference >>> will be hold >>> so btf_put will never be able to actually free btf data. >>> Yes, the kasan problem will be resolved but we leak memory. >> Let me send another version with better commit message. > >> [...]
On 12/8/23 8:45 AM, Yonghong Song wrote: > > On 12/8/23 12:16 AM, Martin KaFai Lau wrote: >> On 12/7/23 7:59 PM, Yonghong Song wrote: >>>> >>>> I am trying to avoid making a special case for "bool has_btf_ref;" and "bool >>>> from_map_check". It seems to a bit too much to deal with the error path for >>>> btf_parse(). >>>> >>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help? >>> >>> No, it does not. The core reason is what Hao is mentioned in >>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ >>> We simply cannot take btf reference if called from btf_parse(). >>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse() >>> so we take ref for btf during btf_parse_fields(), then we have >>> btf_put <=== expect refcount == 0 to start the destruction process >>> ... >>> btf_record_free <=== in which if graph_root, a btf reference will >>> be hold >>> so btf_put will never be able to actually free btf data. >> >> ah. There is a loop like btf->struct_meta_tab->...btf. >> >>> Yes, the kasan problem will be resolved but we leak memory. >>> >>>> >>>>> It is also unnecessary to take a reference since the value_rec is >>>>> referring to a record in struct_meta_tab. >>>> >>>> If we optimize for not taking a refcnt, how about not taking a refcnt for >>>> all cases and postpone the btf_put(), instead of taking refcnt in one case >>>> but not another. Like your fix in v1. The failed selftest can be changed or >>>> even removed if it does not make sense anymore. >>> >>> After a couple of iterations, I think taking necessary reference approach >>> sounds better >>> and this will be consistent with how kptr is handled. For kptr, btf_parse >>> will ignore it. >> >> Got it. It is why kptr.btf got away with the loop. >> >> On the other hand, am I reading it correctly that kptr.btf only needs to take >> the refcnt for btf that is btf_is_kernel()? > > No. besides vmlinux and module btf, it also takes reference for prog btf, see > > static int btf_parse_kptr(const struct btf *btf, struct btf_field *field, > struct btf_field_info *info) > { > ... > if (id == -ENOENT) { > /* btf_parse_kptr should only be called w/ btf = program BTF */ > WARN_ON_ONCE(btf_is_kernel(btf)); > /* Type exists only in program BTF. Assume that it's a MEM_ALLOC > * kptr allocated via bpf_obj_new > */ > field->kptr.dtor = NULL; > id = info->kptr.type_id; > kptr_btf = (struct btf *)btf; > btf_get(kptr_btf); I meant only kernel/module btf needs to take the refcnt, so there is no need to take the refcnt here for the (it)self btf. Sorry that I was not clear in my earlier comment. The record is capturing something either in the self btf or something in the kernel btf. The field->kptr.kptr is the one that may either point to a kernel or self btf, so it should be the only case that needs to check the following in btf_record_free(): if (btf_is_kernel(rec->fields[i].kptr.btf)) btf_put(rec->fields[i].kptr.btf); All other cases the record has a self btf (including field->graph_root.btf). The owner (map here) needs to ensure the self btf is freed after the record is freed. I was thinking if it can avoid doing different things based on where btf_parse_fields() is called by separating what type of btf always needs refcnt or not. Agree the approach in this patch will fix the issue also and I have acked v5. Thanks for the fix. > goto found_dtor; > } > ... > } > >> >>> Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check >>> will both >>> process it and that adds a little bit complexity. >>> Alexei also suggested the same taking reference approach: >>> https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/ >>
On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@linux.dev> wrote: > > > On 12/8/23 12:30 AM, Hou Tao wrote: > > Hi, > > > > On 12/8/2023 12:02 PM, Yonghong Song wrote: > >> On 12/7/23 7:59 PM, Yonghong Song wrote: > >>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote: > >>>> On 12/6/23 1:09 PM, Yonghong Song wrote: > >>>>> When running `./test_progs -j` in my local vm with latest kernel, > >>>>> I once hit a kasan error like below: > >>>>> > >>>>> > > SNIP > >>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with > >>>>> following code: > >>>>> > >>>>> meta = btf_find_struct_meta(btf, btf_id); > >>>>> if (!meta) > >>>>> return -EFAULT; > >>>>> rec->fields[i].graph_root.value_rec = meta->record; > >>>>> > >>>>> So basically, 'value_rec' is a pointer to the record in > >>>>> struct_metas_tab. > >>>>> And it is possible that that particular record has been freed by > >>>>> btf_struct_metas_free() and hence we have a kasan error here. > >>>>> > >>>>> Actually it is very hard to reproduce the failure with current > >>>>> bpf/bpf-next > >>>>> code, I only got the above error once. To increase reproducibility, > >>>>> I added > >>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), > >>>>> which > >>>>> significantly increased reproducibility. > >>>>> > >>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >>>>> index 5e43ddd1b83f..aae5b5213e93 100644 > >>>>> --- a/kernel/bpf/syscall.c > >>>>> +++ b/kernel/bpf/syscall.c > >>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct > >>>>> work_struct *work) > >>>>> struct bpf_map *map = container_of(work, struct bpf_map, > >>>>> work); > >>>>> struct btf_record *rec = map->record; > >>>>> > >>>>> + mdelay(100); > >>>>> security_bpf_map_free(map); > >>>>> bpf_map_release_memcg(map); > >>>>> /* implementation dependent freeing */ > >>>>> > >>>>> To fix the problem, we need to have a reference on btf in order to > >>>>> safeguard accessing field->graph_root.value_rec in > >>>>> map->ops->map_free(). > >>>>> The function btf_parse_graph_root() is the place to get a btf > >>>>> reference. > >>>>> The following are rough call stacks reaching bpf_parse_graph_root(): > >>>>> > >>>>> btf_parse > >>>>> ... > >>>>> btf_parse_fields > >>>>> ... > >>>>> btf_parse_graph_root > >>>>> > >>>>> map_check_btf > >>>>> btf_parse_fields > >>>>> ... > >>>>> btf_parse_graph_root > >>>>> > >>>>> Looking at the above call stack, the btf_parse_graph_root() is > >>>>> indirectly > >>>>> called from btf_parse() or map_check_btf(). > >>>>> > >>>>> We cannot take a reference in btf_parse() case since at that moment, > >>>>> btf is still in the middle to self-validation and initial reference > >>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. > >>>> Thanks for the details analysis and clear explanation. It helps a lot. > >>>> > >>>> Sorry for jumping in late. > >>>> > >>>> I am trying to avoid making a special case for "bool has_btf_ref;" > >>>> and "bool from_map_check". It seems to a bit too much to deal with > >>>> the error path for btf_parse(). > > Maybe we could move the common btf used by kptr and graph_root into > > bpf_record and let the callers of btf_parse_fields() and > > btf_record_free() to decide the life cycle of btf in btf_record, so > > there will be less intrusive and less special case. The following is the > > I didn't fully check the code but looks like we took a > btf reference at map_check_btf() and free it at the end > of bpf_map_free_deferred(). This is similar to my v1 patch, > not exactly the same but similar since they all do > btf_put() at the end of bpf_map_free_deferred(). > > Through discussion, doing on-demand btf_get()/btf_put() > approach, similar to kptr approach, seems more favored. > This also has advantage to free btf at its earlist possible > point. Sorry. Looks like I recommended the wrong path. The approach of btf_parse_fields(... false | true) depending on where it's called and whether returned struct btf_record * will be kept within a type or within a map is pushing complexity too far. A year from now we'll forget these subtle details. There is an advantage to do btf_put() earli in bpf_map_put(), but in the common case it would be delayed just after queue_work. Which is a minor time delay. And for free_after_mult_rcu_gp much longer, but saving from freeing btf are minor compared to the map itself. I think it's cleaner to go back to v1 and simply move btf_put to bpf_map_free_deferred(). A lot less things to worry about. Especially considering that BPF_RB_ROOT may not be the last such special record keeping type and every new type would need to think hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like. v1 avoids this future complexity.
On 12/13/23 8:17 PM, Alexei Starovoitov wrote: > On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@linux.dev> wrote: >> >> On 12/8/23 12:30 AM, Hou Tao wrote: >>> Hi, >>> >>> On 12/8/2023 12:02 PM, Yonghong Song wrote: >>>> On 12/7/23 7:59 PM, Yonghong Song wrote: >>>>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote: >>>>>> On 12/6/23 1:09 PM, Yonghong Song wrote: >>>>>>> When running `./test_progs -j` in my local vm with latest kernel, >>>>>>> I once hit a kasan error like below: >>>>>>> >>>>>>> >>> SNIP >>>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with >>>>>>> following code: >>>>>>> >>>>>>> meta = btf_find_struct_meta(btf, btf_id); >>>>>>> if (!meta) >>>>>>> return -EFAULT; >>>>>>> rec->fields[i].graph_root.value_rec = meta->record; >>>>>>> >>>>>>> So basically, 'value_rec' is a pointer to the record in >>>>>>> struct_metas_tab. >>>>>>> And it is possible that that particular record has been freed by >>>>>>> btf_struct_metas_free() and hence we have a kasan error here. >>>>>>> >>>>>>> Actually it is very hard to reproduce the failure with current >>>>>>> bpf/bpf-next >>>>>>> code, I only got the above error once. To increase reproducibility, >>>>>>> I added >>>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), >>>>>>> which >>>>>>> significantly increased reproducibility. >>>>>>> >>>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >>>>>>> index 5e43ddd1b83f..aae5b5213e93 100644 >>>>>>> --- a/kernel/bpf/syscall.c >>>>>>> +++ b/kernel/bpf/syscall.c >>>>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct >>>>>>> work_struct *work) >>>>>>> struct bpf_map *map = container_of(work, struct bpf_map, >>>>>>> work); >>>>>>> struct btf_record *rec = map->record; >>>>>>> >>>>>>> + mdelay(100); >>>>>>> security_bpf_map_free(map); >>>>>>> bpf_map_release_memcg(map); >>>>>>> /* implementation dependent freeing */ >>>>>>> >>>>>>> To fix the problem, we need to have a reference on btf in order to >>>>>>> safeguard accessing field->graph_root.value_rec in >>>>>>> map->ops->map_free(). >>>>>>> The function btf_parse_graph_root() is the place to get a btf >>>>>>> reference. >>>>>>> The following are rough call stacks reaching bpf_parse_graph_root(): >>>>>>> >>>>>>> btf_parse >>>>>>> ... >>>>>>> btf_parse_fields >>>>>>> ... >>>>>>> btf_parse_graph_root >>>>>>> >>>>>>> map_check_btf >>>>>>> btf_parse_fields >>>>>>> ... >>>>>>> btf_parse_graph_root >>>>>>> >>>>>>> Looking at the above call stack, the btf_parse_graph_root() is >>>>>>> indirectly >>>>>>> called from btf_parse() or map_check_btf(). >>>>>>> >>>>>>> We cannot take a reference in btf_parse() case since at that moment, >>>>>>> btf is still in the middle to self-validation and initial reference >>>>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet. >>>>>> Thanks for the details analysis and clear explanation. It helps a lot. >>>>>> >>>>>> Sorry for jumping in late. >>>>>> >>>>>> I am trying to avoid making a special case for "bool has_btf_ref;" >>>>>> and "bool from_map_check". It seems to a bit too much to deal with >>>>>> the error path for btf_parse(). >>> Maybe we could move the common btf used by kptr and graph_root into >>> bpf_record and let the callers of btf_parse_fields() and >>> btf_record_free() to decide the life cycle of btf in btf_record, so >>> there will be less intrusive and less special case. The following is the >> I didn't fully check the code but looks like we took a >> btf reference at map_check_btf() and free it at the end >> of bpf_map_free_deferred(). This is similar to my v1 patch, >> not exactly the same but similar since they all do >> btf_put() at the end of bpf_map_free_deferred(). >> >> Through discussion, doing on-demand btf_get()/btf_put() >> approach, similar to kptr approach, seems more favored. >> This also has advantage to free btf at its earlist possible >> point. > Sorry. Looks like I recommended the wrong path. > > The approach of btf_parse_fields(... false | true) > depending on where it's called and whether returned struct btf_record * > will be kept within a type or within a map > is pushing complexity too far. > A year from now we'll forget these subtle details. > There is an advantage to do btf_put() earli in bpf_map_put(), > but in the common case it would be delayed just after queue_work. > Which is a minor time delay. > And for free_after_mult_rcu_gp much longer, > but saving from freeing btf are minor compared to the map itself. > > I think it's cleaner to go back to v1 and simply move btf_put > to bpf_map_free_deferred(). > A lot less things to worry about. > Especially considering that BPF_RB_ROOT may not be the last such special > record keeping type and every new type would need to think > hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like. > v1 avoids this future complexity. No problem. Will send v6 tomorrow with v1.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 10e5e4d8a00f..6689b1b15993 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -212,6 +212,7 @@ struct btf_field_graph_root { u32 value_btf_id; u32 node_offset; struct btf_record *value_rec; + bool has_btf_ref; }; struct btf_field { diff --git a/include/linux/btf.h b/include/linux/btf.h index 59d404e22814..f7d4f6594308 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -217,7 +217,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, const struct btf_member *m, u32 expected_offset, u32 expected_size); struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t, - u32 field_mask, u32 value_size); + u32 field_mask, u32 value_size, bool from_map_check); int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec); bool btf_type_is_void(const struct btf_type *t); s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 63cf4128fc05..f23d6a7b8b93 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3665,7 +3665,8 @@ static int btf_parse_graph_root(const struct btf *btf, struct btf_field *field, struct btf_field_info *info, const char *node_type_name, - size_t node_type_align) + size_t node_type_align, + bool from_map_check) { const struct btf_type *t, *n = NULL; const struct btf_member *member; @@ -3696,6 +3697,9 @@ static int btf_parse_graph_root(const struct btf *btf, if (offset % node_type_align) return -EINVAL; + if (from_map_check) + btf_get((struct btf *)btf); + field->graph_root.has_btf_ref = from_map_check; field->graph_root.btf = (struct btf *)btf; field->graph_root.value_btf_id = info->graph_root.value_btf_id; field->graph_root.node_offset = offset; @@ -3706,17 +3710,19 @@ static int btf_parse_graph_root(const struct btf *btf, } static int btf_parse_list_head(const struct btf *btf, struct btf_field *field, - struct btf_field_info *info) + struct btf_field_info *info, bool from_map_check) { return btf_parse_graph_root(btf, field, info, "bpf_list_node", - __alignof__(struct bpf_list_node)); + __alignof__(struct bpf_list_node), + from_map_check); } static int btf_parse_rb_root(const struct btf *btf, struct btf_field *field, - struct btf_field_info *info) + struct btf_field_info *info, bool from_map_check) { return btf_parse_graph_root(btf, field, info, "bpf_rb_node", - __alignof__(struct bpf_rb_node)); + __alignof__(struct bpf_rb_node), + from_map_check); } static int btf_field_cmp(const void *_a, const void *_b, const void *priv) @@ -3732,7 +3738,7 @@ static int btf_field_cmp(const void *_a, const void *_b, const void *priv) } struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t, - u32 field_mask, u32 value_size) + u32 field_mask, u32 value_size, bool from_map_check) { struct btf_field_info info_arr[BTF_FIELDS_MAX]; u32 next_off = 0, field_type_size; @@ -3798,12 +3804,14 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type goto end; break; case BPF_LIST_HEAD: - ret = btf_parse_list_head(btf, &rec->fields[i], &info_arr[i]); + ret = btf_parse_list_head(btf, &rec->fields[i], &info_arr[i], + from_map_check); if (ret < 0) goto end; break; case BPF_RB_ROOT: - ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i]); + ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i], + from_map_check); if (ret < 0) goto end; break; @@ -5390,7 +5398,8 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf) type = &tab->types[tab->cnt]; type->btf_id = i; record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE | - BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT, t->size); + BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT, t->size, + false); /* The record cannot be unset, treat it as an error if so */ if (IS_ERR_OR_NULL(record)) { ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ebaccf77d56e..33eaa3cd32e4 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -521,8 +521,11 @@ void btf_record_free(struct btf_record *rec) btf_put(rec->fields[i].kptr.btf); break; case BPF_LIST_HEAD: - case BPF_LIST_NODE: case BPF_RB_ROOT: + if (rec->fields[i].graph_root.has_btf_ref) + btf_put(rec->fields[i].graph_root.btf); + break; + case BPF_LIST_NODE: case BPF_RB_NODE: case BPF_SPIN_LOCK: case BPF_TIMER: @@ -570,8 +573,11 @@ struct btf_record *btf_record_dup(const struct btf_record *rec) } break; case BPF_LIST_HEAD: - case BPF_LIST_NODE: case BPF_RB_ROOT: + if (fields[i].graph_root.has_btf_ref) + btf_get(fields[i].graph_root.btf); + break; + case BPF_LIST_NODE: case BPF_RB_NODE: case BPF_SPIN_LOCK: case BPF_TIMER: @@ -1034,7 +1040,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, map->record = btf_parse_fields(btf, value_type, BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD | BPF_RB_ROOT | BPF_REFCOUNT, - map->value_size); + map->value_size, true); if (!IS_ERR_OR_NULL(map->record)) { int i;
When running `./test_progs -j` in my local vm with latest kernel, I once hit a kasan error like below: [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0 [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830 [ 1887.186498] [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred [ 1887.190341] Call Trace: [ 1887.190666] <TASK> [ 1887.190949] dump_stack_lvl+0xac/0xe0 [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 [ 1887.192019] ? panic+0x3c0/0x3c0 [ 1887.192449] print_report+0x14f/0x720 [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 [ 1887.193459] ? __virt_addr_valid+0xac/0x120 [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 [ 1887.194572] kasan_report+0xc3/0x100 [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 [ 1887.198883] array_map_free+0x1a3/0x260 [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 [ 1887.200549] worker_thread+0x633/0x890 [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 [ 1887.201574] ? kthread+0x102/0x1d0 [ 1887.202020] kthread+0x1ab/0x1d0 [ 1887.202447] ? pr_cont_work+0x270/0x270 [ 1887.202954] ? kthread_blkcg+0x50/0x50 [ 1887.203444] ret_from_fork+0x34/0x50 [ 1887.203914] ? kthread_blkcg+0x50/0x50 [ 1887.204397] ret_from_fork_asm+0x11/0x20 [ 1887.204913] </TASK> [ 1887.204913] </TASK> [ 1887.205209] [ 1887.205416] Allocated by task 2197: [ 1887.205881] kasan_set_track+0x3f/0x60 [ 1887.206366] __kasan_kmalloc+0x6e/0x80 [ 1887.206856] __kmalloc+0xac/0x1a0 [ 1887.207293] btf_parse_fields+0xa15/0x1480 [ 1887.207836] btf_parse_struct_metas+0x566/0x670 [ 1887.208387] btf_new_fd+0x294/0x4d0 [ 1887.208851] __sys_bpf+0x4ba/0x600 [ 1887.209292] __x64_sys_bpf+0x41/0x50 [ 1887.209762] do_syscall_64+0x4c/0xf0 [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 1887.210868] [ 1887.211074] Freed by task 36: [ 1887.211460] kasan_set_track+0x3f/0x60 [ 1887.211951] kasan_save_free_info+0x28/0x40 [ 1887.212485] ____kasan_slab_free+0x101/0x180 [ 1887.213027] __kmem_cache_free+0xe4/0x210 [ 1887.213514] btf_free+0x5b/0x130 [ 1887.213918] rcu_core+0x638/0xcc0 [ 1887.214347] __do_softirq+0x114/0x37e The error happens at bpf_rb_root_free+0x1f8/0x2b0: 00000000000034c0 <bpf_rb_root_free>: ; { 34c0: f3 0f 1e fa endbr64 34c4: e8 00 00 00 00 callq 0x34c9 <bpf_rb_root_free+0x9> 34c9: 55 pushq %rbp 34ca: 48 89 e5 movq %rsp, %rbp ... ; if (rec && rec->refcount_off >= 0 && 36aa: 4d 85 ed testq %r13, %r13 36ad: 74 a9 je 0x3658 <bpf_rb_root_free+0x198> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi 36b3: e8 00 00 00 00 callq 0x36b8 <bpf_rb_root_free+0x1f8> <==== kasan function 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d <==== use-after-free load 36bc: 45 85 ff testl %r15d, %r15d 36bf: 78 8c js 0x364d <bpf_rb_root_free+0x18d> So the problem is at rec->refcount_off in the above. I did some source code analysis and find the reason. CPU A CPU B bpf_map_put: ... btf_put with rcu callback ... bpf_map_free_deferred with system_unbound_wq ... ... ... ... btf_free_rcu: ... ... ... bpf_map_free_deferred: ... ... ... ---------> btf_struct_metas_free() ... | race condition ... ... ---------> map->ops->map_free() ... ... btf->struct_meta_tab = NULL In the above, map_free() corresponds to array_map_free() and eventually calling bpf_rb_root_free() which calls: ... __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false); ... Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code: meta = btf_find_struct_meta(btf, btf_id); if (!meta) return -EFAULT; rec->fields[i].graph_root.value_rec = meta->record; So basically, 'value_rec' is a pointer to the record in struct_metas_tab. And it is possible that that particular record has been freed by btf_struct_metas_free() and hence we have a kasan error here. Actually it is very hard to reproduce the failure with current bpf/bpf-next code, I only got the above error once. To increase reproducibility, I added a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which significantly increased reproducibility. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5e43ddd1b83f..aae5b5213e93 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work) struct bpf_map *map = container_of(work, struct bpf_map, work); struct btf_record *rec = map->record; + mdelay(100); security_bpf_map_free(map); bpf_map_release_memcg(map); /* implementation dependent freeing */ To fix the problem, we need to have a reference on btf in order to safeguard accessing field->graph_root.value_rec in map->ops->map_free(). The function btf_parse_graph_root() is the place to get a btf reference. The following are rough call stacks reaching bpf_parse_graph_root(): btf_parse ... btf_parse_fields ... btf_parse_graph_root map_check_btf btf_parse_fields ... btf_parse_graph_root Looking at the above call stack, the btf_parse_graph_root() is indirectly called from btf_parse() or map_check_btf(). We cannot take a reference in btf_parse() case since at that moment, btf is still in the middle to self-validation and initial reference (refcount_set(&btf->refcnt, 1)) has not been triggered yet. It is also unnecessary to take a reference since the value_rec is referring to a record in struct_meta_tab. But we do need to take a reference for map_check_btf() case as the value_rec refers to a record in struct_meta_tab and we want to make sure btf is not freed until value_rec usage at map_free is done. So the fix is to get a btf reference for map_check_btf() case whenever a graph_root is found. A boolean field 'has_btf_ref' is added to struct btf_field_graph_root so later the btf reference can be properly released. Rerun './test_progs -j' with the above mdelay() hack for a couple of times and didn't observe the error. Running Hou's test ([1]) is also successful. [1] https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/ Cc: Hou Tao <houtao@huaweicloud.com> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> --- include/linux/bpf.h | 1 + include/linux/btf.h | 2 +- kernel/bpf/btf.c | 27 ++++++++++++++++++--------- kernel/bpf/syscall.c | 12 +++++++++--- 4 files changed, 29 insertions(+), 13 deletions(-)