diff mbox series

[bpf] bpf: Fix a race condition between btf_put() and map_free()

Message ID 20231204173946.3066377-1-yonghong.song@linux.dev (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf] bpf: Fix a race condition between btf_put() and map_free() | expand

Checks

Context Check Description
bpf/vmtest-bpf-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-2 success Logs for Validate matrix.py
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1117 this patch: 1117
netdev/cc_maintainers fail 1 blamed authors not CCed: memxor@gmail.com; 8 maintainers not CCed: haoluo@google.com jolsa@kernel.org kpsingh@kernel.org memxor@gmail.com john.fastabend@gmail.com martin.lau@linux.dev song@kernel.org sdf@google.com
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch fail ERROR: Avoid using diff content in the commit message - patch(1) might not work WARNING: Possible repeated word: 'that'
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-VM_Test-3 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-VM_Test-7 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-4 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-5 fail Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-6 fail Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-VM_Test-15 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-14 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-VM_Test-16 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-21 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-19 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-18 fail Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-24 success Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-25 success Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-26 fail Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-28 success Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-29 success Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-VM_Test-13 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-11 fail Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-PR fail PR summary
bpf/vmtest-bpf-VM_Test-10 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc

Commit Message

Yonghong Song Dec. 4, 2023, 5:39 p.m. UTC
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

  [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
  [ 1887.186498]
  [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
  [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
  [ 1887.190341] Call Trace:
  [ 1887.190666]  <TASK>
  [ 1887.190949]  dump_stack_lvl+0xac/0xe0
  [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
  [ 1887.192019]  ? panic+0x3c0/0x3c0
  [ 1887.192449]  print_report+0x14f/0x720
  [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
  [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.194572]  kasan_report+0xc3/0x100
  [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
  [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
  [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
  [ 1887.198883]  array_map_free+0x1a3/0x260
  [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
  [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
  [ 1887.200549]  worker_thread+0x633/0x890
  [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
  [ 1887.201574]  ? kthread+0x102/0x1d0
  [ 1887.202020]  kthread+0x1ab/0x1d0
  [ 1887.202447]  ? pr_cont_work+0x270/0x270
  [ 1887.202954]  ? kthread_blkcg+0x50/0x50
  [ 1887.203444]  ret_from_fork+0x34/0x50
  [ 1887.203914]  ? kthread_blkcg+0x50/0x50
  [ 1887.204397]  ret_from_fork_asm+0x11/0x20
  [ 1887.204913]  </TASK>
  [ 1887.204913]  </TASK>
  [ 1887.205209]
  [ 1887.205416] Allocated by task 2197:
  [ 1887.205881]  kasan_set_track+0x3f/0x60
  [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
  [ 1887.206856]  __kmalloc+0xac/0x1a0
  [ 1887.207293]  btf_parse_fields+0xa15/0x1480
  [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
  [ 1887.208387]  btf_new_fd+0x294/0x4d0
  [ 1887.208851]  __sys_bpf+0x4ba/0x600
  [ 1887.209292]  __x64_sys_bpf+0x41/0x50
  [ 1887.209762]  do_syscall_64+0x4c/0xf0
  [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
  [ 1887.210868]
  [ 1887.211074] Freed by task 36:
  [ 1887.211460]  kasan_set_track+0x3f/0x60
  [ 1887.211951]  kasan_save_free_info+0x28/0x40
  [ 1887.212485]  ____kasan_slab_free+0x101/0x180
  [ 1887.213027]  __kmem_cache_free+0xe4/0x210
  [ 1887.213514]  btf_free+0x5b/0x130
  [ 1887.213918]  rcu_core+0x638/0xcc0
  [ 1887.214347]  __do_softirq+0x114/0x37e

The error happens at bpf_rb_root_free+0x1f8/0x2b0:

  00000000000034c0 <bpf_rb_root_free>:
  ; {
    34c0: f3 0f 1e fa                   endbr64
    34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
    34c9: 55                            pushq   %rbp
    34ca: 48 89 e5                      movq    %rsp, %rbp
  ...
  ;       if (rec && rec->refcount_off >= 0 &&
    36aa: 4d 85 ed                      testq   %r13, %r13
    36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
    36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
    36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                        <==== kasan function
    36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                        <==== use-after-free load
    36bc: 45 85 ff                      testl   %r15d, %r15d
    36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                  CPU A                        CPU B
  bpf_map_put:
    ...
    btf_put with rcu callback
    ...
    bpf_map_free_deferred
      with system_unbound_wq
    ...                          ...                           ...
    ...                          btf_free_rcu:                 ...
    ...                          ...                           bpf_map_free_deferred:
    ...                          ...
    ...         --------->       btf_struct_metas_free()
    ...         | race condition ...
    ...         --------->                                     map->ops->map_free()
    ...
    ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
  ...
  __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
  ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

  meta = btf_find_struct_meta(btf, btf_id);
  if (!meta)
    return -EFAULT;
  rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.

  diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
  index 5e43ddd1b83f..aae5b5213e93 100644
  --- a/kernel/bpf/syscall.c
  +++ b/kernel/bpf/syscall.c
  @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
        struct bpf_map *map = container_of(work, struct bpf_map, work);
        struct btf_record *rec = map->record;

  +     mdelay(100);
        security_bpf_map_free(map);
        bpf_map_release_memcg(map);
        /* implementation dependent freeing */

To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
struct_metas available during map_free(). Rerun './test_progs -j' with the
above mdelay() hack for a couple of times and didn't observe the error.

Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/syscall.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Andrii Nakryiko Dec. 5, 2023, 12:42 a.m. UTC | #1
On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> When running `./test_progs -j` in my local vm with latest kernel,
> I once hit a kasan error like below:
>
>   [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
>   [ 1887.186498]
>   [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
>   [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>   [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>   [ 1887.190341] Call Trace:
>   [ 1887.190666]  <TASK>
>   [ 1887.190949]  dump_stack_lvl+0xac/0xe0
>   [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
>   [ 1887.192019]  ? panic+0x3c0/0x3c0
>   [ 1887.192449]  print_report+0x14f/0x720
>   [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
>   [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
>   [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.194572]  kasan_report+0xc3/0x100
>   [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
>   [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
>   [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
>   [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
>   [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
>   [ 1887.198883]  array_map_free+0x1a3/0x260
>   [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
>   [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
>   [ 1887.200549]  worker_thread+0x633/0x890
>   [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
>   [ 1887.201574]  ? kthread+0x102/0x1d0
>   [ 1887.202020]  kthread+0x1ab/0x1d0
>   [ 1887.202447]  ? pr_cont_work+0x270/0x270
>   [ 1887.202954]  ? kthread_blkcg+0x50/0x50
>   [ 1887.203444]  ret_from_fork+0x34/0x50
>   [ 1887.203914]  ? kthread_blkcg+0x50/0x50
>   [ 1887.204397]  ret_from_fork_asm+0x11/0x20
>   [ 1887.204913]  </TASK>
>   [ 1887.204913]  </TASK>
>   [ 1887.205209]
>   [ 1887.205416] Allocated by task 2197:
>   [ 1887.205881]  kasan_set_track+0x3f/0x60
>   [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
>   [ 1887.206856]  __kmalloc+0xac/0x1a0
>   [ 1887.207293]  btf_parse_fields+0xa15/0x1480
>   [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
>   [ 1887.208387]  btf_new_fd+0x294/0x4d0
>   [ 1887.208851]  __sys_bpf+0x4ba/0x600
>   [ 1887.209292]  __x64_sys_bpf+0x41/0x50
>   [ 1887.209762]  do_syscall_64+0x4c/0xf0
>   [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>   [ 1887.210868]
>   [ 1887.211074] Freed by task 36:
>   [ 1887.211460]  kasan_set_track+0x3f/0x60
>   [ 1887.211951]  kasan_save_free_info+0x28/0x40
>   [ 1887.212485]  ____kasan_slab_free+0x101/0x180
>   [ 1887.213027]  __kmem_cache_free+0xe4/0x210
>   [ 1887.213514]  btf_free+0x5b/0x130
>   [ 1887.213918]  rcu_core+0x638/0xcc0
>   [ 1887.214347]  __do_softirq+0x114/0x37e
>
> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
>
>   00000000000034c0 <bpf_rb_root_free>:
>   ; {
>     34c0: f3 0f 1e fa                   endbr64
>     34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
>     34c9: 55                            pushq   %rbp
>     34ca: 48 89 e5                      movq    %rsp, %rbp
>   ...
>   ;       if (rec && rec->refcount_off >= 0 &&
>     36aa: 4d 85 ed                      testq   %r13, %r13
>     36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
>     36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
>     36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
>                                         <==== kasan function
>     36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
>                                         <==== use-after-free load
>     36bc: 45 85 ff                      testl   %r15d, %r15d
>     36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>
>
> So the problem is at rec->refcount_off in the above.
>
> I did some source code analysis and find the reason.
>                                   CPU A                        CPU B
>   bpf_map_put:
>     ...
>     btf_put with rcu callback
>     ...
>     bpf_map_free_deferred
>       with system_unbound_wq
>     ...                          ...                           ...
>     ...                          btf_free_rcu:                 ...
>     ...                          ...                           bpf_map_free_deferred:
>     ...                          ...
>     ...         --------->       btf_struct_metas_free()
>     ...         | race condition ...
>     ...         --------->                                     map->ops->map_free()
>     ...
>     ...                          btf->struct_meta_tab = NULL
>
> In the above, map_free() corresponds to array_map_free() and eventually
> calling bpf_rb_root_free() which calls:
>   ...
>   __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>   ...
>
> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
>
>   meta = btf_find_struct_meta(btf, btf_id);
>   if (!meta)
>     return -EFAULT;
>   rec->fields[i].graph_root.value_rec = meta->record;
>
> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
> And it is possible that that particular record has been freed by
> btf_struct_metas_free() and hence we have a kasan error here.
>
> Actually it is very hard to reproduce the failure with current bpf/bpf-next
> code, I only got the above error once. To increase reproducibility, I added
> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
> significantly increased reproducibility.
>
>   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>   index 5e43ddd1b83f..aae5b5213e93 100644
>   --- a/kernel/bpf/syscall.c
>   +++ b/kernel/bpf/syscall.c
>   @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>         struct btf_record *rec = map->record;
>
>   +     mdelay(100);
>         security_bpf_map_free(map);
>         bpf_map_release_memcg(map);
>         /* implementation dependent freeing */
>
> To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
> struct_metas available during map_free(). Rerun './test_progs -j' with the
> above mdelay() hack for a couple of times and didn't observe the error.
>
> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  kernel/bpf/syscall.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0ed286b8a0f0..9c6c3738adfe 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
>  {
>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>         struct btf_record *rec = map->record;
> +       struct btf *btf = map->btf;
>
>         security_bpf_map_free(map);
>         bpf_map_release_memcg(map);
>         /* implementation dependent freeing */
>         map->ops->map_free(map);
> +       /* Delay freeing of btf for maps, as map_free callback may need
> +        * struct_meta info which will be freed with btf_put().
> +        */
> +       btf_put(btf);

The change makes sense to me, but logically I'd put it after
btf_record_free(rec), just in case if some of btf records ever refer
back to map's BTF or something (and just in general to keep it the
very last thing that's happening).


But it also seems like CI is not happy ([0]), please take a look, thanks!

  [0] https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532


>         /* Delay freeing of btf_record for maps, as map_free
>          * callback usually needs access to them. It is better to do it here
>          * than require each callback to do the free itself manually.
> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>         if (atomic64_dec_and_test(&map->refcnt)) {
>                 /* bpf_map_free_id() must be called first */
>                 bpf_map_free_id(map);
> -               btf_put(map->btf);
>                 INIT_WORK(&map->work, bpf_map_free_deferred);
>                 /* Avoid spawning kworkers, since they all might contend
>                  * for the same mutex like slab_mutex.
> --
> 2.34.1
>
Hou Tao Dec. 5, 2023, 1:31 a.m. UTC | #2
Hi,

On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> When running `./test_progs -j` in my local vm with latest kernel,
>> I once hit a kasan error like below:

SNIP
>>
>>
>> So the problem is at rec->refcount_off in the above.
>>
>> I did some source code analysis and find the reason.
>>                                   CPU A                        CPU B
>>   bpf_map_put:
>>     ...
>>     btf_put with rcu callback
>>     ...
>>     bpf_map_free_deferred
>>       with system_unbound_wq
>>     ...                          ...                           ...
>>     ...                          btf_free_rcu:                 ...
>>     ...                          ...                           bpf_map_free_deferred:
>>     ...                          ...
>>     ...         --------->       btf_struct_metas_free()
>>     ...         | race condition ...
>>     ...         --------->                                     map->ops->map_free()
>>     ...
>>     ...                          btf->struct_meta_tab = NULL
>>
>> In the above, map_free() corresponds to array_map_free() and eventually
>> calling bpf_rb_root_free() which calls:
>>   ...
>>   __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>   ...
>>
>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
>>
>>   meta = btf_find_struct_meta(btf, btf_id);
>>   if (!meta)
>>     return -EFAULT;
>>   rec->fields[i].graph_root.value_rec = meta->record;
>>
>> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
>> And it is possible that that particular record has been freed by
>> btf_struct_metas_free() and hence we have a kasan error here.
>>
>> Actually it is very hard to reproduce the failure with current bpf/bpf-next
>> code, I only got the above error once. To increase reproducibility, I added
>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>> significantly increased reproducibility.

Also found the problem when developing the "fix the release of inner
map" patch-set. I have written a selftest which could reliably reproduce
the problem by using map-in-map + bpf_list_head. The reason of using
map-in-map is to delay the release of inner map by using call_rcu() as
well, so the free of bpf_map happens after the release of btf. Will post
it later.
>>
>>   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>   index 5e43ddd1b83f..aae5b5213e93 100644
>>   --- a/kernel/bpf/syscall.c
>>   +++ b/kernel/bpf/syscall.c
>>   @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>>         struct btf_record *rec = map->record;
>>
>>   +     mdelay(100);
>>         security_bpf_map_free(map);
>>         bpf_map_release_memcg(map);
>>         /* implementation dependent freeing */
>>
>> To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
>> struct_metas available during map_free(). Rerun './test_progs -j' with the
>> above mdelay() hack for a couple of times and didn't observe the error.
>>
>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>  kernel/bpf/syscall.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 0ed286b8a0f0..9c6c3738adfe 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>  {
>>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>>         struct btf_record *rec = map->record;
>> +       struct btf *btf = map->btf;
>>
>>         security_bpf_map_free(map);
>>         bpf_map_release_memcg(map);
>>         /* implementation dependent freeing */
>>         map->ops->map_free(map);
>> +       /* Delay freeing of btf for maps, as map_free callback may need
>> +        * struct_meta info which will be freed with btf_put().
>> +        */
>> +       btf_put(btf);
> The change makes sense to me, but logically I'd put it after
> btf_record_free(rec), just in case if some of btf records ever refer
> back to map's BTF or something (and just in general to keep it the
> very last thing that's happening).
>
>
> But it also seems like CI is not happy ([0]), please take a look, thanks!
>
>   [0] https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532

The patch delays the release of BTF id of bpf map, so test_btf_id
failed. Can we fix the problem by optionally pinning the btf in
btf_field_graph_root just like btf_field_kptr, so the map BTF will still
be alive before the invocation of btf_record_free() ? We need to do the
pinning optionally, because btf_record may be contained in btf directly
(namely btf->struct_meta_tab) and is freed through btf_free().
>
>
>>         /* Delay freeing of btf_record for maps, as map_free
>>          * callback usually needs access to them. It is better to do it here
>>          * than require each callback to do the free itself manually.
>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>         if (atomic64_dec_and_test(&map->refcnt)) {
>>                 /* bpf_map_free_id() must be called first */
>>                 bpf_map_free_id(map);
>> -               btf_put(map->btf);
>>                 INIT_WORK(&map->work, bpf_map_free_deferred);
>>                 /* Avoid spawning kworkers, since they all might contend
>>                  * for the same mutex like slab_mutex.
>> --
>> 2.34.1
>>
> .
Yonghong Song Dec. 5, 2023, 3:58 a.m. UTC | #3
On 12/4/23 7:42 PM, Andrii Nakryiko wrote:
> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> When running `./test_progs -j` in my local vm with latest kernel,
>> I once hit a kasan error like below:
>>
>>    [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
>>    [ 1887.186498]
>>    [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
>>    [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>    [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>>    [ 1887.190341] Call Trace:
>>    [ 1887.190666]  <TASK>
>>    [ 1887.190949]  dump_stack_lvl+0xac/0xe0
>>    [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
>>    [ 1887.192019]  ? panic+0x3c0/0x3c0
>>    [ 1887.192449]  print_report+0x14f/0x720
>>    [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
>>    [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
>>    [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.194572]  kasan_report+0xc3/0x100
>>    [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
>>    [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
>>    [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
>>    [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
>>    [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
>>    [ 1887.198883]  array_map_free+0x1a3/0x260
>>    [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
>>    [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
>>    [ 1887.200549]  worker_thread+0x633/0x890
>>    [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
>>    [ 1887.201574]  ? kthread+0x102/0x1d0
>>    [ 1887.202020]  kthread+0x1ab/0x1d0
>>    [ 1887.202447]  ? pr_cont_work+0x270/0x270
>>    [ 1887.202954]  ? kthread_blkcg+0x50/0x50
>>    [ 1887.203444]  ret_from_fork+0x34/0x50
>>    [ 1887.203914]  ? kthread_blkcg+0x50/0x50
>>    [ 1887.204397]  ret_from_fork_asm+0x11/0x20
>>    [ 1887.204913]  </TASK>
>>    [ 1887.204913]  </TASK>
>>    [ 1887.205209]
>>    [ 1887.205416] Allocated by task 2197:
>>    [ 1887.205881]  kasan_set_track+0x3f/0x60
>>    [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
>>    [ 1887.206856]  __kmalloc+0xac/0x1a0
>>    [ 1887.207293]  btf_parse_fields+0xa15/0x1480
>>    [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
>>    [ 1887.208387]  btf_new_fd+0x294/0x4d0
>>    [ 1887.208851]  __sys_bpf+0x4ba/0x600
>>    [ 1887.209292]  __x64_sys_bpf+0x41/0x50
>>    [ 1887.209762]  do_syscall_64+0x4c/0xf0
>>    [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>>    [ 1887.210868]
>>    [ 1887.211074] Freed by task 36:
>>    [ 1887.211460]  kasan_set_track+0x3f/0x60
>>    [ 1887.211951]  kasan_save_free_info+0x28/0x40
>>    [ 1887.212485]  ____kasan_slab_free+0x101/0x180
>>    [ 1887.213027]  __kmem_cache_free+0xe4/0x210
>>    [ 1887.213514]  btf_free+0x5b/0x130
>>    [ 1887.213918]  rcu_core+0x638/0xcc0
>>    [ 1887.214347]  __do_softirq+0x114/0x37e
>>
>> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
>>
>>    00000000000034c0 <bpf_rb_root_free>:
>>    ; {
>>      34c0: f3 0f 1e fa                   endbr64
>>      34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
>>      34c9: 55                            pushq   %rbp
>>      34ca: 48 89 e5                      movq    %rsp, %rbp
>>    ...
>>    ;       if (rec && rec->refcount_off >= 0 &&
>>      36aa: 4d 85 ed                      testq   %r13, %r13
>>      36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
>>      36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
>>      36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
>>                                          <==== kasan function
>>      36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
>>                                          <==== use-after-free load
>>      36bc: 45 85 ff                      testl   %r15d, %r15d
>>      36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>
>>
>> So the problem is at rec->refcount_off in the above.
>>
>> I did some source code analysis and find the reason.
>>                                    CPU A                        CPU B
>>    bpf_map_put:
>>      ...
>>      btf_put with rcu callback
>>      ...
>>      bpf_map_free_deferred
>>        with system_unbound_wq
>>      ...                          ...                           ...
>>      ...                          btf_free_rcu:                 ...
>>      ...                          ...                           bpf_map_free_deferred:
>>      ...                          ...
>>      ...         --------->       btf_struct_metas_free()
>>      ...         | race condition ...
>>      ...         --------->                                     map->ops->map_free()
>>      ...
>>      ...                          btf->struct_meta_tab = NULL
>>
>> In the above, map_free() corresponds to array_map_free() and eventually
>> calling bpf_rb_root_free() which calls:
>>    ...
>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>    ...
>>
>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
>>
>>    meta = btf_find_struct_meta(btf, btf_id);
>>    if (!meta)
>>      return -EFAULT;
>>    rec->fields[i].graph_root.value_rec = meta->record;
>>
>> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
>> And it is possible that that particular record has been freed by
>> btf_struct_metas_free() and hence we have a kasan error here.
>>
>> Actually it is very hard to reproduce the failure with current bpf/bpf-next
>> code, I only got the above error once. To increase reproducibility, I added
>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>> significantly increased reproducibility.
>>
>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>    --- a/kernel/bpf/syscall.c
>>    +++ b/kernel/bpf/syscall.c
>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>          struct btf_record *rec = map->record;
>>
>>    +     mdelay(100);
>>          security_bpf_map_free(map);
>>          bpf_map_release_memcg(map);
>>          /* implementation dependent freeing */
>>
>> To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
>> struct_metas available during map_free(). Rerun './test_progs -j' with the
>> above mdelay() hack for a couple of times and didn't observe the error.
>>
>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   kernel/bpf/syscall.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 0ed286b8a0f0..9c6c3738adfe 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>   {
>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>          struct btf_record *rec = map->record;
>> +       struct btf *btf = map->btf;
>>
>>          security_bpf_map_free(map);
>>          bpf_map_release_memcg(map);
>>          /* implementation dependent freeing */
>>          map->ops->map_free(map);
>> +       /* Delay freeing of btf for maps, as map_free callback may need
>> +        * struct_meta info which will be freed with btf_put().
>> +        */
>> +       btf_put(btf);
> The change makes sense to me, but logically I'd put it after
> btf_record_free(rec), just in case if some of btf records ever refer
> back to map's BTF or something (and just in general to keep it the
> very last thing that's happening).

Currently it is safe as btf_record_free() does not touch anything freed
by btf_put(). But surely will put btf_put() at the end just in case
in the future btf_record_free() changes.

>
>
> But it also seems like CI is not happy ([0]), please take a look, thanks!
>
>    [0] https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532

It is a timing issue. The patch made a little bit longer to free btf and the test
may fail as it can still retrieve the btf_id although it has been freed.
Adding one kern_sync_rcu() in user space seems making it reliable again, at least
tentatively.

>
>
>>          /* Delay freeing of btf_record for maps, as map_free
>>           * callback usually needs access to them. It is better to do it here
>>           * than require each callback to do the free itself manually.
>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>          if (atomic64_dec_and_test(&map->refcnt)) {
>>                  /* bpf_map_free_id() must be called first */
>>                  bpf_map_free_id(map);
>> -               btf_put(map->btf);
>>                  INIT_WORK(&map->work, bpf_map_free_deferred);
>>                  /* Avoid spawning kworkers, since they all might contend
>>                   * for the same mutex like slab_mutex.
>> --
>> 2.34.1
>>
Yonghong Song Dec. 5, 2023, 4:15 a.m. UTC | #4
On 12/4/23 8:31 PM, Hou Tao wrote:
> Hi,
>
> On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
>> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> When running `./test_progs -j` in my local vm with latest kernel,
>>> I once hit a kasan error like below:
> SNIP
>>>
>>> So the problem is at rec->refcount_off in the above.
>>>
>>> I did some source code analysis and find the reason.
>>>                                    CPU A                        CPU B
>>>    bpf_map_put:
>>>      ...
>>>      btf_put with rcu callback
>>>      ...
>>>      bpf_map_free_deferred
>>>        with system_unbound_wq
>>>      ...                          ...                           ...
>>>      ...                          btf_free_rcu:                 ...
>>>      ...                          ...                           bpf_map_free_deferred:
>>>      ...                          ...
>>>      ...         --------->       btf_struct_metas_free()
>>>      ...         | race condition ...
>>>      ...         --------->                                     map->ops->map_free()
>>>      ...
>>>      ...                          btf->struct_meta_tab = NULL
>>>
>>> In the above, map_free() corresponds to array_map_free() and eventually
>>> calling bpf_rb_root_free() which calls:
>>>    ...
>>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>    ...
>>>
>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
>>>
>>>    meta = btf_find_struct_meta(btf, btf_id);
>>>    if (!meta)
>>>      return -EFAULT;
>>>    rec->fields[i].graph_root.value_rec = meta->record;
>>>
>>> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
>>> And it is possible that that particular record has been freed by
>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>
>>> Actually it is very hard to reproduce the failure with current bpf/bpf-next
>>> code, I only got the above error once. To increase reproducibility, I added
>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>>> significantly increased reproducibility.
> Also found the problem when developing the "fix the release of inner
> map" patch-set. I have written a selftest which could reliably reproduce
> the problem by using map-in-map + bpf_list_head. The reason of using
> map-in-map is to delay the release of inner map by using call_rcu() as
> well, so the free of bpf_map happens after the release of btf. Will post
> it later.
>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>>    --- a/kernel/bpf/syscall.c
>>>    +++ b/kernel/bpf/syscall.c
>>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>>          struct btf_record *rec = map->record;
>>>
>>>    +     mdelay(100);
>>>          security_bpf_map_free(map);
>>>          bpf_map_release_memcg(map);
>>>          /* implementation dependent freeing */
>>>
>>> To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
>>> struct_metas available during map_free(). Rerun './test_progs -j' with the
>>> above mdelay() hack for a couple of times and didn't observe the error.
>>>
>>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>>   kernel/bpf/syscall.c | 6 +++++-
>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 0ed286b8a0f0..9c6c3738adfe 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>>   {
>>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>>          struct btf_record *rec = map->record;
>>> +       struct btf *btf = map->btf;
>>>
>>>          security_bpf_map_free(map);
>>>          bpf_map_release_memcg(map);
>>>          /* implementation dependent freeing */
>>>          map->ops->map_free(map);
>>> +       /* Delay freeing of btf for maps, as map_free callback may need
>>> +        * struct_meta info which will be freed with btf_put().
>>> +        */
>>> +       btf_put(btf);
>> The change makes sense to me, but logically I'd put it after
>> btf_record_free(rec), just in case if some of btf records ever refer
>> back to map's BTF or something (and just in general to keep it the
>> very last thing that's happening).
>>
>>
>> But it also seems like CI is not happy ([0]), please take a look, thanks!
>>
>>    [0] https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
> The patch delays the release of BTF id of bpf map, so test_btf_id
> failed. Can we fix the problem by optionally pinning the btf in
> btf_field_graph_root just like btf_field_kptr, so the map BTF will still
> be alive before the invocation of btf_record_free() ? We need to do the
> pinning optionally, because btf_record may be contained in btf directly
> (namely btf->struct_meta_tab) and is freed through btf_free().

Thanks for suggestion, I guess you want two cases:
   - if map->record won't access any btf data (e.g., btf->struct_meta_tab),
     we should keep current btf_put() workflow,
   - if map->record accesses some btf data, we should call btf_put()
     immediately before or after btf_record_free().

This could be done but we need to be careful to find all cases
btf data might be accessed in map->record. The current approach
is simpler. I will post v2 with the change Andrii suggested and
fixed the failed test.

If people really want to fine tune this like the above two cases, I can
investigate too.

>>
>>>          /* Delay freeing of btf_record for maps, as map_free
>>>           * callback usually needs access to them. It is better to do it here
>>>           * than require each callback to do the free itself manually.
>>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>>          if (atomic64_dec_and_test(&map->refcnt)) {
>>>                  /* bpf_map_free_id() must be called first */
>>>                  bpf_map_free_id(map);
>>> -               btf_put(map->btf);
>>>                  INIT_WORK(&map->work, bpf_map_free_deferred);
>>>                  /* Avoid spawning kworkers, since they all might contend
>>>                   * for the same mutex like slab_mutex.
>>> --
>>> 2.34.1
>>>
>> .
Hou Tao Dec. 5, 2023, 6:30 a.m. UTC | #5
Hi,

On 12/5/2023 12:15 PM, Yonghong Song wrote:
>
> On 12/4/23 8:31 PM, Hou Tao wrote:
>> Hi,
>>
>> On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
>>> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song
>>> <yonghong.song@linux.dev> wrote:
>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>> I once hit a kasan error like below:
>> SNIP
>>>>
>>>> So the problem is at rec->refcount_off in the above.
>>>>
>>>> I did some source code analysis and find the reason.
>>>>                                    CPU A                        CPU B
>>>>    bpf_map_put:
>>>>      ...
>>>>      btf_put with rcu callback
>>>>      ...
>>>>      bpf_map_free_deferred
>>>>        with system_unbound_wq
>>>>      ...                          ...                           ...
>>>>      ...                          btf_free_rcu:                 ...
>>>>      ...                          ...                          
>>>> bpf_map_free_deferred:
>>>>      ...                          ...
>>>>      ...         --------->       btf_struct_metas_free()
>>>>      ...         | race condition ...
>>>>      ...         --------->                                    
>>>> map->ops->map_free()
>>>>      ...
>>>>      ...                          btf->struct_meta_tab = NULL
>>>>
>>>> In the above, map_free() corresponds to array_map_free() and
>>>> eventually
>>>> calling bpf_rb_root_free() which calls:
>>>>    ...
>>>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>>    ...
>>>>
>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>> following code:
>>>>
>>>>    meta = btf_find_struct_meta(btf, btf_id);
>>>>    if (!meta)
>>>>      return -EFAULT;
>>>>    rec->fields[i].graph_root.value_rec = meta->record;
>>>>
>>>> So basically, 'value_rec' is a pointer to the record in
>>>> struct_metas_tab.
>>>> And it is possible that that particular record has been freed by
>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>
>>>> Actually it is very hard to reproduce the failure with current
>>>> bpf/bpf-next
>>>> code, I only got the above error once. To increase reproducibility,
>>>> I added
>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>> which
>>>> significantly increased reproducibility.
>> Also found the problem when developing the "fix the release of inner
>> map" patch-set. I have written a selftest which could reliably reproduce
>> the problem by using map-in-map + bpf_list_head. The reason of using
>> map-in-map is to delay the release of inner map by using call_rcu() as
>> well, so the free of bpf_map happens after the release of btf. Will post
>> it later.
>>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>>>    --- a/kernel/bpf/syscall.c
>>>>    +++ b/kernel/bpf/syscall.c
>>>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>> work_struct *work)
>>>>          struct bpf_map *map = container_of(work, struct bpf_map,
>>>> work);
>>>>          struct btf_record *rec = map->record;
>>>>
>>>>    +     mdelay(100);
>>>>          security_bpf_map_free(map);
>>>>          bpf_map_release_memcg(map);
>>>>          /* implementation dependent freeing */
>>>>
>>>> To fix the problem, I moved btf_put() after map->ops->map_free() to
>>>> ensure
>>>> struct_metas available during map_free(). Rerun './test_progs -j'
>>>> with the
>>>> above mdelay() hack for a couple of times and didn't observe the
>>>> error.
>>>>
>>>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>> ---
>>>>   kernel/bpf/syscall.c | 6 +++++-
>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 0ed286b8a0f0..9c6c3738adfe 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct
>>>> work_struct *work)
>>>>   {
>>>>          struct bpf_map *map = container_of(work, struct bpf_map,
>>>> work);
>>>>          struct btf_record *rec = map->record;
>>>> +       struct btf *btf = map->btf;
>>>>
>>>>          security_bpf_map_free(map);
>>>>          bpf_map_release_memcg(map);
>>>>          /* implementation dependent freeing */
>>>>          map->ops->map_free(map);
>>>> +       /* Delay freeing of btf for maps, as map_free callback may
>>>> need
>>>> +        * struct_meta info which will be freed with btf_put().
>>>> +        */
>>>> +       btf_put(btf);
>>> The change makes sense to me, but logically I'd put it after
>>> btf_record_free(rec), just in case if some of btf records ever refer
>>> back to map's BTF or something (and just in general to keep it the
>>> very last thing that's happening).
>>>
>>>
>>> But it also seems like CI is not happy ([0]), please take a look,
>>> thanks!
>>>
>>>    [0]
>>> https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
>> The patch delays the release of BTF id of bpf map, so test_btf_id
>> failed. Can we fix the problem by optionally pinning the btf in
>> btf_field_graph_root just like btf_field_kptr, so the map BTF will still
>> be alive before the invocation of btf_record_free() ? We need to do the
>> pinning optionally, because btf_record may be contained in btf directly
>> (namely btf->struct_meta_tab) and is freed through btf_free().
>
> Thanks for suggestion, I guess you want two cases:
>   - if map->record won't access any btf data (e.g.,
> btf->struct_meta_tab),
>     we should keep current btf_put() workflow,
>   - if map->record accesses some btf data, we should call btf_put()
>     immediately before or after btf_record_free().

Er, it is not what I want, although I have written a similar patch in
which bpf_map_put() will call btf_put() and set map->btf as NULL if
there is no BPF_LIST_HEAD and BPF_RB_ROOT fields in map->record,
otherwise calling bpf_put() in bpf_put_free_deferred(). What I have
suggested is to optionally pin btf in graph_root.btf just like
btf_field_kptr does.
>
> This could be done but we need to be careful to find all cases
> btf data might be accessed in map->record. The current approach
> is simpler. I will post v2 with the change Andrii suggested and
> fixed the failed test.
>
> If people really want to fine tune this like the above two cases, I can
> investigate too.
>
>>>
>>>>          /* Delay freeing of btf_record for maps, as map_free
>>>>           * callback usually needs access to them. It is better to
>>>> do it here
>>>>           * than require each callback to do the free itself manually.
>>>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>>>          if (atomic64_dec_and_test(&map->refcnt)) {
>>>>                  /* bpf_map_free_id() must be called first */
>>>>                  bpf_map_free_id(map);
>>>> -               btf_put(map->btf);
>>>>                  INIT_WORK(&map->work, bpf_map_free_deferred);
>>>>                  /* Avoid spawning kworkers, since they all might
>>>> contend
>>>>                   * for the same mutex like slab_mutex.
>>>> -- 
>>>> 2.34.1
>>>>
>>> .
>
> .
Yonghong Song Dec. 5, 2023, 7:01 a.m. UTC | #6
On 12/5/23 1:30 AM, Hou Tao wrote:
> Hi,
>
> On 12/5/2023 12:15 PM, Yonghong Song wrote:
>> On 12/4/23 8:31 PM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
>>>> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song
>>>> <yonghong.song@linux.dev> wrote:
>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>> I once hit a kasan error like below:
>>> SNIP
>>>>> So the problem is at rec->refcount_off in the above.
>>>>>
>>>>> I did some source code analysis and find the reason.
>>>>>                                     CPU A                        CPU B
>>>>>     bpf_map_put:
>>>>>       ...
>>>>>       btf_put with rcu callback
>>>>>       ...
>>>>>       bpf_map_free_deferred
>>>>>         with system_unbound_wq
>>>>>       ...                          ...                           ...
>>>>>       ...                          btf_free_rcu:                 ...
>>>>>       ...                          ...
>>>>> bpf_map_free_deferred:
>>>>>       ...                          ...
>>>>>       ...         --------->       btf_struct_metas_free()
>>>>>       ...         | race condition ...
>>>>>       ...         --------->
>>>>> map->ops->map_free()
>>>>>       ...
>>>>>       ...                          btf->struct_meta_tab = NULL
>>>>>
>>>>> In the above, map_free() corresponds to array_map_free() and
>>>>> eventually
>>>>> calling bpf_rb_root_free() which calls:
>>>>>     ...
>>>>>     __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>>>     ...
>>>>>
>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>> following code:
>>>>>
>>>>>     meta = btf_find_struct_meta(btf, btf_id);
>>>>>     if (!meta)
>>>>>       return -EFAULT;
>>>>>     rec->fields[i].graph_root.value_rec = meta->record;
>>>>>
>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>> struct_metas_tab.
>>>>> And it is possible that that particular record has been freed by
>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>
>>>>> Actually it is very hard to reproduce the failure with current
>>>>> bpf/bpf-next
>>>>> code, I only got the above error once. To increase reproducibility,
>>>>> I added
>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>> which
>>>>> significantly increased reproducibility.
>>> Also found the problem when developing the "fix the release of inner
>>> map" patch-set. I have written a selftest which could reliably reproduce
>>> the problem by using map-in-map + bpf_list_head. The reason of using
>>> map-in-map is to delay the release of inner map by using call_rcu() as
>>> well, so the free of bpf_map happens after the release of btf. Will post
>>> it later.
>>>>>     diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>>     index 5e43ddd1b83f..aae5b5213e93 100644
>>>>>     --- a/kernel/bpf/syscall.c
>>>>>     +++ b/kernel/bpf/syscall.c
>>>>>     @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>> work_struct *work)
>>>>>           struct bpf_map *map = container_of(work, struct bpf_map,
>>>>> work);
>>>>>           struct btf_record *rec = map->record;
>>>>>
>>>>>     +     mdelay(100);
>>>>>           security_bpf_map_free(map);
>>>>>           bpf_map_release_memcg(map);
>>>>>           /* implementation dependent freeing */
>>>>>
>>>>> To fix the problem, I moved btf_put() after map->ops->map_free() to
>>>>> ensure
>>>>> struct_metas available during map_free(). Rerun './test_progs -j'
>>>>> with the
>>>>> above mdelay() hack for a couple of times and didn't observe the
>>>>> error.
>>>>>
>>>>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> ---
>>>>>    kernel/bpf/syscall.c | 6 +++++-
>>>>>    1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>> index 0ed286b8a0f0..9c6c3738adfe 100644
>>>>> --- a/kernel/bpf/syscall.c
>>>>> +++ b/kernel/bpf/syscall.c
>>>>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct
>>>>> work_struct *work)
>>>>>    {
>>>>>           struct bpf_map *map = container_of(work, struct bpf_map,
>>>>> work);
>>>>>           struct btf_record *rec = map->record;
>>>>> +       struct btf *btf = map->btf;
>>>>>
>>>>>           security_bpf_map_free(map);
>>>>>           bpf_map_release_memcg(map);
>>>>>           /* implementation dependent freeing */
>>>>>           map->ops->map_free(map);
>>>>> +       /* Delay freeing of btf for maps, as map_free callback may
>>>>> need
>>>>> +        * struct_meta info which will be freed with btf_put().
>>>>> +        */
>>>>> +       btf_put(btf);
>>>> The change makes sense to me, but logically I'd put it after
>>>> btf_record_free(rec), just in case if some of btf records ever refer
>>>> back to map's BTF or something (and just in general to keep it the
>>>> very last thing that's happening).
>>>>
>>>>
>>>> But it also seems like CI is not happy ([0]), please take a look,
>>>> thanks!
>>>>
>>>>     [0]
>>>> https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
>>> The patch delays the release of BTF id of bpf map, so test_btf_id
>>> failed. Can we fix the problem by optionally pinning the btf in
>>> btf_field_graph_root just like btf_field_kptr, so the map BTF will still
>>> be alive before the invocation of btf_record_free() ? We need to do the
>>> pinning optionally, because btf_record may be contained in btf directly
>>> (namely btf->struct_meta_tab) and is freed through btf_free().
>> Thanks for suggestion, I guess you want two cases:
>>    - if map->record won't access any btf data (e.g.,
>> btf->struct_meta_tab),
>>      we should keep current btf_put() workflow,
>>    - if map->record accesses some btf data, we should call btf_put()
>>      immediately before or after btf_record_free().
> Er, it is not what I want, although I have written a similar patch in
> which bpf_map_put() will call btf_put() and set map->btf as NULL if
> there is no BPF_LIST_HEAD and BPF_RB_ROOT fields in map->record,
> otherwise calling bpf_put() in bpf_put_free_deferred(). What I have
> suggested is to optionally pin btf in graph_root.btf just like
> btf_field_kptr does.

Okay, I see what you mean. This is actually what I kind of think
as well in below to identify *all* cases btf data might be accessed.
I didn't explicitly mention this approach in detail but the idea is
to get a reference count for btf and later release it during btf_record_free.
I think this should work. I need to do an audit then to find other potential
places, if exists, to do similar things. The current approach
is simpler but looks like we can do better with existing
btf_field_kptr approach.

>> This could be done but we need to be careful to find all cases
>> btf data might be accessed in map->record. The current approach
>> is simpler. I will post v2 with the change Andrii suggested and
>> fixed the failed test.
>>
>> If people really want to fine tune this like the above two cases, I can
>> investigate too.
>>
>>>>>           /* Delay freeing of btf_record for maps, as map_free
>>>>>            * callback usually needs access to them. It is better to
>>>>> do it here
>>>>>            * than require each callback to do the free itself manually.
>>>>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>>>>           if (atomic64_dec_and_test(&map->refcnt)) {
>>>>>                   /* bpf_map_free_id() must be called first */
>>>>>                   bpf_map_free_id(map);
>>>>> -               btf_put(map->btf);
>>>>>                   INIT_WORK(&map->work, bpf_map_free_deferred);
>>>>>                   /* Avoid spawning kworkers, since they all might
>>>>> contend
>>>>>                    * for the same mutex like slab_mutex.
>>>>> -- 
>>>>> 2.34.1
>>>>>
>>>> .
>> .
Alexei Starovoitov Dec. 5, 2023, 9:13 p.m. UTC | #7
On Mon, Dec 4, 2023 at 11:01 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> > Er, it is not what I want, although I have written a similar patch in
> > which bpf_map_put() will call btf_put() and set map->btf as NULL if
> > there is no BPF_LIST_HEAD and BPF_RB_ROOT fields in map->record,
> > otherwise calling bpf_put() in bpf_put_free_deferred(). What I have
> > suggested is to optionally pin btf in graph_root.btf just like
> > btf_field_kptr does.
>
> Okay, I see what you mean. This is actually what I kind of think
> as well in below to identify *all* cases btf data might be accessed.
> I didn't explicitly mention this approach in detail but the idea is
> to get a reference count for btf and later release it during btf_record_free.
> I think this should work. I need to do an audit then to find other potential
> places, if exists, to do similar things. The current approach
> is simpler but looks like we can do better with existing
> btf_field_kptr approach.

imo that would be the only correct way to fix it.
we btf_get(kptr_btf) before saving it kptr.btf in btf_parse_kptr() and
btf_put() it eventually in btf_record_free().
graph_root looks buggy.
It saved the btf pointer in btf_parse_graph_root() without taking refcnt.
Yonghong Song Dec. 5, 2023, 10:50 p.m. UTC | #8
On 12/5/23 4:13 PM, Alexei Starovoitov wrote:
> On Mon, Dec 4, 2023 at 11:01 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> Er, it is not what I want, although I have written a similar patch in
>>> which bpf_map_put() will call btf_put() and set map->btf as NULL if
>>> there is no BPF_LIST_HEAD and BPF_RB_ROOT fields in map->record,
>>> otherwise calling bpf_put() in bpf_put_free_deferred(). What I have
>>> suggested is to optionally pin btf in graph_root.btf just like
>>> btf_field_kptr does.
>> Okay, I see what you mean. This is actually what I kind of think
>> as well in below to identify *all* cases btf data might be accessed.
>> I didn't explicitly mention this approach in detail but the idea is
>> to get a reference count for btf and later release it during btf_record_free.
>> I think this should work. I need to do an audit then to find other potential
>> places, if exists, to do similar things. The current approach
>> is simpler but looks like we can do better with existing
>> btf_field_kptr approach.
> imo that would be the only correct way to fix it.
> we btf_get(kptr_btf) before saving it kptr.btf in btf_parse_kptr() and
> btf_put() it eventually in btf_record_free().
> graph_root looks buggy.
> It saved the btf pointer in btf_parse_graph_root() without taking refcnt.

Agreed. Just send v3 patch:

https://lore.kernel.org/bpf/20231205224812.813224-1-yonghong.song@linux.dev/
diff mbox series

Patch

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0ed286b8a0f0..9c6c3738adfe 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,11 +694,16 @@  static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 	struct btf_record *rec = map->record;
+	struct btf *btf = map->btf;
 
 	security_bpf_map_free(map);
 	bpf_map_release_memcg(map);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
+	/* Delay freeing of btf for maps, as map_free callback may need
+	 * struct_meta info which will be freed with btf_put().
+	 */
+	btf_put(btf);
 	/* Delay freeing of btf_record for maps, as map_free
 	 * callback usually needs access to them. It is better to do it here
 	 * than require each callback to do the free itself manually.
@@ -727,7 +732,6 @@  void bpf_map_put(struct bpf_map *map)
 	if (atomic64_dec_and_test(&map->refcnt)) {
 		/* bpf_map_free_id() must be called first */
 		bpf_map_free_id(map);
-		btf_put(map->btf);
 		INIT_WORK(&map->work, bpf_map_free_deferred);
 		/* Avoid spawning kworkers, since they all might contend
 		 * for the same mutex like slab_mutex.