diff mbox series

[bpf-next,v4] bpf: Fix a race condition between btf_put() and map_free()

Message ID 20231206210959.1035724-1-yonghong.song@linux.dev (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf-next,v4] bpf: Fix a race condition between btf_put() and map_free() | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 7874 this patch: 7874
netdev/cc_maintainers fail 1 blamed authors not CCed: memxor@gmail.com; 8 maintainers not CCed: kpsingh@kernel.org sdf@google.com john.fastabend@gmail.com martin.lau@linux.dev song@kernel.org haoluo@google.com jolsa@kernel.org memxor@gmail.com
netdev/build_clang success Errors and warnings before: 2577 this patch: 2577
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 8409 this patch: 8409
netdev/checkpatch fail ERROR: Avoid using diff content in the commit message - patch(1) might not work WARNING: Possible repeated word: 'that' WARNING: line length of 100 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-3 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-15 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-16 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc

Commit Message

Yonghong Song Dec. 6, 2023, 9:09 p.m. UTC
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

  [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
  [ 1887.186498]
  [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
  [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
  [ 1887.190341] Call Trace:
  [ 1887.190666]  <TASK>
  [ 1887.190949]  dump_stack_lvl+0xac/0xe0
  [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
  [ 1887.192019]  ? panic+0x3c0/0x3c0
  [ 1887.192449]  print_report+0x14f/0x720
  [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
  [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.194572]  kasan_report+0xc3/0x100
  [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
  [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
  [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
  [ 1887.198883]  array_map_free+0x1a3/0x260
  [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
  [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
  [ 1887.200549]  worker_thread+0x633/0x890
  [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
  [ 1887.201574]  ? kthread+0x102/0x1d0
  [ 1887.202020]  kthread+0x1ab/0x1d0
  [ 1887.202447]  ? pr_cont_work+0x270/0x270
  [ 1887.202954]  ? kthread_blkcg+0x50/0x50
  [ 1887.203444]  ret_from_fork+0x34/0x50
  [ 1887.203914]  ? kthread_blkcg+0x50/0x50
  [ 1887.204397]  ret_from_fork_asm+0x11/0x20
  [ 1887.204913]  </TASK>
  [ 1887.204913]  </TASK>
  [ 1887.205209]
  [ 1887.205416] Allocated by task 2197:
  [ 1887.205881]  kasan_set_track+0x3f/0x60
  [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
  [ 1887.206856]  __kmalloc+0xac/0x1a0
  [ 1887.207293]  btf_parse_fields+0xa15/0x1480
  [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
  [ 1887.208387]  btf_new_fd+0x294/0x4d0
  [ 1887.208851]  __sys_bpf+0x4ba/0x600
  [ 1887.209292]  __x64_sys_bpf+0x41/0x50
  [ 1887.209762]  do_syscall_64+0x4c/0xf0
  [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
  [ 1887.210868]
  [ 1887.211074] Freed by task 36:
  [ 1887.211460]  kasan_set_track+0x3f/0x60
  [ 1887.211951]  kasan_save_free_info+0x28/0x40
  [ 1887.212485]  ____kasan_slab_free+0x101/0x180
  [ 1887.213027]  __kmem_cache_free+0xe4/0x210
  [ 1887.213514]  btf_free+0x5b/0x130
  [ 1887.213918]  rcu_core+0x638/0xcc0
  [ 1887.214347]  __do_softirq+0x114/0x37e

The error happens at bpf_rb_root_free+0x1f8/0x2b0:

  00000000000034c0 <bpf_rb_root_free>:
  ; {
    34c0: f3 0f 1e fa                   endbr64
    34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
    34c9: 55                            pushq   %rbp
    34ca: 48 89 e5                      movq    %rsp, %rbp
  ...
  ;       if (rec && rec->refcount_off >= 0 &&
    36aa: 4d 85 ed                      testq   %r13, %r13
    36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
    36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
    36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                        <==== kasan function
    36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                        <==== use-after-free load
    36bc: 45 85 ff                      testl   %r15d, %r15d
    36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                  CPU A                        CPU B
  bpf_map_put:
    ...
    btf_put with rcu callback
    ...
    bpf_map_free_deferred
      with system_unbound_wq
    ...                          ...                           ...
    ...                          btf_free_rcu:                 ...
    ...                          ...                           bpf_map_free_deferred:
    ...                          ...
    ...         --------->       btf_struct_metas_free()
    ...         | race condition ...
    ...         --------->                                     map->ops->map_free()
    ...
    ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
  ...
  __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
  ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

  meta = btf_find_struct_meta(btf, btf_id);
  if (!meta)
    return -EFAULT;
  rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.

  diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
  index 5e43ddd1b83f..aae5b5213e93 100644
  --- a/kernel/bpf/syscall.c
  +++ b/kernel/bpf/syscall.c
  @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
        struct bpf_map *map = container_of(work, struct bpf_map, work);
        struct btf_record *rec = map->record;

  +     mdelay(100);
        security_bpf_map_free(map);
        bpf_map_release_memcg(map);
        /* implementation dependent freeing */

To fix the problem, we need to have a reference on btf in order to
safeguard accessing field->graph_root.value_rec in map->ops->map_free().
The function btf_parse_graph_root() is the place to get a btf reference.
The following are rough call stacks reaching bpf_parse_graph_root():

   btf_parse
     ...
       btf_parse_fields
         ...
           btf_parse_graph_root

   map_check_btf
     btf_parse_fields
       ...
         btf_parse_graph_root

Looking at the above call stack, the btf_parse_graph_root() is indirectly
called from btf_parse() or map_check_btf().

We cannot take a reference in btf_parse() case since at that moment,
btf is still in the middle to self-validation and initial reference
(refcount_set(&btf->refcnt, 1)) has not been triggered yet.
It is also unnecessary to take a reference since the value_rec is
referring to a record in struct_meta_tab.

But we do need to take a reference for map_check_btf() case as
the value_rec refers to a record in struct_meta_tab and we want
to make sure btf is not freed until value_rec usage at map_free
is done.

So the fix is to get a btf reference for map_check_btf() case
whenever a graph_root is found. A boolean field 'has_btf_ref' is added to
struct btf_field_graph_root so later the btf reference can be properly
released.

Rerun './test_progs -j' with the above mdelay() hack for a couple
of times and didn't observe the error.
Running Hou's test ([1]) is also successful.

  [1] https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/

Cc: Hou Tao <houtao@huaweicloud.com>
Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h  |  1 +
 include/linux/btf.h  |  2 +-
 kernel/bpf/btf.c     | 27 ++++++++++++++++++---------
 kernel/bpf/syscall.c | 12 +++++++++---
 4 files changed, 29 insertions(+), 13 deletions(-)

Comments

Hou Tao Dec. 7, 2023, 1:46 p.m. UTC | #1
On 12/7/2023 5:09 AM, Yonghong Song wrote:
> When running `./test_progs -j` in my local vm with latest kernel,
> I once hit a kasan error like below:
>
>   [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
>   [ 1887.186498]
>   [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
>   [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>   [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>   [ 1887.190341] Call Trace:
>   [ 1887.190666]  <TASK>
>   [ 1887.190949]  dump_stack_lvl+0xac/0xe0
>   [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
>   [ 1887.192019]  ? panic+0x3c0/0x3c0
>   [ 1887.192449]  print_report+0x14f/0x720
>   [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
>   [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
>   [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.194572]  kasan_report+0xc3/0x100
>   [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
>   [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
>   [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
>   [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
>   [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
>   [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
>   [ 1887.198883]  array_map_free+0x1a3/0x260
>   [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
>   [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
>   [ 1887.200549]  worker_thread+0x633/0x890
>   [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
>   [ 1887.201574]  ? kthread+0x102/0x1d0
>   [ 1887.202020]  kthread+0x1ab/0x1d0
>   [ 1887.202447]  ? pr_cont_work+0x270/0x270
>   [ 1887.202954]  ? kthread_blkcg+0x50/0x50
>   [ 1887.203444]  ret_from_fork+0x34/0x50
>   [ 1887.203914]  ? kthread_blkcg+0x50/0x50
>   [ 1887.204397]  ret_from_fork_asm+0x11/0x20
>   [ 1887.204913]  </TASK>
>   [ 1887.204913]  </TASK>
>   [ 1887.205209]
>   [ 1887.205416] Allocated by task 2197:
>   [ 1887.205881]  kasan_set_track+0x3f/0x60
>   [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
>   [ 1887.206856]  __kmalloc+0xac/0x1a0
>   [ 1887.207293]  btf_parse_fields+0xa15/0x1480
>   [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
>   [ 1887.208387]  btf_new_fd+0x294/0x4d0
>   [ 1887.208851]  __sys_bpf+0x4ba/0x600
>   [ 1887.209292]  __x64_sys_bpf+0x41/0x50
>   [ 1887.209762]  do_syscall_64+0x4c/0xf0
>   [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>   [ 1887.210868]
>   [ 1887.211074] Freed by task 36:
>   [ 1887.211460]  kasan_set_track+0x3f/0x60
>   [ 1887.211951]  kasan_save_free_info+0x28/0x40
>   [ 1887.212485]  ____kasan_slab_free+0x101/0x180
>   [ 1887.213027]  __kmem_cache_free+0xe4/0x210
>   [ 1887.213514]  btf_free+0x5b/0x130
>   [ 1887.213918]  rcu_core+0x638/0xcc0
>   [ 1887.214347]  __do_softirq+0x114/0x37e
>

SNIP
> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>

Acked-by: Hou Tao <houtao1@huawei.com>
Martin KaFai Lau Dec. 8, 2023, 1:23 a.m. UTC | #2
On 12/6/23 1:09 PM, Yonghong Song wrote:
> When running `./test_progs -j` in my local vm with latest kernel,
> I once hit a kasan error like below:
> 
>    [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
>    [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
>    [ 1887.186498]
>    [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
>    [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>    [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>    [ 1887.190341] Call Trace:
>    [ 1887.190666]  <TASK>
>    [ 1887.190949]  dump_stack_lvl+0xac/0xe0
>    [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
>    [ 1887.192019]  ? panic+0x3c0/0x3c0
>    [ 1887.192449]  print_report+0x14f/0x720
>    [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
>    [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
>    [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
>    [ 1887.194572]  kasan_report+0xc3/0x100
>    [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
>    [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
>    [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
>    [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
>    [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
>    [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
>    [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
>    [ 1887.198883]  array_map_free+0x1a3/0x260
>    [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
>    [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
>    [ 1887.200549]  worker_thread+0x633/0x890
>    [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
>    [ 1887.201574]  ? kthread+0x102/0x1d0
>    [ 1887.202020]  kthread+0x1ab/0x1d0
>    [ 1887.202447]  ? pr_cont_work+0x270/0x270
>    [ 1887.202954]  ? kthread_blkcg+0x50/0x50
>    [ 1887.203444]  ret_from_fork+0x34/0x50
>    [ 1887.203914]  ? kthread_blkcg+0x50/0x50
>    [ 1887.204397]  ret_from_fork_asm+0x11/0x20
>    [ 1887.204913]  </TASK>
>    [ 1887.204913]  </TASK>
>    [ 1887.205209]
>    [ 1887.205416] Allocated by task 2197:
>    [ 1887.205881]  kasan_set_track+0x3f/0x60
>    [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
>    [ 1887.206856]  __kmalloc+0xac/0x1a0
>    [ 1887.207293]  btf_parse_fields+0xa15/0x1480
>    [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
>    [ 1887.208387]  btf_new_fd+0x294/0x4d0
>    [ 1887.208851]  __sys_bpf+0x4ba/0x600
>    [ 1887.209292]  __x64_sys_bpf+0x41/0x50
>    [ 1887.209762]  do_syscall_64+0x4c/0xf0
>    [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>    [ 1887.210868]
>    [ 1887.211074] Freed by task 36:
>    [ 1887.211460]  kasan_set_track+0x3f/0x60
>    [ 1887.211951]  kasan_save_free_info+0x28/0x40
>    [ 1887.212485]  ____kasan_slab_free+0x101/0x180
>    [ 1887.213027]  __kmem_cache_free+0xe4/0x210
>    [ 1887.213514]  btf_free+0x5b/0x130
>    [ 1887.213918]  rcu_core+0x638/0xcc0
>    [ 1887.214347]  __do_softirq+0x114/0x37e
> 
> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
> 
>    00000000000034c0 <bpf_rb_root_free>:
>    ; {
>      34c0: f3 0f 1e fa                   endbr64
>      34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
>      34c9: 55                            pushq   %rbp
>      34ca: 48 89 e5                      movq    %rsp, %rbp
>    ...
>    ;       if (rec && rec->refcount_off >= 0 &&
>      36aa: 4d 85 ed                      testq   %r13, %r13
>      36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
>      36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
>      36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
>                                          <==== kasan function
>      36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
>                                          <==== use-after-free load
>      36bc: 45 85 ff                      testl   %r15d, %r15d
>      36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>
> 
> So the problem is at rec->refcount_off in the above.
> 
> I did some source code analysis and find the reason.
>                                    CPU A                        CPU B
>    bpf_map_put:
>      ...
>      btf_put with rcu callback
>      ...
>      bpf_map_free_deferred
>        with system_unbound_wq
>      ...                          ...                           ...
>      ...                          btf_free_rcu:                 ...
>      ...                          ...                           bpf_map_free_deferred:
>      ...                          ...
>      ...         --------->       btf_struct_metas_free()
>      ...         | race condition ...
>      ...         --------->                                     map->ops->map_free()
>      ...
>      ...                          btf->struct_meta_tab = NULL
> 
> In the above, map_free() corresponds to array_map_free() and eventually
> calling bpf_rb_root_free() which calls:
>    ...
>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>    ...
> 
> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
> 
>    meta = btf_find_struct_meta(btf, btf_id);
>    if (!meta)
>      return -EFAULT;
>    rec->fields[i].graph_root.value_rec = meta->record;
> 
> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
> And it is possible that that particular record has been freed by
> btf_struct_metas_free() and hence we have a kasan error here.
> 
> Actually it is very hard to reproduce the failure with current bpf/bpf-next
> code, I only got the above error once. To increase reproducibility, I added
> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
> significantly increased reproducibility.
> 
>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>    index 5e43ddd1b83f..aae5b5213e93 100644
>    --- a/kernel/bpf/syscall.c
>    +++ b/kernel/bpf/syscall.c
>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>          struct btf_record *rec = map->record;
> 
>    +     mdelay(100);
>          security_bpf_map_free(map);
>          bpf_map_release_memcg(map);
>          /* implementation dependent freeing */
> 
> To fix the problem, we need to have a reference on btf in order to
> safeguard accessing field->graph_root.value_rec in map->ops->map_free().
> The function btf_parse_graph_root() is the place to get a btf reference.
> The following are rough call stacks reaching bpf_parse_graph_root():
> 
>     btf_parse
>       ...
>         btf_parse_fields
>           ...
>             btf_parse_graph_root
> 
>     map_check_btf
>       btf_parse_fields
>         ...
>           btf_parse_graph_root
> 
> Looking at the above call stack, the btf_parse_graph_root() is indirectly
> called from btf_parse() or map_check_btf().
> 
> We cannot take a reference in btf_parse() case since at that moment,
> btf is still in the middle to self-validation and initial reference
> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.

Thanks for the details analysis and clear explanation. It helps a lot.

Sorry for jumping in late.

I am trying to avoid making a special case for "bool has_btf_ref;" and "bool 
from_map_check". It seems to a bit too much to deal with the error path for 
btf_parse().

Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?

> It is also unnecessary to take a reference since the value_rec is
> referring to a record in struct_meta_tab.

If we optimize for not taking a refcnt, how about not taking a refcnt for all 
cases and postpone the btf_put(), instead of taking refcnt in one case but not 
another. Like your fix in v1. The failed selftest can be changed or even removed 
if it does not make sense anymore.

> 
> But we do need to take a reference for map_check_btf() case as
> the value_rec refers to a record in struct_meta_tab and we want
> to make sure btf is not freed until value_rec usage at map_free
> is done.
> 
> So the fix is to get a btf reference for map_check_btf() case
> whenever a graph_root is found. A boolean field 'has_btf_ref' is added to
> struct btf_field_graph_root so later the btf reference can be properly
> released.
> 
> Rerun './test_progs -j' with the above mdelay() hack for a couple
> of times and didn't observe the error.
> Running Hou's test ([1]) is also successful.
> 
>    [1] https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/
>
Yonghong Song Dec. 8, 2023, 3:59 a.m. UTC | #3
On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
> On 12/6/23 1:09 PM, Yonghong Song wrote:
>> When running `./test_progs -j` in my local vm with latest kernel,
>> I once hit a kasan error like below:
>>
>>    [ 1887.184724] BUG: KASAN: slab-use-after-free in 
>> bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.185599] Read of size 4 at addr ffff888106806910 by task 
>> kworker/u12:2/2830
>>    [ 1887.186498]
>>    [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: 
>> G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
>>    [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>> 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>    [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>>    [ 1887.190341] Call Trace:
>>    [ 1887.190666]  <TASK>
>>    [ 1887.190949]  dump_stack_lvl+0xac/0xe0
>>    [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
>>    [ 1887.192019]  ? panic+0x3c0/0x3c0
>>    [ 1887.192449]  print_report+0x14f/0x720
>>    [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
>>    [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
>>    [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.194572]  kasan_report+0xc3/0x100
>>    [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
>>    [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
>>    [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
>>    [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
>>    [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
>>    [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
>>    [ 1887.198883]  array_map_free+0x1a3/0x260
>>    [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
>>    [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
>>    [ 1887.200549]  worker_thread+0x633/0x890
>>    [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
>>    [ 1887.201574]  ? kthread+0x102/0x1d0
>>    [ 1887.202020]  kthread+0x1ab/0x1d0
>>    [ 1887.202447]  ? pr_cont_work+0x270/0x270
>>    [ 1887.202954]  ? kthread_blkcg+0x50/0x50
>>    [ 1887.203444]  ret_from_fork+0x34/0x50
>>    [ 1887.203914]  ? kthread_blkcg+0x50/0x50
>>    [ 1887.204397]  ret_from_fork_asm+0x11/0x20
>>    [ 1887.204913]  </TASK>
>>    [ 1887.204913]  </TASK>
>>    [ 1887.205209]
>>    [ 1887.205416] Allocated by task 2197:
>>    [ 1887.205881]  kasan_set_track+0x3f/0x60
>>    [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
>>    [ 1887.206856]  __kmalloc+0xac/0x1a0
>>    [ 1887.207293]  btf_parse_fields+0xa15/0x1480
>>    [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
>>    [ 1887.208387]  btf_new_fd+0x294/0x4d0
>>    [ 1887.208851]  __sys_bpf+0x4ba/0x600
>>    [ 1887.209292]  __x64_sys_bpf+0x41/0x50
>>    [ 1887.209762]  do_syscall_64+0x4c/0xf0
>>    [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>>    [ 1887.210868]
>>    [ 1887.211074] Freed by task 36:
>>    [ 1887.211460]  kasan_set_track+0x3f/0x60
>>    [ 1887.211951]  kasan_save_free_info+0x28/0x40
>>    [ 1887.212485]  ____kasan_slab_free+0x101/0x180
>>    [ 1887.213027]  __kmem_cache_free+0xe4/0x210
>>    [ 1887.213514]  btf_free+0x5b/0x130
>>    [ 1887.213918]  rcu_core+0x638/0xcc0
>>    [ 1887.214347]  __do_softirq+0x114/0x37e
>>
>> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
>>
>>    00000000000034c0 <bpf_rb_root_free>:
>>    ; {
>>      34c0: f3 0f 1e fa                   endbr64
>>      34c4: e8 00 00 00 00                callq   0x34c9 
>> <bpf_rb_root_free+0x9>
>>      34c9: 55                            pushq   %rbp
>>      34ca: 48 89 e5                      movq    %rsp, %rbp
>>    ...
>>    ;       if (rec && rec->refcount_off >= 0 &&
>>      36aa: 4d 85 ed                      testq   %r13, %r13
>>      36ad: 74 a9                         je      0x3658 
>> <bpf_rb_root_free+0x198>
>>      36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
>>      36b3: e8 00 00 00 00                callq   0x36b8 
>> <bpf_rb_root_free+0x1f8>
>>                                          <==== kasan function
>>      36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
>>                                          <==== use-after-free load
>>      36bc: 45 85 ff                      testl   %r15d, %r15d
>>      36bf: 78 8c                         js      0x364d 
>> <bpf_rb_root_free+0x18d>
>>
>> So the problem is at rec->refcount_off in the above.
>>
>> I did some source code analysis and find the reason.
>>                                    CPU A CPU B
>>    bpf_map_put:
>>      ...
>>      btf_put with rcu callback
>>      ...
>>      bpf_map_free_deferred
>>        with system_unbound_wq
>>      ...                          ... ...
>>      ...                          btf_free_rcu: ...
>>      ...                          ... bpf_map_free_deferred:
>>      ...                          ...
>>      ...         --------->       btf_struct_metas_free()
>>      ...         | race condition ...
>>      ... ---------> map->ops->map_free()
>>      ...
>>      ...                          btf->struct_meta_tab = NULL
>>
>> In the above, map_free() corresponds to array_map_free() and eventually
>> calling bpf_rb_root_free() which calls:
>>    ...
>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>    ...
>>
>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with 
>> following code:
>>
>>    meta = btf_find_struct_meta(btf, btf_id);
>>    if (!meta)
>>      return -EFAULT;
>>    rec->fields[i].graph_root.value_rec = meta->record;
>>
>> So basically, 'value_rec' is a pointer to the record in 
>> struct_metas_tab.
>> And it is possible that that particular record has been freed by
>> btf_struct_metas_free() and hence we have a kasan error here.
>>
>> Actually it is very hard to reproduce the failure with current 
>> bpf/bpf-next
>> code, I only got the above error once. To increase reproducibility, I 
>> added
>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>> significantly increased reproducibility.
>>
>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>    --- a/kernel/bpf/syscall.c
>>    +++ b/kernel/bpf/syscall.c
>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct 
>> work_struct *work)
>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>          struct btf_record *rec = map->record;
>>
>>    +     mdelay(100);
>>          security_bpf_map_free(map);
>>          bpf_map_release_memcg(map);
>>          /* implementation dependent freeing */
>>
>> To fix the problem, we need to have a reference on btf in order to
>> safeguard accessing field->graph_root.value_rec in map->ops->map_free().
>> The function btf_parse_graph_root() is the place to get a btf reference.
>> The following are rough call stacks reaching bpf_parse_graph_root():
>>
>>     btf_parse
>>       ...
>>         btf_parse_fields
>>           ...
>>             btf_parse_graph_root
>>
>>     map_check_btf
>>       btf_parse_fields
>>         ...
>>           btf_parse_graph_root
>>
>> Looking at the above call stack, the btf_parse_graph_root() is 
>> indirectly
>> called from btf_parse() or map_check_btf().
>>
>> We cannot take a reference in btf_parse() case since at that moment,
>> btf is still in the middle to self-validation and initial reference
>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>
> Thanks for the details analysis and clear explanation. It helps a lot.
>
> Sorry for jumping in late.
>
> I am trying to avoid making a special case for "bool has_btf_ref;" and 
> "bool from_map_check". It seems to a bit too much to deal with the 
> error path for btf_parse().
>
> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?

No, it does not. The core reason is what Hao is mentioned in
   https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
We simply cannot take btf reference if called from btf_parse().
Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
so we take ref for btf during btf_parse_fields(), then we have
      btf_put <=== expect refcount == 0 to start the destruction process
        ...
          btf_record_free <=== in which if graph_root, a btf reference will be hold
so btf_put will never be able to actually free btf data.
Yes, the kasan problem will be resolved but we leak memory.

>
>> It is also unnecessary to take a reference since the value_rec is
>> referring to a record in struct_meta_tab.
>
> If we optimize for not taking a refcnt, how about not taking a refcnt 
> for all cases and postpone the btf_put(), instead of taking refcnt in 
> one case but not another. Like your fix in v1. The failed selftest can 
> be changed or even removed if it does not make sense anymore.

After a couple of iterations, I think taking necessary reference approach sounds better
and this will be consistent with how kptr is handled. For kptr, btf_parse will ignore it.
Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check will both
process it and that adds a little bit complexity.
Alexei also suggested the same taking reference approach:
   https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/

>
>>
>> But we do need to take a reference for map_check_btf() case as
>> the value_rec refers to a record in struct_meta_tab and we want
>> to make sure btf is not freed until value_rec usage at map_free
>> is done.
>>
>> So the fix is to get a btf reference for map_check_btf() case
>> whenever a graph_root is found. A boolean field 'has_btf_ref' is 
>> added to
>> struct btf_field_graph_root so later the btf reference can be properly
>> released.
>>
>> Rerun './test_progs -j' with the above mdelay() hack for a couple
>> of times and didn't observe the error.
>> Running Hou's test ([1]) is also successful.
>>
>>    [1] 
>> https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/
>>
>
Yonghong Song Dec. 8, 2023, 4:02 a.m. UTC | #4
On 12/7/23 7:59 PM, Yonghong Song wrote:
>
> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>> When running `./test_progs -j` in my local vm with latest kernel,
>>> I once hit a kasan error like below:
>>>
>>>    [ 1887.184724] BUG: KASAN: slab-use-after-free in 
>>> bpf_rb_root_free+0x1f8/0x2b0
>>>    [ 1887.185599] Read of size 4 at addr ffff888106806910 by task 
>>> kworker/u12:2/2830
>>>    [ 1887.186498]
>>>    [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: 
>>> G           OEL 6.7.0-rc3-00699-g90679706d486-dirty #494
>>>    [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>> 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>>    [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>>>    [ 1887.190341] Call Trace:
>>>    [ 1887.190666]  <TASK>
>>>    [ 1887.190949]  dump_stack_lvl+0xac/0xe0
>>>    [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
>>>    [ 1887.192019]  ? panic+0x3c0/0x3c0
>>>    [ 1887.192449]  print_report+0x14f/0x720
>>>    [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
>>>    [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
>>>    [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
>>>    [ 1887.194572]  kasan_report+0xc3/0x100
>>>    [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
>>>    [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
>>>    [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
>>>    [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
>>>    [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
>>>    [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
>>>    [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
>>>    [ 1887.198883]  array_map_free+0x1a3/0x260
>>>    [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
>>>    [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
>>>    [ 1887.200549]  worker_thread+0x633/0x890
>>>    [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
>>>    [ 1887.201574]  ? kthread+0x102/0x1d0
>>>    [ 1887.202020]  kthread+0x1ab/0x1d0
>>>    [ 1887.202447]  ? pr_cont_work+0x270/0x270
>>>    [ 1887.202954]  ? kthread_blkcg+0x50/0x50
>>>    [ 1887.203444]  ret_from_fork+0x34/0x50
>>>    [ 1887.203914]  ? kthread_blkcg+0x50/0x50
>>>    [ 1887.204397]  ret_from_fork_asm+0x11/0x20
>>>    [ 1887.204913]  </TASK>
>>>    [ 1887.204913]  </TASK>
>>>    [ 1887.205209]
>>>    [ 1887.205416] Allocated by task 2197:
>>>    [ 1887.205881]  kasan_set_track+0x3f/0x60
>>>    [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
>>>    [ 1887.206856]  __kmalloc+0xac/0x1a0
>>>    [ 1887.207293]  btf_parse_fields+0xa15/0x1480
>>>    [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
>>>    [ 1887.208387]  btf_new_fd+0x294/0x4d0
>>>    [ 1887.208851]  __sys_bpf+0x4ba/0x600
>>>    [ 1887.209292]  __x64_sys_bpf+0x41/0x50
>>>    [ 1887.209762]  do_syscall_64+0x4c/0xf0
>>>    [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>>>    [ 1887.210868]
>>>    [ 1887.211074] Freed by task 36:
>>>    [ 1887.211460]  kasan_set_track+0x3f/0x60
>>>    [ 1887.211951]  kasan_save_free_info+0x28/0x40
>>>    [ 1887.212485]  ____kasan_slab_free+0x101/0x180
>>>    [ 1887.213027]  __kmem_cache_free+0xe4/0x210
>>>    [ 1887.213514]  btf_free+0x5b/0x130
>>>    [ 1887.213918]  rcu_core+0x638/0xcc0
>>>    [ 1887.214347]  __do_softirq+0x114/0x37e
>>>
>>> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
>>>
>>>    00000000000034c0 <bpf_rb_root_free>:
>>>    ; {
>>>      34c0: f3 0f 1e fa                   endbr64
>>>      34c4: e8 00 00 00 00                callq   0x34c9 
>>> <bpf_rb_root_free+0x9>
>>>      34c9: 55                            pushq   %rbp
>>>      34ca: 48 89 e5                      movq    %rsp, %rbp
>>>    ...
>>>    ;       if (rec && rec->refcount_off >= 0 &&
>>>      36aa: 4d 85 ed                      testq   %r13, %r13
>>>      36ad: 74 a9                         je      0x3658 
>>> <bpf_rb_root_free+0x198>
>>>      36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
>>>      36b3: e8 00 00 00 00                callq   0x36b8 
>>> <bpf_rb_root_free+0x1f8>
>>>                                          <==== kasan function
>>>      36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
>>>                                          <==== use-after-free load
>>>      36bc: 45 85 ff                      testl   %r15d, %r15d
>>>      36bf: 78 8c                         js      0x364d 
>>> <bpf_rb_root_free+0x18d>
>>>
>>> So the problem is at rec->refcount_off in the above.
>>>
>>> I did some source code analysis and find the reason.
>>>                                    CPU A CPU B
>>>    bpf_map_put:
>>>      ...
>>>      btf_put with rcu callback
>>>      ...
>>>      bpf_map_free_deferred
>>>        with system_unbound_wq
>>>      ...                          ... ...
>>>      ...                          btf_free_rcu: ...
>>>      ...                          ... bpf_map_free_deferred:
>>>      ...                          ...
>>>      ...         --------->       btf_struct_metas_free()
>>>      ...         | race condition ...
>>>      ... ---------> map->ops->map_free()
>>>      ...
>>>      ...                          btf->struct_meta_tab = NULL
>>>
>>> In the above, map_free() corresponds to array_map_free() and eventually
>>> calling bpf_rb_root_free() which calls:
>>>    ...
>>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>    ...
>>>
>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with 
>>> following code:
>>>
>>>    meta = btf_find_struct_meta(btf, btf_id);
>>>    if (!meta)
>>>      return -EFAULT;
>>>    rec->fields[i].graph_root.value_rec = meta->record;
>>>
>>> So basically, 'value_rec' is a pointer to the record in 
>>> struct_metas_tab.
>>> And it is possible that that particular record has been freed by
>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>
>>> Actually it is very hard to reproduce the failure with current 
>>> bpf/bpf-next
>>> code, I only got the above error once. To increase reproducibility, 
>>> I added
>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>>> significantly increased reproducibility.
>>>
>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>>    --- a/kernel/bpf/syscall.c
>>>    +++ b/kernel/bpf/syscall.c
>>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct 
>>> work_struct *work)
>>>          struct bpf_map *map = container_of(work, struct bpf_map, 
>>> work);
>>>          struct btf_record *rec = map->record;
>>>
>>>    +     mdelay(100);
>>>          security_bpf_map_free(map);
>>>          bpf_map_release_memcg(map);
>>>          /* implementation dependent freeing */
>>>
>>> To fix the problem, we need to have a reference on btf in order to
>>> safeguard accessing field->graph_root.value_rec in 
>>> map->ops->map_free().
>>> The function btf_parse_graph_root() is the place to get a btf 
>>> reference.
>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>
>>>     btf_parse
>>>       ...
>>>         btf_parse_fields
>>>           ...
>>>             btf_parse_graph_root
>>>
>>>     map_check_btf
>>>       btf_parse_fields
>>>         ...
>>>           btf_parse_graph_root
>>>
>>> Looking at the above call stack, the btf_parse_graph_root() is 
>>> indirectly
>>> called from btf_parse() or map_check_btf().
>>>
>>> We cannot take a reference in btf_parse() case since at that moment,
>>> btf is still in the middle to self-validation and initial reference
>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>
>> Thanks for the details analysis and clear explanation. It helps a lot.
>>
>> Sorry for jumping in late.
>>
>> I am trying to avoid making a special case for "bool has_btf_ref;" 
>> and "bool from_map_check". It seems to a bit too much to deal with 
>> the error path for btf_parse().
>>
>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?
>
> No, it does not. The core reason is what Hao is mentioned in
> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
> We simply cannot take btf reference if called from btf_parse().
> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
> so we take ref for btf during btf_parse_fields(), then we have
>      btf_put <=== expect refcount == 0 to start the destruction process
>        ...
>          btf_record_free <=== in which if graph_root, a btf reference 
> will be hold
> so btf_put will never be able to actually free btf data.
> Yes, the kasan problem will be resolved but we leak memory.
Let me send another version with better commit message.

[...]
Martin KaFai Lau Dec. 8, 2023, 8:16 a.m. UTC | #5
On 12/7/23 7:59 PM, Yonghong Song wrote:
>>
>> I am trying to avoid making a special case for "bool has_btf_ref;" and "bool 
>> from_map_check". It seems to a bit too much to deal with the error path for 
>> btf_parse().
>>
>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?
> 
> No, it does not. The core reason is what Hao is mentioned in
> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
> We simply cannot take btf reference if called from btf_parse().
> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
> so we take ref for btf during btf_parse_fields(), then we have
>       btf_put <=== expect refcount == 0 to start the destruction process
>         ...
>           btf_record_free <=== in which if graph_root, a btf reference will be hold
> so btf_put will never be able to actually free btf data.

ah. There is a loop like btf->struct_meta_tab->...btf.

> Yes, the kasan problem will be resolved but we leak memory.
> 
>>
>>> It is also unnecessary to take a reference since the value_rec is
>>> referring to a record in struct_meta_tab.
>>
>> If we optimize for not taking a refcnt, how about not taking a refcnt for all 
>> cases and postpone the btf_put(), instead of taking refcnt in one case but not 
>> another. Like your fix in v1. The failed selftest can be changed or even 
>> removed if it does not make sense anymore.
> 
> After a couple of iterations, I think taking necessary reference approach sounds 
> better
> and this will be consistent with how kptr is handled. For kptr, btf_parse will 
> ignore it.

Got it. It is why kptr.btf got away with the loop.

On the other hand, am I reading it correctly that kptr.btf only needs to take 
the refcnt for btf that is btf_is_kernel()?

> Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check will 
> both
> process it and that adds a little bit complexity.
> Alexei also suggested the same taking reference approach:
> https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/
Hou Tao Dec. 8, 2023, 8:30 a.m. UTC | #6
Hi,

On 12/8/2023 12:02 PM, Yonghong Song wrote:
>
> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>
>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>> I once hit a kasan error like below:
>>>>
>>>>  

SNIP
>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>> following code:
>>>>
>>>>    meta = btf_find_struct_meta(btf, btf_id);
>>>>    if (!meta)
>>>>      return -EFAULT;
>>>>    rec->fields[i].graph_root.value_rec = meta->record;
>>>>
>>>> So basically, 'value_rec' is a pointer to the record in
>>>> struct_metas_tab.
>>>> And it is possible that that particular record has been freed by
>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>
>>>> Actually it is very hard to reproduce the failure with current
>>>> bpf/bpf-next
>>>> code, I only got the above error once. To increase reproducibility,
>>>> I added
>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>> which
>>>> significantly increased reproducibility.
>>>>
>>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>>>    --- a/kernel/bpf/syscall.c
>>>>    +++ b/kernel/bpf/syscall.c
>>>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>> work_struct *work)
>>>>          struct bpf_map *map = container_of(work, struct bpf_map,
>>>> work);
>>>>          struct btf_record *rec = map->record;
>>>>
>>>>    +     mdelay(100);
>>>>          security_bpf_map_free(map);
>>>>          bpf_map_release_memcg(map);
>>>>          /* implementation dependent freeing */
>>>>
>>>> To fix the problem, we need to have a reference on btf in order to
>>>> safeguard accessing field->graph_root.value_rec in
>>>> map->ops->map_free().
>>>> The function btf_parse_graph_root() is the place to get a btf
>>>> reference.
>>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>>
>>>>     btf_parse
>>>>       ...
>>>>         btf_parse_fields
>>>>           ...
>>>>             btf_parse_graph_root
>>>>
>>>>     map_check_btf
>>>>       btf_parse_fields
>>>>         ...
>>>>           btf_parse_graph_root
>>>>
>>>> Looking at the above call stack, the btf_parse_graph_root() is
>>>> indirectly
>>>> called from btf_parse() or map_check_btf().
>>>>
>>>> We cannot take a reference in btf_parse() case since at that moment,
>>>> btf is still in the middle to self-validation and initial reference
>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>>
>>> Thanks for the details analysis and clear explanation. It helps a lot.
>>>
>>> Sorry for jumping in late.
>>>
>>> I am trying to avoid making a special case for "bool has_btf_ref;"
>>> and "bool from_map_check". It seems to a bit too much to deal with
>>> the error path for btf_parse().

Maybe we could move the common btf used by kptr and graph_root into
bpf_record and let the callers of btf_parse_fields()  and
btf_record_free() to decide the life cycle of btf in btf_record, so
there will be less intrusive and less special case. The following is the
code snippets for the idea (only simply tested):

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1ad0ed623f50..a0c4d696a231 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -230,6 +230,7 @@ struct btf_record {
        int spin_lock_off;
        int timer_off;
        int refcount_off;
+       struct btf *btf;
        struct btf_field fields[];
 };

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 966dace27fae..08b4a2784ee4 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3598,7 +3598,6 @@ static int btf_parse_kptr(struct btf *btf, struct
btf_field *field,
                field->kptr.dtor = NULL;
                id = info->kptr.type_id;
                kptr_btf = btf;
-               btf_get(kptr_btf);
                goto found_dtor;
        }
        if (id < 0)
@@ -3753,6 +3752,7 @@ struct btf_record *btf_parse_fields(struct btf
*btf, const struct btf_type *t,
        if (!rec)
                return ERR_PTR(-ENOMEM);

+       rec->btf = btf;
        rec->spin_lock_off = -EINVAL;
        rec->timer_off = -EINVAL;
        rec->refcount_off = -EINVAL;
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 8ef269e66ba5..d9f4198158b6 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -56,6 +56,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
                ret = PTR_ERR(inner_map_meta->record);
                goto free;
        }
+       if (inner_map_meta->record)
+               btf_get(inner_map_meta->record->btf);
+
        /* Note: We must use the same BTF, as we also used
btf_record_dup above
         * which relies on BTF being same for both maps, as some members
like
         * record->fields.list_head have pointers like value_rec
pointing into
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 30ed7756fc71..d2641e51a1a7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -516,7 +516,8 @@ void btf_record_free(struct btf_record *rec)
                case BPF_KPTR_PERCPU:
                        if (rec->fields[i].kptr.module)
                                module_put(rec->fields[i].kptr.module);
-                       btf_put(rec->fields[i].kptr.btf);
+                       if (rec->fields[i].kptr.btf != rec->btf)
+                               btf_put(rec->fields[i].kptr.btf);
                        break;
                case BPF_LIST_HEAD:
                case BPF_LIST_NODE:
@@ -537,8 +538,13 @@ void btf_record_free(struct btf_record *rec)

 void bpf_map_free_record(struct bpf_map *map)
 {
+       struct btf *btf = NULL;
+
+       if (!IS_ERR_OR_NULL(map->record))
+               btf = map->record->btf;
        btf_record_free(map->record);
        map->record = NULL;
+       btf_put(btf);
 }

 struct btf_record *btf_record_dup(const struct btf_record *rec)
@@ -561,7 +567,8 @@ struct btf_record *btf_record_dup(const struct
btf_record *rec)
                case BPF_KPTR_UNREF:
                case BPF_KPTR_REF:
                case BPF_KPTR_PERCPU:
-                       btf_get(fields[i].kptr.btf);
+                       if (fields[i].kptr.btf != rec->btf)
+                               btf_get(fields[i].kptr.btf);
                        if (fields[i].kptr.module &&
!try_module_get(fields[i].kptr.module)) {
                                ret = -ENXIO;
                                goto free;
@@ -692,7 +699,10 @@ static void bpf_map_free_deferred(struct
work_struct *work)
 {
        struct bpf_map *map = container_of(work, struct bpf_map, work);
        struct btf_record *rec = map->record;
+       struct btf *btf = NULL;

+       if (!IS_ERR_OR_NULL(rec))
+               btf = rec->btf;
        security_bpf_map_free(map);
        bpf_map_release_memcg(map);
        /* implementation dependent freeing */
@@ -707,6 +717,7 @@ static void bpf_map_free_deferred(struct work_struct
*work)
         * template bpf_map struct used during verification.
         */
        btf_record_free(rec);
+       btf_put(btf);
 }

 static void bpf_map_put_uref(struct bpf_map *map)
@@ -1036,6 +1047,8 @@ static int map_check_btf(struct bpf_map *map,
struct btf *btf,
        if (!IS_ERR_OR_NULL(map->record)) {
                int i;

+               btf_get(map->record->btf);
+
                if (!bpf_capable()) {
                        ret = -EPERM;
                        goto free_map_tab;



WDYT ?

>>>
>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse
>>> help?
>>
>> No, it does not. The core reason is what Hao is mentioned in
>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
>>
>> We simply cannot take btf reference if called from btf_parse().
>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
>> so we take ref for btf during btf_parse_fields(), then we have
>>      btf_put <=== expect refcount == 0 to start the destruction process
>>        ...
>>          btf_record_free <=== in which if graph_root, a btf reference
>> will be hold
>> so btf_put will never be able to actually free btf data.
>> Yes, the kasan problem will be resolved but we leak memory.
> Let me send another version with better commit message.


>
> [...]
Yonghong Song Dec. 8, 2023, 4:45 p.m. UTC | #7
On 12/8/23 12:16 AM, Martin KaFai Lau wrote:
> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>>
>>> I am trying to avoid making a special case for "bool has_btf_ref;" 
>>> and "bool from_map_check". It seems to a bit too much to deal with 
>>> the error path for btf_parse().
>>>
>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse 
>>> help?
>>
>> No, it does not. The core reason is what Hao is mentioned in
>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/ 
>>
>> We simply cannot take btf reference if called from btf_parse().
>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
>> so we take ref for btf during btf_parse_fields(), then we have
>>       btf_put <=== expect refcount == 0 to start the destruction process
>>         ...
>>           btf_record_free <=== in which if graph_root, a btf 
>> reference will be hold
>> so btf_put will never be able to actually free btf data.
>
> ah. There is a loop like btf->struct_meta_tab->...btf.
>
>> Yes, the kasan problem will be resolved but we leak memory.
>>
>>>
>>>> It is also unnecessary to take a reference since the value_rec is
>>>> referring to a record in struct_meta_tab.
>>>
>>> If we optimize for not taking a refcnt, how about not taking a 
>>> refcnt for all cases and postpone the btf_put(), instead of taking 
>>> refcnt in one case but not another. Like your fix in v1. The failed 
>>> selftest can be changed or even removed if it does not make sense 
>>> anymore.
>>
>> After a couple of iterations, I think taking necessary reference 
>> approach sounds better
>> and this will be consistent with how kptr is handled. For kptr, 
>> btf_parse will ignore it.
>
> Got it. It is why kptr.btf got away with the loop.
>
> On the other hand, am I reading it correctly that kptr.btf only needs 
> to take the refcnt for btf that is btf_is_kernel()?

No. besides vmlinux and module btf, it also takes reference for prog btf, see

static int btf_parse_kptr(const struct btf *btf, struct btf_field *field,
                           struct btf_field_info *info)
{
...
         if (id == -ENOENT) {
                 /* btf_parse_kptr should only be called w/ btf = program BTF */
                 WARN_ON_ONCE(btf_is_kernel(btf));
                 
                 /* Type exists only in program BTF. Assume that it's a MEM_ALLOC
                  * kptr allocated via bpf_obj_new
                  */
                 field->kptr.dtor = NULL;
                 id = info->kptr.type_id;
                 kptr_btf = (struct btf *)btf;
                 btf_get(kptr_btf);
                 goto found_dtor;
         }
...
}

>
>> Unfortunately, for graph_root (list_head, rb_root), btf_parse and 
>> map_check will both
>> process it and that adds a little bit complexity.
>> Alexei also suggested the same taking reference approach:
>> https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/ 
>>
>
Yonghong Song Dec. 8, 2023, 5:07 p.m. UTC | #8
On 12/8/23 12:30 AM, Hou Tao wrote:
> Hi,
>
> On 12/8/2023 12:02 PM, Yonghong Song wrote:
>> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>> I once hit a kasan error like below:
>>>>>
>>>>>   
> SNIP
>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>> following code:
>>>>>
>>>>>     meta = btf_find_struct_meta(btf, btf_id);
>>>>>     if (!meta)
>>>>>       return -EFAULT;
>>>>>     rec->fields[i].graph_root.value_rec = meta->record;
>>>>>
>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>> struct_metas_tab.
>>>>> And it is possible that that particular record has been freed by
>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>
>>>>> Actually it is very hard to reproduce the failure with current
>>>>> bpf/bpf-next
>>>>> code, I only got the above error once. To increase reproducibility,
>>>>> I added
>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>> which
>>>>> significantly increased reproducibility.
>>>>>
>>>>>     diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>>     index 5e43ddd1b83f..aae5b5213e93 100644
>>>>>     --- a/kernel/bpf/syscall.c
>>>>>     +++ b/kernel/bpf/syscall.c
>>>>>     @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>> work_struct *work)
>>>>>           struct bpf_map *map = container_of(work, struct bpf_map,
>>>>> work);
>>>>>           struct btf_record *rec = map->record;
>>>>>
>>>>>     +     mdelay(100);
>>>>>           security_bpf_map_free(map);
>>>>>           bpf_map_release_memcg(map);
>>>>>           /* implementation dependent freeing */
>>>>>
>>>>> To fix the problem, we need to have a reference on btf in order to
>>>>> safeguard accessing field->graph_root.value_rec in
>>>>> map->ops->map_free().
>>>>> The function btf_parse_graph_root() is the place to get a btf
>>>>> reference.
>>>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>>>
>>>>>      btf_parse
>>>>>        ...
>>>>>          btf_parse_fields
>>>>>            ...
>>>>>              btf_parse_graph_root
>>>>>
>>>>>      map_check_btf
>>>>>        btf_parse_fields
>>>>>          ...
>>>>>            btf_parse_graph_root
>>>>>
>>>>> Looking at the above call stack, the btf_parse_graph_root() is
>>>>> indirectly
>>>>> called from btf_parse() or map_check_btf().
>>>>>
>>>>> We cannot take a reference in btf_parse() case since at that moment,
>>>>> btf is still in the middle to self-validation and initial reference
>>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>>> Thanks for the details analysis and clear explanation. It helps a lot.
>>>>
>>>> Sorry for jumping in late.
>>>>
>>>> I am trying to avoid making a special case for "bool has_btf_ref;"
>>>> and "bool from_map_check". It seems to a bit too much to deal with
>>>> the error path for btf_parse().
> Maybe we could move the common btf used by kptr and graph_root into
> bpf_record and let the callers of btf_parse_fields()  and
> btf_record_free() to decide the life cycle of btf in btf_record, so
> there will be less intrusive and less special case. The following is the

I didn't fully check the code but looks like we took a
btf reference at map_check_btf() and free it at the end
of bpf_map_free_deferred(). This is similar to my v1 patch,
not exactly the same but similar since they all do
btf_put() at the end of bpf_map_free_deferred().

Through discussion, doing on-demand btf_get()/btf_put()
approach, similar to kptr approach, seems more favored.
This also has advantage to free btf at its earlist possible
point.

> code snippets for the idea (only simply tested):
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1ad0ed623f50..a0c4d696a231 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -230,6 +230,7 @@ struct btf_record {
>          int spin_lock_off;
>          int timer_off;
>          int refcount_off;
> +       struct btf *btf;
>          struct btf_field fields[];
>   };
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 966dace27fae..08b4a2784ee4 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3598,7 +3598,6 @@ static int btf_parse_kptr(struct btf *btf, struct
> btf_field *field,
>                  field->kptr.dtor = NULL;
>                  id = info->kptr.type_id;
>                  kptr_btf = btf;
> -               btf_get(kptr_btf);
>                  goto found_dtor;
>          }
>          if (id < 0)
> @@ -3753,6 +3752,7 @@ struct btf_record *btf_parse_fields(struct btf
> *btf, const struct btf_type *t,
>          if (!rec)
>                  return ERR_PTR(-ENOMEM);
>
> +       rec->btf = btf;
>          rec->spin_lock_off = -EINVAL;
>          rec->timer_off = -EINVAL;
>          rec->refcount_off = -EINVAL;
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 8ef269e66ba5..d9f4198158b6 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -56,6 +56,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>                  ret = PTR_ERR(inner_map_meta->record);
>                  goto free;
>          }
> +       if (inner_map_meta->record)
> +               btf_get(inner_map_meta->record->btf);
> +
>          /* Note: We must use the same BTF, as we also used
> btf_record_dup above
>           * which relies on BTF being same for both maps, as some members
> like
>           * record->fields.list_head have pointers like value_rec
> pointing into
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 30ed7756fc71..d2641e51a1a7 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -516,7 +516,8 @@ void btf_record_free(struct btf_record *rec)
>                  case BPF_KPTR_PERCPU:
>                          if (rec->fields[i].kptr.module)
>                                  module_put(rec->fields[i].kptr.module);
> -                       btf_put(rec->fields[i].kptr.btf);
> +                       if (rec->fields[i].kptr.btf != rec->btf)
> +                               btf_put(rec->fields[i].kptr.btf);
>                          break;
>                  case BPF_LIST_HEAD:
>                  case BPF_LIST_NODE:
> @@ -537,8 +538,13 @@ void btf_record_free(struct btf_record *rec)
>
>   void bpf_map_free_record(struct bpf_map *map)
>   {
> +       struct btf *btf = NULL;
> +
> +       if (!IS_ERR_OR_NULL(map->record))
> +               btf = map->record->btf;
>          btf_record_free(map->record);
>          map->record = NULL;
> +       btf_put(btf);
>   }
>
>   struct btf_record *btf_record_dup(const struct btf_record *rec)
> @@ -561,7 +567,8 @@ struct btf_record *btf_record_dup(const struct
> btf_record *rec)
>                  case BPF_KPTR_UNREF:
>                  case BPF_KPTR_REF:
>                  case BPF_KPTR_PERCPU:
> -                       btf_get(fields[i].kptr.btf);
> +                       if (fields[i].kptr.btf != rec->btf)
> +                               btf_get(fields[i].kptr.btf);
>                          if (fields[i].kptr.module &&
> !try_module_get(fields[i].kptr.module)) {
>                                  ret = -ENXIO;
>                                  goto free;
> @@ -692,7 +699,10 @@ static void bpf_map_free_deferred(struct
> work_struct *work)
>   {
>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>          struct btf_record *rec = map->record;
> +       struct btf *btf = NULL;
>
> +       if (!IS_ERR_OR_NULL(rec))
> +               btf = rec->btf;
>          security_bpf_map_free(map);
>          bpf_map_release_memcg(map);
>          /* implementation dependent freeing */
> @@ -707,6 +717,7 @@ static void bpf_map_free_deferred(struct work_struct
> *work)
>           * template bpf_map struct used during verification.
>           */
>          btf_record_free(rec);
> +       btf_put(btf);
>   }
>
>   static void bpf_map_put_uref(struct bpf_map *map)
> @@ -1036,6 +1047,8 @@ static int map_check_btf(struct bpf_map *map,
> struct btf *btf,
>          if (!IS_ERR_OR_NULL(map->record)) {
>                  int i;
>
> +               btf_get(map->record->btf);
> +
>                  if (!bpf_capable()) {
>                          ret = -EPERM;
>                          goto free_map_tab;
>
>
>
> WDYT ?
>
>>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse
>>>> help?
>>> No, it does not. The core reason is what Hao is mentioned in
>>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
>>>
>>> We simply cannot take btf reference if called from btf_parse().
>>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
>>> so we take ref for btf during btf_parse_fields(), then we have
>>>       btf_put <=== expect refcount == 0 to start the destruction process
>>>         ...
>>>           btf_record_free <=== in which if graph_root, a btf reference
>>> will be hold
>>> so btf_put will never be able to actually free btf data.
>>> Yes, the kasan problem will be resolved but we leak memory.
>> Let me send another version with better commit message.
>
>> [...]
Martin KaFai Lau Dec. 8, 2023, 6:26 p.m. UTC | #9
On 12/8/23 8:45 AM, Yonghong Song wrote:
> 
> On 12/8/23 12:16 AM, Martin KaFai Lau wrote:
>> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>>>
>>>> I am trying to avoid making a special case for "bool has_btf_ref;" and "bool 
>>>> from_map_check". It seems to a bit too much to deal with the error path for 
>>>> btf_parse().
>>>>
>>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?
>>>
>>> No, it does not. The core reason is what Hao is mentioned in
>>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
>>> We simply cannot take btf reference if called from btf_parse().
>>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
>>> so we take ref for btf during btf_parse_fields(), then we have
>>>       btf_put <=== expect refcount == 0 to start the destruction process
>>>         ...
>>>           btf_record_free <=== in which if graph_root, a btf reference will 
>>> be hold
>>> so btf_put will never be able to actually free btf data.
>>
>> ah. There is a loop like btf->struct_meta_tab->...btf.
>>
>>> Yes, the kasan problem will be resolved but we leak memory.
>>>
>>>>
>>>>> It is also unnecessary to take a reference since the value_rec is
>>>>> referring to a record in struct_meta_tab.
>>>>
>>>> If we optimize for not taking a refcnt, how about not taking a refcnt for 
>>>> all cases and postpone the btf_put(), instead of taking refcnt in one case 
>>>> but not another. Like your fix in v1. The failed selftest can be changed or 
>>>> even removed if it does not make sense anymore.
>>>
>>> After a couple of iterations, I think taking necessary reference approach 
>>> sounds better
>>> and this will be consistent with how kptr is handled. For kptr, btf_parse 
>>> will ignore it.
>>
>> Got it. It is why kptr.btf got away with the loop.
>>
>> On the other hand, am I reading it correctly that kptr.btf only needs to take 
>> the refcnt for btf that is btf_is_kernel()?
> 
> No. besides vmlinux and module btf, it also takes reference for prog btf, see
> 
> static int btf_parse_kptr(const struct btf *btf, struct btf_field *field,
>                            struct btf_field_info *info)
> {
> ...
>          if (id == -ENOENT) {
>                  /* btf_parse_kptr should only be called w/ btf = program BTF */
>                  WARN_ON_ONCE(btf_is_kernel(btf));
>                  /* Type exists only in program BTF. Assume that it's a MEM_ALLOC
>                   * kptr allocated via bpf_obj_new
>                   */
>                  field->kptr.dtor = NULL;
>                  id = info->kptr.type_id;
>                  kptr_btf = (struct btf *)btf;
>                  btf_get(kptr_btf);

I meant only kernel/module btf needs to take the refcnt, so there is no need to 
take the refcnt here for the (it)self btf. Sorry that I was not clear in my 
earlier comment.

The record is capturing something either in the self btf or something in the 
kernel btf. The field->kptr.kptr is the one that may either point to a kernel or 
self btf, so it should be the only case that needs to check the following in 
btf_record_free():

	if (btf_is_kernel(rec->fields[i].kptr.btf))
		btf_put(rec->fields[i].kptr.btf);

All other cases the record has a self btf (including field->graph_root.btf). The 
owner (map here) needs to ensure the self btf is freed after the record is freed.

I was thinking if it can avoid doing different things based on where 
btf_parse_fields() is called by separating what type of btf always needs refcnt 
or not. Agree the approach in this patch will fix the issue also and I have 
acked v5. Thanks for the fix.

>                  goto found_dtor;
>          }
> ...
> }
> 
>>
>>> Unfortunately, for graph_root (list_head, rb_root), btf_parse and map_check 
>>> will both
>>> process it and that adds a little bit complexity.
>>> Alexei also suggested the same taking reference approach:
>>> https://lore.kernel.org/bpf/CAADnVQL+uc6VV65_Ezgzw3WH=ME9z1Fdy8Pd6xd0oOq8rgwh7g@mail.gmail.com/
>>
Alexei Starovoitov Dec. 14, 2023, 4:17 a.m. UTC | #10
On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 12/8/23 12:30 AM, Hou Tao wrote:
> > Hi,
> >
> > On 12/8/2023 12:02 PM, Yonghong Song wrote:
> >> On 12/7/23 7:59 PM, Yonghong Song wrote:
> >>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
> >>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
> >>>>> When running `./test_progs -j` in my local vm with latest kernel,
> >>>>> I once hit a kasan error like below:
> >>>>>
> >>>>>
> > SNIP
> >>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
> >>>>> following code:
> >>>>>
> >>>>>     meta = btf_find_struct_meta(btf, btf_id);
> >>>>>     if (!meta)
> >>>>>       return -EFAULT;
> >>>>>     rec->fields[i].graph_root.value_rec = meta->record;
> >>>>>
> >>>>> So basically, 'value_rec' is a pointer to the record in
> >>>>> struct_metas_tab.
> >>>>> And it is possible that that particular record has been freed by
> >>>>> btf_struct_metas_free() and hence we have a kasan error here.
> >>>>>
> >>>>> Actually it is very hard to reproduce the failure with current
> >>>>> bpf/bpf-next
> >>>>> code, I only got the above error once. To increase reproducibility,
> >>>>> I added
> >>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
> >>>>> which
> >>>>> significantly increased reproducibility.
> >>>>>
> >>>>>     diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >>>>>     index 5e43ddd1b83f..aae5b5213e93 100644
> >>>>>     --- a/kernel/bpf/syscall.c
> >>>>>     +++ b/kernel/bpf/syscall.c
> >>>>>     @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
> >>>>> work_struct *work)
> >>>>>           struct bpf_map *map = container_of(work, struct bpf_map,
> >>>>> work);
> >>>>>           struct btf_record *rec = map->record;
> >>>>>
> >>>>>     +     mdelay(100);
> >>>>>           security_bpf_map_free(map);
> >>>>>           bpf_map_release_memcg(map);
> >>>>>           /* implementation dependent freeing */
> >>>>>
> >>>>> To fix the problem, we need to have a reference on btf in order to
> >>>>> safeguard accessing field->graph_root.value_rec in
> >>>>> map->ops->map_free().
> >>>>> The function btf_parse_graph_root() is the place to get a btf
> >>>>> reference.
> >>>>> The following are rough call stacks reaching bpf_parse_graph_root():
> >>>>>
> >>>>>      btf_parse
> >>>>>        ...
> >>>>>          btf_parse_fields
> >>>>>            ...
> >>>>>              btf_parse_graph_root
> >>>>>
> >>>>>      map_check_btf
> >>>>>        btf_parse_fields
> >>>>>          ...
> >>>>>            btf_parse_graph_root
> >>>>>
> >>>>> Looking at the above call stack, the btf_parse_graph_root() is
> >>>>> indirectly
> >>>>> called from btf_parse() or map_check_btf().
> >>>>>
> >>>>> We cannot take a reference in btf_parse() case since at that moment,
> >>>>> btf is still in the middle to self-validation and initial reference
> >>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
> >>>> Thanks for the details analysis and clear explanation. It helps a lot.
> >>>>
> >>>> Sorry for jumping in late.
> >>>>
> >>>> I am trying to avoid making a special case for "bool has_btf_ref;"
> >>>> and "bool from_map_check". It seems to a bit too much to deal with
> >>>> the error path for btf_parse().
> > Maybe we could move the common btf used by kptr and graph_root into
> > bpf_record and let the callers of btf_parse_fields()  and
> > btf_record_free() to decide the life cycle of btf in btf_record, so
> > there will be less intrusive and less special case. The following is the
>
> I didn't fully check the code but looks like we took a
> btf reference at map_check_btf() and free it at the end
> of bpf_map_free_deferred(). This is similar to my v1 patch,
> not exactly the same but similar since they all do
> btf_put() at the end of bpf_map_free_deferred().
>
> Through discussion, doing on-demand btf_get()/btf_put()
> approach, similar to kptr approach, seems more favored.
> This also has advantage to free btf at its earlist possible
> point.

Sorry. Looks like I recommended the wrong path.

The approach of btf_parse_fields(... false | true)
depending on where it's called and whether returned struct btf_record *
will be kept within a type or within a map
is pushing complexity too far.
A year from now we'll forget these subtle details.
There is an advantage to do btf_put() earli in bpf_map_put(),
but in the common case it would be delayed just after queue_work.
Which is a minor time delay.
And for free_after_mult_rcu_gp much longer,
but saving from freeing btf are minor compared to the map itself.

I think it's cleaner to go back to v1 and simply move btf_put
to bpf_map_free_deferred().
A lot less things to worry about.
Especially considering that BPF_RB_ROOT may not be the last such special
record keeping type and every new type would need to think
hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like.
v1 avoids this future complexity.
Yonghong Song Dec. 14, 2023, 6:30 a.m. UTC | #11
On 12/13/23 8:17 PM, Alexei Starovoitov wrote:
> On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 12/8/23 12:30 AM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/8/2023 12:02 PM, Yonghong Song wrote:
>>>> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>>>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>>>>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>>>> I once hit a kasan error like below:
>>>>>>>
>>>>>>>
>>> SNIP
>>>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>>>> following code:
>>>>>>>
>>>>>>>      meta = btf_find_struct_meta(btf, btf_id);
>>>>>>>      if (!meta)
>>>>>>>        return -EFAULT;
>>>>>>>      rec->fields[i].graph_root.value_rec = meta->record;
>>>>>>>
>>>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>>>> struct_metas_tab.
>>>>>>> And it is possible that that particular record has been freed by
>>>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>>>
>>>>>>> Actually it is very hard to reproduce the failure with current
>>>>>>> bpf/bpf-next
>>>>>>> code, I only got the above error once. To increase reproducibility,
>>>>>>> I added
>>>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>>>> which
>>>>>>> significantly increased reproducibility.
>>>>>>>
>>>>>>>      diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>>>>      index 5e43ddd1b83f..aae5b5213e93 100644
>>>>>>>      --- a/kernel/bpf/syscall.c
>>>>>>>      +++ b/kernel/bpf/syscall.c
>>>>>>>      @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>>>> work_struct *work)
>>>>>>>            struct bpf_map *map = container_of(work, struct bpf_map,
>>>>>>> work);
>>>>>>>            struct btf_record *rec = map->record;
>>>>>>>
>>>>>>>      +     mdelay(100);
>>>>>>>            security_bpf_map_free(map);
>>>>>>>            bpf_map_release_memcg(map);
>>>>>>>            /* implementation dependent freeing */
>>>>>>>
>>>>>>> To fix the problem, we need to have a reference on btf in order to
>>>>>>> safeguard accessing field->graph_root.value_rec in
>>>>>>> map->ops->map_free().
>>>>>>> The function btf_parse_graph_root() is the place to get a btf
>>>>>>> reference.
>>>>>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>>>>>
>>>>>>>       btf_parse
>>>>>>>         ...
>>>>>>>           btf_parse_fields
>>>>>>>             ...
>>>>>>>               btf_parse_graph_root
>>>>>>>
>>>>>>>       map_check_btf
>>>>>>>         btf_parse_fields
>>>>>>>           ...
>>>>>>>             btf_parse_graph_root
>>>>>>>
>>>>>>> Looking at the above call stack, the btf_parse_graph_root() is
>>>>>>> indirectly
>>>>>>> called from btf_parse() or map_check_btf().
>>>>>>>
>>>>>>> We cannot take a reference in btf_parse() case since at that moment,
>>>>>>> btf is still in the middle to self-validation and initial reference
>>>>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>>>>> Thanks for the details analysis and clear explanation. It helps a lot.
>>>>>>
>>>>>> Sorry for jumping in late.
>>>>>>
>>>>>> I am trying to avoid making a special case for "bool has_btf_ref;"
>>>>>> and "bool from_map_check". It seems to a bit too much to deal with
>>>>>> the error path for btf_parse().
>>> Maybe we could move the common btf used by kptr and graph_root into
>>> bpf_record and let the callers of btf_parse_fields()  and
>>> btf_record_free() to decide the life cycle of btf in btf_record, so
>>> there will be less intrusive and less special case. The following is the
>> I didn't fully check the code but looks like we took a
>> btf reference at map_check_btf() and free it at the end
>> of bpf_map_free_deferred(). This is similar to my v1 patch,
>> not exactly the same but similar since they all do
>> btf_put() at the end of bpf_map_free_deferred().
>>
>> Through discussion, doing on-demand btf_get()/btf_put()
>> approach, similar to kptr approach, seems more favored.
>> This also has advantage to free btf at its earlist possible
>> point.
> Sorry. Looks like I recommended the wrong path.
>
> The approach of btf_parse_fields(... false | true)
> depending on where it's called and whether returned struct btf_record *
> will be kept within a type or within a map
> is pushing complexity too far.
> A year from now we'll forget these subtle details.
> There is an advantage to do btf_put() earli in bpf_map_put(),
> but in the common case it would be delayed just after queue_work.
> Which is a minor time delay.
> And for free_after_mult_rcu_gp much longer,
> but saving from freeing btf are minor compared to the map itself.
>
> I think it's cleaner to go back to v1 and simply move btf_put
> to bpf_map_free_deferred().
> A lot less things to worry about.
> Especially considering that BPF_RB_ROOT may not be the last such special
> record keeping type and every new type would need to think
> hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like.
> v1 avoids this future complexity.

No problem. Will send v6 tomorrow with v1.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 10e5e4d8a00f..6689b1b15993 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -212,6 +212,7 @@  struct btf_field_graph_root {
 	u32 value_btf_id;
 	u32 node_offset;
 	struct btf_record *value_rec;
+	bool has_btf_ref;
 };
 
 struct btf_field {
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 59d404e22814..f7d4f6594308 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -217,7 +217,7 @@  bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   const struct btf_member *m,
 			   u32 expected_offset, u32 expected_size);
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
-				    u32 field_mask, u32 value_size);
+				    u32 field_mask, u32 value_size, bool from_map_check);
 int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 63cf4128fc05..f23d6a7b8b93 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3665,7 +3665,8 @@  static int btf_parse_graph_root(const struct btf *btf,
 				struct btf_field *field,
 				struct btf_field_info *info,
 				const char *node_type_name,
-				size_t node_type_align)
+				size_t node_type_align,
+				bool from_map_check)
 {
 	const struct btf_type *t, *n = NULL;
 	const struct btf_member *member;
@@ -3696,6 +3697,9 @@  static int btf_parse_graph_root(const struct btf *btf,
 		if (offset % node_type_align)
 			return -EINVAL;
 
+		if (from_map_check)
+			btf_get((struct btf *)btf);
+		field->graph_root.has_btf_ref = from_map_check;
 		field->graph_root.btf = (struct btf *)btf;
 		field->graph_root.value_btf_id = info->graph_root.value_btf_id;
 		field->graph_root.node_offset = offset;
@@ -3706,17 +3710,19 @@  static int btf_parse_graph_root(const struct btf *btf,
 }
 
 static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
-			       struct btf_field_info *info)
+			       struct btf_field_info *info, bool from_map_check)
 {
 	return btf_parse_graph_root(btf, field, info, "bpf_list_node",
-					    __alignof__(struct bpf_list_node));
+				    __alignof__(struct bpf_list_node),
+				    from_map_check);
 }
 
 static int btf_parse_rb_root(const struct btf *btf, struct btf_field *field,
-			     struct btf_field_info *info)
+			     struct btf_field_info *info, bool from_map_check)
 {
 	return btf_parse_graph_root(btf, field, info, "bpf_rb_node",
-					    __alignof__(struct bpf_rb_node));
+				    __alignof__(struct bpf_rb_node),
+				    from_map_check);
 }
 
 static int btf_field_cmp(const void *_a, const void *_b, const void *priv)
@@ -3732,7 +3738,7 @@  static int btf_field_cmp(const void *_a, const void *_b, const void *priv)
 }
 
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
-				    u32 field_mask, u32 value_size)
+				    u32 field_mask, u32 value_size, bool from_map_check)
 {
 	struct btf_field_info info_arr[BTF_FIELDS_MAX];
 	u32 next_off = 0, field_type_size;
@@ -3798,12 +3804,14 @@  struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 				goto end;
 			break;
 		case BPF_LIST_HEAD:
-			ret = btf_parse_list_head(btf, &rec->fields[i], &info_arr[i]);
+			ret = btf_parse_list_head(btf, &rec->fields[i], &info_arr[i],
+						  from_map_check);
 			if (ret < 0)
 				goto end;
 			break;
 		case BPF_RB_ROOT:
-			ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i]);
+			ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i],
+						from_map_check);
 			if (ret < 0)
 				goto end;
 			break;
@@ -5390,7 +5398,8 @@  btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		type = &tab->types[tab->cnt];
 		type->btf_id = i;
 		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
-						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT, t->size);
+						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT, t->size,
+					  false);
 		/* The record cannot be unset, treat it as an error if so */
 		if (IS_ERR_OR_NULL(record)) {
 			ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ebaccf77d56e..33eaa3cd32e4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -521,8 +521,11 @@  void btf_record_free(struct btf_record *rec)
 			btf_put(rec->fields[i].kptr.btf);
 			break;
 		case BPF_LIST_HEAD:
-		case BPF_LIST_NODE:
 		case BPF_RB_ROOT:
+			if (rec->fields[i].graph_root.has_btf_ref)
+				btf_put(rec->fields[i].graph_root.btf);
+			break;
+		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
@@ -570,8 +573,11 @@  struct btf_record *btf_record_dup(const struct btf_record *rec)
 			}
 			break;
 		case BPF_LIST_HEAD:
-		case BPF_LIST_NODE:
 		case BPF_RB_ROOT:
+			if (fields[i].graph_root.has_btf_ref)
+				btf_get(fields[i].graph_root.btf);
+			break;
+		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
@@ -1034,7 +1040,7 @@  static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
 				       BPF_RB_ROOT | BPF_REFCOUNT,
-				       map->value_size);
+				       map->value_size, true);
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;