Message ID | 20221206231000.3180914-2-davemarchevsky@fb.com (mailing list archive) |
---|---|
State | Superseded |
Commit | d8939cb0a03ce7e4e69f65bbd31b79fe42f7d5e6 |
Delegated to: | BPF |
Headers | show |
Series | BPF rbtree next-gen datastructure | expand |
On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: > btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. > There, a BTF record is created for any type containing a spin_lock or > any next-gen datastructure node/head. > > Currently, for non-MAP_VALUE types, reg_btf_record will only search for > a record using struct_meta_tab if the reg->type exactly matches > (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an > "allocated obj" type - returned from bpf_obj_new - might pick up other > flags while working its way through the program. > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. Any other flag combination (the only one possible is PTR_UNTRUSTED right now) cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED is to make then unpassable to helpers. > Loosen the check to be exact for base_type and just use MEM_ALLOC mask > for type_flag. > > This patch is marked Fixes as the original intent of reg_btf_record was > unlikely to have been to fail finding btf_record for valid alloc obj > types with additional flags, some of which (e.g. PTR_UNTRUSTED) > are valid register type states for alloc obj independent of this series. That was the actual intent, same as how check_ptr_to_btf_access uses the exact reg->type to allow the BPF_WRITE case. I think this series is the one introducing this case, passing bpf_rbtree_first's result to bpf_rbtree_remove, which I think is not possible to make safe in the first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> bpf_list_del due to this exact issue. More in [0]. [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com > However, I didn't find a specific broken repro case outside of this > series' added functionality, so it's possible that nothing was > triggering this logic error before. > > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> > cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> > Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects") > --- > kernel/bpf/verifier.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 1d51bd9596da..67a13110bc22 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type) > type == PTR_TO_SOCK_COMMON; > } > > +static bool type_is_ptr_alloc_obj(u32 type) > +{ > + return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC; > +} > + > static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg) > { > struct btf_record *rec = NULL; > @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg) > > if (reg->type == PTR_TO_MAP_VALUE) { > rec = reg->map_ptr->record; > - } else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) { > + } else if (type_is_ptr_alloc_obj(reg->type)) { > meta = btf_find_struct_meta(reg->btf, reg->btf_id); > if (meta) > rec = meta->record; > -- > 2.30.2 >
On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote: > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. >> There, a BTF record is created for any type containing a spin_lock or >> any next-gen datastructure node/head. >> >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for >> a record using struct_meta_tab if the reg->type exactly matches >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an >> "allocated obj" type - returned from bpf_obj_new - might pick up other >> flags while working its way through the program. >> > > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. > Any other flag combination (the only one possible is PTR_UNTRUSTED right now) > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED > is to make then unpassable to helpers. > I see what you mean. If reg_btf_record is only used on regs which are args, then the exact match helps enforce PTR_UNTRUSTED not being an acceptable type flag for an arg. Most uses of reg_btf_record seem to be on arg regs, but then we have its use in reg_may_point_to_spin_lock, which is itself used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not sure that it's only used on arg regs currently. Regardless, if the intended use is on arg regs only, it should be renamed to arg_reg_btf_record or similar to make that clear, as current name sounds like it should be applicable to any reg, and thus not enforce constraints particular to arg regs. But I think it's better to leave it general and enforce those constraints elsewhere. For kfuncs this is already happening in check_kfunc_args, where the big switch statements for KF_ARG_* are doing exact type matching. >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask >> for type_flag. >> >> This patch is marked Fixes as the original intent of reg_btf_record was >> unlikely to have been to fail finding btf_record for valid alloc obj >> types with additional flags, some of which (e.g. PTR_UNTRUSTED) >> are valid register type states for alloc obj independent of this series. > > That was the actual intent, same as how check_ptr_to_btf_access uses the exact > reg->type to allow the BPF_WRITE case. > > I think this series is the one introducing this case, passing bpf_rbtree_first's > result to bpf_rbtree_remove, which I think is not possible to make safe in the > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> > bpf_list_del due to this exact issue. More in [0]. > > [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com > Thanks for the link, I better understand what Alexei meant in his comment on patch 9 of this series. For the helpers added in this series, we can make bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock refs after the rbtree_remove in same manner as they're invalidated after spin_unlock currently. Logic for why this is safe: * If we have two non-owning refs to nodes in a tree, e.g. from bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after, we have no way of knowing if they're aliases of same node. * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree, it might be removing a node that's already been removed, e.g.: n = bpf_obj_new(...); bpf_spin_lock(&lock); bpf_rbtree_add(&tree, &n->node); // n is now non-owning ref to node which was added res = bpf_rbtree_first(); if (!m) {} m = container_of(res, struct node_data, node); // m is now non-owning ref to the same node bpf_rbtree_remove(&tree, &n->node); bpf_rbtree_remove(&tree, &m->node); // BAD bpf_spin_unlock(&lock); * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk of pointing to something that was already removed _only_ after a rbtree_remove, so if we invalidate them all after rbtree_remove they can't be inputs to subsequent remove()s This does conflate current "release non-owning refs because it's not safe to read from them" reasoning with new "release non-owning refs so they can't be passed to remove()". Ideally we could add some new tag to these refs that prevents them from being passed to remove()-type fns, but does allow them to be read, e.g.: n = bpf_obj_new(...); bpf_spin_lock(&lock); bpf_rbtree_add(&tree, &n->node); // n is now non-owning ref to node which was added res = bpf_rbtree_first(); if (!m) {} m = container_of(res, struct node_data, node); // m is now non-owning ref to the same node n = bpf_rbtree_remove(&tree, &n->node); // n is now owning ref again, m is non-owning ref to same node x = m->key; // this should be safe since we're still in CS bpf_rbtree_remove(&tree, &m->node); // But this should be prevented bpf_spin_unlock(&lock); But this would introduce too much addt'l complexity for now IMO. The proposal of just invalidating all non-owning refs prevents both the unsafe second remove() and the safe x = m->key. I will give it a shot, if it doesn't work can change rbtree_remove to rbtree_remove_first w/o node param. But per that linked convo such logic should be tackled eventually, might as well chip away at it now. >> However, I didn't find a specific broken repro case outside of this >> series' added functionality, so it's possible that nothing was >> triggering this logic error before. >> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> >> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> >> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects") >> --- >> kernel/bpf/verifier.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 1d51bd9596da..67a13110bc22 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type) >> type == PTR_TO_SOCK_COMMON; >> } >> >> +static bool type_is_ptr_alloc_obj(u32 type) >> +{ >> + return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC; >> +} >> + >> static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg) >> { >> struct btf_record *rec = NULL; >> @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg) >> >> if (reg->type == PTR_TO_MAP_VALUE) { >> rec = reg->map_ptr->record; >> - } else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) { >> + } else if (type_is_ptr_alloc_obj(reg->type)) { >> meta = btf_find_struct_meta(reg->btf, reg->btf_id); >> if (meta) >> rec = meta->record; >> -- >> 2.30.2 >>
On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote: > On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote: > > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: > >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. > >> There, a BTF record is created for any type containing a spin_lock or > >> any next-gen datastructure node/head. > >> > >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for > >> a record using struct_meta_tab if the reg->type exactly matches > >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an > >> "allocated obj" type - returned from bpf_obj_new - might pick up other > >> flags while working its way through the program. > >> > > > > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be > > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. > > Any other flag combination (the only one possible is PTR_UNTRUSTED right now) > > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED > > is to make then unpassable to helpers. > > > > I see what you mean. If reg_btf_record is only used on regs which are args, > then the exact match helps enforce PTR_UNTRUSTED not being an acceptable > type flag for an arg. Most uses of reg_btf_record seem to be on arg regs, > but then we have its use in reg_may_point_to_spin_lock, which is itself > used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not > sure that it's only used on arg regs currently. > > Regardless, if the intended use is on arg regs only, it should be renamed to > arg_reg_btf_record or similar to make that clear, as current name sounds like > it should be applicable to any reg, and thus not enforce constraints particular > to arg regs. > > But I think it's better to leave it general and enforce those constraints > elsewhere. For kfuncs this is already happening in check_kfunc_args, where the > big switch statements for KF_ARG_* are doing exact type matching. > > >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask > >> for type_flag. > >> > >> This patch is marked Fixes as the original intent of reg_btf_record was > >> unlikely to have been to fail finding btf_record for valid alloc obj > >> types with additional flags, some of which (e.g. PTR_UNTRUSTED) > >> are valid register type states for alloc obj independent of this series. > > > > That was the actual intent, same as how check_ptr_to_btf_access uses the exact > > reg->type to allow the BPF_WRITE case. > > > > I think this series is the one introducing this case, passing bpf_rbtree_first's > > result to bpf_rbtree_remove, which I think is not possible to make safe in the > > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> > > bpf_list_del due to this exact issue. More in [0]. > > > > [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com > > > > Thanks for the link, I better understand what Alexei meant in his comment on > patch 9 of this series. For the helpers added in this series, we can make > bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock > refs after the rbtree_remove in same manner as they're invalidated after > spin_unlock currently. > > Logic for why this is safe: > > * If we have two non-owning refs to nodes in a tree, e.g. from > bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after, > we have no way of knowing if they're aliases of same node. > > * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree, > it might be removing a node that's already been removed, e.g.: > > n = bpf_obj_new(...); > bpf_spin_lock(&lock); > > bpf_rbtree_add(&tree, &n->node); > // n is now non-owning ref to node which was added > res = bpf_rbtree_first(); > if (!m) {} > m = container_of(res, struct node_data, node); > // m is now non-owning ref to the same node > bpf_rbtree_remove(&tree, &n->node); > bpf_rbtree_remove(&tree, &m->node); // BAD Let me clarify my previous email: Above doesn't have to be 'BAD'. Instead of if (WARN_ON_ONCE(RB_EMPTY_NODE(n))) we can drop WARN and simply return. If node is not part of the tree -> nop. Same for bpf_rbtree_add. If it's already added -> nop. Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics. We do all these checks under the same rbtree root lock, so it's safe. > bpf_spin_unlock(&lock); > > * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk > of pointing to something that was already removed _only_ after a > rbtree_remove, so if we invalidate them all after rbtree_remove they can't > be inputs to subsequent remove()s With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add can have release semantics. No need for special release_on_unlock hacks. > This does conflate current "release non-owning refs because it's not safe to > read from them" reasoning with new "release non-owning refs so they can't be > passed to remove()". Ideally we could add some new tag to these refs that > prevents them from being passed to remove()-type fns, but does allow them to > be read, e.g.: > > n = bpf_obj_new(...); 'n' is acquired. > bpf_spin_lock(&lock); > > bpf_rbtree_add(&tree, &n->node); > // n is now non-owning ref to node which was added since bpf_rbtree_add does release on 'n'... > res = bpf_rbtree_first(); > if (!m) {} > m = container_of(res, struct node_data, node); > // m is now non-owning ref to the same node ... below is not allowed by the verifier. > n = bpf_rbtree_remove(&tree, &n->node); I'm not sure what's an idea to return 'n' from remove... Maybe it should be simple bool ? > // n is now owning ref again, m is non-owning ref to same node > x = m->key; // this should be safe since we're still in CS below works because 'm' cames from bpf_rbtree_first that acquired 'res'. > bpf_rbtree_remove(&tree, &m->node); // But this should be prevented > > bpf_spin_unlock(&lock); >
On Thu, Dec 08, 2022 at 12:04:44AM IST, Dave Marchevsky wrote: > On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote: > > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: > >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. > >> There, a BTF record is created for any type containing a spin_lock or > >> any next-gen datastructure node/head. > >> > >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for > >> a record using struct_meta_tab if the reg->type exactly matches > >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an > >> "allocated obj" type - returned from bpf_obj_new - might pick up other > >> flags while working its way through the program. > >> > > > > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be > > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. > > Any other flag combination (the only one possible is PTR_UNTRUSTED right now) > > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED > > is to make then unpassable to helpers. > > > > I see what you mean. If reg_btf_record is only used on regs which are args, > then the exact match helps enforce PTR_UNTRUSTED not being an acceptable > type flag for an arg. Most uses of reg_btf_record seem to be on arg regs, > but then we have its use in reg_may_point_to_spin_lock, which is itself > used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not > sure that it's only used on arg regs currently. > > Regardless, if the intended use is on arg regs only, it should be renamed to > arg_reg_btf_record or similar to make that clear, as current name sounds like > it should be applicable to any reg, and thus not enforce constraints particular > to arg regs. > > But I think it's better to leave it general and enforce those constraints > elsewhere. For kfuncs this is already happening in check_kfunc_args, where the > big switch statements for KF_ARG_* are doing exact type matching. > > >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask > >> for type_flag. > >> > >> This patch is marked Fixes as the original intent of reg_btf_record was > >> unlikely to have been to fail finding btf_record for valid alloc obj > >> types with additional flags, some of which (e.g. PTR_UNTRUSTED) > >> are valid register type states for alloc obj independent of this series. > > > > That was the actual intent, same as how check_ptr_to_btf_access uses the exact > > reg->type to allow the BPF_WRITE case. > > > > I think this series is the one introducing this case, passing bpf_rbtree_first's > > result to bpf_rbtree_remove, which I think is not possible to make safe in the > > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> > > bpf_list_del due to this exact issue. More in [0]. > > > > [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com > > > > Thanks for the link, I better understand what Alexei meant in his comment on > patch 9 of this series. For the helpers added in this series, we can make > bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock > refs after the rbtree_remove in same manner as they're invalidated after > spin_unlock currently. > Rather than doing that, you'll cut down on a lot of complexity and confusion regarding PTR_UNTRUSTED's use in this set by removing bpf_rbtree_first and bpf_rbtree_remove, and simply exposing bpf_rbtree_pop_front. > Logic for why this is safe: > > * If we have two non-owning refs to nodes in a tree, e.g. from > bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after, > we have no way of knowing if they're aliases of same node. > > * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree, > it might be removing a node that's already been removed, e.g.: > > n = bpf_obj_new(...); > bpf_spin_lock(&lock); > > bpf_rbtree_add(&tree, &n->node); > // n is now non-owning ref to node which was added > res = bpf_rbtree_first(); > if (!m) {} > m = container_of(res, struct node_data, node); > // m is now non-owning ref to the same node > bpf_rbtree_remove(&tree, &n->node); > bpf_rbtree_remove(&tree, &m->node); // BAD > > bpf_spin_unlock(&lock); > > * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk > of pointing to something that was already removed _only_ after a > rbtree_remove, so if we invalidate them all after rbtree_remove they can't > be inputs to subsequent remove()s > > This does conflate current "release non-owning refs because it's not safe to > read from them" reasoning with new "release non-owning refs so they can't be > passed to remove()". Ideally we could add some new tag to these refs that > prevents them from being passed to remove()-type fns, but does allow them to > be read, e.g.: > > n = bpf_obj_new(...); > bpf_spin_lock(&lock); > > bpf_rbtree_add(&tree, &n->node); > // n is now non-owning ref to node which was added > res = bpf_rbtree_first(); > if (!m) {} > m = container_of(res, struct node_data, node); > // m is now non-owning ref to the same node > n = bpf_rbtree_remove(&tree, &n->node); > // n is now owning ref again, m is non-owning ref to same node > x = m->key; // this should be safe since we're still in CS > bpf_rbtree_remove(&tree, &m->node); // But this should be prevented > > bpf_spin_unlock(&lock); > > But this would introduce too much addt'l complexity for now IMO. The proposal > of just invalidating all non-owning refs prevents both the unsafe second > remove() and the safe x = m->key. > > I will give it a shot, if it doesn't work can change rbtree_remove to > rbtree_remove_first w/o node param. But per that linked convo such logic > should be tackled eventually, might as well chip away at it now. > I sympathise with your goal to make it as close to kernel programming style as possible. I was exploring the same option (as you saw in that link). But based on multiple discussions so far and trying different approaches, I'm convinced the additional complexity in the verifier is not worth it. Both bpf_list_del and bpf_rbtree_remove are useful and should be added, but should work on e.g. 'current' node in iteration callback. In that context verifier knows that the node is part of the list/rbtree. Introducing more than one node in that same context introduces potential aliasing which hinders the verifier's ability to reason about safety. Then, it has to be pessimistic like your case and invalidate everything to prevent invalid use, so that double list_del and double rbtree_remove is not possible. You will avoid all the problems with PTR_UNTRUSTED being passed to helpers if you adopt such an approach. The code will become much simpler, while allowing people to do the same thing without any loss of usability.
On 12/7/22 1:59 PM, Alexei Starovoitov wrote: > On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote: >> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote: >>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: >>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. >>>> There, a BTF record is created for any type containing a spin_lock or >>>> any next-gen datastructure node/head. >>>> >>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for >>>> a record using struct_meta_tab if the reg->type exactly matches >>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an >>>> "allocated obj" type - returned from bpf_obj_new - might pick up other >>>> flags while working its way through the program. >>>> >>> >>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be >>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. >>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now) >>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED >>> is to make then unpassable to helpers. >>> >> >> I see what you mean. If reg_btf_record is only used on regs which are args, >> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable >> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs, >> but then we have its use in reg_may_point_to_spin_lock, which is itself >> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not >> sure that it's only used on arg regs currently. >> >> Regardless, if the intended use is on arg regs only, it should be renamed to >> arg_reg_btf_record or similar to make that clear, as current name sounds like >> it should be applicable to any reg, and thus not enforce constraints particular >> to arg regs. >> >> But I think it's better to leave it general and enforce those constraints >> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the >> big switch statements for KF_ARG_* are doing exact type matching. >> >>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask >>>> for type_flag. >>>> >>>> This patch is marked Fixes as the original intent of reg_btf_record was >>>> unlikely to have been to fail finding btf_record for valid alloc obj >>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED) >>>> are valid register type states for alloc obj independent of this series. >>> >>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact >>> reg->type to allow the BPF_WRITE case. >>> >>> I think this series is the one introducing this case, passing bpf_rbtree_first's >>> result to bpf_rbtree_remove, which I think is not possible to make safe in the >>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> >>> bpf_list_del due to this exact issue. More in [0]. >>> >>> [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com >>> >> >> Thanks for the link, I better understand what Alexei meant in his comment on >> patch 9 of this series. For the helpers added in this series, we can make >> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock >> refs after the rbtree_remove in same manner as they're invalidated after >> spin_unlock currently. >> >> Logic for why this is safe: >> >> * If we have two non-owning refs to nodes in a tree, e.g. from >> bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after, >> we have no way of knowing if they're aliases of same node. >> >> * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree, >> it might be removing a node that's already been removed, e.g.: >> >> n = bpf_obj_new(...); >> bpf_spin_lock(&lock); >> >> bpf_rbtree_add(&tree, &n->node); >> // n is now non-owning ref to node which was added >> res = bpf_rbtree_first(); >> if (!m) {} >> m = container_of(res, struct node_data, node); >> // m is now non-owning ref to the same node >> bpf_rbtree_remove(&tree, &n->node); >> bpf_rbtree_remove(&tree, &m->node); // BAD > > Let me clarify my previous email: > > Above doesn't have to be 'BAD'. > Instead of > if (WARN_ON_ONCE(RB_EMPTY_NODE(n))) > > we can drop WARN and simply return. > If node is not part of the tree -> nop. > > Same for bpf_rbtree_add. > If it's already added -> nop. > These runtime checks can certainly be done, but if we can guarantee via verifier type system that a particular ptr-to-node is guaranteed to be in / not be in a tree, that's better, no? Feels like a similar train of thought to "fail verification when correct rbtree lock isn't held" vs "just check if lock is held in every rbtree API kfunc". > Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics. > We do all these checks under the same rbtree root lock, so it's safe. > I'll comment on PTR_TRUSTED in our discussion on patch 10. >> bpf_spin_unlock(&lock); >> >> * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk >> of pointing to something that was already removed _only_ after a >> rbtree_remove, so if we invalidate them all after rbtree_remove they can't >> be inputs to subsequent remove()s > > With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add > can have release semantics. > No need for special release_on_unlock hacks. > If we want to be able to interact w/ nodes after they've been added to the rbtree, but before critical section ends, we need to support non-owning refs, which are currently implemented using special release_on_unlock logic. If we go with the runtime check suggestion from above, we'd need to implement 'conditional release' similarly to earlier "rbtree map" attempt: https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ . If rbtree_add has release semantics for its node arg, but the node is already in some tree and runtime check fails, the reference should not be released as rbtree_add() was a nop. Similarly, if rbtree_remove has release semantics for its node arg and acquire semantics for its return value, runtime check failing should result in the node arg not being released. Acquire semantics for the retval are already conditional - if retval == NULL, mark_ptr_or_null regs will release the acquired ref before it can be used. So no issue with failing rbtree_remove messing up acquire. For this reason rbtree_remove and rbtree_first are tagged KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be refactored into a similar flag, KF_RELEASE_NON_OWN or similar. >> This does conflate current "release non-owning refs because it's not safe to >> read from them" reasoning with new "release non-owning refs so they can't be >> passed to remove()". Ideally we could add some new tag to these refs that >> prevents them from being passed to remove()-type fns, but does allow them to >> be read, e.g.: >> >> n = bpf_obj_new(...); > > 'n' is acquired. > >> bpf_spin_lock(&lock); >> >> bpf_rbtree_add(&tree, &n->node); >> // n is now non-owning ref to node which was added > > since bpf_rbtree_add does release on 'n'... > >> res = bpf_rbtree_first(); >> if (!m) {} >> m = container_of(res, struct node_data, node); >> // m is now non-owning ref to the same node > > ... below is not allowed by the verifier. >> n = bpf_rbtree_remove(&tree, &n->node); > > I'm not sure what's an idea to return 'n' from remove... > Maybe it should be simple bool ? > I agree that returning node from rbtree_remove is not strictly necessary, since rbtree_remove can be thought of turning its non-owning ref argument into an owning ref, instead of taking non-owning ref and returning owning ref. But such an operation isn't really an 'acquire' by current verifier logic, since only retvals can be 'acquired'. So we'd need to add some logic to enable acquire semantics for args. Furthermore it's not really 'acquiring' a new ref, rather changing properties of node arg ref. However, if rbtree_remove can fail, such a "turn non-owning into owning" operation will need to be able to fail as well, and the program will need to be able to check for failure. Returning 'acquire' result in retval makes this simple - just check for NULL. For your "return bool" proposal, we'd have to add verifier logic which turns the 'acquired' owning ref back into non-owning based on check of the bool, which will add some verifier complexity. IIRC when doing experimentation with "rbtree map" implementation, I did something like this and decided that the additional complexity wasn't worth it when retval can just be used. >> // n is now owning ref again, m is non-owning ref to same node >> x = m->key; // this should be safe since we're still in CS > > below works because 'm' cames from bpf_rbtree_first that acquired 'res'. > >> bpf_rbtree_remove(&tree, &m->node); // But this should be prevented >> >> bpf_spin_unlock(&lock); >>
On Wed, Dec 07, 2022 at 03:38:55PM -0500, Dave Marchevsky wrote: > On 12/7/22 1:59 PM, Alexei Starovoitov wrote: > > On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote: > >> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote: > >>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: > >>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. > >>>> There, a BTF record is created for any type containing a spin_lock or > >>>> any next-gen datastructure node/head. > >>>> > >>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for > >>>> a record using struct_meta_tab if the reg->type exactly matches > >>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an > >>>> "allocated obj" type - returned from bpf_obj_new - might pick up other > >>>> flags while working its way through the program. > >>>> > >>> > >>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be > >>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. > >>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now) > >>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED > >>> is to make then unpassable to helpers. > >>> > >> > >> I see what you mean. If reg_btf_record is only used on regs which are args, > >> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable > >> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs, > >> but then we have its use in reg_may_point_to_spin_lock, which is itself > >> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not > >> sure that it's only used on arg regs currently. > >> > >> Regardless, if the intended use is on arg regs only, it should be renamed to > >> arg_reg_btf_record or similar to make that clear, as current name sounds like > >> it should be applicable to any reg, and thus not enforce constraints particular > >> to arg regs. > >> > >> But I think it's better to leave it general and enforce those constraints > >> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the > >> big switch statements for KF_ARG_* are doing exact type matching. > >> > >>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask > >>>> for type_flag. > >>>> > >>>> This patch is marked Fixes as the original intent of reg_btf_record was > >>>> unlikely to have been to fail finding btf_record for valid alloc obj > >>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED) > >>>> are valid register type states for alloc obj independent of this series. > >>> > >>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact > >>> reg->type to allow the BPF_WRITE case. > >>> > >>> I think this series is the one introducing this case, passing bpf_rbtree_first's > >>> result to bpf_rbtree_remove, which I think is not possible to make safe in the > >>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> > >>> bpf_list_del due to this exact issue. More in [0]. > >>> > >>> [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com > >>> > >> > >> Thanks for the link, I better understand what Alexei meant in his comment on > >> patch 9 of this series. For the helpers added in this series, we can make > >> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock > >> refs after the rbtree_remove in same manner as they're invalidated after > >> spin_unlock currently. > >> > >> Logic for why this is safe: > >> > >> * If we have two non-owning refs to nodes in a tree, e.g. from > >> bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after, > >> we have no way of knowing if they're aliases of same node. > >> > >> * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree, > >> it might be removing a node that's already been removed, e.g.: > >> > >> n = bpf_obj_new(...); > >> bpf_spin_lock(&lock); > >> > >> bpf_rbtree_add(&tree, &n->node); > >> // n is now non-owning ref to node which was added > >> res = bpf_rbtree_first(); > >> if (!m) {} > >> m = container_of(res, struct node_data, node); > >> // m is now non-owning ref to the same node > >> bpf_rbtree_remove(&tree, &n->node); > >> bpf_rbtree_remove(&tree, &m->node); // BAD > > > > Let me clarify my previous email: > > > > Above doesn't have to be 'BAD'. > > Instead of > > if (WARN_ON_ONCE(RB_EMPTY_NODE(n))) > > > > we can drop WARN and simply return. > > If node is not part of the tree -> nop. > > > > Same for bpf_rbtree_add. > > If it's already added -> nop. > > > > These runtime checks can certainly be done, but if we can guarantee via > verifier type system that a particular ptr-to-node is guaranteed to be in / > not be in a tree, that's better, no? > > Feels like a similar train of thought to "fail verification when correct rbtree > lock isn't held" vs "just check if lock is held in every rbtree API kfunc". > > > Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics. > > We do all these checks under the same rbtree root lock, so it's safe. > > > > I'll comment on PTR_TRUSTED in our discussion on patch 10. > > >> bpf_spin_unlock(&lock); > >> > >> * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk > >> of pointing to something that was already removed _only_ after a > >> rbtree_remove, so if we invalidate them all after rbtree_remove they can't > >> be inputs to subsequent remove()s > > > > With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add > > can have release semantics. > > No need for special release_on_unlock hacks. > > > > If we want to be able to interact w/ nodes after they've been added to the > rbtree, but before critical section ends, we need to support non-owning refs, > which are currently implemented using special release_on_unlock logic. > > If we go with the runtime check suggestion from above, we'd need to implement > 'conditional release' similarly to earlier "rbtree map" attempt: > https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ . > > If rbtree_add has release semantics for its node arg, but the node is already > in some tree and runtime check fails, the reference should not be released as > rbtree_add() was a nop. Got it. The conditional release is tricky. We should probably avoid it for now. I think we can either go with Kumar's proposal and do bpf_rbtree_pop_front() instead of bpf_rbtree_first() that avoids all these issues... but considering that we'll have inline iterators soon and should be able to do: struct bpf_rbtree_iter it; struct bpf_rb_node * node; bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree while ((node = bpf_rbtree_iter_next(&it)) { if (node->field == condition) { struct bpf_rb_node *n; n = bpf_rbtree_remove(rb_root, node); bpf_spin_lock(another_rb_root); bpf_rbtree_add(another_rb_root, n); bpf_spin_unlock(another_rb_root); break; } } bpf_rbtree_iter_destroy(&it); We can treat the 'node' returned from bpf_rbtree_iter_next() the same way as return from bpf_rbtree_first() -> PTR_TRUSTED | MAYBE_NULL, but not acquired (ref_obj_id == 0). bpf_rbtree_add -> KF_RELEASE so we cannot pass not acquired pointers into it. We should probably remove release_on_unlock logic as Kumar suggesting and make bpf_list_push_front/back to be KF_RELEASE. Then bpf_list_pop_front/back stay KF_ACQUIRE | KF_RET_NULL and bpf_rbtree_remove is also KF_ACQUIRE | KF_RET_NULL. The difference is bpf_list_pop has only 'head' while bpf_rbtree_remove has 'root' and 'node' where 'node' has to be PTR_TRUSTED (but not acquired). bpf_rbtree_add will always succeed. bpf_rbtree_remove will conditionally fail if 'node' is not linked. Similarly we can extend link list with n = bpf_list_remove(node) which will have KF_ACQUIRE | KF_RET_NULL semantics. Then everything is nicely uniform. We'll be able to iterate rbtree and iterate link lists. There are downsides, of course. Like the following from your test case: + bpf_spin_lock(&glock); + bpf_rbtree_add(&groot, &n->node, less); + bpf_rbtree_add(&groot, &m->node, less); + res = bpf_rbtree_remove(&groot, &n->node); + bpf_spin_unlock(&glock); will not work. Since bpf_rbtree_add() releases 'n' and it becomes UNTRUSTED. (assuming release_on_unlock is removed). I think it's fine for now. I have to agree with Kumar that it's hard to come up with realistic use case where 'n' should be accessed after it was added to link list or rbtree. Above test case doesn't look real. This part of your test case: + bpf_spin_lock(&glock); + bpf_rbtree_add(&groot, &n->node, less); + bpf_rbtree_add(&groot, &m->node, less); + bpf_rbtree_add(&groot, &o->node, less); + + res = bpf_rbtree_first(&groot); + if (!res) { + bpf_spin_unlock(&glock); + return 2; + } + + o = container_of(res, struct node_data, node); + res = bpf_rbtree_remove(&groot, &o->node); + bpf_spin_unlock(&glock); will work, because bpf_rbtree_first returns PTR_TRUSTED | MAYBE_NULL. > Similarly, if rbtree_remove has release semantics for its node arg and acquire > semantics for its return value, runtime check failing should result in the > node arg not being released. Acquire semantics for the retval are already > conditional - if retval == NULL, mark_ptr_or_null regs will release the > acquired ref before it can be used. So no issue with failing rbtree_remove > messing up acquire. > > For this reason rbtree_remove and rbtree_first are tagged > KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be > refactored into a similar flag, KF_RELEASE_NON_OWN or similar. I guess what I'm propsing above is sort-of KF_RELEASE_NON_OWN idea, but from a different angle. I'd like to avoid introducing new flags. I think PTR_TRUSTED is enough. > > I'm not sure what's an idea to return 'n' from remove... > > Maybe it should be simple bool ? > > > > I agree that returning node from rbtree_remove is not strictly necessary, since > rbtree_remove can be thought of turning its non-owning ref argument into an > owning ref, instead of taking non-owning ref and returning owning ref. But such > an operation isn't really an 'acquire' by current verifier logic, since only > retvals can be 'acquired'. So we'd need to add some logic to enable acquire > semantics for args. Furthermore it's not really 'acquiring' a new ref, rather > changing properties of node arg ref. > > However, if rbtree_remove can fail, such a "turn non-owning into owning" > operation will need to be able to fail as well, and the program will need to > be able to check for failure. Returning 'acquire' result in retval makes > this simple - just check for NULL. For your "return bool" proposal, we'd have > to add verifier logic which turns the 'acquired' owning ref back into non-owning > based on check of the bool, which will add some verifier complexity. > > IIRC when doing experimentation with "rbtree map" implementation, I did > something like this and decided that the additional complexity wasn't worth > it when retval can just be used. Agree. Forget 'bool' idea.
On 12/7/22 5:46 PM, Alexei Starovoitov wrote: > On Wed, Dec 07, 2022 at 03:38:55PM -0500, Dave Marchevsky wrote: >> On 12/7/22 1:59 PM, Alexei Starovoitov wrote: >>> On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote: >>>> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote: >>>>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote: >>>>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. >>>>>> There, a BTF record is created for any type containing a spin_lock or >>>>>> any next-gen datastructure node/head. >>>>>> >>>>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for >>>>>> a record using struct_meta_tab if the reg->type exactly matches >>>>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an >>>>>> "allocated obj" type - returned from bpf_obj_new - might pick up other >>>>>> flags while working its way through the program. >>>>>> >>>>> >>>>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be >>>>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record. >>>>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now) >>>>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED >>>>> is to make then unpassable to helpers. >>>>> >>>> >>>> I see what you mean. If reg_btf_record is only used on regs which are args, >>>> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable >>>> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs, >>>> but then we have its use in reg_may_point_to_spin_lock, which is itself >>>> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not >>>> sure that it's only used on arg regs currently. >>>> >>>> Regardless, if the intended use is on arg regs only, it should be renamed to >>>> arg_reg_btf_record or similar to make that clear, as current name sounds like >>>> it should be applicable to any reg, and thus not enforce constraints particular >>>> to arg regs. >>>> >>>> But I think it's better to leave it general and enforce those constraints >>>> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the >>>> big switch statements for KF_ARG_* are doing exact type matching. >>>> >>>>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask >>>>>> for type_flag. >>>>>> >>>>>> This patch is marked Fixes as the original intent of reg_btf_record was >>>>>> unlikely to have been to fail finding btf_record for valid alloc obj >>>>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED) >>>>>> are valid register type states for alloc obj independent of this series. >>>>> >>>>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact >>>>> reg->type to allow the BPF_WRITE case. >>>>> >>>>> I think this series is the one introducing this case, passing bpf_rbtree_first's >>>>> result to bpf_rbtree_remove, which I think is not possible to make safe in the >>>>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry -> >>>>> bpf_list_del due to this exact issue. More in [0]. >>>>> >>>>> [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com >>>>> >>>> >>>> Thanks for the link, I better understand what Alexei meant in his comment on >>>> patch 9 of this series. For the helpers added in this series, we can make >>>> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock >>>> refs after the rbtree_remove in same manner as they're invalidated after >>>> spin_unlock currently. >>>> >>>> Logic for why this is safe: >>>> >>>> * If we have two non-owning refs to nodes in a tree, e.g. from >>>> bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after, >>>> we have no way of knowing if they're aliases of same node. >>>> >>>> * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree, >>>> it might be removing a node that's already been removed, e.g.: >>>> >>>> n = bpf_obj_new(...); >>>> bpf_spin_lock(&lock); >>>> >>>> bpf_rbtree_add(&tree, &n->node); >>>> // n is now non-owning ref to node which was added >>>> res = bpf_rbtree_first(); >>>> if (!m) {} >>>> m = container_of(res, struct node_data, node); >>>> // m is now non-owning ref to the same node >>>> bpf_rbtree_remove(&tree, &n->node); >>>> bpf_rbtree_remove(&tree, &m->node); // BAD >>> >>> Let me clarify my previous email: >>> >>> Above doesn't have to be 'BAD'. >>> Instead of >>> if (WARN_ON_ONCE(RB_EMPTY_NODE(n))) >>> >>> we can drop WARN and simply return. >>> If node is not part of the tree -> nop. >>> >>> Same for bpf_rbtree_add. >>> If it's already added -> nop. >>> >> >> These runtime checks can certainly be done, but if we can guarantee via >> verifier type system that a particular ptr-to-node is guaranteed to be in / >> not be in a tree, that's better, no? >> >> Feels like a similar train of thought to "fail verification when correct rbtree >> lock isn't held" vs "just check if lock is held in every rbtree API kfunc". >> >>> Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics. >>> We do all these checks under the same rbtree root lock, so it's safe. >>> >> >> I'll comment on PTR_TRUSTED in our discussion on patch 10. >> >>>> bpf_spin_unlock(&lock); >>>> >>>> * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk >>>> of pointing to something that was already removed _only_ after a >>>> rbtree_remove, so if we invalidate them all after rbtree_remove they can't >>>> be inputs to subsequent remove()s >>> >>> With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add >>> can have release semantics. >>> No need for special release_on_unlock hacks. >>> >> >> If we want to be able to interact w/ nodes after they've been added to the >> rbtree, but before critical section ends, we need to support non-owning refs, >> which are currently implemented using special release_on_unlock logic. >> >> If we go with the runtime check suggestion from above, we'd need to implement >> 'conditional release' similarly to earlier "rbtree map" attempt: >> https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ . >> >> If rbtree_add has release semantics for its node arg, but the node is already >> in some tree and runtime check fails, the reference should not be released as >> rbtree_add() was a nop. > > Got it. > The conditional release is tricky. We should probably avoid it for now. > > I think we can either go with Kumar's proposal and do > bpf_rbtree_pop_front() instead of bpf_rbtree_first() > that avoids all these issues... > > but considering that we'll have inline iterators soon and should be able to do: > > struct bpf_rbtree_iter it; > struct bpf_rb_node * node; > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree > while ((node = bpf_rbtree_iter_next(&it)) { > if (node->field == condition) { > struct bpf_rb_node *n; > > n = bpf_rbtree_remove(rb_root, node); > bpf_spin_lock(another_rb_root); > bpf_rbtree_add(another_rb_root, n); > bpf_spin_unlock(another_rb_root); > break; > } > } > bpf_rbtree_iter_destroy(&it); > > We can treat the 'node' returned from bpf_rbtree_iter_next() the same way > as return from bpf_rbtree_first() -> PTR_TRUSTED | MAYBE_NULL, > but not acquired (ref_obj_id == 0). > > bpf_rbtree_add -> KF_RELEASE > so we cannot pass not acquired pointers into it. > > We should probably remove release_on_unlock logic as Kumar suggesting and > make bpf_list_push_front/back to be KF_RELEASE. > > Then > bpf_list_pop_front/back stay KF_ACQUIRE | KF_RET_NULL > and > bpf_rbtree_remove is also KF_ACQUIRE | KF_RET_NULL. > > The difference is bpf_list_pop has only 'head' > while bpf_rbtree_remove has 'root' and 'node' where 'node' has to be PTR_TRUSTED > (but not acquired). > > bpf_rbtree_add will always succeed. > bpf_rbtree_remove will conditionally fail if 'node' is not linked. > > Similarly we can extend link list with > n = bpf_list_remove(node) > which will have KF_ACQUIRE | KF_RET_NULL semantics. > > Then everything is nicely uniform. > We'll be able to iterate rbtree and iterate link lists. > > There are downsides, of course. > Like the following from your test case: > + bpf_spin_lock(&glock); > + bpf_rbtree_add(&groot, &n->node, less); > + bpf_rbtree_add(&groot, &m->node, less); > + res = bpf_rbtree_remove(&groot, &n->node); > + bpf_spin_unlock(&glock); > will not work. > Since bpf_rbtree_add() releases 'n' and it becomes UNTRUSTED. > (assuming release_on_unlock is removed). > > I think it's fine for now. I have to agree with Kumar that it's hard to come up > with realistic use case where 'n' should be accessed after it was added to link > list or rbtree. Above test case doesn't look real. > > This part of your test case: > + bpf_spin_lock(&glock); > + bpf_rbtree_add(&groot, &n->node, less); > + bpf_rbtree_add(&groot, &m->node, less); > + bpf_rbtree_add(&groot, &o->node, less); > + > + res = bpf_rbtree_first(&groot); > + if (!res) { > + bpf_spin_unlock(&glock); > + return 2; > + } > + > + o = container_of(res, struct node_data, node); > + res = bpf_rbtree_remove(&groot, &o->node); > + bpf_spin_unlock(&glock); > > will work, because bpf_rbtree_first returns PTR_TRUSTED | MAYBE_NULL. > >> Similarly, if rbtree_remove has release semantics for its node arg and acquire >> semantics for its return value, runtime check failing should result in the >> node arg not being released. Acquire semantics for the retval are already >> conditional - if retval == NULL, mark_ptr_or_null regs will release the >> acquired ref before it can be used. So no issue with failing rbtree_remove >> messing up acquire. >> >> For this reason rbtree_remove and rbtree_first are tagged >> KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be >> refactored into a similar flag, KF_RELEASE_NON_OWN or similar. > > I guess what I'm propsing above is sort-of KF_RELEASE_NON_OWN idea, > but from a different angle. > I'd like to avoid introducing new flags. > I think PTR_TRUSTED is enough. > >>> I'm not sure what's an idea to return 'n' from remove... >>> Maybe it should be simple bool ? >>> >> >> I agree that returning node from rbtree_remove is not strictly necessary, since >> rbtree_remove can be thought of turning its non-owning ref argument into an >> owning ref, instead of taking non-owning ref and returning owning ref. But such >> an operation isn't really an 'acquire' by current verifier logic, since only >> retvals can be 'acquired'. So we'd need to add some logic to enable acquire >> semantics for args. Furthermore it's not really 'acquiring' a new ref, rather >> changing properties of node arg ref. >> >> However, if rbtree_remove can fail, such a "turn non-owning into owning" >> operation will need to be able to fail as well, and the program will need to >> be able to check for failure. Returning 'acquire' result in retval makes >> this simple - just check for NULL. For your "return bool" proposal, we'd have >> to add verifier logic which turns the 'acquired' owning ref back into non-owning >> based on check of the bool, which will add some verifier complexity. >> >> IIRC when doing experimentation with "rbtree map" implementation, I did >> something like this and decided that the additional complexity wasn't worth >> it when retval can just be used. > > Agree. Forget 'bool' idea. We will merge this convo w/ similar one in the cover letter's thread, and continue w/ replies there.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1d51bd9596da..67a13110bc22 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type) type == PTR_TO_SOCK_COMMON; } +static bool type_is_ptr_alloc_obj(u32 type) +{ + return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC; +} + static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg) { struct btf_record *rec = NULL; @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg) if (reg->type == PTR_TO_MAP_VALUE) { rec = reg->map_ptr->record; - } else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) { + } else if (type_is_ptr_alloc_obj(reg->type)) { meta = btf_find_struct_meta(reg->btf, reg->btf_id); if (meta) rec = meta->record;
btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. There, a BTF record is created for any type containing a spin_lock or any next-gen datastructure node/head. Currently, for non-MAP_VALUE types, reg_btf_record will only search for a record using struct_meta_tab if the reg->type exactly matches (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an "allocated obj" type - returned from bpf_obj_new - might pick up other flags while working its way through the program. Loosen the check to be exact for base_type and just use MEM_ALLOC mask for type_flag. This patch is marked Fixes as the original intent of reg_btf_record was unlikely to have been to fail finding btf_record for valid alloc obj types with additional flags, some of which (e.g. PTR_UNTRUSTED) are valid register type states for alloc obj independent of this series. However, I didn't find a specific broken repro case outside of this series' added functionality, so it's possible that nothing was triggering this logic error before. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects") --- kernel/bpf/verifier.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)