Message ID | 20230821193311.3290257-6-davemarchevsky@fb.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 0816b8c6bf7fc87cec4273dc199e8f0764b9e7b1 |
Delegated to: | BPF |
Headers | show |
Series | BPF Refcount followups 3: bpf_mem_free_rcu refcounted nodes | expand |
On 8/21/23 12:33 PM, Dave Marchevsky wrote: > An earlier patch in the series ensures that the underlying memory of > nodes with bpf_refcount - which can have multiple owners - is not reused > until RCU grace period has elapsed. This prevents > use-after-free with non-owning references that may point to > recently-freed memory. While RCU read lock is held, it's safe to > dereference such a non-owning ref, as by definition RCU GP couldn't have > elapsed and therefore underlying memory couldn't have been reused. > > From the perspective of verifier "trustedness" non-owning refs to > refcounted nodes are now trusted only in RCU CS and therefore should no > longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them > MEM_RCU in order to reflect this new state. > > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> > --- > include/linux/bpf.h | 3 ++- > kernel/bpf/verifier.c | 13 ++++++++++++- > 2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index eced6400f778..12596af59c00 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -653,7 +653,8 @@ enum bpf_type_flag { > MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), > > /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. > - * Currently only valid for linked-list and rbtree nodes. > + * Currently only valid for linked-list and rbtree nodes. If the nodes > + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. > */ > NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 8db0afa5985c..55607ab30522 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, > case PTR_TO_BTF_ID | PTR_TRUSTED: > case PTR_TO_BTF_ID | MEM_RCU: > case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: > + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: > /* When referenced PTR_TO_BTF_ID is passed to release function, > * its fixed offset must be 0. In the other cases, fixed offset > * can be non-zero. This was already checked above. So pass > @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, > static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) > { > struct bpf_verifier_state *state = env->cur_state; > + struct btf_record *rec = reg_btf_record(reg); > > if (!state->active_lock.ptr) { > verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); > @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state > } > > reg->type |= NON_OWN_REF; > + if (rec->refcount_off >= 0) > + reg->type |= MEM_RCU; Should the above MEM_RCU marking be done unless reg access is in rcu critical section? I think we still have issues for state resetting with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which will try to convert the reg state to PTR_UNTRUSTED. Let us say reg state is PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU (1). If hitting bpf_spin_unlock(), since MEM_RCU is in the reg state, the state should become PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU some additional code might be needed so we wont have verifier complaints about ref_obj_id == 0. (2). If hitting bpf_rcu_read_unlock(), the state should become PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF since register access still in bpf_spin_lock() region. Does this make sense? > + > return 0; > } > > @@ -11328,6 +11333,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, > struct bpf_func_state *state; > struct bpf_reg_state *reg; > > + if (in_rbtree_lock_required_cb(env) && (rcu_lock || rcu_unlock)) { > + verbose(env, "Calling bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n"); > + return -EACCES; > + } > + > if (rcu_lock) { > verbose(env, "nested rcu read lock (kernel function %s)\n", func_name); > return -EINVAL; > @@ -16689,7 +16699,8 @@ static int do_check(struct bpf_verifier_env *env) > return -EINVAL; > } > > - if (env->cur_state->active_rcu_lock) { > + if (env->cur_state->active_rcu_lock && > + !in_rbtree_lock_required_cb(env)) { > verbose(env, "bpf_rcu_read_unlock is missing\n"); > return -EINVAL; > }
On 8/21/23 7:37 PM, Yonghong Song wrote: > > > On 8/21/23 12:33 PM, Dave Marchevsky wrote: >> An earlier patch in the series ensures that the underlying memory of >> nodes with bpf_refcount - which can have multiple owners - is not reused >> until RCU grace period has elapsed. This prevents >> use-after-free with non-owning references that may point to >> recently-freed memory. While RCU read lock is held, it's safe to >> dereference such a non-owning ref, as by definition RCU GP couldn't have >> elapsed and therefore underlying memory couldn't have been reused. >> >> From the perspective of verifier "trustedness" non-owning refs to >> refcounted nodes are now trusted only in RCU CS and therefore should no >> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them >> MEM_RCU in order to reflect this new state. >> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> >> --- >> include/linux/bpf.h | 3 ++- >> kernel/bpf/verifier.c | 13 ++++++++++++- >> 2 files changed, 14 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index eced6400f778..12596af59c00 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -653,7 +653,8 @@ enum bpf_type_flag { >> MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), >> /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are >> non-owning. >> - * Currently only valid for linked-list and rbtree nodes. >> + * Currently only valid for linked-list and rbtree nodes. If the >> nodes >> + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. >> */ >> NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 8db0afa5985c..55607ab30522 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct >> bpf_verifier_env *env, >> case PTR_TO_BTF_ID | PTR_TRUSTED: >> case PTR_TO_BTF_ID | MEM_RCU: >> case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: >> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: >> /* When referenced PTR_TO_BTF_ID is passed to release function, >> * its fixed offset must be 0. In the other cases, fixed offset >> * can be non-zero. This was already checked above. So pass >> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct >> bpf_verifier_env *env, >> static int ref_set_non_owning(struct bpf_verifier_env *env, struct >> bpf_reg_state *reg) >> { >> struct bpf_verifier_state *state = env->cur_state; >> + struct btf_record *rec = reg_btf_record(reg); >> if (!state->active_lock.ptr) { >> verbose(env, "verifier internal error: ref_set_non_owning >> w/o active lock\n"); >> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct >> bpf_verifier_env *env, struct bpf_reg_state >> } >> reg->type |= NON_OWN_REF; >> + if (rec->refcount_off >= 0) >> + reg->type |= MEM_RCU; > > Should the above MEM_RCU marking be done unless reg access is in > rcu critical section? > > I think we still have issues for state resetting > with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which > will try to convert the reg state to PTR_UNTRUSTED. > > Let us say reg state is > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU > > (1). If hitting bpf_spin_unlock(), since MEM_RCU is in > the reg state, the state should become > PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU > some additional code might be needed so we wont have > verifier complaints about ref_obj_id == 0. > > (2). If hitting bpf_rcu_read_unlock(), the state should become > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF > since register access still in bpf_spin_lock() region. > > Does this make sense? Okay, seems scenario (2) is not possible as bpf_rcu_read_unlock() is not allowed in bpf spin lock region. > >> + >> return 0; >> } >> @@ -11328,6 +11333,11 @@ static int check_kfunc_call(struct >> bpf_verifier_env *env, struct bpf_insn *insn, >> struct bpf_func_state *state; >> struct bpf_reg_state *reg; >> + if (in_rbtree_lock_required_cb(env) && (rcu_lock || >> rcu_unlock)) { >> + verbose(env, "Calling bpf_rcu_read_{lock,unlock} in >> unnecessary rbtree callback\n"); >> + return -EACCES; >> + } >> + >> if (rcu_lock) { >> verbose(env, "nested rcu read lock (kernel function >> %s)\n", func_name); >> return -EINVAL; >> @@ -16689,7 +16699,8 @@ static int do_check(struct bpf_verifier_env *env) >> return -EINVAL; >> } >> - if (env->cur_state->active_rcu_lock) { >> + if (env->cur_state->active_rcu_lock && >> + !in_rbtree_lock_required_cb(env)) { >> verbose(env, "bpf_rcu_read_unlock is missing\n"); >> return -EINVAL; >> } >
On 8/21/23 10:37 PM, Yonghong Song wrote: > > > On 8/21/23 12:33 PM, Dave Marchevsky wrote: >> An earlier patch in the series ensures that the underlying memory of >> nodes with bpf_refcount - which can have multiple owners - is not reused >> until RCU grace period has elapsed. This prevents >> use-after-free with non-owning references that may point to >> recently-freed memory. While RCU read lock is held, it's safe to >> dereference such a non-owning ref, as by definition RCU GP couldn't have >> elapsed and therefore underlying memory couldn't have been reused. >> >> From the perspective of verifier "trustedness" non-owning refs to >> refcounted nodes are now trusted only in RCU CS and therefore should no >> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them >> MEM_RCU in order to reflect this new state. >> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> >> --- >> include/linux/bpf.h | 3 ++- >> kernel/bpf/verifier.c | 13 ++++++++++++- >> 2 files changed, 14 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index eced6400f778..12596af59c00 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -653,7 +653,8 @@ enum bpf_type_flag { >> MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), >> /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. >> - * Currently only valid for linked-list and rbtree nodes. >> + * Currently only valid for linked-list and rbtree nodes. If the nodes >> + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. >> */ >> NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 8db0afa5985c..55607ab30522 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, >> case PTR_TO_BTF_ID | PTR_TRUSTED: >> case PTR_TO_BTF_ID | MEM_RCU: >> case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: >> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: >> /* When referenced PTR_TO_BTF_ID is passed to release function, >> * its fixed offset must be 0. In the other cases, fixed offset >> * can be non-zero. This was already checked above. So pass >> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, >> static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) >> { >> struct bpf_verifier_state *state = env->cur_state; >> + struct btf_record *rec = reg_btf_record(reg); >> if (!state->active_lock.ptr) { >> verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); >> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state >> } >> reg->type |= NON_OWN_REF; >> + if (rec->refcount_off >= 0) >> + reg->type |= MEM_RCU; > > Should the above MEM_RCU marking be done unless reg access is in > rcu critical section? I think it is fine, since non-owning references currently exist only within spin_lock CS. Based on Alexei's comments on v1 of this series [0], preemption disabled + spin_lock CS should imply RCU CS. [0]: https://lore.kernel.org/bpf/20230802230715.3ltalexaczbomvbu@MacBook-Pro-8.local/ > > I think we still have issues for state resetting > with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which > will try to convert the reg state to PTR_UNTRUSTED. > > Let us say reg state is > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU > > (1). If hitting bpf_spin_unlock(), since MEM_RCU is in > the reg state, the state should become > PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU > some additional code might be needed so we wont have > verifier complaints about ref_obj_id == 0. > > (2). If hitting bpf_rcu_read_unlock(), the state should become > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF > since register access still in bpf_spin_lock() region. I agree w/ your comment in side reply stating that this case isn't possible since bpf_rcu_read_{lock,unlock} in spin_lock CS is currently not allowed. > > Does this make sense? > IIUC the specific reg state flow you're recommending is based on the convos we've had over the past few weeks re: getting rid of special non-owning ref lifetime rules, instead using RCU as much as possible. Specifically, this recommended change would remove non-owning ref clobbering, instead just removing NON_OWN_REF flag on bpf_spin_unlock so that such nodes can no longer be passed to collection kfuncs (refcount_acquire, etc). I agree that in general we'll be able to loosen the lifetime logic for non-owning refs, and your specific suggestion sounds reasonable. IMO it's better to ship that as a separate series, though, as this series was meant to be the minimum changes necessary to re-enable bpf_refcount_acquire, and it's expanded a bit past that already. Easier to reason about the rest of this series' changes without having to account for clobbering changes. >> + >> return 0; >> } >> @@ -11328,6 +11333,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, >> struct bpf_func_state *state; >> struct bpf_reg_state *reg; >> + if (in_rbtree_lock_required_cb(env) && (rcu_lock || rcu_unlock)) { >> + verbose(env, "Calling bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n"); >> + return -EACCES; >> + } >> + >> if (rcu_lock) { >> verbose(env, "nested rcu read lock (kernel function %s)\n", func_name); >> return -EINVAL; >> @@ -16689,7 +16699,8 @@ static int do_check(struct bpf_verifier_env *env) >> return -EINVAL; >> } >> - if (env->cur_state->active_rcu_lock) { >> + if (env->cur_state->active_rcu_lock && >> + !in_rbtree_lock_required_cb(env)) { >> verbose(env, "bpf_rcu_read_unlock is missing\n"); >> return -EINVAL; >> }
On 8/21/23 10:47 PM, David Marchevsky wrote: > On 8/21/23 10:37 PM, Yonghong Song wrote: >> >> >> On 8/21/23 12:33 PM, Dave Marchevsky wrote: >>> An earlier patch in the series ensures that the underlying memory of >>> nodes with bpf_refcount - which can have multiple owners - is not reused >>> until RCU grace period has elapsed. This prevents >>> use-after-free with non-owning references that may point to >>> recently-freed memory. While RCU read lock is held, it's safe to >>> dereference such a non-owning ref, as by definition RCU GP couldn't have >>> elapsed and therefore underlying memory couldn't have been reused. >>> >>> From the perspective of verifier "trustedness" non-owning refs to >>> refcounted nodes are now trusted only in RCU CS and therefore should no >>> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them >>> MEM_RCU in order to reflect this new state. >>> >>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> >>> --- >>> include/linux/bpf.h | 3 ++- >>> kernel/bpf/verifier.c | 13 ++++++++++++- >>> 2 files changed, 14 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >>> index eced6400f778..12596af59c00 100644 >>> --- a/include/linux/bpf.h >>> +++ b/include/linux/bpf.h >>> @@ -653,7 +653,8 @@ enum bpf_type_flag { >>> MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), >>> /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. >>> - * Currently only valid for linked-list and rbtree nodes. >>> + * Currently only valid for linked-list and rbtree nodes. If the nodes >>> + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. >>> */ >>> NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>> index 8db0afa5985c..55607ab30522 100644 >>> --- a/kernel/bpf/verifier.c >>> +++ b/kernel/bpf/verifier.c >>> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, >>> case PTR_TO_BTF_ID | PTR_TRUSTED: >>> case PTR_TO_BTF_ID | MEM_RCU: >>> case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: >>> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: >>> /* When referenced PTR_TO_BTF_ID is passed to release function, >>> * its fixed offset must be 0. In the other cases, fixed offset >>> * can be non-zero. This was already checked above. So pass >>> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, >>> static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) >>> { >>> struct bpf_verifier_state *state = env->cur_state; >>> + struct btf_record *rec = reg_btf_record(reg); >>> if (!state->active_lock.ptr) { >>> verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); >>> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state >>> } >>> reg->type |= NON_OWN_REF; >>> + if (rec->refcount_off >= 0) >>> + reg->type |= MEM_RCU; >> >> Should the above MEM_RCU marking be done unless reg access is in >> rcu critical section? > > I think it is fine, since non-owning references currently exist only within > spin_lock CS. Based on Alexei's comments on v1 of this series [0], preemption > disabled + spin_lock CS should imply RCU CS. > > [0]: https://lore.kernel.org/bpf/20230802230715.3ltalexaczbomvbu@MacBook-Pro-8.local/ Okay, I see. Thanks for the pointer. If MEM_RCU is marked not in explicit rcu cs, it does make sense to clobber everything when bpf_spin_unlock is hit. But see below. > >> >> I think we still have issues for state resetting >> with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which >> will try to convert the reg state to PTR_UNTRUSTED. >> >> Let us say reg state is >> PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU >> >> (1). If hitting bpf_spin_unlock(), since MEM_RCU is in >> the reg state, the state should become >> PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU >> some additional code might be needed so we wont have >> verifier complaints about ref_obj_id == 0. >> >> (2). If hitting bpf_rcu_read_unlock(), the state should become >> PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF >> since register access still in bpf_spin_lock() region. > > I agree w/ your comment in side reply stating that this > case isn't possible since bpf_rcu_read_{lock,unlock} in spin_lock CS > is currently not allowed. > >> >> Does this make sense? >> > > > IIUC the specific reg state flow you're recommending is based on the convos > we've had over the past few weeks re: getting rid of special non-owning ref > lifetime rules, instead using RCU as much as possible. Specifically, this > recommended change would remove non-owning ref clobbering, instead just removing > NON_OWN_REF flag on bpf_spin_unlock so that such nodes can no longer be passed > to collection kfuncs (refcount_acquire, etc). > > I agree that in general we'll be able to loosen the lifetime logic for > non-owning refs, and your specific suggestion sounds reasonable. IMO it's > better to ship that as a separate series, though, as this series was meant > to be the minimum changes necessary to re-enable bpf_refcount_acquire, and > it's expanded a bit past that already. Easier to reason about the rest > of this series' changes without having to account for clobbering changes. I removed >>> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: and >>> + if (rec->refcount_off >= 0) >>> + reg->type |= MEM_RCU; All 'refcount' selftests seem working fine. If this is indeed unnecessary for this patch set, maybe we can delay this RCU marking thing for a future patch set? > >>> + >>> return 0; >>> } >>> @@ -11328,6 +11333,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, >>> struct bpf_func_state *state; >>> struct bpf_reg_state *reg; >>> + if (in_rbtree_lock_required_cb(env) && (rcu_lock || rcu_unlock)) { >>> + verbose(env, "Calling bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n"); >>> + return -EACCES; >>> + } >>> + >>> if (rcu_lock) { >>> verbose(env, "nested rcu read lock (kernel function %s)\n", func_name); >>> return -EINVAL; >>> @@ -16689,7 +16699,8 @@ static int do_check(struct bpf_verifier_env *env) >>> return -EINVAL; >>> } >>> - if (env->cur_state->active_rcu_lock) { >>> + if (env->cur_state->active_rcu_lock && >>> + !in_rbtree_lock_required_cb(env)) { >>> verbose(env, "bpf_rcu_read_unlock is missing\n"); >>> return -EINVAL; >>> }
On Tue, Aug 22, 2023 at 01:47:01AM -0400, David Marchevsky wrote: > On 8/21/23 10:37 PM, Yonghong Song wrote: > > > > > > On 8/21/23 12:33 PM, Dave Marchevsky wrote: > >> An earlier patch in the series ensures that the underlying memory of > >> nodes with bpf_refcount - which can have multiple owners - is not reused > >> until RCU grace period has elapsed. This prevents > >> use-after-free with non-owning references that may point to > >> recently-freed memory. While RCU read lock is held, it's safe to > >> dereference such a non-owning ref, as by definition RCU GP couldn't have > >> elapsed and therefore underlying memory couldn't have been reused. > >> > >> From the perspective of verifier "trustedness" non-owning refs to > >> refcounted nodes are now trusted only in RCU CS and therefore should no > >> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them > >> MEM_RCU in order to reflect this new state. > >> > >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> > >> --- > >> include/linux/bpf.h | 3 ++- > >> kernel/bpf/verifier.c | 13 ++++++++++++- > >> 2 files changed, 14 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h > >> index eced6400f778..12596af59c00 100644 > >> --- a/include/linux/bpf.h > >> +++ b/include/linux/bpf.h > >> @@ -653,7 +653,8 @@ enum bpf_type_flag { > >> MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), > >> /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. > >> - * Currently only valid for linked-list and rbtree nodes. > >> + * Currently only valid for linked-list and rbtree nodes. If the nodes > >> + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. > >> */ > >> NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >> index 8db0afa5985c..55607ab30522 100644 > >> --- a/kernel/bpf/verifier.c > >> +++ b/kernel/bpf/verifier.c > >> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, > >> case PTR_TO_BTF_ID | PTR_TRUSTED: > >> case PTR_TO_BTF_ID | MEM_RCU: > >> case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: > >> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: > >> /* When referenced PTR_TO_BTF_ID is passed to release function, > >> * its fixed offset must be 0. In the other cases, fixed offset > >> * can be non-zero. This was already checked above. So pass > >> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, > >> static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) > >> { > >> struct bpf_verifier_state *state = env->cur_state; > >> + struct btf_record *rec = reg_btf_record(reg); > >> if (!state->active_lock.ptr) { > >> verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); > >> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state > >> } > >> reg->type |= NON_OWN_REF; > >> + if (rec->refcount_off >= 0) > >> + reg->type |= MEM_RCU; > > > > Should the above MEM_RCU marking be done unless reg access is in > > rcu critical section? > > I think it is fine, since non-owning references currently exist only within > spin_lock CS. Based on Alexei's comments on v1 of this series [0], preemption > disabled + spin_lock CS should imply RCU CS. > > [0]: https://lore.kernel.org/bpf/20230802230715.3ltalexaczbomvbu@MacBook-Pro-8.local/ > > > > > I think we still have issues for state resetting > > with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which > > will try to convert the reg state to PTR_UNTRUSTED. > > > > Let us say reg state is > > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU > > > > (1). If hitting bpf_spin_unlock(), since MEM_RCU is in > > the reg state, the state should become > > PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU > > some additional code might be needed so we wont have > > verifier complaints about ref_obj_id == 0. > > > > (2). If hitting bpf_rcu_read_unlock(), the state should become > > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF > > since register access still in bpf_spin_lock() region. > > I agree w/ your comment in side reply stating that this > case isn't possible since bpf_rcu_read_{lock,unlock} in spin_lock CS > is currently not allowed. > > > > > Does this make sense? > > > > > IIUC the specific reg state flow you're recommending is based on the convos > we've had over the past few weeks re: getting rid of special non-owning ref > lifetime rules, instead using RCU as much as possible. Specifically, this > recommended change would remove non-owning ref clobbering, instead just removing > NON_OWN_REF flag on bpf_spin_unlock so that such nodes can no longer be passed > to collection kfuncs (refcount_acquire, etc). Overall the patch set makes sense to me, but I want to clarify above. My understanding that after the patch set applied bpf_spin_unlock() will invalidate_non_owning_refs(), so what Yonghong is saying in (1) is not correct. Instead PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU will become mark_reg_invalid(). Re: (2) even if/when bpf_rcu_read_unlock() will allowed inside spinlocked region it will convert PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU to PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | PTR_UNTRUSTED which is a buggy combination which we would need to address if rcu_unlock is allowed eventually. Did I get it right? If so I think the whole set is good to do.
On 8/22/23 4:45 PM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 01:47:01AM -0400, David Marchevsky wrote: >> On 8/21/23 10:37 PM, Yonghong Song wrote: >>> >>> >>> On 8/21/23 12:33 PM, Dave Marchevsky wrote: >>>> An earlier patch in the series ensures that the underlying memory of >>>> nodes with bpf_refcount - which can have multiple owners - is not reused >>>> until RCU grace period has elapsed. This prevents >>>> use-after-free with non-owning references that may point to >>>> recently-freed memory. While RCU read lock is held, it's safe to >>>> dereference such a non-owning ref, as by definition RCU GP couldn't have >>>> elapsed and therefore underlying memory couldn't have been reused. >>>> >>>> From the perspective of verifier "trustedness" non-owning refs to >>>> refcounted nodes are now trusted only in RCU CS and therefore should no >>>> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them >>>> MEM_RCU in order to reflect this new state. >>>> >>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> >>>> --- >>>> include/linux/bpf.h | 3 ++- >>>> kernel/bpf/verifier.c | 13 ++++++++++++- >>>> 2 files changed, 14 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >>>> index eced6400f778..12596af59c00 100644 >>>> --- a/include/linux/bpf.h >>>> +++ b/include/linux/bpf.h >>>> @@ -653,7 +653,8 @@ enum bpf_type_flag { >>>> MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), >>>> /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. >>>> - * Currently only valid for linked-list and rbtree nodes. >>>> + * Currently only valid for linked-list and rbtree nodes. If the nodes >>>> + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. >>>> */ >>>> NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), >>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>>> index 8db0afa5985c..55607ab30522 100644 >>>> --- a/kernel/bpf/verifier.c >>>> +++ b/kernel/bpf/verifier.c >>>> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, >>>> case PTR_TO_BTF_ID | PTR_TRUSTED: >>>> case PTR_TO_BTF_ID | MEM_RCU: >>>> case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: >>>> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: >>>> /* When referenced PTR_TO_BTF_ID is passed to release function, >>>> * its fixed offset must be 0. In the other cases, fixed offset >>>> * can be non-zero. This was already checked above. So pass >>>> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, >>>> static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) >>>> { >>>> struct bpf_verifier_state *state = env->cur_state; >>>> + struct btf_record *rec = reg_btf_record(reg); >>>> if (!state->active_lock.ptr) { >>>> verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); >>>> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state >>>> } >>>> reg->type |= NON_OWN_REF; >>>> + if (rec->refcount_off >= 0) >>>> + reg->type |= MEM_RCU; >>> >>> Should the above MEM_RCU marking be done unless reg access is in >>> rcu critical section? >> >> I think it is fine, since non-owning references currently exist only within >> spin_lock CS. Based on Alexei's comments on v1 of this series [0], preemption >> disabled + spin_lock CS should imply RCU CS. >> >> [0]: https://lore.kernel.org/bpf/20230802230715.3ltalexaczbomvbu@MacBook-Pro-8.local/ >> >>> >>> I think we still have issues for state resetting >>> with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which >>> will try to convert the reg state to PTR_UNTRUSTED. >>> >>> Let us say reg state is >>> PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU >>> >>> (1). If hitting bpf_spin_unlock(), since MEM_RCU is in >>> the reg state, the state should become >>> PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU >>> some additional code might be needed so we wont have >>> verifier complaints about ref_obj_id == 0. >>> >>> (2). If hitting bpf_rcu_read_unlock(), the state should become >>> PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF >>> since register access still in bpf_spin_lock() region. >> >> I agree w/ your comment in side reply stating that this >> case isn't possible since bpf_rcu_read_{lock,unlock} in spin_lock CS >> is currently not allowed. >> >>> >>> Does this make sense? >>> >> >> >> IIUC the specific reg state flow you're recommending is based on the convos >> we've had over the past few weeks re: getting rid of special non-owning ref >> lifetime rules, instead using RCU as much as possible. Specifically, this >> recommended change would remove non-owning ref clobbering, instead just removing >> NON_OWN_REF flag on bpf_spin_unlock so that such nodes can no longer be passed >> to collection kfuncs (refcount_acquire, etc). > > Overall the patch set makes sense to me, but I want to clarify above. > My understanding that after the patch set applied bpf_spin_unlock() > will invalidate_non_owning_refs(), so what Yonghong is saying in (1) > is not correct. > Instead PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU will become mark_reg_invalid(). I said it 'should become ...', but you are right. right now, it will do mark_reg_invalid(). So it is correct just MAYBE a little conservative. > > Re: (2) even if/when bpf_rcu_read_unlock() will allowed inside spinlocked region > it will convert PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU to > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | PTR_UNTRUSTED > which is a buggy combination which we would need to address if rcu_unlock is allowed eventually. > > Did I get it right? > If so I think the whole set is good to do. >
On Tue, Aug 22, 2023 at 5:18 PM Yonghong Song <yonghong.song@linux.dev> wrote: > > > > On 8/22/23 4:45 PM, Alexei Starovoitov wrote: > > On Tue, Aug 22, 2023 at 01:47:01AM -0400, David Marchevsky wrote: > >> On 8/21/23 10:37 PM, Yonghong Song wrote: > >>> > >>> > >>> On 8/21/23 12:33 PM, Dave Marchevsky wrote: > >>>> An earlier patch in the series ensures that the underlying memory of > >>>> nodes with bpf_refcount - which can have multiple owners - is not reused > >>>> until RCU grace period has elapsed. This prevents > >>>> use-after-free with non-owning references that may point to > >>>> recently-freed memory. While RCU read lock is held, it's safe to > >>>> dereference such a non-owning ref, as by definition RCU GP couldn't have > >>>> elapsed and therefore underlying memory couldn't have been reused. > >>>> > >>>> From the perspective of verifier "trustedness" non-owning refs to > >>>> refcounted nodes are now trusted only in RCU CS and therefore should no > >>>> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them > >>>> MEM_RCU in order to reflect this new state. > >>>> > >>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> > >>>> --- > >>>> include/linux/bpf.h | 3 ++- > >>>> kernel/bpf/verifier.c | 13 ++++++++++++- > >>>> 2 files changed, 14 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h > >>>> index eced6400f778..12596af59c00 100644 > >>>> --- a/include/linux/bpf.h > >>>> +++ b/include/linux/bpf.h > >>>> @@ -653,7 +653,8 @@ enum bpf_type_flag { > >>>> MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), > >>>> /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. > >>>> - * Currently only valid for linked-list and rbtree nodes. > >>>> + * Currently only valid for linked-list and rbtree nodes. If the nodes > >>>> + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. > >>>> */ > >>>> NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), > >>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >>>> index 8db0afa5985c..55607ab30522 100644 > >>>> --- a/kernel/bpf/verifier.c > >>>> +++ b/kernel/bpf/verifier.c > >>>> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, > >>>> case PTR_TO_BTF_ID | PTR_TRUSTED: > >>>> case PTR_TO_BTF_ID | MEM_RCU: > >>>> case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: > >>>> + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: > >>>> /* When referenced PTR_TO_BTF_ID is passed to release function, > >>>> * its fixed offset must be 0. In the other cases, fixed offset > >>>> * can be non-zero. This was already checked above. So pass > >>>> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, > >>>> static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) > >>>> { > >>>> struct bpf_verifier_state *state = env->cur_state; > >>>> + struct btf_record *rec = reg_btf_record(reg); > >>>> if (!state->active_lock.ptr) { > >>>> verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); > >>>> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state > >>>> } > >>>> reg->type |= NON_OWN_REF; > >>>> + if (rec->refcount_off >= 0) > >>>> + reg->type |= MEM_RCU; > >>> > >>> Should the above MEM_RCU marking be done unless reg access is in > >>> rcu critical section? > >> > >> I think it is fine, since non-owning references currently exist only within > >> spin_lock CS. Based on Alexei's comments on v1 of this series [0], preemption > >> disabled + spin_lock CS should imply RCU CS. > >> > >> [0]: https://lore.kernel.org/bpf/20230802230715.3ltalexaczbomvbu@MacBook-Pro-8.local/ > >> > >>> > >>> I think we still have issues for state resetting > >>> with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which > >>> will try to convert the reg state to PTR_UNTRUSTED. > >>> > >>> Let us say reg state is > >>> PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU > >>> > >>> (1). If hitting bpf_spin_unlock(), since MEM_RCU is in > >>> the reg state, the state should become > >>> PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU > >>> some additional code might be needed so we wont have > >>> verifier complaints about ref_obj_id == 0. > >>> > >>> (2). If hitting bpf_rcu_read_unlock(), the state should become > >>> PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF > >>> since register access still in bpf_spin_lock() region. > >> > >> I agree w/ your comment in side reply stating that this > >> case isn't possible since bpf_rcu_read_{lock,unlock} in spin_lock CS > >> is currently not allowed. > >> > >>> > >>> Does this make sense? > >>> > >> > >> > >> IIUC the specific reg state flow you're recommending is based on the convos > >> we've had over the past few weeks re: getting rid of special non-owning ref > >> lifetime rules, instead using RCU as much as possible. Specifically, this > >> recommended change would remove non-owning ref clobbering, instead just removing > >> NON_OWN_REF flag on bpf_spin_unlock so that such nodes can no longer be passed > >> to collection kfuncs (refcount_acquire, etc). > > > > Overall the patch set makes sense to me, but I want to clarify above. > > My understanding that after the patch set applied bpf_spin_unlock() > > will invalidate_non_owning_refs(), so what Yonghong is saying in (1) > > is not correct. > > Instead PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU will become mark_reg_invalid(). > > I said it 'should become ...', but you are right. right now, it will do > mark_reg_invalid(). So it is correct just MAYBE a little conservative. Ahh. You mean that it should be fixed to do that. Got it. non_own_ref after spin_unlock should become a pure mem_rcu pointer. Need to think it through. Probably correct.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index eced6400f778..12596af59c00 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -653,7 +653,8 @@ enum bpf_type_flag { MEM_RCU = BIT(13 + BPF_BASE_TYPE_BITS), /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. - * Currently only valid for linked-list and rbtree nodes. + * Currently only valid for linked-list and rbtree nodes. If the nodes + * have a bpf_refcount_field, they must be tagged MEM_RCU as well. */ NON_OWN_REF = BIT(14 + BPF_BASE_TYPE_BITS), diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8db0afa5985c..55607ab30522 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, case PTR_TO_BTF_ID | PTR_TRUSTED: case PTR_TO_BTF_ID | MEM_RCU: case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: + case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: /* When referenced PTR_TO_BTF_ID is passed to release function, * its fixed offset must be 0. In the other cases, fixed offset * can be non-zero. This was already checked above. So pass @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) { struct bpf_verifier_state *state = env->cur_state; + struct btf_record *rec = reg_btf_record(reg); if (!state->active_lock.ptr) { verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state } reg->type |= NON_OWN_REF; + if (rec->refcount_off >= 0) + reg->type |= MEM_RCU; + return 0; } @@ -11328,6 +11333,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, struct bpf_func_state *state; struct bpf_reg_state *reg; + if (in_rbtree_lock_required_cb(env) && (rcu_lock || rcu_unlock)) { + verbose(env, "Calling bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n"); + return -EACCES; + } + if (rcu_lock) { verbose(env, "nested rcu read lock (kernel function %s)\n", func_name); return -EINVAL; @@ -16689,7 +16699,8 @@ static int do_check(struct bpf_verifier_env *env) return -EINVAL; } - if (env->cur_state->active_rcu_lock) { + if (env->cur_state->active_rcu_lock && + !in_rbtree_lock_required_cb(env)) { verbose(env, "bpf_rcu_read_unlock is missing\n"); return -EINVAL; }
An earlier patch in the series ensures that the underlying memory of nodes with bpf_refcount - which can have multiple owners - is not reused until RCU grace period has elapsed. This prevents use-after-free with non-owning references that may point to recently-freed memory. While RCU read lock is held, it's safe to dereference such a non-owning ref, as by definition RCU GP couldn't have elapsed and therefore underlying memory couldn't have been reused. From the perspective of verifier "trustedness" non-owning refs to refcounted nodes are now trusted only in RCU CS and therefore should no longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them MEM_RCU in order to reflect this new state. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> --- include/linux/bpf.h | 3 ++- kernel/bpf/verifier.c | 13 ++++++++++++- 2 files changed, 14 insertions(+), 2 deletions(-)