Message ID | 20221206231000.3180914-1-davemarchevsky@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | BPF rbtree next-gen datastructure | expand |
Hello: This series was applied to bpf/bpf-next.git (master) by Alexei Starovoitov <ast@kernel.org>: On Tue, 6 Dec 2022 15:09:47 -0800 you wrote: > This series adds a rbtree datastructure following the "next-gen > datastructure" precedent set by recently-added linked-list [0]. This is > a reimplementation of previous rbtree RFC [1] to use kfunc + kptr > instead of adding a new map type. This series adds a smaller set of API > functions than that RFC - just the minimum needed to support current > cgfifo example scheduler in ongoing sched_ext effort [2], namely: > > [...] Here is the summary with links: - [bpf-next,01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record https://git.kernel.org/bpf/bpf-next/c/d8939cb0a03c - [bpf-next,02/13] bpf: map_check_btf should fail if btf_parse_fields fails (no matching commit) - [bpf-next,03/13] bpf: Minor refactor of ref_set_release_on_unlock (no matching commit) - [bpf-next,04/13] bpf: rename list_head -> datastructure_head in field info types (no matching commit) - [bpf-next,05/13] bpf: Add basic bpf_rb_{root,node} support (no matching commit) - [bpf-next,06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs (no matching commit) - [bpf-next,07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args (no matching commit) - [bpf-next,08/13] bpf: Add callback validation to kfunc verifier logic (no matching commit) - [bpf-next,09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first} (no matching commit) - [bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 (no matching commit) - [bpf-next,11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h (no matching commit) - [bpf-next,12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type (no matching commit) - [bpf-next,13/13] selftests/bpf: Add rbtree selftests (no matching commit) You are awesome, thank you!
On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote: > This series adds a rbtree datastructure following the "next-gen > datastructure" precedent set by recently-added linked-list [0]. This is > a reimplementation of previous rbtree RFC [1] to use kfunc + kptr > instead of adding a new map type. This series adds a smaller set of API > functions than that RFC - just the minimum needed to support current > cgfifo example scheduler in ongoing sched_ext effort [2], namely: > > bpf_rbtree_add > bpf_rbtree_remove > bpf_rbtree_first > > [...] > > Future work: > Enabling writes to release_on_unlock refs should be done before the > functionality of BPF rbtree can truly be considered complete. > Implementing this proved more complex than expected so it's been > pushed off to a future patch. > TBH, I think we need to revisit whether there's a strong need for this. I would even argue that we should simply make the release semantics of rbtree_add, list_push helpers stronger and remove release_on_unlock logic entirely, releasing the node immediately. I don't see why it is so critical to have read, and more importantly, write access to nodes after losing their ownership. And that too is only available until the lock is unlocked. I think this relaxed release logic and write support is the wrong direction to take, as it has a direct bearing on what can be done with a node inside the critical section. There's already the problem with not being able to do bpf_obj_drop easily inside the critical section with this. That might be useful for draining operations while holding the lock. Semantically in other languages, once you move an object, accessing it is usually a bug, and in most of the cases it is sufficient to prepare it before insertion. We are certainly in the same territory here with these APIs. Can you elaborate on actual use cases where immediate release or not having write support makes it hard or impossible to support a certain use case, so that it is easier to understand the requirements and design things accordingly?
On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote: > On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote: >> This series adds a rbtree datastructure following the "next-gen >> datastructure" precedent set by recently-added linked-list [0]. This is >> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr >> instead of adding a new map type. This series adds a smaller set of API >> functions than that RFC - just the minimum needed to support current >> cgfifo example scheduler in ongoing sched_ext effort [2], namely: >> >> bpf_rbtree_add >> bpf_rbtree_remove >> bpf_rbtree_first >> >> [...] >> >> Future work: >> Enabling writes to release_on_unlock refs should be done before the >> functionality of BPF rbtree can truly be considered complete. >> Implementing this proved more complex than expected so it's been >> pushed off to a future patch. >> > > TBH, I think we need to revisit whether there's a strong need for this. I would > even argue that we should simply make the release semantics of rbtree_add, > list_push helpers stronger and remove release_on_unlock logic entirely, > releasing the node immediately. I don't see why it is so critical to have read, > and more importantly, write access to nodes after losing their ownership. And > that too is only available until the lock is unlocked. > Moved the next paragraph here to ease reply, it was the last paragraph in your response. > > Can you elaborate on actual use cases where immediate release or not having > write support makes it hard or impossible to support a certain use case, so that > it is easier to understand the requirements and design things accordingly? > Sure, the main usecase and impetus behind this for me is the sched_ext work Tejun and others are doing (https://lwn.net/Articles/916291/). One of the things they'd like to be able to do is implement a CFS-like scheduler using rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to implement complicated scheduling logic. If we can implement such complicated scheduling logic, but it has so much BPF-specific twisting of program logic that it's incomprehensible to scheduler folks, that's not great. The overlap between "BPF experts" and "scheduler experts" is small, and we want the latter group to be able to read BPF scheduling logic without too much struggle. Lower learning curve makes folks more likely to experiment with sched_ext. When 'rbtree map' was in brainstorming / prototyping, non-owning reference semantics were called out as moving BPF datastructures closer to their kernel equivalents from a UX perspective. If the "it makes BPF code better resemble normal kernel code" argumentwas the only reason to do this I wouldn't feel so strongly, but there are practical concerns as well: If we could only read / write from rbtree node if it isn't in a tree, the common operation of "find this node and update its data" would require removing and re-adding it. For rbtree, these unnecessary remove and add operations could result in unnecessary rebalancing. Going back to the sched_ext usecase, if we have a rbtree with task or cgroup stats that need to be updated often, unnecessary rebalancing would make this update slower than if non-owning refs allowed in-place read/write of node data. Also, we eventually want to be able to have a node that's part of both a list and rbtree. Likely adding such a node to both would require calling kfunc for adding to list, and separate kfunc call for adding to rbtree. Once the node has been added to list, we need some way to represent a reference to that node so that we can pass it to rbtree add kfunc. Sounds like a non-owning reference to me, albeit with different semantics than current release_on_unlock. > I think this relaxed release logic and write support is the wrong direction to > take, as it has a direct bearing on what can be done with a node inside the > critical section. There's already the problem with not being able to do > bpf_obj_drop easily inside the critical section with this. That might be useful > for draining operations while holding the lock. > The bpf_obj_drop case is similar to your "can't pass non-owning reference to bpf_rbtree_remove" concern from patch 1's thread. If we have: n = bpf_obj_new(...); // n is owning ref bpf_rbtree_add(&tree, &n->node); // n is non-owning ref res = bpf_rbtree_first(&tree); if (!res) {...} m = container_of(res, struct node_data, node); // m is non-owning ref res = bpf_rbtree_remove(&tree, &n->node); n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory bpf_obj_drop(n); // Not safe to use m anymore Datastructures which support bpf_obj_drop in the critical section can do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning references after bpf_obj_drop. Then there's no potential use-after-free. (For the above example, pretend bpf_rbtree_remove didn't already invalidate 'm', or that there's some other way to obtain non-owning ref to 'n''s node after rbtree_remove) I think that, in practice, operations where the BPF program wants to remove / delete nodes will be distinct from operations where program just wants to obtain some non-owning refs and do read / write. At least for sched_ext usecase this is true. So all the additional clobbers won't require program writer to do special workarounds to deal with verifier in the common case. > Semantically in other languages, once you move an object, accessing it is > usually a bug, and in most of the cases it is sufficient to prepare it before > insertion. We are certainly in the same territory here with these APIs. Sure, but 'add'/'remove' for these intrusive linked datastructures is _not_ a 'move'. Obscuring this from the user and forcing them to use less performant patterns for the sake of some verifier complexity, or desire to mimic semantics of languages w/o reference stability, doesn't make sense to me. If we were to add some datastructures without reference stability, sure, let's not do non-owning references for those. So let's make this non-owning reference stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags, which will coincidentally make it very easy to remove if we later decide that the complexity isn't worth it.
On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote: > On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote: > > On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote: > >> This series adds a rbtree datastructure following the "next-gen > >> datastructure" precedent set by recently-added linked-list [0]. This is > >> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr > >> instead of adding a new map type. This series adds a smaller set of API > >> functions than that RFC - just the minimum needed to support current > >> cgfifo example scheduler in ongoing sched_ext effort [2], namely: > >> > >> bpf_rbtree_add > >> bpf_rbtree_remove > >> bpf_rbtree_first > >> > >> [...] > >> > >> Future work: > >> Enabling writes to release_on_unlock refs should be done before the > >> functionality of BPF rbtree can truly be considered complete. > >> Implementing this proved more complex than expected so it's been > >> pushed off to a future patch. > >> > > > > > TBH, I think we need to revisit whether there's a strong need for this. I would > > even argue that we should simply make the release semantics of rbtree_add, > > list_push helpers stronger and remove release_on_unlock logic entirely, > > releasing the node immediately. I don't see why it is so critical to have read, > > and more importantly, write access to nodes after losing their ownership. And > > that too is only available until the lock is unlocked. > > > > Moved the next paragraph here to ease reply, it was the last paragraph > in your response. > > > > > Can you elaborate on actual use cases where immediate release or not having > > write support makes it hard or impossible to support a certain use case, so that > > it is easier to understand the requirements and design things accordingly? > > > > Sure, the main usecase and impetus behind this for me is the sched_ext work > Tejun and others are doing (https://lwn.net/Articles/916291/). One of the > things they'd like to be able to do is implement a CFS-like scheduler using > rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to > implement complicated scheduling logic. > > If we can implement such complicated scheduling logic, but it has so much > BPF-specific twisting of program logic that it's incomprehensible to scheduler > folks, that's not great. The overlap between "BPF experts" and "scheduler > experts" is small, and we want the latter group to be able to read BPF > scheduling logic without too much struggle. Lower learning curve makes folks > more likely to experiment with sched_ext. > > When 'rbtree map' was in brainstorming / prototyping, non-owning reference > semantics were called out as moving BPF datastructures closer to their kernel > equivalents from a UX perspective. Our emails crossed. See my previous email. Agree on the above. > If the "it makes BPF code better resemble normal kernel code" argumentwas the > only reason to do this I wouldn't feel so strongly, but there are practical > concerns as well: > > If we could only read / write from rbtree node if it isn't in a tree, the common > operation of "find this node and update its data" would require removing and > re-adding it. For rbtree, these unnecessary remove and add operations could Not really. See my previous email. > result in unnecessary rebalancing. Going back to the sched_ext usecase, > if we have a rbtree with task or cgroup stats that need to be updated often, > unnecessary rebalancing would make this update slower than if non-owning refs > allowed in-place read/write of node data. Agree. Read/write from non-owning refs is necessary. In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0 (your non-owning ref) should not be mixed with release_on_unlock logic. KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0. > > Also, we eventually want to be able to have a node that's part of both a > list and rbtree. Likely adding such a node to both would require calling > kfunc for adding to list, and separate kfunc call for adding to rbtree. > Once the node has been added to list, we need some way to represent a reference > to that node so that we can pass it to rbtree add kfunc. Sounds like a > non-owning reference to me, albeit with different semantics than current > release_on_unlock. A node with both link list and rbtree would be a new concept. We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing. That's a future discussion. > > > I think this relaxed release logic and write support is the wrong direction to > > take, as it has a direct bearing on what can be done with a node inside the > > critical section. There's already the problem with not being able to do > > bpf_obj_drop easily inside the critical section with this. That might be useful > > for draining operations while holding the lock. > > > > The bpf_obj_drop case is similar to your "can't pass non-owning reference > to bpf_rbtree_remove" concern from patch 1's thread. If we have: > > n = bpf_obj_new(...); // n is owning ref > bpf_rbtree_add(&tree, &n->node); // n is non-owning ref what I proposed in the other email... n should be untrusted here. That's != 'n is non-owning ref' > res = bpf_rbtree_first(&tree); > if (!res) {...} > m = container_of(res, struct node_data, node); // m is non-owning ref agree. m == PTR_TRUSTED with ref_obj_id == 0. > res = bpf_rbtree_remove(&tree, &n->node); a typo here? Did you mean 'm->node' ? and after 'if (res)' ... > n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory agree. n -> ref_obj_id > 0 > bpf_obj_drop(n); above is ok to do. 'n' becomes UNTRUSTED or invalid. > // Not safe to use m anymore 'm' should have become UNTRUSTED after bpf_rbtree_remove. > Datastructures which support bpf_obj_drop in the critical section can > do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning > references after bpf_obj_drop. 'invalidate all' sounds suspicious. I don't think we need to do sweaping search after bpf_obj_drop. > Then there's no potential use-after-free. > (For the above example, pretend bpf_rbtree_remove didn't already invalidate > 'm', or that there's some other way to obtain non-owning ref to 'n''s node > after rbtree_remove) > > I think that, in practice, operations where the BPF program wants to remove > / delete nodes will be distinct from operations where program just wants to > obtain some non-owning refs and do read / write. At least for sched_ext usecase > this is true. So all the additional clobbers won't require program writer > to do special workarounds to deal with verifier in the common case. > > > Semantically in other languages, once you move an object, accessing it is > > usually a bug, and in most of the cases it is sufficient to prepare it before > > insertion. We are certainly in the same territory here with these APIs. > > Sure, but 'add'/'remove' for these intrusive linked datastructures is > _not_ a 'move'. Obscuring this from the user and forcing them to use > less performant patterns for the sake of some verifier complexity, or desire > to mimic semantics of languages w/o reference stability, doesn't make sense to > me. I agree, but everything we discuss in the above looks orthogonal to release_on_unlock that myself and Kumar are proposing to drop. > If we were to add some datastructures without reference stability, sure, let's > not do non-owning references for those. So let's make this non-owning reference > stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags, > which will coincidentally make it very easy to remove if we later decide that > the complexity isn't worth it. You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ? So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ? If so then I agree. The 'release' part of the name was confusing. It's also not clear which arg it applies to. bpf_rbtree_remove has two args. Both are PTR_TRUSTED. I wouldn't introduce a new flag for this just yet. We can hard code bpf_rbtree_remove, bpf_list_pop for now or use our name suffix hack.
On 12/7/22 6:06 PM, Alexei Starovoitov wrote: > On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote: >> On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote: >>> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote: >>>> This series adds a rbtree datastructure following the "next-gen >>>> datastructure" precedent set by recently-added linked-list [0]. This is >>>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr >>>> instead of adding a new map type. This series adds a smaller set of API >>>> functions than that RFC - just the minimum needed to support current >>>> cgfifo example scheduler in ongoing sched_ext effort [2], namely: >>>> >>>> bpf_rbtree_add >>>> bpf_rbtree_remove >>>> bpf_rbtree_first >>>> >>>> [...] >>>> >>>> Future work: >>>> Enabling writes to release_on_unlock refs should be done before the >>>> functionality of BPF rbtree can truly be considered complete. >>>> Implementing this proved more complex than expected so it's been >>>> pushed off to a future patch. >>>> >> >>> >>> TBH, I think we need to revisit whether there's a strong need for this. I would >>> even argue that we should simply make the release semantics of rbtree_add, >>> list_push helpers stronger and remove release_on_unlock logic entirely, >>> releasing the node immediately. I don't see why it is so critical to have read, >>> and more importantly, write access to nodes after losing their ownership. And >>> that too is only available until the lock is unlocked. >>> >> >> Moved the next paragraph here to ease reply, it was the last paragraph >> in your response. >> >>> >>> Can you elaborate on actual use cases where immediate release or not having >>> write support makes it hard or impossible to support a certain use case, so that >>> it is easier to understand the requirements and design things accordingly? >>> >> >> Sure, the main usecase and impetus behind this for me is the sched_ext work >> Tejun and others are doing (https://lwn.net/Articles/916291/ ). One of the >> things they'd like to be able to do is implement a CFS-like scheduler using >> rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to >> implement complicated scheduling logic. >> >> If we can implement such complicated scheduling logic, but it has so much >> BPF-specific twisting of program logic that it's incomprehensible to scheduler >> folks, that's not great. The overlap between "BPF experts" and "scheduler >> experts" is small, and we want the latter group to be able to read BPF >> scheduling logic without too much struggle. Lower learning curve makes folks >> more likely to experiment with sched_ext. >> >> When 'rbtree map' was in brainstorming / prototyping, non-owning reference >> semantics were called out as moving BPF datastructures closer to their kernel >> equivalents from a UX perspective. > > Our emails crossed. See my previous email. > Agree on the above. > >> If the "it makes BPF code better resemble normal kernel code" argumentwas the >> only reason to do this I wouldn't feel so strongly, but there are practical >> concerns as well: >> >> If we could only read / write from rbtree node if it isn't in a tree, the common >> operation of "find this node and update its data" would require removing and >> re-adding it. For rbtree, these unnecessary remove and add operations could > > Not really. See my previous email. > >> result in unnecessary rebalancing. Going back to the sched_ext usecase, >> if we have a rbtree with task or cgroup stats that need to be updated often, >> unnecessary rebalancing would make this update slower than if non-owning refs >> allowed in-place read/write of node data. > > Agree. Read/write from non-owning refs is necessary. > In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0 > (your non-owning ref) should not be mixed with release_on_unlock logic. > > KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0. > >> >> Also, we eventually want to be able to have a node that's part of both a >> list and rbtree. Likely adding such a node to both would require calling >> kfunc for adding to list, and separate kfunc call for adding to rbtree. >> Once the node has been added to list, we need some way to represent a reference >> to that node so that we can pass it to rbtree add kfunc. Sounds like a >> non-owning reference to me, albeit with different semantics than current >> release_on_unlock. > > A node with both link list and rbtree would be a new concept. > We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing. > That's a future discussion. > >> >>> I think this relaxed release logic and write support is the wrong direction to >>> take, as it has a direct bearing on what can be done with a node inside the >>> critical section. There's already the problem with not being able to do >>> bpf_obj_drop easily inside the critical section with this. That might be useful >>> for draining operations while holding the lock. >>> >> >> The bpf_obj_drop case is similar to your "can't pass non-owning reference >> to bpf_rbtree_remove" concern from patch 1's thread. If we have: >> >> n = bpf_obj_new(...); // n is owning ref >> bpf_rbtree_add(&tree, &n->node); // n is non-owning ref > > what I proposed in the other email... > n should be untrusted here. > That's != 'n is non-owning ref' > >> res = bpf_rbtree_first(&tree); >> if (!res) {...} >> m = container_of(res, struct node_data, node); // m is non-owning ref > > agree. m == PTR_TRUSTED with ref_obj_id == 0. > >> res = bpf_rbtree_remove(&tree, &n->node); > > a typo here? Did you mean 'm->node' ? > > and after 'if (res)' ... >> n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory > > agree. n -> ref_obj_id > 0 > >> bpf_obj_drop(n); > > above is ok to do. > 'n' becomes UNTRUSTED or invalid. > >> // Not safe to use m anymore > > 'm' should have become UNTRUSTED after bpf_rbtree_remove. > >> Datastructures which support bpf_obj_drop in the critical section can >> do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning >> references after bpf_obj_drop. > > 'invalidate all' sounds suspicious. > I don't think we need to do sweaping search after bpf_obj_drop. > >> Then there's no potential use-after-free. >> (For the above example, pretend bpf_rbtree_remove didn't already invalidate >> 'm', or that there's some other way to obtain non-owning ref to 'n''s node >> after rbtree_remove) >> >> I think that, in practice, operations where the BPF program wants to remove >> / delete nodes will be distinct from operations where program just wants to >> obtain some non-owning refs and do read / write. At least for sched_ext usecase >> this is true. So all the additional clobbers won't require program writer >> to do special workarounds to deal with verifier in the common case. >> >>> Semantically in other languages, once you move an object, accessing it is >>> usually a bug, and in most of the cases it is sufficient to prepare it before >>> insertion. We are certainly in the same territory here with these APIs. >> >> Sure, but 'add'/'remove' for these intrusive linked datastructures is >> _not_ a 'move'. Obscuring this from the user and forcing them to use >> less performant patterns for the sake of some verifier complexity, or desire >> to mimic semantics of languages w/o reference stability, doesn't make sense to >> me. > > I agree, but everything we discuss in the above looks orthogonal > to release_on_unlock that myself and Kumar are proposing to drop. > >> If we were to add some datastructures without reference stability, sure, let's >> not do non-owning references for those. So let's make this non-owning reference >> stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags, >> which will coincidentally make it very easy to remove if we later decide that >> the complexity isn't worth it. > > You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ? > So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ? > If so then I agree. The 'release' part of the name was confusing. > It's also not clear which arg it applies to. > bpf_rbtree_remove has two args. Both are PTR_TRUSTED. > I wouldn't introduce a new flag for this just yet. > We can hard code bpf_rbtree_remove, bpf_list_pop for now > or use our name suffix hack. Before replying to specific things in this email, I think it would be useful to have a subthread clearing up definitions and semantics, as I think we're talking past each other a bit. On a conceptual level I've still been using "owning reference" and "non-owning reference" to understand rbtree operations. I'll use those here and try to map them to actual verifier concepts later. owning reference * This reference controls the lifetime of the pointee * Ownership of pointee must be 'released' by passing it to some rbtree API kfunc - rbtree_add in our case - or via bpf_obj_drop, which free's * If not released before program ends, verifier considers prog invalid * Access to the memory ref is pointing at will not page fault non-owning reference * No ownership of pointee so can't pass ownership via rbtree_add, not allowed to bpf_obj_drop * No control of lifetime, but can infer memory safety based on context (see explanation below) * Access to the memory ref is pointing at will not page fault (see explanation below) 2) From verifier's perspective non-owning references can only exist between spin_lock and spin_unlock. Why? After spin_unlock another program can do arbitrary operations on the rbtree like removing and free-ing via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, free'd, and reused via bpf_obj_new would point to an entirely different thing. Or the memory could go away. To prevent this logic violation all non-owning references are invalidated by verifier after critical section ends. This is necessary to ensure "will not page fault" property of non-owning reference. So if verifier hasn't invalidated a non-owning ref, accessing it will not page fault. Currently bpf_obj_drop is not allowed in the critical section, so similarly, if there's a valid non-owning ref, we must be in critical section, and can conclude that the ref's memory hasn't been dropped-and-free'd or dropped- and-reused. 1) Any reference to a node that is in a rbtree _must_ be non-owning, since the tree has control of pointee lifetime. Similarly, any ref to a node that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd node in map_val for now) Moving on to rbtree API: bpf_rbtree_add(&tree, &node); 'node' is an owning ref, becomes a non-owning ref. bpf_rbtree_first(&tree); retval is a non-owning ref, since first() node is still in tree bpf_rbtree_remove(&tree, &node); 'node' is a non-owning ref, retval is an owning ref All of the above can only be called when rbtree's lock is held, so invalidation of all non-owning refs on spin_unlock is fine for rbtree_remove. Nice property of paragraph marked with 1) above is the ability to use the type system to prevent rbtree_add of node that's already in rbtree and rbtree_remove of node that's not in one. So we can forego runtime checking of "already in tree", "already not in tree". But, as you and Kumar talked about in the past and referenced in patch 1's thread, non-owning refs may alias each other, or an owning ref, and have no way of knowing whether this is the case. So if X and Y are two non-owning refs that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent call to bpf_rbtree_remove(tree, Y) would be removing node from tree which already isn't in any tree (since prog has an owning ref to it). But verifier doesn't know X and Y alias each other. So previous paragraph's "forego runtime checks" statement can only hold if we invalidate all non-owning refs after 'destructive' rbtree_remove operation. It doesn't matter to me which combination of type flags, ref_obj_id, other reg state stuff, and special-casing is used to implement owning and non-owning refs. Specific ones chosen in this series for rbtree node: owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) ref_obj_id > 0 non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) PTR_UNTRUSTED - used for "can't pass ownership", not PROBE_MEM - this is why I mentioned "decomposing UNTRUSTED into more granular reg traits" in another thread ref_obj_id > 0 release_on_unlock = true - used due to paragraphs starting with 2) above Any other combination of type and reg state that gives me the semantics def'd above works4me. Based on this reply and others from today, I think you're saying that these concepts should be implemented using: owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) PTR_TRUSTED ref_obj_id > 0 non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) PTR_TRUSTED ref_obj_id == 0 - used for "can't pass ownership", since funcs that expect owning ref need ref_obj_id > 0 And you're also adding 'untrusted' here, mainly as a result of bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added, instead of becoming a non-owning ref. 'untrusted' would have state like: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) PTR_UNTRUSTED ref_obj_id == 0? I think your "non-owning ref" definition also differs from mine, specifically yours doesn't seem to have "will not page fault". For this reason, you don't see the need for release_on_unlock logic, since that's used to prevent refs escaping critical section and potentially referring to free'd memory. This is where I start to get confused. Some questions: * If we get rid of release_on_unlock, and with mass invalidation of non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED? * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse in this scheme, since non-owning ref can escape spin_unlock b/c no mass invalidation? PTR_UNTRUSTED isn't sufficient here * If non-owning ref can live past spin_unlock, do we expect read from such ref after _unlock to go through bpf_probe_read()? Otherwise direct read might fault and silently write 0. * For your 'untrusted', but not non-owning ref concept, I'm not sure what this gives us that's better than just invalidating the ref which gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node) I'm also not sure if you agree with my paragraph marked 1) above. But IMO the release_on_unlock difference, and the perhaps-differing non-owning ref concept are where we're really talking past each other.
On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote: > > Before replying to specific things in this email, I think it would be useful > to have a subthread clearing up definitions and semantics, as I think we're > talking past each other a bit. Yeah. We were not on the same page. The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me. I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago and I thought we agreed that both are not necessary and with that I assumed that anything 'non-owning' as a concept is gone too. So the only thing left (in my mind) was the 'owning' concept. Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'. Please have this detailed explanation in the commit log next time to avoid this back and forth. Now to the fun part... > > On a conceptual level I've still been using "owning reference" and "non-owning > reference" to understand rbtree operations. I'll use those here and try to map > them to actual verifier concepts later. > > owning reference > > * This reference controls the lifetime of the pointee > * Ownership of pointee must be 'released' by passing it to some rbtree > API kfunc - rbtree_add in our case - or via bpf_obj_drop, which free's > * If not released before program ends, verifier considers prog invalid > * Access to the memory ref is pointing at will not page fault agree. > non-owning reference > > * No ownership of pointee so can't pass ownership via rbtree_add, not allowed > to bpf_obj_drop > * No control of lifetime, but can infer memory safety based on context > (see explanation below) > * Access to the memory ref is pointing at will not page fault > (see explanation below) agree with addition that both read and write should be allowed into this 'non-owning' ptr. Which breaks if you map this to something that ORs with PTR_UNTRUSTED. > 2) From verifier's perspective non-owning references can only exist > between spin_lock and spin_unlock. Why? After spin_unlock another program > can do arbitrary operations on the rbtree like removing and free-ing > via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, > free'd, and reused via bpf_obj_new would point to an entirely different thing. > Or the memory could go away. agree that spin_unlock needs to clean up 'non-owning'. > To prevent this logic violation all non-owning references are invalidated by > verifier after critical section ends. This is necessary to ensure "will > not page fault" property of non-owning reference. So if verifier hasn't > invalidated a non-owning ref, accessing it will not page fault. > > Currently bpf_obj_drop is not allowed in the critical section, so similarly, > if there's a valid non-owning ref, we must be in critical section, and can > conclude that the ref's memory hasn't been dropped-and-free'd or dropped- > and-reused. I don't understand why is that a problem. > 1) Any reference to a node that is in a rbtree _must_ be non-owning, since > the tree has control of pointee lifetime. Similarly, any ref to a node > that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd > node in map_val for now) Also not clear why such restriction is necessary. > Moving on to rbtree API: > > bpf_rbtree_add(&tree, &node); > 'node' is an owning ref, becomes a non-owning ref. > > bpf_rbtree_first(&tree); > retval is a non-owning ref, since first() node is still in tree > > bpf_rbtree_remove(&tree, &node); > 'node' is a non-owning ref, retval is an owning ref agree on the above definition. > All of the above can only be called when rbtree's lock is held, so invalidation > of all non-owning refs on spin_unlock is fine for rbtree_remove. > > Nice property of paragraph marked with 1) above is the ability to use the > type system to prevent rbtree_add of node that's already in rbtree and > rbtree_remove of node that's not in one. So we can forego runtime > checking of "already in tree", "already not in tree". I think it's easier to add runtime check inside bpf_rbtree_remove() since it already returns MAYBE_NULL. No 'conditional release' necessary. And with that we don't need to worry about aliases. > But, as you and Kumar talked about in the past and referenced in patch 1's > thread, non-owning refs may alias each other, or an owning ref, and have no > way of knowing whether this is the case. So if X and Y are two non-owning refs > that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent > call to bpf_rbtree_remove(tree, Y) would be removing node from tree which > already isn't in any tree (since prog has an owning ref to it). But verifier > doesn't know X and Y alias each other. So previous paragraph's "forego > runtime checks" statement can only hold if we invalidate all non-owning refs > after 'destructive' rbtree_remove operation. right. we either invalidate all non-owning after bpf_rbtree_remove or do run-time check in bpf_rbtree_remove. Consider the following: bpf_spin_lock n = bpf_rbtree_first(root); m = bpf_rbtree_first(root); x = bpf_rbtree_remove(root, n) y = bpf_rbtree_remove(root, m) bpf_spin_unlock if (x) bpf_obj_drop(x) if (y) bpf_obj_drop(y) If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier. If we do run-time check the above will be accepted and will work without crashing. The problem with release_on_unlock is that it marks 'n' after 1st remove as UNTRUSTED which means 'no write' and 'read via probe_read'. That's not good imo. > > It doesn't matter to me which combination of type flags, ref_obj_id, other > reg state stuff, and special-casing is used to implement owning and non-owning > refs. Specific ones chosen in this series for rbtree node: > > owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) > ref_obj_id > 0 > > non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) > PTR_UNTRUSTED > - used for "can't pass ownership", not PROBE_MEM > - this is why I mentioned "decomposing UNTRUSTED into more > granular reg traits" in another thread Now I undestand, but that was very hard to grasp. UNTRUSTED means 'no write' and 'read via probe_read'. ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly pointing out below: > ref_obj_id > 0 > release_on_unlock = true > - used due to paragraphs starting with 2) above but the problem with ref_set_release_on_unlock() that it mixes real ref-d pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0. And the latter is a quite confusing combination in my mind, since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS. > Any other combination of type and reg state that gives me the semantics def'd > above works4me. > > > Based on this reply and others from today, I think you're saying that these > concepts should be implemented using: > > owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > PTR_TRUSTED > ref_obj_id > 0 Almost. I propose: PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id > 0 See the definition of is_trusted_reg(). It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED) I was saying 'trusted' because of is_trusted_reg() definition. Sorry for confusion. > non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > PTR_TRUSTED > ref_obj_id == 0 > - used for "can't pass ownership", since funcs that expect > owning ref need ref_obj_id > 0 I propose: PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs. And we will be able to pass 'non-owning' under spin_lock into other kfuncs and owning outside of spin_lock into other kfuncs. Which is a good thing. > And you're also adding 'untrusted' here, mainly as a result of > bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added, > instead of becoming a non-owning ref. 'untrusted' would have state like: > > PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > PTR_UNTRUSTED > ref_obj_id == 0? I'm not sure whether we really need full untrusted after going through bpf_rbtree_add() or doing 'non-owning' is enough. If it's full untrusted it will be: PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'. > I think your "non-owning ref" definition also differs from mine, specifically > yours doesn't seem to have "will not page fault". For this reason, you don't > see the need for release_on_unlock logic, since that's used to prevent refs > escaping critical section and potentially referring to free'd memory. Not quite. We should be able to read/write directly through PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock the way release_reference() is doing. I'm just not happy with using acquire_reference/release_reference() logic (as release_on_unlock is doing) for cleaning after unlock. Since we need to clean 'non-owning' ptrs in unlock it's confusing to call the process 'release'. I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED) every reg where reg->id == cur_state->active_lock.id && flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 By deleting relase_on_unlock I meant delete release_on_unlock flag and remove ref_set_release_on_unlock. > This is where I start to get confused. Some questions: > > * If we get rid of release_on_unlock, and with mass invalidation of > non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED? Since we'll be cleaning all PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new. The verifier already enforces that bpf_spin_unlock will be present at the right place in bpf prog. When the verifier sees it it will clean all non-owning refs with this spinlock 'id'. So no concerns of leaking 'non-owning' outside. While processing bpf_rbtree_first we need to: regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC; regs[BPF_REG_0].id = active_lock.id; regs[BPF_REG_0].ref_obj_id = 0; > * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse > in this scheme, since non-owning ref can escape spin_unlock b/c no mass > invalidation? PTR_UNTRUSTED isn't sufficient here run-time check in bpf_rbtree_remove (and in the future bpf_list_remove) should address it, no? > * If non-owning ref can live past spin_unlock, do we expect read from > such ref after _unlock to go through bpf_probe_read()? Otherwise direct > read might fault and silently write 0. unlock has to clean them. > * For your 'untrusted', but not non-owning ref concept, I'm not sure > what this gives us that's better than just invalidating the ref which > gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node) Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one. Untrusted will allow prog to do read only access (via probe_read) into the node but might hide bugs. The cleanup after bpf_spin_unlock of non-owning and clean up after bpf_rbtree_add() does not have to be the same. Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock and non-owning after bpf_rbtree_add. Walking the example from previous email: struct bpf_rbtree_iter it; struct bpf_rb_node * node; struct bpf_rb_node *n, *m; bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock while ((node = bpf_rbtree_iter_next(&it)) { // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0 if (node && node->field == condition) { n = bpf_rbtree_remove(rb_root, node); if (!n) ...; // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time if (!m) ...; // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y // node is still: // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id // assume we allow double locks one day bpf_spin_lock(another_rb_root); bpf_rbtree_add(another_rb_root, n); // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id bpf_spin_unlock(another_rb_root); // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 break; } } // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id bpf_rbtree_iter_destroy(&it); // does unlock // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y bpf_obj_drop(m);
On 12/7/22 10:51 PM, Alexei Starovoitov wrote: > On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote: >> >> Before replying to specific things in this email, I think it would be useful >> to have a subthread clearing up definitions and semantics, as I think we're >> talking past each other a bit. > > Yeah. We were not on the same page. > The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me. > I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago > and I thought we agreed that both are not necessary and with that > I assumed that anything 'non-owning' as a concept is gone too. > So the only thing left (in my mind) was the 'owning' concept. > Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'. > Whereas in my mind the release_on_unlock logic was specifically added to implement the mass invalidation part of non-owning reference semantics, and it being accepted implied that we weren't getting rid of the concept :). > Please have this detailed explanation in the commit log next time to > avoid this back and forth. > Now to the fun part... > I will add a documentation commit explaining 'owning' and 'non-owning' ref as they pertain to these datastructures, after we agree about the semantics. Speaking of which, although I have a few questions / clarifications, I think we're more in agreement after your reply. After one more round of clarification I will summarize conclusions to see if we agree on enough to move forward. >> >> On a conceptual level I've still been using "owning reference" and "non-owning >> reference" to understand rbtree operations. I'll use those here and try to map >> them to actual verifier concepts later. >> >> owning reference >> >> * This reference controls the lifetime of the pointee >> * Ownership of pointee must be 'released' by passing it to some rbtree >> API kfunc - rbtree_add in our case - or via bpf_obj_drop, which free's >> * If not released before program ends, verifier considers prog invalid >> * Access to the memory ref is pointing at will not page fault > > agree. > >> non-owning reference >> >> * No ownership of pointee so can't pass ownership via rbtree_add, not allowed >> to bpf_obj_drop >> * No control of lifetime, but can infer memory safety based on context >> (see explanation below) >> * Access to the memory ref is pointing at will not page fault >> (see explanation below) > > agree with addition that both read and write should be allowed into this > 'non-owning' ptr. > Which breaks if you map this to something that ORs with PTR_UNTRUSTED. > Agree re: read/write allowed. PTR_UNTRUSTED was an implementation detail. Sounds like we agree on general purpose of owning, non-owning. Looks like we're in agreement about above semantics. >> 2) From verifier's perspective non-owning references can only exist >> between spin_lock and spin_unlock. Why? After spin_unlock another program >> can do arbitrary operations on the rbtree like removing and free-ing >> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, >> free'd, and reused via bpf_obj_new would point to an entirely different thing. >> Or the memory could go away. > > agree that spin_unlock needs to clean up 'non-owning'. Another point of agreement. > >> To prevent this logic violation all non-owning references are invalidated by >> verifier after critical section ends. This is necessary to ensure "will >> not page fault" property of non-owning reference. So if verifier hasn't >> invalidated a non-owning ref, accessing it will not page fault. >> >> Currently bpf_obj_drop is not allowed in the critical section, so similarly, >> if there's a valid non-owning ref, we must be in critical section, and can >> conclude that the ref's memory hasn't been dropped-and-free'd or dropped- >> and-reused. > > I don't understand why is that a problem. > >> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since >> the tree has control of pointee lifetime. Similarly, any ref to a node >> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd >> node in map_val for now) > > Also not clear why such restriction is necessary. > If we have this restriction and bpf_rbtree_release also mass invalidates non-owning refs, the type system will ensure that only nodes that are in a tree will be passed to bpf_rbtree_release, and we can avoid the runtime check. But below you mention preferring the runtime check, mostly noting here to refer back when continuing reply below. >> Moving on to rbtree API: >> >> bpf_rbtree_add(&tree, &node); >> 'node' is an owning ref, becomes a non-owning ref. >> >> bpf_rbtree_first(&tree); >> retval is a non-owning ref, since first() node is still in tree >> >> bpf_rbtree_remove(&tree, &node); >> 'node' is a non-owning ref, retval is an owning ref > > agree on the above definition. > >> All of the above can only be called when rbtree's lock is held, so invalidation >> of all non-owning refs on spin_unlock is fine for rbtree_remove. >> >> Nice property of paragraph marked with 1) above is the ability to use the >> type system to prevent rbtree_add of node that's already in rbtree and >> rbtree_remove of node that's not in one. So we can forego runtime >> checking of "already in tree", "already not in tree". > > I think it's easier to add runtime check inside bpf_rbtree_remove() > since it already returns MAYBE_NULL. No 'conditional release' necessary. > And with that we don't need to worry about aliases. > To clarify: You're proposing that we don't worry about solving the aliasing problem at verification time. Instead rbtree_{add,remove} will deal with it at runtime. Corollary of this is that my restriction tagged 1) above ("ref to node in tree _must_ be non-owning, to node not in tree must be owning") isn't something we're guaranteeing, due to possibility of aliasing. So bpf_rbtree_remove might get a node that's not in tree, and bpf_rbtree_add might get a node that's already in tree. Runtime behavior of both should be 'nop'. If that is an accurate restatement of your proposal, the verifier logic will need to be changed: For bpf_rbtree_remove(&tree, &node), if node is already not in a tree, retval will be NULL, effectively not acquiring an owning ref due to mark_ptr_or_null_reg's logic. In this case, do we want to invalidate arg 'node' as well? Or just leave it as a non-owning ref that points to node not in tree? I think the latter requires fewer verifier changes, but can see the argument for the former if we want restriction 1) to mostly be true, unless aliasing. The above scenario is the only case where bpf_rbtree_remove fails and returns NULL. (In this series it can fail and RET_NULL for this reason, but my earlier comment about type system + invalidate all-non owning after remove as discussed below was my original intent. So I shouldn't have been allowing RET_NULL for my version of these semantics.) For bpf_rbtree_add(&tree, &node, less), if arg is already in tree, then 'node' isn't really an owning ref, and we need to tag it as non-owning, and program then won't need to bpf_obj_drop it before exiting. If node wasn't already in tree and rbtree_add actually added it, 'node' would also be tagged as non-owning, since tree now owns it. Do we need some way to indicate whether 'already in tree' case happened? If so, would need to change retval from void to bool or struct bpf_rb_node *. Above scenario is only case where bpf_rbtree_add fails and returns NULL / false. >> But, as you and Kumar talked about in the past and referenced in patch 1's >> thread, non-owning refs may alias each other, or an owning ref, and have no >> way of knowing whether this is the case. So if X and Y are two non-owning refs >> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent >> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which >> already isn't in any tree (since prog has an owning ref to it). But verifier >> doesn't know X and Y alias each other. So previous paragraph's "forego >> runtime checks" statement can only hold if we invalidate all non-owning refs >> after 'destructive' rbtree_remove operation. > > right. we either invalidate all non-owning after bpf_rbtree_remove > or do run-time check in bpf_rbtree_remove. > Consider the following: > bpf_spin_lock > n = bpf_rbtree_first(root); > m = bpf_rbtree_first(root); > x = bpf_rbtree_remove(root, n) > y = bpf_rbtree_remove(root, m) > bpf_spin_unlock > if (x) > bpf_obj_drop(x) > if (y) > bpf_obj_drop(y) > > If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier. > If we do run-time check the above will be accepted and will work without crashing. > Agreed, although the above example's invalid double-remove of same node is the kind of thing I'd like to be prevented at verification time instead of runtime. Regardless, continuing with your runtime check idea. > The problem with release_on_unlock is that it marks 'n' after 1st remove > as UNTRUSTED which means 'no write' and 'read via probe_read'. > That's not good imo. > Based on your response to paragraph below this one, I think we're in agreement that using PTR_UNTRUSTED for non-owning ref gives non-owning ref bunch of traits it doesn't need, when I just wanted "can't pass ownership". So agreed that PTR_UNTRUSTED is too blunt an instrument here. Regarding "marks 'n' after 1st remove", the series isn't currently doing this, I proposed it as a way to prevent aliasing problem, but I think your proposal is explicitly not trying to prevent aliasing problem at verification time. So for your semantics we would only have non-owning cleanup after spin_unlock. And such cleanup might just mark refs PTR_UNTRUSTED instead of invalidating entirely. >> >> It doesn't matter to me which combination of type flags, ref_obj_id, other >> reg state stuff, and special-casing is used to implement owning and non-owning >> refs. Specific ones chosen in this series for rbtree node: >> >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) >> ref_obj_id > 0 >> >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) >> PTR_UNTRUSTED >> - used for "can't pass ownership", not PROBE_MEM >> - this is why I mentioned "decomposing UNTRUSTED into more >> granular reg traits" in another thread > > Now I undestand, but that was very hard to grasp. > UNTRUSTED means 'no write' and 'read via probe_read'. > ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly > pointing out below: >> ref_obj_id > 0 >> release_on_unlock = true >> - used due to paragraphs starting with 2) above > > but the problem with ref_set_release_on_unlock() that it mixes real ref-d > pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0. > And the latter is a quite confusing combination in my mind, > since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS. > I think I understand your desire to get rid of release_on_unlock now. It's not due to disliking the concept of "clean up non-owning refs after spin_unlock", which you earlier agreed was necessary, but rather the specifics of release_on_unlock mechanism used to achieve this. If so, I think I agree with your reasoning for why the mechanism is bad in light of how you want owning/non-owning implemented. To summarize your statements about release_on_unlock mechanism from the rest of your reply: * 'ref_obj_id > 0' already has a specific meaning wrt. is_trusted_reg, and we may want to support both TRUSTED and UNTRUSTED non-owning refs * My comment: Currently is_trusted_reg is only used for KF_ARG_PTR_TO_BTF_ID, while rbtree and list types are assigned special KF_ARGs. So hypothetically could have different 'is_trusted_reg' logic. I don't actually think that's a good idea, though, especially since rbtree / list types are really specializations of PTR_TO_BTF_ID anyways. So agreed. * Instead of using 'acquire' and (modified) 'release', we can achieve "clean-up non-owning after spin_unlock" by associating non-owning refs with active_lock.id when they're created. We can store this in reg.id, which is currently unused for PTR_TO_BTF_ID (afaict). * This will solve issue raised by previous point, allowing us to have non-owning refs which are truly 'untrusted' according to is_trusted_reg. * My comment: This all sounds reasonable. On spin_unlock we have active_lock.id, so can do bpf_for_each_reg_in_vstate to look for PTR_TO_BTF_IDs matching the id and do 'cleanup' for them. >> Any other combination of type and reg state that gives me the semantics def'd >> above works4me. >> >> >> Based on this reply and others from today, I think you're saying that these >> concepts should be implemented using: >> >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) >> PTR_TRUSTED >> ref_obj_id > 0 > > Almost. > I propose: > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id > 0 > > See the definition of is_trusted_reg(). > It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED) > > I was saying 'trusted' because of is_trusted_reg() definition. > Sorry for confusion. > I see. Sounds reasonable. >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) >> PTR_TRUSTED >> ref_obj_id == 0 >> - used for "can't pass ownership", since funcs that expect >> owning ref need ref_obj_id > 0 > > I propose: > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > Also sounds reasonable, perhaps with the addition of id > 0 to account for your desired changes to release_on_unlock mechanism? > Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs. > > And we will be able to pass 'non-owning' under spin_lock into other kfuncs > and owning outside of spin_lock into other kfuncs. > Which is a good thing. > Allowing passing of owning ref outside of spin_lock sounds reasonable to me. 'non-owning' under spinlock will have the same "what if this touches __key" issue I brought up in another thread. But you mentioned not preventing that and I don't necessarily disagree, so just noting here. >> And you're also adding 'untrusted' here, mainly as a result of >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added, >> instead of becoming a non-owning ref. 'untrusted' would have state like: >> >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) >> PTR_UNTRUSTED >> ref_obj_id == 0? > > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add() > or doing 'non-owning' is enough. > If it's full untrusted it will be: > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 > Yeah, I don't see what this "full untrusted" is giving us either. Let's have "cleanup non-owning refs on spin_unlock" just invalidate the regs for now, instead of converting to "full untrusted"? Adding "full untrusted" later won't make any valid programs written with "just invalidate the regs" in mind fail the verifier. So painless to add later. > tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'. > I think such type combo was only added to implement non-owning refs. If it's rewritten to use your type combos I don't think there'll be any uses of MEM_ALLOC | PTR_UNTRUSTED remaining. >> I think your "non-owning ref" definition also differs from mine, specifically >> yours doesn't seem to have "will not page fault". For this reason, you don't >> see the need for release_on_unlock logic, since that's used to prevent refs >> escaping critical section and potentially referring to free'd memory. > > Not quite. > We should be able to read/write directly through > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock > the way release_reference() is doing. > I'm just not happy with using acquire_reference/release_reference() logic > (as release_on_unlock is doing) for cleaning after unlock. > Since we need to clean 'non-owning' ptrs in unlock it's confusing > to call the process 'release'. > I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED) > every reg where > reg->id == cur_state->active_lock.id && > flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > > By deleting relase_on_unlock I meant delete release_on_unlock flag > and remove ref_set_release_on_unlock. > Summarized above, but: agreed, and thanks for clarifying what you meant by "delete release_on_unlock". >> This is where I start to get confused. Some questions: >> >> * If we get rid of release_on_unlock, and with mass invalidation of >> non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED? > > Since we'll be cleaning all > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new. > > The verifier already enforces that bpf_spin_unlock will be present > at the right place in bpf prog. > When the verifier sees it it will clean all non-owning refs with this spinlock 'id'. > So no concerns of leaking 'non-owning' outside. > Sounds like we don't want "full untrusted" or any PTR_UNTRUSTED non-owning ref. > While processing bpf_rbtree_first we need to: > regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC; > regs[BPF_REG_0].id = active_lock.id; > regs[BPF_REG_0].ref_obj_id = 0; > Agreed. >> * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse >> in this scheme, since non-owning ref can escape spin_unlock b/c no mass >> invalidation? PTR_UNTRUSTED isn't sufficient here > > run-time check in bpf_rbtree_remove (and in the future bpf_list_remove) > should address it, no? > If we don't do "full untrusted" and cleanup non-owning refs by invalidating, _and_ don't allow bpf_obj_{new,drop} in critical section, then I don't think this is an issue. But to elaborate on the issue, if we instead cleaned up non-owning by marking untrusted: struct node_data *n = bpf_obj_new(typeof(*n)); struct node_data *m, *o; struct some_other_type *t; bpf_spin_lock(&lock); bpf_rbtree_add(&tree, n); m = bpf_rbtree_first(); o = bpf_rbtree_first(); // m and o are non-owning, point to same node m = bpf_rbtree_remove(&tree, m); // m is owning bpf_spin_unlock(&lock); // o is "full untrusted", marked PTR_UNTRUSTED bpf_obj_drop(m); t = bpf_obj_new(typeof(*t)); // pretend that exact chunk of memory that was // dropped in previous statement is returned here data = o->some_data_field; // PROBE_MEM, but no page fault, so load will // succeed, but will read garbage from another type // while verifier thinks it's reading from node_data If we clean up by invalidating, but eventually enable bpf_obj_{new,drop} inside critical section, we'll have similar issue. It's not necessarily "crash the kernel" dangerous, but it may anger program writers since they can't be sure they're not reading garbage in this scenario. >> * If non-owning ref can live past spin_unlock, do we expect read from >> such ref after _unlock to go through bpf_probe_read()? Otherwise direct >> read might fault and silently write 0. > > unlock has to clean them. > Ack. >> * For your 'untrusted', but not non-owning ref concept, I'm not sure >> what this gives us that's better than just invalidating the ref which >> gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node) > > Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one. > Untrusted will allow prog to do read only access (via probe_read) into the node > but might hide bugs. > The cleanup after bpf_spin_unlock of non-owning and clean up after > bpf_rbtree_add() does not have to be the same. This is a good point. > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock > and non-owning after bpf_rbtree_add. > > Walking the example from previous email: > > struct bpf_rbtree_iter it; > struct bpf_rb_node * node; > struct bpf_rb_node *n, *m; > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock > while ((node = bpf_rbtree_iter_next(&it)) { > // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0 > if (node && node->field == condition) { > > n = bpf_rbtree_remove(rb_root, node); > if (!n) ...; > // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X > m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time > if (!m) ...; > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y > > // node is still: > // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id > > // assume we allow double locks one day > bpf_spin_lock(another_rb_root); > bpf_rbtree_add(another_rb_root, n); > // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id > bpf_spin_unlock(another_rb_root); > // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 > break; > } > } > // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id > bpf_rbtree_iter_destroy(&it); // does unlock > // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED > // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y > bpf_obj_drop(m); This seems like a departure from other statements in your reply, where you're leaning towards "non-owning and trusted" -> "full untrusted" after unlock being unnecessary. I think the combo of reference aliases + bpf_obj_drop-and- reuse make everything hard to reason about. Regardless, your comments annotating reg state look correct to me.
On Thu, Dec 08, 2022 at 01:58:44PM IST, Dave Marchevsky wrote: > On 12/7/22 10:51 PM, Alexei Starovoitov wrote: > > On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote: > >> > >> Before replying to specific things in this email, I think it would be useful > >> to have a subthread clearing up definitions and semantics, as I think we're > >> talking past each other a bit. > > > > Yeah. We were not on the same page. > > The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me. > > I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago > > and I thought we agreed that both are not necessary and with that > > I assumed that anything 'non-owning' as a concept is gone too. > > So the only thing left (in my mind) was the 'owning' concept. > > Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'. > > > > Whereas in my mind the release_on_unlock logic was specifically added to > implement the mass invalidation part of non-owning reference semantics, and it > being accepted implied that we weren't getting rid of the concept :). > > > Please have this detailed explanation in the commit log next time to > > avoid this back and forth. > > Now to the fun part... > > > > I will add a documentation commit explaining 'owning' and 'non-owning' ref > as they pertain to these datastructures, after we agree about the semantics. > > Speaking of which, although I have a few questions / clarifications, I think > we're more in agreement after your reply. After one more round of clarification > I will summarize conclusions to see if we agree on enough to move forward. > > >> > >> On a conceptual level I've still been using "owning reference" and "non-owning > >> reference" to understand rbtree operations. I'll use those here and try to map > >> them to actual verifier concepts later. > >> > >> owning reference > >> > >> * This reference controls the lifetime of the pointee > >> * Ownership of pointee must be 'released' by passing it to some rbtree > >> API kfunc - rbtree_add in our case - or via bpf_obj_drop, which free's > >> * If not released before program ends, verifier considers prog invalid > >> * Access to the memory ref is pointing at will not page fault > > > > agree. > > > >> non-owning reference > >> > >> * No ownership of pointee so can't pass ownership via rbtree_add, not allowed > >> to bpf_obj_drop > >> * No control of lifetime, but can infer memory safety based on context > >> (see explanation below) > >> * Access to the memory ref is pointing at will not page fault > >> (see explanation below) > > > > agree with addition that both read and write should be allowed into this > > 'non-owning' ptr. > > Which breaks if you map this to something that ORs with PTR_UNTRUSTED. > > > > Agree re: read/write allowed. PTR_UNTRUSTED was an implementation detail. > Sounds like we agree on general purpose of owning, non-owning. Looks like > we're in agreement about above semantics. > Yes, PTR_UNTRUSTED is not appropriate for this. My opposition was also more to the idea of mapping PTR_UNTRUSTED to non-owning references. If we do PTR_TO_BTF_ID | MEM_ALLOC for them with ref_obj_id == 0, it SGTM. > >> 2) From verifier's perspective non-owning references can only exist > >> between spin_lock and spin_unlock. Why? After spin_unlock another program > >> can do arbitrary operations on the rbtree like removing and free-ing > >> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, > >> free'd, and reused via bpf_obj_new would point to an entirely different thing. > >> Or the memory could go away. > > > > agree that spin_unlock needs to clean up 'non-owning'. > > Another point of agreement. > +1 > > > >> To prevent this logic violation all non-owning references are invalidated by > >> verifier after critical section ends. This is necessary to ensure "will > >> not page fault" property of non-owning reference. So if verifier hasn't > >> invalidated a non-owning ref, accessing it will not page fault. > >> > >> Currently bpf_obj_drop is not allowed in the critical section, so similarly, > >> if there's a valid non-owning ref, we must be in critical section, and can > >> conclude that the ref's memory hasn't been dropped-and-free'd or dropped- > >> and-reused. > > > > I don't understand why is that a problem. > > > >> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since > >> the tree has control of pointee lifetime. Similarly, any ref to a node > >> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd > >> node in map_val for now) The last case is going to be marked PTR_UNTRUSTED. > > > > Also not clear why such restriction is necessary. > > > > If we have this restriction and bpf_rbtree_release also mass invalidates > non-owning refs, the type system will ensure that only nodes that are in a tree > will be passed to bpf_rbtree_release, and we can avoid the runtime check. > I like this property. This was also how I proposed implementing it for lists. e.g. Any bpf_list_del would invalidate the result of prior bpf_list_first_entry and bpf_list_last_entry to ensure safety. It's a bit similar to aliasing XOR mutability guarantees that Rust has. We're trying to implement a simple borrow checking mechanism. Once the collection is mutated, any prior non-owning references become invalidated. It can be further refined (e.g. bpf_rbtree_add won't do invalidation on mutation) based on the properties of the data structure. > But below you mention preferring the runtime check, mostly noting here to > refer back when continuing reply below. > > >> Moving on to rbtree API: > >> > >> bpf_rbtree_add(&tree, &node); > >> 'node' is an owning ref, becomes a non-owning ref. > >> > >> bpf_rbtree_first(&tree); > >> retval is a non-owning ref, since first() node is still in tree > >> > >> bpf_rbtree_remove(&tree, &node); > >> 'node' is a non-owning ref, retval is an owning ref > > > > agree on the above definition. > > >> All of the above can only be called when rbtree's lock is held, so invalidation > >> of all non-owning refs on spin_unlock is fine for rbtree_remove. > >> > >> Nice property of paragraph marked with 1) above is the ability to use the > >> type system to prevent rbtree_add of node that's already in rbtree and > >> rbtree_remove of node that's not in one. So we can forego runtime > >> checking of "already in tree", "already not in tree". > > > > I think it's easier to add runtime check inside bpf_rbtree_remove() > > since it already returns MAYBE_NULL. No 'conditional release' necessary. > > And with that we don't need to worry about aliases. > > > > To clarify: You're proposing that we don't worry about solving the aliasing > problem at verification time. Instead rbtree_{add,remove} will deal with it > at runtime. Corollary of this is that my restriction tagged 1) above ("ref > to node in tree _must_ be non-owning, to node not in tree must be owning") > isn't something we're guaranteeing, due to possibility of aliasing. > > So bpf_rbtree_remove might get a node that's not in tree, and > bpf_rbtree_add might get a node that's already in tree. Runtime behavior > of both should be 'nop'. > > > If that is an accurate restatement of your proposal, the verifier > logic will need to be changed: > > For bpf_rbtree_remove(&tree, &node), if node is already not in a tree, > retval will be NULL, effectively not acquiring an owning ref due to > mark_ptr_or_null_reg's logic. > > In this case, do we want to invalidate > arg 'node' as well? Or just leave it as a non-owning ref that points > to node not in tree? I think the latter requires fewer verifier changes, > but can see the argument for the former if we want restriction 1) to > mostly be true, unless aliasing. > > The above scenario is the only case where bpf_rbtree_remove fails and > returns NULL. > > (In this series it can fail and RET_NULL for this reason, but my earlier comment > about type system + invalidate all-non owning after remove as discussed below > was my original intent. So I shouldn't have been allowing RET_NULL for my > version of these semantics.) > I agree with Dave to rely on the invariant that non-owning refs to nodes are part of the collection. Then bpf_rbtree_remove is simply KF_ACQUIRE. > > For bpf_rbtree_add(&tree, &node, less), if arg is already in tree, then > 'node' isn't really an owning ref, and we need to tag it as non-owning, > and program then won't need to bpf_obj_drop it before exiting. If node > wasn't already in tree and rbtree_add actually added it, 'node' would > also be tagged as non-owning, since tree now owns it. > > Do we need some way to indicate whether 'already in tree' case happened? > If so, would need to change retval from void to bool or struct bpf_rb_node *. > > Above scenario is only case where bpf_rbtree_add fails and returns > NULL / false. > Why should we allow node that is not acquired to be passed to bpf_rbtree_add? > >> But, as you and Kumar talked about in the past and referenced in patch 1's > >> thread, non-owning refs may alias each other, or an owning ref, and have no > >> way of knowing whether this is the case. So if X and Y are two non-owning refs > >> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent > >> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which > >> already isn't in any tree (since prog has an owning ref to it). But verifier > >> doesn't know X and Y alias each other. So previous paragraph's "forego > >> runtime checks" statement can only hold if we invalidate all non-owning refs > >> after 'destructive' rbtree_remove operation. > > > > right. we either invalidate all non-owning after bpf_rbtree_remove > > or do run-time check in bpf_rbtree_remove. > > Consider the following: > > bpf_spin_lock > > n = bpf_rbtree_first(root); > > m = bpf_rbtree_first(root); > > x = bpf_rbtree_remove(root, n) > > y = bpf_rbtree_remove(root, m) > > bpf_spin_unlock > > if (x) > > bpf_obj_drop(x) > > if (y) > > bpf_obj_drop(y) > > > > If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier. > > If we do run-time check the above will be accepted and will work without crashing. > > > > Agreed, although the above example's invalid double-remove of same node is > the kind of thing I'd like to be prevented at verification time instead of > runtime. Regardless, continuing with your runtime check idea. > I agree with Dave, it seems better to invalidate non-owning refs after first remove rather than allowing this to work. > > The problem with release_on_unlock is that it marks 'n' after 1st remove > > as UNTRUSTED which means 'no write' and 'read via probe_read'. > > That's not good imo. > > > > Based on your response to paragraph below this one, I think we're in agreement > that using PTR_UNTRUSTED for non-owning ref gives non-owning ref bunch of traits > it doesn't need, when I just wanted "can't pass ownership". So agreed that > PTR_UNTRUSTED is too blunt an instrument here. > I think this is the part of the confusion which has left me wondering so far. The discussion in this thread is making things more clear. PTR_UNTRUSTED was never meant to be the kind of non-owning reference you want to be returned from bpf_rbtree_first. PTR_TO_BTF_ID | MEM_ALLOC with ref_obj_id == 0 is the right choice. > Regarding "marks 'n' after 1st remove", the series isn't currently doing this, > I proposed it as a way to prevent aliasing problem, but I think your proposal > is explicitly not trying to prevent aliasing problem at verification time. So > for your semantics we would only have non-owning cleanup after spin_unlock. > And such cleanup might just mark refs PTR_UNTRUSTED instead of invalidating > entirely. > I would prefer proper invalidation using mark_reg_unknown. > >> > >> It doesn't matter to me which combination of type flags, ref_obj_id, other > >> reg state stuff, and special-casing is used to implement owning and non-owning > >> refs. Specific ones chosen in this series for rbtree node: > >> > >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) > >> ref_obj_id > 0 > >> > >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node) > >> PTR_UNTRUSTED > >> - used for "can't pass ownership", not PROBE_MEM > >> - this is why I mentioned "decomposing UNTRUSTED into more > >> granular reg traits" in another thread > > > > Now I undestand, but that was very hard to grasp. > > UNTRUSTED means 'no write' and 'read via probe_read'. > > ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly > > pointing out below: > >> ref_obj_id > 0 > >> release_on_unlock = true > >> - used due to paragraphs starting with 2) above > > > > but the problem with ref_set_release_on_unlock() that it mixes real ref-d > > pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0. > > And the latter is a quite confusing combination in my mind, > > since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS. > > > > I think I understand your desire to get rid of release_on_unlock now. It's not > due to disliking the concept of "clean up non-owning refs after spin_unlock", > which you earlier agreed was necessary, but rather the specifics of > release_on_unlock mechanism used to achieve this. > > If so, I think I agree with your reasoning for why the mechanism is bad in > light of how you want owning/non-owning implemented. To summarize your > statements about release_on_unlock mechanism from the rest of your reply: > > * 'ref_obj_id > 0' already has a specific meaning wrt. is_trusted_reg, > and we may want to support both TRUSTED and UNTRUSTED non-owning refs > > * My comment: Currently is_trusted_reg is only used for > KF_ARG_PTR_TO_BTF_ID, while rbtree and list types are assigned special > KF_ARGs. So hypothetically could have different 'is_trusted_reg' logic. > I don't actually think that's a good idea, though, especially since > rbtree / list types are really specializations of PTR_TO_BTF_ID anyways. > So agreed. > > * Instead of using 'acquire' and (modified) 'release', we can achieve > "clean-up non-owning after spin_unlock" by associating non-owning > refs with active_lock.id when they're created. We can store this in > reg.id, which is currently unused for PTR_TO_BTF_ID (afaict). > I don't mind using active_lock.id for invalidation, but using reg->id to associate it with reg is a bad idea IMO, it's already preserved and set when the object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock with that non-owing ref if it has a spin lock, essentially unlocking different spin lock if the reg->btf of already locked spin lock reg is same due to same active_lock.id. Even if you prevent it somehow it's more confusing to overload reg->id again for this purpose. It makes more sense to introduce a new nonref_obj_id instead dedicated for this purpose, to associate it back to the reg->id of the collection it is coming from. Also, there are two cases of invalidation, one is on remove from rbtree, which should only invalidate non-owning references into the rbtree, and one is on unlock, which should invalidate all non-owning references. bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same lock, but unlocking should do it for both rbtree and list non-owning refs it is protecting. So it seems you will have to maintain two IDs for non-owning referneces, one for the collection it comes from, and one for the lock region it is obtained in. > * This will solve issue raised by previous point, allowing us to have > non-owning refs which are truly 'untrusted' according to is_trusted_reg. > > * My comment: This all sounds reasonable. On spin_unlock we have > active_lock.id, so can do bpf_for_each_reg_in_vstate to look for > PTR_TO_BTF_IDs matching the id and do 'cleanup' for them. > > >> Any other combination of type and reg state that gives me the semantics def'd > >> above works4me. > >> > >> > >> Based on this reply and others from today, I think you're saying that these > >> concepts should be implemented using: > >> > >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > >> PTR_TRUSTED > >> ref_obj_id > 0 > > > > Almost. > > I propose: > > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id > 0 > > > > See the definition of is_trusted_reg(). > > It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED) > > > > I was saying 'trusted' because of is_trusted_reg() definition. > > Sorry for confusion. > > > > I see. Sounds reasonable. > > >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > >> PTR_TRUSTED > >> ref_obj_id == 0 > >> - used for "can't pass ownership", since funcs that expect > >> owning ref need ref_obj_id > 0 > > > > I propose: > > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > > > > Also sounds reasonable, perhaps with the addition of id > 0 to account for > your desired changes to release_on_unlock mechanism? > > > Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs. > > > > And we will be able to pass 'non-owning' under spin_lock into other kfuncs > > and owning outside of spin_lock into other kfuncs. > > Which is a good thing. > > > > Allowing passing of owning ref outside of spin_lock sounds reasonable to me. > 'non-owning' under spinlock will have the same "what if this touches __key" > issue I brought up in another thread. But you mentioned not preventing that > and I don't necessarily disagree, so just noting here. > Yeah, I agree with Alexei that writing to key is a non-issue. 'Less' cb may not actually do the correct thing at all, so in that sense writing to key is a small issue. In any case violating the 'sorted' property is not something we should be trying to prevent. > >> And you're also adding 'untrusted' here, mainly as a result of > >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added, > >> instead of becoming a non-owning ref. 'untrusted' would have state like: > >> > >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > >> PTR_UNTRUSTED > >> ref_obj_id == 0? > > > > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add() > > or doing 'non-owning' is enough. > > If it's full untrusted it will be: > > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 > > > > Yeah, I don't see what this "full untrusted" is giving us either. Let's have > "cleanup non-owning refs on spin_unlock" just invalidate the regs for now, > instead of converting to "full untrusted"? > +1, I prefer invalidating completely on unlock. > Adding "full untrusted" later won't make any valid programs written with > "just invalidate the regs" in mind fail the verifier. So painless to add later. > +1 > > tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'. > > Eventually it will also be used for alloc obj kptr loaded from maps. > > I think such type combo was only added to implement non-owning refs. If it's > rewritten to use your type combos I don't think there'll be any uses of > MEM_ALLOC | PTR_UNTRUSTED remaining. > To be clear I was not intending to use PTR_UNTRUSTED to do such non-owning refs. > >> I think your "non-owning ref" definition also differs from mine, specifically > >> yours doesn't seem to have "will not page fault". For this reason, you don't > >> see the need for release_on_unlock logic, since that's used to prevent refs > >> escaping critical section and potentially referring to free'd memory. > > > > Not quite. > > We should be able to read/write directly through > > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > > and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock > > the way release_reference() is doing. > > I'm just not happy with using acquire_reference/release_reference() logic > > (as release_on_unlock is doing) for cleaning after unlock. > > Since we need to clean 'non-owning' ptrs in unlock it's confusing > > to call the process 'release'. > > I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED) > > every reg where > > reg->id == cur_state->active_lock.id && > > flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > > > > By deleting relase_on_unlock I meant delete release_on_unlock flag > > and remove ref_set_release_on_unlock. > > > > Summarized above, but: agreed, and thanks for clarifying what you meant by > "delete release_on_unlock". > > >> This is where I start to get confused. Some questions: > >> > >> * If we get rid of release_on_unlock, and with mass invalidation of > >> non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED? > > > > Since we'll be cleaning all > > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 > > it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new. > > > > The verifier already enforces that bpf_spin_unlock will be present > > at the right place in bpf prog. > > When the verifier sees it it will clean all non-owning refs with this spinlock 'id'. > > So no concerns of leaking 'non-owning' outside. > > > > Sounds like we don't want "full untrusted" or any PTR_UNTRUSTED non-owning ref. > > > While processing bpf_rbtree_first we need to: > > regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC; > > regs[BPF_REG_0].id = active_lock.id; > > regs[BPF_REG_0].ref_obj_id = 0; > > > > Agreed. > I'm a bit concerned about putting active_lock.id in reg->id. Don't object to the idea but the implementation, since we take PTR_TO_BTF_ID | MEM_ALLOC in bpf_spin_lock/bpf_spin_unlock. It will lead to confusion. Currently this exact reg->type never has reg->ref_obj_id as 0. Maybe that needs to be checked for those helper calls. Just thinking out loud, maybe it's fine but we need to be careful, reg->id changes meaning with ref_obj_id == 0. > >> * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse > >> in this scheme, since non-owning ref can escape spin_unlock b/c no mass > >> invalidation? PTR_UNTRUSTED isn't sufficient here > > > > run-time check in bpf_rbtree_remove (and in the future bpf_list_remove) > > should address it, no? > > > > If we don't do "full untrusted" and cleanup non-owning refs by invalidating, > _and_ don't allow bpf_obj_{new,drop} in critical section, then I don't think > this is an issue. > bpf_obj_drop if/when enabled can also do invalidation. But let's table that discussion until we introduce it. We most likely may not need it inside the CS. > But to elaborate on the issue, if we instead cleaned up non-owning by marking > untrusted: > > struct node_data *n = bpf_obj_new(typeof(*n)); > struct node_data *m, *o; > struct some_other_type *t; > > bpf_spin_lock(&lock); > > bpf_rbtree_add(&tree, n); > m = bpf_rbtree_first(); > o = bpf_rbtree_first(); // m and o are non-owning, point to same node > > m = bpf_rbtree_remove(&tree, m); // m is owning > > bpf_spin_unlock(&lock); // o is "full untrusted", marked PTR_UNTRUSTED > > bpf_obj_drop(m); > t = bpf_obj_new(typeof(*t)); // pretend that exact chunk of memory that was > // dropped in previous statement is returned here > > data = o->some_data_field; // PROBE_MEM, but no page fault, so load will > // succeed, but will read garbage from another type > // while verifier thinks it's reading from node_data > > > If we clean up by invalidating, but eventually enable bpf_obj_{new,drop} inside > critical section, we'll have similar issue. > > It's not necessarily "crash the kernel" dangerous, but it may anger program > writers since they can't be sure they're not reading garbage in this scenario. > I think it's better to clean by invalidating. We have better tools to form untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs such an escape hatch for some reason. It's also easier to review where an untrusted pointer is being used in a program, and has zero cost at runtime. > >> * If non-owning ref can live past spin_unlock, do we expect read from > >> such ref after _unlock to go through bpf_probe_read()? Otherwise direct > >> read might fault and silently write 0. > > > > unlock has to clean them. > > > > Ack. > > >> * For your 'untrusted', but not non-owning ref concept, I'm not sure > >> what this gives us that's better than just invalidating the ref which > >> gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node) > > > > Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one. > > Untrusted will allow prog to do read only access (via probe_read) into the node > > but might hide bugs. > > The cleanup after bpf_spin_unlock of non-owning and clean up after > > bpf_rbtree_add() does not have to be the same. > > This is a good point. > So far I'm leaning towards: bpf_rbtree_add(node) : node becomes non-owned ref bpf_spin_unlock(lock) : node is invalidated > > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock > > and non-owning after bpf_rbtree_add. > > > > Walking the example from previous email: > > > > struct bpf_rbtree_iter it; > > struct bpf_rb_node * node; > > struct bpf_rb_node *n, *m; > > > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock > > while ((node = bpf_rbtree_iter_next(&it)) { > > // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0 > > if (node && node->field == condition) { > > > > n = bpf_rbtree_remove(rb_root, node); > > if (!n) ...; > > // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X > > m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time > > if (!m) ...; > > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y > > This second remove I would simply disallow as Dave is suggesting during verification, by invalidating non-owning refs for rb_root. > > // node is still: > > // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id > > > > // assume we allow double locks one day > > bpf_spin_lock(another_rb_root); > > bpf_rbtree_add(another_rb_root, n); > > // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id > > bpf_spin_unlock(another_rb_root); > > // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 > > break; > > } > > } > > // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id > > bpf_rbtree_iter_destroy(&it); // does unlock > > // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED > > // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED > > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y > > bpf_obj_drop(m); > > This seems like a departure from other statements in your reply, where you're > leaning towards "non-owning and trusted" -> "full untrusted" after unlock > being unnecessary. I think the combo of reference aliases + bpf_obj_drop-and- > reuse make everything hard to reason about. > > Regardless, your comments annotating reg state look correct to me. I think it's much more clear in this thread wrt what you wanted to do. It would be good after the thread concludes to eventually summarize how you're going to finally implement all this before respinning.
On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote: > > I don't mind using active_lock.id for invalidation, but using reg->id to > associate it with reg is a bad idea IMO, it's already preserved and set when the > object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock > with that non-owing ref if it has a spin lock, essentially unlocking different > spin lock if the reg->btf of already locked spin lock reg is same due to same > active_lock.id. Right. Overwriting reg->id was a bad idea. > Even if you prevent it somehow it's more confusing to overload reg->id again for > this purpose. > > It makes more sense to introduce a new nonref_obj_id instead dedicated for this > purpose, to associate it back to the reg->id of the collection it is coming from. nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be connected to reg->id the way we do it for ref_obj_id. > Also, there are two cases of invalidation, one is on remove from rbtree, which > should only invalidate non-owning references into the rbtree, and one is on > unlock, which should invalidate all non-owning references. Two cases only if we're going to do invalidation on rbtree_remove. > bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same > lock, but unlocking should do it for both rbtree and list non-owning refs it is > protecting. > > So it seems you will have to maintain two IDs for non-owning referneces, one for > the collection it comes from, and one for the lock region it is obtained in. Right. Like this ? collection_id = rbroot->reg->id; // to track the collection it came from active_lock_id = cur_state->active_lock.id // to track the lock region but before we proceed let me demonstrate an example where cleanup on rbtree_remove is not user friendly: bpf_spin_lock x = bpf_list_first(); if (!x) .. y = bpf_list_last(); if (!y) .. n = bpf_list_remove(x); if (!n) .. bpf_list_add_after(n, y); // we should allow this bpf_spin_unlock We don't have such apis right now. The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy all regs that point somewhere in the collection. This way we save run-time check in bpf_rbtree_remove, but sacrificing usability. x and y could be pointing to the same thing. In such case bpf_list_add_after() should fail in runtime after discovering that 'y' is unlinked. Similarly with bpf_rbtree_add(). Currently it cannot fail. It takes owning ref and will release it. We can mark it as KF_RELEASE and no extra verifier changes necessary. But in the future we might have failing add/insert operations on lists and rbtree. If they're failing we'd need to struggle with 'conditional release' verifier additions, the bpf prog would need to check return value, etc. I think we better deal with it in run-time. The verifier could supply bpf_list_add_after() with two hidden args: - container_of offset (delta between rb_node and begining of prog's struct) - struct btf_struct_meta *meta Then inside bpf_list_add_after or any failing KF_RELEASE kfunc it can call bpf_obj_drop_impl() that element. Then from the verifier pov the KF_RELEASE function did the release and 'owning ref' became 'non-owning ref'. > > >> And you're also adding 'untrusted' here, mainly as a result of > > >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added, > > >> instead of becoming a non-owning ref. 'untrusted' would have state like: > > >> > > >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) > > >> PTR_UNTRUSTED > > >> ref_obj_id == 0? > > > > > > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add() > > > or doing 'non-owning' is enough. > > > If it's full untrusted it will be: > > > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 > > > > > > > Yeah, I don't see what this "full untrusted" is giving us either. Let's have > > "cleanup non-owning refs on spin_unlock" just invalidate the regs for now, > > instead of converting to "full untrusted"? > > > > +1, I prefer invalidating completely on unlock. fine by me. > > I think it's better to clean by invalidating. We have better tools to form > untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs > such an escape hatch for some reason. It's also easier to review where an > untrusted pointer is being used in a program, and has zero cost at runtime. ok. Since it's more strict we can relax to untrusted later if necessary. > So far I'm leaning towards: > > bpf_rbtree_add(node) : node becomes non-owned ref > bpf_spin_unlock(lock) : node is invalidated ok > > > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock > > > and non-owning after bpf_rbtree_add. > > > > > > Walking the example from previous email: > > > > > > struct bpf_rbtree_iter it; > > > struct bpf_rb_node * node; > > > struct bpf_rb_node *n, *m; > > > > > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock > > > while ((node = bpf_rbtree_iter_next(&it)) { > > > // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0 > > > if (node && node->field == condition) { > > > > > > n = bpf_rbtree_remove(rb_root, node); > > > if (!n) ...; > > > // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X > > > m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time > > > if (!m) ...; > > > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y > > > > > This second remove I would simply disallow as Dave is suggesting during > verification, by invalidating non-owning refs for rb_root. Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our only remaining disagreement. I feel run-time checks will be fast enough and will improve usabililty. Also it feels that not doing cleanup on rbtree_remove is simpler to implement and reason about. Here is the proposal with one new field 'active_lock_id': first = bpf_rbtree_first(root) KF_RET_NULL check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0; R0->active_lock_id = root->reg->id R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog. same way we can add rb_find, rb_find_first, but not rb_next, rb_prev, since they don't have 'root' argument. bpf_rbtree_add(root, node, cb); KF_RELEASE. needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0 check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id calls release_reference(node->ref_obj_id) converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0; node->active_lock_id = root->reg->id 'node' is equivalent to 'first'. They both point to some element inside rbtree and valid inside spin_locked region. It's ok to read|write to both under lock. removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and usual check_reg_allocation_locked(root) R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL R0->ref_obj_id = R0->id = acquire_reference_state(); R0->active_lock_id should stay 0 mark_reg_unknown(node) bpf_spin_unlock(lock); checks lock->id == cur->active_lock.id for all regs in state if (reg->active_lock_id == lock->id) mark_reg_unknown(reg)
On 12/8/22 3:36 PM, Alexei Starovoitov wrote: > On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote: >> >> I don't mind using active_lock.id for invalidation, but using reg->id to >> associate it with reg is a bad idea IMO, it's already preserved and set when the >> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock >> with that non-owing ref if it has a spin lock, essentially unlocking different >> spin lock if the reg->btf of already locked spin lock reg is same due to same >> active_lock.id. > > Right. Overwriting reg->id was a bad idea. > >> Even if you prevent it somehow it's more confusing to overload reg->id again for >> this purpose. >> >> It makes more sense to introduce a new nonref_obj_id instead dedicated for this >> purpose, to associate it back to the reg->id of the collection it is coming from. > > nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be > connected to reg->id the way we do it for ref_obj_id. > >> Also, there are two cases of invalidation, one is on remove from rbtree, which >> should only invalidate non-owning references into the rbtree, and one is on >> unlock, which should invalidate all non-owning references. > > Two cases only if we're going to do invalidation on rbtree_remove. > >> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same >> lock, but unlocking should do it for both rbtree and list non-owning refs it is >> protecting. >> >> So it seems you will have to maintain two IDs for non-owning referneces, one for >> the collection it comes from, and one for the lock region it is obtained in. > > Right. Like this ? > collection_id = rbroot->reg->id; // to track the collection it came from > active_lock_id = cur_state->active_lock.id // to track the lock region > > but before we proceed let me demonstrate an example where > cleanup on rbtree_remove is not user friendly: > > bpf_spin_lock > x = bpf_list_first(); if (!x) .. > y = bpf_list_last(); if (!y) .. > > n = bpf_list_remove(x); if (!n) .. > > bpf_list_add_after(n, y); // we should allow this > bpf_spin_unlock > > We don't have such apis right now. > The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy > all regs that point somewhere in the collection. > This way we save run-time check in bpf_rbtree_remove, but sacrificing usability. > > x and y could be pointing to the same thing. > In such case bpf_list_add_after() should fail in runtime after discovering > that 'y' is unlinked. > > Similarly with bpf_rbtree_add(). > Currently it cannot fail. It takes owning ref and will release it. > We can mark it as KF_RELEASE and no extra verifier changes necessary. > > But in the future we might have failing add/insert operations on lists and rbtree. > If they're failing we'd need to struggle with 'conditional release' verifier additions, > the bpf prog would need to check return value, etc. > > I think we better deal with it in run-time. > The verifier could supply bpf_list_add_after() with two hidden args: > - container_of offset (delta between rb_node and begining of prog's struct) > - struct btf_struct_meta *meta > Then inside bpf_list_add_after or any failing KF_RELEASE kfunc > it can call bpf_obj_drop_impl() that element. > Then from the verifier pov the KF_RELEASE function did the release > and 'owning ref' became 'non-owning ref'. > >>>>> And you're also adding 'untrusted' here, mainly as a result of >>>>> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added, >>>>> instead of becoming a non-owning ref. 'untrusted' would have state like: >>>>> >>>>> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type) >>>>> PTR_UNTRUSTED >>>>> ref_obj_id == 0? >>>> >>>> I'm not sure whether we really need full untrusted after going through bpf_rbtree_add() >>>> or doing 'non-owning' is enough. >>>> If it's full untrusted it will be: >>>> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0 >>>> >>> >>> Yeah, I don't see what this "full untrusted" is giving us either. Let's have >>> "cleanup non-owning refs on spin_unlock" just invalidate the regs for now, >>> instead of converting to "full untrusted"? >>> >> >> +1, I prefer invalidating completely on unlock. > > fine by me. > >> >> I think it's better to clean by invalidating. We have better tools to form >> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs >> such an escape hatch for some reason. It's also easier to review where an >> untrusted pointer is being used in a program, and has zero cost at runtime. > > ok. Since it's more strict we can relax to untrusted later if necessary. > >> So far I'm leaning towards: >> >> bpf_rbtree_add(node) : node becomes non-owned ref >> bpf_spin_unlock(lock) : node is invalidated > > ok > >>>> Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock >>>> and non-owning after bpf_rbtree_add. >>>> >>>> Walking the example from previous email: >>>> >>>> struct bpf_rbtree_iter it; >>>> struct bpf_rb_node * node; >>>> struct bpf_rb_node *n, *m; >>>> >>>> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock >>>> while ((node = bpf_rbtree_iter_next(&it)) { >>>> // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0 >>>> if (node && node->field == condition) { >>>> >>>> n = bpf_rbtree_remove(rb_root, node); >>>> if (!n) ...; >>>> // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X >>>> m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time >>>> if (!m) ...; >>>> // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y >>>> >> >> This second remove I would simply disallow as Dave is suggesting during >> verification, by invalidating non-owning refs for rb_root. > > Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our > only remaining disagreement. > I feel run-time checks will be fast enough and will improve usabililty. > > Also it feels that not doing cleanup on rbtree_remove is simpler to > implement and reason about. > > Here is the proposal with one new field 'active_lock_id': > > first = bpf_rbtree_first(root) KF_RET_NULL > check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id > R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0; > R0->active_lock_id = root->reg->id > R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog. > > same way we can add rb_find, rb_find_first, > but not rb_next, rb_prev, since they don't have 'root' argument. > > bpf_rbtree_add(root, node, cb); KF_RELEASE. > needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0 > check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id > calls release_reference(node->ref_obj_id) > converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0; > node->active_lock_id = root->reg->id > > 'node' is equivalent to 'first'. They both point to some element > inside rbtree and valid inside spin_locked region. > It's ok to read|write to both under lock. > > removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL > need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and > usual check_reg_allocation_locked(root) > R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL > R0->ref_obj_id = R0->id = acquire_reference_state(); > R0->active_lock_id should stay 0 > mark_reg_unknown(node) > > bpf_spin_unlock(lock); > checks lock->id == cur->active_lock.id > for all regs in state > if (reg->active_lock_id == lock->id) > mark_reg_unknown(reg) OK, so sounds like a few more points of agreement, regardless of whether we go the runtime checking route or the other one: * We're tossing 'full untrusted' for now. non-owning references will not be allowed to escape critical section. They'll be clobbered w/ mark_reg_unknown. * No pressing need to make bpf_obj_drop callable from critical section. As a result no owning or non-owning ref access can page fault. * When spin_lock is unlocked, verifier needs to know about all non-owning references so that it can clobber them. Current implementation - ref_obj_id + release_on_unlock - is bad for a number of reasons, should be replaced with something that doesn't use ref_obj_id or reg->id. * Specific better approach was proposed above: new field + keep track of lock and datastructure identity. Differences in proposed approaches: "Type System checks + invalidation on 'destructive' rbtree ops" * This approach tries to prevent aliasing problems by invalidating non-owning refs after 'destructive' rbtree ops - like rbtree_remove - in addition to invalidation on spin_unlock * Type system guarantees invariants: * "if it's an owning ref, the node is guaranteed to not be in an rbtree" * "if it's a non-owning ref, the node is guaranteed to be in an rbtree" * Downside: mass non-owning ref invalidation on rbtree_remove will make some programs that logically don't have aliasing problem will be rejected by verifier. Will affect usability depending on how bad this is. "Runtime checks + spin_unlock invalidation only" * This approach allows for the possibility of aliasing problem. As a result the invariants guaranteed in point 2 above don't necessarily hold. * Helpers that add or remove need to account for possibility that the node they're operating on has already been added / removed. Need to check this at runtime and nop if so. * non-owning refs are only invalidated on spin_unlock. * As a result, usability issues of previous approach don't happen here. * Downside: Need to do runtime checks, some additional verifier complexity to deal with "runtime check failed" case due to prev approach's invariant not holding Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock or remove) can be added to either approach (maybe - at least it was specifically discussed for "runtime checks"). Such untrusted refs, by virtue of being PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input. For the "type system" approach this might ameliorate some of the usability issues. For the "runtime checks" approach it would only be useful to let such refs escape spin_unlock. But we're not going to do non-owning -> 'untrusted' for now, just listing for completeness. The distance between what I have now and "type system" approach is smaller than "runtime checks" approach. And to get from "type system" to "runtime checks" I'd need to: * Remove 'destructive op' invalidation points * Add runtime checks to rbtree_{add,remove} * Add verifier handling of runtime check failure possibility Of which only the first point is getting rid of something added for the "type system" approach, and won't be much work relative to all the refactoring and other improvements that are common between the two approaches. So for V2 I will do the "type system + invalidation on 'destructive' ops" approach as it'll take less time. This'll get eyes on common improvements faster. Then can do a "runtime checks" v3 and we can compare usability of both on same base.
On Thu, Dec 08, 2022 at 06:35:24PM -0500, Dave Marchevsky wrote: > > > > Here is the proposal with one new field 'active_lock_id': > > > > first = bpf_rbtree_first(root) KF_RET_NULL > > check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id > > R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0; > > R0->active_lock_id = root->reg->id > > R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog. > > > > same way we can add rb_find, rb_find_first, > > but not rb_next, rb_prev, since they don't have 'root' argument. > > > > bpf_rbtree_add(root, node, cb); KF_RELEASE. > > needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0 > > check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id > > calls release_reference(node->ref_obj_id) > > converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0; > > node->active_lock_id = root->reg->id > > > > 'node' is equivalent to 'first'. They both point to some element > > inside rbtree and valid inside spin_locked region. > > It's ok to read|write to both under lock. > > > > removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL > > need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and > > usual check_reg_allocation_locked(root) > > R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL > > R0->ref_obj_id = R0->id = acquire_reference_state(); > > R0->active_lock_id should stay 0 > > mark_reg_unknown(node) > > > > bpf_spin_unlock(lock); > > checks lock->id == cur->active_lock.id > > for all regs in state > > if (reg->active_lock_id == lock->id) > > mark_reg_unknown(reg) > > OK, so sounds like a few more points of agreement, regardless of whether > we go the runtime checking route or the other one: > > * We're tossing 'full untrusted' for now. non-owning references will not be > allowed to escape critical section. They'll be clobbered w/ > mark_reg_unknown. agree > * No pressing need to make bpf_obj_drop callable from critical section. > As a result no owning or non-owning ref access can page fault. agree > > * When spin_lock is unlocked, verifier needs to know about all non-owning > references so that it can clobber them. Current implementation - > ref_obj_id + release_on_unlock - is bad for a number of reasons, should > be replaced with something that doesn't use ref_obj_id or reg->id. > * Specific better approach was proposed above: new field + keep track > of lock and datastructure identity. yes > > Differences in proposed approaches: > > "Type System checks + invalidation on 'destructive' rbtree ops" > > * This approach tries to prevent aliasing problems by invalidating > non-owning refs after 'destructive' rbtree ops - like rbtree_remove - > in addition to invalidation on spin_unlock > > * Type system guarantees invariants: > * "if it's an owning ref, the node is guaranteed to not be in an rbtree" > * "if it's a non-owning ref, the node is guaranteed to be in an rbtree" > > * Downside: mass non-owning ref invalidation on rbtree_remove will make some > programs that logically don't have aliasing problem will be rejected by > verifier. Will affect usability depending on how bad this is. yes. > > > "Runtime checks + spin_unlock invalidation only" > > * This approach allows for the possibility of aliasing problem. As a result > the invariants guaranteed in point 2 above don't necessarily hold. > * Helpers that add or remove need to account for possibility that the node > they're operating on has already been added / removed. Need to check this > at runtime and nop if so. Only 'remove' needs to check. 'add' is operating on 'owning ref'. It cannot fail. Some future 'add_here(root, owning_node_to_add, nonowning_location)' may need to fail. > > * non-owning refs are only invalidated on spin_unlock. > * As a result, usability issues of previous approach don't happen here. > > * Downside: Need to do runtime checks, some additional verifier complexity > to deal with "runtime check failed" case due to prev approach's invariant > not holding > > Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock > or remove) can be added to either approach (maybe - at least it was specifically > discussed for "runtime checks"). Such untrusted refs, by virtue of being > PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input. correct. > For the "type system" approach this might ameliorate some of the usability > issues. For the "runtime checks" approach it would only be useful to let > such refs escape spin_unlock. the prog can do bpf_rdonly_cast() even after mark_unknown. > But we're not going to do non-owning -> 'untrusted' for now, just listing for > completeness. right, because of bpf_rdonly_cast availability. > The distance between what I have now and "type system" approach is smaller > than "runtime checks" approach. And to get from "type system" to "runtime > checks" I'd need to: > > * Remove 'destructive op' invalidation points > * Add runtime checks to rbtree_{add,remove} > * Add verifier handling of runtime check failure possibility > > Of which only the first point is getting rid of something added for the > "type system" approach, and won't be much work relative to all the refactoring > and other improvements that are common between the two approaches. > > So for V2 I will do the "type system + invalidation on 'destructive' ops" > approach as it'll take less time. This'll get eyes on common improvements > faster. Then can do a "runtime checks" v3 and we can compare usability of both > on same base. Sure, if you think cleanup on rbtree_remove is faster to implement then definitely go for it. I was imagining the other way around, but it's fine. Happy to be wrong. I'm not seeing though how you gonna do that cleanup. Another id-like field? Before doing all coding could you post a proposal in the format that I did above? imo it's much easier to think through in that form instead of analyzing the src code.