mbox series

[RFC,bpf-next,00/11] bpf: Introduce rbtree map

Message ID 20220722183438.3319790-1-davemarchevsky@fb.com (mailing list archive)
Headers show
Series bpf: Introduce rbtree map | expand

Message

Dave Marchevsky July 22, 2022, 6:34 p.m. UTC
Introduce bpf_rbtree map data structure. As the name implies, rbtree map
allows bpf programs to use red-black trees similarly to kernel code.
Programs interact with rbtree maps in a much more open-coded way than
more classic map implementations. Some example code to demonstrate:

  node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
  if (!node)
    return 0;

  node->one = calls;
  node->two = 6;
  bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));

  ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
  if (!ret) {
    bpf_rbtree_free_node(&rbtree, node);
    goto unlock_ret;
  }

unlock_ret:
  bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
  return 0;


This series is in a heavy RFC state, with some added verifier semantics
needing improvement before they can be considered safe. I am sending
early to gather feedback on approach:

  * Does the API seem reasonable and might it be useful for others?

  * Do new verifier semantics added in this series make logical sense?
    Are there any glaring safety holes aside from those called out in
    individual patches?

Please see individual patches for more in-depth explanation. A quick
summary of patches follows:


Patches 1-3 extend verifier and BTF searching logic in minor ways to
prepare for rbtree implementation patch.
  bpf: Pull repeated reg access bounds check into helper fn
  bpf: Add verifier support for custom callback return range
  bpf: Add rb_node_off to bpf_map


Patch 4 adds basic rbtree map implementation.
  bpf: Add rbtree map

Note that 'complete' implementation requires concepts and changes
introduced in further patches in the series. The series is currently
arranged in this way to ease RFC review.


Patches 5-7 add a spinlock to the rbtree map, with some differing
semantics from existing verifier spinlock handling.
  bpf: Add bpf_spin_lock member to rbtree
  bpf: Add bpf_rbtree_{lock,unlock} helpers
  bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}

Notably, rbtree's bpf_spin_lock must be held while manipulating the tree 
via helpers, while existing spinlock verifier logic prevents any helper
calls while lock is held. In current state this is worked around by not
having the verifier treat rbtree's lock specially in any way. This 
needs to be improved before leaving RFC state as it's unsafe.


Patch 8 adds the concept of non-owning references, firming up the
semantics of helpers that return a ptr to node which is owned by
a rbtree. See patch 4's summary for additional discussion of node
ownership.


Patch 9 adds a 'conditional release' concept: helpers which release a
resource, but may fail to do so and need to enforce that the BPF program
handles this failure appropriately, namely by freeing the resource
another way.


Path 10 adds 'iter' type flags which teach the verifier to understand
open-coded iteration of a data structure. Specifically, with such flags
the verifier can understand that this loop eventually ends:

  struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);

  while (iter) {
    node_ct++;
    iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
  }

NOTE: Patch 10's logic is currently very unsafe and it's unclear whether
there's a safe path forward that isn't too complex. It's the most RFC-ey
of all the patches.


Patch 11 adds tests. Best to start here to see BPF programs using rbtree
map as intended.


This series is based ontop of "bpf: Cleanup check_refcount_ok" patch,
which was submitted separately [0] and therefore is not included here. That
patch is likely to be applied before this is out of RFC state, so will
just rebase on newer bpf-next/master.

  [0]: lore.kernel.org/bpf/20220719215536.2787530-1-davemarchevsky@fb.com/

Dave Marchevsky (11):
  bpf: Pull repeated reg access bounds check into helper fn
  bpf: Add verifier support for custom callback return range
  bpf: Add rb_node_off to bpf_map
  bpf: Add rbtree map
  bpf: Add bpf_spin_lock member to rbtree
  bpf: Add bpf_rbtree_{lock,unlock} helpers
  bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
  bpf: Add OBJ_NON_OWNING_REF type flag
  bpf: Add CONDITIONAL_RELEASE type flag
  bpf: Introduce PTR_ITER and PTR_ITER_END type flags
  selftests/bpf: Add rbtree map tests

 include/linux/bpf.h                           |  13 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |   2 +
 include/linux/btf.h                           |   1 +
 include/uapi/linux/bpf.h                      | 121 ++++++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/btf.c                              |  21 +
 kernel/bpf/helpers.c                          |  42 +-
 kernel/bpf/rbtree.c                           | 401 ++++++++++++++++++
 kernel/bpf/syscall.c                          |   3 +
 kernel/bpf/verifier.c                         | 382 +++++++++++++++--
 tools/include/uapi/linux/bpf.h                | 121 ++++++
 .../selftests/bpf/prog_tests/rbtree_map.c     | 164 +++++++
 .../testing/selftests/bpf/progs/rbtree_map.c  | 108 +++++
 .../selftests/bpf/progs/rbtree_map_fail.c     | 236 +++++++++++
 .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
 16 files changed, 1605 insertions(+), 37 deletions(-)
 create mode 100644 kernel/bpf/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c

Comments

Yonghong Song July 28, 2022, 7:04 a.m. UTC | #1
On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
> allows bpf programs to use red-black trees similarly to kernel code.
> Programs interact with rbtree maps in a much more open-coded way than
> more classic map implementations. Some example code to demonstrate:
> 
>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>    if (!node)
>      return 0;
> 
>    node->one = calls;
>    node->two = 6;
>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));

Can we just do
      bpf_rbtree_lock(&rbtree)
      bpf_rbtree_unlock(&rbtree)
? Looks like the only places bpf_rbtree_get_lock() used are
as arguments of bpf_rbtree_lock/unlock or bpf_spin_lock/unlock?

> 
>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>    if (!ret) {
>      bpf_rbtree_free_node(&rbtree, node);
>      goto unlock_ret;
>    }
> 
> unlock_ret:
>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>    return 0;
> 
> 
> This series is in a heavy RFC state, with some added verifier semantics
> needing improvement before they can be considered safe. I am sending
> early to gather feedback on approach:
> 
>    * Does the API seem reasonable and might it be useful for others?
> 
>    * Do new verifier semantics added in this series make logical sense?
>      Are there any glaring safety holes aside from those called out in
>      individual patches?
> 
> Please see individual patches for more in-depth explanation. A quick
> summary of patches follows:
> 
> 
> Patches 1-3 extend verifier and BTF searching logic in minor ways to
> prepare for rbtree implementation patch.
>    bpf: Pull repeated reg access bounds check into helper fn
>    bpf: Add verifier support for custom callback return range
>    bpf: Add rb_node_off to bpf_map
> 
> 
> Patch 4 adds basic rbtree map implementation.
>    bpf: Add rbtree map
> 
> Note that 'complete' implementation requires concepts and changes
> introduced in further patches in the series. The series is currently
> arranged in this way to ease RFC review.
> 
> 
> Patches 5-7 add a spinlock to the rbtree map, with some differing
> semantics from existing verifier spinlock handling.
>    bpf: Add bpf_spin_lock member to rbtree
>    bpf: Add bpf_rbtree_{lock,unlock} helpers
>    bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
> 
> Notably, rbtree's bpf_spin_lock must be held while manipulating the tree
> via helpers, while existing spinlock verifier logic prevents any helper
> calls while lock is held. In current state this is worked around by not
> having the verifier treat rbtree's lock specially in any way. This
> needs to be improved before leaving RFC state as it's unsafe.
> 
[...]
Alexei Starovoitov Aug. 1, 2022, 9:27 p.m. UTC | #2
On 7/22/22 11:34 AM, Dave Marchevsky wrote:
> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
> allows bpf programs to use red-black trees similarly to kernel code.
> Programs interact with rbtree maps in a much more open-coded way than
> more classic map implementations. Some example code to demonstrate:
> 
>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>    if (!node)
>      return 0;
> 
>    node->one = calls;
>    node->two = 6;
>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
> 
>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>    if (!ret) {
>      bpf_rbtree_free_node(&rbtree, node);
>      goto unlock_ret;
>    }
> 
> unlock_ret:
>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>    return 0;
> 
> 
> This series is in a heavy RFC state, with some added verifier semantics
> needing improvement before they can be considered safe. I am sending
> early to gather feedback on approach:
> 
>    * Does the API seem reasonable and might it be useful for others?

Overall it seems to be on the right track.
The two main issue to resolve are locks and iterators.
The rest needs work too, but it's getting close :)
Will reply in individual patches.
Andrii Nakryiko Aug. 2, 2022, 10:02 p.m. UTC | #3
On Fri, Jul 22, 2022 at 11:34 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
> allows bpf programs to use red-black trees similarly to kernel code.
> Programs interact with rbtree maps in a much more open-coded way than
> more classic map implementations. Some example code to demonstrate:
>
>   node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>   if (!node)
>     return 0;
>
>   node->one = calls;
>   node->two = 6;
>   bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
>
>   ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>   if (!ret) {
>     bpf_rbtree_free_node(&rbtree, node);
>     goto unlock_ret;
>   }
>
> unlock_ret:
>   bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>   return 0;
>
>
> This series is in a heavy RFC state, with some added verifier semantics
> needing improvement before they can be considered safe. I am sending
> early to gather feedback on approach:
>
>   * Does the API seem reasonable and might it be useful for others?
>
>   * Do new verifier semantics added in this series make logical sense?
>     Are there any glaring safety holes aside from those called out in
>     individual patches?
>
> Please see individual patches for more in-depth explanation. A quick
> summary of patches follows:
>
>
> Patches 1-3 extend verifier and BTF searching logic in minor ways to
> prepare for rbtree implementation patch.
>   bpf: Pull repeated reg access bounds check into helper fn
>   bpf: Add verifier support for custom callback return range
>   bpf: Add rb_node_off to bpf_map
>
>
> Patch 4 adds basic rbtree map implementation.
>   bpf: Add rbtree map
>
> Note that 'complete' implementation requires concepts and changes
> introduced in further patches in the series. The series is currently
> arranged in this way to ease RFC review.
>
>
> Patches 5-7 add a spinlock to the rbtree map, with some differing
> semantics from existing verifier spinlock handling.
>   bpf: Add bpf_spin_lock member to rbtree
>   bpf: Add bpf_rbtree_{lock,unlock} helpers
>   bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
>
> Notably, rbtree's bpf_spin_lock must be held while manipulating the tree
> via helpers, while existing spinlock verifier logic prevents any helper
> calls while lock is held. In current state this is worked around by not
> having the verifier treat rbtree's lock specially in any way. This
> needs to be improved before leaving RFC state as it's unsafe.
>
>
> Patch 8 adds the concept of non-owning references, firming up the
> semantics of helpers that return a ptr to node which is owned by
> a rbtree. See patch 4's summary for additional discussion of node
> ownership.
>
>
> Patch 9 adds a 'conditional release' concept: helpers which release a
> resource, but may fail to do so and need to enforce that the BPF program
> handles this failure appropriately, namely by freeing the resource
> another way.
>
>
> Path 10 adds 'iter' type flags which teach the verifier to understand
> open-coded iteration of a data structure. Specifically, with such flags
> the verifier can understand that this loop eventually ends:
>
>   struct node_data *iter = (struct node_data *)bpf_rbtree_first(&rbtree);
>
>   while (iter) {
>     node_ct++;
>     iter = (struct node_data *)bpf_rbtree_next(&rbtree, iter);
>   }
>
> NOTE: Patch 10's logic is currently very unsafe and it's unclear whether
> there's a safe path forward that isn't too complex. It's the most RFC-ey
> of all the patches.
>
>
> Patch 11 adds tests. Best to start here to see BPF programs using rbtree
> map as intended.
>
>
> This series is based ontop of "bpf: Cleanup check_refcount_ok" patch,
> which was submitted separately [0] and therefore is not included here. That
> patch is likely to be applied before this is out of RFC state, so will
> just rebase on newer bpf-next/master.
>
>   [0]: lore.kernel.org/bpf/20220719215536.2787530-1-davemarchevsky@fb.com/
>
> Dave Marchevsky (11):
>   bpf: Pull repeated reg access bounds check into helper fn
>   bpf: Add verifier support for custom callback return range
>   bpf: Add rb_node_off to bpf_map
>   bpf: Add rbtree map
>   bpf: Add bpf_spin_lock member to rbtree
>   bpf: Add bpf_rbtree_{lock,unlock} helpers
>   bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
>   bpf: Add OBJ_NON_OWNING_REF type flag
>   bpf: Add CONDITIONAL_RELEASE type flag
>   bpf: Introduce PTR_ITER and PTR_ITER_END type flags
>   selftests/bpf: Add rbtree map tests
>
>  include/linux/bpf.h                           |  13 +
>  include/linux/bpf_types.h                     |   1 +
>  include/linux/bpf_verifier.h                  |   2 +
>  include/linux/btf.h                           |   1 +
>  include/uapi/linux/bpf.h                      | 121 ++++++
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/btf.c                              |  21 +
>  kernel/bpf/helpers.c                          |  42 +-
>  kernel/bpf/rbtree.c                           | 401 ++++++++++++++++++
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 382 +++++++++++++++--
>  tools/include/uapi/linux/bpf.h                | 121 ++++++
>  .../selftests/bpf/prog_tests/rbtree_map.c     | 164 +++++++
>  .../testing/selftests/bpf/progs/rbtree_map.c  | 108 +++++
>  .../selftests/bpf/progs/rbtree_map_fail.c     | 236 +++++++++++
>  .../bpf/progs/rbtree_map_load_fail.c          |  24 ++
>  16 files changed, 1605 insertions(+), 37 deletions(-)
>  create mode 100644 kernel/bpf/rbtree.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_fail.c
>  create mode 100644 tools/testing/selftests/bpf/progs/rbtree_map_load_fail.c
>
> --
> 2.30.2
>

I just skimmed through commit descriptions and wanted to ask few questions.

1. It's not exactly clear from descriptions what are the advantages of
having struct node_data data vs sticking to more typical BPF map
interface with bpf_map_lookup_elem() and so on. Am I right that the
biggest advantage is that we can move node from one RB tree to
another? But if that's the only reason, why can't we have a
specialized bpf_rbtree_move(rb1, rb2, value) (where rb1 and rb2 are
trees involved in a move, and value is whatever is returned by
bpf_map_lookup_elem, which internally will also keep rbtree_node
hidden in front of value pointer, just like we do it with ringbuf).

2. As for rbtree node pre-allocated in more permissive mode and then
inserting into RB tree later on in more restrictive (e.g., NMI) mode.
Wouldn't this problem be mostly solved by Alexei's BPF-specific memory
allocator? And right now you are requiring to insert in the same mode
as when node was allocated (or free it), right? That's just a current
limitation and you are planning to lift this restriction? If yes, what
would be the mechanism to "temporarily" store rbtree_node somewhere?
kptr? dynptr? In both cases not exactly clear how type information is
preserved.

3. As for locking. Assuming we add bpf_rbtree_move_node() helper, you
can pass lock as an argument to it instead of storing lock inside the
RB tree itself. Preventing deadlocks and stuff like that is still hard
and tricky, but at least you won't have to worry about storing locks
inside RB trees themselves.


Just some questions that I still remember from a brief reading of
this. I think it would be extremely helpful for me and others to have
a clear "why we need open-coded rbtree_node approach" question
answered with comparison to possible alternative using a bit more
conventional BPF approach.

Overall, I'm still trying to wrap my head around this, this approach
and stated goals (e.g., moving nodes between two trees, all the
locking story, etc) look very complicated, so will take some time to
internalize this new approach and convince myself it's doable.
Dave Marchevsky Aug. 10, 2022, 5:54 p.m. UTC | #4
On 7/28/22 3:04 AM, Yonghong Song wrote:   
> 
> 
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
>> allows bpf programs to use red-black trees similarly to kernel code.
>> Programs interact with rbtree maps in a much more open-coded way than
>> more classic map implementations. Some example code to demonstrate:
>>
>>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>>    if (!node)
>>      return 0;
>>
>>    node->one = calls;
>>    node->two = 6;
>>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
> 
> Can we just do
>      bpf_rbtree_lock(&rbtree)
>      bpf_rbtree_unlock(&rbtree)
> ? Looks like the only places bpf_rbtree_get_lock() used are
> as arguments of bpf_rbtree_lock/unlock or bpf_spin_lock/unlock?
> 

Summarizing our VC convo: the intent here is to have the lock live separately
from the tree, meaning it'd be separately initialized and passed into the tree
instead of current state where it's a bpf_rbtree field.

If the tree still keeps a pointer to the lock - ideally passed in when rbtree
map is instantiated - it'll be possible to move to a cleaner API as you
describe. The very explicit way it's done in this series is not the end state
I'd like either.

>>
>>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>>    if (!ret) {
>>      bpf_rbtree_free_node(&rbtree, node);
>>      goto unlock_ret;
>>    }
>>
>> unlock_ret:
>>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>>    return 0;
>>
>>
>> This series is in a heavy RFC state, with some added verifier semantics
>> needing improvement before they can be considered safe. I am sending
>> early to gather feedback on approach:
>>
>>    * Does the API seem reasonable and might it be useful for others?
>>
>>    * Do new verifier semantics added in this series make logical sense?
>>      Are there any glaring safety holes aside from those called out in
>>      individual patches?
>>
>> Please see individual patches for more in-depth explanation. A quick
>> summary of patches follows:
>>
>>
>> Patches 1-3 extend verifier and BTF searching logic in minor ways to
>> prepare for rbtree implementation patch.
>>    bpf: Pull repeated reg access bounds check into helper fn
>>    bpf: Add verifier support for custom callback return range
>>    bpf: Add rb_node_off to bpf_map
>>
>>
>> Patch 4 adds basic rbtree map implementation.
>>    bpf: Add rbtree map
>>
>> Note that 'complete' implementation requires concepts and changes
>> introduced in further patches in the series. The series is currently
>> arranged in this way to ease RFC review.
>>
>>
>> Patches 5-7 add a spinlock to the rbtree map, with some differing
>> semantics from existing verifier spinlock handling.
>>    bpf: Add bpf_spin_lock member to rbtree
>>    bpf: Add bpf_rbtree_{lock,unlock} helpers
>>    bpf: Enforce spinlock hold for bpf_rbtree_{add,remove,find}
>>
>> Notably, rbtree's bpf_spin_lock must be held while manipulating the tree
>> via helpers, while existing spinlock verifier logic prevents any helper
>> calls while lock is held. In current state this is worked around by not
>> having the verifier treat rbtree's lock specially in any way. This
>> needs to be improved before leaving RFC state as it's unsafe.
>>
> [...]
Dave Marchevsky Aug. 10, 2022, 6:11 p.m. UTC | #5
On 8/1/22 5:27 PM, Alexei Starovoitov wrote:   
> On 7/22/22 11:34 AM, Dave Marchevsky wrote:
>> Introduce bpf_rbtree map data structure. As the name implies, rbtree map
>> allows bpf programs to use red-black trees similarly to kernel code.
>> Programs interact with rbtree maps in a much more open-coded way than
>> more classic map implementations. Some example code to demonstrate:
>>
>>    node = bpf_rbtree_alloc_node(&rbtree, sizeof(struct node_data));
>>    if (!node)
>>      return 0;
>>
>>    node->one = calls;
>>    node->two = 6;
>>    bpf_rbtree_lock(bpf_rbtree_get_lock(&rbtree));
>>
>>    ret = (struct node_data *)bpf_rbtree_add(&rbtree, node, less);
>>    if (!ret) {
>>      bpf_rbtree_free_node(&rbtree, node);
>>      goto unlock_ret;
>>    }
>>
>> unlock_ret:
>>    bpf_rbtree_unlock(bpf_rbtree_get_lock(&rbtree));
>>    return 0;
>>
>>
>> This series is in a heavy RFC state, with some added verifier semantics
>> needing improvement before they can be considered safe. I am sending
>> early to gather feedback on approach:
>>
>>    * Does the API seem reasonable and might it be useful for others?
> 
> Overall it seems to be on the right track.
> The two main issue to resolve are locks and iterators.
> The rest needs work too, but it's getting close :)
> Will reply in individual patches.

I generally agree with the comments on individual patches. We talked about the
whole series over VC quite a bit, I will summarize the larger points below.

* It's worth experimenting with statically verifying correct locking instead
of current dynamic approach where helpers that really shouldn't fail - e.g.
adding / removing from tree - can fail if lock isn't held. If it's possible
to get it working we can get rid of CONDITIONAL_RELEASE concept since failing
add operation is the only thing that needs it.

* You're not sold on the need for OBJ_NON_OWNING_REF flag in verifier, and
feel that it makes the API more confusing to work with. Worth experimenting
with a 'valid use-after-free' (really use-after-release here) concept that
doesn't require new type flag, or just accepting "racy, but safe".

* Open-coded iteration will take a long time to do correctly and
should be dropped for now.