diff mbox series

[RFC,v6,1/5] bpf: Introduce rbtree map

Message ID 20221005171709.150520-2-xiyou.wangcong@gmail.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series net_sched: introduce eBPF based Qdisc | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 pending Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 1691 this patch: 1693
netdev/cc_maintainers warning 10 maintainers not CCed: john.fastabend@gmail.com andrii@kernel.org yhs@fb.com ast@kernel.org haoluo@google.com jolsa@kernel.org kpsingh@kernel.org song@kernel.org daniel@iogearbox.net martin.lau@linux.dev
netdev/build_clang success Errors and warnings before: 171 this patch: 171
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 1683 this patch: 1685
netdev/checkpatch fail CHECK: Alignment should match open parenthesis CHECK: Comparison to NULL could be written "e" ERROR: that open brace { should be on the previous line WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Cong Wang Oct. 5, 2022, 5:17 p.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

Insert:
bpf_map_update(&map, &key, &val, flag);

Delete a specific key-val pair:
bpf_map_delete_elem(&map, &key);

Pop the minimum one:
bpf_map_pop(&map, &val);

Lookup:
val = bpf_map_lookup_elem(&map, &key);

Iterator:
bpf_for_each_map_elem(&map, callback, key, val);

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/Makefile       |   2 +-
 kernel/bpf/rbtree.c       | 445 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 448 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/rbtree.c

Comments

Kumar Kartikeya Dwivedi Oct. 11, 2022, 3:19 p.m. UTC | #1
On Wed, 5 Oct 2022 at 22:53, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cong.wang@bytedance.com>
>
> Insert:
> bpf_map_update(&map, &key, &val, flag);
>
> Delete a specific key-val pair:
> bpf_map_delete_elem(&map, &key);
>
> Pop the minimum one:
> bpf_map_pop(&map, &val);
>
> Lookup:
> val = bpf_map_lookup_elem(&map, &key);
>
> Iterator:
> bpf_for_each_map_elem(&map, callback, key, val);
>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

Instead of a dedicated BPF map and using kptr inside the map value, we
should probably lift Dave's series [0] adding the rbtree, and allow
linking sk_buff ctx directly into it. It would require recognising the
rb_node in sk_buff (or __sk_buff shadow struct) as a valid bpf_rb_node
similar to those in user allocated types. Overall it would be a much
better approach IMO and avoid having different rbtree implementations.
We would probably follow a similar approach for xdp_frame as well.

It can also be a union of bpf_rb_node, bpf_list_node, etc. Since the
type can only be in only one collection at once it would allow it to
be linked into different types of structures without wasting any
space.

[0]: https://lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
Dave Marchevsky Oct. 11, 2022, 5:05 p.m. UTC | #2
Hi Cong,

On 10/5/22 1:17 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Insert:
> bpf_map_update(&map, &key, &val, flag);
> 
> Delete a specific key-val pair:
> bpf_map_delete_elem(&map, &key);
> 
> Pop the minimum one:
> bpf_map_pop(&map, &val);
> 
> Lookup:
> val = bpf_map_lookup_elem(&map, &key);
> 
> Iterator:
> bpf_for_each_map_elem(&map, callback, key, val);
> 
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  include/linux/bpf_types.h |   1 +
>  include/uapi/linux/bpf.h  |   1 +
>  kernel/bpf/Makefile       |   2 +-
>  kernel/bpf/rbtree.c       | 445 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 448 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/rbtree.c
> 

I've also been working on a rbtree wrapper [0], with the goal of supporting
scheduler-like usecases as you mentioned in your cover letter. This morphed into
a larger discussion around "next-generation BPF datastructures" with Kumar,
Alexei, and others. Many of the points we reached consensus on are relevant to
the rbtree patches in this series, I'll try to link context where appropriate.

For next-gen BPF datastructures, we'd like to move away from exposing them as
BPF_MAPs with standard bpf helpers to access / manipulate. Instead, they should
be exposed as kptrs and manipulated by unstable kfuncs. kptr representing the
datastructure can live inside map_value of existing BPF_MAP - e.g. inside
ARRAY map_value as a global value. This means that next-gen datastructures can
be added - and their API changed post-addition - without UAPI changes. For
rbtree this is particularly desired as a few other folks also want a BPF rbtree
and we'd like to make sure as many usecases as possible are supported.
Discussion around this started in initial rbtree RFC [1] (specifically patch 5's
thread) and Alexei mentioned it during his LPC 2022 talk (slides [2] video [3],
slides 22-end are particularly relevant).

This next-gen style also allows us to move locks and locking behavior out of
helpers and into the BPF program itself, with verifier checking that correct
lock is held when the datastructure is manipulated. This discussion started
in [1] and continued in Kumar's [4] and my [0], both of which implement such
behavior. David Vernet's LWN article [5] touches on this in "Plans for the
future" section as well.

Kumar's recently-submitted evolution of [4], [6], implements the above as
well as other groundwork necessary for next-gen datastructures and a next-gen
linked list implementation. I've been modifying my rbtree implementation to
meet next-gen datastructure expectations - specifically kptr and kfunc, as it's
still using BPF_MAP/helpers like this series - and should be able to submit it
soon.

I'd like to make sure my impl works for you as well, so there are some questions
about approach below. I didn't focus too much on coding details as requested
in cover letter.

  [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
  [1]: lore.kernel.org/bpf/20220722183438.3319790-1-davemarchevsky@fb.com
  [2]: lpc.events/event/16/contributions/1346/attachments/1021/1966/bpf_LPC_2022.pdf
  [3]: https://www.youtube.com/watch?v=andvNRUAAs0&t=91s
  [4]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
  [5]: lwn.net/Articles/909095/
  [6]: lore.kernel.org/bpf/20221011012240.3149-1-memxor@gmail.com

> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 2c6a4f2562a7..c53ba6de1613 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -127,6 +127,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops)
>  
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..9492cd3af701 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -935,6 +935,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_TASK_STORAGE,
>  	BPF_MAP_TYPE_BLOOM_FILTER,
>  	BPF_MAP_TYPE_USER_RINGBUF,
> +	BPF_MAP_TYPE_RBTREE,
>  };
>  
>  /* Note that tracing related programs such as
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..e60249258c74 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -7,7 +7,7 @@ endif
>  CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>  
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o rbtree.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>  obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
> diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
> new file mode 100644
> index 000000000000..f1a9b1c40b8b
> --- /dev/null
> +++ b/kernel/bpf/rbtree.c
> @@ -0,0 +1,445 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * rbtree.c: eBPF rbtree map
> + *
> + * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
> + */
> +#include <linux/bpf.h>
> +#include <linux/slab.h>
> +#include <linux/capability.h>
> +#include <linux/rbtree.h>
> +#include <linux/btf_ids.h>
> +#include <linux/bpf_mem_alloc.h>
> +#include <linux/math.h>
> +#include <linux/seq_file.h>
> +
> +#define RBTREE_CREATE_FLAG_MASK \
> +	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
> +
> +/* each rbtree element is struct rbtree_elem + key + value */
> +struct rbtree_elem {
> +	struct rb_node rbnode;
> +	char key[] __aligned(8);

My impl takes a different approach to key, instead allowing user to do

  struct node_data {
	  struct rb_node node;
	  __u32 key;
	  __u32 val;
  };

in their BPF program similarly to normal kernel style. This has the downside
of requiring custom comparator callbacks for any operation requiring traversal,
which prevents common lookup/update/delete helpers from being used. But next-gen
kptr / kfunc datastructure style won't use common helpers anyways. Will
something like this work for your usecase?

> +};
> +
> +struct rbtree_map {
> +	struct bpf_map map;
> +	struct bpf_mem_alloc ma;
> +	raw_spinlock_t lock;
> +	struct rb_root root;

In my RFCv1, Alexei suggested having the underlying rb_root be the "_cached"
version as it's likely that _cached behavior will be better for most usecases
by default. Do you need the raw rb_root?


> +	atomic_t nr_entries;

Is nr_entries used anywhere? I could only find incr/decr in this series.

> +};
> +
> +#define rb_to_elem(rb) rb_entry_safe(rb, struct rbtree_elem, rbnode)
> +#define elem_rb_first(root) rb_to_elem(rb_first(root))
> +#define elem_rb_last(root)  rb_to_elem(rb_last(root))
> +#define elem_rb_next(e)   rb_to_elem(rb_next(&(e)->rbnode))
> +#define rbtree_walk_safe(e, tmp, root)					\
> +		for (e = elem_rb_first(root);				\
> +		     tmp = e ? elem_rb_next(e) : NULL, (e != NULL);	\
> +		     e = tmp)
> +
> +static struct rbtree_map *rbtree_map(struct bpf_map *map)
> +{
> +	return container_of(map, struct rbtree_map, map);
> +}
> +
> +/* Called from syscall */
> +static int rbtree_map_alloc_check(union bpf_attr *attr)
> +{
> +	if (!bpf_capable())
> +		return -EPERM;
> +
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 ||
> +	    attr->map_flags & ~RBTREE_CREATE_FLAG_MASK ||
> +	    !bpf_map_flags_access_ok(attr->map_flags))
> +		return -EINVAL;
> +
> +	if (attr->value_size > KMALLOC_MAX_SIZE)
> +		/* if value_size is bigger, the user space won't be able to
> +		 * access the elements.
> +		 */
> +		return -E2BIG;
> +
> +	return 0;
> +}
> +
> +static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
> +{
> +	int numa_node = bpf_map_attr_numa_node(attr);
> +	struct rbtree_map *rb;
> +	u32 elem_size;
> +	int err;
> +
> +	rb = bpf_map_area_alloc(sizeof(*rb), numa_node);
> +	if (!rb)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memset(rb, 0, sizeof(*rb));
> +	bpf_map_init_from_attr(&rb->map, attr);
> +	raw_spin_lock_init(&rb->lock);
> +	rb->root = RB_ROOT;
> +	atomic_set(&rb->nr_entries, 0);
> +
> +	elem_size = sizeof(struct rbtree_elem) +
> +			  round_up(rb->map.key_size, 8);
> +	elem_size += round_up(rb->map.value_size, 8);

Will your usecase's rbtree values always have the same size?

> +	err = bpf_mem_alloc_init(&rb->ma, elem_size, false);

Although my most recently-sent rbtree patchset doesn't use the allocator Alexei
added, next version will.

> +	if (err) {
> +		bpf_map_area_free(rb);
> +		return ERR_PTR(err);
> +	}
> +	return &rb->map;
> +}
> +
> +static void check_and_free_fields(struct rbtree_map *rb,
> +				  struct rbtree_elem *elem)
> +{
> +	void *map_value = elem->key + round_up(rb->map.key_size, 8);
> +
> +	if (map_value_has_kptrs(&rb->map))
> +		bpf_map_free_kptrs(&rb->map, map_value);
> +}
> +
> +static void rbtree_map_purge(struct bpf_map *map)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e, *tmp;
> +
> +	rbtree_walk_safe(e, tmp, &rb->root) {
> +		rb_erase(&e->rbnode, &rb->root);
> +		check_and_free_fields(rb, e);
> +		bpf_mem_cache_free(&rb->ma, e);
> +	}
> +}
> +
> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
> +static void rbtree_map_free(struct bpf_map *map)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&rb->lock, flags);
> +	rbtree_map_purge(map);
> +	raw_spin_unlock_irqrestore(&rb->lock, flags);
> +	bpf_mem_alloc_destroy(&rb->ma);
> +	bpf_map_area_free(rb);
> +}
> +
> +static struct rbtree_elem *bpf_rbtree_find(struct rb_root *root, void *key, int size)
> +{
> +	struct rb_node **p = &root->rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rbtree_elem *e;
> +
> +	while (*p) {
> +		int ret;
> +
> +		parent = *p;
> +		e = rb_to_elem(parent);
> +		ret = memcmp(key, e->key, size);
> +		if (ret < 0)
> +			p = &parent->rb_left;
> +		else if (ret > 0)
> +			p = &parent->rb_right;
> +		else
> +			return e;
> +	}
> +	return NULL;
> +}
> +
> +/* Called from eBPF program or syscall */
> +static void *rbtree_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e;
> +
> +	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
> +	if (!e)
> +		return NULL;
> +	return e->key + round_up(rb->map.key_size, 8);
> +}
> +
> +static int check_flags(struct rbtree_elem *old, u64 map_flags)
> +{
> +	if (old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
> +		/* elem already exists */
> +		return -EEXIST;
> +
> +	if (!old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
> +		/* elem doesn't exist, cannot update it */
> +		return -ENOENT;
> +
> +	return 0;
> +}
> +
> +static void rbtree_map_insert(struct rbtree_map *rb, struct rbtree_elem *e)
> +{
> +	struct rb_root *root = &rb->root;
> +	struct rb_node **p = &root->rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rbtree_elem *e1;
> +
> +	while (*p) {
> +		parent = *p;
> +		e1 = rb_to_elem(parent);
> +		if (memcmp(e->key, e1->key, rb->map.key_size) < 0)
> +			p = &parent->rb_left;
> +		else
> +			p = &parent->rb_right;
> +	}
> +	rb_link_node(&e->rbnode, parent, p);
> +	rb_insert_color(&e->rbnode, root);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int rbtree_map_update_elem(struct bpf_map *map, void *key, void *value,
> +			       u64 map_flags)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	void *val = rbtree_map_lookup_elem(map, key);
> +	int ret;
> +
> +	ret = check_flags(val, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (!val) {
> +		struct rbtree_elem *e_new;
> +		unsigned long flags;
> +
> +		e_new = bpf_mem_cache_alloc(&rb->ma);
> +		if (!e_new)
> +			return -ENOMEM;
> +		val = e_new->key + round_up(rb->map.key_size, 8);
> +		check_and_init_map_value(&rb->map, val);
> +		memcpy(e_new->key, key, rb->map.key_size);
> +		raw_spin_lock_irqsave(&rb->lock, flags);
> +		rbtree_map_insert(rb, e_new);
> +		raw_spin_unlock_irqrestore(&rb->lock, flags);
> +		atomic_inc(&rb->nr_entries);
> +	}
> +
> +	if (map_flags & BPF_F_LOCK)
> +		copy_map_value_locked(map, val, value, false);
> +	else
> +		copy_map_value(map, val, value);
> +	return 0;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int rbtree_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&rb->lock, flags);
> +	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
> +	if (!e) {
> +		raw_spin_unlock_irqrestore(&rb->lock, flags);
> +		return -ENOENT;
> +	}
> +	rb_erase(&e->rbnode, &rb->root);
> +	raw_spin_unlock_irqrestore(&rb->lock, flags);
> +	check_and_free_fields(rb, e);
> +	bpf_mem_cache_free(&rb->ma, e);
> +	atomic_dec(&rb->nr_entries);
> +	return 0;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int rbtree_map_pop_elem(struct bpf_map *map, void *value)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e = elem_rb_first(&rb->root);
> +	unsigned long flags;
> +	void *val;
> +
> +	if (!e)
> +		return -ENOENT;
> +	raw_spin_lock_irqsave(&rb->lock, flags);
> +	rb_erase(&e->rbnode, &rb->root);
> +	raw_spin_unlock_irqrestore(&rb->lock, flags);
> +	val = e->key + round_up(rb->map.key_size, 8);
> +	copy_map_value(map, value, val);
> +	check_and_free_fields(rb, e);
> +	bpf_mem_cache_free(&rb->ma, e);
> +	atomic_dec(&rb->nr_entries);
> +	return 0;
> +}
> +
> +/* Called from syscall */
> +static int rbtree_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e;
> +
> +	if (!key) {
> +		e = elem_rb_first(&rb->root);
> +		if (!e)
> +			return -ENOENT;
> +		goto found;
> +	}
> +	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
> +	if (!e)
> +		return -ENOENT;
> +	e = elem_rb_next(e);
> +	if (!e)
> +		return 0;
> +found:
> +	memcpy(next_key, e->key, map->key_size);
> +	return 0;
> +}
> +
> +static int bpf_for_each_rbtree_map(struct bpf_map *map,
> +				   bpf_callback_t callback_fn,
> +				   void *callback_ctx, u64 flags)
> +{
> +	struct rbtree_map *rb = rbtree_map(map);
> +	struct rbtree_elem *e, *tmp;
> +	void *key, *value;
> +	u32 num_elems = 0;
> +	u64 ret = 0;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	rbtree_walk_safe(e, tmp, &rb->root) {
> +		num_elems++;
> +		key = e->key;
> +		value = key + round_up(rb->map.key_size, 8);
> +		ret = callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value,
> +				  (u64)(long)callback_ctx, 0);
> +		/* return value: 0 - continue, 1 - stop and return */
> +		if (ret)
> +			break;
> +	}
> +
> +	return num_elems;
> +}
> +
> +struct rbtree_map_seq_info {
> +	struct bpf_map *map;
> +	struct rbtree_map *rb;
> +};
> +
> +static void *rbtree_map_seq_find_next(struct rbtree_map_seq_info *info,
> +				      struct rbtree_elem *prev_elem)
> +{
> +	const struct rbtree_map *rb = info->rb;
> +	struct rbtree_elem *elem;
> +
> +	/* try to find next elem in the same bucket */
> +	if (prev_elem) {
> +		elem = elem_rb_next(prev_elem);
> +		if (elem)
> +			return elem;
> +		return NULL;
> +	}
> +
> +	return elem_rb_first(&rb->root);
> +}
> +
> +static void *rbtree_map_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct rbtree_map_seq_info *info = seq->private;
> +
> +	if (*pos == 0)
> +		++*pos;
> +
> +	/* pairs with rbtree_map_seq_stop */
> +	rcu_read_lock();
> +	return rbtree_map_seq_find_next(info, NULL);
> +}
> +
> +static void *rbtree_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct rbtree_map_seq_info *info = seq->private;
> +
> +	++*pos;
> +	return rbtree_map_seq_find_next(info, v);
> +}
> +
> +static int rbtree_map_seq_show(struct seq_file *seq, void *v)
> +{
> +	struct rbtree_map_seq_info *info = seq->private;
> +	struct bpf_iter__bpf_map_elem ctx = {};
> +	struct rbtree_elem *elem = v;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, !elem);
> +	if (!prog)
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.map = info->map;
> +	if (elem) {
> +		ctx.key = elem->key;
> +		ctx.value = elem->key + round_up(info->map->key_size, 8);
> +	}
> +
> +	return bpf_iter_run_prog(prog, &ctx);
> +}
> +
> +static void rbtree_map_seq_stop(struct seq_file *seq, void *v)
> +{
> +	if (!v)
> +		(void)rbtree_map_seq_show(seq, NULL);
> +
> +	/* pairs with rbtree_map_seq_start */
> +	rcu_read_unlock();
> +}
> +
> +static const struct seq_operations rbtree_map_seq_ops = {
> +	.start	= rbtree_map_seq_start,
> +	.next	= rbtree_map_seq_next,
> +	.stop	= rbtree_map_seq_stop,
> +	.show	= rbtree_map_seq_show,
> +};
> +
> +static int rbtree_map_init_seq_private(void *priv_data,
> +				       struct bpf_iter_aux_info *aux)
> +{
> +	struct rbtree_map_seq_info *info = priv_data;
> +
> +	bpf_map_inc_with_uref(aux->map);
> +	info->map = aux->map;
> +	info->rb = rbtree_map(info->map);
> +	return 0;
> +}
> +
> +static void rbtree_map_fini_seq_private(void *priv_data)
> +{
> +	struct rbtree_map_seq_info *info = priv_data;
> +
> +	bpf_map_put_with_uref(info->map);
> +}
> +
> +static const struct bpf_iter_seq_info rbtree_map_iter_seq_info = {
> +	.seq_ops		= &rbtree_map_seq_ops,
> +	.init_seq_private	= rbtree_map_init_seq_private,
> +	.fini_seq_private	= rbtree_map_fini_seq_private,
> +	.seq_priv_size		= sizeof(struct rbtree_map_seq_info),
> +};
> +
> +BTF_ID_LIST_SINGLE(rbtree_map_btf_ids, struct, rbtree_map)
> +const struct bpf_map_ops rbtree_map_ops = {
> +	.map_meta_equal = bpf_map_meta_equal,
> +	.map_alloc_check = rbtree_map_alloc_check,
> +	.map_alloc = rbtree_map_alloc,
> +	.map_free = rbtree_map_free,
> +	.map_lookup_elem = rbtree_map_lookup_elem,
> +	.map_update_elem = rbtree_map_update_elem,
> +	.map_delete_elem = rbtree_map_delete_elem,
> +	.map_pop_elem = rbtree_map_pop_elem,
> +	.map_get_next_key = rbtree_map_get_next_key,
> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
> +	.map_for_each_callback = bpf_for_each_rbtree_map,
> +	.map_btf_id = &rbtree_map_btf_ids[0],
> +	.iter_seq_info = &rbtree_map_iter_seq_info,
> +};
> +
diff mbox series

Patch

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2c6a4f2562a7..c53ba6de1613 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -127,6 +127,7 @@  BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 51b9aa640ad2..9492cd3af701 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -935,6 +935,7 @@  enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_RBTREE,
 };
 
 /* Note that tracing related programs such as
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..e60249258c74 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@  endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o rbtree.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c
new file mode 100644
index 000000000000..f1a9b1c40b8b
--- /dev/null
+++ b/kernel/bpf/rbtree.c
@@ -0,0 +1,445 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * rbtree.c: eBPF rbtree map
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/bpf.h>
+#include <linux/slab.h>
+#include <linux/capability.h>
+#include <linux/rbtree.h>
+#include <linux/btf_ids.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/math.h>
+#include <linux/seq_file.h>
+
+#define RBTREE_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+
+/* each rbtree element is struct rbtree_elem + key + value */
+struct rbtree_elem {
+	struct rb_node rbnode;
+	char key[] __aligned(8);
+};
+
+struct rbtree_map {
+	struct bpf_map map;
+	struct bpf_mem_alloc ma;
+	raw_spinlock_t lock;
+	struct rb_root root;
+	atomic_t nr_entries;
+};
+
+#define rb_to_elem(rb) rb_entry_safe(rb, struct rbtree_elem, rbnode)
+#define elem_rb_first(root) rb_to_elem(rb_first(root))
+#define elem_rb_last(root)  rb_to_elem(rb_last(root))
+#define elem_rb_next(e)   rb_to_elem(rb_next(&(e)->rbnode))
+#define rbtree_walk_safe(e, tmp, root)					\
+		for (e = elem_rb_first(root);				\
+		     tmp = e ? elem_rb_next(e) : NULL, (e != NULL);	\
+		     e = tmp)
+
+static struct rbtree_map *rbtree_map(struct bpf_map *map)
+{
+	return container_of(map, struct rbtree_map, map);
+}
+
+/* Called from syscall */
+static int rbtree_map_alloc_check(union bpf_attr *attr)
+{
+	if (!bpf_capable())
+		return -EPERM;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 ||
+	    attr->map_flags & ~RBTREE_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return -EINVAL;
+
+	if (attr->value_size > KMALLOC_MAX_SIZE)
+		/* if value_size is bigger, the user space won't be able to
+		 * access the elements.
+		 */
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr)
+{
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct rbtree_map *rb;
+	u32 elem_size;
+	int err;
+
+	rb = bpf_map_area_alloc(sizeof(*rb), numa_node);
+	if (!rb)
+		return ERR_PTR(-ENOMEM);
+
+	memset(rb, 0, sizeof(*rb));
+	bpf_map_init_from_attr(&rb->map, attr);
+	raw_spin_lock_init(&rb->lock);
+	rb->root = RB_ROOT;
+	atomic_set(&rb->nr_entries, 0);
+
+	elem_size = sizeof(struct rbtree_elem) +
+			  round_up(rb->map.key_size, 8);
+	elem_size += round_up(rb->map.value_size, 8);
+	err = bpf_mem_alloc_init(&rb->ma, elem_size, false);
+	if (err) {
+		bpf_map_area_free(rb);
+		return ERR_PTR(err);
+	}
+	return &rb->map;
+}
+
+static void check_and_free_fields(struct rbtree_map *rb,
+				  struct rbtree_elem *elem)
+{
+	void *map_value = elem->key + round_up(rb->map.key_size, 8);
+
+	if (map_value_has_kptrs(&rb->map))
+		bpf_map_free_kptrs(&rb->map, map_value);
+}
+
+static void rbtree_map_purge(struct bpf_map *map)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e, *tmp;
+
+	rbtree_walk_safe(e, tmp, &rb->root) {
+		rb_erase(&e->rbnode, &rb->root);
+		check_and_free_fields(rb, e);
+		bpf_mem_cache_free(&rb->ma, e);
+	}
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void rbtree_map_free(struct bpf_map *map)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	rbtree_map_purge(map);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	bpf_mem_alloc_destroy(&rb->ma);
+	bpf_map_area_free(rb);
+}
+
+static struct rbtree_elem *bpf_rbtree_find(struct rb_root *root, void *key, int size)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct rbtree_elem *e;
+
+	while (*p) {
+		int ret;
+
+		parent = *p;
+		e = rb_to_elem(parent);
+		ret = memcmp(key, e->key, size);
+		if (ret < 0)
+			p = &parent->rb_left;
+		else if (ret > 0)
+			p = &parent->rb_right;
+		else
+			return e;
+	}
+	return NULL;
+}
+
+/* Called from eBPF program or syscall */
+static void *rbtree_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e;
+
+	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
+	if (!e)
+		return NULL;
+	return e->key + round_up(rb->map.key_size, 8);
+}
+
+static int check_flags(struct rbtree_elem *old, u64 map_flags)
+{
+	if (old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
+		/* elem already exists */
+		return -EEXIST;
+
+	if (!old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
+		/* elem doesn't exist, cannot update it */
+		return -ENOENT;
+
+	return 0;
+}
+
+static void rbtree_map_insert(struct rbtree_map *rb, struct rbtree_elem *e)
+{
+	struct rb_root *root = &rb->root;
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct rbtree_elem *e1;
+
+	while (*p) {
+		parent = *p;
+		e1 = rb_to_elem(parent);
+		if (memcmp(e->key, e1->key, rb->map.key_size) < 0)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&e->rbnode, parent, p);
+	rb_insert_color(&e->rbnode, root);
+}
+
+/* Called from syscall or from eBPF program */
+static int rbtree_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 map_flags)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	void *val = rbtree_map_lookup_elem(map, key);
+	int ret;
+
+	ret = check_flags(val, map_flags);
+	if (ret)
+		return ret;
+
+	if (!val) {
+		struct rbtree_elem *e_new;
+		unsigned long flags;
+
+		e_new = bpf_mem_cache_alloc(&rb->ma);
+		if (!e_new)
+			return -ENOMEM;
+		val = e_new->key + round_up(rb->map.key_size, 8);
+		check_and_init_map_value(&rb->map, val);
+		memcpy(e_new->key, key, rb->map.key_size);
+		raw_spin_lock_irqsave(&rb->lock, flags);
+		rbtree_map_insert(rb, e_new);
+		raw_spin_unlock_irqrestore(&rb->lock, flags);
+		atomic_inc(&rb->nr_entries);
+	}
+
+	if (map_flags & BPF_F_LOCK)
+		copy_map_value_locked(map, val, value, false);
+	else
+		copy_map_value(map, val, value);
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int rbtree_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
+	if (!e) {
+		raw_spin_unlock_irqrestore(&rb->lock, flags);
+		return -ENOENT;
+	}
+	rb_erase(&e->rbnode, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	check_and_free_fields(rb, e);
+	bpf_mem_cache_free(&rb->ma, e);
+	atomic_dec(&rb->nr_entries);
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int rbtree_map_pop_elem(struct bpf_map *map, void *value)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e = elem_rb_first(&rb->root);
+	unsigned long flags;
+	void *val;
+
+	if (!e)
+		return -ENOENT;
+	raw_spin_lock_irqsave(&rb->lock, flags);
+	rb_erase(&e->rbnode, &rb->root);
+	raw_spin_unlock_irqrestore(&rb->lock, flags);
+	val = e->key + round_up(rb->map.key_size, 8);
+	copy_map_value(map, value, val);
+	check_and_free_fields(rb, e);
+	bpf_mem_cache_free(&rb->ma, e);
+	atomic_dec(&rb->nr_entries);
+	return 0;
+}
+
+/* Called from syscall */
+static int rbtree_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e;
+
+	if (!key) {
+		e = elem_rb_first(&rb->root);
+		if (!e)
+			return -ENOENT;
+		goto found;
+	}
+	e = bpf_rbtree_find(&rb->root, key, rb->map.key_size);
+	if (!e)
+		return -ENOENT;
+	e = elem_rb_next(e);
+	if (!e)
+		return 0;
+found:
+	memcpy(next_key, e->key, map->key_size);
+	return 0;
+}
+
+static int bpf_for_each_rbtree_map(struct bpf_map *map,
+				   bpf_callback_t callback_fn,
+				   void *callback_ctx, u64 flags)
+{
+	struct rbtree_map *rb = rbtree_map(map);
+	struct rbtree_elem *e, *tmp;
+	void *key, *value;
+	u32 num_elems = 0;
+	u64 ret = 0;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	rbtree_walk_safe(e, tmp, &rb->root) {
+		num_elems++;
+		key = e->key;
+		value = key + round_up(rb->map.key_size, 8);
+		ret = callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value,
+				  (u64)(long)callback_ctx, 0);
+		/* return value: 0 - continue, 1 - stop and return */
+		if (ret)
+			break;
+	}
+
+	return num_elems;
+}
+
+struct rbtree_map_seq_info {
+	struct bpf_map *map;
+	struct rbtree_map *rb;
+};
+
+static void *rbtree_map_seq_find_next(struct rbtree_map_seq_info *info,
+				      struct rbtree_elem *prev_elem)
+{
+	const struct rbtree_map *rb = info->rb;
+	struct rbtree_elem *elem;
+
+	/* try to find next elem in the same bucket */
+	if (prev_elem) {
+		elem = elem_rb_next(prev_elem);
+		if (elem)
+			return elem;
+		return NULL;
+	}
+
+	return elem_rb_first(&rb->root);
+}
+
+static void *rbtree_map_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct rbtree_map_seq_info *info = seq->private;
+
+	if (*pos == 0)
+		++*pos;
+
+	/* pairs with rbtree_map_seq_stop */
+	rcu_read_lock();
+	return rbtree_map_seq_find_next(info, NULL);
+}
+
+static void *rbtree_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct rbtree_map_seq_info *info = seq->private;
+
+	++*pos;
+	return rbtree_map_seq_find_next(info, v);
+}
+
+static int rbtree_map_seq_show(struct seq_file *seq, void *v)
+{
+	struct rbtree_map_seq_info *info = seq->private;
+	struct bpf_iter__bpf_map_elem ctx = {};
+	struct rbtree_elem *elem = v;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, !elem);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.map = info->map;
+	if (elem) {
+		ctx.key = elem->key;
+		ctx.value = elem->key + round_up(info->map->key_size, 8);
+	}
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static void rbtree_map_seq_stop(struct seq_file *seq, void *v)
+{
+	if (!v)
+		(void)rbtree_map_seq_show(seq, NULL);
+
+	/* pairs with rbtree_map_seq_start */
+	rcu_read_unlock();
+}
+
+static const struct seq_operations rbtree_map_seq_ops = {
+	.start	= rbtree_map_seq_start,
+	.next	= rbtree_map_seq_next,
+	.stop	= rbtree_map_seq_stop,
+	.show	= rbtree_map_seq_show,
+};
+
+static int rbtree_map_init_seq_private(void *priv_data,
+				       struct bpf_iter_aux_info *aux)
+{
+	struct rbtree_map_seq_info *info = priv_data;
+
+	bpf_map_inc_with_uref(aux->map);
+	info->map = aux->map;
+	info->rb = rbtree_map(info->map);
+	return 0;
+}
+
+static void rbtree_map_fini_seq_private(void *priv_data)
+{
+	struct rbtree_map_seq_info *info = priv_data;
+
+	bpf_map_put_with_uref(info->map);
+}
+
+static const struct bpf_iter_seq_info rbtree_map_iter_seq_info = {
+	.seq_ops		= &rbtree_map_seq_ops,
+	.init_seq_private	= rbtree_map_init_seq_private,
+	.fini_seq_private	= rbtree_map_fini_seq_private,
+	.seq_priv_size		= sizeof(struct rbtree_map_seq_info),
+};
+
+BTF_ID_LIST_SINGLE(rbtree_map_btf_ids, struct, rbtree_map)
+const struct bpf_map_ops rbtree_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = rbtree_map_alloc_check,
+	.map_alloc = rbtree_map_alloc,
+	.map_free = rbtree_map_free,
+	.map_lookup_elem = rbtree_map_lookup_elem,
+	.map_update_elem = rbtree_map_update_elem,
+	.map_delete_elem = rbtree_map_delete_elem,
+	.map_pop_elem = rbtree_map_pop_elem,
+	.map_get_next_key = rbtree_map_get_next_key,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_rbtree_map,
+	.map_btf_id = &rbtree_map_btf_ids[0],
+	.iter_seq_info = &rbtree_map_iter_seq_info,
+};
+