Message ID | 20231215001209.3252729-1-yonghong.song@linux.dev (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Reduce memory usage for bpf_global_percpu_ma | expand |
On 12/14/23 4:12 PM, Yonghong Song wrote: > Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") > added support for non-fix-size percpu memory allocation. > Such allocation will allocate percpu memory for all buckets on all > cpus and the memory consumption is in the order to quadratic. > For example, let us say, 4 cpus, unit size 16 bytes, so each > cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. > Then let us say, 8 cpus with the same unit size, each cpu > has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. > So if the number of cpus doubles, the number of memory consumption > will be 4 times. So for a system with large number of cpus, the > memory consumption goes up quickly with quadratic order. > For example, for 4KB percpu allocation, 128 cpus. The total memory > consumption will 4KB * 128 * 128 = 64MB. Things will become > worse if the number of cpus is bigger (e.g., 512, 1024, etc.) > > In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is > done in boot time, so for system with large number of cpus, the initial > percpu memory consumption is very visible. For example, for 128 cpu > system, the total percpu memory allocation will be at least > (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) > * 128 * 128 = ~138MB. > which is pretty big. It will be even bigger for larger number of cpus. > > Note that the current prefill also allocates 4 entries if the unit size > is less than 256. So on top of 138MB memory consumption, this will > add more consumption with > 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. > Next patch will try to reduce this memory consumption. > > Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory > at init stage") moved the non-fix-size percpu memory allocation > to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() > is called by bpf program, the memory allocator will try to fill in > the cache with all sizes, causing the same amount of percpu memory > consumption as in the boot stage. > > To reduce the initial percpu memory consumption for non-fix-size > percpu memory allocation, instead of filling the cache with all > supported allocation sizes, this patch intends to fill the cache > only for the requested size. As typically users will not use large > percpu data structure, this can save memory significantly. > For example, the allocation size is 64 bytes with 128 cpus. > Then total percpu memory amount will be 64 * 128 * 128 = 1MB, > much less than previous 138MB. > > Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > --- > include/linux/bpf.h | 2 +- > include/linux/bpf_mem_alloc.h | 7 ++++ > kernel/bpf/core.c | 8 +++-- > kernel/bpf/memalloc.c | 68 ++++++++++++++++++++++++++++++++++- > kernel/bpf/verifier.c | 28 ++++++--------- > 5 files changed, 91 insertions(+), 22 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index c87c608a3689..f1f16449fbc4 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -60,7 +60,7 @@ extern struct idr btf_idr; > extern spinlock_t btf_idr_lock; > extern struct kobject *btf_kobj; > extern struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; > -extern bool bpf_global_ma_set; > +extern bool bpf_global_ma_set, bpf_global_percpu_ma_set; > > typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64); > typedef int (*bpf_iter_init_seq_priv_t)(void *private_data, > diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h > index bb1223b21308..43e635c67150 100644 > --- a/include/linux/bpf_mem_alloc.h > +++ b/include/linux/bpf_mem_alloc.h > @@ -21,8 +21,15 @@ struct bpf_mem_alloc { > * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. > * Alloc and free are done with bpf_mem_{alloc,free}() and the size of > * the returned object is given by the size argument of bpf_mem_alloc(). > + * If percpu equals true, error will be returned in order to avoid > + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() > + * should be used to do on-demand per-cpu allocation for each size. > */ > int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); > +/* Initialize a non-fix-size percpu memory allocator */ > +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); > +/* The percpu allocation with a specific unit size. */ > +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); > void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); > > /* kmalloc/kfree equivalent: */ > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index c34513d645c4..4a9177770f93 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -64,8 +64,8 @@ > #define OFF insn->off > #define IMM insn->imm > > -struct bpf_mem_alloc bpf_global_ma; > -bool bpf_global_ma_set; > +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; > +bool bpf_global_ma_set, bpf_global_percpu_ma_set; > > /* No hurry in this branch > * > @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void) > > ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); > bpf_global_ma_set = !ret; > - return ret; > + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); > + bpf_global_percpu_ma_set = !ret; > + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; > } > late_initcall(bpf_global_ma_init); > #endif > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 472158f1fb08..aea4cd07c7b6 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -121,6 +121,8 @@ struct bpf_mem_caches { > struct bpf_mem_cache cache[NUM_CACHES]; > }; > > +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; > + > static struct llist_node notrace *__llist_del_first(struct llist_head *head) > { > struct llist_node *entry, *next; > @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) > */ > int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) > { > - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; > int cpu, i, err, unit_size, percpu_size = 0; > struct bpf_mem_caches *cc, __percpu *pcc; > struct bpf_mem_cache *c, __percpu *pc; > struct obj_cgroup *objcg = NULL; > > + if (percpu && size == 0) > + return -EINVAL; > + > /* room for llist_node and per-cpu pointer */ > if (percpu) > percpu_size = LLIST_NODE_SZ + sizeof(void *); > @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c) > drain_mem_cache(c); > } > > +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) > +{ > + struct bpf_mem_caches __percpu *pcc; > + > + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO); > + if (!pcc) > + return -ENOMEM; > + > + ma->caches = pcc; > + ma->percpu = true; > + return 0; > +} > + > +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) > +{ > + static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; Sorry, a oversight here. The above should be removed. Will fix in the next revision. > + int cpu, i, err, unit_size, percpu_size = 0; > + struct bpf_mem_caches *cc, __percpu *pcc; > + struct obj_cgroup *objcg = NULL; > + struct bpf_mem_cache *c; > + > + /* room for llist_node and per-cpu pointer */ > + percpu_size = LLIST_NODE_SZ + sizeof(void *); > + > + i = bpf_mem_cache_idx(size); > + if (i < 0) > + return -EINVAL; > + > + err = 0; > + pcc = ma->caches; > + unit_size = sizes[i]; > + > +#ifdef CONFIG_MEMCG_KMEM > + objcg = get_obj_cgroup_from_current(); > +#endif > + for_each_possible_cpu(cpu) { > + cc = per_cpu_ptr(pcc, cpu); > + c = &cc->cache[i]; > + if (cpu == 0 && c->unit_size) > + goto out; > + > + c->unit_size = unit_size; > + c->objcg = objcg; > + c->percpu_size = percpu_size; > + c->tgt = c; > + > + init_refill_work(c); > + prefill_mem_cache(c, cpu); > + > + if (cpu == 0) { > + err = check_obj_size(c, i); > + if (err) { > + bpf_mem_alloc_destroy_cache(c); > + goto out; > + } > + } > + } > + > +out: > + return err; > +} > + > static void check_mem_cache(struct bpf_mem_cache *c) > { > WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); [...]
On 12/15/2023 8:12 AM, Yonghong Song wrote: > Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") > added support for non-fix-size percpu memory allocation. > Such allocation will allocate percpu memory for all buckets on all > cpus and the memory consumption is in the order to quadratic. > For example, let us say, 4 cpus, unit size 16 bytes, so each > cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. > Then let us say, 8 cpus with the same unit size, each cpu > has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. > So if the number of cpus doubles, the number of memory consumption > will be 4 times. So for a system with large number of cpus, the > memory consumption goes up quickly with quadratic order. > For example, for 4KB percpu allocation, 128 cpus. The total memory > consumption will 4KB * 128 * 128 = 64MB. Things will become > worse if the number of cpus is bigger (e.g., 512, 1024, etc.) > > In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is > done in boot time, so for system with large number of cpus, the initial > percpu memory consumption is very visible. For example, for 128 cpu > system, the total percpu memory allocation will be at least > (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) > * 128 * 128 = ~138MB. > which is pretty big. It will be even bigger for larger number of cpus. > SNIP > index bb1223b21308..43e635c67150 100644 > --- a/include/linux/bpf_mem_alloc.h > +++ b/include/linux/bpf_mem_alloc.h > @@ -21,8 +21,15 @@ struct bpf_mem_alloc { > * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. > * Alloc and free are done with bpf_mem_{alloc,free}() and the size of > * the returned object is given by the size argument of bpf_mem_alloc(). > + * If percpu equals true, error will be returned in order to avoid > + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() > + * should be used to do on-demand per-cpu allocation for each size. > */ > int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); > +/* Initialize a non-fix-size percpu memory allocator */ > +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); > +/* The percpu allocation with a specific unit size. */ > +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); > void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); > > /* kmalloc/kfree equivalent: */ > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index c34513d645c4..4a9177770f93 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -64,8 +64,8 @@ > #define OFF insn->off > #define IMM insn->imm > > -struct bpf_mem_alloc bpf_global_ma; > -bool bpf_global_ma_set; > +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; > +bool bpf_global_ma_set, bpf_global_percpu_ma_set; > > /* No hurry in this branch > * > @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void) > > ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); > bpf_global_ma_set = !ret; > - return ret; > + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); > + bpf_global_percpu_ma_set = !ret; > + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; > } > late_initcall(bpf_global_ma_init); > #endif > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 472158f1fb08..aea4cd07c7b6 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -121,6 +121,8 @@ struct bpf_mem_caches { > struct bpf_mem_cache cache[NUM_CACHES]; > }; > > +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; Is it better to make it being const ? > + > static struct llist_node notrace *__llist_del_first(struct llist_head *head) > { > struct llist_node *entry, *next; > @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) > */ > int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) > { > - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; > int cpu, i, err, unit_size, percpu_size = 0; > struct bpf_mem_caches *cc, __percpu *pcc; > struct bpf_mem_cache *c, __percpu *pc; > struct obj_cgroup *objcg = NULL; > > + if (percpu && size == 0) > + return -EINVAL; > + > /* room for llist_node and per-cpu pointer */ > if (percpu) > percpu_size = LLIST_NODE_SZ + sizeof(void *); > @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c) > drain_mem_cache(c); > } > > +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) > +{ > + struct bpf_mem_caches __percpu *pcc; > + > + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO); > + if (!pcc) > + return -ENOMEM; __GFP_ZERO is not needed. __alloc_percpu_gfp() will zero the returned area by default. > + > + ma->caches = pcc; > + ma->percpu = true; > + return 0; > +} > + > +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) > +{ > + static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; > + int cpu, i, err, unit_size, percpu_size = 0; > + struct bpf_mem_caches *cc, __percpu *pcc; > + struct obj_cgroup *objcg = NULL; > + struct bpf_mem_cache *c; > + > + /* room for llist_node and per-cpu pointer */ > + percpu_size = LLIST_NODE_SZ + sizeof(void *); > + > + i = bpf_mem_cache_idx(size); > + if (i < 0) > + return -EINVAL; > + > + err = 0; > + pcc = ma->caches; > + unit_size = sizes[i]; > + > +#ifdef CONFIG_MEMCG_KMEM > + objcg = get_obj_cgroup_from_current(); > +#endif > + for_each_possible_cpu(cpu) { > + cc = per_cpu_ptr(pcc, cpu); > + c = &cc->cache[i]; > + if (cpu == 0 && c->unit_size) > + goto out; > + > + c->unit_size = unit_size; > + c->objcg = objcg; > + c->percpu_size = percpu_size; > + c->tgt = c; > + > + init_refill_work(c); > + prefill_mem_cache(c, cpu); > + > + if (cpu == 0) { > + err = check_obj_size(c, i); > + if (err) { > + bpf_mem_alloc_destroy_cache(c); It seems drain_mem_cache() will be enough. Have you considered setting low_watermark as 0 to prevent potential refill in unit_alloc() if the initialization of the current unit fails ? > + goto out; > + } > + } > + } > + > +out: > + return err; > +} > + > static void check_mem_cache(struct bpf_mem_cache *c) > { > WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); > .
On 12/14/23 7:19 PM, Hou Tao wrote: > > On 12/15/2023 8:12 AM, Yonghong Song wrote: >> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") >> added support for non-fix-size percpu memory allocation. >> Such allocation will allocate percpu memory for all buckets on all >> cpus and the memory consumption is in the order to quadratic. >> For example, let us say, 4 cpus, unit size 16 bytes, so each >> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. >> Then let us say, 8 cpus with the same unit size, each cpu >> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. >> So if the number of cpus doubles, the number of memory consumption >> will be 4 times. So for a system with large number of cpus, the >> memory consumption goes up quickly with quadratic order. >> For example, for 4KB percpu allocation, 128 cpus. The total memory >> consumption will 4KB * 128 * 128 = 64MB. Things will become >> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >> >> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >> done in boot time, so for system with large number of cpus, the initial >> percpu memory consumption is very visible. For example, for 128 cpu >> system, the total percpu memory allocation will be at least >> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >> * 128 * 128 = ~138MB. >> which is pretty big. It will be even bigger for larger number of cpus. >> > SNIP >> index bb1223b21308..43e635c67150 100644 >> --- a/include/linux/bpf_mem_alloc.h >> +++ b/include/linux/bpf_mem_alloc.h >> @@ -21,8 +21,15 @@ struct bpf_mem_alloc { >> * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. >> * Alloc and free are done with bpf_mem_{alloc,free}() and the size of >> * the returned object is given by the size argument of bpf_mem_alloc(). >> + * If percpu equals true, error will be returned in order to avoid >> + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() >> + * should be used to do on-demand per-cpu allocation for each size. >> */ >> int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); >> +/* Initialize a non-fix-size percpu memory allocator */ >> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); >> +/* The percpu allocation with a specific unit size. */ >> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); >> void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); >> >> /* kmalloc/kfree equivalent: */ >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index c34513d645c4..4a9177770f93 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -64,8 +64,8 @@ >> #define OFF insn->off >> #define IMM insn->imm >> >> -struct bpf_mem_alloc bpf_global_ma; >> -bool bpf_global_ma_set; >> +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; >> +bool bpf_global_ma_set, bpf_global_percpu_ma_set; >> >> /* No hurry in this branch >> * >> @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void) >> >> ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); >> bpf_global_ma_set = !ret; >> - return ret; >> + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); >> + bpf_global_percpu_ma_set = !ret; >> + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; >> } >> late_initcall(bpf_global_ma_init); >> #endif >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 472158f1fb08..aea4cd07c7b6 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -121,6 +121,8 @@ struct bpf_mem_caches { >> struct bpf_mem_cache cache[NUM_CACHES]; >> }; >> >> +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; > Is it better to make it being const ? Right. We can make it as const. >> + >> static struct llist_node notrace *__llist_del_first(struct llist_head *head) >> { >> struct llist_node *entry, *next; >> @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) >> */ >> int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) >> { >> - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; >> int cpu, i, err, unit_size, percpu_size = 0; >> struct bpf_mem_caches *cc, __percpu *pcc; >> struct bpf_mem_cache *c, __percpu *pc; >> struct obj_cgroup *objcg = NULL; >> >> + if (percpu && size == 0) >> + return -EINVAL; >> + >> /* room for llist_node and per-cpu pointer */ >> if (percpu) >> percpu_size = LLIST_NODE_SZ + sizeof(void *); >> @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c) >> drain_mem_cache(c); >> } >> >> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) >> +{ >> + struct bpf_mem_caches __percpu *pcc; >> + >> + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO); >> + if (!pcc) >> + return -ENOMEM; > __GFP_ZERO is not needed. __alloc_percpu_gfp() will zero the returned > area by default. Thanks. Checked the comments in __alloc_percpu_gfp() and indeed, the returned buffer has been zeroed. >> + >> + ma->caches = pcc; >> + ma->percpu = true; >> + return 0; >> +} >> + >> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) >> +{ >> + static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; >> + int cpu, i, err, unit_size, percpu_size = 0; >> + struct bpf_mem_caches *cc, __percpu *pcc; >> + struct obj_cgroup *objcg = NULL; >> + struct bpf_mem_cache *c; >> + >> + /* room for llist_node and per-cpu pointer */ >> + percpu_size = LLIST_NODE_SZ + sizeof(void *); >> + >> + i = bpf_mem_cache_idx(size); >> + if (i < 0) >> + return -EINVAL; >> + >> + err = 0; >> + pcc = ma->caches; >> + unit_size = sizes[i]; >> + >> +#ifdef CONFIG_MEMCG_KMEM >> + objcg = get_obj_cgroup_from_current(); >> +#endif >> + for_each_possible_cpu(cpu) { >> + cc = per_cpu_ptr(pcc, cpu); >> + c = &cc->cache[i]; >> + if (cpu == 0 && c->unit_size) >> + goto out; >> + >> + c->unit_size = unit_size; >> + c->objcg = objcg; >> + c->percpu_size = percpu_size; >> + c->tgt = c; >> + >> + init_refill_work(c); >> + prefill_mem_cache(c, cpu); >> + >> + if (cpu == 0) { >> + err = check_obj_size(c, i); >> + if (err) { >> + bpf_mem_alloc_destroy_cache(c); > It seems drain_mem_cache() will be enough. Have you considered setting At prefill stage, looks like the following is enough: free_all(__llist_del_all(&c->free_llist), percpu); But I agree that drain_mem_cache() is simpler and is easier for future potential code change. > low_watermark as 0 to prevent potential refill in unit_alloc() if the > initialization of the current unit fails ? I think it does make sense. For non-fix-size non-percpu prefill, if check_obj_size() failed, the prefill will fail, which include all buckets. In this case, if it fails for a particular bucket, we should make sure that bucket always return NULL ptr, so setting the low_watermark to 0 does make sense. >> + goto out; >> + } >> + } >> + } >> + >> +out: >> + return err; >> +} >> + >> static void check_mem_cache(struct bpf_mem_cache *c) >> { >> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); >> > . >
On 12/14/23 10:50 PM, Yonghong Song wrote: > > On 12/14/23 7:19 PM, Hou Tao wrote: >> >> On 12/15/2023 8:12 AM, Yonghong Song wrote: >>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>> allocation") >>> added support for non-fix-size percpu memory allocation. >>> Such allocation will allocate percpu memory for all buckets on all >>> cpus and the memory consumption is in the order to quadratic. >>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>> bytes. >>> Then let us say, 8 cpus with the same unit size, each cpu >>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>> bytes. >>> So if the number of cpus doubles, the number of memory consumption >>> will be 4 times. So for a system with large number of cpus, the >>> memory consumption goes up quickly with quadratic order. >>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >>> >>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >>> done in boot time, so for system with large number of cpus, the initial >>> percpu memory consumption is very visible. For example, for 128 cpu >>> system, the total percpu memory allocation will be at least >>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >>> * 128 * 128 = ~138MB. >>> which is pretty big. It will be even bigger for larger number of cpus. >>> >> SNIP >>> index bb1223b21308..43e635c67150 100644 >>> --- a/include/linux/bpf_mem_alloc.h >>> +++ b/include/linux/bpf_mem_alloc.h >>> @@ -21,8 +21,15 @@ struct bpf_mem_alloc { >>> * 'size = 0' is for bpf_mem_alloc which manages many fixed-size >>> objects. >>> * Alloc and free are done with bpf_mem_{alloc,free}() and the >>> size of >>> * the returned object is given by the size argument of >>> bpf_mem_alloc(). >>> + * If percpu equals true, error will be returned in order to avoid >>> + * large memory consumption and the below >>> bpf_mem_alloc_percpu_unit_init() >>> + * should be used to do on-demand per-cpu allocation for each size. >>> */ >>> int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool >>> percpu); >>> +/* Initialize a non-fix-size percpu memory allocator */ >>> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); >>> +/* The percpu allocation with a specific unit size. */ >>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int >>> size); >>> void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); >>> /* kmalloc/kfree equivalent: */ >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >>> index c34513d645c4..4a9177770f93 100644 >>> --- a/kernel/bpf/core.c >>> +++ b/kernel/bpf/core.c >>> @@ -64,8 +64,8 @@ >>> #define OFF insn->off >>> #define IMM insn->imm >>> -struct bpf_mem_alloc bpf_global_ma; >>> -bool bpf_global_ma_set; >>> +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; >>> +bool bpf_global_ma_set, bpf_global_percpu_ma_set; >>> /* No hurry in this branch >>> * >>> @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void) >>> ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); >>> bpf_global_ma_set = !ret; >>> - return ret; >>> + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); >>> + bpf_global_percpu_ma_set = !ret; >>> + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; >>> } >>> late_initcall(bpf_global_ma_init); >>> #endif >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>> index 472158f1fb08..aea4cd07c7b6 100644 >>> --- a/kernel/bpf/memalloc.c >>> +++ b/kernel/bpf/memalloc.c >>> @@ -121,6 +121,8 @@ struct bpf_mem_caches { >>> struct bpf_mem_cache cache[NUM_CACHES]; >>> }; >>> +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, >>> 512, 1024, 2048, 4096}; >> Is it better to make it being const ? > > Right. We can make it as const. > >>> + >>> static struct llist_node notrace *__llist_del_first(struct >>> llist_head *head) >>> { >>> struct llist_node *entry, *next; >>> @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache >>> *c, unsigned int idx) >>> */ >>> int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool >>> percpu) >>> { >>> - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, >>> 512, 1024, 2048, 4096}; >>> int cpu, i, err, unit_size, percpu_size = 0; >>> struct bpf_mem_caches *cc, __percpu *pcc; >>> struct bpf_mem_cache *c, __percpu *pc; >>> struct obj_cgroup *objcg = NULL; >>> + if (percpu && size == 0) >>> + return -EINVAL; >>> + >>> /* room for llist_node and per-cpu pointer */ >>> if (percpu) >>> percpu_size = LLIST_NODE_SZ + sizeof(void *); >>> @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct >>> bpf_mem_cache *c) >>> drain_mem_cache(c); >>> } >>> +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) >>> +{ >>> + struct bpf_mem_caches __percpu *pcc; >>> + >>> + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, >>> GFP_KERNEL | __GFP_ZERO); >>> + if (!pcc) >>> + return -ENOMEM; >> __GFP_ZERO is not needed. __alloc_percpu_gfp() will zero the returned >> area by default. > > Thanks. Checked the comments in __alloc_percpu_gfp() and indeed, the > returned > buffer has been zeroed. > >>> + >>> + ma->caches = pcc; >>> + ma->percpu = true; >>> + return 0; >>> +} >>> + >>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) >>> +{ >>> + static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, >>> 512, 1024, 2048, 4096}; >>> + int cpu, i, err, unit_size, percpu_size = 0; >>> + struct bpf_mem_caches *cc, __percpu *pcc; >>> + struct obj_cgroup *objcg = NULL; >>> + struct bpf_mem_cache *c; >>> + >>> + /* room for llist_node and per-cpu pointer */ >>> + percpu_size = LLIST_NODE_SZ + sizeof(void *); >>> + >>> + i = bpf_mem_cache_idx(size); >>> + if (i < 0) >>> + return -EINVAL; >>> + >>> + err = 0; >>> + pcc = ma->caches; >>> + unit_size = sizes[i]; >>> + >>> +#ifdef CONFIG_MEMCG_KMEM >>> + objcg = get_obj_cgroup_from_current(); >>> +#endif >>> + for_each_possible_cpu(cpu) { >>> + cc = per_cpu_ptr(pcc, cpu); >>> + c = &cc->cache[i]; >>> + if (cpu == 0 && c->unit_size) >>> + goto out; >>> + >>> + c->unit_size = unit_size; >>> + c->objcg = objcg; >>> + c->percpu_size = percpu_size; >>> + c->tgt = c; >>> + >>> + init_refill_work(c); >>> + prefill_mem_cache(c, cpu); >>> + >>> + if (cpu == 0) { >>> + err = check_obj_size(c, i); >>> + if (err) { >>> + bpf_mem_alloc_destroy_cache(c); >> It seems drain_mem_cache() will be enough. Have you considered setting > > At prefill stage, looks like the following is enough: > free_all(__llist_del_all(&c->free_llist), percpu); > But I agree that drain_mem_cache() is simpler and is > easier for future potential code change. > >> low_watermark as 0 to prevent potential refill in unit_alloc() if the >> initialization of the current unit fails ? > > I think it does make sense. For non-fix-size non-percpu prefill, > if check_obj_size() failed, the prefill will fail, which include > all buckets. > > In this case, if it fails for a particular bucket, we should > make sure that bucket always return NULL ptr, so setting the > low_watermark to 0 does make sense. Thinking again. If the initialization of the current unit failed, the verification will fail and the corresponding bpf program will not be able to do memory alloc, so we should be fine. But it is totally possible that some prog later may call bpf_mem_alloc_percpu_unit_init() again with the same size/bucket. So we should simply reset bpf_mem_cache to 0 during the previous failed bpf_mem_alloc_percpu_unit_init() call. Is it possible that check_obj_size() may initially returns an error but sometime later something in the kernel changed and the check_obj_size() with the same size could return true? > >>> + goto out; >>> + } >>> + } >>> + } >>> + >>> +out: >>> + return err; >>> +} >>> + >>> static void check_mem_cache(struct bpf_mem_cache *c) >>> { >>> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); >>> >> . >> >
Hi, On 12/15/2023 3:27 PM, Yonghong Song wrote: > > On 12/14/23 10:50 PM, Yonghong Song wrote: >> >> On 12/14/23 7:19 PM, Hou Tao wrote: >>> >>> On 12/15/2023 8:12 AM, Yonghong Song wrote: >>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>>> allocation") >>>> added support for non-fix-size percpu memory allocation. >>>> Such allocation will allocate percpu memory for all buckets on all >>>> cpus and the memory consumption is in the order to quadratic. >>>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>>> bytes. >>>> Then let us say, 8 cpus with the same unit size, each cpu >>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>>> bytes. >>>> So if the number of cpus doubles, the number of memory consumption >>>> will be 4 times. So for a system with large number of cpus, the >>>> memory consumption goes up quickly with quadratic order. >>>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) SNIP >>>> +#ifdef CONFIG_MEMCG_KMEM >>>> + objcg = get_obj_cgroup_from_current(); >>>> +#endif >>>> + for_each_possible_cpu(cpu) { >>>> + cc = per_cpu_ptr(pcc, cpu); >>>> + c = &cc->cache[i]; >>>> + if (cpu == 0 && c->unit_size) >>>> + goto out; >>>> + >>>> + c->unit_size = unit_size; >>>> + c->objcg = objcg; >>>> + c->percpu_size = percpu_size; >>>> + c->tgt = c; >>>> + >>>> + init_refill_work(c); >>>> + prefill_mem_cache(c, cpu); >>>> + >>>> + if (cpu == 0) { >>>> + err = check_obj_size(c, i); >>>> + if (err) { >>>> + bpf_mem_alloc_destroy_cache(c); >>> It seems drain_mem_cache() will be enough. Have you considered setting >> >> At prefill stage, looks like the following is enough: >> free_all(__llist_del_all(&c->free_llist), percpu); >> But I agree that drain_mem_cache() is simpler and is >> easier for future potential code change. >> >>> low_watermark as 0 to prevent potential refill in unit_alloc() if the >>> initialization of the current unit fails ? >> >> I think it does make sense. For non-fix-size non-percpu prefill, >> if check_obj_size() failed, the prefill will fail, which include >> all buckets. >> >> In this case, if it fails for a particular bucket, we should >> make sure that bucket always return NULL ptr, so setting the >> low_watermark to 0 does make sense. > > Thinking again. If the initialization of the current unit > failed, the verification will fail and the corresponding > bpf program will not be able to do memory alloc, so we > should be fine. > > But it is totally possible that some prog later may > call bpf_mem_alloc_percpu_unit_init() again with the > same size/bucket. So we should simply reset bpf_mem_cache > to 0 during the previous failed bpf_mem_alloc_percpu_unit_init() > call. Is it possible that check_obj_size() may initially > returns an error but sometime later something in > the kernel changed and the check_obj_size() with the > same size could return true? Resetting bpf_mem_cache as 0 is much simpler and easier to understand than resetting low_watermark as 0. For per-cpu allocation, the return value of pcpu_alloc_size() is stable and I don't think it will change like ksize() does(), so it is not possible that the previous check_obj_size() failed, but the new check_obj_size() for the same unit_size succeeds. > > >> >>>> + goto out; >>>> + } >>>> + } >>>> + } >>>> + >>>> +out: >>>> + return err; >>>> +} >>>> + >>>> static void check_mem_cache(struct bpf_mem_cache *c) >>>> { >>>> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); >>>> >>> . >>> >>
On 12/14/23 11:40 PM, Hou Tao wrote: > Hi, > > On 12/15/2023 3:27 PM, Yonghong Song wrote: >> On 12/14/23 10:50 PM, Yonghong Song wrote: >>> On 12/14/23 7:19 PM, Hou Tao wrote: >>>> On 12/15/2023 8:12 AM, Yonghong Song wrote: >>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>>>> allocation") >>>>> added support for non-fix-size percpu memory allocation. >>>>> Such allocation will allocate percpu memory for all buckets on all >>>>> cpus and the memory consumption is in the order to quadratic. >>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>>>> bytes. >>>>> Then let us say, 8 cpus with the same unit size, each cpu >>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>>>> bytes. >>>>> So if the number of cpus doubles, the number of memory consumption >>>>> will be 4 times. So for a system with large number of cpus, the >>>>> memory consumption goes up quickly with quadratic order. >>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) > SNIP >>>>> +#ifdef CONFIG_MEMCG_KMEM >>>>> + objcg = get_obj_cgroup_from_current(); >>>>> +#endif >>>>> + for_each_possible_cpu(cpu) { >>>>> + cc = per_cpu_ptr(pcc, cpu); >>>>> + c = &cc->cache[i]; >>>>> + if (cpu == 0 && c->unit_size) >>>>> + goto out; >>>>> + >>>>> + c->unit_size = unit_size; >>>>> + c->objcg = objcg; >>>>> + c->percpu_size = percpu_size; >>>>> + c->tgt = c; >>>>> + >>>>> + init_refill_work(c); >>>>> + prefill_mem_cache(c, cpu); >>>>> + >>>>> + if (cpu == 0) { >>>>> + err = check_obj_size(c, i); >>>>> + if (err) { >>>>> + bpf_mem_alloc_destroy_cache(c); >>>> It seems drain_mem_cache() will be enough. Have you considered setting >>> At prefill stage, looks like the following is enough: >>> free_all(__llist_del_all(&c->free_llist), percpu); >>> But I agree that drain_mem_cache() is simpler and is >>> easier for future potential code change. >>> >>>> low_watermark as 0 to prevent potential refill in unit_alloc() if the >>>> initialization of the current unit fails ? >>> I think it does make sense. For non-fix-size non-percpu prefill, >>> if check_obj_size() failed, the prefill will fail, which include >>> all buckets. >>> >>> In this case, if it fails for a particular bucket, we should >>> make sure that bucket always return NULL ptr, so setting the >>> low_watermark to 0 does make sense. >> Thinking again. If the initialization of the current unit >> failed, the verification will fail and the corresponding >> bpf program will not be able to do memory alloc, so we >> should be fine. >> >> But it is totally possible that some prog later may >> call bpf_mem_alloc_percpu_unit_init() again with the >> same size/bucket. So we should simply reset bpf_mem_cache >> to 0 during the previous failed bpf_mem_alloc_percpu_unit_init() >> call. Is it possible that check_obj_size() may initially >> returns an error but sometime later something in >> the kernel changed and the check_obj_size() with the >> same size could return true? > Resetting bpf_mem_cache as 0 is much simpler and easier to understand > than resetting low_watermark as 0. For per-cpu allocation, the return > value of pcpu_alloc_size() is stable and I don't think it will change > like ksize() does(), so it is not possible that the previous > check_obj_size() failed, but the new check_obj_size() for the same > unit_size succeeds. Thanks for clarification. Let me just do resetting bpf_mem_cache to 0 then. > >> >>>>> + goto out; >>>>> + } >>>>> + } >>>>> + } >>>>> + >>>>> +out: >>>>> + return err; >>>>> +} >>>>> + >>>>> static void check_mem_cache(struct bpf_mem_cache *c) >>>>> { >>>>> WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); >>>>> >>>> . >>>>
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c87c608a3689..f1f16449fbc4 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -60,7 +60,7 @@ extern struct idr btf_idr; extern spinlock_t btf_idr_lock; extern struct kobject *btf_kobj; extern struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; -extern bool bpf_global_ma_set; +extern bool bpf_global_ma_set, bpf_global_percpu_ma_set; typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64); typedef int (*bpf_iter_init_seq_priv_t)(void *private_data, diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index bb1223b21308..43e635c67150 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -21,8 +21,15 @@ struct bpf_mem_alloc { * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. * Alloc and free are done with bpf_mem_{alloc,free}() and the size of * the returned object is given by the size argument of bpf_mem_alloc(). + * If percpu equals true, error will be returned in order to avoid + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() + * should be used to do on-demand per-cpu allocation for each size. */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); +/* Initialize a non-fix-size percpu memory allocator */ +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); +/* The percpu allocation with a specific unit size. */ +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); /* kmalloc/kfree equivalent: */ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index c34513d645c4..4a9177770f93 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -64,8 +64,8 @@ #define OFF insn->off #define IMM insn->imm -struct bpf_mem_alloc bpf_global_ma; -bool bpf_global_ma_set; +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; +bool bpf_global_ma_set, bpf_global_percpu_ma_set; /* No hurry in this branch * @@ -2938,7 +2938,9 @@ static int __init bpf_global_ma_init(void) ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); bpf_global_ma_set = !ret; - return ret; + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); + bpf_global_percpu_ma_set = !ret; + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; } late_initcall(bpf_global_ma_init); #endif diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 472158f1fb08..aea4cd07c7b6 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -121,6 +121,8 @@ struct bpf_mem_caches { struct bpf_mem_cache cache[NUM_CACHES]; }; +static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; + static struct llist_node notrace *__llist_del_first(struct llist_head *head) { struct llist_node *entry, *next; @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) { - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; int cpu, i, err, unit_size, percpu_size = 0; struct bpf_mem_caches *cc, __percpu *pcc; struct bpf_mem_cache *c, __percpu *pc; struct obj_cgroup *objcg = NULL; + if (percpu && size == 0) + return -EINVAL; + /* room for llist_node and per-cpu pointer */ if (percpu) percpu_size = LLIST_NODE_SZ + sizeof(void *); @@ -625,6 +629,68 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c) drain_mem_cache(c); } +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) +{ + struct bpf_mem_caches __percpu *pcc; + + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL | __GFP_ZERO); + if (!pcc) + return -ENOMEM; + + ma->caches = pcc; + ma->percpu = true; + return 0; +} + +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) +{ + static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; + int cpu, i, err, unit_size, percpu_size = 0; + struct bpf_mem_caches *cc, __percpu *pcc; + struct obj_cgroup *objcg = NULL; + struct bpf_mem_cache *c; + + /* room for llist_node and per-cpu pointer */ + percpu_size = LLIST_NODE_SZ + sizeof(void *); + + i = bpf_mem_cache_idx(size); + if (i < 0) + return -EINVAL; + + err = 0; + pcc = ma->caches; + unit_size = sizes[i]; + +#ifdef CONFIG_MEMCG_KMEM + objcg = get_obj_cgroup_from_current(); +#endif + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(pcc, cpu); + c = &cc->cache[i]; + if (cpu == 0 && c->unit_size) + goto out; + + c->unit_size = unit_size; + c->objcg = objcg; + c->percpu_size = percpu_size; + c->tgt = c; + + init_refill_work(c); + prefill_mem_cache(c, cpu); + + if (cpu == 0) { + err = check_obj_size(c, i); + if (err) { + bpf_mem_alloc_destroy_cache(c); + goto out; + } + } + } + +out: + return err; +} + static void check_mem_cache(struct bpf_mem_cache *c) { WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1863826a4ac3..ce62ee0cc8f6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -42,9 +42,6 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = { #undef BPF_LINK_TYPE }; -struct bpf_mem_alloc bpf_global_percpu_ma; -static bool bpf_global_percpu_ma_set; - /* bpf_check() is a static code analyzer that walks eBPF program * instruction by instruction and updates register/stack state. * All paths of conditional branches are analyzed until 'bpf_exit' insn. @@ -12062,20 +12059,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) return -ENOMEM; - if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { - if (!bpf_global_percpu_ma_set) { - mutex_lock(&bpf_percpu_ma_lock); - if (!bpf_global_percpu_ma_set) { - err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); - if (!err) - bpf_global_percpu_ma_set = true; - } - mutex_unlock(&bpf_percpu_ma_lock); - if (err) - return err; - } - } - if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) { verbose(env, "local type ID argument must be in range [0, U32_MAX]\n"); return -EINVAL; @@ -12096,6 +12079,17 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, return -EINVAL; } + if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { + if (!bpf_global_percpu_ma_set) + return -ENOMEM; + + mutex_lock(&bpf_percpu_ma_lock); + err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); + mutex_unlock(&bpf_percpu_ma_lock); + if (err) + return err; + } + struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id); if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> --- include/linux/bpf.h | 2 +- include/linux/bpf_mem_alloc.h | 7 ++++ kernel/bpf/core.c | 8 +++-- kernel/bpf/memalloc.c | 68 ++++++++++++++++++++++++++++++++++- kernel/bpf/verifier.c | 28 ++++++--------- 5 files changed, 91 insertions(+), 22 deletions(-)