Message ID | 20231216023015.3741144-1-yonghong.song@linux.dev (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Reduce memory usage for bpf_global_percpu_ma | expand |
Hi, On 12/16/2023 10:30 AM, Yonghong Song wrote: > Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") > added support for non-fix-size percpu memory allocation. > Such allocation will allocate percpu memory for all buckets on all > cpus and the memory consumption is in the order to quadratic. > For example, let us say, 4 cpus, unit size 16 bytes, so each > cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. > Then let us say, 8 cpus with the same unit size, each cpu > has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. > So if the number of cpus doubles, the number of memory consumption > will be 4 times. So for a system with large number of cpus, the > memory consumption goes up quickly with quadratic order. > For example, for 4KB percpu allocation, 128 cpus. The total memory > consumption will 4KB * 128 * 128 = 64MB. Things will become > worse if the number of cpus is bigger (e.g., 512, 1024, etc.) SNIP > +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) > +{ > + struct bpf_mem_caches __percpu *pcc; > + > + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL); > + if (!pcc) > + return -ENOMEM; > + > + ma->caches = pcc; > + ma->percpu = true; > + return 0; > +} > + > +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) > +{ > + int cpu, i, err = 0, unit_size, percpu_size; > + struct bpf_mem_caches *cc, __percpu *pcc; > + struct obj_cgroup *objcg; > + struct bpf_mem_cache *c; > + > + i = bpf_mem_cache_idx(size); > + if (i < 0) > + return -EINVAL; > + > + /* room for llist_node and per-cpu pointer */ > + percpu_size = LLIST_NODE_SZ + sizeof(void *); > + > + pcc = ma->caches; > + unit_size = sizes[i]; > + > +#ifdef CONFIG_MEMCG_KMEM > + objcg = get_obj_cgroup_from_current(); > +#endif For bpf_global_percpu_ma, we also need to account the allocated memory to root memory cgroup just like bpf_global_ma did, do we ? So it seems that we need to initialize c->objcg early in bpf_mem_alloc_percpu_init (). > + for_each_possible_cpu(cpu) { > + cc = per_cpu_ptr(pcc, cpu); > + c = &cc->cache[i]; > + if (cpu == 0 && c->unit_size) > + goto out; > + > + c->unit_size = unit_size; > + c->objcg = objcg; > + c->percpu_size = percpu_size; > + c->tgt = c; > + > + init_refill_work(c); > + prefill_mem_cache(c, cpu); > + > + if (cpu == 0) { > + err = check_obj_size(c, i); > + if (err) { > + drain_mem_cache(c); > + memset(c, 0, sizeof(*c)); I also forgot about c->objcg. objcg may be leaked if we do memset() here. > + goto out; > + } > + } > + } > + > +out: > + return err; > +} > + > .
Hi Yonghong, kernel test robot noticed the following build warnings: [auto build test WARNING on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Yonghong-Song/bpf-Avoid-unnecessary-extra-percpu-memory-allocation/20231216-103310 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20231216023015.3741144-1-yonghong.song%40linux.dev patch subject: [PATCH bpf-next v3 2/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator config: x86_64-randconfig-003-20231216 (https://download.01.org/0day-ci/archive/20231216/202312162351.UuoFmjJk-lkp@intel.com/config) compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312162351.UuoFmjJk-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312162351.UuoFmjJk-lkp@intel.com/ All warnings (new ones prefixed by >>): >> kernel/bpf/memalloc.c:665:14: warning: variable 'objcg' is uninitialized when used here [-Wuninitialized] c->objcg = objcg; ^~~~~ kernel/bpf/memalloc.c:642:26: note: initialize the variable 'objcg' to silence this warning struct obj_cgroup *objcg; ^ = NULL 1 warning generated. vim +/objcg +665 kernel/bpf/memalloc.c 637 638 int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) 639 { 640 int cpu, i, err = 0, unit_size, percpu_size; 641 struct bpf_mem_caches *cc, __percpu *pcc; 642 struct obj_cgroup *objcg; 643 struct bpf_mem_cache *c; 644 645 i = bpf_mem_cache_idx(size); 646 if (i < 0) 647 return -EINVAL; 648 649 /* room for llist_node and per-cpu pointer */ 650 percpu_size = LLIST_NODE_SZ + sizeof(void *); 651 652 pcc = ma->caches; 653 unit_size = sizes[i]; 654 655 #ifdef CONFIG_MEMCG_KMEM 656 objcg = get_obj_cgroup_from_current(); 657 #endif 658 for_each_possible_cpu(cpu) { 659 cc = per_cpu_ptr(pcc, cpu); 660 c = &cc->cache[i]; 661 if (cpu == 0 && c->unit_size) 662 goto out; 663 664 c->unit_size = unit_size; > 665 c->objcg = objcg; 666 c->percpu_size = percpu_size; 667 c->tgt = c; 668 669 init_refill_work(c); 670 prefill_mem_cache(c, cpu); 671 672 if (cpu == 0) { 673 err = check_obj_size(c, i); 674 if (err) { 675 drain_mem_cache(c); 676 memset(c, 0, sizeof(*c)); 677 goto out; 678 } 679 } 680 } 681 682 out: 683 return err; 684 } 685
On 12/15/23 7:12 PM, Hou Tao wrote: > Hi, > > On 12/16/2023 10:30 AM, Yonghong Song wrote: >> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") >> added support for non-fix-size percpu memory allocation. >> Such allocation will allocate percpu memory for all buckets on all >> cpus and the memory consumption is in the order to quadratic. >> For example, let us say, 4 cpus, unit size 16 bytes, so each >> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. >> Then let us say, 8 cpus with the same unit size, each cpu >> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. >> So if the number of cpus doubles, the number of memory consumption >> will be 4 times. So for a system with large number of cpus, the >> memory consumption goes up quickly with quadratic order. >> For example, for 4KB percpu allocation, 128 cpus. The total memory >> consumption will 4KB * 128 * 128 = 64MB. Things will become >> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) > SNIP >> +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) >> +{ >> + struct bpf_mem_caches __percpu *pcc; >> + >> + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL); >> + if (!pcc) >> + return -ENOMEM; >> + >> + ma->caches = pcc; >> + ma->percpu = true; >> + return 0; >> +} >> + >> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) >> +{ >> + int cpu, i, err = 0, unit_size, percpu_size; >> + struct bpf_mem_caches *cc, __percpu *pcc; >> + struct obj_cgroup *objcg; >> + struct bpf_mem_cache *c; >> + >> + i = bpf_mem_cache_idx(size); >> + if (i < 0) >> + return -EINVAL; >> + >> + /* room for llist_node and per-cpu pointer */ >> + percpu_size = LLIST_NODE_SZ + sizeof(void *); >> + >> + pcc = ma->caches; >> + unit_size = sizes[i]; >> + >> +#ifdef CONFIG_MEMCG_KMEM >> + objcg = get_obj_cgroup_from_current(); >> +#endif > For bpf_global_percpu_ma, we also need to account the allocated memory > to root memory cgroup just like bpf_global_ma did, do we ? So it seems > that we need to initialize c->objcg early in bpf_mem_alloc_percpu_init (). Good point. Agree. the original behavior percpu non-fix-size mem allocation is to do get_obj_cgroup_from_current() at init stage and charge to root memory cgroup, and we indeed should move the above bpf_mem_alloc_percpu_init(). >> + for_each_possible_cpu(cpu) { >> + cc = per_cpu_ptr(pcc, cpu); >> + c = &cc->cache[i]; >> + if (cpu == 0 && c->unit_size) >> + goto out; >> + >> + c->unit_size = unit_size; >> + c->objcg = objcg; >> + c->percpu_size = percpu_size; >> + c->tgt = c; >> + >> + init_refill_work(c); >> + prefill_mem_cache(c, cpu); >> + >> + if (cpu == 0) { >> + err = check_obj_size(c, i); >> + if (err) { >> + drain_mem_cache(c); >> + memset(c, 0, sizeof(*c)); > I also forgot about c->objcg. objcg may be leaked if we do memset() here. The objcg gets a reference at init bpf_mem_alloc_init() stage and released at bpf_mem_alloc_destroy(). For bpf_global_ma, if there is a failure, indeed bpf_mem_alloc_destroy() will be called and the reference c->objcg will be released. So if we move get_obj_cgroup_from_current() to bpf_mem_alloc_percpu_init() stage, we should be okay here. BTW, is check_obj_size() really necessary here? My answer is no since as you mentioned, the size->cache_index is pretty stable, so check_obj_size() should not return error in such cases. What do you think? >> + goto out; >> + } >> + } >> + } >> + >> +out: >> + return err; >> +} >> + >> > . >
On 12/16/23 11:11 PM, Yonghong Song wrote: > > On 12/15/23 7:12 PM, Hou Tao wrote: >> Hi, >> >> On 12/16/2023 10:30 AM, Yonghong Song wrote: >>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>> allocation") >>> added support for non-fix-size percpu memory allocation. >>> Such allocation will allocate percpu memory for all buckets on all >>> cpus and the memory consumption is in the order to quadratic. >>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>> bytes. >>> Then let us say, 8 cpus with the same unit size, each cpu >>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>> bytes. >>> So if the number of cpus doubles, the number of memory consumption >>> will be 4 times. So for a system with large number of cpus, the >>> memory consumption goes up quickly with quadratic order. >>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >> SNIP >>> +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) >>> +{ >>> + struct bpf_mem_caches __percpu *pcc; >>> + >>> + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, >>> GFP_KERNEL); >>> + if (!pcc) >>> + return -ENOMEM; >>> + >>> + ma->caches = pcc; >>> + ma->percpu = true; >>> + return 0; >>> +} >>> + >>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) >>> +{ >>> + int cpu, i, err = 0, unit_size, percpu_size; >>> + struct bpf_mem_caches *cc, __percpu *pcc; >>> + struct obj_cgroup *objcg; >>> + struct bpf_mem_cache *c; >>> + >>> + i = bpf_mem_cache_idx(size); >>> + if (i < 0) >>> + return -EINVAL; >>> + >>> + /* room for llist_node and per-cpu pointer */ >>> + percpu_size = LLIST_NODE_SZ + sizeof(void *); >>> + >>> + pcc = ma->caches; >>> + unit_size = sizes[i]; >>> + >>> +#ifdef CONFIG_MEMCG_KMEM >>> + objcg = get_obj_cgroup_from_current(); >>> +#endif >> For bpf_global_percpu_ma, we also need to account the allocated memory >> to root memory cgroup just like bpf_global_ma did, do we ? So it seems >> that we need to initialize c->objcg early in >> bpf_mem_alloc_percpu_init (). > > Good point. Agree. the original behavior percpu non-fix-size mem > allocation is to do get_obj_cgroup_from_current() at init stage > and charge to root memory cgroup, and we indeed should move > the above bpf_mem_alloc_percpu_init(). > >>> + for_each_possible_cpu(cpu) { >>> + cc = per_cpu_ptr(pcc, cpu); >>> + c = &cc->cache[i]; >>> + if (cpu == 0 && c->unit_size) >>> + goto out; >>> + >>> + c->unit_size = unit_size; >>> + c->objcg = objcg; >>> + c->percpu_size = percpu_size; >>> + c->tgt = c; >>> + >>> + init_refill_work(c); >>> + prefill_mem_cache(c, cpu); >>> + >>> + if (cpu == 0) { >>> + err = check_obj_size(c, i); >>> + if (err) { >>> + drain_mem_cache(c); >>> + memset(c, 0, sizeof(*c)); >> I also forgot about c->objcg. objcg may be leaked if we do memset() >> here. > > The objcg gets a reference at init bpf_mem_alloc_init() stage > and released at bpf_mem_alloc_destroy(). For bpf_global_ma, > if there is a failure, indeed bpf_mem_alloc_destroy() will be > called and the reference c->objcg will be released. > > So if we move get_obj_cgroup_from_current() to > bpf_mem_alloc_percpu_init() stage, we should be okay here. > > BTW, is check_obj_size() really necessary here? My answer is no > since as you mentioned, the size->cache_index is pretty stable, > so check_obj_size() should not return error in such cases. > What do you think? How about the following change on top of this patch? diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index 43e635c67150..d1403204379e 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -11,6 +11,7 @@ struct bpf_mem_caches; struct bpf_mem_alloc { struct bpf_mem_caches __percpu *caches; struct bpf_mem_cache __percpu *cache; + struct obj_cgroup *objcg; bool percpu; struct work_struct work; }; diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 5cf2738c20a9..6486da4ba097 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -553,6 +553,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) if (memcg_bpf_enabled()) objcg = get_obj_cgroup_from_current(); #endif + ma->objcg = objcg; + for_each_possible_cpu(cpu) { c = per_cpu_ptr(pc, cpu); c->unit_size = unit_size; @@ -573,6 +575,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) #ifdef CONFIG_MEMCG_KMEM objcg = get_obj_cgroup_from_current(); #endif + ma->objcg = objcg; for_each_possible_cpu(cpu) { cc = per_cpu_ptr(pcc, cpu); for (i = 0; i < NUM_CACHES; i++) { @@ -637,6 +640,12 @@ __init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) ma->caches = pcc; ma->percpu = true; + +#ifdef CONFIG_MEMCG_KMEM + ma->objcg = get_obj_cgroup_from_current(); +#else + ma->objcg = NULL; +#endif return 0; } @@ -656,10 +665,8 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) pcc = ma->caches; unit_size = sizes[i]; + objcg = ma->objcg; -#ifdef CONFIG_MEMCG_KMEM - objcg = get_obj_cgroup_from_current(); -#endif for_each_possible_cpu(cpu) { cc = per_cpu_ptr(pcc, cpu); c = &cc->cache[i]; @@ -799,9 +806,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } - /* objcg is the same across cpus */ - if (c->objcg) - obj_cgroup_put(c->objcg); + if (ma->objcg) + obj_cgroup_put(ma->objcg); destroy_mem_alloc(ma, rcu_in_progress); } if (ma->caches) { @@ -817,8 +823,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } } - if (c->objcg) - obj_cgroup_put(c->objcg); + if (ma->objcg) + obj_cgroup_put(ma->objcg); destroy_mem_alloc(ma, rcu_in_progress); } } I still think check_obj_size for percpu allocation is not needed. But I guess we can address that issue later on. > >>> + goto out; >>> + } >>> + } >>> + } >>> + >>> +out: >>> + return err; >>> +} >>> + >> . >> >
Hi, On 12/18/2023 1:21 AM, Yonghong Song wrote: > > On 12/16/23 11:11 PM, Yonghong Song wrote: >> >> On 12/15/23 7:12 PM, Hou Tao wrote: >>> Hi, >>> >>> On 12/16/2023 10:30 AM, Yonghong Song wrote: >>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>>> allocation") >>>> added support for non-fix-size percpu memory allocation. >>>> Such allocation will allocate percpu memory for all buckets on all >>>> cpus and the memory consumption is in the order to quadratic. >>>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>>> bytes. >>>> Then let us say, 8 cpus with the same unit size, each cpu >>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>>> bytes. >>>> So if the number of cpus doubles, the number of memory consumption >>>> will be 4 times. So for a system with large number of cpus, the >>>> memory consumption goes up quickly with quadratic order. >>>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >>> SNIP >>>> +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) >>>> +{ >>>> + struct bpf_mem_caches __percpu *pcc; >>>> + >>>> + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, >>>> GFP_KERNEL); >>>> + if (!pcc) >>>> + return -ENOMEM; >>>> + >>>> + ma->caches = pcc; >>>> + ma->percpu = true; >>>> + return 0; >>>> +} >>>> + >>>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int >>>> size) >>>> +{ >>>> + int cpu, i, err = 0, unit_size, percpu_size; >>>> + struct bpf_mem_caches *cc, __percpu *pcc; >>>> + struct obj_cgroup *objcg; >>>> + struct bpf_mem_cache *c; >>>> + >>>> + i = bpf_mem_cache_idx(size); >>>> + if (i < 0) >>>> + return -EINVAL; >>>> + >>>> + /* room for llist_node and per-cpu pointer */ >>>> + percpu_size = LLIST_NODE_SZ + sizeof(void *); >>>> + >>>> + pcc = ma->caches; >>>> + unit_size = sizes[i]; >>>> + >>>> +#ifdef CONFIG_MEMCG_KMEM >>>> + objcg = get_obj_cgroup_from_current(); >>>> +#endif >>> For bpf_global_percpu_ma, we also need to account the allocated memory >>> to root memory cgroup just like bpf_global_ma did, do we ? So it seems >>> that we need to initialize c->objcg early in >>> bpf_mem_alloc_percpu_init (). >> >> Good point. Agree. the original behavior percpu non-fix-size mem >> allocation is to do get_obj_cgroup_from_current() at init stage >> and charge to root memory cgroup, and we indeed should move >> the above bpf_mem_alloc_percpu_init(). >> >>>> + for_each_possible_cpu(cpu) { >>>> + cc = per_cpu_ptr(pcc, cpu); >>>> + c = &cc->cache[i]; >>>> + if (cpu == 0 && c->unit_size) >>>> + goto out; >>>> + >>>> + c->unit_size = unit_size; >>>> + c->objcg = objcg; >>>> + c->percpu_size = percpu_size; >>>> + c->tgt = c; >>>> + >>>> + init_refill_work(c); >>>> + prefill_mem_cache(c, cpu); >>>> + >>>> + if (cpu == 0) { >>>> + err = check_obj_size(c, i); >>>> + if (err) { >>>> + drain_mem_cache(c); >>>> + memset(c, 0, sizeof(*c)); >>> I also forgot about c->objcg. objcg may be leaked if we do memset() >>> here. >> >> The objcg gets a reference at init bpf_mem_alloc_init() stage >> and released at bpf_mem_alloc_destroy(). For bpf_global_ma, >> if there is a failure, indeed bpf_mem_alloc_destroy() will be >> called and the reference c->objcg will be released. >> >> So if we move get_obj_cgroup_from_current() to >> bpf_mem_alloc_percpu_init() stage, we should be okay here. >> >> BTW, is check_obj_size() really necessary here? My answer is no >> since as you mentioned, the size->cache_index is pretty stable, >> so check_obj_size() should not return error in such cases. >> What do you think? > > How about the following change on top of this patch? I think the patch below is fine. Before the change, objcg is a per-bpf_mem_alloc object, but the implementation doesn't make it being explicit. The change below make the objcg being a a per-bpf_mem_alloc object. > > diff --git a/include/linux/bpf_mem_alloc.h > b/include/linux/bpf_mem_alloc.h > index 43e635c67150..d1403204379e 100644 > --- a/include/linux/bpf_mem_alloc.h > +++ b/include/linux/bpf_mem_alloc.h > @@ -11,6 +11,7 @@ struct bpf_mem_caches; > struct bpf_mem_alloc { > struct bpf_mem_caches __percpu *caches; > struct bpf_mem_cache __percpu *cache; > + struct obj_cgroup *objcg; > bool percpu; > struct work_struct work; > }; > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 5cf2738c20a9..6486da4ba097 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -553,6 +553,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, > int size, bool percpu) > if (memcg_bpf_enabled()) > objcg = get_obj_cgroup_from_current(); > #endif > + ma->objcg = objcg; > + > for_each_possible_cpu(cpu) { > c = per_cpu_ptr(pc, cpu); > c->unit_size = unit_size; > @@ -573,6 +575,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, > int size, bool percpu) > #ifdef CONFIG_MEMCG_KMEM > objcg = get_obj_cgroup_from_current(); > #endif > + ma->objcg = objcg; > for_each_possible_cpu(cpu) { > cc = per_cpu_ptr(pcc, cpu); > for (i = 0; i < NUM_CACHES; i++) { > @@ -637,6 +640,12 @@ __init int bpf_mem_alloc_percpu_init(struct > bpf_mem_alloc *ma) > > ma->caches = pcc; > ma->percpu = true; > + > +#ifdef CONFIG_MEMCG_KMEM > + ma->objcg = get_obj_cgroup_from_current(); > +#else > + ma->objcg = NULL; > +#endif > return 0; > } > > @@ -656,10 +665,8 @@ int bpf_mem_alloc_percpu_unit_init(struct > bpf_mem_alloc *ma, int size) > > pcc = ma->caches; > unit_size = sizes[i]; > + objcg = ma->objcg; > > -#ifdef CONFIG_MEMCG_KMEM > - objcg = get_obj_cgroup_from_current(); > -#endif > for_each_possible_cpu(cpu) { > cc = per_cpu_ptr(pcc, cpu); > c = &cc->cache[i]; > @@ -799,9 +806,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > rcu_in_progress += > atomic_read(&c->call_rcu_ttrace_in_progress); > rcu_in_progress += > atomic_read(&c->call_rcu_in_progress); > } > - /* objcg is the same across cpus */ > - if (c->objcg) > - obj_cgroup_put(c->objcg); > + if (ma->objcg) > + obj_cgroup_put(ma->objcg); > destroy_mem_alloc(ma, rcu_in_progress); > } > if (ma->caches) { > @@ -817,8 +823,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > rcu_in_progress += > atomic_read(&c->call_rcu_in_progress); > } > } > - if (c->objcg) > - obj_cgroup_put(c->objcg); > + if (ma->objcg) > + obj_cgroup_put(ma->objcg); > destroy_mem_alloc(ma, rcu_in_progress); > } > } > > I still think check_obj_size for percpu allocation is not needed. > But I guess we can address that issue later on. You are right. check_obj_size() is not needed for per-cpu allocation, so it is OK to just remove it. I also remove check_obj_size() for kmalloc allocation in [1]. [1]: https://lore.kernel.org/bpf/20231216131052.27621-1-houtao@huaweicloud.com/ > >> >>>> + goto out; >>>> + } >>>> + } >>>> + } >>>> + >>>> +out: >>>> + return err; >>>> +} >>>> + >>> . >>> >>
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5e694934cf37..bd32274561e3 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -61,7 +61,7 @@ extern struct idr btf_idr; extern spinlock_t btf_idr_lock; extern struct kobject *btf_kobj; extern struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; -extern bool bpf_global_ma_set; +extern bool bpf_global_ma_set, bpf_global_percpu_ma_set; typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64); typedef int (*bpf_iter_init_seq_priv_t)(void *private_data, diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index bb1223b21308..43e635c67150 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -21,8 +21,15 @@ struct bpf_mem_alloc { * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. * Alloc and free are done with bpf_mem_{alloc,free}() and the size of * the returned object is given by the size argument of bpf_mem_alloc(). + * If percpu equals true, error will be returned in order to avoid + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() + * should be used to do on-demand per-cpu allocation for each size. */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); +/* Initialize a non-fix-size percpu memory allocator */ +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); +/* The percpu allocation with a specific unit size. */ +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); /* kmalloc/kfree equivalent: */ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 5aa6863ac33b..bc93eb7e00c7 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -64,8 +64,8 @@ #define OFF insn->off #define IMM insn->imm -struct bpf_mem_alloc bpf_global_ma; -bool bpf_global_ma_set; +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; +bool bpf_global_ma_set, bpf_global_percpu_ma_set; /* No hurry in this branch * @@ -2963,7 +2963,9 @@ static int __init bpf_global_ma_init(void) ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); bpf_global_ma_set = !ret; - return ret; + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); + bpf_global_percpu_ma_set = !ret; + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; } late_initcall(bpf_global_ma_init); #endif diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 00e101c2a68b..30e347fccc6a 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -121,6 +121,8 @@ struct bpf_mem_caches { struct bpf_mem_cache cache[NUM_CACHES]; }; +static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; + static struct llist_node notrace *__llist_del_first(struct llist_head *head) { struct llist_node *entry, *next; @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) { - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; int cpu, i, err, unit_size, percpu_size = 0; struct bpf_mem_caches *cc, __percpu *pcc; struct bpf_mem_cache *c, __percpu *pc; struct obj_cgroup *objcg = NULL; + if (percpu && size == 0) + return -EINVAL; + /* room for llist_node and per-cpu pointer */ if (percpu) percpu_size = LLIST_NODE_SZ + sizeof(void *); @@ -618,6 +622,67 @@ static void drain_mem_cache(struct bpf_mem_cache *c) free_all(llist_del_all(&c->waiting_for_gp), percpu); } +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) +{ + struct bpf_mem_caches __percpu *pcc; + + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL); + if (!pcc) + return -ENOMEM; + + ma->caches = pcc; + ma->percpu = true; + return 0; +} + +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) +{ + int cpu, i, err = 0, unit_size, percpu_size; + struct bpf_mem_caches *cc, __percpu *pcc; + struct obj_cgroup *objcg; + struct bpf_mem_cache *c; + + i = bpf_mem_cache_idx(size); + if (i < 0) + return -EINVAL; + + /* room for llist_node and per-cpu pointer */ + percpu_size = LLIST_NODE_SZ + sizeof(void *); + + pcc = ma->caches; + unit_size = sizes[i]; + +#ifdef CONFIG_MEMCG_KMEM + objcg = get_obj_cgroup_from_current(); +#endif + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(pcc, cpu); + c = &cc->cache[i]; + if (cpu == 0 && c->unit_size) + goto out; + + c->unit_size = unit_size; + c->objcg = objcg; + c->percpu_size = percpu_size; + c->tgt = c; + + init_refill_work(c); + prefill_mem_cache(c, cpu); + + if (cpu == 0) { + err = check_obj_size(c, i); + if (err) { + drain_mem_cache(c); + memset(c, 0, sizeof(*c)); + goto out; + } + } + } + +out: + return err; +} + static void check_mem_cache(struct bpf_mem_cache *c) { WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace)); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1863826a4ac3..ce62ee0cc8f6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -42,9 +42,6 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = { #undef BPF_LINK_TYPE }; -struct bpf_mem_alloc bpf_global_percpu_ma; -static bool bpf_global_percpu_ma_set; - /* bpf_check() is a static code analyzer that walks eBPF program * instruction by instruction and updates register/stack state. * All paths of conditional branches are analyzed until 'bpf_exit' insn. @@ -12062,20 +12059,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) return -ENOMEM; - if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { - if (!bpf_global_percpu_ma_set) { - mutex_lock(&bpf_percpu_ma_lock); - if (!bpf_global_percpu_ma_set) { - err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); - if (!err) - bpf_global_percpu_ma_set = true; - } - mutex_unlock(&bpf_percpu_ma_lock); - if (err) - return err; - } - } - if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) { verbose(env, "local type ID argument must be in range [0, U32_MAX]\n"); return -EINVAL; @@ -12096,6 +12079,17 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, return -EINVAL; } + if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { + if (!bpf_global_percpu_ma_set) + return -ENOMEM; + + mutex_lock(&bpf_percpu_ma_lock); + err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); + mutex_unlock(&bpf_percpu_ma_lock); + if (err) + return err; + } + struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id); if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> --- include/linux/bpf.h | 2 +- include/linux/bpf_mem_alloc.h | 7 ++++ kernel/bpf/core.c | 8 +++-- kernel/bpf/memalloc.c | 67 ++++++++++++++++++++++++++++++++++- kernel/bpf/verifier.c | 28 ++++++--------- 5 files changed, 90 insertions(+), 22 deletions(-)