From patchwork Mon Dec 18 06:30:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496250 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 66-220-155-179.mail-mxout.facebook.com (66-220-155-179.mail-mxout.facebook.com [66.220.155.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 872746FB2 for ; Mon, 18 Dec 2023 06:30:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id A409C2BB1F6CD; Sun, 17 Dec 2023 22:30:36 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau , Hou Tao Subject: [PATCH bpf-next v4 1/7] bpf: Avoid unnecessary extra percpu memory allocation Date: Sun, 17 Dec 2023 22:30:36 -0800 Message-Id: <20231218063036.3039329-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net Currently, for percpu memory allocation, say if the user requests allocation size to be 32 bytes, the actually calculated size will be 40 bytes and it further rounds to 64 bytes, and eventually 64 bytes are allocated, wasting 32-byte memory. Change bpf_mem_alloc() to calculate the cache index based on the user-provided allocation size so unnecessary extra memory can be avoided. Suggested-by: Hou Tao Acked-by: Hou Tao Signed-off-by: Yonghong Song --- kernel/bpf/memalloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 6a51cfe4c2d6..00e101c2a68b 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -871,7 +871,9 @@ void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size) if (!size) return ZERO_SIZE_PTR; - idx = bpf_mem_cache_idx(size + LLIST_NODE_SZ); + if (!ma->percpu) + size += LLIST_NODE_SZ; + idx = bpf_mem_cache_idx(size); if (idx < 0) return NULL; From patchwork Mon Dec 18 06:30:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496249 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 66-220-155-179.mail-mxout.facebook.com (66-220-155-179.mail-mxout.facebook.com [66.220.155.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F2CDC6FA3 for ; Mon, 18 Dec 2023 06:30:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id 1CA462BB1F6E3; Sun, 17 Dec 2023 22:30:41 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau Subject: [PATCH bpf-next v4 2/7] bpf: Add objcg to bpf_mem_alloc Date: Sun, 17 Dec 2023 22:30:41 -0800 Message-Id: <20231218063041.3039560-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's are with the same objcg. This patch made such a property explicit. The next patch will use this property to save and restore objcg for percpu unit allocator. Signed-off-by: Yonghong Song Acked-by: Hou Tao --- include/linux/bpf_mem_alloc.h | 1 + kernel/bpf/memalloc.c | 11 ++++++----- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index bb1223b21308..acef8c808599 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -11,6 +11,7 @@ struct bpf_mem_caches; struct bpf_mem_alloc { struct bpf_mem_caches __percpu *caches; struct bpf_mem_cache __percpu *cache; + struct obj_cgroup *objcg; bool percpu; struct work_struct work; }; diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 00e101c2a68b..dfde9d9a3e1d 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -544,6 +544,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) if (memcg_bpf_enabled()) objcg = get_obj_cgroup_from_current(); #endif + ma->objcg = objcg; for_each_possible_cpu(cpu) { c = per_cpu_ptr(pc, cpu); c->unit_size = unit_size; @@ -564,6 +565,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) #ifdef CONFIG_MEMCG_KMEM objcg = get_obj_cgroup_from_current(); #endif + ma->objcg = objcg; for_each_possible_cpu(cpu) { cc = per_cpu_ptr(pcc, cpu); for (i = 0; i < NUM_CACHES; i++) { @@ -729,9 +731,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } - /* objcg is the same across cpus */ - if (c->objcg) - obj_cgroup_put(c->objcg); + if (ma->objcg) + obj_cgroup_put(ma->objcg); destroy_mem_alloc(ma, rcu_in_progress); } if (ma->caches) { @@ -747,8 +748,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } } - if (c->objcg) - obj_cgroup_put(c->objcg); + if (ma->objcg) + obj_cgroup_put(ma->objcg); destroy_mem_alloc(ma, rcu_in_progress); } } From patchwork Mon Dec 18 06:30:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496251 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 69-171-232-180.mail-mxout.facebook.com (69-171-232-180.mail-mxout.facebook.com [69.171.232.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A3115690 for ; Mon, 18 Dec 2023 06:30:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id 431212BB1F716; Sun, 17 Dec 2023 22:30:47 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau Subject: [PATCH bpf-next v4 3/7] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Date: Sun, 17 Dec 2023 22:30:47 -0800 Message-Id: <20231218063047.3040611-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song Acked-by: Hou Tao --- include/linux/bpf.h | 2 +- include/linux/bpf_mem_alloc.h | 7 ++++ kernel/bpf/core.c | 8 +++-- kernel/bpf/memalloc.c | 62 ++++++++++++++++++++++++++++++++++- kernel/bpf/verifier.c | 28 +++++++--------- 5 files changed, 85 insertions(+), 22 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5e694934cf37..bd32274561e3 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -61,7 +61,7 @@ extern struct idr btf_idr; extern spinlock_t btf_idr_lock; extern struct kobject *btf_kobj; extern struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; -extern bool bpf_global_ma_set; +extern bool bpf_global_ma_set, bpf_global_percpu_ma_set; typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64); typedef int (*bpf_iter_init_seq_priv_t)(void *private_data, diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index acef8c808599..d1403204379e 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -22,8 +22,15 @@ struct bpf_mem_alloc { * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. * Alloc and free are done with bpf_mem_{alloc,free}() and the size of * the returned object is given by the size argument of bpf_mem_alloc(). + * If percpu equals true, error will be returned in order to avoid + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() + * should be used to do on-demand per-cpu allocation for each size. */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); +/* Initialize a non-fix-size percpu memory allocator */ +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); +/* The percpu allocation with a specific unit size. */ +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); /* kmalloc/kfree equivalent: */ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 5aa6863ac33b..bc93eb7e00c7 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -64,8 +64,8 @@ #define OFF insn->off #define IMM insn->imm -struct bpf_mem_alloc bpf_global_ma; -bool bpf_global_ma_set; +struct bpf_mem_alloc bpf_global_ma, bpf_global_percpu_ma; +bool bpf_global_ma_set, bpf_global_percpu_ma_set; /* No hurry in this branch * @@ -2963,7 +2963,9 @@ static int __init bpf_global_ma_init(void) ret = bpf_mem_alloc_init(&bpf_global_ma, 0, false); bpf_global_ma_set = !ret; - return ret; + ret = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); + bpf_global_percpu_ma_set = !ret; + return !bpf_global_ma_set || !bpf_global_percpu_ma_set; } late_initcall(bpf_global_ma_init); #endif diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index dfde9d9a3e1d..50ab2fecc005 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -121,6 +121,8 @@ struct bpf_mem_caches { struct bpf_mem_cache cache[NUM_CACHES]; }; +static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; + static struct llist_node notrace *__llist_del_first(struct llist_head *head) { struct llist_node *entry, *next; @@ -520,12 +522,14 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) { - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; int cpu, i, err, unit_size, percpu_size = 0; struct bpf_mem_caches *cc, __percpu *pcc; struct bpf_mem_cache *c, __percpu *pc; struct obj_cgroup *objcg = NULL; + if (percpu && size == 0) + return -EINVAL; + /* room for llist_node and per-cpu pointer */ if (percpu) percpu_size = LLIST_NODE_SZ + sizeof(void *); @@ -545,6 +549,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) objcg = get_obj_cgroup_from_current(); #endif ma->objcg = objcg; + for_each_possible_cpu(cpu) { c = per_cpu_ptr(pc, cpu); c->unit_size = unit_size; @@ -600,6 +605,61 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) return err; } +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) +{ + struct bpf_mem_caches __percpu *pcc; + + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL); + if (!pcc) + return -ENOMEM; + + ma->caches = pcc; + ma->percpu = true; + +#ifdef CONFIG_MEMCG_KMEM + ma->objcg = get_obj_cgroup_from_current(); +#else + ma->objcg = NULL; +#endif + return 0; +} + +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) +{ + struct bpf_mem_caches *cc, __percpu *pcc; + int cpu, i, unit_size, percpu_size; + struct obj_cgroup *objcg; + struct bpf_mem_cache *c; + + i = bpf_mem_cache_idx(size); + if (i < 0) + return -EINVAL; + + /* room for llist_node and per-cpu pointer */ + percpu_size = LLIST_NODE_SZ + sizeof(void *); + + unit_size = sizes[i]; + objcg = ma->objcg; + pcc = ma->caches; + + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(pcc, cpu); + c = &cc->cache[i]; + if (cpu == 0 && c->unit_size) + break; + + c->unit_size = unit_size; + c->objcg = objcg; + c->percpu_size = percpu_size; + c->tgt = c; + + init_refill_work(c); + prefill_mem_cache(c, cpu); + } + + return 0; +} + static void drain_mem_cache(struct bpf_mem_cache *c) { bool percpu = !!c->percpu_size; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1863826a4ac3..ce62ee0cc8f6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -42,9 +42,6 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = { #undef BPF_LINK_TYPE }; -struct bpf_mem_alloc bpf_global_percpu_ma; -static bool bpf_global_percpu_ma_set; - /* bpf_check() is a static code analyzer that walks eBPF program * instruction by instruction and updates register/stack state. * All paths of conditional branches are analyzed until 'bpf_exit' insn. @@ -12062,20 +12059,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) return -ENOMEM; - if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { - if (!bpf_global_percpu_ma_set) { - mutex_lock(&bpf_percpu_ma_lock); - if (!bpf_global_percpu_ma_set) { - err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); - if (!err) - bpf_global_percpu_ma_set = true; - } - mutex_unlock(&bpf_percpu_ma_lock); - if (err) - return err; - } - } - if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) { verbose(env, "local type ID argument must be in range [0, U32_MAX]\n"); return -EINVAL; @@ -12096,6 +12079,17 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, return -EINVAL; } + if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { + if (!bpf_global_percpu_ma_set) + return -ENOMEM; + + mutex_lock(&bpf_percpu_ma_lock); + err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); + mutex_unlock(&bpf_percpu_ma_lock); + if (err) + return err; + } + struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id); if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) { From patchwork Mon Dec 18 06:30:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496252 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 69-171-232-180.mail-mxout.facebook.com (69-171-232-180.mail-mxout.facebook.com [69.171.232.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A2E93C16 for ; Mon, 18 Dec 2023 06:30:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id 6B7FB2BB1F72B; Sun, 17 Dec 2023 22:30:52 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau , Hou Tao Subject: [PATCH bpf-next v4 4/7] bpf: Refill only one percpu element in memalloc Date: Sun, 17 Dec 2023 22:30:52 -0800 Message-Id: <20231218063052.3040932-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net Typically for percpu map element or data structure, once allocated, most operations are lookup or in-place update. Deletion are really rare. Currently, for percpu data strcture, 4 elements will be refilled if the size is <= 256. Let us just do with one element for percpu data. For example, for size 256 and 128 cpus, the potential saving will be 3 * 256 * 128 * 128 = 12MB. Acked-by: Hou Tao Signed-off-by: Yonghong Song --- kernel/bpf/memalloc.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 50ab2fecc005..f37998662146 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -485,11 +485,16 @@ static void init_refill_work(struct bpf_mem_cache *c) static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) { - /* To avoid consuming memory assume that 1st run of bpf - * prog won't be doing more than 4 map_update_elem from - * irq disabled region + int cnt = 1; + + /* To avoid consuming memory, for non-percpu allocation, assume that + * 1st run of bpf prog won't be doing more than 4 map_update_elem from + * irq disabled region if unit size is less than or equal to 256. + * For all other cases, let us just do one allocation. */ - alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false); + if (!c->percpu_size && c->unit_size <= 256) + cnt = 4; + alloc_bulk(c, cnt, cpu_to_node(cpu), false); } static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) From patchwork Mon Dec 18 06:30:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496253 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 66-220-155-178.mail-mxout.facebook.com (66-220-155-178.mail-mxout.facebook.com [66.220.155.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2A906FBD for ; Mon, 18 Dec 2023 06:31:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id 910C92BB1F751; Sun, 17 Dec 2023 22:30:57 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau , Hou Tao Subject: [PATCH bpf-next v4 5/7] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Date: Sun, 17 Dec 2023 22:30:57 -0800 Message-Id: <20231218063057.3041405-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net For percpu data structure allocation with bpf_global_percpu_ma, the maximum data size is 4K. But for a system with large number of cpus, bigger data size (e.g., 2K, 4K) might consume a lot of memory. For example, the percpu memory consumption with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB memory. We should discourage such usage. Let us limit the maximum data size to be 512 for bpf_global_percpu_ma allocation. Acked-by: Hou Tao Signed-off-by: Yonghong Song --- kernel/bpf/verifier.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ce62ee0cc8f6..039d699a425d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -192,6 +192,8 @@ struct bpf_verifier_stack_elem { POISON_POINTER_DELTA)) #define BPF_MAP_PTR(X) ((struct bpf_map *)((X) & ~BPF_MAP_PTR_UNPRIV)) +#define BPF_GLOBAL_PERCPU_MA_MAX_SIZE 512 + static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx); static int release_reference(struct bpf_verifier_env *env, int ref_obj_id); static void invalidate_non_owning_refs(struct bpf_verifier_env *env); @@ -12083,6 +12085,12 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (!bpf_global_percpu_ma_set) return -ENOMEM; + if (ret_t->size > BPF_GLOBAL_PERCPU_MA_MAX_SIZE) { + verbose(env, "bpf_percpu_obj_new type size (%d) is greater than %d\n", + ret_t->size, BPF_GLOBAL_PERCPU_MA_MAX_SIZE); + return -EINVAL; + } + mutex_lock(&bpf_percpu_ma_lock); err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); mutex_unlock(&bpf_percpu_ma_lock); From patchwork Mon Dec 18 06:31:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496254 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 66-220-155-178.mail-mxout.facebook.com (66-220-155-178.mail-mxout.facebook.com [66.220.155.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A53286FC7 for ; Mon, 18 Dec 2023 06:31:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id B94332BB1F774; Sun, 17 Dec 2023 22:31:02 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau , Hou Tao Subject: [PATCH bpf-next v4 6/7] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Date: Sun, 17 Dec 2023 22:31:02 -0800 Message-Id: <20231218063102.3042299-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net In the previous patch, the maximum data size for bpf_global_percpu_ma is 512 bytes. This breaks selftest test_bpf_ma. The test is adjusted in two aspects: - Since the maximum allowed data size for bpf_global_percpu_ma is 512, remove all tests beyond that, names sizes 1024, 2048 and 4096. - Previously the percpu data size is bucket_size - 8 in order to avoid percpu allocation into the next bucket. This patch removed such data size adjustment thanks to Patch 1. Also, a better way to generate BTF type is used than adding a member to the value struct. Acked-by: Hou Tao Signed-off-by: Yonghong Song --- .../selftests/bpf/prog_tests/test_bpf_ma.c | 20 +++++++---- .../testing/selftests/bpf/progs/test_bpf_ma.c | 34 +++++++++---------- 2 files changed, 30 insertions(+), 24 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c b/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c index d3491a84b3b9..ccae0b31ac6c 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c +++ b/tools/testing/selftests/bpf/prog_tests/test_bpf_ma.c @@ -14,7 +14,8 @@ static void do_bpf_ma_test(const char *name) struct test_bpf_ma *skel; struct bpf_program *prog; struct btf *btf; - int i, err; + int i, err, id; + char tname[32]; skel = test_bpf_ma__open(); if (!ASSERT_OK_PTR(skel, "open")) @@ -25,16 +26,21 @@ static void do_bpf_ma_test(const char *name) goto out; for (i = 0; i < ARRAY_SIZE(skel->rodata->data_sizes); i++) { - char name[32]; - int id; - - snprintf(name, sizeof(name), "bin_data_%u", skel->rodata->data_sizes[i]); - id = btf__find_by_name_kind(btf, name, BTF_KIND_STRUCT); - if (!ASSERT_GT(id, 0, "bin_data")) + snprintf(tname, sizeof(tname), "bin_data_%u", skel->rodata->data_sizes[i]); + id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT); + if (!ASSERT_GT(id, 0, tname)) goto out; skel->rodata->data_btf_ids[i] = id; } + for (i = 0; i < ARRAY_SIZE(skel->rodata->percpu_data_sizes); i++) { + snprintf(tname, sizeof(tname), "percpu_bin_data_%u", skel->rodata->percpu_data_sizes[i]); + id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT); + if (!ASSERT_GT(id, 0, tname)) + goto out; + skel->rodata->percpu_data_btf_ids[i] = id; + } + prog = bpf_object__find_program_by_name(skel->obj, name); if (!ASSERT_OK_PTR(prog, "invalid prog name")) goto out; diff --git a/tools/testing/selftests/bpf/progs/test_bpf_ma.c b/tools/testing/selftests/bpf/progs/test_bpf_ma.c index b685a4aba6bd..e453a9392e5a 100644 --- a/tools/testing/selftests/bpf/progs/test_bpf_ma.c +++ b/tools/testing/selftests/bpf/progs/test_bpf_ma.c @@ -20,6 +20,9 @@ char _license[] SEC("license") = "GPL"; const unsigned int data_sizes[] = {8, 16, 32, 64, 96, 128, 192, 256, 512, 1024, 2048, 4096}; const volatile unsigned int data_btf_ids[ARRAY_SIZE(data_sizes)] = {}; +const unsigned int percpu_data_sizes[] = {8, 16, 32, 64, 96, 128, 192, 256, 512}; +const volatile unsigned int percpu_data_btf_ids[ARRAY_SIZE(data_sizes)] = {}; + int err = 0; int pid = 0; @@ -27,10 +30,10 @@ int pid = 0; struct bin_data_##_size { \ char data[_size - sizeof(void *)]; \ }; \ + /* See Commit 5d8d6634ccc, force btf generation for type bin_data_##_size */ \ + struct bin_data_##_size *__bin_data_##_size; \ struct map_value_##_size { \ struct bin_data_##_size __kptr * data; \ - /* To emit BTF info for bin_data_xx */ \ - struct bin_data_##_size not_used; \ }; \ struct { \ __uint(type, BPF_MAP_TYPE_ARRAY); \ @@ -40,8 +43,12 @@ int pid = 0; } array_##_size SEC(".maps") #define DEFINE_ARRAY_WITH_PERCPU_KPTR(_size) \ + struct percpu_bin_data_##_size { \ + char data[_size]; \ + }; \ + struct percpu_bin_data_##_size *__percpu_bin_data_##_size; \ struct map_value_percpu_##_size { \ - struct bin_data_##_size __percpu_kptr * data; \ + struct percpu_bin_data_##_size __percpu_kptr * data; \ }; \ struct { \ __uint(type, BPF_MAP_TYPE_ARRAY); \ @@ -114,7 +121,7 @@ static __always_inline void batch_percpu_alloc(struct bpf_map *map, unsigned int return; } /* per-cpu allocator may not be able to refill in time */ - new = bpf_percpu_obj_new_impl(data_btf_ids[idx], NULL); + new = bpf_percpu_obj_new_impl(percpu_data_btf_ids[idx], NULL); if (!new) continue; @@ -179,7 +186,7 @@ DEFINE_ARRAY_WITH_KPTR(1024); DEFINE_ARRAY_WITH_KPTR(2048); DEFINE_ARRAY_WITH_KPTR(4096); -/* per-cpu kptr doesn't support bin_data_8 which is a zero-sized array */ +DEFINE_ARRAY_WITH_PERCPU_KPTR(8); DEFINE_ARRAY_WITH_PERCPU_KPTR(16); DEFINE_ARRAY_WITH_PERCPU_KPTR(32); DEFINE_ARRAY_WITH_PERCPU_KPTR(64); @@ -188,9 +195,6 @@ DEFINE_ARRAY_WITH_PERCPU_KPTR(128); DEFINE_ARRAY_WITH_PERCPU_KPTR(192); DEFINE_ARRAY_WITH_PERCPU_KPTR(256); DEFINE_ARRAY_WITH_PERCPU_KPTR(512); -DEFINE_ARRAY_WITH_PERCPU_KPTR(1024); -DEFINE_ARRAY_WITH_PERCPU_KPTR(2048); -DEFINE_ARRAY_WITH_PERCPU_KPTR(4096); SEC("?fentry/" SYS_PREFIX "sys_nanosleep") int test_batch_alloc_free(void *ctx) @@ -248,9 +252,10 @@ int test_batch_percpu_alloc_free(void *ctx) if ((u32)bpf_get_current_pid_tgid() != pid) return 0; - /* Alloc 128 16-bytes per-cpu objects in batch to trigger refilling, - * then free 128 16-bytes per-cpu objects in batch to trigger freeing. + /* Alloc 128 8-bytes per-cpu objects in batch to trigger refilling, + * then free 128 8-bytes per-cpu objects in batch to trigger freeing. */ + CALL_BATCH_PERCPU_ALLOC_FREE(8, 128, 0); CALL_BATCH_PERCPU_ALLOC_FREE(16, 128, 1); CALL_BATCH_PERCPU_ALLOC_FREE(32, 128, 2); CALL_BATCH_PERCPU_ALLOC_FREE(64, 128, 3); @@ -259,9 +264,6 @@ int test_batch_percpu_alloc_free(void *ctx) CALL_BATCH_PERCPU_ALLOC_FREE(192, 128, 6); CALL_BATCH_PERCPU_ALLOC_FREE(256, 128, 7); CALL_BATCH_PERCPU_ALLOC_FREE(512, 64, 8); - CALL_BATCH_PERCPU_ALLOC_FREE(1024, 32, 9); - CALL_BATCH_PERCPU_ALLOC_FREE(2048, 16, 10); - CALL_BATCH_PERCPU_ALLOC_FREE(4096, 8, 11); return 0; } @@ -272,9 +274,10 @@ int test_percpu_free_through_map_free(void *ctx) if ((u32)bpf_get_current_pid_tgid() != pid) return 0; - /* Alloc 128 16-bytes per-cpu objects in batch to trigger refilling, + /* Alloc 128 8-bytes per-cpu objects in batch to trigger refilling, * then free these object through map free. */ + CALL_BATCH_PERCPU_ALLOC(8, 128, 0); CALL_BATCH_PERCPU_ALLOC(16, 128, 1); CALL_BATCH_PERCPU_ALLOC(32, 128, 2); CALL_BATCH_PERCPU_ALLOC(64, 128, 3); @@ -283,9 +286,6 @@ int test_percpu_free_through_map_free(void *ctx) CALL_BATCH_PERCPU_ALLOC(192, 128, 6); CALL_BATCH_PERCPU_ALLOC(256, 128, 7); CALL_BATCH_PERCPU_ALLOC(512, 64, 8); - CALL_BATCH_PERCPU_ALLOC(1024, 32, 9); - CALL_BATCH_PERCPU_ALLOC(2048, 16, 10); - CALL_BATCH_PERCPU_ALLOC(4096, 8, 11); return 0; } From patchwork Mon Dec 18 06:31:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13496255 X-Patchwork-Delegate: bpf@iogearbox.net Received: from 66-220-155-179.mail-mxout.facebook.com (66-220-155-179.mail-mxout.facebook.com [66.220.155.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F40263C16 for ; Mon, 18 Dec 2023 06:31:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id DFE912BB1F7AA; Sun, 17 Dec 2023 22:31:07 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau , Hou Tao Subject: [PATCH bpf-next v4 7/7] selftests/bpf: Add a selftest with > 512-byte percpu allocation size Date: Sun, 17 Dec 2023 22:31:07 -0800 Message-Id: <20231218063107.3043154-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231218063031.3037929-1-yonghong.song@linux.dev> References: <20231218063031.3037929-1-yonghong.song@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net Add a selftest to capture the verification failure when the allocation size is greater than 512. Acked-by: Hou Tao Signed-off-by: Yonghong Song --- .../selftests/bpf/progs/percpu_alloc_fail.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c index 1a891d30f1fe..f2b8eb2ff76f 100644 --- a/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c +++ b/tools/testing/selftests/bpf/progs/percpu_alloc_fail.c @@ -17,6 +17,10 @@ struct val_with_rb_root_t { struct bpf_spin_lock lock; }; +struct val_600b_t { + char b[600]; +}; + struct elem { long sum; struct val_t __percpu_kptr *pc; @@ -161,4 +165,18 @@ int BPF_PROG(test_array_map_7) return 0; } +SEC("?fentry.s/bpf_fentry_test1") +__failure __msg("bpf_percpu_obj_new type size (600) is greater than 512") +int BPF_PROG(test_array_map_8) +{ + struct val_600b_t __percpu_kptr *p; + + p = bpf_percpu_obj_new(struct val_600b_t); + if (!p) + return 0; + + bpf_percpu_obj_drop(p); + return 0; +} + char _license[] SEC("license") = "GPL";