diff mbox series

[bpf-next,v4,4/7] bpf: Refill only one percpu element in memalloc

Message ID 20231218063052.3040932-1-yonghong.song@linux.dev (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: Reduce memory usage for bpf_global_percpu_ma | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1122 this patch: 1122
netdev/cc_maintainers warning 7 maintainers not CCed: kpsingh@kernel.org haoluo@google.com martin.lau@linux.dev jolsa@kernel.org sdf@google.com song@kernel.org john.fastabend@gmail.com
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1149 this patch: 1149
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc

Commit Message

Yonghong Song Dec. 18, 2023, 6:30 a.m. UTC
Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/memalloc.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Comments

Hou Tao Dec. 19, 2023, 11:31 a.m. UTC | #1
Hi,

On 12/18/2023 2:30 PM, Yonghong Song wrote:
> Typically for percpu map element or data structure, once allocated,
> most operations are lookup or in-place update. Deletion are really
> rare. Currently, for percpu data strcture, 4 elements will be
> refilled if the size is <= 256. Let us just do with one element
> for percpu data. For example, for size 256 and 128 cpus, the
> potential saving will be 3 * 256 * 128 * 128 = 12MB.
>
> Acked-by: Hou Tao <houtao1@huawei.com>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  kernel/bpf/memalloc.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 50ab2fecc005..f37998662146 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -485,11 +485,16 @@ static void init_refill_work(struct bpf_mem_cache *c)
>  
>  static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
>  {
> -	/* To avoid consuming memory assume that 1st run of bpf
> -	 * prog won't be doing more than 4 map_update_elem from
> -	 * irq disabled region
> +	int cnt = 1;
> +
> +	/* To avoid consuming memory, for non-percpu allocation, assume that
> +	 * 1st run of bpf prog won't be doing more than 4 map_update_elem from
> +	 * irq disabled region if unit size is less than or equal to 256.
> +	 * For all other cases, let us just do one allocation.
>  	 */
> -	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false);
> +	if (!c->percpu_size && c->unit_size <= 256)
> +		cnt = 4;
> +	alloc_bulk(c, cnt, cpu_to_node(cpu), false);
>  }

Another thought about this patch. When the prefilled element is
allocated by the invocation of bpf_percpu_obj_new(), the prefill will
trigger again and this time it will allocate c->batch elements. For
256-bytes unit_size, c->batch will be 64, so the maximal memory
consumption under 128-cpus host will be: 64 * 256 * 128 * 128 = 256MB
when there is one allocation of bpf_percpu_obj_new() on each CPU. And my
question is that should we adjust the low_watermark and high_watermark
accordingly for per-cpu allocation to reduce the memory consumption ?
>  
>  static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
Yonghong Song Dec. 20, 2023, 7:55 p.m. UTC | #2
On 12/19/23 3:31 AM, Hou Tao wrote:
> Hi,
>
> On 12/18/2023 2:30 PM, Yonghong Song wrote:
>> Typically for percpu map element or data structure, once allocated,
>> most operations are lookup or in-place update. Deletion are really
>> rare. Currently, for percpu data strcture, 4 elements will be
>> refilled if the size is <= 256. Let us just do with one element
>> for percpu data. For example, for size 256 and 128 cpus, the
>> potential saving will be 3 * 256 * 128 * 128 = 12MB.
>>
>> Acked-by: Hou Tao <houtao1@huawei.com>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   kernel/bpf/memalloc.c | 13 +++++++++----
>>   1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 50ab2fecc005..f37998662146 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -485,11 +485,16 @@ static void init_refill_work(struct bpf_mem_cache *c)
>>   
>>   static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
>>   {
>> -	/* To avoid consuming memory assume that 1st run of bpf
>> -	 * prog won't be doing more than 4 map_update_elem from
>> -	 * irq disabled region
>> +	int cnt = 1;
>> +
>> +	/* To avoid consuming memory, for non-percpu allocation, assume that
>> +	 * 1st run of bpf prog won't be doing more than 4 map_update_elem from
>> +	 * irq disabled region if unit size is less than or equal to 256.
>> +	 * For all other cases, let us just do one allocation.
>>   	 */
>> -	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false);
>> +	if (!c->percpu_size && c->unit_size <= 256)
>> +		cnt = 4;
>> +	alloc_bulk(c, cnt, cpu_to_node(cpu), false);
>>   }
> Another thought about this patch. When the prefilled element is
> allocated by the invocation of bpf_percpu_obj_new(), the prefill will
> trigger again and this time it will allocate c->batch elements. For
> 256-bytes unit_size, c->batch will be 64, so the maximal memory
> consumption under 128-cpus host will be: 64 * 256 * 128 * 128 = 256MB

Actually, it should be 48 * 256 * 128 * 128 in the worst case, due to
   c->batch = max((c->high_watermark - c->low_watermark) / 4 * 3, 1);

But in reality, for percpu allocation, we probably won't have allocation
for all 128 cpus.

But your point is taken, for percpu allocation, we will have much less
allocations, so we should not use current lower_watermark/upper_watermark used
for non-percpu allocations. So for percpu allocation, I suggest to do
   lower_watermark = 1;
   upper_watermark = 3;
   c->batch = 1;

Thanks!

> when there is one allocation of bpf_percpu_obj_new() on each CPU. And my
> question is that should we adjust the low_watermark and high_watermark
> accordingly for per-cpu allocation to reduce the memory consumption ?
>>   
>>   static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 50ab2fecc005..f37998662146 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -485,11 +485,16 @@  static void init_refill_work(struct bpf_mem_cache *c)
 
 static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
 {
-	/* To avoid consuming memory assume that 1st run of bpf
-	 * prog won't be doing more than 4 map_update_elem from
-	 * irq disabled region
+	int cnt = 1;
+
+	/* To avoid consuming memory, for non-percpu allocation, assume that
+	 * 1st run of bpf prog won't be doing more than 4 map_update_elem from
+	 * irq disabled region if unit size is less than or equal to 256.
+	 * For all other cases, let us just do one allocation.
 	 */
-	alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false);
+	if (!c->percpu_size && c->unit_size <= 256)
+		cnt = 4;
+	alloc_bulk(c, cnt, cpu_to_node(cpu), false);
 }
 
 static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx)