[bpf-next,v8] selftests/bpf: Add benchmark for bpf memory allocator

From: Hou Tao <houtao1@huawei.com>

From: Hou Tao <houtao1@huawei.com>

The benchmark could be used to compare the performance of hash map
operations and the memory usage between different flavors of bpf memory
allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
could be used to check the performance improvement or the memory saving
provided by optimization.

The benchmark creates a non-preallocated hash map which uses bpf memory
allocator and shows the operation performance and the memory usage of
the hash map under different use cases:
(1) overwrite
Each CPU overwrites nonoverlapping part of hash map. When each CPU
completes overwriting of 64 elements in hash map, it increases the
op_count.
(2) batch_add_batch_del
Each CPU adds then deletes nonoverlapping part of hash map in batch.
When each CPU adds and deletes 64 elements in hash map, it increases
the op_count twice.
(3) add_del_on_diff_cpu
Each two-CPUs pair adds and deletes nonoverlapping part of map
cooperatively. When each CPU adds or deletes 64 elements in hash map,
it will increase the op_count.

The following is the benchmark results when comparing between different
flavors of bpf memory allocator. These tests are conducted on a KVM guest
with 8 CPUs and 16 GB memory. The command line below is used to do all
the following benchmarks:

  ./bench htab-mem --use-case $name ${OPTS} -w3 -d10 -a -p8

These results show that preallocated hash map has both better performance
and smaller memory footprint.

(1) non-preallocated + no bpf memory allocator (v6.0.19)
use kmalloc() + call_rcu

overwrite            per-prod-op: 11.24 ± 0.07k/s, avg mem: 82.64 ± 26.32MiB, peak mem: 119.18MiB
batch_add_batch_del  per-prod-op: 18.45 ± 0.10k/s, avg mem: 50.47 ± 14.51MiB, peak mem: 94.96MiB
add_del_on_diff_cpu  per-prod-op: 14.50 ± 0.03k/s, avg mem: 4.64 ± 0.73MiB, peak mem: 7.20MiB

(2) preallocated
OPTS=--preallocated

overwrite            per-prod-op: 191.92 ± 0.07k/s, avg mem: 1.23 ± 0.00MiB, peak mem: 1.49MiB
batch_add_batch_del  per-prod-op: 218.10 ± 0.25k/s, avg mem: 1.23 ± 0.00MiB, peak mem: 1.49MiB
add_del_on_diff_cpu  per-prod-op: 39.59 ± 0.41k/s, avg mem: 1.48 ± 0.11MiB, peak mem: 1.74MiB

(3) normal bpf memory allocator

overwrite            per-prod-op: 134.81 ± 0.22k/s, avg mem: 1.67 ± 0.12MiB, peak mem: 2.74MiB
batch_add_batch_del  per-prod-op: 90.44 ± 0.34k/s, avg mem: 2.27 ± 0.00MiB, peak mem: 2.74MiB
add_del_on_diff_cpu  per-prod-op: 28.20 ± 0.15k/s, avg mem: 1.73 ± 0.17MiB, peak mem: 2.06MiB

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
Hi,

After hacking htab to use bpf_mem_cache_free_rcu() [0], the htab-mem
benchmark result is as follows:

htab-mem:

overwrite            per-prod-op: 69.37 ± 0.83k/s, avg mem: 137.00 ± 30.10MiB, peak mem: 201.45MiB
batch_add_batch_del  per-prod-op: 76.26 ± 0.54k/s, avg mem: 45.88 ± 2.19MiB, peak mem: 56.27MiB
add_del_on_diff_cpu  per-prod-op: 30.58 ± 0.13k/s, avg mem: 20.06 ± 2.31MiB, peak mem: 28.54MiB

For reference, hash_map_perf benchmark results are shown below. The
command line for the benchmark is "./map_perf_test 4 8 16384".

htab-with-bpf_mem_cache_free()
2:hash_map_perf kmalloc 496369 events per sec
0:hash_map_perf kmalloc 250241 events per sec
1:hash_map_perf kmalloc 248366 events per sec
3:hash_map_perf kmalloc 240521 events per sec
6:hash_map_perf kmalloc 250032 events per sec
7:hash_map_perf kmalloc 250798 events per sec
4:hash_map_perf kmalloc 152847 events per sec
5:hash_map_perf kmalloc 150083 events per sec

htab-with-bpf_mem_cache_free_rc()
2:hash_map_perf kmalloc 294190 events per sec
3:hash_map_perf kmalloc 172039 events per sec
0:hash_map_perf kmalloc 170092 events per sec
5:hash_map_perf kmalloc 170396 events per sec
6:hash_map_perf kmalloc 170030 events per sec
7:hash_map_perf kmalloc 167629 events per sec
1:hash_map_perf kmalloc 162975 events per sec
4:hash_map_perf kmalloc 162673 events per sec

[0]: https://lore.kernel.org/bpf/20230628015634.33193-1-alexei.starovoitov@gmail.com/

Change Log:

v8:
 * use MAX() from sys/param.h instead of redefining it
 * remove unnecessary patch which adds min() & max() in bpf_util.h

v7: https://lore.kernel.org/rcu/20230628115910.3817966-1-houtao@huaweicloud.com/
 * Rename name of producer threads to avoid confusion
 * Make the comments in producer threads more clear
 * Remove unnecessary check of ctx->from in bpf program
 * Split add_del_on_diff bpf program to two bpf program for clarity

v6: https://lore.kernel.org/bpf/20230613080921.1623219-1-houtao@huaweicloud.com/
  * add fix patches for benchmark framework
  * updates for htab-mem benchmark (Most of updates are suggested by Alexei)
    * remove --full and --max-entries and use a fixed 8k size for htab
    * remove op_factor and increase op_cnt correctly
    * use -a instead of --prod-affinity in run_bench_htab_mem.sh
    * use $RUN_BENCH in run_bench_htab_mem.sh
    * call cleanup_cgroup_environment() at the end of htab_mem_report_final()

v5: https://lore.kernel.org/bpf/ff4b2396-48aa-28f1-c91b-7c8a4b9510bb@huaweicloud.com/
 * send the benchmark patch alone (suggested by Alexei)
 * limit the max number of touched elements per-bpf-program call to 64 (from Alexei)
 * show per-producer performance (from Alexei)
 * handle the return value of read() (from BPF CI)
 * do cleanup_cgroup_environment() in htab_mem_report_final()

v4: https://lore.kernel.org/bpf/20230606035310.4026145-1-houtao@huaweicloud.com/

 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 346 ++++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  40 ++
 .../selftests/bpf/progs/htab_mem_bench.c      | 105 ++++++
 5 files changed, 498 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

Message ID	20230703141332.3319271-1-houtao@huaweicloud.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <rcu-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D54FDEB64DC for <rcu@archiver.kernel.org>; Mon, 3 Jul 2023 13:41:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231603AbjGCNld (ORCPT <rfc822;rcu@archiver.kernel.org>); Mon, 3 Jul 2023 09:41:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58938 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230409AbjGCNlc (ORCPT <rfc822;rcu@vger.kernel.org>); Mon, 3 Jul 2023 09:41:32 -0400 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C025E54; Mon, 3 Jul 2023 06:41:28 -0700 (PDT) Received: from mail02.huawei.com (unknown [172.30.67.143]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4QvnBm4qt6z4f3kj5; Mon, 3 Jul 2023 21:41:20 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.124.27]) by APP4 (Coremail) with SMTP id gCh0CgD3mp7+z6Jk58B1NA--.28926S4; Mon, 03 Jul 2023 21:41:20 +0800 (CST) From: Hou Tao <houtao@huaweicloud.com> To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>, Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>, Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>, John Fastabend <john.fastabend@gmail.com>, "Paul E . McKenney" <paulmck@kernel.org>, rcu@vger.kernel.org, houtao1@huawei.com Subject: [PATCH bpf-next v8] selftests/bpf: Add benchmark for bpf memory allocator Date: Mon, 3 Jul 2023 22:13:32 +0800 Message-Id: <20230703141332.3319271-1-houtao@huaweicloud.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-CM-TRANSID: gCh0CgD3mp7+z6Jk58B1NA--.28926S4 X-Coremail-Antispam: 1UD129KBjvAXoWfCFWxCF1rCw1kAw18XF4kCrg_yoW8tryxto Z3CFs8Jr18Jr1vq3ykCF1kJ3Z3uF1q9ryUXryUt3Z8ZFy8Cr1rurWxCw4fZryxXFWfK3y7 WFZ2y347ZrWkJF93n29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUY17kC6x804xWl14x267AKxVW5JVWrJwAFc2x0x2IEx4CE42xK 8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4 AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0I7IYx2IY6xkF 7I0E14v26F4j6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7MxAIw28I cxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr4lx2 IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUtVW8ZwCIc40Y0x0EwIxGrwCI 42IY6xIIjxv20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8JVWxJwCI42 IY6xAIw20EY4v20xvaj40_WFyUJVCq3wCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E 87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvjxUFDGOUUUUU X-CM-SenderInfo: xkrx3t3r6k3tpzhluzxrxghudrp/ X-CFilter-Loop: Reflected Precedence: bulk List-ID: <rcu.vger.kernel.org> X-Mailing-List: rcu@vger.kernel.org
Series	[bpf-next,v8] selftests/bpf: Add benchmark for bpf memory allocator \| expand [bpf-next,v8] selftests/bpf: Add benchmark for bpf memory allocator

[bpf-next,v8] selftests/bpf: Add benchmark for bpf memory allocator

Commit Message

Comments

Patch