[v3,bpf-next] selftests/bpf: Add benchmark for local_storage get

Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s
local_storage cache interleaved get:  hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s
local_storage cache interleaved get:  hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s
local_storage cache interleaved get:  hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s
local_storage cache interleaved get:  hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s
local_storage cache interleaved get:  hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s
local_storage cache interleaved get:  hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

As evidenced by the unintuitive-looking results for smaller num_maps
benchmark runs, overhead which is amortized across larger num_maps runs
dominates when there are fewer maps. To get a sense of the overhead, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.h and ran the benchmark on the same host as the
'real' run. Results:

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 4.135 ± 0.069 M ops/s, hits latency: 241.831 ns/op, important_hits throughput: 4.135 ± 0.069 M ops/s
local_storage cache interleaved get:  hits throughput: 7.693 ± 0.039 M ops/s, hits latency: 129.982 ns/op, important_hits throughput: 7.693 ± 0.039 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 33.044 ± 0.232 M ops/s, hits latency: 30.262 ns/op, important_hits throughput: 3.304 ± 0.023 M ops/s
local_storage cache interleaved get:  hits throughput: 36.525 ± 1.545 M ops/s, hits latency: 27.378 ns/op, important_hits throughput: 13.045 ± 0.552 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 45.502 ± 1.429 M ops/s, hits latency: 21.977 ns/op, important_hits throughput: 2.844 ± 0.089 M ops/s
local_storage cache interleaved get:  hits throughput: 47.741 ± 1.115 M ops/s, hits latency: 20.946 ns/op, important_hits throughput: 15.190 ± 0.355 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 47.177 ± 0.617 M ops/s, hits latency: 21.197 ns/op, important_hits throughput: 2.775 ± 0.036 M ops/s
local_storage cache interleaved get:  hits throughput: 50.005 ± 0.463 M ops/s, hits latency: 19.998 ns/op, important_hits throughput: 15.219 ± 0.141 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 58.076 ± 0.507 M ops/s, hits latency: 17.219 ns/op, important_hits throughput: 2.420 ± 0.021 M ops/s
local_storage cache interleaved get:  hits throughput: 57.731 ± 0.500 M ops/s, hits latency: 17.322 ns/op, important_hits throughput: 16.237 ± 0.141 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 68.266 ± 0.234 M ops/s, hits latency: 14.649 ns/op, important_hits throughput: 2.133 ± 0.007 M ops/s
local_storage cache interleaved get:  hits throughput: 62.138 ± 2.695 M ops/s, hits latency: 16.093 ns/op, important_hits throughput: 17.341 ± 0.752 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 103.735 ± 2.874 M ops/s, hits latency: 9.640 ns/op, important_hits throughput: 1.037 ± 0.029 M ops/s
local_storage cache interleaved get:  hits throughput: 85.950 ± 1.619 M ops/s, hits latency: 11.635 ns/op, important_hits throughput: 22.450 ± 0.423 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 133.551 ± 1.915 M ops/s, hits latency: 7.488 ns/op, important_hits throughput: 0.134 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 97.579 ± 1.415 M ops/s, hits latency: 10.248 ns/op, important_hits throughput: 24.505 ± 0.355 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:

hashmap_control:     ~6.6ns
sequential_get_1:    ~17.9ns
sequential_get_10:   ~18.9ns
sequential_get_16:   ~19.0ns
sequential_get_17:   ~20.2ns
sequential_get_24:   ~42.2ns
sequential_get_32:   ~68.7ns
sequential_get_100:  ~163.3ns
sequential_get_1000: ~2200ns

Clearly demonstrating a cliff.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
Changelog:

v2 -> v3:
  * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
	  so just use 1 program (Alexei)

v1 -> v2:
  * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
    configurable # of maps (Andrii)
  * Add hashmap benchmark (Alexei)
	* Add discussion of overhead

 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/bench.c           |  57 +++
 tools/testing/selftests/bpf/bench.h           |   5 +
 .../bpf/benchs/bench_local_storage.c          | 332 ++++++++++++++++++
 .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
 .../selftests/bpf/benchs/run_common.sh        |  17 +
 .../selftests/bpf/progs/local_storage_bench.h |  63 ++++
 .../bpf/progs/local_storage_bench__get_int.c  |  12 +
 .../bpf/progs/local_storage_bench__get_seq.c  |  12 +
 .../bpf/progs/local_storage_bench__hashmap.c  |  13 +
 10 files changed, 537 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c

Message ID	20220521045958.3405148-1-davemarchevsky@fb.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7EB9C433F5 for <bpf@archiver.kernel.org>; Sat, 21 May 2022 05:00:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245663AbiEUFAN (ORCPT <rfc822;bpf@archiver.kernel.org>); Sat, 21 May 2022 01:00:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229740AbiEUFAM (ORCPT <rfc822;bpf@vger.kernel.org>); Sat, 21 May 2022 01:00:12 -0400 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A74A16D13E for <bpf@vger.kernel.org>; Fri, 20 May 2022 22:00:09 -0700 (PDT) Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24L3Y0v0013951 for <bpf@vger.kernel.org>; Fri, 20 May 2022 22:00:09 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : content-type : content-transfer-encoding : mime-version; s=facebook; bh=RKavV8KCrhYDTAVVfP0qjPcpJr9GGSIDZ3vhY/33ji4=; b=Al3FpUVz3Ynf1C1f2eqFIarT4TCNy3Youp8yDem5JH8ArMIwJ6SL3bNWodKgHvQf/ET5 IrQ4CG+fyx0E0l25CRBGsXpKL3r5EDQjUA7SJH/660gUcQsr1apxfRX5C+PJYznfHtW0 PSvuiifuhMeCnN8n2AaJt/nLXjV5N1T+xbk= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3g59tc06t1-6 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <bpf@vger.kernel.org>; Fri, 20 May 2022 22:00:09 -0700 Received: from twshared24024.25.frc3.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.28; Fri, 20 May 2022 22:00:06 -0700 Received: by devbig077.ldc1.facebook.com (Postfix, from userid 158236) id A582B7F27952; Fri, 20 May 2022 21:59:59 -0700 (PDT) From: Dave Marchevsky <davemarchevsky@fb.com> To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Kernel Team <kernel-team@fb.com>, Dave Marchevsky <davemarchevsky@fb.com> Subject: [PATCH v3 bpf-next] selftests/bpf: Add benchmark for local_storage get Date: Fri, 20 May 2022 21:59:58 -0700 Message-ID: <20220521045958.3405148-1-davemarchevsky@fb.com> X-Mailer: git-send-email 2.30.2 Content-Type: text/plain; charset="UTF-8" X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: H5hssHHgJxAA8o62p0ihsvTDaXjLYXRN X-Proofpoint-GUID: H5hssHHgJxAA8o62p0ihsvTDaXjLYXRN Content-Transfer-Encoding: quoted-printable X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-20_08,2022-05-20_02,2022-02-23_01 Precedence: bulk List-ID: <bpf.vger.kernel.org> X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net
Series	[v3,bpf-next] selftests/bpf: Add benchmark for local_storage get \| expand [v3,bpf-next] selftests/bpf: Add benchmark for local_storage get

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	fail	Detected static functions without inline keyword in header files: 1
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	8 maintainers not CCed: joannekoong@fb.com kpsingh@kernel.org john.fastabend@gmail.com yhs@fb.com songliubraving@fb.com netdev@vger.kernel.org linux-kselftest@vger.kernel.org shuah@kernel.org
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 0 this patch: 0
netdev/checkpatch	warning	CHECK: Macro argument 'interleave' may be better as '(interleave)' to avoid precedence issues WARNING: Macros with flow control statements should be avoided WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 106 exceeds 80 columns WARNING: line length of 112 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	fail	Was 0 now: 1
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-3	fail	Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Kernel LATEST on ubuntu-latest with llvm-15

[v3,bpf-next] selftests/bpf: Add benchmark for local_storage get

Checks

Commit Message

Comments

Patch