[v4,bpf-next,1/2] selftests/bpf: Add benchmark for local_storage get

Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 48.338 ± 2.366 M ops/s, hits latency: 20.688 ns/op, important_hits throughput: 0.097 ± 0.005 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 44.503 ± 1.080 M ops/s, hits latency: 22.470 ns/op, important_hits throughput: 44.503 ± 1.080 M ops/s
local_storage cache interleaved get:  hits throughput: 54.963 ± 0.586 M ops/s, hits latency: 18.194 ns/op, important_hits throughput: 54.963 ± 0.586 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 43.743 ± 0.418 M ops/s, hits latency: 22.861 ns/op, important_hits throughput: 4.374 ± 0.042 M ops/s
local_storage cache interleaved get:  hits throughput: 50.073 ± 0.609 M ops/s, hits latency: 19.971 ns/op, important_hits throughput: 17.883 ± 0.217 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 43.962 ± 0.525 M ops/s, hits latency: 22.747 ns/op, important_hits throughput: 2.748 ± 0.033 M ops/s
local_storage cache interleaved get:  hits throughput: 48.166 ± 0.825 M ops/s, hits latency: 20.761 ns/op, important_hits throughput: 15.326 ± 0.263 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 33.207 ± 0.461 M ops/s, hits latency: 30.114 ns/op, important_hits throughput: 1.956 ± 0.027 M ops/s
local_storage cache interleaved get:  hits throughput: 43.540 ± 0.265 M ops/s, hits latency: 22.968 ns/op, important_hits throughput: 13.255 ± 0.081 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 19.402 ± 0.348 M ops/s, hits latency: 51.542 ns/op, important_hits throughput: 0.809 ± 0.015 M ops/s
local_storage cache interleaved get:  hits throughput: 22.981 ± 0.487 M ops/s, hits latency: 43.514 ns/op, important_hits throughput: 6.465 ± 0.137 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 13.378 ± 0.220 M ops/s, hits latency: 74.748 ns/op, important_hits throughput: 0.419 ± 0.007 M ops/s
local_storage cache interleaved get:  hits throughput: 16.894 ± 0.172 M ops/s, hits latency: 59.193 ns/op, important_hits throughput: 4.716 ± 0.048 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 6.070 ± 0.140 M ops/s, hits latency: 164.745 ns/op, important_hits throughput: 0.061 ± 0.001 M ops/s
local_storage cache interleaved get:  hits throughput: 7.323 ± 0.149 M ops/s, hits latency: 136.554 ns/op, important_hits throughput: 1.913 ± 0.039 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.438 ± 0.012 M ops/s, hits latency: 2281.369 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.522 ± 0.010 M ops/s, hits latency: 1913.937 ns/op, important_hits throughput: 0.131 ± 0.003 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

To get a sense of the overhead of the benchmark program, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.c and ran the benchmark on the same host as the
'real' run. Results:

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 96.965 ± 1.346 M ops/s, hits latency: 10.313 ns/op, important_hits throughput: 0.194 ± 0.003 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 105.792 ± 1.860 M ops/s, hits latency: 9.453 ns/op, important_hits throughput: 105.792 ± 1.860 M ops/s
local_storage cache interleaved get:  hits throughput: 185.847 ± 4.014 M ops/s, hits latency: 5.381 ns/op, important_hits throughput: 185.847 ± 4.014 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 109.867 ± 1.358 M ops/s, hits latency: 9.102 ns/op, important_hits throughput: 10.987 ± 0.136 M ops/s
local_storage cache interleaved get:  hits throughput: 144.165 ± 1.256 M ops/s, hits latency: 6.936 ns/op, important_hits throughput: 51.487 ± 0.449 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 109.258 ± 1.902 M ops/s, hits latency: 9.153 ns/op, important_hits throughput: 6.829 ± 0.119 M ops/s
local_storage cache interleaved get:  hits throughput: 140.248 ± 1.836 M ops/s, hits latency: 7.130 ns/op, important_hits throughput: 44.624 ± 0.584 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 116.397 ± 7.679 M ops/s, hits latency: 8.591 ns/op, important_hits throughput: 6.856 ± 0.452 M ops/s
local_storage cache interleaved get:  hits throughput: 128.411 ± 4.927 M ops/s, hits latency: 7.787 ns/op, important_hits throughput: 39.093 ± 1.500 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 110.890 ± 0.976 M ops/s, hits latency: 9.018 ns/op, important_hits throughput: 4.624 ± 0.041 M ops/s
local_storage cache interleaved get:  hits throughput: 133.316 ± 1.889 M ops/s, hits latency: 7.501 ns/op, important_hits throughput: 37.503 ± 0.531 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 112.900 ± 1.171 M ops/s, hits latency: 8.857 ns/op, important_hits throughput: 3.534 ± 0.037 M ops/s
local_storage cache interleaved get:  hits throughput: 132.844 ± 1.207 M ops/s, hits latency: 7.528 ns/op, important_hits throughput: 37.081 ± 0.337 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 110.025 ± 4.714 M ops/s, hits latency: 9.089 ns/op, important_hits throughput: 1.100 ± 0.047 M ops/s
local_storage cache interleaved get:  hits throughput: 131.979 ± 5.013 M ops/s, hits latency: 7.577 ns/op, important_hits throughput: 34.472 ± 1.309 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 117.850 ± 2.423 M ops/s, hits latency: 8.485 ns/op, important_hits throughput: 0.118 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 141.268 ± 9.658 M ops/s, hits latency: 7.079 ns/op, important_hits throughput: 35.476 ± 2.425 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:

hashmap_control:     ~10.4ns
sequential_get_1:    ~13.0ns
sequential_get_10:   ~13.8ns
sequential_get_16:   ~13.6ns
sequential_get_17:   ~21.5ns
sequential_get_24:   ~42.5ns
sequential_get_32:   ~65.9ns
sequential_get_100:  ~155.7ns
sequential_get_1000: ~2270ns

Clearly demonstrating a cliff.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
Changelog:

v3 -> v4:
	* Remove ifs guarding increments in measure fn (Andrii)
	* Refactor to use 1 bpf prog for all 3 benchmarks w/ global vars set
	  from userspace before load to control behavior (Andrii)
	* Greatly reduce variance in overhead by having benchmark bpf prog
	  loop 10k times regardless of map count (Andrii)
		* Also, move sync_fetch_and_incr out of do_lookup as the guaranteed
		  second sync_fetch_and_incr call for num_maps = 1 was adding
		  overhead
	* Add second patch refactoring bench.c's mean/stddev calculations
	  in reporting helper fns

v2 -> v3:
  * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
    so just use 1 program (Alexei)

v1 -> v2:
  * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
    configurable # of maps (Andrii)
  * Add hashmap benchmark (Alexei)
  * Add discussion of overhead

 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  55 ++++
 tools/testing/selftests/bpf/bench.h           |   4 +
 .../bpf/benchs/bench_local_storage.c          | 250 ++++++++++++++++++
 .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
 .../selftests/bpf/benchs/run_common.sh        |  17 ++
 .../selftests/bpf/progs/local_storage_bench.c |  99 +++++++
 7 files changed, 449 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.c

Message ID	20220530202711.2594486-1-davemarchevsky@fb.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E836C433FE for <bpf@archiver.kernel.org>; Mon, 30 May 2022 20:27:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231893AbiE3U1a (ORCPT <rfc822;bpf@archiver.kernel.org>); Mon, 30 May 2022 16:27:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232989AbiE3U12 (ORCPT <rfc822;bpf@vger.kernel.org>); Mon, 30 May 2022 16:27:28 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0050C71DB4 for <bpf@vger.kernel.org>; Mon, 30 May 2022 13:27:25 -0700 (PDT) Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24U3g7ch014378 for <bpf@vger.kernel.org>; Mon, 30 May 2022 13:27:25 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : content-type : content-transfer-encoding : mime-version; s=facebook; bh=UxEOjqn1Rx5yDl5kxyGJwUpkshHFf6fsnuX+jwONbPA=; b=DbICciPo0GOUWJFCHMvdTdyjyz/bUi4uWu2LkQEYJcjscI6YOm+ruTTT38+3TAewcowQ xJbm9XzSzDlXg4He8oxRjPJ99ym4oZhExiKvpWr4V0QFglO8jcOxYs9w857VLsOjjsUQ BoJrDAoc99lWTucO7zjOwyd8gOrf72fgJCw= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3gbf2jjab9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <bpf@vger.kernel.org>; Mon, 30 May 2022 13:27:25 -0700 Received: from twshared14577.08.ash8.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.28; Mon, 30 May 2022 13:27:23 -0700 Received: by devbig077.ldc1.facebook.com (Postfix, from userid 158236) id 1C57386006F0; Mon, 30 May 2022 13:27:12 -0700 (PDT) From: Dave Marchevsky <davemarchevsky@fb.com> To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Kernel Team <kernel-team@fb.com>, Dave Marchevsky <davemarchevsky@fb.com> Subject: [PATCH v4 bpf-next 1/2] selftests/bpf: Add benchmark for local_storage get Date: Mon, 30 May 2022 13:27:10 -0700 Message-ID: <20220530202711.2594486-1-davemarchevsky@fb.com> X-Mailer: git-send-email 2.30.2 Content-Type: text/plain; charset="UTF-8" X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: cWcuzxDs4xQTqSuh4YpyL92ktdrUgE_i X-Proofpoint-GUID: cWcuzxDs4xQTqSuh4YpyL92ktdrUgE_i Content-Transfer-Encoding: quoted-printable X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-05-30_09,2022-05-30_03,2022-02-23_01 Precedence: bulk List-ID: <bpf.vger.kernel.org> X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net
Series	[v4,bpf-next,1/2] selftests/bpf: Add benchmark for local_storage get \| expand [v4,bpf-next,1/2] selftests/bpf: Add benchmark for local_storage get [v4,bpf-next,2/2] selftests/bpf: refactor bench reporting functions

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	8 maintainers not CCed: netdev@vger.kernel.org songliubraving@fb.com linux-kselftest@vger.kernel.org yhs@fb.com joannekoong@fb.com john.fastabend@gmail.com shuah@kernel.org kpsingh@kernel.org
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	net selftest script(s) already in Makefile
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 0 this patch: 0
netdev/checkpatch	warning	WARNING: Use of volatile is usually wrong: see Documentation/process/volatile-considered-harmful.rst WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 106 exceeds 80 columns WARNING: line length of 112 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	fail	Was 0 now: 1
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Kernel LATEST on z15 with gcc

[v4,bpf-next,1/2] selftests/bpf: Add benchmark for local_storage get

Checks

Commit Message

Comments

Patch