diff mbox series

[v3,bpf-next] selftests/bpf: Add benchmark for local_storage get

Message ID 20220521045958.3405148-1-davemarchevsky@fb.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [v3,bpf-next] selftests/bpf: Add benchmark for local_storage get | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline fail Detected static functions without inline keyword in header files: 1
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 8 maintainers not CCed: joannekoong@fb.com kpsingh@kernel.org john.fastabend@gmail.com yhs@fb.com songliubraving@fb.com netdev@vger.kernel.org linux-kselftest@vger.kernel.org shuah@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning CHECK: Macro argument 'interleave' may be better as '(interleave)' to avoid precedence issues WARNING: Macros with flow control statements should be avoided WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 106 exceeds 80 columns WARNING: line length of 112 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline fail Was 0 now: 1
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15

Commit Message

Dave Marchevsky May 21, 2022, 4:59 a.m. UTC
Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s
local_storage cache interleaved get:  hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s
local_storage cache interleaved get:  hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s
local_storage cache interleaved get:  hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s
local_storage cache interleaved get:  hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s
local_storage cache interleaved get:  hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s
local_storage cache interleaved get:  hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

As evidenced by the unintuitive-looking results for smaller num_maps
benchmark runs, overhead which is amortized across larger num_maps runs
dominates when there are fewer maps. To get a sense of the overhead, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.h and ran the benchmark on the same host as the
'real' run. Results:

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 4.135 ± 0.069 M ops/s, hits latency: 241.831 ns/op, important_hits throughput: 4.135 ± 0.069 M ops/s
local_storage cache interleaved get:  hits throughput: 7.693 ± 0.039 M ops/s, hits latency: 129.982 ns/op, important_hits throughput: 7.693 ± 0.039 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 33.044 ± 0.232 M ops/s, hits latency: 30.262 ns/op, important_hits throughput: 3.304 ± 0.023 M ops/s
local_storage cache interleaved get:  hits throughput: 36.525 ± 1.545 M ops/s, hits latency: 27.378 ns/op, important_hits throughput: 13.045 ± 0.552 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 45.502 ± 1.429 M ops/s, hits latency: 21.977 ns/op, important_hits throughput: 2.844 ± 0.089 M ops/s
local_storage cache interleaved get:  hits throughput: 47.741 ± 1.115 M ops/s, hits latency: 20.946 ns/op, important_hits throughput: 15.190 ± 0.355 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 47.177 ± 0.617 M ops/s, hits latency: 21.197 ns/op, important_hits throughput: 2.775 ± 0.036 M ops/s
local_storage cache interleaved get:  hits throughput: 50.005 ± 0.463 M ops/s, hits latency: 19.998 ns/op, important_hits throughput: 15.219 ± 0.141 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 58.076 ± 0.507 M ops/s, hits latency: 17.219 ns/op, important_hits throughput: 2.420 ± 0.021 M ops/s
local_storage cache interleaved get:  hits throughput: 57.731 ± 0.500 M ops/s, hits latency: 17.322 ns/op, important_hits throughput: 16.237 ± 0.141 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 68.266 ± 0.234 M ops/s, hits latency: 14.649 ns/op, important_hits throughput: 2.133 ± 0.007 M ops/s
local_storage cache interleaved get:  hits throughput: 62.138 ± 2.695 M ops/s, hits latency: 16.093 ns/op, important_hits throughput: 17.341 ± 0.752 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 103.735 ± 2.874 M ops/s, hits latency: 9.640 ns/op, important_hits throughput: 1.037 ± 0.029 M ops/s
local_storage cache interleaved get:  hits throughput: 85.950 ± 1.619 M ops/s, hits latency: 11.635 ns/op, important_hits throughput: 22.450 ± 0.423 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 133.551 ± 1.915 M ops/s, hits latency: 7.488 ns/op, important_hits throughput: 0.134 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 97.579 ± 1.415 M ops/s, hits latency: 10.248 ns/op, important_hits throughput: 24.505 ± 0.355 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:

hashmap_control:     ~6.6ns
sequential_get_1:    ~17.9ns
sequential_get_10:   ~18.9ns
sequential_get_16:   ~19.0ns
sequential_get_17:   ~20.2ns
sequential_get_24:   ~42.2ns
sequential_get_32:   ~68.7ns
sequential_get_100:  ~163.3ns
sequential_get_1000: ~2200ns

Clearly demonstrating a cliff.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
Changelog:

v2 -> v3:
  * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
	  so just use 1 program (Alexei)

v1 -> v2:
  * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
    configurable # of maps (Andrii)
  * Add hashmap benchmark (Alexei)
	* Add discussion of overhead

 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/bench.c           |  57 +++
 tools/testing/selftests/bpf/bench.h           |   5 +
 .../bpf/benchs/bench_local_storage.c          | 332 ++++++++++++++++++
 .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
 .../selftests/bpf/benchs/run_common.sh        |  17 +
 .../selftests/bpf/progs/local_storage_bench.h |  63 ++++
 .../bpf/progs/local_storage_bench__get_int.c  |  12 +
 .../bpf/progs/local_storage_bench__get_seq.c  |  12 +
 .../bpf/progs/local_storage_bench__hashmap.c  |  13 +
 10 files changed, 537 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c

Comments

Andrii Nakryiko May 23, 2022, 9:11 p.m. UTC | #1
On Fri, May 20, 2022 at 10:00 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> Add a benchmarks to demonstrate the performance cliff for local_storage
> get as the number of local_storage maps increases beyond current
> local_storage implementation's cache size.
>
> "sequential get" and "interleaved get" benchmarks are added, both of
> which do many bpf_task_storage_get calls on sets of task local_storage
> maps of various counts, while considering a single specific map to be
> 'important' and counting task_storage_gets to the important map
> separately in addition to normal 'hits' count of all gets. Goal here is
> to mimic scenario where a particular program using one map - the
> important one - is running on a system where many other local_storage
> maps exist and are accessed often.
>
> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> bpf_task_storage_gets for the important map for every 10 map gets. This
> is meant to highlight performance differences when important map is
> accessed far more frequently than non-important maps.
>
> A "hashmap control" benchmark is also included for easy comparison of
> standard bpf hashmap lookup vs local_storage get. The benchmark is
> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> instead of local storage.
>
> Addition of this benchmark is inspired by conversation with Alexei in a
> previous patchset's thread [0], which highlighted the need for such a
> benchmark to motivate and validate improvements to local_storage
> implementation. My approach in that series focused on improving
> performance for explicitly-marked 'important' maps and was rejected
> with feedback to make more generally-applicable improvements while
> avoiding explicitly marking maps as important. Thus the benchmark
> reports both general and important-map-focused metrics, so effect of
> future work on both is clear.
>
> Regarding the benchmark results. On a powerful system (Skylake, 20
> cores, 256gb ram):
>
> Local Storage
> =============
>         Hashmap Control w/ 500 maps
> hashmap (control) sequential    get:  hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s
>
>         num_maps: 1
> local_storage cache sequential  get:  hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s
> local_storage cache interleaved get:  hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s

this is huge drop in performance for num_maps is due to syscall and
fentry overhead, is that right? How about making each syscall
invocation do (roughly) the same amount of map/storage lookups per
invocation to neutralize this overhead? Something like:


const volatile int map_cnt;
const volatile int iter_cnt;

...


for (i = 0; i < iter_cnt; i++) {
    int map_idx = i % map_cnt;

    do_lookup(map_idx, task);

...

}

User-space can calculate iter_cnt to be closest exact multiple of
map_cnt or you can just hard-code iter_cnt to fixed number (something
like 10000 or some high enough value) and just leave with slightly
uneven pattern for last round of looping.


But this way you make syscall/fentry overhead essentially fixed, which
will avoid these counter-intuitive numbers.


>
>         num_maps: 10
> local_storage cache sequential  get:  hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s
> local_storage cache interleaved get:  hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s
>
>         num_maps: 16
> local_storage cache sequential  get:  hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s
> local_storage cache interleaved get:  hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s
>
>         num_maps: 17
> local_storage cache sequential  get:  hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s
> local_storage cache interleaved get:  hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s
>
>         num_maps: 24
> local_storage cache sequential  get:  hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s
> local_storage cache interleaved get:  hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s
>
>         num_maps: 32
> local_storage cache sequential  get:  hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s
> local_storage cache interleaved get:  hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s
>
>         num_maps: 100
> local_storage cache sequential  get:  hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s
> local_storage cache interleaved get:  hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s
>
>         num_maps: 1000
> local_storage cache sequential  get:  hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
> local_storage cache interleaved get:  hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s
>
> Looking at the "sequential get" results, it's clear that as the
> number of task local_storage maps grows beyond the current cache size
> (16), there's a significant reduction in hits throughput. Note that
> current local_storage implementation assigns a cache_idx to maps as they
> are created. Since "sequential get" is creating maps 0..n in order and
> then doing bpf_task_storage_get calls in the same order, the benchmark
> is effectively ensuring that a map will not be in cache when the program
> tries to access it.
>
> For "interleaved get" results, important-map hits throughput is greatly
> increased as the important map is more likely to be in cache by virtue
> of being accessed far more frequently. Throughput still reduces as #
> maps increases, though.
>
> As evidenced by the unintuitive-looking results for smaller num_maps
> benchmark runs, overhead which is amortized across larger num_maps runs
> dominates when there are fewer maps. To get a sense of the overhead, I
> commented out bpf_task_storage_get/bpf_map_lookup_elem in
> local_storage_bench.h and ran the benchmark on the same host as the
> 'real' run. Results:
>
> Local Storage
> =============
>         Hashmap Control w/ 500 maps
> hashmap (control) sequential    get:  hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s
>

[...]

>
> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>
> hashmap_control:     ~6.6ns
> sequential_get_1:    ~17.9ns
> sequential_get_10:   ~18.9ns
> sequential_get_16:   ~19.0ns
> sequential_get_17:   ~20.2ns
> sequential_get_24:   ~42.2ns
> sequential_get_32:   ~68.7ns
> sequential_get_100:  ~163.3ns
> sequential_get_1000: ~2200ns
>
> Clearly demonstrating a cliff.
>
> When running the benchmarks it may be necessary to bump 'open files'
> ulimit for a successful run.
>
>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
> Changelog:
>
> v2 -> v3:
>   * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
>           so just use 1 program (Alexei)
>
> v1 -> v2:
>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>     configurable # of maps (Andrii)
>   * Add hashmap benchmark (Alexei)
>         * Add discussion of overhead
>
>  tools/testing/selftests/bpf/Makefile          |   6 +-
>  tools/testing/selftests/bpf/bench.c           |  57 +++
>  tools/testing/selftests/bpf/bench.h           |   5 +
>  .../bpf/benchs/bench_local_storage.c          | 332 ++++++++++++++++++
>  .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
>  .../selftests/bpf/benchs/run_common.sh        |  17 +
>  .../selftests/bpf/progs/local_storage_bench.h |  63 ++++
>  .../bpf/progs/local_storage_bench__get_int.c  |  12 +
>  .../bpf/progs/local_storage_bench__get_seq.c  |  12 +
>  .../bpf/progs/local_storage_bench__hashmap.c  |  13 +
>  10 files changed, 537 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h
>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 4030dd6cbc34..6095f6af2ad1 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -560,6 +560,9 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
>  $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
>  $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
>  $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
> +$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \
> +                                 $(OUTPUT)/local_storage_bench__get_int.skel.h \
> +                                 $(OUTPUT)/local_storage_bench__hashmap.skel.h

You really don't need 3 skeletons for this, you can parameterize
everything with 2-3 .rodata variables and have fixed code and single
skeleton header. It will also simplify your setup code, you won't need
need those callbacks that abstract specific skeleton away. Much
cleaner and simpler, IMO.

Please, try to simplify this and make it easier to maintain.


>  $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
>  $(OUTPUT)/bench: LDLIBS += -lm
>  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> @@ -571,7 +574,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>                  $(OUTPUT)/bench_ringbufs.o \
>                  $(OUTPUT)/bench_bloom_filter_map.o \
>                  $(OUTPUT)/bench_bpf_loop.o \
> -                $(OUTPUT)/bench_strncmp.o
> +                $(OUTPUT)/bench_strncmp.o \
> +                $(OUTPUT)/bench_local_storage.o
>         $(call msg,BINARY,,$@)
>         $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
>
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index f061cc20e776..71271062f68d 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -150,6 +150,53 @@ void ops_report_final(struct bench_res res[], int res_cnt)
>         printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt);
>  }
>
> +void local_storage_report_progress(int iter, struct bench_res *res,
> +                                  long delta_ns)
> +{
> +       double important_hits_per_sec, hits_per_sec;
> +       double delta_sec = delta_ns / 1000000000.0;
> +
> +       hits_per_sec = res->hits / 1000000.0 / delta_sec;
> +       important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec;
> +
> +       printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
> +
> +       printf("hits %8.3lfM/s ", hits_per_sec);
> +       printf("important_hits %8.3lfM/s\n", important_hits_per_sec);
> +}
> +
> +void local_storage_report_final(struct bench_res res[], int res_cnt)
> +{
> +       double important_hits_mean = 0.0, important_hits_stddev = 0.0;
> +       double hits_mean = 0.0, hits_stddev = 0.0;
> +       int i;
> +
> +       for (i = 0; i < res_cnt; i++) {
> +               hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
> +               important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt);
> +       }
> +
> +       if (res_cnt > 1)  {
> +               for (i = 0; i < res_cnt; i++) {
> +                       hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
> +                                      (hits_mean - res[i].hits / 1000000.0) /
> +                                      (res_cnt - 1.0);
> +                       important_hits_stddev +=
> +                                      (important_hits_mean - res[i].important_hits / 1000000.0) *
> +                                      (important_hits_mean - res[i].important_hits / 1000000.0) /
> +                                      (res_cnt - 1.0);
> +               }
> +
> +               hits_stddev = sqrt(hits_stddev);
> +               important_hits_stddev = sqrt(important_hits_stddev);
> +       }
> +       printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ",
> +              hits_mean, hits_stddev);
> +       printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean);
> +       printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n",
> +              important_hits_mean, important_hits_stddev);
> +}
> +

We have hits_drops_report_progress/hits_drops_report_final which uses
"hit" and "drop" terminology (admittedly confusing for this set of
benchmarks), but if you ignore the "drop" part, it's exactly what you
need - to track two independent values (in your case hit and important
hit). You'll get rid of a good chunk of repetitive code with some
statistics in it. You post-processing scripts will further hide this
detail.

>  const char *argp_program_version = "benchmark";
>  const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
>  const char argp_program_doc[] =
> @@ -188,12 +235,14 @@ static const struct argp_option opts[] = {
>  extern struct argp bench_ringbufs_argp;
>  extern struct argp bench_bloom_map_argp;
>  extern struct argp bench_bpf_loop_argp;
> +extern struct argp bench_local_storage_argp;
>  extern struct argp bench_strncmp_argp;
>

[...]

> +
> +static int setup_inner_map_and_load(int inner_fd)
> +{
> +       int err, mim_fd;
> +
> +       err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd);
> +       if (err)
> +               return -1;
> +
> +       err = ctx.load_skel(ctx.skel);
> +       if (err)
> +               return -1;
> +
> +       mim_fd = bpf_map__fd(ctx.array_of_maps);
> +       if (mim_fd < 0)
> +               return -1;
> +
> +       return mim_fd;
> +}
> +
> +static int load_btf(void)
> +{
> +       static const char btf_str_sec[] = "\0";
> +       __u32 btf_raw_types[] = {
> +               /* int */
> +               BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
> +       };
> +       struct btf_header btf_hdr = {
> +               .magic = BTF_MAGIC,
> +               .version = BTF_VERSION,
> +               .hdr_len = sizeof(struct btf_header),
> +               .type_len = sizeof(btf_raw_types),
> +               .str_off = sizeof(btf_raw_types),
> +               .str_len = sizeof(btf_str_sec),
> +       };
> +       __u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
> +                               sizeof(btf_str_sec)];
> +
> +       memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
> +       memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
> +       memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
> +              btf_str_sec, sizeof(btf_str_sec));
> +
> +       return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
> +}
> +

please try using declarative map-in-map definition, hopefully it
doesn't influence benchmark results. It will allow to avoid this
low-level setup code completely.

> +static void __setup(struct bpf_program *prog, bool hashmap)
> +{
> +       int i, fd, mim_fd, err;
> +       int btf_fd = 0;
> +
> +       LIBBPF_OPTS(bpf_map_create_opts, create_opts);
> +

[...]

> +
> +static void measure(struct bench_res *res)
> +{
> +       if (ctx.hits)
> +               res->hits = atomic_swap(ctx.hits, 0);
> +       if (ctx.important_hits)

why these ifs? just swap, measure is called once a second, there is no
need to optimize this

> +               res->important_hits = atomic_swap(ctx.important_hits, 0);
> +}
> +
> +static inline void trigger_bpf_program(void)
> +{
> +       syscall(__NR_getpgid);
> +}
> +

[...]

> +#ifdef LOOKUP_HASHMAP
> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
> +{
> +       void *map;
> +       int zero = 0;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_map_lookup_elem(map, &zero);

shouldn't you use elem here as well to make it a bit more in line with
bpf_task_storage_get()? This fixed zero is too optimistic and
minimizes CPU cache usage, skewing results towards hashmap. It's
cheaper to go access same location in hashmap over and over again, vs
randomly jumping over N elements

> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}
> +#else
> +static int do_lookup(unsigned int elem, struct task_struct *task)
> +{
> +       void *map;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}
> +#endif /* LOOKUP_HASHMAP */
> +
> +#define TASK_STORAGE_GET_LOOP_PROG(interleave)                 \
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
> +int get_local(void *ctx)                                       \
> +{                                                              \
> +       struct task_struct *task;                               \
> +       unsigned int i;                                         \
> +       void *map;                                              \
> +                                                               \
> +       task = bpf_get_current_task_btf();                      \
> +       for (i = 0; i < 1000; i++) {                            \
> +               if (do_lookup(i, task))                         \
> +                       return 0;                               \
> +               if (interleave && i % 3 == 0)                   \
> +                       do_lookup(0, task);                     \
> +       }                                                       \
> +       return 0;                                               \
> +}

I think


const volatile use_local_storage; /* set from user-space */


if (use_local_storage) {
    do_lookup_storage()
} else {
    do_lookup_hashmap()
}

is as clear (if not clearer) than having three separate skeletons
built from single #include header parameterized by extra #defines.

> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
> new file mode 100644

[...]
Dave Marchevsky May 30, 2022, 8:41 p.m. UTC | #2
On 5/23/22 5:11 PM, Andrii Nakryiko wrote:   
> On Fri, May 20, 2022 at 10:00 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>
>> Add a benchmarks to demonstrate the performance cliff for local_storage
>> get as the number of local_storage maps increases beyond current
>> local_storage implementation's cache size.
>>
>> "sequential get" and "interleaved get" benchmarks are added, both of
>> which do many bpf_task_storage_get calls on sets of task local_storage
>> maps of various counts, while considering a single specific map to be
>> 'important' and counting task_storage_gets to the important map
>> separately in addition to normal 'hits' count of all gets. Goal here is
>> to mimic scenario where a particular program using one map - the
>> important one - is running on a system where many other local_storage
>> maps exist and are accessed often.
>>
>> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
>> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
>> bpf_task_storage_gets for the important map for every 10 map gets. This
>> is meant to highlight performance differences when important map is
>> accessed far more frequently than non-important maps.
>>
>> A "hashmap control" benchmark is also included for easy comparison of
>> standard bpf hashmap lookup vs local_storage get. The benchmark is
>> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
>> instead of local storage.
>>
>> Addition of this benchmark is inspired by conversation with Alexei in a
>> previous patchset's thread [0], which highlighted the need for such a
>> benchmark to motivate and validate improvements to local_storage
>> implementation. My approach in that series focused on improving
>> performance for explicitly-marked 'important' maps and was rejected
>> with feedback to make more generally-applicable improvements while
>> avoiding explicitly marking maps as important. Thus the benchmark
>> reports both general and important-map-focused metrics, so effect of
>> future work on both is clear.
>>
>> Regarding the benchmark results. On a powerful system (Skylake, 20
>> cores, 256gb ram):
>>
>> Local Storage
>> =============
>>         Hashmap Control w/ 500 maps
>> hashmap (control) sequential    get:  hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s
>>
>>         num_maps: 1
>> local_storage cache sequential  get:  hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s
>> local_storage cache interleaved get:  hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s
> 
> this is huge drop in performance for num_maps is due to syscall and
> fentry overhead, is that right? How about making each syscall
> invocation do (roughly) the same amount of map/storage lookups per
> invocation to neutralize this overhead? Something like:
> 
> 
> const volatile int map_cnt;
> const volatile int iter_cnt;
> 
> ...
> 
> 
> for (i = 0; i < iter_cnt; i++) {
>     int map_idx = i % map_cnt;
> 
>     do_lookup(map_idx, task);
> 
> ...
> 
> }
> 
> User-space can calculate iter_cnt to be closest exact multiple of
> map_cnt or you can just hard-code iter_cnt to fixed number (something
> like 10000 or some high enough value) and just leave with slightly
> uneven pattern for last round of looping.
> 
> 
> But this way you make syscall/fentry overhead essentially fixed, which
> will avoid these counter-intuitive numbers.
> 

This worked! But had to use bpf_loop as verifier didn't like various attempts 
to loop 10k times. Overhead is much more consistent now (see v4 for numbers).

>>
>>         num_maps: 10
>> local_storage cache sequential  get:  hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s
>> local_storage cache interleaved get:  hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s
>>
>>         num_maps: 16
>> local_storage cache sequential  get:  hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s
>> local_storage cache interleaved get:  hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s
>>
>>         num_maps: 17
>> local_storage cache sequential  get:  hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s
>> local_storage cache interleaved get:  hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s
>>
>>         num_maps: 24
>> local_storage cache sequential  get:  hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s
>> local_storage cache interleaved get:  hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s
>>
>>         num_maps: 32
>> local_storage cache sequential  get:  hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s
>> local_storage cache interleaved get:  hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s
>>
>>         num_maps: 100
>> local_storage cache sequential  get:  hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s
>> local_storage cache interleaved get:  hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s
>>
>>         num_maps: 1000
>> local_storage cache sequential  get:  hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
>> local_storage cache interleaved get:  hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s
>>
>> Looking at the "sequential get" results, it's clear that as the
>> number of task local_storage maps grows beyond the current cache size
>> (16), there's a significant reduction in hits throughput. Note that
>> current local_storage implementation assigns a cache_idx to maps as they
>> are created. Since "sequential get" is creating maps 0..n in order and
>> then doing bpf_task_storage_get calls in the same order, the benchmark
>> is effectively ensuring that a map will not be in cache when the program
>> tries to access it.
>>
>> For "interleaved get" results, important-map hits throughput is greatly
>> increased as the important map is more likely to be in cache by virtue
>> of being accessed far more frequently. Throughput still reduces as #
>> maps increases, though.
>>
>> As evidenced by the unintuitive-looking results for smaller num_maps
>> benchmark runs, overhead which is amortized across larger num_maps runs
>> dominates when there are fewer maps. To get a sense of the overhead, I
>> commented out bpf_task_storage_get/bpf_map_lookup_elem in
>> local_storage_bench.h and ran the benchmark on the same host as the
>> 'real' run. Results:
>>
>> Local Storage
>> =============
>>         Hashmap Control w/ 500 maps
>> hashmap (control) sequential    get:  hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s
>>
> 
> [...]
> 
>>
>> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>>
>> hashmap_control:     ~6.6ns
>> sequential_get_1:    ~17.9ns
>> sequential_get_10:   ~18.9ns
>> sequential_get_16:   ~19.0ns
>> sequential_get_17:   ~20.2ns
>> sequential_get_24:   ~42.2ns
>> sequential_get_32:   ~68.7ns
>> sequential_get_100:  ~163.3ns
>> sequential_get_1000: ~2200ns
>>
>> Clearly demonstrating a cliff.
>>
>> When running the benchmarks it may be necessary to bump 'open files'
>> ulimit for a successful run.
>>
>>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>> Changelog:
>>
>> v2 -> v3:
>>   * Accessing 1k maps in ARRAY_OF_MAPS doesn't hit MAX_USED_MAPS limit,
>>           so just use 1 program (Alexei)
>>
>> v1 -> v2:
>>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>>     configurable # of maps (Andrii)
>>   * Add hashmap benchmark (Alexei)
>>         * Add discussion of overhead
>>
>>  tools/testing/selftests/bpf/Makefile          |   6 +-
>>  tools/testing/selftests/bpf/bench.c           |  57 +++
>>  tools/testing/selftests/bpf/bench.h           |   5 +
>>  .../bpf/benchs/bench_local_storage.c          | 332 ++++++++++++++++++
>>  .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
>>  .../selftests/bpf/benchs/run_common.sh        |  17 +
>>  .../selftests/bpf/progs/local_storage_bench.h |  63 ++++
>>  .../bpf/progs/local_storage_bench__get_int.c  |  12 +
>>  .../bpf/progs/local_storage_bench__get_seq.c  |  12 +
>>  .../bpf/progs/local_storage_bench__hashmap.c  |  13 +
>>  10 files changed, 537 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
>>
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index 4030dd6cbc34..6095f6af2ad1 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -560,6 +560,9 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
>>  $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
>>  $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
>>  $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
>> +$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \
>> +                                 $(OUTPUT)/local_storage_bench__get_int.skel.h \
>> +                                 $(OUTPUT)/local_storage_bench__hashmap.skel.h
> 
> You really don't need 3 skeletons for this, you can parameterize
> everything with 2-3 .rodata variables and have fixed code and single
> skeleton header. It will also simplify your setup code, you won't need
> need those callbacks that abstract specific skeleton away. Much
> cleaner and simpler, IMO.
> 
> Please, try to simplify this and make it easier to maintain.
> 
> 
>>  $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
>>  $(OUTPUT)/bench: LDLIBS += -lm
>>  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>> @@ -571,7 +574,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
>>                  $(OUTPUT)/bench_ringbufs.o \
>>                  $(OUTPUT)/bench_bloom_filter_map.o \
>>                  $(OUTPUT)/bench_bpf_loop.o \
>> -                $(OUTPUT)/bench_strncmp.o
>> +                $(OUTPUT)/bench_strncmp.o \
>> +                $(OUTPUT)/bench_local_storage.o
>>         $(call msg,BINARY,,$@)
>>         $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
>>
>> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
>> index f061cc20e776..71271062f68d 100644
>> --- a/tools/testing/selftests/bpf/bench.c
>> +++ b/tools/testing/selftests/bpf/bench.c
>> @@ -150,6 +150,53 @@ void ops_report_final(struct bench_res res[], int res_cnt)
>>         printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt);
>>  }
>>
>> +void local_storage_report_progress(int iter, struct bench_res *res,
>> +                                  long delta_ns)
>> +{
>> +       double important_hits_per_sec, hits_per_sec;
>> +       double delta_sec = delta_ns / 1000000000.0;
>> +
>> +       hits_per_sec = res->hits / 1000000.0 / delta_sec;
>> +       important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec;
>> +
>> +       printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
>> +
>> +       printf("hits %8.3lfM/s ", hits_per_sec);
>> +       printf("important_hits %8.3lfM/s\n", important_hits_per_sec);
>> +}
>> +
>> +void local_storage_report_final(struct bench_res res[], int res_cnt)
>> +{
>> +       double important_hits_mean = 0.0, important_hits_stddev = 0.0;
>> +       double hits_mean = 0.0, hits_stddev = 0.0;
>> +       int i;
>> +
>> +       for (i = 0; i < res_cnt; i++) {
>> +               hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
>> +               important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt);
>> +       }
>> +
>> +       if (res_cnt > 1)  {
>> +               for (i = 0; i < res_cnt; i++) {
>> +                       hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
>> +                                      (hits_mean - res[i].hits / 1000000.0) /
>> +                                      (res_cnt - 1.0);
>> +                       important_hits_stddev +=
>> +                                      (important_hits_mean - res[i].important_hits / 1000000.0) *
>> +                                      (important_hits_mean - res[i].important_hits / 1000000.0) /
>> +                                      (res_cnt - 1.0);
>> +               }
>> +
>> +               hits_stddev = sqrt(hits_stddev);
>> +               important_hits_stddev = sqrt(important_hits_stddev);
>> +       }
>> +       printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ",
>> +              hits_mean, hits_stddev);
>> +       printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean);
>> +       printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n",
>> +              important_hits_mean, important_hits_stddev);
>> +}
>> +
> 
> We have hits_drops_report_progress/hits_drops_report_final which uses
> "hit" and "drop" terminology (admittedly confusing for this set of
> benchmarks), but if you ignore the "drop" part, it's exactly what you
> need - to track two independent values (in your case hit and important
> hit). You'll get rid of a good chunk of repetitive code with some
> statistics in it. You post-processing scripts will further hide this
> detail.
> 

I considered doing this when working on v1 of the patch, but decided against it
because 'drop' and 'hit' are independent while 'important_hit' and 'hit' are
not. Every important_hit is also a hit.

The repetitive mean / stddev calculations are annoying, though. Added a 2nd
commit to v4 which refactors them.

>>  const char *argp_program_version = "benchmark";
>>  const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
>>  const char argp_program_doc[] =
>> @@ -188,12 +235,14 @@ static const struct argp_option opts[] = {
>>  extern struct argp bench_ringbufs_argp;
>>  extern struct argp bench_bloom_map_argp;
>>  extern struct argp bench_bpf_loop_argp;
>> +extern struct argp bench_local_storage_argp;
>>  extern struct argp bench_strncmp_argp;
>>
> 
> [...]
> 
>> +
>> +static int setup_inner_map_and_load(int inner_fd)
>> +{
>> +       int err, mim_fd;
>> +
>> +       err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd);
>> +       if (err)
>> +               return -1;
>> +
>> +       err = ctx.load_skel(ctx.skel);
>> +       if (err)
>> +               return -1;
>> +
>> +       mim_fd = bpf_map__fd(ctx.array_of_maps);
>> +       if (mim_fd < 0)
>> +               return -1;
>> +
>> +       return mim_fd;
>> +}
>> +
>> +static int load_btf(void)
>> +{
>> +       static const char btf_str_sec[] = "\0";
>> +       __u32 btf_raw_types[] = {
>> +               /* int */
>> +               BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
>> +       };
>> +       struct btf_header btf_hdr = {
>> +               .magic = BTF_MAGIC,
>> +               .version = BTF_VERSION,
>> +               .hdr_len = sizeof(struct btf_header),
>> +               .type_len = sizeof(btf_raw_types),
>> +               .str_off = sizeof(btf_raw_types),
>> +               .str_len = sizeof(btf_str_sec),
>> +       };
>> +       __u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
>> +                               sizeof(btf_str_sec)];
>> +
>> +       memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
>> +       memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
>> +       memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
>> +              btf_str_sec, sizeof(btf_str_sec));
>> +
>> +       return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
>> +}
>> +
> 
> please try using declarative map-in-map definition, hopefully it
> doesn't influence benchmark results. It will allow to avoid this
> low-level setup code completely.
> 

Forgot to call this out in v4 changelog, but made this change.

It was more frustrating than expected, though, as it was necessary to grab
btf_key_type_id / value_type_id from the inner map still, and the helpers would
only return valid type_id if called before prog load.

>> +static void __setup(struct bpf_program *prog, bool hashmap)
>> +{
>> +       int i, fd, mim_fd, err;
>> +       int btf_fd = 0;
>> +
>> +       LIBBPF_OPTS(bpf_map_create_opts, create_opts);
>> +
> 
> [...]
> 
>> +
>> +static void measure(struct bench_res *res)
>> +{
>> +       if (ctx.hits)
>> +               res->hits = atomic_swap(ctx.hits, 0);
>> +       if (ctx.important_hits)
> 
> why these ifs? just swap, measure is called once a second, there is no
> need to optimize this
> 
>> +               res->important_hits = atomic_swap(ctx.important_hits, 0);
>> +}
>> +
>> +static inline void trigger_bpf_program(void)
>> +{
>> +       syscall(__NR_getpgid);
>> +}
>> +
> 
> [...]
> 
>> +#ifdef LOOKUP_HASHMAP
>> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
>> +{
>> +       void *map;
>> +       int zero = 0;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_map_lookup_elem(map, &zero);
> 
> shouldn't you use elem here as well to make it a bit more in line with
> bpf_task_storage_get()? This fixed zero is too optimistic and
> minimizes CPU cache usage, skewing results towards hashmap. It's
> cheaper to go access same location in hashmap over and over again, vs
> randomly jumping over N elements
> 

Both hashmap and local_storage benchmarks are always grabbing key 0 from the
map. v4 makes this more clear. Intent is to measure how long it takes to access
local_storage map, not random element within it.

Every other comment here that wasn't responded to was addressed in v4.

>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#else
>> +static int do_lookup(unsigned int elem, struct task_struct *task)
>> +{
>> +       void *map;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#endif /* LOOKUP_HASHMAP */
>> +
>> +#define TASK_STORAGE_GET_LOOP_PROG(interleave)                 \
>> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
>> +int get_local(void *ctx)                                       \
>> +{                                                              \
>> +       struct task_struct *task;                               \
>> +       unsigned int i;                                         \
>> +       void *map;                                              \
>> +                                                               \
>> +       task = bpf_get_current_task_btf();                      \
>> +       for (i = 0; i < 1000; i++) {                            \
>> +               if (do_lookup(i, task))                         \
>> +                       return 0;                               \
>> +               if (interleave && i % 3 == 0)                   \
>> +                       do_lookup(0, task);                     \
>> +       }                                                       \
>> +       return 0;                                               \
>> +}
> 
> I think
> 
> 
> const volatile use_local_storage; /* set from user-space */
> 
> 
> if (use_local_storage) {
>     do_lookup_storage()
> } else {
>     do_lookup_hashmap()
> }
> 
> is as clear (if not clearer) than having three separate skeletons
> built from single #include header parameterized by extra #defines.
> 
>> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
>> new file mode 100644
> 
> [...]
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4030dd6cbc34..6095f6af2ad1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -560,6 +560,9 @@  $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
 $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
+$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \
+				  $(OUTPUT)/local_storage_bench__get_int.skel.h \
+				  $(OUTPUT)/local_storage_bench__hashmap.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -571,7 +574,8 @@  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_ringbufs.o \
 		 $(OUTPUT)/bench_bloom_filter_map.o \
 		 $(OUTPUT)/bench_bpf_loop.o \
-		 $(OUTPUT)/bench_strncmp.o
+		 $(OUTPUT)/bench_strncmp.o \
+		 $(OUTPUT)/bench_local_storage.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index f061cc20e776..71271062f68d 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -150,6 +150,53 @@  void ops_report_final(struct bench_res res[], int res_cnt)
 	printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt);
 }
 
+void local_storage_report_progress(int iter, struct bench_res *res,
+				   long delta_ns)
+{
+	double important_hits_per_sec, hits_per_sec;
+	double delta_sec = delta_ns / 1000000000.0;
+
+	hits_per_sec = res->hits / 1000000.0 / delta_sec;
+	important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec;
+
+	printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
+
+	printf("hits %8.3lfM/s ", hits_per_sec);
+	printf("important_hits %8.3lfM/s\n", important_hits_per_sec);
+}
+
+void local_storage_report_final(struct bench_res res[], int res_cnt)
+{
+	double important_hits_mean = 0.0, important_hits_stddev = 0.0;
+	double hits_mean = 0.0, hits_stddev = 0.0;
+	int i;
+
+	for (i = 0; i < res_cnt; i++) {
+		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+		important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt);
+	}
+
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++) {
+			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
+				       (hits_mean - res[i].hits / 1000000.0) /
+				       (res_cnt - 1.0);
+			important_hits_stddev +=
+				       (important_hits_mean - res[i].important_hits / 1000000.0) *
+				       (important_hits_mean - res[i].important_hits / 1000000.0) /
+				       (res_cnt - 1.0);
+		}
+
+		hits_stddev = sqrt(hits_stddev);
+		important_hits_stddev = sqrt(important_hits_stddev);
+	}
+	printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ",
+	       hits_mean, hits_stddev);
+	printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean);
+	printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n",
+	       important_hits_mean, important_hits_stddev);
+}
+
 const char *argp_program_version = "benchmark";
 const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
 const char argp_program_doc[] =
@@ -188,12 +235,14 @@  static const struct argp_option opts[] = {
 extern struct argp bench_ringbufs_argp;
 extern struct argp bench_bloom_map_argp;
 extern struct argp bench_bpf_loop_argp;
+extern struct argp bench_local_storage_argp;
 extern struct argp bench_strncmp_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
 	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
+	{ &bench_local_storage_argp, 0, "local_storage benchmark", 0 },
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
 	{},
 };
@@ -396,6 +445,9 @@  extern const struct bench bench_hashmap_with_bloom;
 extern const struct bench bench_bpf_loop;
 extern const struct bench bench_strncmp_no_helper;
 extern const struct bench bench_strncmp_helper;
+extern const struct bench bench_local_storage_cache_seq_get;
+extern const struct bench bench_local_storage_cache_interleaved_get;
+extern const struct bench bench_local_storage_cache_hashmap_control;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -430,6 +482,9 @@  static const struct bench *benchs[] = {
 	&bench_bpf_loop,
 	&bench_strncmp_no_helper,
 	&bench_strncmp_helper,
+	&bench_local_storage_cache_seq_get,
+	&bench_local_storage_cache_interleaved_get,
+	&bench_local_storage_cache_hashmap_control,
 };
 
 static void setup_benchmark()
@@ -547,5 +602,7 @@  int main(int argc, char **argv)
 		bench->report_final(state.results + env.warmup_sec,
 				    state.res_cnt - env.warmup_sec);
 
+	if (bench->teardown)
+		bench->teardown();
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index fb3e213df3dc..0a137eedc959 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -34,12 +34,14 @@  struct bench_res {
 	long hits;
 	long drops;
 	long false_hits;
+	long important_hits;
 };
 
 struct bench {
 	const char *name;
 	void (*validate)(void);
 	void (*setup)(void);
+	void (*teardown)(void);
 	void *(*producer_thread)(void *ctx);
 	void *(*consumer_thread)(void *ctx);
 	void (*measure)(struct bench_res* res);
@@ -61,6 +63,9 @@  void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
 void false_hits_report_final(struct bench_res res[], int res_cnt);
 void ops_report_progress(int iter, struct bench_res *res, long delta_ns);
 void ops_report_final(struct bench_res res[], int res_cnt);
+void local_storage_report_progress(int iter, struct bench_res *res,
+				   long delta_ns);
+void local_storage_report_final(struct bench_res res[], int res_cnt);
 
 static inline __u64 get_time_ns(void)
 {
diff --git a/tools/testing/selftests/bpf/benchs/bench_local_storage.c b/tools/testing/selftests/bpf/benchs/bench_local_storage.c
new file mode 100644
index 000000000000..96bc91f1f994
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_local_storage.c
@@ -0,0 +1,332 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <argp.h>
+#include <linux/btf.h>
+
+#include "local_storage_bench__get_int.skel.h"
+#include "local_storage_bench__get_seq.skel.h"
+#include "local_storage_bench__hashmap.skel.h"
+#include "bench.h"
+
+#include <test_btf.h>
+
+static struct {
+	__u32 nr_maps;
+} args = {
+	.nr_maps = 100,
+};
+
+enum {
+	ARG_NR_MAPS = 6000,
+};
+
+static const struct argp_option opts[] = {
+	{ "nr_maps", ARG_NR_MAPS, "NR_MAPS", 0,
+		"Set number of local_storage maps"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	long ret;
+
+	switch (key) {
+	case ARG_NR_MAPS:
+		ret = strtol(arg, NULL, 10);
+		if (ret < 1 || ret > UINT_MAX) {
+			fprintf(stderr, "invalid nr_maps");
+			argp_usage(state);
+		}
+		args.nr_maps = ret;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_local_storage_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.producer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-producer!\n");
+		exit(1);
+	}
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (args.nr_maps > 1000) {
+		fprintf(stderr, "nr_maps must be <= 1000\n");
+		exit(1);
+	}
+}
+
+/* Keep in sync w/ array of maps in bpf */
+#define MAX_NR_MAPS 1000
+
+static struct {
+	void (*destroy_skel)(void *obj);
+	int (*load_skel)(void *obj);
+	long *important_hits;
+	long *hits;
+	void *progs;
+	void *skel;
+	struct bpf_map *array_of_maps;
+	struct bpf_link *attached_prog;
+	int created_maps[MAX_NR_MAPS];
+} ctx;
+
+static void teardown(void)
+{
+	int i;
+
+	bpf_link__detach(ctx.attached_prog);
+
+	if (ctx.destroy_skel && ctx.skel)
+		ctx.destroy_skel(ctx.skel);
+
+	for (i = 0; i < MAX_NR_MAPS; i++) {
+		if (!ctx.created_maps[i])
+			break;
+		close(ctx.created_maps[i]);
+	}
+}
+
+static int setup_inner_map_and_load(int inner_fd)
+{
+	int err, mim_fd;
+
+	err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd);
+	if (err)
+		return -1;
+
+	err = ctx.load_skel(ctx.skel);
+	if (err)
+		return -1;
+
+	mim_fd = bpf_map__fd(ctx.array_of_maps);
+	if (mim_fd < 0)
+		return -1;
+
+	return mim_fd;
+}
+
+static int load_btf(void)
+{
+	static const char btf_str_sec[] = "\0";
+	__u32 btf_raw_types[] = {
+		/* int */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+	};
+	struct btf_header btf_hdr = {
+		.magic = BTF_MAGIC,
+		.version = BTF_VERSION,
+		.hdr_len = sizeof(struct btf_header),
+		.type_len = sizeof(btf_raw_types),
+		.str_off = sizeof(btf_raw_types),
+		.str_len = sizeof(btf_str_sec),
+	};
+	__u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
+				sizeof(btf_str_sec)];
+
+	memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
+	memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
+	memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
+	       btf_str_sec, sizeof(btf_str_sec));
+
+	return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
+}
+
+static void __setup(struct bpf_program *prog, bool hashmap)
+{
+	int i, fd, mim_fd, err;
+	int btf_fd = 0;
+
+	LIBBPF_OPTS(bpf_map_create_opts, create_opts);
+
+	memset(&ctx.created_maps, 0, MAX_NR_MAPS * sizeof(int));
+
+	btf_fd = load_btf();
+	create_opts.btf_fd = btf_fd;
+	create_opts.btf_key_type_id = 1;
+	create_opts.btf_value_type_id = 1;
+	if (!hashmap)
+		create_opts.map_flags = BPF_F_NO_PREALLOC;
+
+	mim_fd = 0;
+	for (i = 0; i < args.nr_maps; i++) {
+		if (hashmap)
+			fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(int),
+					    sizeof(int), 65536, &create_opts);
+		else
+			fd = bpf_map_create(BPF_MAP_TYPE_TASK_STORAGE, NULL, sizeof(int),
+					    sizeof(int), 0, &create_opts);
+		if (fd < 0) {
+			fprintf(stderr, "Error creating map %d\n", i);
+			goto err_out;
+		}
+
+		if (i == 0) {
+			mim_fd = setup_inner_map_and_load(fd);
+			if (mim_fd < 0) {
+				fprintf(stderr, "Error doing setup_inner_map_and_load\n");
+				goto err_out;
+			}
+		}
+
+		err = bpf_map_update_elem(mim_fd, &i, &fd, 0);
+		if (err) {
+			fprintf(stderr, "Error updating array-of-maps w/ map %d\n", i);
+			goto err_out;
+		}
+		ctx.created_maps[i] = fd;
+	}
+	close(btf_fd);
+
+	ctx.attached_prog = bpf_program__attach(prog);
+	if (!ctx.attached_prog) {
+		fprintf(stderr, "Error attaching bpf program\n");
+		goto err_out;
+	}
+
+	return;
+err_out:
+	if (btf_fd)
+		close(btf_fd);
+	teardown();
+	exit(1);
+}
+
+static void hashmap_setup(void)
+{
+	struct local_storage_bench__hashmap *skel;
+
+	setup_libbpf();
+
+	skel = local_storage_bench__hashmap__open();
+	ctx.skel = skel;
+	ctx.hits = &skel->bss->hits;
+	ctx.important_hits = &skel->bss->important_hits;
+	ctx.load_skel = (int (*)(void *))local_storage_bench__hashmap__load;
+	ctx.progs = (void *)&skel->progs;
+	ctx.destroy_skel = (void (*)(void *))local_storage_bench__hashmap__destroy;
+	ctx.array_of_maps = skel->maps.array_of_maps;
+
+	__setup(skel->progs.get_local, true);
+}
+
+static void local_storage_cache_get_setup(void)
+{
+	struct local_storage_bench__get_seq *skel;
+
+	setup_libbpf();
+
+	skel = local_storage_bench__get_seq__open();
+	ctx.skel = skel;
+	ctx.hits = &skel->bss->hits;
+	ctx.important_hits = &skel->bss->important_hits;
+	ctx.load_skel = (int (*)(void *))local_storage_bench__get_seq__load;
+	ctx.progs = (void *)&skel->progs;
+	ctx.destroy_skel = (void (*)(void *))local_storage_bench__get_seq__destroy;
+	ctx.array_of_maps = skel->maps.array_of_maps;
+
+	__setup(skel->progs.get_local, false);
+}
+
+static void local_storage_cache_get_interleaved_setup(void)
+{
+	struct local_storage_bench__get_int *skel;
+
+	setup_libbpf();
+
+	skel = local_storage_bench__get_int__open();
+	ctx.skel = skel;
+	ctx.hits = &skel->bss->hits;
+	ctx.important_hits = &skel->bss->important_hits;
+	ctx.load_skel = (int (*)(void *))local_storage_bench__get_int__load;
+	ctx.progs = (void *)&skel->progs;
+	ctx.destroy_skel = (void (*)(void *))local_storage_bench__get_int__destroy;
+	ctx.array_of_maps = skel->maps.array_of_maps;
+
+	__setup(skel->progs.get_local, false);
+}
+
+static void measure(struct bench_res *res)
+{
+	if (ctx.hits)
+		res->hits = atomic_swap(ctx.hits, 0);
+	if (ctx.important_hits)
+		res->important_hits = atomic_swap(ctx.important_hits, 0);
+}
+
+static inline void trigger_bpf_program(void)
+{
+	syscall(__NR_getpgid);
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void *producer(void *input)
+{
+	while (true)
+		trigger_bpf_program();
+
+	return NULL;
+}
+
+/* cache sequential and interleaved get benchs test local_storage get
+ * performance, specifically they demonstrate performance cliff of
+ * current list-plus-cache local_storage model.
+ *
+ * cache sequential get: call bpf_task_storage_get on n maps in order
+ * cache interleaved get: like "sequential get", but interleave 4 calls to the
+ *	'important' map (idx 0 in array_of_maps) for every 10 calls. Goal
+ *	is to mimic environment where many progs are accessing their local_storage
+ *	maps, with 'our' prog needing to access its map more often than others
+ */
+const struct bench bench_local_storage_cache_seq_get = {
+	.name = "local-storage-cache-seq-get",
+	.validate = validate,
+	.setup = local_storage_cache_get_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = local_storage_report_progress,
+	.report_final = local_storage_report_final,
+	.teardown = teardown,
+};
+
+const struct bench bench_local_storage_cache_interleaved_get = {
+	.name = "local-storage-cache-int-get",
+	.validate = validate,
+	.setup = local_storage_cache_get_interleaved_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = local_storage_report_progress,
+	.report_final = local_storage_report_final,
+	.teardown = teardown,
+};
+
+const struct bench bench_local_storage_cache_hashmap_control = {
+	.name = "local-storage-cache-hashmap-control",
+	.validate = validate,
+	.setup = hashmap_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = local_storage_report_progress,
+	.report_final = local_storage_report_final,
+	.teardown = teardown,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh b/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
new file mode 100755
index 000000000000..479096c47c93
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
@@ -0,0 +1,21 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+header "Local Storage"
+subtitle "Hashmap Control w/ 500 maps"
+	summarize_local_storage "hashmap (control) sequential    get: "\
+		"$(./bench --nr_maps 500 local-storage-cache-hashmap-control)"
+	printf "\n"
+
+for i in 1 10 16 17 24 32 100 1000; do
+subtitle "num_maps: $i"
+	summarize_local_storage "local_storage cache sequential  get: "\
+		"$(./bench --nr_maps $i local-storage-cache-seq-get)"
+	summarize_local_storage "local_storage cache interleaved get: "\
+		"$(./bench --nr_maps $i local-storage-cache-int-get)"
+	printf "\n"
+done
diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh
index 6c5e6023a69f..d9f40af82006 100644
--- a/tools/testing/selftests/bpf/benchs/run_common.sh
+++ b/tools/testing/selftests/bpf/benchs/run_common.sh
@@ -41,6 +41,16 @@  function ops()
 	echo "$*" | sed -E "s/.*latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/"
 }
 
+function local_storage()
+{
+	echo -n "hits throughput: "
+	echo -n "$*" | sed -E "s/.* hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/"
+	echo -n -e ", hits latency: "
+	echo -n "$*" | sed -E "s/.* hits latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/"
+	echo -n ", important_hits throughput: "
+	echo "$*" | sed -E "s/.*important_hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/"
+}
+
 function total()
 {
 	echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
@@ -67,6 +77,13 @@  function summarize_ops()
 	printf "%-20s %s\n" "$bench" "$(ops $summary)"
 }
 
+function summarize_local_storage()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s\n" "$bench" "$(local_storage $summary)"
+}
+
 function summarize_total()
 {
 	bench="$1"
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
new file mode 100644
index 000000000000..88ccfe178641
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
@@ -0,0 +1,63 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, int);
+} array_of_maps SEC(".maps");
+
+long important_hits;
+long hits;
+
+#ifdef LOOKUP_HASHMAP
+static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
+{
+	void *map;
+	int zero = 0;
+
+	map = bpf_map_lookup_elem(&array_of_maps, &elem);
+	if (!map)
+		return -1;
+
+	bpf_map_lookup_elem(map, &zero);
+	__sync_add_and_fetch(&hits, 1);
+	if (!elem)
+		__sync_add_and_fetch(&important_hits, 1);
+	return 0;
+}
+#else
+static int do_lookup(unsigned int elem, struct task_struct *task)
+{
+	void *map;
+
+	map = bpf_map_lookup_elem(&array_of_maps, &elem);
+	if (!map)
+		return -1;
+
+	bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
+	__sync_add_and_fetch(&hits, 1);
+	if (!elem)
+		__sync_add_and_fetch(&important_hits, 1);
+	return 0;
+}
+#endif /* LOOKUP_HASHMAP */
+
+#define TASK_STORAGE_GET_LOOP_PROG(interleave)			\
+SEC("fentry/" SYS_PREFIX "sys_getpgid")			\
+int get_local(void *ctx)					\
+{								\
+	struct task_struct *task;				\
+	unsigned int i;						\
+	void *map;						\
+								\
+	task = bpf_get_current_task_btf();			\
+	for (i = 0; i < 1000; i++) {				\
+		if (do_lookup(i, task))				\
+			return 0;				\
+		if (interleave && i % 3 == 0)			\
+			do_lookup(0, task);			\
+	}							\
+	return 0;						\
+}
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
new file mode 100644
index 000000000000..c45da01c026d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
@@ -0,0 +1,12 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#include "local_storage_bench.h"
+
+TASK_STORAGE_GET_LOOP_PROG(true);
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
new file mode 100644
index 000000000000..076d54e5dbdf
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
@@ -0,0 +1,12 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#include "local_storage_bench.h"
+
+TASK_STORAGE_GET_LOOP_PROG(false);
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c b/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
new file mode 100644
index 000000000000..4e199479bc9c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
@@ -0,0 +1,13 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#define LOOKUP_HASHMAP
+#include "local_storage_bench.h"
+
+TASK_STORAGE_GET_LOOP_PROG(false);
+
+char _license[] SEC("license") = "GPL";