diff mbox series

[v2,bpf-next] selftests/bpf: Add benchmark for local_storage get

Message ID 20220518035131.725193-1-davemarchevsky@fb.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [v2,bpf-next] selftests/bpf: Add benchmark for local_storage get | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline fail Detected static functions without inline keyword in header files: 1
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 8 maintainers not CCed: netdev@vger.kernel.org john.fastabend@gmail.com shuah@kernel.org yhs@fb.com kpsingh@kernel.org linux-kselftest@vger.kernel.org songliubraving@fb.com joannekoong@fb.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Macro argument 'interleave' may be better as '(interleave)' to avoid precedence issues CHECK: Macro argument 'start' may be better as '(start)' to avoid precedence issues WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 106 exceeds 80 columns WARNING: line length of 112 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline fail Was 0 now: 1
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15

Commit Message

Dave Marchevsky May 18, 2022, 3:51 a.m. UTC
Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 64.689 ± 2.806 M ops/s, hits latency: 15.459 ns/op, important_hits throughput: 0.129 ± 0.006 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 3.793 ± 0.101 M ops/s, hits latency: 263.623 ns/op, important_hits throughput: 3.793 ± 0.101 M ops/s
local_storage cache interleaved get:  hits throughput: 6.539 ± 0.209 M ops/s, hits latency: 152.938 ns/op, important_hits throughput: 6.539 ± 0.209 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 20.237 ± 0.439 M ops/s, hits latency: 49.415 ns/op, important_hits throughput: 2.024 ± 0.044 M ops/s
local_storage cache interleaved get:  hits throughput: 22.421 ± 0.874 M ops/s, hits latency: 44.601 ns/op, important_hits throughput: 8.007 ± 0.312 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 25.582 ± 0.346 M ops/s, hits latency: 39.090 ns/op, important_hits throughput: 1.599 ± 0.022 M ops/s
local_storage cache interleaved get:  hits throughput: 26.615 ± 0.601 M ops/s, hits latency: 37.573 ns/op, important_hits throughput: 8.468 ± 0.191 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 22.932 ± 0.436 M ops/s, hits latency: 43.606 ns/op, important_hits throughput: 1.349 ± 0.026 M ops/s
local_storage cache interleaved get:  hits throughput: 24.005 ± 0.462 M ops/s, hits latency: 41.658 ns/op, important_hits throughput: 7.306 ± 0.140 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 16.025 ± 0.821 M ops/s, hits latency: 62.402 ns/op, important_hits throughput: 0.668 ± 0.034 M ops/s
local_storage cache interleaved get:  hits throughput: 17.691 ± 0.744 M ops/s, hits latency: 56.526 ns/op, important_hits throughput: 4.976 ± 0.209 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 11.865 ± 0.180 M ops/s, hits latency: 84.279 ns/op, important_hits throughput: 0.371 ± 0.006 M ops/s
local_storage cache interleaved get:  hits throughput: 14.383 ± 0.108 M ops/s, hits latency: 69.525 ns/op, important_hits throughput: 4.014 ± 0.030 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 6.105 ± 0.190 M ops/s, hits latency: 163.798 ns/op, important_hits throughput: 0.061 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 7.055 ± 0.129 M ops/s, hits latency: 141.746 ns/op, important_hits throughput: 1.843 ± 0.034 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.433 ± 0.010 M ops/s, hits latency: 2309.469 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.499 ± 0.026 M ops/s, hits latency: 2002.510 ns/op, important_hits throughput: 0.127 ± 0.007 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

Note that the test programs need to split task_storage_get calls across
multiple programs to work around the verifier's MAX_USED_MAPS
limitations. As evidenced by the unintuitive-looking results for smaller
num_maps benchmark runs, overhead which is amortized across larger
num_maps in other runs dominates when there are fewer maps. To get a
sense of the overhead, I commented out
bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and
ran the benchmark on the same host as 'real' run. Results:

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 115.812 ± 2.513 M ops/s, hits latency: 8.635 ns/op, important_hits throughput: 0.232 ± 0.005 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 4.031 ± 0.033 M ops/s, hits latency: 248.094 ns/op, important_hits throughput: 4.031 ± 0.033 M ops/s
local_storage cache interleaved get:  hits throughput: 7.997 ± 0.088 M ops/s, hits latency: 125.046 ns/op, important_hits throughput: 7.997 ± 0.088 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 34.000 ± 0.077 M ops/s, hits latency: 29.412 ns/op, important_hits throughput: 3.400 ± 0.008 M ops/s
local_storage cache interleaved get:  hits throughput: 37.895 ± 0.670 M ops/s, hits latency: 26.389 ns/op, important_hits throughput: 13.534 ± 0.239 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 46.947 ± 0.283 M ops/s, hits latency: 21.300 ns/op, important_hits throughput: 2.934 ± 0.018 M ops/s
local_storage cache interleaved get:  hits throughput: 47.301 ± 1.027 M ops/s, hits latency: 21.141 ns/op, important_hits throughput: 15.050 ± 0.327 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 45.871 ± 0.414 M ops/s, hits latency: 21.800 ns/op, important_hits throughput: 2.698 ± 0.024 M ops/s
local_storage cache interleaved get:  hits throughput: 46.591 ± 1.969 M ops/s, hits latency: 21.463 ns/op, important_hits throughput: 14.180 ± 0.599 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 58.053 ± 1.043 M ops/s, hits latency: 17.226 ns/op, important_hits throughput: 2.419 ± 0.043 M ops/s
local_storage cache interleaved get:  hits throughput: 58.115 ± 0.377 M ops/s, hits latency: 17.207 ns/op, important_hits throughput: 16.345 ± 0.106 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 68.548 ± 0.820 M ops/s, hits latency: 14.588 ns/op, important_hits throughput: 2.142 ± 0.026 M ops/s
local_storage cache interleaved get:  hits throughput: 63.015 ± 0.963 M ops/s, hits latency: 15.869 ns/op, important_hits throughput: 17.586 ± 0.269 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 95.375 ± 1.286 M ops/s, hits latency: 10.485 ns/op, important_hits throughput: 0.954 ± 0.013 M ops/s
local_storage cache interleaved get:  hits throughput: 76.996 ± 2.614 M ops/s, hits latency: 12.988 ns/op, important_hits throughput: 20.111 ± 0.683 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 119.965 ± 1.386 M ops/s, hits latency: 8.336 ns/op, important_hits throughput: 0.120 ± 0.001 M ops/s
local_storage cache interleaved get:  hits throughput: 92.665 ± 0.788 M ops/s, hits latency: 10.792 ns/op, important_hits throughput: 23.581 ± 0.200 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:

hashmap_control:     ~6.8ns
sequential_get_1:    ~15.5ns
sequential_get_10:   ~20ns
sequential_get_16:   ~17.8ns
sequential_get_17:   ~21.8ns
sequential_get_24:   ~45.2ns
sequential_get_32:   ~69.7ns
sequential_get_100:  ~153.3ns
sequential_get_1000: ~2300ns

Clearly demonstrating a cliff.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
Changelog:

v1 -> v2:
  * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
    configurable # of maps (Andrii)
  * Add hashmap benchmark (Alexei)
	* Add discussion of overhead

 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/bench.c           |  57 +++
 tools/testing/selftests/bpf/bench.h           |   5 +
 .../bpf/benchs/bench_local_storage.c          | 345 ++++++++++++++++++
 .../bpf/benchs/run_bench_local_storage.sh     |  21 ++
 .../selftests/bpf/benchs/run_common.sh        |  17 +
 .../selftests/bpf/progs/local_storage_bench.h |  69 ++++
 .../bpf/progs/local_storage_bench__get_int.c  |  31 ++
 .../bpf/progs/local_storage_bench__get_seq.c  |  31 ++
 .../bpf/progs/local_storage_bench__hashmap.c  |  32 ++
 10 files changed, 613 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_local_storage.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench.h
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
 create mode 100644 tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c

Comments

Alexei Starovoitov May 20, 2022, 11:46 p.m. UTC | #1
On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
> maps increases, though.
>
> Note that the test programs need to split task_storage_get calls across
> multiple programs to work around the verifier's MAX_USED_MAPS
> limitations.
...
> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> +       __uint(max_entries, 1000);
> +       __type(key, int);
> +       __type(value, int);
> +} array_of_maps SEC(".maps");
> +
> +long important_hits;
> +long hits;
> +
> +#ifdef LOOKUP_HASHMAP
> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
> +{
> +       void *map;
> +       int zero = 0;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_map_lookup_elem(map, &zero);
> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}

This prog accesses only two maps: array_of_maps and global data
(hidden array map).

Where do you see it's reaching MAX_USED_MAPS limit ?
Dave Marchevsky May 21, 2022, 12:37 a.m. UTC | #2
On 5/20/22 7:46 PM, Alexei Starovoitov wrote:   
> On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>> maps increases, though.
>>
>> Note that the test programs need to split task_storage_get calls across
>> multiple programs to work around the verifier's MAX_USED_MAPS
>> limitations.
> ...
>> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
>> @@ -0,0 +1,69 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +struct {
>> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
>> +       __uint(max_entries, 1000);
>> +       __type(key, int);
>> +       __type(value, int);
>> +} array_of_maps SEC(".maps");
>> +
>> +long important_hits;
>> +long hits;
>> +
>> +#ifdef LOOKUP_HASHMAP
>> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
>> +{
>> +       void *map;
>> +       int zero = 0;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_map_lookup_elem(map, &zero);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
> 
> This prog accesses only two maps: array_of_maps and global data
> (hidden array map).
> 
> Where do you see it's reaching MAX_USED_MAPS limit ?

You're right, that's a holdover from v1.

Will fix and send v3 w/ new benchmark results.
Andrii Nakryiko May 21, 2022, 12:41 a.m. UTC | #3
On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> Add a benchmarks to demonstrate the performance cliff for local_storage
> get as the number of local_storage maps increases beyond current
> local_storage implementation's cache size.
>
> "sequential get" and "interleaved get" benchmarks are added, both of
> which do many bpf_task_storage_get calls on sets of task local_storage
> maps of various counts, while considering a single specific map to be
> 'important' and counting task_storage_gets to the important map
> separately in addition to normal 'hits' count of all gets. Goal here is
> to mimic scenario where a particular program using one map - the
> important one - is running on a system where many other local_storage
> maps exist and are accessed often.
>
> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> bpf_task_storage_gets for the important map for every 10 map gets. This
> is meant to highlight performance differences when important map is
> accessed far more frequently than non-important maps.
>
> A "hashmap control" benchmark is also included for easy comparison of
> standard bpf hashmap lookup vs local_storage get. The benchmark is
> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> instead of local storage.
>
> Addition of this benchmark is inspired by conversation with Alexei in a
> previous patchset's thread [0], which highlighted the need for such a
> benchmark to motivate and validate improvements to local_storage
> implementation. My approach in that series focused on improving
> performance for explicitly-marked 'important' maps and was rejected
> with feedback to make more generally-applicable improvements while
> avoiding explicitly marking maps as important. Thus the benchmark
> reports both general and important-map-focused metrics, so effect of
> future work on both is clear.
>
> Regarding the benchmark results. On a powerful system (Skylake, 20
> cores, 256gb ram):
>
> Local Storage
> =============

[...]

>
> Looking at the "sequential get" results, it's clear that as the
> number of task local_storage maps grows beyond the current cache size
> (16), there's a significant reduction in hits throughput. Note that
> current local_storage implementation assigns a cache_idx to maps as they
> are created. Since "sequential get" is creating maps 0..n in order and
> then doing bpf_task_storage_get calls in the same order, the benchmark
> is effectively ensuring that a map will not be in cache when the program
> tries to access it.
>
> For "interleaved get" results, important-map hits throughput is greatly
> increased as the important map is more likely to be in cache by virtue
> of being accessed far more frequently. Throughput still reduces as #
> maps increases, though.
>
> Note that the test programs need to split task_storage_get calls across
> multiple programs to work around the verifier's MAX_USED_MAPS
> limitations. As evidenced by the unintuitive-looking results for smaller
> num_maps benchmark runs, overhead which is amortized across larger
> num_maps in other runs dominates when there are fewer maps. To get a
> sense of the overhead, I commented out
> bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and
> ran the benchmark on the same host as 'real' run. Results:
>
> Local Storage
> =============

[...]

>
> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>
> hashmap_control:     ~6.8ns
> sequential_get_1:    ~15.5ns
> sequential_get_10:   ~20ns
> sequential_get_16:   ~17.8ns
> sequential_get_17:   ~21.8ns
> sequential_get_24:   ~45.2ns
> sequential_get_32:   ~69.7ns
> sequential_get_100:  ~153.3ns
> sequential_get_1000: ~2300ns
>
> Clearly demonstrating a cliff.
>
> When running the benchmarks it may be necessary to bump 'open files'
> ulimit for a successful run.
>
>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
> Changelog:
>
> v1 -> v2:
>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>     configurable # of maps (Andrii)
>   * Add hashmap benchmark (Alexei)
>         * Add discussion of overhead
>

[...]

> +
> +/* Keep in sync w/ array of maps in bpf */
> +#define MAX_NR_MAPS 1000
> +/* Keep in sync w/ number of progs in bpf app */
> +#define MAX_NR_PROGS 20
> +
> +static struct {
> +       void (*destroy_skel)(void *obj);
> +       int (*load_skel)(void *obj);
> +       long *important_hits;
> +       long *hits;
> +       void *progs;
> +       void *skel;
> +       struct bpf_map *array_of_maps;
> +} ctx;
> +
> +int created_maps[MAX_NR_MAPS];
> +struct bpf_link *attached_links[MAX_NR_PROGS];
> +

static?


> +static void teardown(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_NR_PROGS; i++) {
> +               if (!attached_links[i])
> +                       break;
> +               bpf_link__detach(attached_links[i]);
> +       }
> +
> +       if (ctx.destroy_skel && ctx.skel)
> +               ctx.destroy_skel(ctx.skel);
> +
> +       for (i = 0; i < MAX_NR_MAPS; i++) {
> +               if (!created_maps[i])
> +                       break;
> +               close(created_maps[i]);
> +       }
> +}
> +

Wouldn't all this be cleaned up on bench exit anyway?.. We've been
less strict about proper clean up for bench to keep code simpler.


[...]

> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> new file mode 100644
> index 000000000000..b5e358dee245
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> +       __uint(max_entries, 1000);
> +       __type(key, int);
> +       __type(value, int);
> +} array_of_maps SEC(".maps");

you don't need setup_inner_map_and_load and load_btf, you can just
declaratively have two ARRAY_OF_MAPS, one using inner hashmap and
another using inner task_local_storage. Grep for __array in selftests
to see how to declaratively define inner map prototype, e.g., see
test_ringbuf_multi.c. With the below suggestion one of do_lookup
flavors will use array_of_hashes and another will use
array_of_storages explicitly. From user-space you can create and setup
as many inner maps as needed. If you need btf_id for value_type_id for
inner map, see if bpf_map__inner_map() would be useful.

> +
> +long important_hits;
> +long hits;
> +
> +#ifdef LOOKUP_HASHMAP

why #ifdef'ing if you can have do_lookup_hashmap and
do_lookup_task_storage and choose which one to call using read-only
variable:

const volatile bool use_hashmap;

just set it before load and verifier will know that one of do_lookup
flavors is dead code

> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
> +{
> +       void *map;
> +       int zero = 0;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_map_lookup_elem(map, &zero);
> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}
> +#else
> +static int do_lookup(unsigned int elem, struct task_struct *task)
> +{
> +       void *map;
> +
> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> +       if (!map)
> +               return -1;
> +
> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       __sync_add_and_fetch(&hits, 1);
> +       if (!elem)
> +               __sync_add_and_fetch(&important_hits, 1);
> +       return 0;
> +}
> +#endif /* LOOKUP_HASHMAP */
> +
> +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
> +int get_local_ ## start(void *ctx)                             \
> +{                                                              \
> +       struct task_struct *task;                               \
> +       unsigned int i, elem;                                   \
> +       void *map;                                              \
> +                                                               \
> +       task = bpf_get_current_task_btf();                      \
> +       for (i = 0; i < 50; i++) {                              \

I'm trying to understand why you didn't just do


for (i = 0; i < 1000; i++) { ... }

and avoid all the macro stuff? what didn't work?


> +               elem = start + i;                               \
> +               if (do_lookup(elem, task))                      \
> +                       return 0;                               \
> +               if (interleave && i % 3 == 0)                   \

nit % 3 will be slow(-ish), why not pick some power of 2?

> +                       do_lookup(0, task);                     \
> +       }                                                       \
> +       return 0;                                               \
> +}
> +
> +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \
> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, false)
> +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \
> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, true)

[...]

> +
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
> +

all these macro make me sad, I'd like to understand why it has to be
done this way

> +char _license[] SEC("license") = "GPL";
> --
> 2.30.2
>
Dave Marchevsky May 21, 2022, 5:22 a.m. UTC | #4
On 5/20/22 8:41 PM, Andrii Nakryiko wrote:   
> On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>
>> Add a benchmarks to demonstrate the performance cliff for local_storage
>> get as the number of local_storage maps increases beyond current
>> local_storage implementation's cache size.
>>
>> "sequential get" and "interleaved get" benchmarks are added, both of
>> which do many bpf_task_storage_get calls on sets of task local_storage
>> maps of various counts, while considering a single specific map to be
>> 'important' and counting task_storage_gets to the important map
>> separately in addition to normal 'hits' count of all gets. Goal here is
>> to mimic scenario where a particular program using one map - the
>> important one - is running on a system where many other local_storage
>> maps exist and are accessed often.
>>
>> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
>> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
>> bpf_task_storage_gets for the important map for every 10 map gets. This
>> is meant to highlight performance differences when important map is
>> accessed far more frequently than non-important maps.
>>
>> A "hashmap control" benchmark is also included for easy comparison of
>> standard bpf hashmap lookup vs local_storage get. The benchmark is
>> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
>> instead of local storage.
>>
>> Addition of this benchmark is inspired by conversation with Alexei in a
>> previous patchset's thread [0], which highlighted the need for such a
>> benchmark to motivate and validate improvements to local_storage
>> implementation. My approach in that series focused on improving
>> performance for explicitly-marked 'important' maps and was rejected
>> with feedback to make more generally-applicable improvements while
>> avoiding explicitly marking maps as important. Thus the benchmark
>> reports both general and important-map-focused metrics, so effect of
>> future work on both is clear.
>>
>> Regarding the benchmark results. On a powerful system (Skylake, 20
>> cores, 256gb ram):
>>
>> Local Storage
>> =============
> 
> [...]
> 
>>
>> Looking at the "sequential get" results, it's clear that as the
>> number of task local_storage maps grows beyond the current cache size
>> (16), there's a significant reduction in hits throughput. Note that
>> current local_storage implementation assigns a cache_idx to maps as they
>> are created. Since "sequential get" is creating maps 0..n in order and
>> then doing bpf_task_storage_get calls in the same order, the benchmark
>> is effectively ensuring that a map will not be in cache when the program
>> tries to access it.
>>
>> For "interleaved get" results, important-map hits throughput is greatly
>> increased as the important map is more likely to be in cache by virtue
>> of being accessed far more frequently. Throughput still reduces as #
>> maps increases, though.
>>
>> Note that the test programs need to split task_storage_get calls across
>> multiple programs to work around the verifier's MAX_USED_MAPS
>> limitations. As evidenced by the unintuitive-looking results for smaller
>> num_maps benchmark runs, overhead which is amortized across larger
>> num_maps in other runs dominates when there are fewer maps. To get a
>> sense of the overhead, I commented out
>> bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and
>> ran the benchmark on the same host as 'real' run. Results:
>>
>> Local Storage
>> =============
> 
> [...]
> 
>>
>> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
>>
>> hashmap_control:     ~6.8ns
>> sequential_get_1:    ~15.5ns
>> sequential_get_10:   ~20ns
>> sequential_get_16:   ~17.8ns
>> sequential_get_17:   ~21.8ns
>> sequential_get_24:   ~45.2ns
>> sequential_get_32:   ~69.7ns
>> sequential_get_100:  ~153.3ns
>> sequential_get_1000: ~2300ns
>>
>> Clearly demonstrating a cliff.
>>
>> When running the benchmarks it may be necessary to bump 'open files'
>> ulimit for a successful run.
>>
>>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>> Changelog:
>>
>> v1 -> v2:
>>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
>>     configurable # of maps (Andrii)
>>   * Add hashmap benchmark (Alexei)
>>         * Add discussion of overhead
>>
> 
> [...]
> 
>> +
>> +/* Keep in sync w/ array of maps in bpf */
>> +#define MAX_NR_MAPS 1000
>> +/* Keep in sync w/ number of progs in bpf app */
>> +#define MAX_NR_PROGS 20
>> +
>> +static struct {
>> +       void (*destroy_skel)(void *obj);
>> +       int (*load_skel)(void *obj);
>> +       long *important_hits;
>> +       long *hits;
>> +       void *progs;
>> +       void *skel;
>> +       struct bpf_map *array_of_maps;
>> +} ctx;
>> +
>> +int created_maps[MAX_NR_MAPS];
>> +struct bpf_link *attached_links[MAX_NR_PROGS];
>> +
> 
> static?

I sent v3 before seeing this email, but luckily most of your comments are
addressed already. 

attached_links is removed in v3, created_maps is moved to ctx

> 
> 
>> +static void teardown(void)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < MAX_NR_PROGS; i++) {
>> +               if (!attached_links[i])
>> +                       break;
>> +               bpf_link__detach(attached_links[i]);
>> +       }
>> +
>> +       if (ctx.destroy_skel && ctx.skel)
>> +               ctx.destroy_skel(ctx.skel);
>> +
>> +       for (i = 0; i < MAX_NR_MAPS; i++) {
>> +               if (!created_maps[i])
>> +                       break;
>> +               close(created_maps[i]);
>> +       }
>> +}
>> +
> 
> Wouldn't all this be cleaned up on bench exit anyway?.. We've been
> less strict about proper clean up for bench to keep code simpler.

It's important to explicitly clean up created_maps because local_storage maps
are assigned a cache slot when the map is created, and count of "how many maps
are assigned to this cache slot" is incr'd. On map free the count is decr'd.

So cache behavior of subsequently alloc'd maps can be affected if these are kept
around.

Not a big deal now since just 1 bench is run and process exits, but if that
changes in the future I don't want the benchmark to silently give odd results.

> 
> 
> [...]
> 
>> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
>> new file mode 100644
>> index 000000000000..b5e358dee245
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
>> @@ -0,0 +1,69 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +struct {
>> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
>> +       __uint(max_entries, 1000);
>> +       __type(key, int);
>> +       __type(value, int);
>> +} array_of_maps SEC(".maps");
> 
> you don't need setup_inner_map_and_load and load_btf, you can just
> declaratively have two ARRAY_OF_MAPS, one using inner hashmap and
> another using inner task_local_storage. Grep for __array in selftests
> to see how to declaratively define inner map prototype, e.g., see
> test_ringbuf_multi.c. With the below suggestion one of do_lookup
> flavors will use array_of_hashes and another will use
> array_of_storages explicitly. From user-space you can create and setup
> as many inner maps as needed. If you need btf_id for value_type_id for
> inner map, see if bpf_map__inner_map() would be useful.
> 

Declaratively specifying an inner task_local_storage map will result in libbpf
creating such a map to pass as the inner_map_fd, no? This will have same
effect on cache_idx assignment as my previous comment. 

>> +
>> +long important_hits;
>> +long hits;
>> +
>> +#ifdef LOOKUP_HASHMAP
> 
> why #ifdef'ing if you can have do_lookup_hashmap and
> do_lookup_task_storage and choose which one to call using read-only
> variable:
> 
> const volatile bool use_hashmap;
> 
> just set it before load and verifier will know that one of do_lookup
> flavors is dead code

I want it to be obvious that the hashmap part of the benchmark is not a flavor
of local_storage, the distinct separation of do_lookups here makes this easier
to notice.

Can do a v4 addressing this if you feel strongly about it.

> 
>> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
>> +{
>> +       void *map;
>> +       int zero = 0;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_map_lookup_elem(map, &zero);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#else
>> +static int do_lookup(unsigned int elem, struct task_struct *task)
>> +{
>> +       void *map;
>> +
>> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
>> +       if (!map)
>> +               return -1;
>> +
>> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +       __sync_add_and_fetch(&hits, 1);
>> +       if (!elem)
>> +               __sync_add_and_fetch(&important_hits, 1);
>> +       return 0;
>> +}
>> +#endif /* LOOKUP_HASHMAP */
>> +
>> +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \
>> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
>> +int get_local_ ## start(void *ctx)                             \
>> +{                                                              \
>> +       struct task_struct *task;                               \
>> +       unsigned int i, elem;                                   \
>> +       void *map;                                              \
>> +                                                               \
>> +       task = bpf_get_current_task_btf();                      \
>> +       for (i = 0; i < 50; i++) {                              \
> 
> I'm trying to understand why you didn't just do
> 
> 
> for (i = 0; i < 1000; i++) { ... }
> 
> and avoid all the macro stuff? what didn't work?
> 
> 
>> +               elem = start + i;                               \
>> +               if (do_lookup(elem, task))                      \
>> +                       return 0;                               \
>> +               if (interleave && i % 3 == 0)                   \
> 
> nit % 3 will be slow(-ish), why not pick some power of 2?

Few reasons:

1) This results in a ratio of "get important map" to "get any other map" closest
to v1 of this patchset, and similar interleaving pattern. Made it easy to
compare results after big refactor and be reasonably sure I didn't break the
benchmark.

2) The current local_storage cache has 16 entries and the current
cache slot assignment algorithm will always give the important map cache_idx
0 since it's created first. Second created map - idx 1 in the ARRAY_OF_MAPS -
will get cache_idx 1, etc, so for this benchmark
cache_idx = map_of_maps_idx % 16. (Assuming no other task_local_storage maps
have been created on the system).

We want to pick a small number because in 'important map' scenario the
important map is accessed significantly more often than other maps. So < 16.
Considering current implementation with fixed cache_idx, if we pick a power
of 2 that's < 16, there will always be the same 'gap' between important
map get and other map gets with same cache_idx. We care about these specifically
since they'll be knocking each other out of cache slot. 

If an odd number is used the gaps will vary, resulting in a benchmark more 
closely mimicing "bunch of unrelated progs accessing maps in arbitrary order,
with one important prog accessing its map very frequently".

Probably moving to bpf_get_prandom or some userspace nondeterminism to feed
list of indices to interleave gets of important map is the best solution,
but I'm hoping to keep things simple for now.

> 
>> +                       do_lookup(0, task);                     \
>> +       }                                                       \
>> +       return 0;                                               \
>> +}
>> +
>> +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \
>> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, false)
>> +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \
>> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, true)
> 
> [...]
> 
>> +
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
>> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
>> +
> 
> all these macro make me sad, I'd like to understand why it has to be
> done this way
> 

Macro and "why not i < 1000" are addressed in v3. v1 of this patch - before your
ARRAY_OF_MAPS suggestion - was hitting MAX_USED_MAPS limit and failing
verification when trying to access all 1k maps in a single program. I assumed
that accessing maps from within an ARRAY_OF_MAPS would trigger similar limit,
but this is not the case.

>> +char _license[] SEC("license") = "GPL";
>> --
>> 2.30.2
>>
Andrii Nakryiko May 23, 2022, 8:39 p.m. UTC | #5
On Fri, May 20, 2022 at 10:22 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> On 5/20/22 8:41 PM, Andrii Nakryiko wrote:
> > On Tue, May 17, 2022 at 8:51 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >>
> >> Add a benchmarks to demonstrate the performance cliff for local_storage
> >> get as the number of local_storage maps increases beyond current
> >> local_storage implementation's cache size.
> >>
> >> "sequential get" and "interleaved get" benchmarks are added, both of
> >> which do many bpf_task_storage_get calls on sets of task local_storage
> >> maps of various counts, while considering a single specific map to be
> >> 'important' and counting task_storage_gets to the important map
> >> separately in addition to normal 'hits' count of all gets. Goal here is
> >> to mimic scenario where a particular program using one map - the
> >> important one - is running on a system where many other local_storage
> >> maps exist and are accessed often.
> >>
> >> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> >> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> >> bpf_task_storage_gets for the important map for every 10 map gets. This
> >> is meant to highlight performance differences when important map is
> >> accessed far more frequently than non-important maps.
> >>
> >> A "hashmap control" benchmark is also included for easy comparison of
> >> standard bpf hashmap lookup vs local_storage get. The benchmark is
> >> identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> >> instead of local storage.
> >>
> >> Addition of this benchmark is inspired by conversation with Alexei in a
> >> previous patchset's thread [0], which highlighted the need for such a
> >> benchmark to motivate and validate improvements to local_storage
> >> implementation. My approach in that series focused on improving
> >> performance for explicitly-marked 'important' maps and was rejected
> >> with feedback to make more generally-applicable improvements while
> >> avoiding explicitly marking maps as important. Thus the benchmark
> >> reports both general and important-map-focused metrics, so effect of
> >> future work on both is clear.
> >>
> >> Regarding the benchmark results. On a powerful system (Skylake, 20
> >> cores, 256gb ram):
> >>
> >> Local Storage
> >> =============
> >
> > [...]
> >
> >>
> >> Looking at the "sequential get" results, it's clear that as the
> >> number of task local_storage maps grows beyond the current cache size
> >> (16), there's a significant reduction in hits throughput. Note that
> >> current local_storage implementation assigns a cache_idx to maps as they
> >> are created. Since "sequential get" is creating maps 0..n in order and
> >> then doing bpf_task_storage_get calls in the same order, the benchmark
> >> is effectively ensuring that a map will not be in cache when the program
> >> tries to access it.
> >>
> >> For "interleaved get" results, important-map hits throughput is greatly
> >> increased as the important map is more likely to be in cache by virtue
> >> of being accessed far more frequently. Throughput still reduces as #
> >> maps increases, though.
> >>
> >> Note that the test programs need to split task_storage_get calls across
> >> multiple programs to work around the verifier's MAX_USED_MAPS
> >> limitations. As evidenced by the unintuitive-looking results for smaller
> >> num_maps benchmark runs, overhead which is amortized across larger
> >> num_maps in other runs dominates when there are fewer maps. To get a
> >> sense of the overhead, I commented out
> >> bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and
> >> ran the benchmark on the same host as 'real' run. Results:
> >>
> >> Local Storage
> >> =============
> >
> > [...]
> >
> >>
> >> Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:
> >>
> >> hashmap_control:     ~6.8ns
> >> sequential_get_1:    ~15.5ns
> >> sequential_get_10:   ~20ns
> >> sequential_get_16:   ~17.8ns
> >> sequential_get_17:   ~21.8ns
> >> sequential_get_24:   ~45.2ns
> >> sequential_get_32:   ~69.7ns
> >> sequential_get_100:  ~153.3ns
> >> sequential_get_1000: ~2300ns
> >>
> >> Clearly demonstrating a cliff.
> >>
> >> When running the benchmarks it may be necessary to bump 'open files'
> >> ulimit for a successful run.
> >>
> >>   [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
> >>
> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >> ---
> >> Changelog:
> >>
> >> v1 -> v2:
> >>   * Adopt ARRAY_OF_MAPS approach for bpf program, allowing truly
> >>     configurable # of maps (Andrii)
> >>   * Add hashmap benchmark (Alexei)
> >>         * Add discussion of overhead
> >>
> >
> > [...]
> >
> >> +
> >> +/* Keep in sync w/ array of maps in bpf */
> >> +#define MAX_NR_MAPS 1000
> >> +/* Keep in sync w/ number of progs in bpf app */
> >> +#define MAX_NR_PROGS 20
> >> +
> >> +static struct {
> >> +       void (*destroy_skel)(void *obj);
> >> +       int (*load_skel)(void *obj);
> >> +       long *important_hits;
> >> +       long *hits;
> >> +       void *progs;
> >> +       void *skel;
> >> +       struct bpf_map *array_of_maps;
> >> +} ctx;
> >> +
> >> +int created_maps[MAX_NR_MAPS];
> >> +struct bpf_link *attached_links[MAX_NR_PROGS];
> >> +
> >
> > static?
>
> I sent v3 before seeing this email, but luckily most of your comments are
> addressed already.
>
> attached_links is removed in v3, created_maps is moved to ctx
>
> >
> >
> >> +static void teardown(void)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < MAX_NR_PROGS; i++) {
> >> +               if (!attached_links[i])
> >> +                       break;
> >> +               bpf_link__detach(attached_links[i]);
> >> +       }
> >> +
> >> +       if (ctx.destroy_skel && ctx.skel)
> >> +               ctx.destroy_skel(ctx.skel);
> >> +
> >> +       for (i = 0; i < MAX_NR_MAPS; i++) {
> >> +               if (!created_maps[i])
> >> +                       break;
> >> +               close(created_maps[i]);
> >> +       }
> >> +}
> >> +
> >
> > Wouldn't all this be cleaned up on bench exit anyway?.. We've been
> > less strict about proper clean up for bench to keep code simpler.
>
> It's important to explicitly clean up created_maps because local_storage maps
> are assigned a cache slot when the map is created, and count of "how many maps
> are assigned to this cache slot" is incr'd. On map free the count is decr'd.
>
> So cache behavior of subsequently alloc'd maps can be affected if these are kept
> around.
>
> Not a big deal now since just 1 bench is run and process exits, but if that
> changes in the future I don't want the benchmark to silently give odd results.
>

Right, I don't see why we'd need to support running multiple
benchmarks within the same bench process. We can always run ./bench
twice. So the exact point is that there is no need to clean up after
benchmark (unless kernel won't clean up after us automatically).
That's the reason why we never had teardown callback in the first
place, not because there is nothing to clean up, but because it will
happen automatically on process exit (and that's ok for ./bench).


> >
> >
> > [...]
> >
> >> diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> >> new file mode 100644
> >> index 000000000000..b5e358dee245
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
> >> @@ -0,0 +1,69 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> >> +
> >> +struct {
> >> +       __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> >> +       __uint(max_entries, 1000);
> >> +       __type(key, int);
> >> +       __type(value, int);
> >> +} array_of_maps SEC(".maps");
> >
> > you don't need setup_inner_map_and_load and load_btf, you can just
> > declaratively have two ARRAY_OF_MAPS, one using inner hashmap and
> > another using inner task_local_storage. Grep for __array in selftests
> > to see how to declaratively define inner map prototype, e.g., see
> > test_ringbuf_multi.c. With the below suggestion one of do_lookup
> > flavors will use array_of_hashes and another will use
> > array_of_storages explicitly. From user-space you can create and setup
> > as many inner maps as needed. If you need btf_id for value_type_id for
> > inner map, see if bpf_map__inner_map() would be useful.
> >
>
> Declaratively specifying an inner task_local_storage map will result in libbpf
> creating such a map to pass as the inner_map_fd, no? This will have same
> effect on cache_idx assignment as my previous comment.

It will inner "prototype" map, pass its FD as inner_map_fd to create
outer map. But after that it will immediately close FD and destroy it.
Does that affect the benchmark negatively? Extra inner map should be
gone by the time you start creating actual inner maps.

>
> >> +
> >> +long important_hits;
> >> +long hits;
> >> +
> >> +#ifdef LOOKUP_HASHMAP
> >
> > why #ifdef'ing if you can have do_lookup_hashmap and
> > do_lookup_task_storage and choose which one to call using read-only
> > variable:
> >
> > const volatile bool use_hashmap;
> >
> > just set it before load and verifier will know that one of do_lookup
> > flavors is dead code
>
> I want it to be obvious that the hashmap part of the benchmark is not a flavor
> of local_storage, the distinct separation of do_lookups here makes this easier
> to notice.
>
> Can do a v4 addressing this if you feel strongly about it.

I generally dislike #if/#endif guards for BPF code, if they can be
avoided, as they go completely against the spirit of CO-RE.

>
> >
> >> +static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
> >> +{
> >> +       void *map;
> >> +       int zero = 0;
> >> +
> >> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> >> +       if (!map)
> >> +               return -1;
> >> +
> >> +       bpf_map_lookup_elem(map, &zero);
> >> +       __sync_add_and_fetch(&hits, 1);
> >> +       if (!elem)
> >> +               __sync_add_and_fetch(&important_hits, 1);
> >> +       return 0;
> >> +}
> >> +#else
> >> +static int do_lookup(unsigned int elem, struct task_struct *task)
> >> +{
> >> +       void *map;
> >> +
> >> +       map = bpf_map_lookup_elem(&array_of_maps, &elem);
> >> +       if (!map)
> >> +               return -1;
> >> +
> >> +       bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
> >> +       __sync_add_and_fetch(&hits, 1);
> >> +       if (!elem)
> >> +               __sync_add_and_fetch(&important_hits, 1);
> >> +       return 0;
> >> +}
> >> +#endif /* LOOKUP_HASHMAP */
> >> +
> >> +#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \
> >> +SEC("fentry/" SYS_PREFIX "sys_getpgid")                        \
> >> +int get_local_ ## start(void *ctx)                             \
> >> +{                                                              \
> >> +       struct task_struct *task;                               \
> >> +       unsigned int i, elem;                                   \
> >> +       void *map;                                              \
> >> +                                                               \
> >> +       task = bpf_get_current_task_btf();                      \
> >> +       for (i = 0; i < 50; i++) {                              \
> >
> > I'm trying to understand why you didn't just do
> >
> >
> > for (i = 0; i < 1000; i++) { ... }
> >
> > and avoid all the macro stuff? what didn't work?
> >
> >
> >> +               elem = start + i;                               \
> >> +               if (do_lookup(elem, task))                      \
> >> +                       return 0;                               \
> >> +               if (interleave && i % 3 == 0)                   \
> >
> > nit % 3 will be slow(-ish), why not pick some power of 2?
>
> Few reasons:
>
> 1) This results in a ratio of "get important map" to "get any other map" closest
> to v1 of this patchset, and similar interleaving pattern. Made it easy to
> compare results after big refactor and be reasonably sure I didn't break the
> benchmark.
>

ok, np, div and mod are noticeably more expensive than even
multiplication, so if you have microbenchmark it would be preferable
to avoid division, if possible


> 2) The current local_storage cache has 16 entries and the current
> cache slot assignment algorithm will always give the important map cache_idx
> 0 since it's created first. Second created map - idx 1 in the ARRAY_OF_MAPS -
> will get cache_idx 1, etc, so for this benchmark
> cache_idx = map_of_maps_idx % 16. (Assuming no other task_local_storage maps
> have been created on the system).
>
> We want to pick a small number because in 'important map' scenario the
> important map is accessed significantly more often than other maps. So < 16.
> Considering current implementation with fixed cache_idx, if we pick a power
> of 2 that's < 16, there will always be the same 'gap' between important
> map get and other map gets with same cache_idx. We care about these specifically
> since they'll be knocking each other out of cache slot.
>
> If an odd number is used the gaps will vary, resulting in a benchmark more
> closely mimicing "bunch of unrelated progs accessing maps in arbitrary order,
> with one important prog accessing its map very frequently".
>
> Probably moving to bpf_get_prandom or some userspace nondeterminism to feed
> list of indices to interleave gets of important map is the best solution,
> but I'm hoping to keep things simple for now.
>

yeah, IMO it's fine, as long as it's done consciously

> >
> >> +                       do_lookup(0, task);                     \
> >> +       }                                                       \
> >> +       return 0;                                               \
> >> +}
> >> +
> >> +#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \
> >> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, false)
> >> +#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \
> >> +       __TASK_STORAGE_GET_LOOP_PROG(array, start, true)
> >
> > [...]
> >
> >> +
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
> >> +TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
> >> +
> >
> > all these macro make me sad, I'd like to understand why it has to be
> > done this way
> >
>
> Macro and "why not i < 1000" are addressed in v3. v1 of this patch - before your
> ARRAY_OF_MAPS suggestion - was hitting MAX_USED_MAPS limit and failing
> verification when trying to access all 1k maps in a single program. I assumed
> that accessing maps from within an ARRAY_OF_MAPS would trigger similar limit,
> but this is not the case.

ok, haven't checked v3 yet, will take a look, thanks

>
> >> +char _license[] SEC("license") = "GPL";
> >> --
> >> 2.30.2
> >>
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4030dd6cbc34..6095f6af2ad1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -560,6 +560,9 @@  $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
 $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
+$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench__get_seq.skel.h \
+				  $(OUTPUT)/local_storage_bench__get_int.skel.h \
+				  $(OUTPUT)/local_storage_bench__hashmap.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -571,7 +574,8 @@  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_ringbufs.o \
 		 $(OUTPUT)/bench_bloom_filter_map.o \
 		 $(OUTPUT)/bench_bpf_loop.o \
-		 $(OUTPUT)/bench_strncmp.o
+		 $(OUTPUT)/bench_strncmp.o \
+		 $(OUTPUT)/bench_local_storage.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index f061cc20e776..71271062f68d 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -150,6 +150,53 @@  void ops_report_final(struct bench_res res[], int res_cnt)
 	printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt);
 }
 
+void local_storage_report_progress(int iter, struct bench_res *res,
+				   long delta_ns)
+{
+	double important_hits_per_sec, hits_per_sec;
+	double delta_sec = delta_ns / 1000000000.0;
+
+	hits_per_sec = res->hits / 1000000.0 / delta_sec;
+	important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec;
+
+	printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
+
+	printf("hits %8.3lfM/s ", hits_per_sec);
+	printf("important_hits %8.3lfM/s\n", important_hits_per_sec);
+}
+
+void local_storage_report_final(struct bench_res res[], int res_cnt)
+{
+	double important_hits_mean = 0.0, important_hits_stddev = 0.0;
+	double hits_mean = 0.0, hits_stddev = 0.0;
+	int i;
+
+	for (i = 0; i < res_cnt; i++) {
+		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+		important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt);
+	}
+
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++) {
+			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
+				       (hits_mean - res[i].hits / 1000000.0) /
+				       (res_cnt - 1.0);
+			important_hits_stddev +=
+				       (important_hits_mean - res[i].important_hits / 1000000.0) *
+				       (important_hits_mean - res[i].important_hits / 1000000.0) /
+				       (res_cnt - 1.0);
+		}
+
+		hits_stddev = sqrt(hits_stddev);
+		important_hits_stddev = sqrt(important_hits_stddev);
+	}
+	printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ",
+	       hits_mean, hits_stddev);
+	printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean);
+	printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n",
+	       important_hits_mean, important_hits_stddev);
+}
+
 const char *argp_program_version = "benchmark";
 const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
 const char argp_program_doc[] =
@@ -188,12 +235,14 @@  static const struct argp_option opts[] = {
 extern struct argp bench_ringbufs_argp;
 extern struct argp bench_bloom_map_argp;
 extern struct argp bench_bpf_loop_argp;
+extern struct argp bench_local_storage_argp;
 extern struct argp bench_strncmp_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
 	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
+	{ &bench_local_storage_argp, 0, "local_storage benchmark", 0 },
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
 	{},
 };
@@ -396,6 +445,9 @@  extern const struct bench bench_hashmap_with_bloom;
 extern const struct bench bench_bpf_loop;
 extern const struct bench bench_strncmp_no_helper;
 extern const struct bench bench_strncmp_helper;
+extern const struct bench bench_local_storage_cache_seq_get;
+extern const struct bench bench_local_storage_cache_interleaved_get;
+extern const struct bench bench_local_storage_cache_hashmap_control;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -430,6 +482,9 @@  static const struct bench *benchs[] = {
 	&bench_bpf_loop,
 	&bench_strncmp_no_helper,
 	&bench_strncmp_helper,
+	&bench_local_storage_cache_seq_get,
+	&bench_local_storage_cache_interleaved_get,
+	&bench_local_storage_cache_hashmap_control,
 };
 
 static void setup_benchmark()
@@ -547,5 +602,7 @@  int main(int argc, char **argv)
 		bench->report_final(state.results + env.warmup_sec,
 				    state.res_cnt - env.warmup_sec);
 
+	if (bench->teardown)
+		bench->teardown();
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index fb3e213df3dc..0a137eedc959 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -34,12 +34,14 @@  struct bench_res {
 	long hits;
 	long drops;
 	long false_hits;
+	long important_hits;
 };
 
 struct bench {
 	const char *name;
 	void (*validate)(void);
 	void (*setup)(void);
+	void (*teardown)(void);
 	void *(*producer_thread)(void *ctx);
 	void *(*consumer_thread)(void *ctx);
 	void (*measure)(struct bench_res* res);
@@ -61,6 +63,9 @@  void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
 void false_hits_report_final(struct bench_res res[], int res_cnt);
 void ops_report_progress(int iter, struct bench_res *res, long delta_ns);
 void ops_report_final(struct bench_res res[], int res_cnt);
+void local_storage_report_progress(int iter, struct bench_res *res,
+				   long delta_ns);
+void local_storage_report_final(struct bench_res res[], int res_cnt);
 
 static inline __u64 get_time_ns(void)
 {
diff --git a/tools/testing/selftests/bpf/benchs/bench_local_storage.c b/tools/testing/selftests/bpf/benchs/bench_local_storage.c
new file mode 100644
index 000000000000..1cf041b9448a
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_local_storage.c
@@ -0,0 +1,345 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <argp.h>
+#include <linux/btf.h>
+
+#include "local_storage_bench__get_int.skel.h"
+#include "local_storage_bench__get_seq.skel.h"
+#include "local_storage_bench__hashmap.skel.h"
+#include "bench.h"
+
+#include <test_btf.h>
+
+static struct {
+	__u32 nr_maps;
+} args = {
+	.nr_maps = 100,
+};
+
+enum {
+	ARG_NR_MAPS = 6000,
+};
+
+static const struct argp_option opts[] = {
+	{ "nr_maps", ARG_NR_MAPS, "NR_MAPS", 0,
+		"Set number of local_storage maps"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	long ret;
+
+	switch (key) {
+	case ARG_NR_MAPS:
+		ret = strtol(arg, NULL, 10);
+		if (ret < 1 || ret > UINT_MAX) {
+			fprintf(stderr, "invalid nr_maps");
+			argp_usage(state);
+		}
+		args.nr_maps = ret;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_local_storage_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.producer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-producer!\n");
+		exit(1);
+	}
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+
+	if (args.nr_maps > 1000) {
+		fprintf(stderr, "nr_maps must be <= 1000\n");
+		exit(1);
+	}
+}
+
+/* Keep in sync w/ array of maps in bpf */
+#define MAX_NR_MAPS 1000
+/* Keep in sync w/ number of progs in bpf app */
+#define MAX_NR_PROGS 20
+
+static struct {
+	void (*destroy_skel)(void *obj);
+	int (*load_skel)(void *obj);
+	long *important_hits;
+	long *hits;
+	void *progs;
+	void *skel;
+	struct bpf_map *array_of_maps;
+} ctx;
+
+int created_maps[MAX_NR_MAPS];
+struct bpf_link *attached_links[MAX_NR_PROGS];
+
+static void teardown(void)
+{
+	int i;
+
+	for (i = 0; i < MAX_NR_PROGS; i++) {
+		if (!attached_links[i])
+			break;
+		bpf_link__detach(attached_links[i]);
+	}
+
+	if (ctx.destroy_skel && ctx.skel)
+		ctx.destroy_skel(ctx.skel);
+
+	for (i = 0; i < MAX_NR_MAPS; i++) {
+		if (!created_maps[i])
+			break;
+		close(created_maps[i]);
+	}
+}
+
+static int setup_inner_map_and_load(int inner_fd)
+{
+	int err, mim_fd;
+
+	err = bpf_map__set_inner_map_fd(ctx.array_of_maps, inner_fd);
+	if (err)
+		return -1;
+
+	err = ctx.load_skel(ctx.skel);
+	if (err)
+		return -1;
+
+	mim_fd = bpf_map__fd(ctx.array_of_maps);
+	if (mim_fd < 0)
+		return -1;
+
+	return mim_fd;
+}
+
+static int load_btf(void)
+{
+	static const char btf_str_sec[] = "\0";
+	__u32 btf_raw_types[] = {
+		/* int */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+	};
+	struct btf_header btf_hdr = {
+		.magic = BTF_MAGIC,
+		.version = BTF_VERSION,
+		.hdr_len = sizeof(struct btf_header),
+		.type_len = sizeof(btf_raw_types),
+		.str_off = sizeof(btf_raw_types),
+		.str_len = sizeof(btf_str_sec),
+	};
+	__u8 raw_btf[sizeof(struct btf_header) + sizeof(btf_raw_types) +
+				sizeof(btf_str_sec)];
+
+	memcpy(raw_btf, &btf_hdr, sizeof(btf_hdr));
+	memcpy(raw_btf + sizeof(btf_hdr), btf_raw_types, sizeof(btf_raw_types));
+	memcpy(raw_btf + sizeof(btf_hdr) + sizeof(btf_raw_types),
+	       btf_str_sec, sizeof(btf_str_sec));
+
+	return bpf_btf_load(raw_btf, sizeof(raw_btf), NULL);
+}
+
+static void __setup(bool hashmap)
+{
+	struct bpf_program **prog;
+	uint32_t progs_to_attach;
+	int i, fd, mim_fd, err;
+	int btf_fd = 0;
+
+	LIBBPF_OPTS(bpf_map_create_opts, create_opts);
+
+	memset(&created_maps, 0, MAX_NR_MAPS * sizeof(int));
+	memset(&attached_links, 0, MAX_NR_PROGS * sizeof(void *));
+
+	btf_fd = load_btf();
+	create_opts.btf_fd = btf_fd;
+	create_opts.btf_key_type_id = 1;
+	create_opts.btf_value_type_id = 1;
+	if (!hashmap)
+		create_opts.map_flags = BPF_F_NO_PREALLOC;
+
+	mim_fd = 0;
+	for (i = 0; i < args.nr_maps; i++) {
+		if (hashmap)
+			fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, sizeof(int),
+							sizeof(int), 65536, &create_opts);
+		else
+			fd = bpf_map_create(BPF_MAP_TYPE_TASK_STORAGE, NULL, sizeof(int),
+							sizeof(int), 0, &create_opts);
+		if (fd < 0) {
+			fprintf(stderr, "Error creating map %d\n", i);
+			goto err_out;
+		}
+
+		if (i == 0) {
+			mim_fd = setup_inner_map_and_load(fd);
+			if (mim_fd < 0) {
+				fprintf(stderr, "Error doing setup_inner_map_and_load\n");
+				goto err_out;
+			}
+		}
+
+		err = bpf_map_update_elem(mim_fd, &i, &fd, 0);
+		if (err) {
+			fprintf(stderr, "Error updating array-of-maps w/ map %d\n", i);
+			goto err_out;
+		}
+		created_maps[i] = fd;
+	}
+	close(btf_fd);
+
+	progs_to_attach = (args.nr_maps / 50);
+	if (args.nr_maps % 50)
+		progs_to_attach++;
+
+	for (i = 0; i < progs_to_attach; i++) {
+		prog = ctx.progs + i * sizeof(void *);
+		attached_links[i] = bpf_program__attach(*prog);
+	}
+
+	return;
+err_out:
+	if (btf_fd)
+		close(btf_fd);
+	teardown();
+	exit(1);
+}
+
+static void hashmap_setup(void)
+{
+	struct local_storage_bench__hashmap *skel;
+
+	setup_libbpf();
+
+	skel = local_storage_bench__hashmap__open();
+	ctx.skel = skel;
+	ctx.hits = &skel->bss->hits;
+	ctx.important_hits = &skel->bss->important_hits;
+	ctx.load_skel = (int (*)(void *))local_storage_bench__hashmap__load;
+	ctx.progs = (void *)&skel->progs;
+	ctx.destroy_skel = (void (*)(void *))local_storage_bench__hashmap__destroy;
+	ctx.array_of_maps = skel->maps.array_of_maps;
+
+	__setup(true);
+}
+
+static void local_storage_cache_get_setup(void)
+{
+	struct local_storage_bench__get_seq *skel;
+
+	setup_libbpf();
+
+	skel = local_storage_bench__get_seq__open();
+	ctx.skel = skel;
+	ctx.hits = &skel->bss->hits;
+	ctx.important_hits = &skel->bss->important_hits;
+	ctx.load_skel = (int (*)(void *))local_storage_bench__get_seq__load;
+	ctx.progs = (void *)&skel->progs;
+	ctx.destroy_skel = (void (*)(void *))local_storage_bench__get_seq__destroy;
+	ctx.array_of_maps = skel->maps.array_of_maps;
+
+	__setup(false);
+}
+
+static void local_storage_cache_get_interleaved_setup(void)
+{
+	struct local_storage_bench__get_int *skel;
+
+	setup_libbpf();
+
+	skel = local_storage_bench__get_int__open();
+	ctx.skel = skel;
+	ctx.hits = &skel->bss->hits;
+	ctx.important_hits = &skel->bss->important_hits;
+	ctx.load_skel = (int (*)(void *))local_storage_bench__get_int__load;
+	ctx.progs = (void *)&skel->progs;
+	ctx.destroy_skel = (void (*)(void *))local_storage_bench__get_int__destroy;
+	ctx.array_of_maps = skel->maps.array_of_maps;
+
+	__setup(false);
+}
+
+static void measure(struct bench_res *res)
+{
+	if (ctx.hits)
+		res->hits = atomic_swap(ctx.hits, 0);
+	if (ctx.important_hits)
+		res->important_hits = atomic_swap(ctx.important_hits, 0);
+}
+
+static inline void trigger_bpf_program(void)
+{
+	syscall(__NR_getpgid);
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void *producer(void *input)
+{
+	while (true)
+		trigger_bpf_program();
+
+	return NULL;
+}
+
+/* cache sequential and interleaved get benchs test local_storage get
+ * performance, specifically they demonstrate performance cliff of
+ * current list-plus-cache local_storage model.
+ *
+ * cache sequential get: call bpf_task_storage_get on n maps in order
+ * cache interleaved get: like "sequential get", but interleave 4 calls to the
+ *	'important' map (idx 0 in array_of_maps) for every 10 calls. Goal
+ *	is to mimic environment where many progs are accessing their local_storage
+ *	maps, with 'our' prog needing to access its map more often than others
+ */
+const struct bench bench_local_storage_cache_seq_get = {
+	.name = "local-storage-cache-seq-get",
+	.validate = validate,
+	.setup = local_storage_cache_get_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = local_storage_report_progress,
+	.report_final = local_storage_report_final,
+	.teardown = teardown,
+};
+
+const struct bench bench_local_storage_cache_interleaved_get = {
+	.name = "local-storage-cache-int-get",
+	.validate = validate,
+	.setup = local_storage_cache_get_interleaved_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = local_storage_report_progress,
+	.report_final = local_storage_report_final,
+	.teardown = teardown,
+};
+
+const struct bench bench_local_storage_cache_hashmap_control = {
+	.name = "local-storage-cache-hashmap-control",
+	.validate = validate,
+	.setup = hashmap_setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = local_storage_report_progress,
+	.report_final = local_storage_report_final,
+	.teardown = teardown,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh b/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
new file mode 100755
index 000000000000..479096c47c93
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_local_storage.sh
@@ -0,0 +1,21 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+header "Local Storage"
+subtitle "Hashmap Control w/ 500 maps"
+	summarize_local_storage "hashmap (control) sequential    get: "\
+		"$(./bench --nr_maps 500 local-storage-cache-hashmap-control)"
+	printf "\n"
+
+for i in 1 10 16 17 24 32 100 1000; do
+subtitle "num_maps: $i"
+	summarize_local_storage "local_storage cache sequential  get: "\
+		"$(./bench --nr_maps $i local-storage-cache-seq-get)"
+	summarize_local_storage "local_storage cache interleaved get: "\
+		"$(./bench --nr_maps $i local-storage-cache-int-get)"
+	printf "\n"
+done
diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh
index 6c5e6023a69f..d9f40af82006 100644
--- a/tools/testing/selftests/bpf/benchs/run_common.sh
+++ b/tools/testing/selftests/bpf/benchs/run_common.sh
@@ -41,6 +41,16 @@  function ops()
 	echo "$*" | sed -E "s/.*latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/"
 }
 
+function local_storage()
+{
+	echo -n "hits throughput: "
+	echo -n "$*" | sed -E "s/.* hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/"
+	echo -n -e ", hits latency: "
+	echo -n "$*" | sed -E "s/.* hits latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/"
+	echo -n ", important_hits throughput: "
+	echo "$*" | sed -E "s/.*important_hits throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/"
+}
+
 function total()
 {
 	echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
@@ -67,6 +77,13 @@  function summarize_ops()
 	printf "%-20s %s\n" "$bench" "$(ops $summary)"
 }
 
+function summarize_local_storage()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s\n" "$bench" "$(local_storage $summary)"
+}
+
 function summarize_total()
 {
 	bench="$1"
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench.h b/tools/testing/selftests/bpf/progs/local_storage_bench.h
new file mode 100644
index 000000000000..b5e358dee245
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench.h
@@ -0,0 +1,69 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, int);
+} array_of_maps SEC(".maps");
+
+long important_hits;
+long hits;
+
+#ifdef LOOKUP_HASHMAP
+static int do_lookup(unsigned int elem, struct task_struct *task /* unused */)
+{
+	void *map;
+	int zero = 0;
+
+	map = bpf_map_lookup_elem(&array_of_maps, &elem);
+	if (!map)
+		return -1;
+
+	bpf_map_lookup_elem(map, &zero);
+	__sync_add_and_fetch(&hits, 1);
+	if (!elem)
+		__sync_add_and_fetch(&important_hits, 1);
+	return 0;
+}
+#else
+static int do_lookup(unsigned int elem, struct task_struct *task)
+{
+	void *map;
+
+	map = bpf_map_lookup_elem(&array_of_maps, &elem);
+	if (!map)
+		return -1;
+
+	bpf_task_storage_get(map, task, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);
+	__sync_add_and_fetch(&hits, 1);
+	if (!elem)
+		__sync_add_and_fetch(&important_hits, 1);
+	return 0;
+}
+#endif /* LOOKUP_HASHMAP */
+
+#define __TASK_STORAGE_GET_LOOP_PROG(array, start, interleave) \
+SEC("fentry/" SYS_PREFIX "sys_getpgid")			\
+int get_local_ ## start(void *ctx)				\
+{								\
+	struct task_struct *task;				\
+	unsigned int i, elem;					\
+	void *map;						\
+								\
+	task = bpf_get_current_task_btf();			\
+	for (i = 0; i < 50; i++) {				\
+		elem = start + i;				\
+		if (do_lookup(elem, task))			\
+			return 0;				\
+		if (interleave && i % 3 == 0)			\
+			do_lookup(0, task);			\
+	}							\
+	return 0;						\
+}
+
+#define TASK_STORAGE_GET_LOOP_PROG_SEQ(array, start) \
+	__TASK_STORAGE_GET_LOOP_PROG(array, start, false)
+#define TASK_STORAGE_GET_LOOP_PROG_INT(array, start) \
+	__TASK_STORAGE_GET_LOOP_PROG(array, start, true)
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
new file mode 100644
index 000000000000..498f0c81014b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench__get_int.c
@@ -0,0 +1,31 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#include "local_storage_bench.h"
+
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 0);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 50);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 100);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 150);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 200);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 250);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 300);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 350);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 400);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 450);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 500);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 550);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 600);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 650);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 700);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 750);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 800);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 850);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 900);
+TASK_STORAGE_GET_LOOP_PROG_INT(array_of_maps, 950);
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c b/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
new file mode 100644
index 000000000000..71514576a05d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench__get_seq.c
@@ -0,0 +1,31 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#include "local_storage_bench.h"
+
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c b/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
new file mode 100644
index 000000000000..15e4fe13d0b5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/local_storage_bench__hashmap.c
@@ -0,0 +1,32 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#define LOOKUP_HASHMAP
+#include "local_storage_bench.h"
+
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 0);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 50);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 100);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 150);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 200);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 250);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 300);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 350);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 400);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 450);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 500);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 550);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 600);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 650);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 700);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 750);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 800);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 850);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 900);
+TASK_STORAGE_GET_LOOP_PROG_SEQ(array_of_maps, 950);
+
+char _license[] SEC("license") = "GPL";