diff mbox series

[bpf-next,1/2] bpf: add new map ops ->map_pressure

Message ID 20230531110511.64612-2-aspsk@isovalent.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series add mechanism to report map pressure | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1732 this patch: 1732
netdev/cc_maintainers warning 10 maintainers not CCed: daniel@iogearbox.net yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev song@kernel.org sdf@google.com andrii@kernel.org jolsa@kernel.org haoluo@google.com ast@kernel.org
netdev/build_clang success Errors and warnings before: 182 this patch: 182
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1731 this patch: 1731
netdev/checkpatch warning CHECK: Unbalanced braces around else statement CHECK: braces {} should be used on all arms of this statement WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Anton Protopopov May 31, 2023, 11:05 a.m. UTC
Add a new map ops named map_pressure to return map's "raw pressure". This value
is to be defined per map type, but in most cases this would be just the number
of elements currently present in the map: the more it grows, the more the
pressure is. (For array-based maps it seems right to set it to 0, i.e.,
"there's no pressure".)

By analogy with the 'max_entries' field the pressure value is a u32 integer.
The primary API to read it from userspace is via the map proc entry, where it
is reported in the "raw_pressure" filed, e.g., for an example map we have

    # echo -e `strace -e bpf,openat,read -s 1024 bpftool map show id 202 2>&1 | grep -A1 '/proc/self/fdinfo' | head -2 | tail -1 | cut -f2 -d' '`
    "pos:   0
    flags:  02000002
    mnt_id: 15
    ino:    18958
    map_type:       1
    key_size:       8
    value_size:     8
    max_entries:    1224
    map_flags:      0x1
    map_extra:      0x0
    memlock:        69632
    raw_pressure:   500
    map_id: 202
    frozen: 0
    ",

For old kernels and when the ->map_pressure map operation is not defined the
'raw_pressure' field is absent from the list.

The second way to get the raw_pressure is via BPF_OBJ_GET_INFO_BY_FD, where a
previously unused field in the struct bpf_map_info is now used to return this
value.

The patch adds relatively small amount of logic due to the fact that for most
maps the number of elements was already computed to provide the map memory
usage API, just not exported.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |   2 +-
 kernel/bpf/hashtab.c           | 118 ++++++++++++++++++++-------------
 kernel/bpf/lpm_trie.c          |   8 +++
 kernel/bpf/syscall.c           |  15 +++++
 tools/include/uapi/linux/bpf.h |   2 +-
 6 files changed, 99 insertions(+), 47 deletions(-)

Comments

Alexei Starovoitov May 31, 2023, 6:24 p.m. UTC | #1
On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
>  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
>  {
>  	htab_put_fd_value(htab, l);
>  
> +	dec_elem_count(htab);
> +
>  	if (htab_is_prealloc(htab)) {
>  		check_and_free_fields(htab, l);
>  		__pcpu_freelist_push(&htab->freelist, &l->fnode);
>  	} else {
> -		dec_elem_count(htab);
>  		htab_elem_free(htab, l);
>  	}
>  }
> @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
>  			if (!l)
>  				return ERR_PTR(-E2BIG);
>  			l_new = container_of(l, struct htab_elem, fnode);
> +			inc_elem_count(htab);

The current use_percpu_counter heuristic is far from perfect. It works for some cases,
but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
Hence, there is a big performance risk doing inc/dec everywhere.
Hence, this is a nack: we cannot decrease performance of various maps for few folks
who want to see map stats.

If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
No kernel patches needed.
Then bpf_prog_run such tracing prog and read all internal map info.
It's less convenient that exposing things in uapi, but not being uapi is the point.
kernel test robot June 1, 2023, 12:44 a.m. UTC | #2
Hi Anton,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Anton-Protopopov/bpf-add-new-map-ops-map_pressure/20230531-190704
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230531110511.64612-2-aspsk%40isovalent.com
patch subject: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20230601/202306010837.mGhA199K-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 12.3.0
reproduce (this is a W=1 build):
        mkdir -p ~/bin
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/025cc7c86c6c7e108ba5b9946a0f50e0cc082f9b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Anton-Protopopov/bpf-add-new-map-ops-map_pressure/20230531-190704
        git checkout 025cc7c86c6c7e108ba5b9946a0f50e0cc082f9b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=sh olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=sh SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/bpf/hashtab.c: In function 'htab_map_pressure':
>> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
     189 |                 return __percpu_counter_sum(&htab->pcount);
         |                        ^~~~~~~~~~~~~~~~~~~~
         |                        percpu_counter_sum
   cc1: some warnings being treated as errors


vim +189 kernel/bpf/hashtab.c

   183	
   184	static u32 htab_map_pressure(const struct bpf_map *map)
   185	{
   186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
   187	
   188		if (htab->use_percpu_counter)
 > 189			return __percpu_counter_sum(&htab->pcount);
   190		return atomic_read(&htab->count);
   191	}
   192
Anton Protopopov June 1, 2023, 7:31 a.m. UTC | #3
On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote:
> On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
> >  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> >  {
> >  	htab_put_fd_value(htab, l);
> >  
> > +	dec_elem_count(htab);
> > +
> >  	if (htab_is_prealloc(htab)) {
> >  		check_and_free_fields(htab, l);
> >  		__pcpu_freelist_push(&htab->freelist, &l->fnode);
> >  	} else {
> > -		dec_elem_count(htab);
> >  		htab_elem_free(htab, l);
> >  	}
> >  }
> > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
> >  			if (!l)
> >  				return ERR_PTR(-E2BIG);
> >  			l_new = container_of(l, struct htab_elem, fnode);
> > +			inc_elem_count(htab);
> 
> The current use_percpu_counter heuristic is far from perfect. It works for some cases,
> but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
> Hence, there is a big performance risk doing inc/dec everywhere.
> Hence, this is a nack: we cannot decrease performance of various maps for few folks
> who want to see map stats.

This patch adds some inc/dec only for preallocated hashtabs and doesn't change
code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for
preallocated hashtabs we don't need to compare counters, so a raw (non-batch)
percpu counter may be used for this case.

> If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
> use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
> No kernel patches needed.
> Then bpf_prog_run such tracing prog and read all internal map info.
> It's less convenient that exposing things in uapi, but not being uapi is the point.

Thanks for the pointers, this makes sense. However, this doesn't work for LRU
which is always pre-allocated. Would it be ok if we add non-batch percpu
counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?
Anton Protopopov June 1, 2023, 7:50 a.m. UTC | #4
On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote:
> Hi Anton,
> 
> kernel test robot noticed the following build errors:
> 
> [...]
> 
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/

How does this apply to patches? If I send a v2, should I include these tags
there? If this patch gets rejected, is there need to do anything to close the
robot's ticket?

> All errors (new ones prefixed by >>):
> 
>    kernel/bpf/hashtab.c: In function 'htab_map_pressure':
> >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
>      189 |                 return __percpu_counter_sum(&htab->pcount);
>          |                        ^~~~~~~~~~~~~~~~~~~~
>          |                        percpu_counter_sum
>    cc1: some warnings being treated as errors
> 
> 
> vim +189 kernel/bpf/hashtab.c
> 
>    183	
>    184	static u32 htab_map_pressure(const struct bpf_map *map)
>    185	{
>    186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>    187	
>    188		if (htab->use_percpu_counter)
>  > 189			return __percpu_counter_sum(&htab->pcount);
>    190		return atomic_read(&htab->count);
>    191	}
>    192	

(This bug happens for !SMP case.)

> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
Alexei Starovoitov June 1, 2023, 4:39 p.m. UTC | #5
On Thu, Jun 1, 2023 at 12:30 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote:
> > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
> > >  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> > >  {
> > >     htab_put_fd_value(htab, l);
> > >
> > > +   dec_elem_count(htab);
> > > +
> > >     if (htab_is_prealloc(htab)) {
> > >             check_and_free_fields(htab, l);
> > >             __pcpu_freelist_push(&htab->freelist, &l->fnode);
> > >     } else {
> > > -           dec_elem_count(htab);
> > >             htab_elem_free(htab, l);
> > >     }
> > >  }
> > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
> > >                     if (!l)
> > >                             return ERR_PTR(-E2BIG);
> > >                     l_new = container_of(l, struct htab_elem, fnode);
> > > +                   inc_elem_count(htab);
> >
> > The current use_percpu_counter heuristic is far from perfect. It works for some cases,
> > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
> > Hence, there is a big performance risk doing inc/dec everywhere.
> > Hence, this is a nack: we cannot decrease performance of various maps for few folks
> > who want to see map stats.
>
> This patch adds some inc/dec only for preallocated hashtabs and doesn't change
> code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for
> preallocated hashtabs we don't need to compare counters,

exactly. that's why I don't like to add inc/dec that serves no purpose
other than stats.

> so a raw (non-batch)
> percpu counter may be used for this case.

and you can do it inside your own bpf prog.

> > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
> > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
> > No kernel patches needed.
> > Then bpf_prog_run such tracing prog and read all internal map info.
> > It's less convenient that exposing things in uapi, but not being uapi is the point.
>
> Thanks for the pointers, this makes sense. However, this doesn't work for LRU
> which is always pre-allocated. Would it be ok if we add non-batch percpu
> counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?

LRU logic doesn't kick in until the map is full.
If your LRU map is not full you shouldn't be using LRU in the first place.
Anton Protopopov June 1, 2023, 6:18 p.m. UTC | #6
On Thu, Jun 01, 2023 at 09:39:43AM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 12:30 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
> > > >  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> > > >  {
> > > >     htab_put_fd_value(htab, l);
> > > >
> > > > +   dec_elem_count(htab);
> > > > +
> > > >     if (htab_is_prealloc(htab)) {
> > > >             check_and_free_fields(htab, l);
> > > >             __pcpu_freelist_push(&htab->freelist, &l->fnode);
> > > >     } else {
> > > > -           dec_elem_count(htab);
> > > >             htab_elem_free(htab, l);
> > > >     }
> > > >  }
> > > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
> > > >                     if (!l)
> > > >                             return ERR_PTR(-E2BIG);
> > > >                     l_new = container_of(l, struct htab_elem, fnode);
> > > > +                   inc_elem_count(htab);
> > >
> > > The current use_percpu_counter heuristic is far from perfect. It works for some cases,
> > > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
> > > Hence, there is a big performance risk doing inc/dec everywhere.
> > > Hence, this is a nack: we cannot decrease performance of various maps for few folks
> > > who want to see map stats.
> >
> > This patch adds some inc/dec only for preallocated hashtabs and doesn't change
> > code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for
> > preallocated hashtabs we don't need to compare counters,
> 
> exactly. that's why I don't like to add inc/dec that serves no purpose
> other than stats.
> 
> > so a raw (non-batch)
> > percpu counter may be used for this case.
> 
> and you can do it inside your own bpf prog.
> 
> > > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
> > > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
> > > No kernel patches needed.
> > > Then bpf_prog_run such tracing prog and read all internal map info.
> > > It's less convenient that exposing things in uapi, but not being uapi is the point.
> >
> > Thanks for the pointers, this makes sense. However, this doesn't work for LRU
> > which is always pre-allocated. Would it be ok if we add non-batch percpu
> > counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?
> 
> LRU logic doesn't kick in until the map is full.

In fact, it can: a reproducable example is in the self-test from this patch
series. In the test N threads try to insert random values for keys 1..3000
simultaneously. As the result, the map may contain any number of elements,
typically 100 to 1000 (never full 3000, which is also less than the map size).
So a user can't really even closely estimate the number of elements in the LRU
map based on the number of updates (with unique keys). A per-cpu counter
inc/dec'ed from the kernel side would solve this.

> If your LRU map is not full you shouldn't be using LRU in the first place.

This makes sense, yes, especially that LRU evictions may happen randomly,
without a map being full. I will step back with this patch until we investigate
if we can replace LRUs with hashes.

Thanks for the comments!
Alexei Starovoitov June 1, 2023, 6:24 p.m. UTC | #7
On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > LRU logic doesn't kick in until the map is full.
>
> In fact, it can: a reproducable example is in the self-test from this patch
> series. In the test N threads try to insert random values for keys 1..3000
> simultaneously. As the result, the map may contain any number of elements,
> typically 100 to 1000 (never full 3000, which is also less than the map size).
> So a user can't really even closely estimate the number of elements in the LRU
> map based on the number of updates (with unique keys). A per-cpu counter
> inc/dec'ed from the kernel side would solve this.

That's odd and unexpected.
Definitely something to investigate and fix in the LRU map.

Pls cc Martin in the future.

> > If your LRU map is not full you shouldn't be using LRU in the first place.
>
> This makes sense, yes, especially that LRU evictions may happen randomly,
> without a map being full. I will step back with this patch until we investigate
> if we can replace LRUs with hashes.
>
> Thanks for the comments!
Alexei Starovoitov June 2, 2023, 12:40 a.m. UTC | #8
On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > LRU logic doesn't kick in until the map is full.
> >
> > In fact, it can: a reproducable example is in the self-test from this patch
> > series. In the test N threads try to insert random values for keys 1..3000
> > simultaneously. As the result, the map may contain any number of elements,
> > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > So a user can't really even closely estimate the number of elements in the LRU
> > map based on the number of updates (with unique keys). A per-cpu counter
> > inc/dec'ed from the kernel side would solve this.
>
> That's odd and unexpected.
> Definitely something to investigate and fix in the LRU map.
>
> Pls cc Martin in the future.
>
> > > If your LRU map is not full you shouldn't be using LRU in the first place.
> >
> > This makes sense, yes, especially that LRU evictions may happen randomly,
> > without a map being full. I will step back with this patch until we investigate
> > if we can replace LRUs with hashes.
> >
> > Thanks for the comments!

Thinking about it more...
since you're proposing to use percpu counter unconditionally for prealloc
and percpu_counter_add_batch() logic is batched,
it could actually be acceptable if it's paired with non-api access.
Like another patch can add generic kfunc to do __percpu_counter_sum()
and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
for maps can be extended to print the element count, so the user can have
convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.

But additional logic of percpu_counter_add_batch() might get in the way
of debugging eventually.
If we want to have stats then we can have normal per-cpu u32 in basic
struct bpf_map that most maps, except array, will inc/dec on update/delete.
kfunc to iterate over percpu is still necessary.
This way we will be able to see not only number of elements, but detect
bad usage when one cpu is only adding and another cpu is deleting elements.
And other cpu misbalance.

but debugging and stats is a slippery slope. These simple stats won't be
enough and people will be tempted to add more and more.
So I agree that there is a need for bpf map observability,
but it is not clear whether hard coded stats is the solution.
Anton Protopopov June 2, 2023, 2:21 p.m. UTC | #9
On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > >
> > > > LRU logic doesn't kick in until the map is full.
> > >
> > > In fact, it can: a reproducable example is in the self-test from this patch
> > > series. In the test N threads try to insert random values for keys 1..3000
> > > simultaneously. As the result, the map may contain any number of elements,
> > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > So a user can't really even closely estimate the number of elements in the LRU
> > > map based on the number of updates (with unique keys). A per-cpu counter
> > > inc/dec'ed from the kernel side would solve this.
> >
> > That's odd and unexpected.
> > Definitely something to investigate and fix in the LRU map.
> >
> > Pls cc Martin in the future.
> >
> > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > >
> > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > without a map being full. I will step back with this patch until we investigate
> > > if we can replace LRUs with hashes.
> > >
> > > Thanks for the comments!
> 
> Thinking about it more...
> since you're proposing to use percpu counter unconditionally for prealloc
> and percpu_counter_add_batch() logic is batched,
> it could actually be acceptable if it's paired with non-api access.
> Like another patch can add generic kfunc to do __percpu_counter_sum()
> and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> for maps can be extended to print the element count, so the user can have
> convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> 
> But additional logic of percpu_counter_add_batch() might get in the way
> of debugging eventually.
> If we want to have stats then we can have normal per-cpu u32 in basic
> struct bpf_map that most maps, except array, will inc/dec on update/delete.
> kfunc to iterate over percpu is still necessary.
> This way we will be able to see not only number of elements, but detect
> bad usage when one cpu is only adding and another cpu is deleting elements.
> And other cpu misbalance.

This looks for me like two different things: one is a kfunc to get the current
counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
detailed stats (e.g., per-cpu values or more).

My patch, slightly modified, addresses the first goal: most maps of interest
already have a counter in some form (sometimes just atomic_t or u64+lock). If
we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
the new kfunc can get the counter based on the map type.

If/when there's need to provide per-cpu statistics of elements or some more
sophisticated statistics, this can be done without changing the api of the
bpf_map_elements_count() kfunc.

Would this work?

> but debugging and stats is a slippery slope. These simple stats won't be
> enough and people will be tempted to add more and more.
> So I agree that there is a need for bpf map observability,
> but it is not clear whether hard coded stats is the solution.
Alexei Starovoitov June 2, 2023, 4:23 p.m. UTC | #10
On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > >
> > > > > LRU logic doesn't kick in until the map is full.
> > > >
> > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > simultaneously. As the result, the map may contain any number of elements,
> > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > inc/dec'ed from the kernel side would solve this.
> > >
> > > That's odd and unexpected.
> > > Definitely something to investigate and fix in the LRU map.
> > >
> > > Pls cc Martin in the future.
> > >
> > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > >
> > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > without a map being full. I will step back with this patch until we investigate
> > > > if we can replace LRUs with hashes.
> > > >
> > > > Thanks for the comments!
> >
> > Thinking about it more...
> > since you're proposing to use percpu counter unconditionally for prealloc
> > and percpu_counter_add_batch() logic is batched,
> > it could actually be acceptable if it's paired with non-api access.
> > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > for maps can be extended to print the element count, so the user can have
> > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> >
> > But additional logic of percpu_counter_add_batch() might get in the way
> > of debugging eventually.
> > If we want to have stats then we can have normal per-cpu u32 in basic
> > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > kfunc to iterate over percpu is still necessary.
> > This way we will be able to see not only number of elements, but detect
> > bad usage when one cpu is only adding and another cpu is deleting elements.
> > And other cpu misbalance.
>
> This looks for me like two different things: one is a kfunc to get the current
> counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> detailed stats (e.g., per-cpu values or more).
>
> My patch, slightly modified, addresses the first goal: most maps of interest
> already have a counter in some form (sometimes just atomic_t or u64+lock). If
> we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> the new kfunc can get the counter based on the map type.
>
> If/when there's need to provide per-cpu statistics of elements or some more
> sophisticated statistics, this can be done without changing the api of the
> bpf_map_elements_count() kfunc.
>
> Would this work?

No, because bpf_map_elements_count() as a building block is too big
and too specific. Nothing else can be made out of it, but counting
elements.
"for_each_cpu in per-cpu variable" would be generic that is usable beyond
this particular use case of stats collection.
Anton Protopopov June 6, 2023, 7:49 a.m. UTC | #11
On Fri, Jun 02, 2023 at 09:23:11AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > >
> > > > > > LRU logic doesn't kick in until the map is full.
> > > > >
> > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > inc/dec'ed from the kernel side would solve this.
> > > >
> > > > That's odd and unexpected.
> > > > Definitely something to investigate and fix in the LRU map.
> > > >
> > > > Pls cc Martin in the future.
> > > >
> > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > >
> > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > without a map being full. I will step back with this patch until we investigate
> > > > > if we can replace LRUs with hashes.
> > > > >
> > > > > Thanks for the comments!
> > >
> > > Thinking about it more...
> > > since you're proposing to use percpu counter unconditionally for prealloc
> > > and percpu_counter_add_batch() logic is batched,
> > > it could actually be acceptable if it's paired with non-api access.
> > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > for maps can be extended to print the element count, so the user can have
> > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > >
> > > But additional logic of percpu_counter_add_batch() might get in the way
> > > of debugging eventually.
> > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > kfunc to iterate over percpu is still necessary.
> > > This way we will be able to see not only number of elements, but detect
> > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > And other cpu misbalance.
> >
> > This looks for me like two different things: one is a kfunc to get the current
> > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > detailed stats (e.g., per-cpu values or more).
> >
> > My patch, slightly modified, addresses the first goal: most maps of interest
> > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > the new kfunc can get the counter based on the map type.
> >
> > If/when there's need to provide per-cpu statistics of elements or some more
> > sophisticated statistics, this can be done without changing the api of the
> > bpf_map_elements_count() kfunc.
> >
> > Would this work?
> 
> No, because bpf_map_elements_count() as a building block is too big
> and too specific. Nothing else can be made out of it, but counting
> elements.
> "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> this particular use case of stats collection.

Thanks. I will prepare a v2 with a "no-uapi percpu" version.
Yujie Liu June 13, 2023, 8:23 a.m. UTC | #12
Hi Anton,

Sorry for the late reply.

On Thu, Jun 01, 2023 at 07:50:00AM +0000, Anton Protopopov wrote:
> On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote:
> > Hi Anton,
> > 
> > kernel test robot noticed the following build errors:
> > 
> > [...]
> > 
> > If you fix the issue, kindly add following tag where applicable
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/
> 
> How does this apply to patches? If I send a v2, should I include these tags
> there?

If a v2 is sent, these tags should not be included.

> If this patch gets rejected, is there need to do anything to close the
> robot's ticket?

No need to close this ticket.

Thanks for raising above concerns. We have updated the wording in our
reports as below to avoid misinterpretation:

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: ...
| Closes: ...

--
Best Regards,
Yujie

> > All errors (new ones prefixed by >>):
> > 
> >    kernel/bpf/hashtab.c: In function 'htab_map_pressure':
> > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
> >      189 |                 return __percpu_counter_sum(&htab->pcount);
> >          |                        ^~~~~~~~~~~~~~~~~~~~
> >          |                        percpu_counter_sum
> >    cc1: some warnings being treated as errors
> > 
> > 
> > vim +189 kernel/bpf/hashtab.c
> > 
> >    183	
> >    184	static u32 htab_map_pressure(const struct bpf_map *map)
> >    185	{
> >    186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> >    187	
> >    188		if (htab->use_percpu_counter)
> >  > 189			return __percpu_counter_sum(&htab->pcount);
> >    190		return atomic_read(&htab->count);
> >    191	}
> >    192	
> 
> (This bug happens for !SMP case.)
> 
> > -- 
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
>
Anton Protopopov June 13, 2023, 8:30 a.m. UTC | #13
On Tue, Jun 13, 2023 at 04:23:29PM +0800, Yujie Liu wrote:
> Hi Anton,
> 
> Sorry for the late reply.
> 
> On Thu, Jun 01, 2023 at 07:50:00AM +0000, Anton Protopopov wrote:
> > On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote:
> > > Hi Anton,
> > > 
> > > kernel test robot noticed the following build errors:
> > > 
> > > [...]
> > > 
> > > If you fix the issue, kindly add following tag where applicable
> > > | Reported-by: kernel test robot <lkp@intel.com>
> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/
> > 
> > How does this apply to patches? If I send a v2, should I include these tags
> > there?
> 
> If a v2 is sent, these tags should not be included.
> 
> > If this patch gets rejected, is there need to do anything to close the
> > robot's ticket?
> 
> No need to close this ticket.
> 
> Thanks for raising above concerns. We have updated the wording in our
> reports as below to avoid misinterpretation:
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: ...
> | Closes: ...

Great, thanks for the explanations!

> --
> Best Regards,
> Yujie
> 
> > > All errors (new ones prefixed by >>):
> > > 
> > >    kernel/bpf/hashtab.c: In function 'htab_map_pressure':
> > > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
> > >      189 |                 return __percpu_counter_sum(&htab->pcount);
> > >          |                        ^~~~~~~~~~~~~~~~~~~~
> > >          |                        percpu_counter_sum
> > >    cc1: some warnings being treated as errors
> > > 
> > > 
> > > vim +189 kernel/bpf/hashtab.c
> > > 
> > >    183	
> > >    184	static u32 htab_map_pressure(const struct bpf_map *map)
> > >    185	{
> > >    186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > >    187	
> > >    188		if (htab->use_percpu_counter)
> > >  > 189			return __percpu_counter_sum(&htab->pcount);
> > >    190		return atomic_read(&htab->count);
> > >    191	}
> > >    192	
> > 
> > (This bug happens for !SMP case.)
> > 
> > > -- 
> > > 0-DAY CI Kernel Test Service
> > > https://github.com/intel/lkp-tests/wiki
> >
John Fastabend June 24, 2023, midnight UTC | #14
Alexei Starovoitov wrote:
> On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > >
> > > > > > LRU logic doesn't kick in until the map is full.
> > > > >
> > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > inc/dec'ed from the kernel side would solve this.
> > > >
> > > > That's odd and unexpected.
> > > > Definitely something to investigate and fix in the LRU map.
> > > >
> > > > Pls cc Martin in the future.
> > > >
> > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > >
> > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > without a map being full. I will step back with this patch until we investigate
> > > > > if we can replace LRUs with hashes.
> > > > >
> > > > > Thanks for the comments!
> > >
> > > Thinking about it more...
> > > since you're proposing to use percpu counter unconditionally for prealloc
> > > and percpu_counter_add_batch() logic is batched,
> > > it could actually be acceptable if it's paired with non-api access.
> > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > for maps can be extended to print the element count, so the user can have
> > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > >
> > > But additional logic of percpu_counter_add_batch() might get in the way
> > > of debugging eventually.
> > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > kfunc to iterate over percpu is still necessary.
> > > This way we will be able to see not only number of elements, but detect
> > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > And other cpu misbalance.
> >
> > This looks for me like two different things: one is a kfunc to get the current
> > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > detailed stats (e.g., per-cpu values or more).
> >
> > My patch, slightly modified, addresses the first goal: most maps of interest
> > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > the new kfunc can get the counter based on the map type.
> >
> > If/when there's need to provide per-cpu statistics of elements or some more
> > sophisticated statistics, this can be done without changing the api of the
> > bpf_map_elements_count() kfunc.
> >
> > Would this work?
> 
> No, because bpf_map_elements_count() as a building block is too big
> and too specific. Nothing else can be made out of it, but counting
> elements.
> "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> this particular use case of stats collection.

Without much thought, could you hook the eviction logic in LRU to know
when the evict happens and even more details about what was evicted so
we could debug the random case where we evict something in a conntrack
table and then later it comes back to life and sends some data like a
long living UDP session.

For example in the cases where you build an LRU map because in 99%
cases no evictions happen and the LRU is just there as a backstop
you might even generate events to userspace to let it know evicts
are in progress and they should do something about it.

Thanks,
John
Anton Protopopov June 26, 2023, 4:29 p.m. UTC | #15
On Fri, Jun 23, 2023 at 05:00:15PM -0700, John Fastabend wrote:
> Alexei Starovoitov wrote:
> > On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > > >
> > > > > > > LRU logic doesn't kick in until the map is full.
> > > > > >
> > > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > > inc/dec'ed from the kernel side would solve this.
> > > > >
> > > > > That's odd and unexpected.
> > > > > Definitely something to investigate and fix in the LRU map.
> > > > >
> > > > > Pls cc Martin in the future.
> > > > >
> > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > > >
> > > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > > without a map being full. I will step back with this patch until we investigate
> > > > > > if we can replace LRUs with hashes.
> > > > > >
> > > > > > Thanks for the comments!
> > > >
> > > > Thinking about it more...
> > > > since you're proposing to use percpu counter unconditionally for prealloc
> > > > and percpu_counter_add_batch() logic is batched,
> > > > it could actually be acceptable if it's paired with non-api access.
> > > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > > for maps can be extended to print the element count, so the user can have
> > > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > > >
> > > > But additional logic of percpu_counter_add_batch() might get in the way
> > > > of debugging eventually.
> > > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > > kfunc to iterate over percpu is still necessary.
> > > > This way we will be able to see not only number of elements, but detect
> > > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > > And other cpu misbalance.
> > >
> > > This looks for me like two different things: one is a kfunc to get the current
> > > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > > detailed stats (e.g., per-cpu values or more).
> > >
> > > My patch, slightly modified, addresses the first goal: most maps of interest
> > > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > > the new kfunc can get the counter based on the map type.
> > >
> > > If/when there's need to provide per-cpu statistics of elements or some more
> > > sophisticated statistics, this can be done without changing the api of the
> > > bpf_map_elements_count() kfunc.
> > >
> > > Would this work?
> > 
> > No, because bpf_map_elements_count() as a building block is too big
> > and too specific. Nothing else can be made out of it, but counting
> > elements.
> > "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> > this particular use case of stats collection.
> 
> Without much thought, could you hook the eviction logic in LRU to know
> when the evict happens and even more details about what was evicted so
> we could debug the random case where we evict something in a conntrack
> table and then later it comes back to life and sends some data like a
> long living UDP session.
> 
> For example in the cases where you build an LRU map because in 99%
> cases no evictions happen and the LRU is just there as a backstop
> you might even generate events to userspace to let it know evicts
> are in progress and they should do something about it.

Yes, one can trace evictions, as the destructor function for lru_list is
noinline. An example here: https://github.com/aspsk/bcc/tree/aspsk/lrusnoop

One problem with evictions and LRU is that evictions can [currently] happen at
random times even if map is nearly emptee, see a new map test from this series
and also https://lore.kernel.org/bpf/ZHjhBFLLnUcSy9Tt@zh-lab-node-5/ So for LRU
we really need to invest more time. For hashtabs and other maps the
bpf_map_sum_elem_count is in any case quite useful.

> Thanks,
> John
Anton Protopopov June 28, 2023, 1:17 p.m. UTC | #16
Hi Alexey, hi Martin,

On Thu, Jun 01, 2023 at 11:24:20AM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > LRU logic doesn't kick in until the map is full.
> >
> > In fact, it can: a reproducable example is in the self-test from this patch
> > series. In the test N threads try to insert random values for keys 1..3000
> > simultaneously. As the result, the map may contain any number of elements,
> > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > So a user can't really even closely estimate the number of elements in the LRU
> > map based on the number of updates (with unique keys). A per-cpu counter
> > inc/dec'ed from the kernel side would solve this.
> 
> That's odd and unexpected.
> Definitely something to investigate and fix in the LRU map.
> 
> Pls cc Martin in the future.

I've looked into this a bit more and the problem is as follows.

LRU maps allocate MAX_ENTRIES elements and put them in the global free list.
Then each CPU will try to get memory in 128 elements chunks into their own
local free lists.

The user expectation is that evictions start when the map is full, however, on
practice we start evicting elements when capacity reaches about (MAX_ENTRIES -
NCPUS*128) elements. This happens because when one CPU have used its local
free-list, it gets to the global lists. While there could be another
(NCPUS-1)*128 free elements in local free lists of other CPUs, our CPU goes to
the global free list, which is empty, and then starts to evict elements from
active/inactive lists (a 128 elements chunk). Then this can happen for another
active CPU, etc.

This looks like not a problem for big maps, where (NCPUS*128) is not a big %%
of the total map capacity. For smaller maps this may be unexpected (I first
noticed this on a 4K map where after updating 4K keys map capacity was about
200-300 elements).

My first attempt to fix this was to just increase nr_entries allocated for the
map by NCPUS*128, which makes evictions to start happening at MAX_ENTRIES.  But
soon I've realized that in such way users can get more than MAX_ENTRIES inside
a map, which is unexpected as well (say, when dumping a map in a location of
MAX_ENTRIES size or syncing entries with another map of MAX_ENTRIES capacity).

I also briefly looked into allowing to call the prealloc_lru_pop() function
under a bucket lock (by passing the currently locked bucket to it so that this
pointer is passed all the way to the htab_lru_map_delete_node() function which
may then bypass locking the bucket if it is the same one). Looks like this
works, but I didn't have time to understand if this breaks the LRU architecture
badly or not.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f58895830ada..4d33fc6ed2ea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -162,6 +162,7 @@  struct bpf_map_ops {
 				     void *callback_ctx, u64 flags);
 
 	u64 (*map_mem_usage)(const struct bpf_map *map);
+	u32 (*map_pressure)(const struct bpf_map *map);
 
 	/* BTF id of struct allocated by map_alloc */
 	int *map_btf_id;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9273c654743c..99580f2d006b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6363,7 +6363,7 @@  struct bpf_map_info {
 	__u32 btf_id;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
-	__u32 :32;	/* alignment pad */
+	__u32 raw_pressure;
 	__u64 map_extra;
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 9901efee4339..331a923e29d5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -133,6 +133,63 @@  static inline bool htab_is_prealloc(const struct bpf_htab *htab)
 	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
 }
 
+/* compute_batch_value() computes batch value as num_online_cpus() * 2
+ * and __percpu_counter_compare() needs
+ * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
+ * for percpu_counter to be faster than atomic_t. In practice the average bpf
+ * hash map size is 10k, which means that a system with 64 cpus will fill
+ * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
+ * define our own batch count as 32 then 10k hash map can be filled up to 80%:
+ * 10k - 8k > 32 _batch_ * 64 _cpus_
+ * and __percpu_counter_compare() will still be fast. At that point hash map
+ * collisions will dominate its performance anyway. Assume that hash map filled
+ * to 50+% isn't going to be O(1) and use the following formula to choose
+ * between percpu_counter and atomic_t.
+ *
+ * For preallocated maps we only increase/decrease counters on adding/removing
+ * an element to be later fetched by htab_map_pressure, so we always enable the
+ * per-cpu version in favor of atomic
+ */
+#define PERCPU_COUNTER_BATCH 32
+static bool htab_use_percpu_counter(union bpf_attr *attr)
+{
+	return (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH ||
+		!(attr->map_flags & BPF_F_NO_PREALLOC));
+}
+
+static bool is_map_full(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
+						PERCPU_COUNTER_BATCH) >= 0;
+	return atomic_read(&htab->count) >= htab->map.max_entries;
+}
+
+static void inc_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_inc(&htab->count);
+}
+
+static void dec_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_dec(&htab->count);
+}
+
+static u32 htab_map_pressure(const struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	if (htab->use_percpu_counter)
+		return __percpu_counter_sum(&htab->pcount);
+	return atomic_read(&htab->count);
+}
+
 static void htab_init_buckets(struct bpf_htab *htab)
 {
 	unsigned int i;
@@ -539,23 +596,7 @@  static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	htab_init_buckets(htab);
 
-/* compute_batch_value() computes batch value as num_online_cpus() * 2
- * and __percpu_counter_compare() needs
- * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
- * for percpu_counter to be faster than atomic_t. In practice the average bpf
- * hash map size is 10k, which means that a system with 64 cpus will fill
- * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
- * define our own batch count as 32 then 10k hash map can be filled up to 80%:
- * 10k - 8k > 32 _batch_ * 64 _cpus_
- * and __percpu_counter_compare() will still be fast. At that point hash map
- * collisions will dominate its performance anyway. Assume that hash map filled
- * to 50+% isn't going to be O(1) and use the following formula to choose
- * between percpu_counter and atomic_t.
- */
-#define PERCPU_COUNTER_BATCH 32
-	if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
-		htab->use_percpu_counter = true;
-
+	htab->use_percpu_counter = htab_use_percpu_counter(attr);
 	if (htab->use_percpu_counter) {
 		err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
 		if (err)
@@ -810,6 +851,7 @@  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 		if (l == tgt_l) {
 			hlist_nulls_del_rcu(&l->hash_node);
 			check_and_free_fields(htab, l);
+			dec_elem_count(htab);
 			break;
 		}
 
@@ -896,40 +938,16 @@  static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 	}
 }
 
-static bool is_map_full(struct bpf_htab *htab)
-{
-	if (htab->use_percpu_counter)
-		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
-						PERCPU_COUNTER_BATCH) >= 0;
-	return atomic_read(&htab->count) >= htab->map.max_entries;
-}
-
-static void inc_elem_count(struct bpf_htab *htab)
-{
-	if (htab->use_percpu_counter)
-		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
-	else
-		atomic_inc(&htab->count);
-}
-
-static void dec_elem_count(struct bpf_htab *htab)
-{
-	if (htab->use_percpu_counter)
-		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
-	else
-		atomic_dec(&htab->count);
-}
-
-
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
 
+	dec_elem_count(htab);
+
 	if (htab_is_prealloc(htab)) {
 		check_and_free_fields(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
-		dec_elem_count(htab);
 		htab_elem_free(htab, l);
 	}
 }
@@ -1006,6 +1024,7 @@  static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			if (!l)
 				return ERR_PTR(-E2BIG);
 			l_new = container_of(l, struct htab_elem, fnode);
+			inc_elem_count(htab);
 		}
 	} else {
 		if (is_map_full(htab))
@@ -1227,9 +1246,11 @@  static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
 	 * concurrent search will find it before old elem
 	 */
 	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+	inc_elem_count(htab);
 	if (l_old) {
 		bpf_lru_node_set_ref(&l_new->lru_node);
 		hlist_nulls_del_rcu(&l_old->hash_node);
+		dec_elem_count(htab);
 	}
 	ret = 0;
 
@@ -1357,6 +1378,7 @@  static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
 				value, onallcpus);
 		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+		inc_elem_count(htab);
 		l_new = NULL;
 	}
 	ret = 0;
@@ -1443,9 +1465,10 @@  static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	l = lookup_elem_raw(head, hash, key, key_size);
 
-	if (l)
+	if (l) {
+		dec_elem_count(htab);
 		hlist_nulls_del_rcu(&l->hash_node);
-	else
+	} else
 		ret = -ENOENT;
 
 	htab_unlock_bucket(htab, b, hash, flags);
@@ -2249,6 +2272,7 @@  const struct bpf_map_ops htab_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2271,6 +2295,7 @@  const struct bpf_map_ops htab_lru_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab_lru),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2423,6 +2448,7 @@  const struct bpf_map_ops htab_percpu_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2443,6 +2469,7 @@  const struct bpf_map_ops htab_lru_percpu_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab_lru_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2581,6 +2608,7 @@  const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 };
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index e0d3ddf2037a..24ff5feb07ca 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -730,6 +730,13 @@  static u64 trie_mem_usage(const struct bpf_map *map)
 	return elem_size * READ_ONCE(trie->n_entries);
 }
 
+static u32 trie_map_pressure(const struct bpf_map *map)
+{
+	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+
+	return READ_ONCE(trie->n_entries);
+}
+
 BTF_ID_LIST_SINGLE(trie_map_btf_ids, struct, lpm_trie)
 const struct bpf_map_ops trie_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
@@ -744,5 +751,6 @@  const struct bpf_map_ops trie_map_ops = {
 	.map_delete_batch = generic_map_delete_batch,
 	.map_check_btf = trie_check_btf,
 	.map_mem_usage = trie_mem_usage,
+	.map_pressure = trie_map_pressure,
 	.map_btf_id = &trie_map_btf_ids[0],
 };
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 92a57efc77de..6ea30a24f057 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -794,6 +794,13 @@  static fmode_t map_get_sys_perms(struct bpf_map *map, struct fd f)
 	return mode;
 }
 
+static u32 bpf_map_pressure(const struct bpf_map *map)
+{
+	if (map->ops->map_pressure)
+		return map->ops->map_pressure(map);
+	return 0;
+}
+
 #ifdef CONFIG_PROC_FS
 /* Show the memory usage of a bpf map */
 static u64 bpf_map_memory_usage(const struct bpf_map *map)
@@ -804,6 +811,7 @@  static u64 bpf_map_memory_usage(const struct bpf_map *map)
 static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 {
 	struct bpf_map *map = filp->private_data;
+	char map_pressure_buf[36] = "";
 	u32 type = 0, jited = 0;
 
 	if (map_type_contains_progs(map)) {
@@ -813,6 +821,10 @@  static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		spin_unlock(&map->owner.lock);
 	}
 
+	if (map->ops->map_pressure)
+		snprintf(map_pressure_buf, sizeof(map_pressure_buf),
+			 "raw_pressure:\t%u\n", map->ops->map_pressure(map));
+
 	seq_printf(m,
 		   "map_type:\t%u\n"
 		   "key_size:\t%u\n"
@@ -821,6 +833,7 @@  static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		   "map_flags:\t%#x\n"
 		   "map_extra:\t%#llx\n"
 		   "memlock:\t%llu\n"
+		   "%s"
 		   "map_id:\t%u\n"
 		   "frozen:\t%u\n",
 		   map->map_type,
@@ -830,6 +843,7 @@  static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		   map->map_flags,
 		   (unsigned long long)map->map_extra,
 		   bpf_map_memory_usage(map),
+		   map_pressure_buf,
 		   map->id,
 		   READ_ONCE(map->frozen));
 	if (type) {
@@ -4275,6 +4289,7 @@  static int bpf_map_get_info_by_fd(struct file *file,
 	info.value_size = map->value_size;
 	info.max_entries = map->max_entries;
 	info.map_flags = map->map_flags;
+	info.raw_pressure = bpf_map_pressure(map);
 	info.map_extra = map->map_extra;
 	memcpy(info.name, map->name, sizeof(map->name));
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9273c654743c..99580f2d006b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6363,7 +6363,7 @@  struct bpf_map_info {
 	__u32 btf_id;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
-	__u32 :32;	/* alignment pad */
+	__u32 raw_pressure;
 	__u64 map_extra;
 } __attribute__((aligned(8)));