Message ID | 20230531110511.64612-2-aspsk@isovalent.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | add mechanism to report map pressure | expand |
On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote: > static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) > { > htab_put_fd_value(htab, l); > > + dec_elem_count(htab); > + > if (htab_is_prealloc(htab)) { > check_and_free_fields(htab, l); > __pcpu_freelist_push(&htab->freelist, &l->fnode); > } else { > - dec_elem_count(htab); > htab_elem_free(htab, l); > } > } > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, > if (!l) > return ERR_PTR(-E2BIG); > l_new = container_of(l, struct htab_elem, fnode); > + inc_elem_count(htab); The current use_percpu_counter heuristic is far from perfect. It works for some cases, but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say. Hence, there is a big performance risk doing inc/dec everywhere. Hence, this is a nack: we cannot decrease performance of various maps for few folks who want to see map stats. If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating. No kernel patches needed. Then bpf_prog_run such tracing prog and read all internal map info. It's less convenient that exposing things in uapi, but not being uapi is the point.
Hi Anton, kernel test robot noticed the following build errors: [auto build test ERROR on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Anton-Protopopov/bpf-add-new-map-ops-map_pressure/20230531-190704 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20230531110511.64612-2-aspsk%40isovalent.com patch subject: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20230601/202306010837.mGhA199K-lkp@intel.com/config) compiler: sh4-linux-gcc (GCC) 12.3.0 reproduce (this is a W=1 build): mkdir -p ~/bin wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/025cc7c86c6c7e108ba5b9946a0f50e0cc082f9b git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Anton-Protopopov/bpf-add-new-map-ops-map_pressure/20230531-190704 git checkout 025cc7c86c6c7e108ba5b9946a0f50e0cc082f9b # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=sh olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=sh SHELL=/bin/bash kernel/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/bpf/hashtab.c: In function 'htab_map_pressure': >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration] 189 | return __percpu_counter_sum(&htab->pcount); | ^~~~~~~~~~~~~~~~~~~~ | percpu_counter_sum cc1: some warnings being treated as errors vim +189 kernel/bpf/hashtab.c 183 184 static u32 htab_map_pressure(const struct bpf_map *map) 185 { 186 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 187 188 if (htab->use_percpu_counter) > 189 return __percpu_counter_sum(&htab->pcount); 190 return atomic_read(&htab->count); 191 } 192
On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote: > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote: > > static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) > > { > > htab_put_fd_value(htab, l); > > > > + dec_elem_count(htab); > > + > > if (htab_is_prealloc(htab)) { > > check_and_free_fields(htab, l); > > __pcpu_freelist_push(&htab->freelist, &l->fnode); > > } else { > > - dec_elem_count(htab); > > htab_elem_free(htab, l); > > } > > } > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, > > if (!l) > > return ERR_PTR(-E2BIG); > > l_new = container_of(l, struct htab_elem, fnode); > > + inc_elem_count(htab); > > The current use_percpu_counter heuristic is far from perfect. It works for some cases, > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say. > Hence, there is a big performance risk doing inc/dec everywhere. > Hence, this is a nack: we cannot decrease performance of various maps for few folks > who want to see map stats. This patch adds some inc/dec only for preallocated hashtabs and doesn't change code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for preallocated hashtabs we don't need to compare counters, so a raw (non-batch) percpu counter may be used for this case. > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating. > No kernel patches needed. > Then bpf_prog_run such tracing prog and read all internal map info. > It's less convenient that exposing things in uapi, but not being uapi is the point. Thanks for the pointers, this makes sense. However, this doesn't work for LRU which is always pre-allocated. Would it be ok if we add non-batch percpu counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?
On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote: > Hi Anton, > > kernel test robot noticed the following build errors: > > [...] > > If you fix the issue, kindly add following tag where applicable > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/ How does this apply to patches? If I send a v2, should I include these tags there? If this patch gets rejected, is there need to do anything to close the robot's ticket? > All errors (new ones prefixed by >>): > > kernel/bpf/hashtab.c: In function 'htab_map_pressure': > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration] > 189 | return __percpu_counter_sum(&htab->pcount); > | ^~~~~~~~~~~~~~~~~~~~ > | percpu_counter_sum > cc1: some warnings being treated as errors > > > vim +189 kernel/bpf/hashtab.c > > 183 > 184 static u32 htab_map_pressure(const struct bpf_map *map) > 185 { > 186 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); > 187 > 188 if (htab->use_percpu_counter) > > 189 return __percpu_counter_sum(&htab->pcount); > 190 return atomic_read(&htab->count); > 191 } > 192 (This bug happens for !SMP case.) > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki
On Thu, Jun 1, 2023 at 12:30 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote: > > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote: > > > static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) > > > { > > > htab_put_fd_value(htab, l); > > > > > > + dec_elem_count(htab); > > > + > > > if (htab_is_prealloc(htab)) { > > > check_and_free_fields(htab, l); > > > __pcpu_freelist_push(&htab->freelist, &l->fnode); > > > } else { > > > - dec_elem_count(htab); > > > htab_elem_free(htab, l); > > > } > > > } > > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, > > > if (!l) > > > return ERR_PTR(-E2BIG); > > > l_new = container_of(l, struct htab_elem, fnode); > > > + inc_elem_count(htab); > > > > The current use_percpu_counter heuristic is far from perfect. It works for some cases, > > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say. > > Hence, there is a big performance risk doing inc/dec everywhere. > > Hence, this is a nack: we cannot decrease performance of various maps for few folks > > who want to see map stats. > > This patch adds some inc/dec only for preallocated hashtabs and doesn't change > code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for > preallocated hashtabs we don't need to compare counters, exactly. that's why I don't like to add inc/dec that serves no purpose other than stats. > so a raw (non-batch) > percpu counter may be used for this case. and you can do it inside your own bpf prog. > > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and > > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating. > > No kernel patches needed. > > Then bpf_prog_run such tracing prog and read all internal map info. > > It's less convenient that exposing things in uapi, but not being uapi is the point. > > Thanks for the pointers, this makes sense. However, this doesn't work for LRU > which is always pre-allocated. Would it be ok if we add non-batch percpu > counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace? LRU logic doesn't kick in until the map is full. If your LRU map is not full you shouldn't be using LRU in the first place.
On Thu, Jun 01, 2023 at 09:39:43AM -0700, Alexei Starovoitov wrote: > On Thu, Jun 1, 2023 at 12:30 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote: > > > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote: > > > > static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) > > > > { > > > > htab_put_fd_value(htab, l); > > > > > > > > + dec_elem_count(htab); > > > > + > > > > if (htab_is_prealloc(htab)) { > > > > check_and_free_fields(htab, l); > > > > __pcpu_freelist_push(&htab->freelist, &l->fnode); > > > > } else { > > > > - dec_elem_count(htab); > > > > htab_elem_free(htab, l); > > > > } > > > > } > > > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, > > > > if (!l) > > > > return ERR_PTR(-E2BIG); > > > > l_new = container_of(l, struct htab_elem, fnode); > > > > + inc_elem_count(htab); > > > > > > The current use_percpu_counter heuristic is far from perfect. It works for some cases, > > > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say. > > > Hence, there is a big performance risk doing inc/dec everywhere. > > > Hence, this is a nack: we cannot decrease performance of various maps for few folks > > > who want to see map stats. > > > > This patch adds some inc/dec only for preallocated hashtabs and doesn't change > > code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for > > preallocated hashtabs we don't need to compare counters, > > exactly. that's why I don't like to add inc/dec that serves no purpose > other than stats. > > > so a raw (non-batch) > > percpu counter may be used for this case. > > and you can do it inside your own bpf prog. > > > > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and > > > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating. > > > No kernel patches needed. > > > Then bpf_prog_run such tracing prog and read all internal map info. > > > It's less convenient that exposing things in uapi, but not being uapi is the point. > > > > Thanks for the pointers, this makes sense. However, this doesn't work for LRU > > which is always pre-allocated. Would it be ok if we add non-batch percpu > > counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace? > > LRU logic doesn't kick in until the map is full. In fact, it can: a reproducable example is in the self-test from this patch series. In the test N threads try to insert random values for keys 1..3000 simultaneously. As the result, the map may contain any number of elements, typically 100 to 1000 (never full 3000, which is also less than the map size). So a user can't really even closely estimate the number of elements in the LRU map based on the number of updates (with unique keys). A per-cpu counter inc/dec'ed from the kernel side would solve this. > If your LRU map is not full you shouldn't be using LRU in the first place. This makes sense, yes, especially that LRU evictions may happen randomly, without a map being full. I will step back with this patch until we investigate if we can replace LRUs with hashes. Thanks for the comments!
On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > LRU logic doesn't kick in until the map is full. > > In fact, it can: a reproducable example is in the self-test from this patch > series. In the test N threads try to insert random values for keys 1..3000 > simultaneously. As the result, the map may contain any number of elements, > typically 100 to 1000 (never full 3000, which is also less than the map size). > So a user can't really even closely estimate the number of elements in the LRU > map based on the number of updates (with unique keys). A per-cpu counter > inc/dec'ed from the kernel side would solve this. That's odd and unexpected. Definitely something to investigate and fix in the LRU map. Pls cc Martin in the future. > > If your LRU map is not full you shouldn't be using LRU in the first place. > > This makes sense, yes, especially that LRU evictions may happen randomly, > without a map being full. I will step back with this patch until we investigate > if we can replace LRUs with hashes. > > Thanks for the comments!
On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > LRU logic doesn't kick in until the map is full. > > > > In fact, it can: a reproducable example is in the self-test from this patch > > series. In the test N threads try to insert random values for keys 1..3000 > > simultaneously. As the result, the map may contain any number of elements, > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > So a user can't really even closely estimate the number of elements in the LRU > > map based on the number of updates (with unique keys). A per-cpu counter > > inc/dec'ed from the kernel side would solve this. > > That's odd and unexpected. > Definitely something to investigate and fix in the LRU map. > > Pls cc Martin in the future. > > > > If your LRU map is not full you shouldn't be using LRU in the first place. > > > > This makes sense, yes, especially that LRU evictions may happen randomly, > > without a map being full. I will step back with this patch until we investigate > > if we can replace LRUs with hashes. > > > > Thanks for the comments! Thinking about it more... since you're proposing to use percpu counter unconditionally for prealloc and percpu_counter_add_batch() logic is batched, it could actually be acceptable if it's paired with non-api access. Like another patch can add generic kfunc to do __percpu_counter_sum() and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c for maps can be extended to print the element count, so the user can have convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. But additional logic of percpu_counter_add_batch() might get in the way of debugging eventually. If we want to have stats then we can have normal per-cpu u32 in basic struct bpf_map that most maps, except array, will inc/dec on update/delete. kfunc to iterate over percpu is still necessary. This way we will be able to see not only number of elements, but detect bad usage when one cpu is only adding and another cpu is deleting elements. And other cpu misbalance. but debugging and stats is a slippery slope. These simple stats won't be enough and people will be tempted to add more and more. So I agree that there is a need for bpf map observability, but it is not clear whether hard coded stats is the solution.
On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote: > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > > > LRU logic doesn't kick in until the map is full. > > > > > > In fact, it can: a reproducable example is in the self-test from this patch > > > series. In the test N threads try to insert random values for keys 1..3000 > > > simultaneously. As the result, the map may contain any number of elements, > > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > > So a user can't really even closely estimate the number of elements in the LRU > > > map based on the number of updates (with unique keys). A per-cpu counter > > > inc/dec'ed from the kernel side would solve this. > > > > That's odd and unexpected. > > Definitely something to investigate and fix in the LRU map. > > > > Pls cc Martin in the future. > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place. > > > > > > This makes sense, yes, especially that LRU evictions may happen randomly, > > > without a map being full. I will step back with this patch until we investigate > > > if we can replace LRUs with hashes. > > > > > > Thanks for the comments! > > Thinking about it more... > since you're proposing to use percpu counter unconditionally for prealloc > and percpu_counter_add_batch() logic is batched, > it could actually be acceptable if it's paired with non-api access. > Like another patch can add generic kfunc to do __percpu_counter_sum() > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c > for maps can be extended to print the element count, so the user can have > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. > > But additional logic of percpu_counter_add_batch() might get in the way > of debugging eventually. > If we want to have stats then we can have normal per-cpu u32 in basic > struct bpf_map that most maps, except array, will inc/dec on update/delete. > kfunc to iterate over percpu is still necessary. > This way we will be able to see not only number of elements, but detect > bad usage when one cpu is only adding and another cpu is deleting elements. > And other cpu misbalance. This looks for me like two different things: one is a kfunc to get the current counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more detailed stats (e.g., per-cpu values or more). My patch, slightly modified, addresses the first goal: most maps of interest already have a counter in some form (sometimes just atomic_t or u64+lock). If we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done: the new kfunc can get the counter based on the map type. If/when there's need to provide per-cpu statistics of elements or some more sophisticated statistics, this can be done without changing the api of the bpf_map_elements_count() kfunc. Would this work? > but debugging and stats is a slippery slope. These simple stats won't be > enough and people will be tempted to add more and more. > So I agree that there is a need for bpf map observability, > but it is not clear whether hard coded stats is the solution.
On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote: > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > > > > > LRU logic doesn't kick in until the map is full. > > > > > > > > In fact, it can: a reproducable example is in the self-test from this patch > > > > series. In the test N threads try to insert random values for keys 1..3000 > > > > simultaneously. As the result, the map may contain any number of elements, > > > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > > > So a user can't really even closely estimate the number of elements in the LRU > > > > map based on the number of updates (with unique keys). A per-cpu counter > > > > inc/dec'ed from the kernel side would solve this. > > > > > > That's odd and unexpected. > > > Definitely something to investigate and fix in the LRU map. > > > > > > Pls cc Martin in the future. > > > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place. > > > > > > > > This makes sense, yes, especially that LRU evictions may happen randomly, > > > > without a map being full. I will step back with this patch until we investigate > > > > if we can replace LRUs with hashes. > > > > > > > > Thanks for the comments! > > > > Thinking about it more... > > since you're proposing to use percpu counter unconditionally for prealloc > > and percpu_counter_add_batch() logic is batched, > > it could actually be acceptable if it's paired with non-api access. > > Like another patch can add generic kfunc to do __percpu_counter_sum() > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c > > for maps can be extended to print the element count, so the user can have > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. > > > > But additional logic of percpu_counter_add_batch() might get in the way > > of debugging eventually. > > If we want to have stats then we can have normal per-cpu u32 in basic > > struct bpf_map that most maps, except array, will inc/dec on update/delete. > > kfunc to iterate over percpu is still necessary. > > This way we will be able to see not only number of elements, but detect > > bad usage when one cpu is only adding and another cpu is deleting elements. > > And other cpu misbalance. > > This looks for me like two different things: one is a kfunc to get the current > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more > detailed stats (e.g., per-cpu values or more). > > My patch, slightly modified, addresses the first goal: most maps of interest > already have a counter in some form (sometimes just atomic_t or u64+lock). If > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done: > the new kfunc can get the counter based on the map type. > > If/when there's need to provide per-cpu statistics of elements or some more > sophisticated statistics, this can be done without changing the api of the > bpf_map_elements_count() kfunc. > > Would this work? No, because bpf_map_elements_count() as a building block is too big and too specific. Nothing else can be made out of it, but counting elements. "for_each_cpu in per-cpu variable" would be generic that is usable beyond this particular use case of stats collection.
On Fri, Jun 02, 2023 at 09:23:11AM -0700, Alexei Starovoitov wrote: > On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote: > > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > > > > > > > LRU logic doesn't kick in until the map is full. > > > > > > > > > > In fact, it can: a reproducable example is in the self-test from this patch > > > > > series. In the test N threads try to insert random values for keys 1..3000 > > > > > simultaneously. As the result, the map may contain any number of elements, > > > > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > > > > So a user can't really even closely estimate the number of elements in the LRU > > > > > map based on the number of updates (with unique keys). A per-cpu counter > > > > > inc/dec'ed from the kernel side would solve this. > > > > > > > > That's odd and unexpected. > > > > Definitely something to investigate and fix in the LRU map. > > > > > > > > Pls cc Martin in the future. > > > > > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place. > > > > > > > > > > This makes sense, yes, especially that LRU evictions may happen randomly, > > > > > without a map being full. I will step back with this patch until we investigate > > > > > if we can replace LRUs with hashes. > > > > > > > > > > Thanks for the comments! > > > > > > Thinking about it more... > > > since you're proposing to use percpu counter unconditionally for prealloc > > > and percpu_counter_add_batch() logic is batched, > > > it could actually be acceptable if it's paired with non-api access. > > > Like another patch can add generic kfunc to do __percpu_counter_sum() > > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c > > > for maps can be extended to print the element count, so the user can have > > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. > > > > > > But additional logic of percpu_counter_add_batch() might get in the way > > > of debugging eventually. > > > If we want to have stats then we can have normal per-cpu u32 in basic > > > struct bpf_map that most maps, except array, will inc/dec on update/delete. > > > kfunc to iterate over percpu is still necessary. > > > This way we will be able to see not only number of elements, but detect > > > bad usage when one cpu is only adding and another cpu is deleting elements. > > > And other cpu misbalance. > > > > This looks for me like two different things: one is a kfunc to get the current > > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more > > detailed stats (e.g., per-cpu values or more). > > > > My patch, slightly modified, addresses the first goal: most maps of interest > > already have a counter in some form (sometimes just atomic_t or u64+lock). If > > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done: > > the new kfunc can get the counter based on the map type. > > > > If/when there's need to provide per-cpu statistics of elements or some more > > sophisticated statistics, this can be done without changing the api of the > > bpf_map_elements_count() kfunc. > > > > Would this work? > > No, because bpf_map_elements_count() as a building block is too big > and too specific. Nothing else can be made out of it, but counting > elements. > "for_each_cpu in per-cpu variable" would be generic that is usable beyond > this particular use case of stats collection. Thanks. I will prepare a v2 with a "no-uapi percpu" version.
Hi Anton, Sorry for the late reply. On Thu, Jun 01, 2023 at 07:50:00AM +0000, Anton Protopopov wrote: > On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote: > > Hi Anton, > > > > kernel test robot noticed the following build errors: > > > > [...] > > > > If you fix the issue, kindly add following tag where applicable > > | Reported-by: kernel test robot <lkp@intel.com> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/ > > How does this apply to patches? If I send a v2, should I include these tags > there? If a v2 is sent, these tags should not be included. > If this patch gets rejected, is there need to do anything to close the > robot's ticket? No need to close this ticket. Thanks for raising above concerns. We have updated the wording in our reports as below to avoid misinterpretation: If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: ... | Closes: ... -- Best Regards, Yujie > > All errors (new ones prefixed by >>): > > > > kernel/bpf/hashtab.c: In function 'htab_map_pressure': > > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration] > > 189 | return __percpu_counter_sum(&htab->pcount); > > | ^~~~~~~~~~~~~~~~~~~~ > > | percpu_counter_sum > > cc1: some warnings being treated as errors > > > > > > vim +189 kernel/bpf/hashtab.c > > > > 183 > > 184 static u32 htab_map_pressure(const struct bpf_map *map) > > 185 { > > 186 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); > > 187 > > 188 if (htab->use_percpu_counter) > > > 189 return __percpu_counter_sum(&htab->pcount); > > 190 return atomic_read(&htab->count); > > 191 } > > 192 > > (This bug happens for !SMP case.) > > > -- > > 0-DAY CI Kernel Test Service > > https://github.com/intel/lkp-tests/wiki >
On Tue, Jun 13, 2023 at 04:23:29PM +0800, Yujie Liu wrote: > Hi Anton, > > Sorry for the late reply. > > On Thu, Jun 01, 2023 at 07:50:00AM +0000, Anton Protopopov wrote: > > On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote: > > > Hi Anton, > > > > > > kernel test robot noticed the following build errors: > > > > > > [...] > > > > > > If you fix the issue, kindly add following tag where applicable > > > | Reported-by: kernel test robot <lkp@intel.com> > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/ > > > > How does this apply to patches? If I send a v2, should I include these tags > > there? > > If a v2 is sent, these tags should not be included. > > > If this patch gets rejected, is there need to do anything to close the > > robot's ticket? > > No need to close this ticket. > > Thanks for raising above concerns. We have updated the wording in our > reports as below to avoid misinterpretation: > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: ... > | Closes: ... Great, thanks for the explanations! > -- > Best Regards, > Yujie > > > > All errors (new ones prefixed by >>): > > > > > > kernel/bpf/hashtab.c: In function 'htab_map_pressure': > > > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration] > > > 189 | return __percpu_counter_sum(&htab->pcount); > > > | ^~~~~~~~~~~~~~~~~~~~ > > > | percpu_counter_sum > > > cc1: some warnings being treated as errors > > > > > > > > > vim +189 kernel/bpf/hashtab.c > > > > > > 183 > > > 184 static u32 htab_map_pressure(const struct bpf_map *map) > > > 185 { > > > 186 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); > > > 187 > > > 188 if (htab->use_percpu_counter) > > > > 189 return __percpu_counter_sum(&htab->pcount); > > > 190 return atomic_read(&htab->count); > > > 191 } > > > 192 > > > > (This bug happens for !SMP case.) > > > > > -- > > > 0-DAY CI Kernel Test Service > > > https://github.com/intel/lkp-tests/wiki > >
Alexei Starovoitov wrote: > On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote: > > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > > > > > > > LRU logic doesn't kick in until the map is full. > > > > > > > > > > In fact, it can: a reproducable example is in the self-test from this patch > > > > > series. In the test N threads try to insert random values for keys 1..3000 > > > > > simultaneously. As the result, the map may contain any number of elements, > > > > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > > > > So a user can't really even closely estimate the number of elements in the LRU > > > > > map based on the number of updates (with unique keys). A per-cpu counter > > > > > inc/dec'ed from the kernel side would solve this. > > > > > > > > That's odd and unexpected. > > > > Definitely something to investigate and fix in the LRU map. > > > > > > > > Pls cc Martin in the future. > > > > > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place. > > > > > > > > > > This makes sense, yes, especially that LRU evictions may happen randomly, > > > > > without a map being full. I will step back with this patch until we investigate > > > > > if we can replace LRUs with hashes. > > > > > > > > > > Thanks for the comments! > > > > > > Thinking about it more... > > > since you're proposing to use percpu counter unconditionally for prealloc > > > and percpu_counter_add_batch() logic is batched, > > > it could actually be acceptable if it's paired with non-api access. > > > Like another patch can add generic kfunc to do __percpu_counter_sum() > > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c > > > for maps can be extended to print the element count, so the user can have > > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. > > > > > > But additional logic of percpu_counter_add_batch() might get in the way > > > of debugging eventually. > > > If we want to have stats then we can have normal per-cpu u32 in basic > > > struct bpf_map that most maps, except array, will inc/dec on update/delete. > > > kfunc to iterate over percpu is still necessary. > > > This way we will be able to see not only number of elements, but detect > > > bad usage when one cpu is only adding and another cpu is deleting elements. > > > And other cpu misbalance. > > > > This looks for me like two different things: one is a kfunc to get the current > > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more > > detailed stats (e.g., per-cpu values or more). > > > > My patch, slightly modified, addresses the first goal: most maps of interest > > already have a counter in some form (sometimes just atomic_t or u64+lock). If > > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done: > > the new kfunc can get the counter based on the map type. > > > > If/when there's need to provide per-cpu statistics of elements or some more > > sophisticated statistics, this can be done without changing the api of the > > bpf_map_elements_count() kfunc. > > > > Would this work? > > No, because bpf_map_elements_count() as a building block is too big > and too specific. Nothing else can be made out of it, but counting > elements. > "for_each_cpu in per-cpu variable" would be generic that is usable beyond > this particular use case of stats collection. Without much thought, could you hook the eviction logic in LRU to know when the evict happens and even more details about what was evicted so we could debug the random case where we evict something in a conntrack table and then later it comes back to life and sends some data like a long living UDP session. For example in the cases where you build an LRU map because in 99% cases no evictions happen and the LRU is just there as a backstop you might even generate events to userspace to let it know evicts are in progress and they should do something about it. Thanks, John
On Fri, Jun 23, 2023 at 05:00:15PM -0700, John Fastabend wrote: > Alexei Starovoitov wrote: > > On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote: > > > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > > > > > > > > > LRU logic doesn't kick in until the map is full. > > > > > > > > > > > > In fact, it can: a reproducable example is in the self-test from this patch > > > > > > series. In the test N threads try to insert random values for keys 1..3000 > > > > > > simultaneously. As the result, the map may contain any number of elements, > > > > > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > > > > > So a user can't really even closely estimate the number of elements in the LRU > > > > > > map based on the number of updates (with unique keys). A per-cpu counter > > > > > > inc/dec'ed from the kernel side would solve this. > > > > > > > > > > That's odd and unexpected. > > > > > Definitely something to investigate and fix in the LRU map. > > > > > > > > > > Pls cc Martin in the future. > > > > > > > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place. > > > > > > > > > > > > This makes sense, yes, especially that LRU evictions may happen randomly, > > > > > > without a map being full. I will step back with this patch until we investigate > > > > > > if we can replace LRUs with hashes. > > > > > > > > > > > > Thanks for the comments! > > > > > > > > Thinking about it more... > > > > since you're proposing to use percpu counter unconditionally for prealloc > > > > and percpu_counter_add_batch() logic is batched, > > > > it could actually be acceptable if it's paired with non-api access. > > > > Like another patch can add generic kfunc to do __percpu_counter_sum() > > > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c > > > > for maps can be extended to print the element count, so the user can have > > > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps. > > > > > > > > But additional logic of percpu_counter_add_batch() might get in the way > > > > of debugging eventually. > > > > If we want to have stats then we can have normal per-cpu u32 in basic > > > > struct bpf_map that most maps, except array, will inc/dec on update/delete. > > > > kfunc to iterate over percpu is still necessary. > > > > This way we will be able to see not only number of elements, but detect > > > > bad usage when one cpu is only adding and another cpu is deleting elements. > > > > And other cpu misbalance. > > > > > > This looks for me like two different things: one is a kfunc to get the current > > > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more > > > detailed stats (e.g., per-cpu values or more). > > > > > > My patch, slightly modified, addresses the first goal: most maps of interest > > > already have a counter in some form (sometimes just atomic_t or u64+lock). If > > > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done: > > > the new kfunc can get the counter based on the map type. > > > > > > If/when there's need to provide per-cpu statistics of elements or some more > > > sophisticated statistics, this can be done without changing the api of the > > > bpf_map_elements_count() kfunc. > > > > > > Would this work? > > > > No, because bpf_map_elements_count() as a building block is too big > > and too specific. Nothing else can be made out of it, but counting > > elements. > > "for_each_cpu in per-cpu variable" would be generic that is usable beyond > > this particular use case of stats collection. > > Without much thought, could you hook the eviction logic in LRU to know > when the evict happens and even more details about what was evicted so > we could debug the random case where we evict something in a conntrack > table and then later it comes back to life and sends some data like a > long living UDP session. > > For example in the cases where you build an LRU map because in 99% > cases no evictions happen and the LRU is just there as a backstop > you might even generate events to userspace to let it know evicts > are in progress and they should do something about it. Yes, one can trace evictions, as the destructor function for lru_list is noinline. An example here: https://github.com/aspsk/bcc/tree/aspsk/lrusnoop One problem with evictions and LRU is that evictions can [currently] happen at random times even if map is nearly emptee, see a new map test from this series and also https://lore.kernel.org/bpf/ZHjhBFLLnUcSy9Tt@zh-lab-node-5/ So for LRU we really need to invest more time. For hashtabs and other maps the bpf_map_sum_elem_count is in any case quite useful. > Thanks, > John
Hi Alexey, hi Martin, On Thu, Jun 01, 2023 at 11:24:20AM -0700, Alexei Starovoitov wrote: > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote: > > > > > > LRU logic doesn't kick in until the map is full. > > > > In fact, it can: a reproducable example is in the self-test from this patch > > series. In the test N threads try to insert random values for keys 1..3000 > > simultaneously. As the result, the map may contain any number of elements, > > typically 100 to 1000 (never full 3000, which is also less than the map size). > > So a user can't really even closely estimate the number of elements in the LRU > > map based on the number of updates (with unique keys). A per-cpu counter > > inc/dec'ed from the kernel side would solve this. > > That's odd and unexpected. > Definitely something to investigate and fix in the LRU map. > > Pls cc Martin in the future. I've looked into this a bit more and the problem is as follows. LRU maps allocate MAX_ENTRIES elements and put them in the global free list. Then each CPU will try to get memory in 128 elements chunks into their own local free lists. The user expectation is that evictions start when the map is full, however, on practice we start evicting elements when capacity reaches about (MAX_ENTRIES - NCPUS*128) elements. This happens because when one CPU have used its local free-list, it gets to the global lists. While there could be another (NCPUS-1)*128 free elements in local free lists of other CPUs, our CPU goes to the global free list, which is empty, and then starts to evict elements from active/inactive lists (a 128 elements chunk). Then this can happen for another active CPU, etc. This looks like not a problem for big maps, where (NCPUS*128) is not a big %% of the total map capacity. For smaller maps this may be unexpected (I first noticed this on a 4K map where after updating 4K keys map capacity was about 200-300 elements). My first attempt to fix this was to just increase nr_entries allocated for the map by NCPUS*128, which makes evictions to start happening at MAX_ENTRIES. But soon I've realized that in such way users can get more than MAX_ENTRIES inside a map, which is unexpected as well (say, when dumping a map in a location of MAX_ENTRIES size or syncing entries with another map of MAX_ENTRIES capacity). I also briefly looked into allowing to call the prealloc_lru_pop() function under a bucket lock (by passing the currently locked bucket to it so that this pointer is passed all the way to the htab_lru_map_delete_node() function which may then bypass locking the bucket if it is the same one). Looks like this works, but I didn't have time to understand if this breaks the LRU architecture badly or not.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f58895830ada..4d33fc6ed2ea 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -162,6 +162,7 @@ struct bpf_map_ops { void *callback_ctx, u64 flags); u64 (*map_mem_usage)(const struct bpf_map *map); + u32 (*map_pressure)(const struct bpf_map *map); /* BTF id of struct allocated by map_alloc */ int *map_btf_id; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9273c654743c..99580f2d006b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -6363,7 +6363,7 @@ struct bpf_map_info { __u32 btf_id; __u32 btf_key_type_id; __u32 btf_value_type_id; - __u32 :32; /* alignment pad */ + __u32 raw_pressure; __u64 map_extra; } __attribute__((aligned(8))); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 9901efee4339..331a923e29d5 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -133,6 +133,63 @@ static inline bool htab_is_prealloc(const struct bpf_htab *htab) return !(htab->map.map_flags & BPF_F_NO_PREALLOC); } +/* compute_batch_value() computes batch value as num_online_cpus() * 2 + * and __percpu_counter_compare() needs + * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus() + * for percpu_counter to be faster than atomic_t. In practice the average bpf + * hash map size is 10k, which means that a system with 64 cpus will fill + * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore + * define our own batch count as 32 then 10k hash map can be filled up to 80%: + * 10k - 8k > 32 _batch_ * 64 _cpus_ + * and __percpu_counter_compare() will still be fast. At that point hash map + * collisions will dominate its performance anyway. Assume that hash map filled + * to 50+% isn't going to be O(1) and use the following formula to choose + * between percpu_counter and atomic_t. + * + * For preallocated maps we only increase/decrease counters on adding/removing + * an element to be later fetched by htab_map_pressure, so we always enable the + * per-cpu version in favor of atomic + */ +#define PERCPU_COUNTER_BATCH 32 +static bool htab_use_percpu_counter(union bpf_attr *attr) +{ + return (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH || + !(attr->map_flags & BPF_F_NO_PREALLOC)); +} + +static bool is_map_full(struct bpf_htab *htab) +{ + if (htab->use_percpu_counter) + return __percpu_counter_compare(&htab->pcount, htab->map.max_entries, + PERCPU_COUNTER_BATCH) >= 0; + return atomic_read(&htab->count) >= htab->map.max_entries; +} + +static void inc_elem_count(struct bpf_htab *htab) +{ + if (htab->use_percpu_counter) + percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH); + else + atomic_inc(&htab->count); +} + +static void dec_elem_count(struct bpf_htab *htab) +{ + if (htab->use_percpu_counter) + percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH); + else + atomic_dec(&htab->count); +} + +static u32 htab_map_pressure(const struct bpf_map *map) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + if (htab->use_percpu_counter) + return __percpu_counter_sum(&htab->pcount); + return atomic_read(&htab->count); +} + static void htab_init_buckets(struct bpf_htab *htab) { unsigned int i; @@ -539,23 +596,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) htab_init_buckets(htab); -/* compute_batch_value() computes batch value as num_online_cpus() * 2 - * and __percpu_counter_compare() needs - * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus() - * for percpu_counter to be faster than atomic_t. In practice the average bpf - * hash map size is 10k, which means that a system with 64 cpus will fill - * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore - * define our own batch count as 32 then 10k hash map can be filled up to 80%: - * 10k - 8k > 32 _batch_ * 64 _cpus_ - * and __percpu_counter_compare() will still be fast. At that point hash map - * collisions will dominate its performance anyway. Assume that hash map filled - * to 50+% isn't going to be O(1) and use the following formula to choose - * between percpu_counter and atomic_t. - */ -#define PERCPU_COUNTER_BATCH 32 - if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH) - htab->use_percpu_counter = true; - + htab->use_percpu_counter = htab_use_percpu_counter(attr); if (htab->use_percpu_counter) { err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL); if (err) @@ -810,6 +851,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node) if (l == tgt_l) { hlist_nulls_del_rcu(&l->hash_node); check_and_free_fields(htab, l); + dec_elem_count(htab); break; } @@ -896,40 +938,16 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l) } } -static bool is_map_full(struct bpf_htab *htab) -{ - if (htab->use_percpu_counter) - return __percpu_counter_compare(&htab->pcount, htab->map.max_entries, - PERCPU_COUNTER_BATCH) >= 0; - return atomic_read(&htab->count) >= htab->map.max_entries; -} - -static void inc_elem_count(struct bpf_htab *htab) -{ - if (htab->use_percpu_counter) - percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH); - else - atomic_inc(&htab->count); -} - -static void dec_elem_count(struct bpf_htab *htab) -{ - if (htab->use_percpu_counter) - percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH); - else - atomic_dec(&htab->count); -} - - static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) { htab_put_fd_value(htab, l); + dec_elem_count(htab); + if (htab_is_prealloc(htab)) { check_and_free_fields(htab, l); __pcpu_freelist_push(&htab->freelist, &l->fnode); } else { - dec_elem_count(htab); htab_elem_free(htab, l); } } @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, if (!l) return ERR_PTR(-E2BIG); l_new = container_of(l, struct htab_elem, fnode); + inc_elem_count(htab); } } else { if (is_map_full(htab)) @@ -1227,9 +1246,11 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value * concurrent search will find it before old elem */ hlist_nulls_add_head_rcu(&l_new->hash_node, head); + inc_elem_count(htab); if (l_old) { bpf_lru_node_set_ref(&l_new->lru_node); hlist_nulls_del_rcu(&l_old->hash_node); + dec_elem_count(htab); } ret = 0; @@ -1357,6 +1378,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size), value, onallcpus); hlist_nulls_add_head_rcu(&l_new->hash_node, head); + inc_elem_count(htab); l_new = NULL; } ret = 0; @@ -1443,9 +1465,10 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key) l = lookup_elem_raw(head, hash, key, key_size); - if (l) + if (l) { + dec_elem_count(htab); hlist_nulls_del_rcu(&l->hash_node); - else + } else ret = -ENOENT; htab_unlock_bucket(htab, b, hash, flags); @@ -2249,6 +2272,7 @@ const struct bpf_map_ops htab_map_ops = { .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, .map_mem_usage = htab_map_mem_usage, + .map_pressure = htab_map_pressure, BATCH_OPS(htab), .map_btf_id = &htab_map_btf_ids[0], .iter_seq_info = &iter_seq_info, @@ -2271,6 +2295,7 @@ const struct bpf_map_ops htab_lru_map_ops = { .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, .map_mem_usage = htab_map_mem_usage, + .map_pressure = htab_map_pressure, BATCH_OPS(htab_lru), .map_btf_id = &htab_map_btf_ids[0], .iter_seq_info = &iter_seq_info, @@ -2423,6 +2448,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, .map_mem_usage = htab_map_mem_usage, + .map_pressure = htab_map_pressure, BATCH_OPS(htab_percpu), .map_btf_id = &htab_map_btf_ids[0], .iter_seq_info = &iter_seq_info, @@ -2443,6 +2469,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = { .map_set_for_each_callback_args = map_set_for_each_callback_args, .map_for_each_callback = bpf_for_each_hash_elem, .map_mem_usage = htab_map_mem_usage, + .map_pressure = htab_map_pressure, BATCH_OPS(htab_lru_percpu), .map_btf_id = &htab_map_btf_ids[0], .iter_seq_info = &iter_seq_info, @@ -2581,6 +2608,7 @@ const struct bpf_map_ops htab_of_maps_map_ops = { .map_gen_lookup = htab_of_map_gen_lookup, .map_check_btf = map_check_no_btf, .map_mem_usage = htab_map_mem_usage, + .map_pressure = htab_map_pressure, BATCH_OPS(htab), .map_btf_id = &htab_map_btf_ids[0], }; diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index e0d3ddf2037a..24ff5feb07ca 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -730,6 +730,13 @@ static u64 trie_mem_usage(const struct bpf_map *map) return elem_size * READ_ONCE(trie->n_entries); } +static u32 trie_map_pressure(const struct bpf_map *map) +{ + struct lpm_trie *trie = container_of(map, struct lpm_trie, map); + + return READ_ONCE(trie->n_entries); +} + BTF_ID_LIST_SINGLE(trie_map_btf_ids, struct, lpm_trie) const struct bpf_map_ops trie_map_ops = { .map_meta_equal = bpf_map_meta_equal, @@ -744,5 +751,6 @@ const struct bpf_map_ops trie_map_ops = { .map_delete_batch = generic_map_delete_batch, .map_check_btf = trie_check_btf, .map_mem_usage = trie_mem_usage, + .map_pressure = trie_map_pressure, .map_btf_id = &trie_map_btf_ids[0], }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 92a57efc77de..6ea30a24f057 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -794,6 +794,13 @@ static fmode_t map_get_sys_perms(struct bpf_map *map, struct fd f) return mode; } +static u32 bpf_map_pressure(const struct bpf_map *map) +{ + if (map->ops->map_pressure) + return map->ops->map_pressure(map); + return 0; +} + #ifdef CONFIG_PROC_FS /* Show the memory usage of a bpf map */ static u64 bpf_map_memory_usage(const struct bpf_map *map) @@ -804,6 +811,7 @@ static u64 bpf_map_memory_usage(const struct bpf_map *map) static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) { struct bpf_map *map = filp->private_data; + char map_pressure_buf[36] = ""; u32 type = 0, jited = 0; if (map_type_contains_progs(map)) { @@ -813,6 +821,10 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) spin_unlock(&map->owner.lock); } + if (map->ops->map_pressure) + snprintf(map_pressure_buf, sizeof(map_pressure_buf), + "raw_pressure:\t%u\n", map->ops->map_pressure(map)); + seq_printf(m, "map_type:\t%u\n" "key_size:\t%u\n" @@ -821,6 +833,7 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) "map_flags:\t%#x\n" "map_extra:\t%#llx\n" "memlock:\t%llu\n" + "%s" "map_id:\t%u\n" "frozen:\t%u\n", map->map_type, @@ -830,6 +843,7 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) map->map_flags, (unsigned long long)map->map_extra, bpf_map_memory_usage(map), + map_pressure_buf, map->id, READ_ONCE(map->frozen)); if (type) { @@ -4275,6 +4289,7 @@ static int bpf_map_get_info_by_fd(struct file *file, info.value_size = map->value_size; info.max_entries = map->max_entries; info.map_flags = map->map_flags; + info.raw_pressure = bpf_map_pressure(map); info.map_extra = map->map_extra; memcpy(info.name, map->name, sizeof(map->name)); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9273c654743c..99580f2d006b 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -6363,7 +6363,7 @@ struct bpf_map_info { __u32 btf_id; __u32 btf_key_type_id; __u32 btf_value_type_id; - __u32 :32; /* alignment pad */ + __u32 raw_pressure; __u64 map_extra; } __attribute__((aligned(8)));
Add a new map ops named map_pressure to return map's "raw pressure". This value is to be defined per map type, but in most cases this would be just the number of elements currently present in the map: the more it grows, the more the pressure is. (For array-based maps it seems right to set it to 0, i.e., "there's no pressure".) By analogy with the 'max_entries' field the pressure value is a u32 integer. The primary API to read it from userspace is via the map proc entry, where it is reported in the "raw_pressure" filed, e.g., for an example map we have # echo -e `strace -e bpf,openat,read -s 1024 bpftool map show id 202 2>&1 | grep -A1 '/proc/self/fdinfo' | head -2 | tail -1 | cut -f2 -d' '` "pos: 0 flags: 02000002 mnt_id: 15 ino: 18958 map_type: 1 key_size: 8 value_size: 8 max_entries: 1224 map_flags: 0x1 map_extra: 0x0 memlock: 69632 raw_pressure: 500 map_id: 202 frozen: 0 ", For old kernels and when the ->map_pressure map operation is not defined the 'raw_pressure' field is absent from the list. The second way to get the raw_pressure is via BPF_OBJ_GET_INFO_BY_FD, where a previously unused field in the struct bpf_map_info is now used to return this value. The patch adds relatively small amount of logic due to the fact that for most maps the number of elements was already computed to provide the map memory usage API, just not exported. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 2 +- kernel/bpf/hashtab.c | 118 ++++++++++++++++++++------------- kernel/bpf/lpm_trie.c | 8 +++ kernel/bpf/syscall.c | 15 +++++ tools/include/uapi/linux/bpf.h | 2 +- 6 files changed, 99 insertions(+), 47 deletions(-)