Message ID | 20241002180956.1781008-3-namhyung@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Add kmem_cache iterator and kfunc | expand |
On Wed, Oct 02, 2024 at 11:09:55AM -0700, Namhyung Kim wrote: > The bpf_get_kmem_cache() is to get a slab cache information from a > virtual address like virt_to_cache(). If the address is a pointer > to a slab object, it'd return a valid kmem_cache pointer, otherwise > NULL is returned. > > It doesn't grab a reference count of the kmem_cache so the caller is > responsible to manage the access. The intended use case for now is to > symbolize locks in slab objects from the lock contention tracepoints. > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > --- > kernel/bpf/helpers.c | 1 + > mm/slab_common.c | 19 +++++++++++++++++++ > 2 files changed, 20 insertions(+) > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > BTF_KFUNCS_END(common_btf_ids) > > static const struct btf_kfunc_id_set common_kfunc_set = { > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 7443244656150325..5484e1cd812f698e 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > } > EXPORT_SYMBOL(ksize); > > +#ifdef CONFIG_BPF_SYSCALL > +#include <linux/btf.h> > + > +__bpf_kfunc_start_defs(); > + > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > +{ > + struct slab *slab; > + > + if (!virt_addr_valid(addr)) Hmm.. 32-bit systems don't like this. Is it ok to change the type of the parameter (addr) to 'unsigned long'? Or do you want to keep it as u64 and add a cast here? Thanks, Namhyung > + return NULL; > + > + slab = virt_to_slab((void *)(long)addr); > + return slab ? slab->slab_cache : NULL; > +} > + > +__bpf_kfunc_end_defs(); > +#endif /* CONFIG_BPF_SYSCALL */ > + > /* Tracepoints definitions. */ > EXPORT_TRACEPOINT_SYMBOL(kmalloc); > EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc); > -- > 2.46.1.824.gd892dcdcdd-goog >
On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > The bpf_get_kmem_cache() is to get a slab cache information from a > virtual address like virt_to_cache(). If the address is a pointer > to a slab object, it'd return a valid kmem_cache pointer, otherwise > NULL is returned. > > It doesn't grab a reference count of the kmem_cache so the caller is > responsible to manage the access. The intended use case for now is to > symbolize locks in slab objects from the lock contention tracepoints. > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > --- > kernel/bpf/helpers.c | 1 + > mm/slab_common.c | 19 +++++++++++++++++++ > 2 files changed, 20 insertions(+) > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > BTF_KFUNCS_END(common_btf_ids) > > static const struct btf_kfunc_id_set common_kfunc_set = { > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 7443244656150325..5484e1cd812f698e 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > } > EXPORT_SYMBOL(ksize); > > +#ifdef CONFIG_BPF_SYSCALL > +#include <linux/btf.h> > + > +__bpf_kfunc_start_defs(); > + > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > +{ > + struct slab *slab; > + > + if (!virt_addr_valid(addr)) > + return NULL; > + > + slab = virt_to_slab((void *)(long)addr); > + return slab ? slab->slab_cache : NULL; > +} Do we need to hold a refcount to the slab_cache? Given we make this kfunc available everywhere, including sleepable contexts, I think it is necessary. Thanks Song > + > +__bpf_kfunc_end_defs(); > +#endif /* CONFIG_BPF_SYSCALL */ > + > /* Tracepoints definitions. */ > EXPORT_TRACEPOINT_SYMBOL(kmalloc); > EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc); > -- > 2.46.1.824.gd892dcdcdd-goog >
On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > The bpf_get_kmem_cache() is to get a slab cache information from a > > virtual address like virt_to_cache(). If the address is a pointer > > to a slab object, it'd return a valid kmem_cache pointer, otherwise > > NULL is returned. > > > > It doesn't grab a reference count of the kmem_cache so the caller is > > responsible to manage the access. The intended use case for now is to > > symbolize locks in slab objects from the lock contention tracepoints. > > > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > --- > > kernel/bpf/helpers.c | 1 + > > mm/slab_common.c | 19 +++++++++++++++++++ > > 2 files changed, 20 insertions(+) > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > > --- a/kernel/bpf/helpers.c > > +++ b/kernel/bpf/helpers.c > > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > BTF_KFUNCS_END(common_btf_ids) > > > > static const struct btf_kfunc_id_set common_kfunc_set = { > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > index 7443244656150325..5484e1cd812f698e 100644 > > --- a/mm/slab_common.c > > +++ b/mm/slab_common.c > > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > > } > > EXPORT_SYMBOL(ksize); > > > > +#ifdef CONFIG_BPF_SYSCALL > > +#include <linux/btf.h> > > + > > +__bpf_kfunc_start_defs(); > > + > > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > +{ > > + struct slab *slab; > > + > > + if (!virt_addr_valid(addr)) > > + return NULL; > > + > > + slab = virt_to_slab((void *)(long)addr); > > + return slab ? slab->slab_cache : NULL; > > +} > > Do we need to hold a refcount to the slab_cache? Given > we make this kfunc available everywhere, including > sleepable contexts, I think it is necessary. It's a really good question. If the callee somehow owns the slab object, as in the example provided in the series (current task), it's not necessarily. If a user can pass a random address, you're right, we need to grab the slab_cache's refcnt. But then we also can't guarantee that the object still belongs to the same slab_cache, the function becomes racy by the definition.
On Fri, Oct 4, 2024 at 2:25 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > The bpf_get_kmem_cache() is to get a slab cache information from a > > > virtual address like virt_to_cache(). If the address is a pointer > > > to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > NULL is returned. > > > > > > It doesn't grab a reference count of the kmem_cache so the caller is > > > responsible to manage the access. The intended use case for now is to > > > symbolize locks in slab objects from the lock contention tracepoints. > > > > > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > --- > > > kernel/bpf/helpers.c | 1 + > > > mm/slab_common.c | 19 +++++++++++++++++++ > > > 2 files changed, 20 insertions(+) > > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > > > --- a/kernel/bpf/helpers.c > > > +++ b/kernel/bpf/helpers.c > > > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > > BTF_KFUNCS_END(common_btf_ids) > > > > > > static const struct btf_kfunc_id_set common_kfunc_set = { > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > index 7443244656150325..5484e1cd812f698e 100644 > > > --- a/mm/slab_common.c > > > +++ b/mm/slab_common.c > > > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > > > } > > > EXPORT_SYMBOL(ksize); > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > +#include <linux/btf.h> > > > + > > > +__bpf_kfunc_start_defs(); > > > + > > > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > > +{ > > > + struct slab *slab; > > > + > > > + if (!virt_addr_valid(addr)) > > > + return NULL; > > > + > > > + slab = virt_to_slab((void *)(long)addr); > > > + return slab ? slab->slab_cache : NULL; > > > +} > > > > Do we need to hold a refcount to the slab_cache? Given > > we make this kfunc available everywhere, including > > sleepable contexts, I think it is necessary. > > It's a really good question. > > If the callee somehow owns the slab object, as in the example > provided in the series (current task), it's not necessarily. > > If a user can pass a random address, you're right, we need to > grab the slab_cache's refcnt. But then we also can't guarantee > that the object still belongs to the same slab_cache, the > function becomes racy by the definition. To be safe, we can limit the kfunc to sleepable context only. Then we can lock slab_mutex for virt_to_slab, and hold a refcount to slab_cache. We will need a KF_RELEASE kfunc to release the refcount later. IIUC, this limitation (sleepable context only) shouldn't be a problem for perf use case? Thanks, Song
On Fri, Oct 04, 2024 at 02:36:30PM -0700, Song Liu wrote: > On Fri, Oct 4, 2024 at 2:25 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > > On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > > > The bpf_get_kmem_cache() is to get a slab cache information from a > > > > virtual address like virt_to_cache(). If the address is a pointer > > > > to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > > NULL is returned. > > > > > > > > It doesn't grab a reference count of the kmem_cache so the caller is > > > > responsible to manage the access. The intended use case for now is to > > > > symbolize locks in slab objects from the lock contention tracepoints. > > > > > > > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > --- > > > > kernel/bpf/helpers.c | 1 + > > > > mm/slab_common.c | 19 +++++++++++++++++++ > > > > 2 files changed, 20 insertions(+) > > > > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > > > > --- a/kernel/bpf/helpers.c > > > > +++ b/kernel/bpf/helpers.c > > > > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > > > BTF_KFUNCS_END(common_btf_ids) > > > > > > > > static const struct btf_kfunc_id_set common_kfunc_set = { > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > index 7443244656150325..5484e1cd812f698e 100644 > > > > --- a/mm/slab_common.c > > > > +++ b/mm/slab_common.c > > > > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > > > > } > > > > EXPORT_SYMBOL(ksize); > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > +#include <linux/btf.h> > > > > + > > > > +__bpf_kfunc_start_defs(); > > > > + > > > > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > > > +{ > > > > + struct slab *slab; > > > > + > > > > + if (!virt_addr_valid(addr)) > > > > + return NULL; > > > > + > > > > + slab = virt_to_slab((void *)(long)addr); > > > > + return slab ? slab->slab_cache : NULL; > > > > +} > > > > > > Do we need to hold a refcount to the slab_cache? Given > > > we make this kfunc available everywhere, including > > > sleepable contexts, I think it is necessary. > > > > It's a really good question. > > > > If the callee somehow owns the slab object, as in the example > > provided in the series (current task), it's not necessarily. > > > > If a user can pass a random address, you're right, we need to > > grab the slab_cache's refcnt. But then we also can't guarantee > > that the object still belongs to the same slab_cache, the > > function becomes racy by the definition. > > To be safe, we can limit the kfunc to sleepable context only. Then > we can lock slab_mutex for virt_to_slab, and hold a refcount > to slab_cache. We will need a KF_RELEASE kfunc to release > the refcount later. Then it needs to call kmem_cache_destroy() for release which contains rcu_barrier. :( > > IIUC, this limitation (sleepable context only) shouldn't be a problem > for perf use case? No, it would be called from the lock contention path including spinlocks. :( Can we limit it to non-sleepable ctx and not to pass arbtrary address somehow (or not to save the result pointer)? Thanks, Namhyung
On Fri, Oct 4, 2024 at 2:58 PM Namhyung Kim <namhyung@kernel.org> wrote: > > On Fri, Oct 04, 2024 at 02:36:30PM -0700, Song Liu wrote: > > On Fri, Oct 4, 2024 at 2:25 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > > > On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > > > > > The bpf_get_kmem_cache() is to get a slab cache information from a > > > > > virtual address like virt_to_cache(). If the address is a pointer > > > > > to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > > > NULL is returned. > > > > > > > > > > It doesn't grab a reference count of the kmem_cache so the caller is > > > > > responsible to manage the access. The intended use case for now is to > > > > > symbolize locks in slab objects from the lock contention tracepoints. > > > > > > > > > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > > > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > --- > > > > > kernel/bpf/helpers.c | 1 + > > > > > mm/slab_common.c | 19 +++++++++++++++++++ > > > > > 2 files changed, 20 insertions(+) > > > > > > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > > > > > --- a/kernel/bpf/helpers.c > > > > > +++ b/kernel/bpf/helpers.c > > > > > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > > > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > > > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > > > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > > > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > > > > BTF_KFUNCS_END(common_btf_ids) > > > > > > > > > > static const struct btf_kfunc_id_set common_kfunc_set = { > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > > index 7443244656150325..5484e1cd812f698e 100644 > > > > > --- a/mm/slab_common.c > > > > > +++ b/mm/slab_common.c > > > > > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > > > > > } > > > > > EXPORT_SYMBOL(ksize); > > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > +#include <linux/btf.h> > > > > > + > > > > > +__bpf_kfunc_start_defs(); > > > > > + > > > > > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > > > > +{ > > > > > + struct slab *slab; > > > > > + > > > > > + if (!virt_addr_valid(addr)) > > > > > + return NULL; > > > > > + > > > > > + slab = virt_to_slab((void *)(long)addr); > > > > > + return slab ? slab->slab_cache : NULL; > > > > > +} > > > > > > > > Do we need to hold a refcount to the slab_cache? Given > > > > we make this kfunc available everywhere, including > > > > sleepable contexts, I think it is necessary. > > > > > > It's a really good question. > > > > > > If the callee somehow owns the slab object, as in the example > > > provided in the series (current task), it's not necessarily. > > > > > > If a user can pass a random address, you're right, we need to > > > grab the slab_cache's refcnt. But then we also can't guarantee > > > that the object still belongs to the same slab_cache, the > > > function becomes racy by the definition. > > > > To be safe, we can limit the kfunc to sleepable context only. Then > > we can lock slab_mutex for virt_to_slab, and hold a refcount > > to slab_cache. We will need a KF_RELEASE kfunc to release > > the refcount later. > > Then it needs to call kmem_cache_destroy() for release which contains > rcu_barrier. :( > > > > > IIUC, this limitation (sleepable context only) shouldn't be a problem > > for perf use case? > > No, it would be called from the lock contention path including > spinlocks. :( > > Can we limit it to non-sleepable ctx and not to pass arbtrary address > somehow (or not to save the result pointer)? I hacked something like the following. It is not ideal, because we are taking spinlock_t pointer instead of void pointer. To use this with void 'pointer, we will need some verifier changes. Thanks, Song diff --git i/kernel/bpf/helpers.c w/kernel/bpf/helpers.c index 3709fb142881..7311a26ecb01 100644 --- i/kernel/bpf/helpers.c +++ w/kernel/bpf/helpers.c @@ -3090,7 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) -BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL | KF_TRUSTED_ARGS | KF_RCU_PROTECTED) BTF_KFUNCS_END(common_btf_ids) static const struct btf_kfunc_id_set common_kfunc_set = { diff --git i/mm/slab_common.c w/mm/slab_common.c index 5484e1cd812f..3e3e5f172f2e 100644 --- i/mm/slab_common.c +++ w/mm/slab_common.c @@ -1327,14 +1327,15 @@ EXPORT_SYMBOL(ksize); __bpf_kfunc_start_defs(); -__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(spinlock_t *addr) { struct slab *slab; + unsigned long a = (unsigned long)addr; - if (!virt_addr_valid(addr)) + if (!virt_addr_valid(a)) return NULL; - slab = virt_to_slab((void *)(long)addr); + slab = virt_to_slab(addr); return slab ? slab->slab_cache : NULL; } @@ -1346,4 +1347,3 @@ EXPORT_TRACEPOINT_SYMBOL(kmalloc); EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc); EXPORT_TRACEPOINT_SYMBOL(kfree); EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free); - diff --git i/tools/testing/selftests/bpf/progs/kmem_cache_iter.c w/tools/testing/selftests/bpf/progs/kmem_cache_iter.c index 3f6ec15a1bf6..8238155a5055 100644 --- i/tools/testing/selftests/bpf/progs/kmem_cache_iter.c +++ w/tools/testing/selftests/bpf/progs/kmem_cache_iter.c @@ -16,7 +16,7 @@ struct { __uint(max_entries, 1024); } slab_hash SEC(".maps"); -extern struct kmem_cache *bpf_get_kmem_cache(__u64 addr) __ksym; +extern struct kmem_cache *bpf_get_kmem_cache(spinlock_t *addr) __ksym; /* result, will be checked by userspace */ int found; @@ -46,21 +46,23 @@ int slab_info_collector(struct bpf_iter__kmem_cache *ctx) SEC("raw_tp/bpf_test_finish") int BPF_PROG(check_task_struct) { - __u64 curr = bpf_get_current_task(); + struct task_struct *curr = bpf_get_current_task_btf(); struct kmem_cache *s; char *name; - s = bpf_get_kmem_cache(curr); + s = bpf_get_kmem_cache(&curr->alloc_lock); if (s == NULL) { found = -1; return 0; } + bpf_rcu_read_lock(); name = bpf_map_lookup_elem(&slab_hash, &s); if (name && !bpf_strncmp(name, 11, "task_struct")) found = 1; else found = -2; + bpf_rcu_read_unlock(); return 0; }
On Fri, Oct 04, 2024 at 03:57:26PM -0700, Song Liu wrote: > On Fri, Oct 4, 2024 at 2:58 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > On Fri, Oct 04, 2024 at 02:36:30PM -0700, Song Liu wrote: > > > On Fri, Oct 4, 2024 at 2:25 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > > > > On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > > > > > > > The bpf_get_kmem_cache() is to get a slab cache information from a > > > > > > virtual address like virt_to_cache(). If the address is a pointer > > > > > > to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > > > > NULL is returned. > > > > > > > > > > > > It doesn't grab a reference count of the kmem_cache so the caller is > > > > > > responsible to manage the access. The intended use case for now is to > > > > > > symbolize locks in slab objects from the lock contention tracepoints. > > > > > > > > > > > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > > > > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > --- > > > > > > kernel/bpf/helpers.c | 1 + > > > > > > mm/slab_common.c | 19 +++++++++++++++++++ > > > > > > 2 files changed, 20 insertions(+) > > > > > > > > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > > > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > > > > > > --- a/kernel/bpf/helpers.c > > > > > > +++ b/kernel/bpf/helpers.c > > > > > > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > > > > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > > > > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > > > > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > > > > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > > > > > BTF_KFUNCS_END(common_btf_ids) > > > > > > > > > > > > static const struct btf_kfunc_id_set common_kfunc_set = { > > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > > > index 7443244656150325..5484e1cd812f698e 100644 > > > > > > --- a/mm/slab_common.c > > > > > > +++ b/mm/slab_common.c > > > > > > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > > > > > > } > > > > > > EXPORT_SYMBOL(ksize); > > > > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > > +#include <linux/btf.h> > > > > > > + > > > > > > +__bpf_kfunc_start_defs(); > > > > > > + > > > > > > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > > > > > +{ > > > > > > + struct slab *slab; > > > > > > + > > > > > > + if (!virt_addr_valid(addr)) > > > > > > + return NULL; > > > > > > + > > > > > > + slab = virt_to_slab((void *)(long)addr); > > > > > > + return slab ? slab->slab_cache : NULL; > > > > > > +} > > > > > > > > > > Do we need to hold a refcount to the slab_cache? Given > > > > > we make this kfunc available everywhere, including > > > > > sleepable contexts, I think it is necessary. > > > > > > > > It's a really good question. > > > > > > > > If the callee somehow owns the slab object, as in the example > > > > provided in the series (current task), it's not necessarily. > > > > > > > > If a user can pass a random address, you're right, we need to > > > > grab the slab_cache's refcnt. But then we also can't guarantee > > > > that the object still belongs to the same slab_cache, the > > > > function becomes racy by the definition. > > > > > > To be safe, we can limit the kfunc to sleepable context only. Then > > > we can lock slab_mutex for virt_to_slab, and hold a refcount > > > to slab_cache. We will need a KF_RELEASE kfunc to release > > > the refcount later. > > > > Then it needs to call kmem_cache_destroy() for release which contains > > rcu_barrier. :( > > > > > > > > IIUC, this limitation (sleepable context only) shouldn't be a problem > > > for perf use case? > > > > No, it would be called from the lock contention path including > > spinlocks. :( > > > > Can we limit it to non-sleepable ctx and not to pass arbtrary address > > somehow (or not to save the result pointer)? > > I hacked something like the following. It is not ideal, because we are > taking spinlock_t pointer instead of void pointer. To use this with void > 'pointer, we will need some verifier changes. Thanks a lot for doing this!! I'll take a look at the verifier what needs to be done. Namhyung > > > diff --git i/kernel/bpf/helpers.c w/kernel/bpf/helpers.c > index 3709fb142881..7311a26ecb01 100644 > --- i/kernel/bpf/helpers.c > +++ w/kernel/bpf/helpers.c > @@ -3090,7 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > -BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL | KF_TRUSTED_ARGS > | KF_RCU_PROTECTED) > BTF_KFUNCS_END(common_btf_ids) > > static const struct btf_kfunc_id_set common_kfunc_set = { > diff --git i/mm/slab_common.c w/mm/slab_common.c > index 5484e1cd812f..3e3e5f172f2e 100644 > --- i/mm/slab_common.c > +++ w/mm/slab_common.c > @@ -1327,14 +1327,15 @@ EXPORT_SYMBOL(ksize); > > __bpf_kfunc_start_defs(); > > -__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(spinlock_t *addr) > { > struct slab *slab; > + unsigned long a = (unsigned long)addr; > > - if (!virt_addr_valid(addr)) > + if (!virt_addr_valid(a)) > return NULL; > > - slab = virt_to_slab((void *)(long)addr); > + slab = virt_to_slab(addr); > return slab ? slab->slab_cache : NULL; > } > > @@ -1346,4 +1347,3 @@ EXPORT_TRACEPOINT_SYMBOL(kmalloc); > EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc); > EXPORT_TRACEPOINT_SYMBOL(kfree); > EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free); > - > diff --git i/tools/testing/selftests/bpf/progs/kmem_cache_iter.c > w/tools/testing/selftests/bpf/progs/kmem_cache_iter.c > index 3f6ec15a1bf6..8238155a5055 100644 > --- i/tools/testing/selftests/bpf/progs/kmem_cache_iter.c > +++ w/tools/testing/selftests/bpf/progs/kmem_cache_iter.c > @@ -16,7 +16,7 @@ struct { > __uint(max_entries, 1024); > } slab_hash SEC(".maps"); > > -extern struct kmem_cache *bpf_get_kmem_cache(__u64 addr) __ksym; > +extern struct kmem_cache *bpf_get_kmem_cache(spinlock_t *addr) __ksym; > > /* result, will be checked by userspace */ > int found; > @@ -46,21 +46,23 @@ int slab_info_collector(struct bpf_iter__kmem_cache *ctx) > SEC("raw_tp/bpf_test_finish") > int BPF_PROG(check_task_struct) > { > - __u64 curr = bpf_get_current_task(); > + struct task_struct *curr = bpf_get_current_task_btf(); > struct kmem_cache *s; > char *name; > > - s = bpf_get_kmem_cache(curr); > + s = bpf_get_kmem_cache(&curr->alloc_lock); > if (s == NULL) { > found = -1; > return 0; > } > > + bpf_rcu_read_lock(); > name = bpf_map_lookup_elem(&slab_hash, &s); > if (name && !bpf_strncmp(name, 11, "task_struct")) > found = 1; > else > found = -2; > + bpf_rcu_read_unlock(); > > return 0; > }
On Fri, Oct 4, 2024 at 3:57 PM Song Liu <song@kernel.org> wrote: > > On Fri, Oct 4, 2024 at 2:58 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > On Fri, Oct 04, 2024 at 02:36:30PM -0700, Song Liu wrote: > > > On Fri, Oct 4, 2024 at 2:25 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > > > > On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > > > > > > > The bpf_get_kmem_cache() is to get a slab cache information from a > > > > > > virtual address like virt_to_cache(). If the address is a pointer > > > > > > to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > > > > NULL is returned. > > > > > > > > > > > > It doesn't grab a reference count of the kmem_cache so the caller is > > > > > > responsible to manage the access. The intended use case for now is to > > > > > > symbolize locks in slab objects from the lock contention tracepoints. > > > > > > > > > > > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > > > > Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > --- > > > > > > kernel/bpf/helpers.c | 1 + > > > > > > mm/slab_common.c | 19 +++++++++++++++++++ > > > > > > 2 files changed, 20 insertions(+) > > > > > > > > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > > > index 4053f279ed4cc7ab..3709fb14288105c6 100644 > > > > > > --- a/kernel/bpf/helpers.c > > > > > > +++ b/kernel/bpf/helpers.c > > > > > > @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > > > > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > > > > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > > > > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > > > > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > > > > > BTF_KFUNCS_END(common_btf_ids) > > > > > > > > > > > > static const struct btf_kfunc_id_set common_kfunc_set = { > > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > > > index 7443244656150325..5484e1cd812f698e 100644 > > > > > > --- a/mm/slab_common.c > > > > > > +++ b/mm/slab_common.c > > > > > > @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > > > > > > } > > > > > > EXPORT_SYMBOL(ksize); > > > > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > > +#include <linux/btf.h> > > > > > > + > > > > > > +__bpf_kfunc_start_defs(); > > > > > > + > > > > > > +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > > > > > +{ > > > > > > + struct slab *slab; > > > > > > + > > > > > > + if (!virt_addr_valid(addr)) > > > > > > + return NULL; > > > > > > + > > > > > > + slab = virt_to_slab((void *)(long)addr); > > > > > > + return slab ? slab->slab_cache : NULL; > > > > > > +} > > > > > > > > > > Do we need to hold a refcount to the slab_cache? Given > > > > > we make this kfunc available everywhere, including > > > > > sleepable contexts, I think it is necessary. > > > > > > > > It's a really good question. > > > > > > > > If the callee somehow owns the slab object, as in the example > > > > provided in the series (current task), it's not necessarily. > > > > > > > > If a user can pass a random address, you're right, we need to > > > > grab the slab_cache's refcnt. But then we also can't guarantee > > > > that the object still belongs to the same slab_cache, the > > > > function becomes racy by the definition. > > > > > > To be safe, we can limit the kfunc to sleepable context only. Then > > > we can lock slab_mutex for virt_to_slab, and hold a refcount > > > to slab_cache. We will need a KF_RELEASE kfunc to release > > > the refcount later. > > > > Then it needs to call kmem_cache_destroy() for release which contains > > rcu_barrier. :( > > > > > > > > IIUC, this limitation (sleepable context only) shouldn't be a problem > > > for perf use case? > > > > No, it would be called from the lock contention path including > > spinlocks. :( > > > > Can we limit it to non-sleepable ctx and not to pass arbtrary address > > somehow (or not to save the result pointer)? > > I hacked something like the following. It is not ideal, because we are > taking spinlock_t pointer instead of void pointer. To use this with void > 'pointer, we will need some verifier changes. > > Thanks, > Song > > > diff --git i/kernel/bpf/helpers.c w/kernel/bpf/helpers.c > index 3709fb142881..7311a26ecb01 100644 > --- i/kernel/bpf/helpers.c > +++ w/kernel/bpf/helpers.c > @@ -3090,7 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > -BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL | KF_TRUSTED_ARGS > | KF_RCU_PROTECTED) I don't think KF_TRUSTED_ARGS approach would fit here. Namhyung's use case is tracing. The 'addr' will be some potentially arbitrary address from somewhere. The chance to see a trusted pointer is probably very low in such a tracing use case. The verifier change can mainly be the following: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7d9b38ffd220..e09eb108e956 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12834,6 +12834,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, regs[BPF_REG_0].type = PTR_TO_BTF_ID; regs[BPF_REG_0].btf_id = ptr_type_id; + if (meta.func_id == special_kfunc_list[KF_get_kmem_cache]) + regs[BPF_REG_0].type |= PTR_UNTRUSTED; + if (is_iter_next_kfunc(&meta)) { struct bpf_reg_state *cur_iter; The returned 'struct kmem_cache *' won't be refcnt-ed (acquired). It will be readonly via ptr_to_btf_id logic. s->flags; s->size; s->offset; access will be allowed but the verifier will sanitize them with an inlined version of probe_read_kernel. Even KF_RET_NULL can be dropped.
On Fri, Oct 4, 2024 at 4:44 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: [...] > > diff --git i/kernel/bpf/helpers.c w/kernel/bpf/helpers.c > > index 3709fb142881..7311a26ecb01 100644 > > --- i/kernel/bpf/helpers.c > > +++ w/kernel/bpf/helpers.c > > @@ -3090,7 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > -BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL | KF_TRUSTED_ARGS > > | KF_RCU_PROTECTED) > > I don't think KF_TRUSTED_ARGS approach would fit here. > Namhyung's use case is tracing. The 'addr' will be some potentially > arbitrary address from somewhere. The chance to see a trusted pointer > is probably very low in such a tracing use case. I thought the primary use case was to trace lock contention, for example, queued_spin_lock_slowpath(). Of course, a more general solution is better. > > The verifier change can mainly be the following: > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 7d9b38ffd220..e09eb108e956 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -12834,6 +12834,9 @@ static int check_kfunc_call(struct > bpf_verifier_env *env, struct bpf_insn *insn, > regs[BPF_REG_0].type = PTR_TO_BTF_ID; > regs[BPF_REG_0].btf_id = ptr_type_id; > > + if (meta.func_id == > special_kfunc_list[KF_get_kmem_cache]) > + regs[BPF_REG_0].type |= PTR_UNTRUSTED; > + > if (is_iter_next_kfunc(&meta)) { > struct bpf_reg_state *cur_iter; This is easier than I thought. Thanks, Song > The returned 'struct kmem_cache *' won't be refcnt-ed (acquired). > It will be readonly via ptr_to_btf_id logic. > s->flags; > s->size; > s->offset; > access will be allowed but the verifier will sanitize them > with an inlined version of probe_read_kernel. > Even KF_RET_NULL can be dropped.
Hello, On Fri, Oct 04, 2024 at 04:56:57PM -0700, Song Liu wrote: > On Fri, Oct 4, 2024 at 4:44 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > [...] > > > diff --git i/kernel/bpf/helpers.c w/kernel/bpf/helpers.c > > > index 3709fb142881..7311a26ecb01 100644 > > > --- i/kernel/bpf/helpers.c > > > +++ w/kernel/bpf/helpers.c > > > @@ -3090,7 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > > > BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > > > BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > > > BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > > > -BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > > > +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL | KF_TRUSTED_ARGS > > > | KF_RCU_PROTECTED) > > > > I don't think KF_TRUSTED_ARGS approach would fit here. > > Namhyung's use case is tracing. The 'addr' will be some potentially > > arbitrary address from somewhere. The chance to see a trusted pointer > > is probably very low in such a tracing use case. > > I thought the primary use case was to trace lock contention, for > example, queued_spin_lock_slowpath(). Of course, a more > general solution is better. Right, my intended use case is the lock contention profiling so probably it's ok to limit it for trusted pointers if it helps. But as Song said, a general solution should be better. :) > > > > > The verifier change can mainly be the following: > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > index 7d9b38ffd220..e09eb108e956 100644 > > --- a/kernel/bpf/verifier.c > > +++ b/kernel/bpf/verifier.c > > @@ -12834,6 +12834,9 @@ static int check_kfunc_call(struct > > bpf_verifier_env *env, struct bpf_insn *insn, > > regs[BPF_REG_0].type = PTR_TO_BTF_ID; > > regs[BPF_REG_0].btf_id = ptr_type_id; > > > > + if (meta.func_id == special_kfunc_list[KF_get_kmem_cache]) > > + regs[BPF_REG_0].type |= PTR_UNTRUSTED; > > + > > if (is_iter_next_kfunc(&meta)) { > > struct bpf_reg_state *cur_iter; > > This is easier than I thought. Indeed! Thanks for providing the code. > > > The returned 'struct kmem_cache *' won't be refcnt-ed (acquired). > > It will be readonly via ptr_to_btf_id logic. > > s->flags; > > s->size; > > s->offset; > > access will be allowed but the verifier will sanitize them > > with an inlined version of probe_read_kernel. > > Even KF_RET_NULL can be dropped. Ok, I'll check this out. By having PTR_UNTRUSTED, are the callers still required to check NULL or is it handled by probe_read_kernel()? Thanks, Namhyung
On 10/4/24 11:25 PM, Roman Gushchin wrote: > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: >> On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: >>> >>> The bpf_get_kmem_cache() is to get a slab cache information from a >>> virtual address like virt_to_cache(). If the address is a pointer >>> to a slab object, it'd return a valid kmem_cache pointer, otherwise >>> NULL is returned. >>> >>> It doesn't grab a reference count of the kmem_cache so the caller is >>> responsible to manage the access. The intended use case for now is to >>> symbolize locks in slab objects from the lock contention tracepoints. >>> >>> Suggested-by: Vlastimil Babka <vbabka@suse.cz> >>> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab >>> Signed-off-by: Namhyung Kim <namhyung@kernel.org> So IIRC from our discussions with Namhyung and Arnaldo at LSF/MM I thought the perf use case was: - at the beginning it iterates the kmem caches and stores anything of possible interest in bpf maps or somewhere - hence we have the iterator - during profiling, from object it gets to a cache, but doesn't need to access the cache - just store the kmem_cache address in the perf record - after profiling itself, use the information in the maps from the first step together with cache pointers from the second step to calculate whatever is necessary So at no point it should be necessary to take refcount to a kmem_cache? But maybe "bpf_get_kmem_cache()" is implemented here as too generic given the above use case and it should be implemented in a way that the pointer it returns cannot be used to access anything (which could be unsafe), but only as a bpf map key - so it should return e.g. an unsigned long instead? >>> --- >>> kernel/bpf/helpers.c | 1 + >>> mm/slab_common.c | 19 +++++++++++++++++++ >>> 2 files changed, 20 insertions(+) >>> >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >>> index 4053f279ed4cc7ab..3709fb14288105c6 100644 >>> --- a/kernel/bpf/helpers.c >>> +++ b/kernel/bpf/helpers.c >>> @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) >>> BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) >>> BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) >>> BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) >>> +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) >>> BTF_KFUNCS_END(common_btf_ids) >>> >>> static const struct btf_kfunc_id_set common_kfunc_set = { >>> diff --git a/mm/slab_common.c b/mm/slab_common.c >>> index 7443244656150325..5484e1cd812f698e 100644 >>> --- a/mm/slab_common.c >>> +++ b/mm/slab_common.c >>> @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) >>> } >>> EXPORT_SYMBOL(ksize); >>> >>> +#ifdef CONFIG_BPF_SYSCALL >>> +#include <linux/btf.h> >>> + >>> +__bpf_kfunc_start_defs(); >>> + >>> +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) >>> +{ >>> + struct slab *slab; >>> + >>> + if (!virt_addr_valid(addr)) >>> + return NULL; >>> + >>> + slab = virt_to_slab((void *)(long)addr); >>> + return slab ? slab->slab_cache : NULL; >>> +} >> >> Do we need to hold a refcount to the slab_cache? Given >> we make this kfunc available everywhere, including >> sleepable contexts, I think it is necessary. > > It's a really good question. > > If the callee somehow owns the slab object, as in the example > provided in the series (current task), it's not necessarily. > > If a user can pass a random address, you're right, we need to > grab the slab_cache's refcnt. But then we also can't guarantee > that the object still belongs to the same slab_cache, the > function becomes racy by the definition.
On Mon, Oct 07, 2024 at 02:57:08PM +0200, Vlastimil Babka wrote: > On 10/4/24 11:25 PM, Roman Gushchin wrote: > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > >> On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > >>> > >>> The bpf_get_kmem_cache() is to get a slab cache information from a > >>> virtual address like virt_to_cache(). If the address is a pointer > >>> to a slab object, it'd return a valid kmem_cache pointer, otherwise > >>> NULL is returned. > >>> > >>> It doesn't grab a reference count of the kmem_cache so the caller is > >>> responsible to manage the access. The intended use case for now is to > >>> symbolize locks in slab objects from the lock contention tracepoints. > >>> > >>> Suggested-by: Vlastimil Babka <vbabka@suse.cz> > >>> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > >>> Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > So IIRC from our discussions with Namhyung and Arnaldo at LSF/MM I > thought the perf use case was: > > - at the beginning it iterates the kmem caches and stores anything of > possible interest in bpf maps or somewhere - hence we have the iterator > - during profiling, from object it gets to a cache, but doesn't need to > access the cache - just store the kmem_cache address in the perf record > - after profiling itself, use the information in the maps from the first > step together with cache pointers from the second step to calculate > whatever is necessary Correct. > > So at no point it should be necessary to take refcount to a kmem_cache? > > But maybe "bpf_get_kmem_cache()" is implemented here as too generic > given the above use case and it should be implemented in a way that the > pointer it returns cannot be used to access anything (which could be > unsafe), but only as a bpf map key - so it should return e.g. an > unsigned long instead? Yep, this should work for my use case. Maybe we don't need the iterator when bpf_get_kmem_cache() kfunc returns the valid pointer as we can get the necessary info at the moment. But I think it'd be less efficient as more work need to be done at the event (lock contention). It'd better setting up necessary info in a map before monitoring (using the iterator), and just looking up the map with the kfunc while monitoring the lock contention. Thanks, Namhyung > > >>> --- > >>> kernel/bpf/helpers.c | 1 + > >>> mm/slab_common.c | 19 +++++++++++++++++++ > >>> 2 files changed, 20 insertions(+) > >>> > >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >>> index 4053f279ed4cc7ab..3709fb14288105c6 100644 > >>> --- a/kernel/bpf/helpers.c > >>> +++ b/kernel/bpf/helpers.c > >>> @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) > >>> BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) > >>> BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) > >>> BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) > >>> +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) > >>> BTF_KFUNCS_END(common_btf_ids) > >>> > >>> static const struct btf_kfunc_id_set common_kfunc_set = { > >>> diff --git a/mm/slab_common.c b/mm/slab_common.c > >>> index 7443244656150325..5484e1cd812f698e 100644 > >>> --- a/mm/slab_common.c > >>> +++ b/mm/slab_common.c > >>> @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) > >>> } > >>> EXPORT_SYMBOL(ksize); > >>> > >>> +#ifdef CONFIG_BPF_SYSCALL > >>> +#include <linux/btf.h> > >>> + > >>> +__bpf_kfunc_start_defs(); > >>> + > >>> +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > >>> +{ > >>> + struct slab *slab; > >>> + > >>> + if (!virt_addr_valid(addr)) > >>> + return NULL; > >>> + > >>> + slab = virt_to_slab((void *)(long)addr); > >>> + return slab ? slab->slab_cache : NULL; > >>> +} > >> > >> Do we need to hold a refcount to the slab_cache? Given > >> we make this kfunc available everywhere, including > >> sleepable contexts, I think it is necessary. > > > > It's a really good question. > > > > If the callee somehow owns the slab object, as in the example > > provided in the series (current task), it's not necessarily. > > > > If a user can pass a random address, you're right, we need to > > grab the slab_cache's refcnt. But then we also can't guarantee > > that the object still belongs to the same slab_cache, the > > function becomes racy by the definition.
On Wed, Oct 09, 2024 at 12:17:12AM -0700, Namhyung Kim wrote: > On Mon, Oct 07, 2024 at 02:57:08PM +0200, Vlastimil Babka wrote: > > On 10/4/24 11:25 PM, Roman Gushchin wrote: > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > >> On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > >>> > > >>> The bpf_get_kmem_cache() is to get a slab cache information from a > > >>> virtual address like virt_to_cache(). If the address is a pointer > > >>> to a slab object, it'd return a valid kmem_cache pointer, otherwise > > >>> NULL is returned. > > >>> > > >>> It doesn't grab a reference count of the kmem_cache so the caller is > > >>> responsible to manage the access. The intended use case for now is to > > >>> symbolize locks in slab objects from the lock contention tracepoints. > > >>> > > >>> Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > >>> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > >>> Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > So IIRC from our discussions with Namhyung and Arnaldo at LSF/MM I > > thought the perf use case was: > > > > - at the beginning it iterates the kmem caches and stores anything of > > possible interest in bpf maps or somewhere - hence we have the iterator > > - during profiling, from object it gets to a cache, but doesn't need to > > access the cache - just store the kmem_cache address in the perf record > > - after profiling itself, use the information in the maps from the first > > step together with cache pointers from the second step to calculate > > whatever is necessary > > Correct. > > > > > So at no point it should be necessary to take refcount to a kmem_cache? > > > > But maybe "bpf_get_kmem_cache()" is implemented here as too generic > > given the above use case and it should be implemented in a way that the > > pointer it returns cannot be used to access anything (which could be > > unsafe), but only as a bpf map key - so it should return e.g. an > > unsigned long instead? > > Yep, this should work for my use case. Maybe we don't need the > iterator when bpf_get_kmem_cache() kfunc returns the valid pointer as > we can get the necessary info at the moment. But I think it'd be less > efficient as more work need to be done at the event (lock contention). > It'd better setting up necessary info in a map before monitoring (using > the iterator), and just looking up the map with the kfunc while > monitoring the lock contention. Maybe it's still better to return a non-refcounted pointer for future use. I'll leave it for v5. Thanks, Namhyung
On Thu, Oct 10, 2024 at 9:46 AM Namhyung Kim <namhyung@kernel.org> wrote: > > On Wed, Oct 09, 2024 at 12:17:12AM -0700, Namhyung Kim wrote: > > On Mon, Oct 07, 2024 at 02:57:08PM +0200, Vlastimil Babka wrote: > > > On 10/4/24 11:25 PM, Roman Gushchin wrote: > > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > > >> On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > >>> > > > >>> The bpf_get_kmem_cache() is to get a slab cache information from a > > > >>> virtual address like virt_to_cache(). If the address is a pointer > > > >>> to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > >>> NULL is returned. > > > >>> > > > >>> It doesn't grab a reference count of the kmem_cache so the caller is > > > >>> responsible to manage the access. The intended use case for now is to > > > >>> symbolize locks in slab objects from the lock contention tracepoints. > > > >>> > > > >>> Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > >>> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > >>> Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > > > > So IIRC from our discussions with Namhyung and Arnaldo at LSF/MM I > > > thought the perf use case was: > > > > > > - at the beginning it iterates the kmem caches and stores anything of > > > possible interest in bpf maps or somewhere - hence we have the iterator > > > - during profiling, from object it gets to a cache, but doesn't need to > > > access the cache - just store the kmem_cache address in the perf record > > > - after profiling itself, use the information in the maps from the first > > > step together with cache pointers from the second step to calculate > > > whatever is necessary > > > > Correct. > > > > > > > > So at no point it should be necessary to take refcount to a kmem_cache? > > > > > > But maybe "bpf_get_kmem_cache()" is implemented here as too generic > > > given the above use case and it should be implemented in a way that the > > > pointer it returns cannot be used to access anything (which could be > > > unsafe), but only as a bpf map key - so it should return e.g. an > > > unsigned long instead? > > > > Yep, this should work for my use case. Maybe we don't need the > > iterator when bpf_get_kmem_cache() kfunc returns the valid pointer as > > we can get the necessary info at the moment. But I think it'd be less > > efficient as more work need to be done at the event (lock contention). > > It'd better setting up necessary info in a map before monitoring (using > > the iterator), and just looking up the map with the kfunc while > > monitoring the lock contention. > > Maybe it's still better to return a non-refcounted pointer for future > use. I'll leave it for v5. Pls keep it as: __bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) just make sure it's PTR_UNTRUSTED. No need to make it return long or void *. The users can do: bpf_core_cast(any_value, struct kmem_cache); anyway, but it would be an unnecessary step.
On Thu, Oct 10, 2024 at 10:04:24AM -0700, Alexei Starovoitov wrote: > On Thu, Oct 10, 2024 at 9:46 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > On Wed, Oct 09, 2024 at 12:17:12AM -0700, Namhyung Kim wrote: > > > On Mon, Oct 07, 2024 at 02:57:08PM +0200, Vlastimil Babka wrote: > > > > On 10/4/24 11:25 PM, Roman Gushchin wrote: > > > > > On Fri, Oct 04, 2024 at 01:10:58PM -0700, Song Liu wrote: > > > > >> On Wed, Oct 2, 2024 at 11:10 AM Namhyung Kim <namhyung@kernel.org> wrote: > > > > >>> > > > > >>> The bpf_get_kmem_cache() is to get a slab cache information from a > > > > >>> virtual address like virt_to_cache(). If the address is a pointer > > > > >>> to a slab object, it'd return a valid kmem_cache pointer, otherwise > > > > >>> NULL is returned. > > > > >>> > > > > >>> It doesn't grab a reference count of the kmem_cache so the caller is > > > > >>> responsible to manage the access. The intended use case for now is to > > > > >>> symbolize locks in slab objects from the lock contention tracepoints. > > > > >>> > > > > >>> Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > > > >>> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) > > > > >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab > > > > >>> Signed-off-by: Namhyung Kim <namhyung@kernel.org> > > > > > > > > > > > > So IIRC from our discussions with Namhyung and Arnaldo at LSF/MM I > > > > thought the perf use case was: > > > > > > > > - at the beginning it iterates the kmem caches and stores anything of > > > > possible interest in bpf maps or somewhere - hence we have the iterator > > > > - during profiling, from object it gets to a cache, but doesn't need to > > > > access the cache - just store the kmem_cache address in the perf record > > > > - after profiling itself, use the information in the maps from the first > > > > step together with cache pointers from the second step to calculate > > > > whatever is necessary > > > > > > Correct. > > > > > > > > > > > So at no point it should be necessary to take refcount to a kmem_cache? > > > > > > > > But maybe "bpf_get_kmem_cache()" is implemented here as too generic > > > > given the above use case and it should be implemented in a way that the > > > > pointer it returns cannot be used to access anything (which could be > > > > unsafe), but only as a bpf map key - so it should return e.g. an > > > > unsigned long instead? > > > > > > Yep, this should work for my use case. Maybe we don't need the > > > iterator when bpf_get_kmem_cache() kfunc returns the valid pointer as > > > we can get the necessary info at the moment. But I think it'd be less > > > efficient as more work need to be done at the event (lock contention). > > > It'd better setting up necessary info in a map before monitoring (using > > > the iterator), and just looking up the map with the kfunc while > > > monitoring the lock contention. > > > > Maybe it's still better to return a non-refcounted pointer for future > > use. I'll leave it for v5. > > Pls keep it as: > __bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) > > just make sure it's PTR_UNTRUSTED. Sure, will do. > No need to make it return long or void *. > The users can do: > bpf_core_cast(any_value, struct kmem_cache); > anyway, but it would be an unnecessary step. Yeah I thought there would be a way to do that. Thanks, Namhyung
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 4053f279ed4cc7ab..3709fb14288105c6 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -3090,6 +3090,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_get_kmem_cache, KF_RET_NULL) BTF_KFUNCS_END(common_btf_ids) static const struct btf_kfunc_id_set common_kfunc_set = { diff --git a/mm/slab_common.c b/mm/slab_common.c index 7443244656150325..5484e1cd812f698e 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1322,6 +1322,25 @@ size_t ksize(const void *objp) } EXPORT_SYMBOL(ksize); +#ifdef CONFIG_BPF_SYSCALL +#include <linux/btf.h> + +__bpf_kfunc_start_defs(); + +__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr) +{ + struct slab *slab; + + if (!virt_addr_valid(addr)) + return NULL; + + slab = virt_to_slab((void *)(long)addr); + return slab ? slab->slab_cache : NULL; +} + +__bpf_kfunc_end_defs(); +#endif /* CONFIG_BPF_SYSCALL */ + /* Tracepoints definitions. */ EXPORT_TRACEPOINT_SYMBOL(kmalloc); EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);