Message ID | 20220811085938.2506536-1-imran.f.khan@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] Introduce sysfs interface to disable kfence for selected slabs. | expand |
On 8/11/22 10:59, Imran Khan wrote: > By default kfence allocation can happen for any slab object, whose size > is up to PAGE_SIZE, as long as that allocation is the first allocation > after expiration of kfence sample interval. But in certain debugging > scenarios we may be interested in debugging corruptions involving > some specific slub objects like dentry or ext4_* etc. In such cases > limiting kfence for allocations involving only specific slub objects > will increase the probablity of catching the issue since kfence pool > will not be consumed by other slab objects. So you want to enable specific caches for kfence. > This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' > to disable kfence for specific slabs. Having the interface work in this > way does not impact current/default behavior of kfence and allows us to > use kfence for specific slabs (when needed) as well. The decision to > skip/use kfence is taken depending on whether kmem_cache.flags has > (newly introduced) SLAB_SKIP_KFENCE flag set or not. But this seems everything is still enabled and you can selectively disable. Isn't that rather impractical? How about making this cache flag rather denote that KFENCE is enabled (not skipped), set it by default only for for caches with size <= 1024, then you can drop the size check in __kfence_alloc and rely only on the flag? And if you need, you can also enable a cache with size > 1024 with the sysfs interface, to override the limit, which isn't possible now. (I don't think changing the limit to always act on s->object_size instead of e.g. size passed to kmalloc() that it can pick up now, will change anything in practice) Then you can also have a kernel boot param that tells kfence to set the flag on no cache at all, and you can easily enable just the specific caches you want. Or make a parameter that lets you override the 1024 size limit globally, and if you set it to 0, it means no cache is enabled for kfence? > Signed-off-by: Imran Khan <imran.f.khan@oracle.com> > --- > > Changes since v1: > - Remove RFC tag > > include/linux/slab.h | 6 ++++++ > mm/kfence/core.c | 7 +++++++ > mm/slub.c | 27 +++++++++++++++++++++++++++ > 3 files changed, 40 insertions(+) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 0fefdf528e0d..947d912fd08c 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -119,6 +119,12 @@ > */ > #define SLAB_NO_USER_FLAGS ((slab_flags_t __force)0x10000000U) > > +#ifdef CONFIG_KFENCE > +#define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U) > +#else > +#define SLAB_SKIP_KFENCE 0 > +#endif > + > /* The following flags affect the page allocator grouping pages by mobility */ > /* Objects are reclaimable */ > #define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U) > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index c252081b11df..8c08ae2101d7 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -1003,6 +1003,13 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) > return NULL; > } > > + /* > + * Skip allocations for this slab, if KFENCE has been disabled for > + * this slab. > + */ > + if (s->flags & SLAB_SKIP_KFENCE) > + return NULL; > + > if (atomic_inc_return(&kfence_allocation_gate) > 1) > return NULL; > #ifdef CONFIG_KFENCE_STATIC_KEYS > diff --git a/mm/slub.c b/mm/slub.c > index 862dbd9af4f5..ee8b48327536 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5745,6 +5745,30 @@ STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); > STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); > #endif /* CONFIG_SLUB_STATS */ > > +#ifdef CONFIG_KFENCE > +static ssize_t skip_kfence_show(struct kmem_cache *s, char *buf) > +{ > + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_SKIP_KFENCE)); > +} > + > +static ssize_t skip_kfence_store(struct kmem_cache *s, > + const char *buf, size_t length) > +{ > + int ret = length; > + > + if (buf[0] == '0') > + s->flags &= ~SLAB_SKIP_KFENCE; > + else if (buf[0] == '1') > + s->flags |= SLAB_SKIP_KFENCE; > + else > + ret = -EINVAL; > + > + return ret; > +} > +SLAB_ATTR(skip_kfence); > + > +#endif > + > static struct attribute *slab_attrs[] = { > &slab_size_attr.attr, > &object_size_attr.attr, > @@ -5812,6 +5836,9 @@ static struct attribute *slab_attrs[] = { > &failslab_attr.attr, > #endif > &usersize_attr.attr, > +#ifdef CONFIG_KFENCE > + &skip_kfence_attr.attr, > +#endif > > NULL > }; > > base-commit: 40d43a7507e1547dd45cb02af2e40d897c591870
On Thu, 11 Aug 2022 at 11:31, <vbabka@suse.cz> wrote: > > On 8/11/22 10:59, Imran Khan wrote: > > By default kfence allocation can happen for any slab object, whose size > > is up to PAGE_SIZE, as long as that allocation is the first allocation > > after expiration of kfence sample interval. But in certain debugging > > scenarios we may be interested in debugging corruptions involving > > some specific slub objects like dentry or ext4_* etc. In such cases > > limiting kfence for allocations involving only specific slub objects > > will increase the probablity of catching the issue since kfence pool > > will not be consumed by other slab objects. > > So you want to enable specific caches for kfence. > > > This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' > > to disable kfence for specific slabs. Having the interface work in this > > way does not impact current/default behavior of kfence and allows us to > > use kfence for specific slabs (when needed) as well. The decision to > > skip/use kfence is taken depending on whether kmem_cache.flags has > > (newly introduced) SLAB_SKIP_KFENCE flag set or not. > > But this seems everything is still enabled and you can selectively disable. > Isn't that rather impractical? A script just iterates through all the caches that they don't want, and sets skip_kfence? It doesn't look more complicated. > How about making this cache flag rather denote that KFENCE is enabled (not > skipped), set it by default only for for caches with size <= 1024, then you Where does 1024 come from? PAGE_SIZE? The problem with that opt-in vs. opt-out is that it becomes more complex to maintain opt-in (as the first RFC of this did). With the new flag SLAB_SKIP_KFENCE, it also can serve a dual purpose, where someone might want to explicitly opt out by default and pass it to kmem_cache_create() (for whatever reason; not that we'd encourage that). I feel that the real use cases for selectively enabling caches for KFENCE are very narrow, and a design that introduces lots of complexity elsewhere, just to support this feature cannot be justified (which is why I suggested the simpler design here back in https://lore.kernel.org/lkml/CANpmjNNmD9z7oRqSaP72m90kWL7jYH+cxNAZEGpJP8oLrDV-vw@mail.gmail.com/ ) > can drop the size check in __kfence_alloc and rely only on the flag? And if > you need, you can also enable a cache with size > 1024 with the sysfs > interface, to override the limit, which isn't possible now. > (I don't think changing the limit to always act on s->object_size instead of > e.g. size passed to kmalloc() that it can pick up now, will change anything > in practice) > Then you can also have a kernel boot param that tells kfence to set the flag > on no cache at all, and you can easily enable just the specific caches you > want. Or make a parameter that lets you override the 1024 size limit > globally, and if you set it to 0, it means no cache is enabled for kfence? > > > Signed-off-by: Imran Khan <imran.f.khan@oracle.com> > > --- > > > > Changes since v1: > > - Remove RFC tag > > > > include/linux/slab.h | 6 ++++++ > > mm/kfence/core.c | 7 +++++++ > > mm/slub.c | 27 +++++++++++++++++++++++++++ > > 3 files changed, 40 insertions(+) > > > > diff --git a/include/linux/slab.h b/include/linux/slab.h > > index 0fefdf528e0d..947d912fd08c 100644 > > --- a/include/linux/slab.h > > +++ b/include/linux/slab.h > > @@ -119,6 +119,12 @@ > > */ > > #define SLAB_NO_USER_FLAGS ((slab_flags_t __force)0x10000000U) > > > > +#ifdef CONFIG_KFENCE > > +#define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U) > > +#else > > +#define SLAB_SKIP_KFENCE 0 > > +#endif > > + > > /* The following flags affect the page allocator grouping pages by mobility */ > > /* Objects are reclaimable */ > > #define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U) > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > > index c252081b11df..8c08ae2101d7 100644 > > --- a/mm/kfence/core.c > > +++ b/mm/kfence/core.c > > @@ -1003,6 +1003,13 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) > > return NULL; > > } > > > > + /* > > + * Skip allocations for this slab, if KFENCE has been disabled for > > + * this slab. > > + */ > > + if (s->flags & SLAB_SKIP_KFENCE) > > + return NULL; > > + > > if (atomic_inc_return(&kfence_allocation_gate) > 1) > > return NULL; > > #ifdef CONFIG_KFENCE_STATIC_KEYS > > diff --git a/mm/slub.c b/mm/slub.c > > index 862dbd9af4f5..ee8b48327536 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -5745,6 +5745,30 @@ STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); > > STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); > > #endif /* CONFIG_SLUB_STATS */ > > > > +#ifdef CONFIG_KFENCE > > +static ssize_t skip_kfence_show(struct kmem_cache *s, char *buf) > > +{ > > + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_SKIP_KFENCE)); > > +} > > + > > +static ssize_t skip_kfence_store(struct kmem_cache *s, > > + const char *buf, size_t length) > > +{ > > + int ret = length; > > + > > + if (buf[0] == '0') > > + s->flags &= ~SLAB_SKIP_KFENCE; > > + else if (buf[0] == '1') > > + s->flags |= SLAB_SKIP_KFENCE; > > + else > > + ret = -EINVAL; > > + > > + return ret; > > +} > > +SLAB_ATTR(skip_kfence); > > + > > +#endif > > + > > static struct attribute *slab_attrs[] = { > > &slab_size_attr.attr, > > &object_size_attr.attr, > > @@ -5812,6 +5836,9 @@ static struct attribute *slab_attrs[] = { > > &failslab_attr.attr, > > #endif > > &usersize_attr.attr, > > +#ifdef CONFIG_KFENCE > > + &skip_kfence_attr.attr, > > +#endif > > > > NULL > > }; > > > > base-commit: 40d43a7507e1547dd45cb02af2e40d897c591870 >
On 8/11/22 11:52, Marco Elver wrote: > On Thu, 11 Aug 2022 at 11:31, <vbabka@suse.cz> wrote: >> >> On 8/11/22 10:59, Imran Khan wrote: >> > By default kfence allocation can happen for any slab object, whose size >> > is up to PAGE_SIZE, as long as that allocation is the first allocation >> > after expiration of kfence sample interval. But in certain debugging >> > scenarios we may be interested in debugging corruptions involving >> > some specific slub objects like dentry or ext4_* etc. In such cases >> > limiting kfence for allocations involving only specific slub objects >> > will increase the probablity of catching the issue since kfence pool >> > will not be consumed by other slab objects. >> >> So you want to enable specific caches for kfence. >> >> > This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' >> > to disable kfence for specific slabs. Having the interface work in this >> > way does not impact current/default behavior of kfence and allows us to >> > use kfence for specific slabs (when needed) as well. The decision to >> > skip/use kfence is taken depending on whether kmem_cache.flags has >> > (newly introduced) SLAB_SKIP_KFENCE flag set or not. >> >> But this seems everything is still enabled and you can selectively disable. >> Isn't that rather impractical? > > A script just iterates through all the caches that they don't want, > and sets skip_kfence? It doesn't look more complicated. Well, yeah, it's possible. >> How about making this cache flag rather denote that KFENCE is enabled (not >> skipped), set it by default only for for caches with size <= 1024, then you > > Where does 1024 come from? PAGE_SIZE? You're right, the existing check in __kfence_alloc() uses PAGE_SIZE, not 1024, which probably came from lack of coffee :) > The problem with that opt-in vs. opt-out is that it becomes more > complex to maintain opt-in (as the first RFC of this did). With the I see. There was a kfence_global_alloc_enabled and slub_kfence[=slabs] ... that probably wouldn't be necessary even in an opt-in scenario as I described. > new flag SLAB_SKIP_KFENCE, it also can serve a dual purpose, where > someone might want to explicitly opt out by default and pass it to > kmem_cache_create() (for whatever reason; not that we'd encourage > that). Right, not be able to do that would be a downside (although it should be possible even with opt-in to add an opt-out cache flag that would just make sure the opt-in flag is not set even if eligible by global defaults). > I feel that the real use cases for selectively enabling caches for > KFENCE are very narrow, and a design that introduces lots of > complexity elsewhere, just to support this feature cannot be justified > (which is why I suggested the simpler design here back in > https://lore.kernel.org/lkml/CANpmjNNmD9z7oRqSaP72m90kWL7jYH+cxNAZEGpJP8oLrDV-vw@mail.gmail.com/ > ) I don't mind strongly either way, just a suggestion to consider.
On Thu, Aug 11, 2022 at 06:59:38PM +1000, Imran Khan wrote: > By default kfence allocation can happen for any slab object, whose size > is up to PAGE_SIZE, as long as that allocation is the first allocation > after expiration of kfence sample interval. But in certain debugging > scenarios we may be interested in debugging corruptions involving > some specific slub objects like dentry or ext4_* etc. In such cases > limiting kfence for allocations involving only specific slub objects > will increase the probablity of catching the issue since kfence pool > will not be consumed by other slab objects. > > This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' > to disable kfence for specific slabs. Having the interface work in this > way does not impact current/default behavior of kfence and allows us to > use kfence for specific slabs (when needed) as well. The decision to > skip/use kfence is taken depending on whether kmem_cache.flags has > (newly introduced) SLAB_SKIP_KFENCE flag set or not. > > Signed-off-by: Imran Khan <imran.f.khan@oracle.com> > --- > > Changes since v1: > - Remove RFC tag > > include/linux/slab.h | 6 ++++++ > mm/kfence/core.c | 7 +++++++ > mm/slub.c | 27 +++++++++++++++++++++++++++ > 3 files changed, 40 insertions(+) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 0fefdf528e0d..947d912fd08c 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -119,6 +119,12 @@ > */ > #define SLAB_NO_USER_FLAGS ((slab_flags_t __force)0x10000000U) > > +#ifdef CONFIG_KFENCE > +#define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U) > +#else > +#define SLAB_SKIP_KFENCE 0 > +#endif > + > /* The following flags affect the page allocator grouping pages by mobility */ > /* Objects are reclaimable */ > #define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U) > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index c252081b11df..8c08ae2101d7 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -1003,6 +1003,13 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) > return NULL; > } > > + /* > + * Skip allocations for this slab, if KFENCE has been disabled for > + * this slab. > + */ > + if (s->flags & SLAB_SKIP_KFENCE) > + return NULL; > + > if (atomic_inc_return(&kfence_allocation_gate) > 1) > return NULL; > #ifdef CONFIG_KFENCE_STATIC_KEYS > diff --git a/mm/slub.c b/mm/slub.c > index 862dbd9af4f5..ee8b48327536 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5745,6 +5745,30 @@ STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); > STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); > #endif /* CONFIG_SLUB_STATS */ > > +#ifdef CONFIG_KFENCE > +static ssize_t skip_kfence_show(struct kmem_cache *s, char *buf) > +{ > + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_SKIP_KFENCE)); > +} > + > +static ssize_t skip_kfence_store(struct kmem_cache *s, > + const char *buf, size_t length) > +{ > + int ret = length; > + > + if (buf[0] == '0') > + s->flags &= ~SLAB_SKIP_KFENCE; > + else if (buf[0] == '1') > + s->flags |= SLAB_SKIP_KFENCE; > + else > + ret = -EINVAL; > + > + return ret; > +} > +SLAB_ATTR(skip_kfence); > + > +#endif > + > static struct attribute *slab_attrs[] = { > &slab_size_attr.attr, > &object_size_attr.attr, > @@ -5812,6 +5836,9 @@ static struct attribute *slab_attrs[] = { > &failslab_attr.attr, > #endif > &usersize_attr.attr, > +#ifdef CONFIG_KFENCE > + &skip_kfence_attr.attr, > +#endif > > NULL > }; > > base-commit: 40d43a7507e1547dd45cb02af2e40d897c591870 > -- > 2.30.2 No strong opinion on its interface, but from view of correctness: Looks good to me. Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
On Thu, 11 Aug 2022 at 12:07, <vbabka@suse.cz> wrote: [...] > > new flag SLAB_SKIP_KFENCE, it also can serve a dual purpose, where > > someone might want to explicitly opt out by default and pass it to > > kmem_cache_create() (for whatever reason; not that we'd encourage > > that). > > Right, not be able to do that would be a downside (although it should be > possible even with opt-in to add an opt-out cache flag that would just make > sure the opt-in flag is not set even if eligible by global defaults). True, but I'd avoid all this unnecessary complexity if possible. > > I feel that the real use cases for selectively enabling caches for > > KFENCE are very narrow, and a design that introduces lots of > > complexity elsewhere, just to support this feature cannot be justified > > (which is why I suggested the simpler design here back in > > https://lore.kernel.org/lkml/CANpmjNNmD9z7oRqSaP72m90kWL7jYH+cxNAZEGpJP8oLrDV-vw@mail.gmail.com/ > > ) > > I don't mind strongly either way, just a suggestion to consider. While switching the semantics of the flag from opt-out to opt-in is just as valid, I'm more comfortable with the opt-out flag: the rest of the logic can stay the same, and we're aware of the fact that changing cache coverage by KFENCE shouldn't be something that needs to be done manually. My main point is that opting out or in to only a few select caches should be a rarely used feature, and accordingly it should be as simple as possible. Honestly, I still don't quite see the point of it, and my solution would be to just increase the KFENCE pool, increase sample rate, or decrease the "skip covered threshold%". But in the case described by Imran, perhaps a running machine is having trouble and limiting the caches to be analyzed by KFENCE might be worthwhile if a more aggressive configuration doesn't yield anything (and then there's of course KASAN, but I recognize it's not always possible to switch kernel and run the same workload with it). The use case for the proposed change is definitely when an admin or kernel dev is starting to debug a problem. KFENCE wasn't designed for that (vs. deployment at scale, discovery of bugs). As such I'm having a hard time admitting how useful this feature will really be, but given the current implementation is simple, having it might actually help a few people. Imran, just to make sure my assumptions here are right, have you had success debugging an issue in this way? Can you elaborate on what "certain debugging scenarios" you mean (admin debugging something, or a kernel dev, production fleet, or test machine)? Thanks, -- Marco
Hello Marco, On 11/8/22 11:21 pm, Marco Elver wrote: > On Thu, 11 Aug 2022 at 12:07, <vbabka@suse.cz> wrote: > [...] >>> new flag SLAB_SKIP_KFENCE, it also can serve a dual purpose, where >>> someone might want to explicitly opt out by default and pass it to >>> kmem_cache_create() (for whatever reason; not that we'd encourage >>> that). >> >> Right, not be able to do that would be a downside (although it should be >> possible even with opt-in to add an opt-out cache flag that would just make >> sure the opt-in flag is not set even if eligible by global defaults). > > True, but I'd avoid all this unnecessary complexity if possible. > >>> I feel that the real use cases for selectively enabling caches for >>> KFENCE are very narrow, and a design that introduces lots of >>> complexity elsewhere, just to support this feature cannot be justified >>> (which is why I suggested the simpler design here back in >>> https://urldefense.com/v3/__https://lore.kernel.org/lkml/CANpmjNNmD9z7oRqSaP72m90kWL7jYH*cxNAZEGpJP8oLrDV-vw@mail.gmail.com/__;Kw!!ACWV5N9M2RV99hQ!Oh4PBJ1NoN9mEgqGqdaNcWuKtJiC6TS_rIbALuqZadQoo93jpVJaFFmXUpOTuzRUdCwcRJWE6uJ4pe0$ >>> ) >> >> I don't mind strongly either way, just a suggestion to consider. > > While switching the semantics of the flag from opt-out to opt-in is > just as valid, I'm more comfortable with the opt-out flag: the rest of > the logic can stay the same, and we're aware of the fact that changing > cache coverage by KFENCE shouldn't be something that needs to be done > manually. > > My main point is that opting out or in to only a few select caches > should be a rarely used feature, and accordingly it should be as > simple as possible. Honestly, I still don't quite see the point of it, > and my solution would be to just increase the KFENCE pool, increase > sample rate, or decrease the "skip covered threshold%". But in the > case described by Imran, perhaps a running machine is having trouble > and limiting the caches to be analyzed by KFENCE might be worthwhile > if a more aggressive configuration doesn't yield anything (and then > there's of course KASAN, but I recognize it's not always possible to > switch kernel and run the same workload with it). > > The use case for the proposed change is definitely when an admin or > kernel dev is starting to debug a problem. KFENCE wasn't designed for > that (vs. deployment at scale, discovery of bugs). As such I'm having > a hard time admitting how useful this feature will really be, but > given the current implementation is simple, having it might actually > help a few people. > > Imran, just to make sure my assumptions here are right, have you had > success debugging an issue in this way? Can you elaborate on what > "certain debugging scenarios" you mean (admin debugging something, or > a kernel dev, production fleet, or test machine)? > I have not used kfence in this way because as of now we don't have such newer kernels in production fleet but I can cite a couple of instances where using slub_debug for few selected slabs helped me in locating the issue on a production system where KASAN or even full slub_debug were not feasible. Apologies in advance if I am elaborating more than you asked for :). In one case a freed struct mutex was being used later on and by that time same address had been given to a kmalloc-32 object. The issue was appearing more frequently if one would enforce some cgroup memory limitation resulting in fork of a task exiting prematurely. From the vmcore we could see that mutex or more specifically task_struct.futex_exit_mutex was in bad shape and eventually using slub_debug for kmalloc-32 pointed to issue. Another case involved a mem_cgroup corruption which was causing system crash but was giving list corruption warnings beforehand. Since list corruption warnings were coming from cgroup subsystem, corresponding objects were in doubt. Enabling slub_debug for kmalloc-4k helped in locating the actual corruption. Admittedly both of the above issues were result of backporting mistakes but nonetheless they happened in production systems where very few debugging options were available. By "certain debugging scenarios" I meant such cases where some initial data (from production fleet) like vmcore or kernel debug messages can give some pointer towards which slab objects could be wrong and then we would use this feature (along with further tuning like increasing sampling frequency, pool size if needed/possible) to pinpoint the actual issue. The idea is that limiting KFENCE to few slabs will increase the probablity of catching the issue even if we are not able to tweak pool size. Please let me know if it sounds reasonable or if I missed something from your query. Thanks, -- Imran
On Thu, 11 Aug 2022 at 17:10, Imran Khan <imran.f.khan@oracle.com> wrote: > > Hello Marco, > > On 11/8/22 11:21 pm, Marco Elver wrote: > > On Thu, 11 Aug 2022 at 12:07, <vbabka@suse.cz> wrote: > > [...] > >>> new flag SLAB_SKIP_KFENCE, it also can serve a dual purpose, where > >>> someone might want to explicitly opt out by default and pass it to > >>> kmem_cache_create() (for whatever reason; not that we'd encourage > >>> that). > >> > >> Right, not be able to do that would be a downside (although it should be > >> possible even with opt-in to add an opt-out cache flag that would just make > >> sure the opt-in flag is not set even if eligible by global defaults). > > > > True, but I'd avoid all this unnecessary complexity if possible. > > > >>> I feel that the real use cases for selectively enabling caches for > >>> KFENCE are very narrow, and a design that introduces lots of > >>> complexity elsewhere, just to support this feature cannot be justified > >>> (which is why I suggested the simpler design here back in > >>> https://urldefense.com/v3/__https://lore.kernel.org/lkml/CANpmjNNmD9z7oRqSaP72m90kWL7jYH*cxNAZEGpJP8oLrDV-vw@mail.gmail.com/__;Kw!!ACWV5N9M2RV99hQ!Oh4PBJ1NoN9mEgqGqdaNcWuKtJiC6TS_rIbALuqZadQoo93jpVJaFFmXUpOTuzRUdCwcRJWE6uJ4pe0$ > >>> ) > >> > >> I don't mind strongly either way, just a suggestion to consider. > > > > While switching the semantics of the flag from opt-out to opt-in is > > just as valid, I'm more comfortable with the opt-out flag: the rest of > > the logic can stay the same, and we're aware of the fact that changing > > cache coverage by KFENCE shouldn't be something that needs to be done > > manually. > > > > My main point is that opting out or in to only a few select caches > > should be a rarely used feature, and accordingly it should be as > > simple as possible. Honestly, I still don't quite see the point of it, > > and my solution would be to just increase the KFENCE pool, increase > > sample rate, or decrease the "skip covered threshold%". But in the > > case described by Imran, perhaps a running machine is having trouble > > and limiting the caches to be analyzed by KFENCE might be worthwhile > > if a more aggressive configuration doesn't yield anything (and then > > there's of course KASAN, but I recognize it's not always possible to > > switch kernel and run the same workload with it). > > > > The use case for the proposed change is definitely when an admin or > > kernel dev is starting to debug a problem. KFENCE wasn't designed for > > that (vs. deployment at scale, discovery of bugs). As such I'm having > > a hard time admitting how useful this feature will really be, but > > given the current implementation is simple, having it might actually > > help a few people. > > > > Imran, just to make sure my assumptions here are right, have you had > > success debugging an issue in this way? Can you elaborate on what > > "certain debugging scenarios" you mean (admin debugging something, or > > a kernel dev, production fleet, or test machine)? > > > > I have not used kfence in this way because as of now we don't have such newer > kernels in production fleet but I can cite a couple of instances where using > slub_debug for few selected slabs helped me in locating the issue on a > production system where KASAN or even full slub_debug were not feasible. > Apologies in advance if I am elaborating more than you asked for :). This is very useful to understand the use case. > In one case a freed struct mutex was being used later on and by that time same > address had been given to a kmalloc-32 object. The issue was appearing more > frequently if one would enforce some cgroup memory limitation resulting in fork > of a task exiting prematurely. From the vmcore we could see that mutex or more > specifically task_struct.futex_exit_mutex was in bad shape and eventually using > slub_debug for kmalloc-32 pointed to issue. > > Another case involved a mem_cgroup corruption which was causing system crash but > was giving list corruption warnings beforehand. Since list corruption warnings > were coming from cgroup subsystem, corresponding objects were in doubt. > Enabling slub_debug for kmalloc-4k helped in locating the actual corruption. > > Admittedly both of the above issues were result of backporting mistakes but > nonetheless they happened in production systems where very few debugging options > were available. > > By "certain debugging scenarios" I meant such cases where some initial data > (from production fleet) like vmcore or kernel debug messages can give some > pointer towards which slab objects could be wrong and then we would use this > feature (along with further tuning like increasing sampling frequency, pool size > if needed/possible) to pinpoint the actual issue. The idea is that limiting > KFENCE to few slabs will increase the probablity of catching the issue even if > we are not able to tweak pool size. > > Please let me know if it sounds reasonable or if I missed something from your > query. Thanks for the elaboration on use cases - agreed that in few scenarios this feature can help increase the probability of debugging an issue. Reviewed-by: Marco Elver <elver@google.com> With minor suggestions: > +SLAB_ATTR(skip_kfence); > + ^ Unnecessary space between SLAB_ATTR and #endif. > +#endif > + And the patch title should be something like "kfence: add sysfs interface to disable kfence for selected slabs" (to follow format "<subsys>: <thing changed>").
On 8/11/22 10:59, Imran Khan wrote: > By default kfence allocation can happen for any slab object, whose size > is up to PAGE_SIZE, as long as that allocation is the first allocation > after expiration of kfence sample interval. But in certain debugging > scenarios we may be interested in debugging corruptions involving > some specific slub objects like dentry or ext4_* etc. In such cases > limiting kfence for allocations involving only specific slub objects > will increase the probablity of catching the issue since kfence pool > will not be consumed by other slab objects. > > This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' > to disable kfence for specific slabs. Having the interface work in this > way does not impact current/default behavior of kfence and allows us to > use kfence for specific slabs (when needed) as well. The decision to > skip/use kfence is taken depending on whether kmem_cache.flags has > (newly introduced) SLAB_SKIP_KFENCE flag set or not. > > Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Nit below: > --- > > Changes since v1: > - Remove RFC tag > > include/linux/slab.h | 6 ++++++ > mm/kfence/core.c | 7 +++++++ > mm/slub.c | 27 +++++++++++++++++++++++++++ > 3 files changed, 40 insertions(+) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 0fefdf528e0d..947d912fd08c 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -119,6 +119,12 @@ > */ > #define SLAB_NO_USER_FLAGS ((slab_flags_t __force)0x10000000U) > > +#ifdef CONFIG_KFENCE > +#define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U) > +#else > +#define SLAB_SKIP_KFENCE 0 > +#endif The whitespace here (spaces) differs from other flags above (tabs).
diff --git a/include/linux/slab.h b/include/linux/slab.h index 0fefdf528e0d..947d912fd08c 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -119,6 +119,12 @@ */ #define SLAB_NO_USER_FLAGS ((slab_flags_t __force)0x10000000U) +#ifdef CONFIG_KFENCE +#define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U) +#else +#define SLAB_SKIP_KFENCE 0 +#endif + /* The following flags affect the page allocator grouping pages by mobility */ /* Objects are reclaimable */ #define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index c252081b11df..8c08ae2101d7 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -1003,6 +1003,13 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) return NULL; } + /* + * Skip allocations for this slab, if KFENCE has been disabled for + * this slab. + */ + if (s->flags & SLAB_SKIP_KFENCE) + return NULL; + if (atomic_inc_return(&kfence_allocation_gate) > 1) return NULL; #ifdef CONFIG_KFENCE_STATIC_KEYS diff --git a/mm/slub.c b/mm/slub.c index 862dbd9af4f5..ee8b48327536 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5745,6 +5745,30 @@ STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); #endif /* CONFIG_SLUB_STATS */ +#ifdef CONFIG_KFENCE +static ssize_t skip_kfence_show(struct kmem_cache *s, char *buf) +{ + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_SKIP_KFENCE)); +} + +static ssize_t skip_kfence_store(struct kmem_cache *s, + const char *buf, size_t length) +{ + int ret = length; + + if (buf[0] == '0') + s->flags &= ~SLAB_SKIP_KFENCE; + else if (buf[0] == '1') + s->flags |= SLAB_SKIP_KFENCE; + else + ret = -EINVAL; + + return ret; +} +SLAB_ATTR(skip_kfence); + +#endif + static struct attribute *slab_attrs[] = { &slab_size_attr.attr, &object_size_attr.attr, @@ -5812,6 +5836,9 @@ static struct attribute *slab_attrs[] = { &failslab_attr.attr, #endif &usersize_attr.attr, +#ifdef CONFIG_KFENCE + &skip_kfence_attr.attr, +#endif NULL };
By default kfence allocation can happen for any slab object, whose size is up to PAGE_SIZE, as long as that allocation is the first allocation after expiration of kfence sample interval. But in certain debugging scenarios we may be interested in debugging corruptions involving some specific slub objects like dentry or ext4_* etc. In such cases limiting kfence for allocations involving only specific slub objects will increase the probablity of catching the issue since kfence pool will not be consumed by other slab objects. This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific slabs. Having the interface work in this way does not impact current/default behavior of kfence and allows us to use kfence for specific slabs (when needed) as well. The decision to skip/use kfence is taken depending on whether kmem_cache.flags has (newly introduced) SLAB_SKIP_KFENCE flag set or not. Signed-off-by: Imran Khan <imran.f.khan@oracle.com> --- Changes since v1: - Remove RFC tag include/linux/slab.h | 6 ++++++ mm/kfence/core.c | 7 +++++++ mm/slub.c | 27 +++++++++++++++++++++++++++ 3 files changed, 40 insertions(+) base-commit: 40d43a7507e1547dd45cb02af2e40d897c591870