mm: slub: Panic if the object corruption is checked.

Message ID	20250120082908.4162780-1-hyesoo.yu@samsung.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Hyesoo Yu <hyesoo.yu@samsung.com> To: Cc: janghyuck.kim@samsung.com, Hyesoo Yu <hyesoo.yu@samsung.com>, Andrew Morton <akpm@linux-foundation.org>, Jonathan Corbet <corbet@lwn.net>, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Vlastimil Babka <vbabka@suse.cz>, Roman Gushchin <roman.gushchin@linux.dev>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: slub: Panic if the object corruption is checked. Date: Mon, 20 Jan 2025 17:28:21 +0900 Message-ID: <20250120082908.4162780-1-hyesoo.yu@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 102P DLP-Filter: Pass References: <CGME20250120083041epcas2p30d61ee801c1db5a7ebdf26fdc642883f@epcas2p3.samsung.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: slub: Panic if the object corruption is checked. \| expand mm: slub: Panic if the object corruption is checked.

Hyesoo Yu Jan. 20, 2025, 8:28 a.m. UTC

If a slab object is corrupted or an error occurs in its internal
value, continuing after restoration may cause other side effects.
At this point, it is difficult to debug because the problem occurred
in the past. A flag has been added that can cause a panic when there
is a problem with the object.

Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
---
 Documentation/mm/slub.rst |  2 ++
 include/linux/slab.h      |  4 ++++
 mm/slub.c                 | 14 ++++++++++++++
 3 files changed, 20 insertions(+)

Matthew Wilcox Jan. 20, 2025, 3:36 p.m. UTC | #1

On Mon, Jan 20, 2025 at 05:28:21PM +0900, Hyesoo Yu wrote:
> If a slab object is corrupted or an error occurs in its internal
> value, continuing after restoration may cause other side effects.
> At this point, it is difficult to debug because the problem occurred
> in the past. A flag has been added that can cause a panic when there
> is a problem with the object.
> 
> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493

Linux does not use Change IDs.  Please omit these from future patches.

Panicing is a very unfriendly approach.  I think a better approach would
be to freeze the slab where corruption is detected.  That is, no future
objects are allocated from that slab, and attempts to free objects from
that slab become no-ops.  I don't think that should be hard to implement.

Hyeonggon Yoo Jan. 20, 2025, 3:41 p.m. UTC | #2

On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>
> If a slab object is corrupted or an error occurs in its internal
> value, continuing after restoration may cause other side effects.
> At this point, it is difficult to debug because the problem occurred
> in the past. A flag has been added that can cause a panic when there
> is a problem with the object.

Hi Hyesoo,

I'm concerned about this because it goes against the effort to avoid
introducing new BUG() calls [1].

And I think it would be more appropriate to use existing panic_on_warn
functionality [2] which causes
a panic on WARN(), rather than introducing a SLUB-specific knob to do
the same thing.

However SLUB does not call WARN() and uses pr_err() instead when
reporting an error.
Vlastimil and I talked about changing it to use WARN() a while ago
[3], but neither of us
have done that yet.

Probably you may want to look at it, as it also aligns with your purpose?
FYI, if you would like to work on it, please make sure that it WARN()
is suppressed during kunit test.

[1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
[2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
[3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz

Best,
Hyeonggon

> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
> ---
>  Documentation/mm/slub.rst |  2 ++
>  include/linux/slab.h      |  4 ++++
>  mm/slub.c                 | 14 ++++++++++++++
>  3 files changed, 20 insertions(+)
>
> diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
> index 84ca1dc94e5e..ce58525db93d 100644
> --- a/Documentation/mm/slub.rst
> +++ b/Documentation/mm/slub.rst
> @@ -53,6 +53,7 @@ Possible debug options are::
>         U               User tracking (free and alloc)
>         T               Trace (please only use on single slabs)
>         A               Enable failslab filter mark for the cache
> +       C               Panic if object corruption is checked.
>         O               Switch debugging off for caches that would have
>                         caused higher minimum slab orders
>         -               Switch all debugging off (useful if the kernel is
> @@ -113,6 +114,7 @@ options from the ``slab_debug`` parameter translate to the following files::
>         U       store_user
>         T       trace
>         A       failslab
> +       C       corruption_panic
>
>  failslab file is writable, so writing 1 or 0 will enable or disable
>  the option at runtime. Write returns -EINVAL if cache is an alias.
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 10a971c2bde3..4391c30564d6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -31,6 +31,7 @@ enum _slab_flag_bits {
>         _SLAB_CACHE_DMA32,
>         _SLAB_STORE_USER,
>         _SLAB_PANIC,
> +       _SLAB_CORRUPTION_PANIC,
>         _SLAB_TYPESAFE_BY_RCU,
>         _SLAB_TRACE,
>  #ifdef CONFIG_DEBUG_OBJECTS
> @@ -97,6 +98,9 @@ enum _slab_flag_bits {
>  #define SLAB_STORE_USER                __SLAB_FLAG_BIT(_SLAB_STORE_USER)
>  /* Panic if kmem_cache_create() fails */
>  #define SLAB_PANIC             __SLAB_FLAG_BIT(_SLAB_PANIC)
> +/* Panic if object corruption is checked */
> +#define SLAB_CORRUPTION_PANIC  __SLAB_FLAG_BIT(_SLAB_CORRUPTION_PANIC)
> +
>  /**
>   * define SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
>   *
> diff --git a/mm/slub.c b/mm/slub.c
> index 48cefc969480..36a8dabf1349 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1306,6 +1306,8 @@ slab_pad_check(struct kmem_cache *s, struct slab *slab)
>                         fault, end - 1, fault - start);
>         print_section(KERN_ERR, "Padding ", pad, remainder);
>
> +       BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> +
>         restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
>  }
>
> @@ -1389,6 +1391,8 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
>         if (!ret && !slab_in_kunit_test()) {
>                 print_trailer(s, slab, object);
>                 add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> +
> +               BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
>         }
>
>         return ret;
> @@ -1689,6 +1693,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init)
>                 case 'a':
>                         *flags |= SLAB_FAILSLAB;
>                         break;
> +               case 'c':
> +                       *flags |= SLAB_CORRUPTION_PANIC;
> +                       break;
>                 case 'o':
>                         /*
>                          * Avoid enabling debugging on caches if its minimum
> @@ -6874,6 +6881,12 @@ static ssize_t store_user_show(struct kmem_cache *s, char *buf)
>
>  SLAB_ATTR_RO(store_user);
>
> +static ssize_t corruption_panic_show(struct kmem_cache *s, char *buf)
> +{
> +       return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CORRUPTION_PANIC));
> +}
> +SLAB_ATTR_RO(corruption_panic);
> +
>  static ssize_t validate_show(struct kmem_cache *s, char *buf)
>  {
>         return 0;
> @@ -7092,6 +7105,7 @@ static struct attribute *slab_attrs[] = {
>         &red_zone_attr.attr,
>         &poison_attr.attr,
>         &store_user_attr.attr,
> +       &corruption_panic_attr.attr,
>         &validate_attr.attr,
>  #endif
>  #ifdef CONFIG_ZONE_DMA
> --
> 2.48.0
>

Hyesoo Yu Jan. 21, 2025, 12:40 a.m. UTC | #3

On Mon, Jan 20, 2025 at 03:36:08PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 20, 2025 at 05:28:21PM +0900, Hyesoo Yu wrote:
> > If a slab object is corrupted or an error occurs in its internal
> > value, continuing after restoration may cause other side effects.
> > At this point, it is difficult to debug because the problem occurred
> > in the past. A flag has been added that can cause a panic when there
> > is a problem with the object.
> > 
> > Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> > Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
> 
> Linux does not use Change IDs.  Please omit these from future patches.
> 
> Panicing is a very unfriendly approach.  I think a better approach would
> be to freeze the slab where corruption is detected.  That is, no future
> objects are allocated from that slab, and attempts to free objects from
> that slab become no-ops.  I don't think that should be hard to implement.
>

Thanks you for your responce. That is my mistake. I will remove the change ID.

I agree that freezing is better than recovery or panic for the system's stability.
However what I want from the patch is not just to make the system run stably.
I need to immediately trigger a panic to investigate the slub.

I would like to analyze the corrupted data at that moment to check issues
like cache problem, user errors, system clock frequency and similar problems,
not just passing by without any issues.

However I agree that panic is not a friendly approach.
I will modify it to notify the problem using warn() and then use
panic_on_warn to trigger panic.

Thanks,
Regards.

Hyesoo Yu Jan. 21, 2025, 12:54 a.m. UTC | #4

On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
> On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
> >
> > If a slab object is corrupted or an error occurs in its internal
> > value, continuing after restoration may cause other side effects.
> > At this point, it is difficult to debug because the problem occurred
> > in the past. A flag has been added that can cause a panic when there
> > is a problem with the object.
> 
> Hi Hyesoo,
> 
> I'm concerned about this because it goes against the effort to avoid
> introducing new BUG() calls [1].
> 
> And I think it would be more appropriate to use existing panic_on_warn
> functionality [2] which causes
> a panic on WARN(), rather than introducing a SLUB-specific knob to do
> the same thing.
> 
> However SLUB does not call WARN() and uses pr_err() instead when
> reporting an error.
> Vlastimil and I talked about changing it to use WARN() a while ago
> [3], but neither of us
> have done that yet.
> 
> Probably you may want to look at it, as it also aligns with your purpose?
> FYI, if you would like to work on it, please make sure that it WARN()
> is suppressed during kunit test.
> 
> [1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
> [2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
> [3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
> 
> Best,
> Hyeonggon

Thanks for response.

Using warn() instead of panic, is a great idea.
Thanks for pointing out what I missed.

The next patch will be changed to use warn().

Thanks.

> 
> > Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
> > Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
> > ---
> >  Documentation/mm/slub.rst |  2 ++
> >  include/linux/slab.h      |  4 ++++
> >  mm/slub.c                 | 14 ++++++++++++++
> >  3 files changed, 20 insertions(+)
> >
> > diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
> > index 84ca1dc94e5e..ce58525db93d 100644
> > --- a/Documentation/mm/slub.rst
> > +++ b/Documentation/mm/slub.rst
> > @@ -53,6 +53,7 @@ Possible debug options are::
> >         U               User tracking (free and alloc)
> >         T               Trace (please only use on single slabs)
> >         A               Enable failslab filter mark for the cache
> > +       C               Panic if object corruption is checked.
> >         O               Switch debugging off for caches that would have
> >                         caused higher minimum slab orders
> >         -               Switch all debugging off (useful if the kernel is
> > @@ -113,6 +114,7 @@ options from the ``slab_debug`` parameter translate to the following files::
> >         U       store_user
> >         T       trace
> >         A       failslab
> > +       C       corruption_panic
> >
> >  failslab file is writable, so writing 1 or 0 will enable or disable
> >  the option at runtime. Write returns -EINVAL if cache is an alias.
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 10a971c2bde3..4391c30564d6 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -31,6 +31,7 @@ enum _slab_flag_bits {
> >         _SLAB_CACHE_DMA32,
> >         _SLAB_STORE_USER,
> >         _SLAB_PANIC,
> > +       _SLAB_CORRUPTION_PANIC,
> >         _SLAB_TYPESAFE_BY_RCU,
> >         _SLAB_TRACE,
> >  #ifdef CONFIG_DEBUG_OBJECTS
> > @@ -97,6 +98,9 @@ enum _slab_flag_bits {
> >  #define SLAB_STORE_USER                __SLAB_FLAG_BIT(_SLAB_STORE_USER)
> >  /* Panic if kmem_cache_create() fails */
> >  #define SLAB_PANIC             __SLAB_FLAG_BIT(_SLAB_PANIC)
> > +/* Panic if object corruption is checked */
> > +#define SLAB_CORRUPTION_PANIC  __SLAB_FLAG_BIT(_SLAB_CORRUPTION_PANIC)
> > +
> >  /**
> >   * define SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
> >   *
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 48cefc969480..36a8dabf1349 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1306,6 +1306,8 @@ slab_pad_check(struct kmem_cache *s, struct slab *slab)
> >                         fault, end - 1, fault - start);
> >         print_section(KERN_ERR, "Padding ", pad, remainder);
> >
> > +       BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> > +
> >         restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
> >  }
> >
> > @@ -1389,6 +1391,8 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> >         if (!ret && !slab_in_kunit_test()) {
> >                 print_trailer(s, slab, object);
> >                 add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> > +
> > +               BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
> >         }
> >
> >         return ret;
> > @@ -1689,6 +1693,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init)
> >                 case 'a':
> >                         *flags |= SLAB_FAILSLAB;
> >                         break;
> > +               case 'c':
> > +                       *flags |= SLAB_CORRUPTION_PANIC;
> > +                       break;
> >                 case 'o':
> >                         /*
> >                          * Avoid enabling debugging on caches if its minimum
> > @@ -6874,6 +6881,12 @@ static ssize_t store_user_show(struct kmem_cache *s, char *buf)
> >
> >  SLAB_ATTR_RO(store_user);
> >
> > +static ssize_t corruption_panic_show(struct kmem_cache *s, char *buf)
> > +{
> > +       return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CORRUPTION_PANIC));
> > +}
> > +SLAB_ATTR_RO(corruption_panic);
> > +
> >  static ssize_t validate_show(struct kmem_cache *s, char *buf)
> >  {
> >         return 0;
> > @@ -7092,6 +7105,7 @@ static struct attribute *slab_attrs[] = {
> >         &red_zone_attr.attr,
> >         &poison_attr.attr,
> >         &store_user_attr.attr,
> > +       &corruption_panic_attr.attr,
> >         &validate_attr.attr,
> >  #endif
> >  #ifdef CONFIG_ZONE_DMA
> > --
> > 2.48.0
> >
>

Hyeonggon Yoo Jan. 21, 2025, 1:48 a.m. UTC | #5

On 1/21/2025 9:54 AM, Hyesoo Yu wrote:
> On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
>> On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>>>
>>> If a slab object is corrupted or an error occurs in its internal
>>> value, continuing after restoration may cause other side effects.
>>> At this point, it is difficult to debug because the problem occurred
>>> in the past. A flag has been added that can cause a panic when there
>>> is a problem with the object.
>>
>> Hi Hyesoo,
>>
>> I'm concerned about this because it goes against the effort to avoid
>> introducing new BUG() calls [1].
>>
>> And I think it would be more appropriate to use existing panic_on_warn
>> functionality [2] which causes
>> a panic on WARN(), rather than introducing a SLUB-specific knob to do
>> the same thing.
>>
>> However SLUB does not call WARN() and uses pr_err() instead when
>> reporting an error.
>> Vlastimil and I talked about changing it to use WARN() a while ago
>> [3], but neither of us
>> have done that yet.
>>
>> Probably you may want to look at it, as it also aligns with your purpose?
>> FYI, if you would like to work on it, please make sure that it WARN()
>> is suppressed during kunit test.
>>
>> [1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
>> [2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
>> [3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
>>
>> Best,
>> Hyeonggon
> 
> Thanks for response.
> 
> Using warn() instead of panic, is a great idea.
> Thanks for pointing out what I missed.

Just for clarification, I think changing the common error reporting
logic (like, slab_bug()) to use WARN() will be preferable to inserting
new WARN()s at random points, which is what this patch does now.

Best,
Hyeonggon

Hyesoo Yu Jan. 21, 2025, 2:32 a.m. UTC | #6

On Tue, Jan 21, 2025 at 10:48:08AM +0900, Hyeonggon Yoo wrote:
> 
> 
> On 1/21/2025 9:54 AM, Hyesoo Yu wrote:
> > On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
> > > On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
> > > > 
> > > > If a slab object is corrupted or an error occurs in its internal
> > > > value, continuing after restoration may cause other side effects.
> > > > At this point, it is difficult to debug because the problem occurred
> > > > in the past. A flag has been added that can cause a panic when there
> > > > is a problem with the object.
> > > 
> > > Hi Hyesoo,
> > > 
> > > I'm concerned about this because it goes against the effort to avoid
> > > introducing new BUG() calls [1].
> > > 
> > > And I think it would be more appropriate to use existing panic_on_warn
> > > functionality [2] which causes
> > > a panic on WARN(), rather than introducing a SLUB-specific knob to do
> > > the same thing.
> > > 
> > > However SLUB does not call WARN() and uses pr_err() instead when
> > > reporting an error.
> > > Vlastimil and I talked about changing it to use WARN() a while ago
> > > [3], but neither of us
> > > have done that yet.
> > > 
> > > Probably you may want to look at it, as it also aligns with your purpose?
> > > FYI, if you would like to work on it, please make sure that it WARN()
> > > is suppressed during kunit test.
> > > 
> > > [1] https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
> > > [2] https://www.kernel.org/doc/html/v6.9/admin-guide/sysctl/kernel.html#panic-on-warn
> > > [3] https://lore.kernel.org/linux-mm/d4219cd9-32d3-4697-93b9-6a44bf77d50c@suse.cz
> > > 
> > > Best,
> > > Hyeonggon
> > 
> > Thanks for response.
> > 
> > Using warn() instead of panic, is a great idea.
> > Thanks for pointing out what I missed.
> 
> Just for clarification, I think changing the common error reporting
> logic (like, slab_bug()) to use WARN() will be preferable to inserting
> new WARN()s at random points, which is what this patch does now.
> 
> Best,
> Hyeonggon
> 

Thanks you for clarification.

Actually, I considered adding BUG_ON() to slab_bug. However if we add BUG_ON() to slab_bug,
it will prevent many meaningful error log from being printed subsequently.
As you know, slab_bug is the log that usually is printed at the biginning of a bug
in the slab. As a result, it would be difficult to figure out the problems based on
the logs during our large-scale test-bed.

Similary, even if I use WARN() in slab_bug, we won't be able to obtain the logs
when panic_on_warn is enabled. I don't think it is useful to include WARN in slab_bug.

Instead, I will implement a solution where WARN is only used in slab_fix before
slab object is restored. If I add it to slab_fix, I think warning is suppressed on
kunit test by slab_add_kunit_errors handling.

Thanks,
Regards.

Vlastimil Babka Jan. 21, 2025, 10:27 a.m. UTC | #7

On 1/21/25 1:40 AM, Hyesoo Yu wrote:
> On Mon, Jan 20, 2025 at 03:36:08PM +0000, Matthew Wilcox wrote:
>> On Mon, Jan 20, 2025 at 05:28:21PM +0900, Hyesoo Yu wrote:
>>> If a slab object is corrupted or an error occurs in its internal
>>> value, continuing after restoration may cause other side effects.
>>> At this point, it is difficult to debug because the problem occurred
>>> in the past. A flag has been added that can cause a panic when there
>>> is a problem with the object.
>>>
>>> Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
>>> Change-Id: I4e7e5e0ec3421a7f6c84d591db052f79d3775493
>>
>> Linux does not use Change IDs.  Please omit these from future patches.
>>
>> Panicing is a very unfriendly approach.  I think a better approach would
>> be to freeze the slab where corruption is detected.  That is, no future
>> objects are allocated from that slab, and attempts to free objects from
>> that slab become no-ops.  I don't think that should be hard to implement.

Freezing of slab is already done in some cases when corruption is
detected - all objects are marked as used, and further freeing attempts
on the slab are discarded. Perhaps not all cases, which could be improved.

> Thanks you for your responce. That is my mistake. I will remove the change ID.
> 
> I agree that freezing is better than recovery or panic for the system's stability.
> However what I want from the patch is not just to make the system run stably.
> I need to immediately trigger a panic to investigate the slub.

IMHO it's a valid goal to panic more quickly when debugging, and
enabling slub_debug means debugging is in progress (as opposed to normal
production when we try to avoid panic).
But making it possible to reuse the general panic_on_warn mechanism
(which can be also expected to be enabled when debugging) is indeed
preferable to introducing a new slab-specific flag.

> I would like to analyze the corrupted data at that moment to check issues
> like cache problem, user errors, system clock frequency and similar problems,
> not just passing by without any issues.
> 
> However I agree that panic is not a friendly approach.
> I will modify it to notify the problem using warn() and then use
> panic_on_warn to trigger panic.
> 
> Thanks,
> Regards.
> 
>

Vlastimil Babka Jan. 21, 2025, 10:28 a.m. UTC | #8

On 1/21/25 3:32 AM, Hyesoo Yu wrote:
> On Tue, Jan 21, 2025 at 10:48:08AM +0900, Hyeonggon Yoo wrote:
>>
>>
>> On 1/21/2025 9:54 AM, Hyesoo Yu wrote:
>>> On Tue, Jan 21, 2025 at 12:41:01AM +0900, Hyeonggon Yoo wrote:
>>>> On Mon, Jan 20, 2025 at 5:30 PM Hyesoo Yu <hyesoo.yu@samsung.com> wrote:
>>>
>>> Thanks for response.
>>>
>>> Using warn() instead of panic, is a great idea.
>>> Thanks for pointing out what I missed.
>>
>> Just for clarification, I think changing the common error reporting
>> logic (like, slab_bug()) to use WARN() will be preferable to inserting
>> new WARN()s at random points, which is what this patch does now.
>>
>> Best,
>> Hyeonggon
>>
> 
> Thanks you for clarification.
> 
> Actually, I considered adding BUG_ON() to slab_bug. However if we add BUG_ON() to slab_bug,
> it will prevent many meaningful error log from being printed subsequently.
> As you know, slab_bug is the log that usually is printed at the biginning of a bug
> in the slab. As a result, it would be difficult to figure out the problems based on
> the logs during our large-scale test-bed.
> 
> Similary, even if I use WARN() in slab_bug, we won't be able to obtain the logs
> when panic_on_warn is enabled. I don't think it is useful to include WARN in slab_bug.
> 
> Instead, I will implement a solution where WARN is only used in slab_fix before
> slab object is restored. If I add it to slab_fix, I think warning is suppressed on
> kunit test by slab_add_kunit_errors handling.

Right, makes sense to only do the WARN() after printing the debugging
logs. Thanks.

> Thanks,
> Regards.
> 
>

kernel test robot Jan. 21, 2025, 12:46 p.m. UTC | #9

Hi Hyesoo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Hyesoo-Yu/mm-slub-Panic-if-the-object-corruption-is-checked/20250120-163233
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250120082908.4162780-1-hyesoo.yu%40samsung.com
patch subject: [PATCH] mm: slub: Panic if the object corruption is checked.
config: mips-randconfig-r111-20250121 (https://download.01.org/0day-ci/archive/20250121/202501212026.lUnLNhv6-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 14.2.0
reproduce: (https://download.01.org/0day-ci/archive/20250121/202501212026.lUnLNhv6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501212026.lUnLNhv6-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/slub.c:1308:9: sparse: sparse: cast from restricted slab_flags_t
   mm/slub.c:1394:17: sparse: sparse: cast from restricted slab_flags_t
   mm/slub.c:4440:47: sparse: sparse: context imbalance in '__slab_free' - unexpected unlock

vim +1308 mm/slub.c

  1273	
  1274	/* Check the pad bytes at the end of a slab page */
  1275	static pad_check_attributes void
  1276	slab_pad_check(struct kmem_cache *s, struct slab *slab)
  1277	{
  1278		u8 *start;
  1279		u8 *fault;
  1280		u8 *end;
  1281		u8 *pad;
  1282		int length;
  1283		int remainder;
  1284	
  1285		if (!(s->flags & SLAB_POISON))
  1286			return;
  1287	
  1288		start = slab_address(slab);
  1289		length = slab_size(slab);
  1290		end = start + length;
  1291		remainder = length % s->size;
  1292		if (!remainder)
  1293			return;
  1294	
  1295		pad = end - remainder;
  1296		metadata_access_enable();
  1297		fault = memchr_inv(kasan_reset_tag(pad), POISON_INUSE, remainder);
  1298		metadata_access_disable();
  1299		if (!fault)
  1300			return;
  1301		while (end > fault && end[-1] == POISON_INUSE)
  1302			end--;
  1303	
  1304		slab_err(s, slab, "Padding overwritten. 0x%p-0x%p @offset=%tu",
  1305				fault, end - 1, fault - start);
  1306		print_section(KERN_ERR, "Padding ", pad, remainder);
  1307	
> 1308		BUG_ON(s->flags & SLAB_CORRUPTION_PANIC);
  1309	
  1310		restore_bytes(s, "slab padding", POISON_INUSE, fault, end);
  1311	}
  1312

mm: slub: Panic if the object corruption is checked.

Commit Message

Comments

Patch