Message ID | 20210309134720.29052-1-georgi.djakov@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/slub: Add slub_debug option to panic on memory corruption | expand |
On Tue, 9 Mar 2021, Georgi Djakov wrote: > Being able to stop the system immediately when a memory corruption > is detected is crucial to finding the source of it. This is very > useful when the memory can be inspected with kdump or other tools. Hmmm.... ok. > static void restore_bytes(struct kmem_cache *s, char *message, u8 data, > void *from, void *to) > { > + if (slub_debug & SLAB_CORRUPTION_PANIC) > + panic("slab: object overwritten\n"); > slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data); > memset(from, data, to - from); > } Why panic here? This should only be called late in the bug reporting when an error has already been printed.
On 3/9/21 2:47 PM, Georgi Djakov wrote: > Being able to stop the system immediately when a memory corruption > is detected is crucial to finding the source of it. This is very > useful when the memory can be inspected with kdump or other tools. Is this in some testing scenarios where you would also use e.g. panic_on_warn? We could hook to that. If not, we could introduce a new panic_on_memory_corruption that would apply also for debug_pagealloc and whatnot? > Let's add an option panic the kernel when slab debug catches an > object or list corruption. > > This new option is not enabled by default (yet), so it needs to be > enabled explicitly (for example by adding "slub_debug=FZPUC" to > the kernel command line). > > Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> > --- > Documentation/vm/slub.rst | 1 + > include/linux/slab.h | 3 +++ > mm/slab.h | 2 +- > mm/slub.c | 9 +++++++++ > 4 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/Documentation/vm/slub.rst b/Documentation/vm/slub.rst > index 03f294a638bd..32878c44f3de 100644 > --- a/Documentation/vm/slub.rst > +++ b/Documentation/vm/slub.rst > @@ -53,6 +53,7 @@ Possible debug options are:: > Z Red zoning > P Poisoning (object and padding) > U User tracking (free and alloc) > + C Panic on object corruption (enables SLAB_CORRUPTION_PANIC) > T Trace (please only use on single slabs) > A Enable failslab filter mark for the cache > O Switch debugging off for caches that would have > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 0c97d788762c..ebff5e704d08 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -39,6 +39,9 @@ > #define SLAB_STORE_USER ((slab_flags_t __force)0x00010000U) > /* Panic if kmem_cache_create() fails */ > #define SLAB_PANIC ((slab_flags_t __force)0x00040000U) > +/* Panic if memory corruption is detected */ > +#define SLAB_CORRUPTION_PANIC ((slab_flags_t __force)0x00080000U) > + > /* > * SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS! > * > diff --git a/mm/slab.h b/mm/slab.h > index 120b1d0dfb6d..ae0079017fc6 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -134,7 +134,7 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size, > #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER) > #elif defined(CONFIG_SLUB_DEBUG) > #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \ > - SLAB_TRACE | SLAB_CONSISTENCY_CHECKS) > + SLAB_TRACE | SLAB_CONSISTENCY_CHECKS | SLAB_CORRUPTION_PANIC) > #else > #define SLAB_DEBUG_FLAGS (0) > #endif > diff --git a/mm/slub.c b/mm/slub.c > index 077a019e4d7a..49351427f701 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -741,6 +741,8 @@ void object_err(struct kmem_cache *s, struct page *page, > { > slab_bug(s, "%s", reason); > print_trailer(s, page, object); > + if (slub_debug & SLAB_CORRUPTION_PANIC) > + panic(reason); > } > > static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, > @@ -755,6 +757,8 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, > slab_bug(s, "%s", buf); > print_page_info(page); > dump_stack(); > + if (slub_debug & SLAB_CORRUPTION_PANIC) > + panic("slab: slab error\n"); > } > > static void init_object(struct kmem_cache *s, void *object, u8 val) > @@ -776,6 +780,8 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) > static void restore_bytes(struct kmem_cache *s, char *message, u8 data, > void *from, void *to) > { > + if (slub_debug & SLAB_CORRUPTION_PANIC) > + panic("slab: object overwritten\n"); > slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data); > memset(from, data, to - from); > } > @@ -1319,6 +1325,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init) > case 'a': > *flags |= SLAB_FAILSLAB; > break; > + case 'c': > + *flags |= SLAB_CORRUPTION_PANIC; > + break; > case 'o': > /* > * Avoid enabling debugging on caches if its minimum >
Hi Christoph, Thanks for the comments! On 3/9/21 16:56, Christoph Lameter wrote: > On Tue, 9 Mar 2021, Georgi Djakov wrote: > >> Being able to stop the system immediately when a memory corruption >> is detected is crucial to finding the source of it. This is very >> useful when the memory can be inspected with kdump or other tools. > > Hmmm.... ok. The idea is to be able to collect data right after the corruption is detected, otherwise more data might be corrupted and tracing becomes more difficult. > >> static void restore_bytes(struct kmem_cache *s, char *message, u8 data, >> void *from, void *to) >> { >> + if (slub_debug & SLAB_CORRUPTION_PANIC) >> + panic("slab: object overwritten\n"); >> slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data); >> memset(from, data, to - from); >> } > > Why panic here? This should only be called late in the bug reporting when > an error has already been printed. This is called by both slab_pad_check() and check_bytes_and_report(), so it seemed like a common place where i could put the panic(). I can move it to the caller functions instead, if that's preferred. Thanks, Georgi
Hi Vlastimil, Thanks for the comment! On 3/9/21 17:09, Vlastimil Babka wrote: > On 3/9/21 2:47 PM, Georgi Djakov wrote: >> Being able to stop the system immediately when a memory corruption >> is detected is crucial to finding the source of it. This is very >> useful when the memory can be inspected with kdump or other tools. > > Is this in some testing scenarios where you would also use e.g. panic_on_warn? > We could hook to that. If not, we could introduce a new > panic_on_memory_corruption that would apply also for debug_pagealloc and whatnot? I would prefer that we not tie it with panic_on_warn - there might be lots of new code in multiple subsystems, so hitting some WARNing while testing is not something unexpected. Introducing an additional panic_on_memory_corruption would work, but i noticed that we already have slub_debug and thought to re-use that. But indeed, аdding an option to panic in for example bad_page() sounds also useful, if that's what you suggest. Thanks, Georgi
On 3/9/21 7:14 PM, Georgi Djakov wrote: > Hi Vlastimil, > > Thanks for the comment! > > On 3/9/21 17:09, Vlastimil Babka wrote: >> On 3/9/21 2:47 PM, Georgi Djakov wrote: >>> Being able to stop the system immediately when a memory corruption >>> is detected is crucial to finding the source of it. This is very >>> useful when the memory can be inspected with kdump or other tools. >> >> Is this in some testing scenarios where you would also use e.g. panic_on_warn? >> We could hook to that. If not, we could introduce a new >> panic_on_memory_corruption that would apply also for debug_pagealloc and whatnot? > > I would prefer that we not tie it with panic_on_warn - there might be lots of > new code in multiple subsystems, so hitting some WARNing while testing is not > something unexpected. > > Introducing an additional panic_on_memory_corruption would work, but i noticed > that we already have slub_debug and thought to re-use that. But indeed, аdding > an option to panic in for example bad_page() sounds also useful, if that's what > you suggest. Yes, that would be another example. Also CCing Kees for input, as besides the "kdump ASAP for debugging" case, I can imagine security hardening folks could be interested in the "somebody might have just failed to pwn the kernel, better panic than let them continue" angle. But I'm naive wrt security, so it might be a stupid idea :) Vlastimil > Thanks, > Georgi
On Tue, Mar 09, 2021 at 07:18:32PM +0100, Vlastimil Babka wrote: > On 3/9/21 7:14 PM, Georgi Djakov wrote: > > Hi Vlastimil, > > > > Thanks for the comment! > > > > On 3/9/21 17:09, Vlastimil Babka wrote: > >> On 3/9/21 2:47 PM, Georgi Djakov wrote: > >>> Being able to stop the system immediately when a memory corruption > >>> is detected is crucial to finding the source of it. This is very > >>> useful when the memory can be inspected with kdump or other tools. > >> > >> Is this in some testing scenarios where you would also use e.g. panic_on_warn? > >> We could hook to that. If not, we could introduce a new > >> panic_on_memory_corruption that would apply also for debug_pagealloc and whatnot? > > > > I would prefer that we not tie it with panic_on_warn - there might be lots of > > new code in multiple subsystems, so hitting some WARNing while testing is not > > something unexpected. > > > > Introducing an additional panic_on_memory_corruption would work, but i noticed > > that we already have slub_debug and thought to re-use that. But indeed, аdding > > an option to panic in for example bad_page() sounds also useful, if that's what > > you suggest. > > Yes, that would be another example. > Also CCing Kees for input, as besides the "kdump ASAP for debugging" case, I can > imagine security hardening folks could be interested in the "somebody might have > just failed to pwn the kernel, better panic than let them continue" angle. But > I'm naive wrt security, so it might be a stupid idea :) I've really wanted such things, but Linus has been pretty adamant about not wanting to provide new "panic" paths (or even BUG usage[1]). It seems that panic_on_warn remains the way to get this behavior, with the understanding that WARN should only be produced on expected-to-be-impossible situations[1]. Hitting a WARN while testing should result in either finding and fixing a real bug, or removing the WARN in favor of pr_warn(). :) -Kees [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
On 3/18/21 6:48 AM, Kees Cook wrote: > On Tue, Mar 09, 2021 at 07:18:32PM +0100, Vlastimil Babka wrote: >> On 3/9/21 7:14 PM, Georgi Djakov wrote: >> > Hi Vlastimil, >> > >> > Thanks for the comment! >> > >> > On 3/9/21 17:09, Vlastimil Babka wrote: >> >> On 3/9/21 2:47 PM, Georgi Djakov wrote: >> >>> Being able to stop the system immediately when a memory corruption >> >>> is detected is crucial to finding the source of it. This is very >> >>> useful when the memory can be inspected with kdump or other tools. >> >> >> >> Is this in some testing scenarios where you would also use e.g. panic_on_warn? >> >> We could hook to that. If not, we could introduce a new >> >> panic_on_memory_corruption that would apply also for debug_pagealloc and whatnot? >> > >> > I would prefer that we not tie it with panic_on_warn - there might be lots of >> > new code in multiple subsystems, so hitting some WARNing while testing is not >> > something unexpected. >> > >> > Introducing an additional panic_on_memory_corruption would work, but i noticed >> > that we already have slub_debug and thought to re-use that. But indeed, аdding >> > an option to panic in for example bad_page() sounds also useful, if that's what >> > you suggest. >> >> Yes, that would be another example. >> Also CCing Kees for input, as besides the "kdump ASAP for debugging" case, I can >> imagine security hardening folks could be interested in the "somebody might have >> just failed to pwn the kernel, better panic than let them continue" angle. But >> I'm naive wrt security, so it might be a stupid idea :) > > I've really wanted such things, but Linus has been pretty adamant about > not wanting to provide new "panic" paths (or even BUG usage[1]). It > seems that panic_on_warn remains the way to get this behavior, > with the understanding that WARN should only be produced on > expected-to-be-impossible situations[1]. > > Hitting a WARN while testing should result in either finding and fixing > a real bug, or removing the WARN in favor of pr_warn(). :) I was going to suggest adding a panic_on_taint parameter... but turns out it was already added last year! And various memory corruption detections already use TAINT_BAD_PAGE, including SLUB. If anything's missing an add_taint() it can be added, and with the parameter you should get what you want. > -Kees > > [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on >
On Thu, Mar 18, 2021 at 01:56:05PM +0100, Vlastimil Babka wrote: > I was going to suggest adding a panic_on_taint parameter... but turns out it was > already added last year! And various memory corruption detections already use > TAINT_BAD_PAGE, including SLUB. > If anything's missing an add_taint() it can be added, and with the parameter you > should get what you want. Ah-ha! That works too. I hadn't seen that -- I wonder if I can wire some other hardening things up to that. (e.g. refactor BUG_ON_CORRUPTION finally.)
diff --git a/Documentation/vm/slub.rst b/Documentation/vm/slub.rst index 03f294a638bd..32878c44f3de 100644 --- a/Documentation/vm/slub.rst +++ b/Documentation/vm/slub.rst @@ -53,6 +53,7 @@ Possible debug options are:: Z Red zoning P Poisoning (object and padding) U User tracking (free and alloc) + C Panic on object corruption (enables SLAB_CORRUPTION_PANIC) T Trace (please only use on single slabs) A Enable failslab filter mark for the cache O Switch debugging off for caches that would have diff --git a/include/linux/slab.h b/include/linux/slab.h index 0c97d788762c..ebff5e704d08 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -39,6 +39,9 @@ #define SLAB_STORE_USER ((slab_flags_t __force)0x00010000U) /* Panic if kmem_cache_create() fails */ #define SLAB_PANIC ((slab_flags_t __force)0x00040000U) +/* Panic if memory corruption is detected */ +#define SLAB_CORRUPTION_PANIC ((slab_flags_t __force)0x00080000U) + /* * SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS! * diff --git a/mm/slab.h b/mm/slab.h index 120b1d0dfb6d..ae0079017fc6 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -134,7 +134,7 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size, #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER) #elif defined(CONFIG_SLUB_DEBUG) #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \ - SLAB_TRACE | SLAB_CONSISTENCY_CHECKS) + SLAB_TRACE | SLAB_CONSISTENCY_CHECKS | SLAB_CORRUPTION_PANIC) #else #define SLAB_DEBUG_FLAGS (0) #endif diff --git a/mm/slub.c b/mm/slub.c index 077a019e4d7a..49351427f701 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -741,6 +741,8 @@ void object_err(struct kmem_cache *s, struct page *page, { slab_bug(s, "%s", reason); print_trailer(s, page, object); + if (slub_debug & SLAB_CORRUPTION_PANIC) + panic(reason); } static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, @@ -755,6 +757,8 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, slab_bug(s, "%s", buf); print_page_info(page); dump_stack(); + if (slub_debug & SLAB_CORRUPTION_PANIC) + panic("slab: slab error\n"); } static void init_object(struct kmem_cache *s, void *object, u8 val) @@ -776,6 +780,8 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) static void restore_bytes(struct kmem_cache *s, char *message, u8 data, void *from, void *to) { + if (slub_debug & SLAB_CORRUPTION_PANIC) + panic("slab: object overwritten\n"); slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data); memset(from, data, to - from); } @@ -1319,6 +1325,9 @@ parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init) case 'a': *flags |= SLAB_FAILSLAB; break; + case 'c': + *flags |= SLAB_CORRUPTION_PANIC; + break; case 'o': /* * Avoid enabling debugging on caches if its minimum
Being able to stop the system immediately when a memory corruption is detected is crucial to finding the source of it. This is very useful when the memory can be inspected with kdump or other tools. Let's add an option panic the kernel when slab debug catches an object or list corruption. This new option is not enabled by default (yet), so it needs to be enabled explicitly (for example by adding "slub_debug=FZPUC" to the kernel command line). Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> --- Documentation/vm/slub.rst | 1 + include/linux/slab.h | 3 +++ mm/slab.h | 2 +- mm/slub.c | 9 +++++++++ 4 files changed, 14 insertions(+), 1 deletion(-)