mm: extend max struct page size for kmsan

Message ID	20230130130739.563628-1-arnd@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Arnd Bergmann <arnd@kernel.org> To: Andrew Morton <akpm@linux-foundation.org>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pavel.tatashin@microsoft.com>, Alexander Potapenko <glider@google.com> Cc: Arnd Bergmann <arnd@arndb.de>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, David Hildenbrand <david@redhat.com>, "Liam R. Howlett" <Liam.Howlett@Oracle.com>, John Hubbard <jhubbard@nvidia.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Hugh Dickins <hughd@google.com>, Suren Baghdasaryan <surenb@google.com>, Alex Sierra <alex.sierra@amd.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: extend max struct page size for kmsan Date: Mon, 30 Jan 2023 14:07:26 +0100 Message-Id: <20230130130739.563628-1-arnd@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: extend max struct page size for kmsan \| expand mm: extend max struct page size for kmsan

Arnd Bergmann Jan. 30, 2023, 1:07 p.m. UTC

From: Arnd Bergmann <arnd@arndb.de>

After x86 has enabled support for KMSAN, it has become possible
to have larger 'struct page' than was expected when commit
5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
architectures") was merged:

include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
        switch (sizeof(struct page)) {

Extend the maximum accordingly.

Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
This seems to show up extremely rarely in randconfig builds, but
enough to trigger my build machine.

I saw a related discussion at [1] about raising MAX_STRUCT_PAGE_SIZE,
but as I understand it, that needs to be addressed separately.

[1] https://lore.kernel.org/lkml/20220701142310.2188015-11-glider@google.com/
---
 include/linux/mm.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Michal Hocko Jan. 30, 2023, 1:38 p.m. UTC | #1

On Mon 30-01-23 14:07:26, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> After x86 has enabled support for KMSAN, it has become possible
> to have larger 'struct page' than was expected when commit
> 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> architectures") was merged:
> 
> include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
>         switch (sizeof(struct page)) {
> 
> Extend the maximum accordingly.
> 
> Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
> Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Acked-by: Michal Hocko <mhocko@suse.com>

I haven't really followed KMSAN development but I would have expected
that it would, like other debugging tools, add its metadata to page_ext
rather than page directly.

> ---
> This seems to show up extremely rarely in randconfig builds, but
> enough to trigger my build machine.
> 
> I saw a related discussion at [1] about raising MAX_STRUCT_PAGE_SIZE,
> but as I understand it, that needs to be addressed separately.
> 
> [1] https://lore.kernel.org/lkml/20220701142310.2188015-11-glider@google.com/
> ---
>  include/linux/mm.h | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b73ba2e5cfd2..aa39d5ddace1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -137,7 +137,7 @@ extern int mmap_rnd_compat_bits __read_mostly;
>   * define their own version of this macro in <asm/pgtable.h>
>   */
>  #if BITS_PER_LONG == 64
> -/* This function must be updated when the size of struct page grows above 80
> +/* This function must be updated when the size of struct page grows above 96
>   * or reduces below 56. The idea that compiler optimizes out switch()
>   * statement, and only leaves move/store instructions. Also the compiler can
>   * combine write statements if they are both assignments and can be reordered,
> @@ -148,12 +148,18 @@ static inline void __mm_zero_struct_page(struct page *page)
>  {
>  	unsigned long *_pp = (void *)page;
>  
> -	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
> +	 /* Check that struct page is either 56, 64, 72, 80, 88 or 96 bytes */
>  	BUILD_BUG_ON(sizeof(struct page) & 7);
>  	BUILD_BUG_ON(sizeof(struct page) < 56);
> -	BUILD_BUG_ON(sizeof(struct page) > 80);
> +	BUILD_BUG_ON(sizeof(struct page) > 96);
>  
>  	switch (sizeof(struct page)) {
> +	case 96:
> +		_pp[11] = 0;
> +		fallthrough;
> +	case 88:
> +		_pp[10] = 0;
> +		fallthrough;
>  	case 80:
>  		_pp[9] = 0;
>  		fallthrough;
> -- 
> 2.39.0

Alexander Duyck Jan. 30, 2023, 4:29 p.m. UTC | #2

On Mon, Jan 30, 2023 at 5:07 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> After x86 has enabled support for KMSAN, it has become possible
> to have larger 'struct page' than was expected when commit
> 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> architectures") was merged:
>
> include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
>         switch (sizeof(struct page)) {
>
> Extend the maximum accordingly.
>
> Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
> Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")

Rather than saying this fixes the code that enables the config flags I
might be more comfortable with listing the commit that added the two
pointers to the struct:
Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")

It will make it easier to identify where the lines where added that
actually increased the size of the page struct.

> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> This seems to show up extremely rarely in randconfig builds, but
> enough to trigger my build machine.
>
> I saw a related discussion at [1] about raising MAX_STRUCT_PAGE_SIZE,
> but as I understand it, that needs to be addressed separately.
>
> [1] https://lore.kernel.org/lkml/20220701142310.2188015-11-glider@google.com/
> ---
>  include/linux/mm.h | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b73ba2e5cfd2..aa39d5ddace1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -137,7 +137,7 @@ extern int mmap_rnd_compat_bits __read_mostly;
>   * define their own version of this macro in <asm/pgtable.h>
>   */
>  #if BITS_PER_LONG == 64
> -/* This function must be updated when the size of struct page grows above 80
> +/* This function must be updated when the size of struct page grows above 96
>   * or reduces below 56. The idea that compiler optimizes out switch()
>   * statement, and only leaves move/store instructions. Also the compiler can
>   * combine write statements if they are both assignments and can be reordered,
> @@ -148,12 +148,18 @@ static inline void __mm_zero_struct_page(struct page *page)
>  {
>         unsigned long *_pp = (void *)page;
>
> -        /* Check that struct page is either 56, 64, 72, or 80 bytes */
> +        /* Check that struct page is either 56, 64, 72, 80, 88 or 96 bytes */
>         BUILD_BUG_ON(sizeof(struct page) & 7);
>         BUILD_BUG_ON(sizeof(struct page) < 56);
> -       BUILD_BUG_ON(sizeof(struct page) > 80);
> +       BUILD_BUG_ON(sizeof(struct page) > 96);
>
>         switch (sizeof(struct page)) {
> +       case 96:
> +               _pp[11] = 0;
> +               fallthrough;
> +       case 88:
> +               _pp[10] = 0;
> +               fallthrough;
>         case 80:
>                 _pp[9] = 0;
>                 fallthrough;

Otherwise the code itself looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

Matthew Wilcox Jan. 30, 2023, 4:47 p.m. UTC | #3

On Mon, Jan 30, 2023 at 02:38:22PM +0100, Michal Hocko wrote:
> On Mon 30-01-23 14:07:26, Arnd Bergmann wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> > 
> > After x86 has enabled support for KMSAN, it has become possible
> > to have larger 'struct page' than was expected when commit
> > 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> > architectures") was merged:
> > 
> > include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
> >         switch (sizeof(struct page)) {
> > 
> > Extend the maximum accordingly.
> > 
> > Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
> > Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> I haven't really followed KMSAN development but I would have expected
> that it would, like other debugging tools, add its metadata to page_ext
> rather than page directly.

Yes, that would have been preferable.  Also, I don't understand why we
need an entire page to store whether each "bit" of a page is initialised.
There are no CPUs which have bit-granularity stores; either you initialise
an entire byte or not.  So that metadata can shrink from 4096 bytes
to 512.

Pasha Tatashin Jan. 30, 2023, 5:19 p.m. UTC | #4

On Mon, Jan 30, 2023 at 8:07 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> After x86 has enabled support for KMSAN, it has become possible
> to have larger 'struct page' than was expected when commit
> 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> architectures") was merged:
>
> include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
>         switch (sizeof(struct page)) {
>
> Extend the maximum accordingly.
>
> Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")

No need to add the above "Fixes:". The above patch works as expected
everywhere where the struct page is 80 bytes or smaller (as specified
by the comment).

I also agree with others that the KMSAN should be part of page_ext
instead of increasing the complexity of "struct page".

Otherwise,
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Thanks,
Pasha

Alexander Potapenko Jan. 30, 2023, 5:59 p.m. UTC | #5

On Mon, Jan 30, 2023 at 2:38 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 30-01-23 14:07:26, Arnd Bergmann wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> >
> > After x86 has enabled support for KMSAN, it has become possible
> > to have larger 'struct page' than was expected when commit
> > 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> > architectures") was merged:
> >
> > include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
> >         switch (sizeof(struct page)) {
> >
> > Extend the maximum accordingly.
> >
> > Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
> > Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> I haven't really followed KMSAN development but I would have expected
> that it would, like other debugging tools, add its metadata to page_ext
> rather than page directly.

Thanks for the comment!
I was considering page_ext at some point, but managed to convince
myself it didn't suit the purpose well enough.

Right now KMSAN allocates its metadata at boot time, when tearing down memblock.
At that point only a handful of memory ranges exist, and it is pretty
easy to carve out some unused pages for the metadata for those ranges,
then divide the rest evenly and return 1/3 to the system, spending 2/3
to keep the metadata for the returned pages.
I tried allocating the memory lazily (at page_alloc(), for example),
and it turned out to be very tricky because of fragmentation: for an
allocation of a given order, one needs shadow and origin allocations
of the same order [1], and alloc_pages() simply started with ripping
apart the biggest chunk of memory available.

IIRC if we choose to allocate metadata via page_ext, the memory will
be already too fragmented to easily handle it, because it will only
happen once alloc_pages() is available.
We also can't get rid of the shadow/origin pointers in struct page_ext
(storing two 4K-sized arrays in that struct would defeat all the
possible alignments), so we won't save any memory by switching to
page_ext.

[1] - I can go into more details, but the TLDR is that contiguous
pages within the same allocations better have contiguous shadow/origin
pages, otherwise unaligned accesses will corrupt other pages.

Alexander Potapenko Jan. 30, 2023, 6:20 p.m. UTC | #6

> > I haven't really followed KMSAN development but I would have expected
> > that it would, like other debugging tools, add its metadata to page_ext
> > rather than page directly.
>
> Yes, that would have been preferable.  Also, I don't understand why we
> need an entire page to store whether each "bit" of a page is initialised.
> There are no CPUs which have bit-granularity stores; either you initialise
> an entire byte or not.  So that metadata can shrink from 4096 bytes
> to 512.

It's not about bit-granularity stores, it's about bits being
uninitialized or not.

Consider the following struct:

  struct foo {
    char a:4;
    char b:4;
  } f;

- if the user initializes f.a and then tries to use f.b, this is still
undefined behavior that KMSAN is able to catch thanks to bit-to-bit
shadow, but would not have been able to detect if we only stored one
bit per byte.
Another example is bit flags or bit masks, where you can set a single
bit in an int32, but that wouldn't necessarily mean the rest of that
variable is initialized.

It's worth mentioning that even if we choose to shrink the shadows
from 4096 to 512 bytes, there'd still be four-byte origin IDs, which
are allocated for every four bytes of program memory.
So a whole page of origins will still be required in addition to those
512 bytes of shadow.

(Origins are handy when debugging KMSAN reports, because a single
uninit value can be copied or modified multiple times before it is
used in a branch or passed to the userspace.
Shrinking origins further would render them useless for e.g. 32-bit
local variables, which is a quite common use case).

Michal Hocko Jan. 31, 2023, 3:14 p.m. UTC | #7

On Mon 30-01-23 18:59:45, Alexander Potapenko wrote:
> On Mon, Jan 30, 2023 at 2:38 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 30-01-23 14:07:26, Arnd Bergmann wrote:
> > > From: Arnd Bergmann <arnd@arndb.de>
> > >
> > > After x86 has enabled support for KMSAN, it has become possible
> > > to have larger 'struct page' than was expected when commit
> > > 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> > > architectures") was merged:
> > >
> > > include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
> > >         switch (sizeof(struct page)) {
> > >
> > > Extend the maximum accordingly.
> > >
> > > Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
> > > Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")
> > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > I haven't really followed KMSAN development but I would have expected
> > that it would, like other debugging tools, add its metadata to page_ext
> > rather than page directly.
> 
> Thanks for the comment!
> I was considering page_ext at some point, but managed to convince
> myself it didn't suit the purpose well enough.
> 
> Right now KMSAN allocates its metadata at boot time, when tearing down memblock.
> At that point only a handful of memory ranges exist, and it is pretty
> easy to carve out some unused pages for the metadata for those ranges,
> then divide the rest evenly and return 1/3 to the system, spending 2/3
> to keep the metadata for the returned pages.
> I tried allocating the memory lazily (at page_alloc(), for example),
> and it turned out to be very tricky because of fragmentation: for an
> allocation of a given order, one needs shadow and origin allocations
> of the same order [1], and alloc_pages() simply started with ripping
> apart the biggest chunk of memory available.

page_ext allocation happens quite early as well. There shouldn't be any
real fragmentation that early during the boot.

> IIRC if we choose to allocate metadata via page_ext, the memory will
> be already too fragmented to easily handle it, because it will only
> happen once alloc_pages() is available.
> We also can't get rid of the shadow/origin pointers in struct page_ext
> (storing two 4K-sized arrays in that struct would defeat all the
> possible alignments), so we won't save any memory by switching to
> page_ext.

With page_ext you would allow to compile the feature in disabled by
default and allow to boot time enable it.
 
> [1] - I can go into more details, but the TLDR is that contiguous
> pages within the same allocations better have contiguous shadow/origin
> pages, otherwise unaligned accesses will corrupt other pages.

Alexander Potapenko Jan. 31, 2023, 3:17 p.m. UTC | #8

On Tue, Jan 31, 2023 at 4:14 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 30-01-23 18:59:45, Alexander Potapenko wrote:
> > On Mon, Jan 30, 2023 at 2:38 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 30-01-23 14:07:26, Arnd Bergmann wrote:
> > > > From: Arnd Bergmann <arnd@arndb.de>
> > > >
> > > > After x86 has enabled support for KMSAN, it has become possible
> > > > to have larger 'struct page' than was expected when commit
> > > > 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b
> > > > architectures") was merged:
> > > >
> > > > include/linux/mm.h:156:10: warning: no case matching constant switch condition '96'
> > > >         switch (sizeof(struct page)) {
> > > >
> > > > Extend the maximum accordingly.
> > > >
> > > > Fixes: 5470dea49f53 ("mm: use mm_zero_struct_page from SPARC on all 64b architectures")
> > > > Fixes: 4ca8cc8d1bbe ("x86: kmsan: enable KMSAN builds for x86")
> > > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > >
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > >
> > > I haven't really followed KMSAN development but I would have expected
> > > that it would, like other debugging tools, add its metadata to page_ext
> > > rather than page directly.
> >
> > Thanks for the comment!
> > I was considering page_ext at some point, but managed to convince
> > myself it didn't suit the purpose well enough.
> >
> > Right now KMSAN allocates its metadata at boot time, when tearing down memblock.
> > At that point only a handful of memory ranges exist, and it is pretty
> > easy to carve out some unused pages for the metadata for those ranges,
> > then divide the rest evenly and return 1/3 to the system, spending 2/3
> > to keep the metadata for the returned pages.
> > I tried allocating the memory lazily (at page_alloc(), for example),
> > and it turned out to be very tricky because of fragmentation: for an
> > allocation of a given order, one needs shadow and origin allocations
> > of the same order [1], and alloc_pages() simply started with ripping
> > apart the biggest chunk of memory available.
>
> page_ext allocation happens quite early as well. There shouldn't be any
> real fragmentation that early during the boot.
>
> > IIRC if we choose to allocate metadata via page_ext, the memory will
> > be already too fragmented to easily handle it, because it will only
> > happen once alloc_pages() is available.
> > We also can't get rid of the shadow/origin pointers in struct page_ext
> > (storing two 4K-sized arrays in that struct would defeat all the
> > possible alignments), so we won't save any memory by switching to
> > page_ext.
>
> With page_ext you would allow to compile the feature in disabled by
> default and allow to boot time enable it.

This makes little sense to do, because KMSAN requires heavy
compile-time instrumentation to work. One cannot simply enable/disable
it at boot time anyway.

Alexander Potapenko Jan. 31, 2023, 4:03 p.m. UTC | #9

> > Right now KMSAN allocates its metadata at boot time, when tearing down memblock.
> > At that point only a handful of memory ranges exist, and it is pretty
> > easy to carve out some unused pages for the metadata for those ranges,
> > then divide the rest evenly and return 1/3 to the system, spending 2/3
> > to keep the metadata for the returned pages.
> > I tried allocating the memory lazily (at page_alloc(), for example),
> > and it turned out to be very tricky because of fragmentation: for an
> > allocation of a given order, one needs shadow and origin allocations
> > of the same order [1], and alloc_pages() simply started with ripping
> > apart the biggest chunk of memory available.
>
> page_ext allocation happens quite early as well. There shouldn't be any
> real fragmentation that early during the boot.

Assuming we are talking about the early_page_ext_enabled() case, here
are the init functions that are executed between kmsan_init_shadow()
and page_ext_init():

        stack_depot_early_init();
        mem_init();
        mem_init_print_info();
        kmem_cache_init();
        /*
         * page_owner must be initialized after buddy is ready, and also after
         * slab is ready so that stack_depot_init() works properly
         */
        page_ext_init_flatmem_late();
        kmemleak_init();
        pgtable_init();
        debug_objects_mem_init();
        vmalloc_init();

There's yet another problem besides fragmentation: we need to allocate
shadow for every page that was allocated by these functions.
Right now this is done by kmsan_init_shadow, which walks all the
existing memblock ranges, plus the _data segment and the node data for
each node, and grabs memory from the buddy allocator.
If we delay the metadata allocation to the point where memory caches
exist, we'll have to somehow walk every allocated struct page and
allocate the metadata for each of those. Is there an easy way to do
so?

I am unsure if vmalloc_init() creates any virtual mappings (probably
not?), but if it does, we'd also need to call
kmsan_vmap_pages_range_noflush() for them once we set up the metadata.
With the current metadata allocation scheme it's not needed, because
the buddy allocator is torn down before the virtual mappings are
created.

In the ideal world, we'd better place KMSAN shadow/origin pages at
fixed addresses, like this is done for KASAN - that would not require
storing pointers in struct page.
But reserving big chunks of the address space is even harder than
what's currently being done.

mm: extend max struct page size for kmsan

Commit Message

Comments

Patch