diff mbox series

[v1,10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

Message ID 20220308141437.144919-11-david@redhat.com (mailing list archive)
State New
Headers show
Series mm: COW fixes part 2: reliable GUP pins of anonymous pages | expand

Commit Message

David Hildenbrand March 8, 2022, 2:14 p.m. UTC
The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it
be shared?

In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
don't grow on trees, so we have to get a little creative for the time
being.

Introduce a way to mark an anonymous page as exclusive, with the
ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
GUP pins lose consistency with the pages mapped into the page table,
resulting in reported memory corruptions.

Most pageflags already have semantics for anonymous pages, so we're left
with reusing PG_slab for our purpose: for PageAnon() pages PG_slab now
translates to PG_anon_exclusive, teach some in-kernel code that manually
handles PG_slab about that.

Add a spoiler on the semantics of PG_anon_exclusive as documentation. More
documentation will be contained in the code that actually makes use of
PG_anon_exclusive.

We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.
Letting information about exclusivity stick around will be an important
property when adding sanity checks to unpinning code.

RFC notes: in-tree tools/cgroup/memcg_slabinfo.py looks like it might need
	   some care. We'd have to lookup the head page and check if
	   PageAnon() is set. Similarly, tools living outside the kernel
	   repository like crash and makedumpfile might need adaptions.

Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/page.c                 |   3 +-
 include/linux/page-flags.h     | 124 ++++++++++++++++++++++++++++++++-
 include/trace/events/mmflags.h |   2 +-
 mm/hugetlb.c                   |   4 ++
 mm/memory-failure.c            |  24 +++++--
 mm/memremap.c                  |  11 +++
 mm/page_alloc.c                |  13 ++++
 7 files changed, 170 insertions(+), 11 deletions(-)

Comments

Matthew Wilcox March 9, 2022, 3:47 p.m. UTC | #1
On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote:
> The basic question we would like to have a reliable and efficient answer
> to is: is this anonymous page exclusive to a single process or might it
> be shared?

Is this supposed to be for PAGE_SIZE pages as well, or is it only used
on pages > PAGE_SIZE?

> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
> don't grow on trees, so we have to get a little creative for the time
> being.

This feels a little _too_ creative to me.  There's now an implicit
requirement that SL[AOU]B doesn't use the bottom two bits of
->slab_cache, which is probably OK but would need to be documented.

I have plans to get rid of PageError and PagePrivate, but those are going
to be too late for you.  I don't think mappedtodisk has meaning for anon
pages, even if they're in the swapcache.  It would need PG_has_hwpoisoned
to shift to another bit ... but almost any bit will do for has_hwpoisoned.
Or have I overlooked something?

> @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page);
>  
>  __PAGEFLAG(Isolated, isolated, PF_ANY);
>  
> +static __always_inline bool folio_test_slab(struct folio *folio)
> +{
> +	return !folio_test_anon(folio) &&
> +	       test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
> +}
> +
> +static __always_inline int PageSlab(struct page *page)
> +{
> +	return !PageAnon(page) &&
> +		test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags);
> +}

In case we do end up using this, this would be better implemented as

static __always_inline int PageSlab(struct page *page)
{
	return folio_test_slab(page_folio(page));
}

since PageAnon already has a page_folio() call embedded in it.

> +static __always_inline void __SetPageSlab(struct page *page)
> +{
> +	VM_BUG_ON_PGFLAGS(PageAnon(page), page);
> +	__set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
> +}

There's only one caller of __SetPageSlab() left, in kfence.  And that
code looks ... weird.

        for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
                if (!i || (i % 2))
                        continue;

                /* Verify we do not have a compound head page. */
                if (WARN_ON(compound_head(&pages[i]) != &pages[i]))
                        goto err;

                __SetPageSlab(&pages[i]);

I think the author probably intended WARN_ON(PageCompound(page)) because
they're actually verifying that it's not a tail page, rather than head
page.

> +static __always_inline void __ClearPageSlab(struct page *page)
> +{
> +	VM_BUG_ON_PGFLAGS(PageAnon(page), page);
> +	__clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
> +}

There are no remaining callers of __ClearPageSlab().  yay.
David Hildenbrand March 9, 2022, 4:57 p.m. UTC | #2
Hi,

On 09.03.22 16:47, Matthew Wilcox wrote:
> On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote:
>> The basic question we would like to have a reliable and efficient answer
>> to is: is this anonymous page exclusive to a single process or might it
>> be shared?
> 
> Is this supposed to be for PAGE_SIZE pages as well, or is it only used
> on pages > PAGE_SIZE?

As documented, simple/ordinary PAGE_SIZE pages as well (unfortunately ,
otherwise we'd have more space :) ).

> 
>> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
>> don't grow on trees, so we have to get a little creative for the time
>> being.
> 
> This feels a little _too_ creative to me.  There's now an implicit

It's making the semantics of PG_slab depend on another bit in the head
page. I agree, it's not perfect, but it's not *too* crazy. As raised in
the cover letter, not proud of this, but I didn't really find an
alternative for the time being.

> requirement that SL[AOU]B doesn't use the bottom two bits of

I think only the last bit (0x1)

> ->slab_cache, which is probably OK but would need to be documented.

We'd already have false positive PageAnon() if that wasn't the case. At
least in stable_page_flags() would already indicate something wrong I
think (KPF_ANON). We'd need !PageSlab() checks at a couple of places
already if I'm not wrong.

I had a comment in there, but after the PageSlab cleanups came in, I
dropped it because my assumption was that actually nobody is really
allowed to use the lowest mapping bit for something else. We can
document that, of course.

So at least in that regard, I think this is fine.

> 
> I have plans to get rid of PageError and PagePrivate, but those are going
> to be too late for you.  I don't think mappedtodisk has meaning for anon
> pages, even if they're in the swapcache.  It would need PG_has_hwpoisoned

Are you sure it's not used if the page is in the swapcache? I have no
detailed knowledge how file-back swap targets are handled in that
regard. So fs experience would be highly appreciated :)

> to shift to another bit ... but almost any bit will do for has_hwpoisoned.
> Or have I overlooked something?

Good question, I assume we could use another bit that's not already
defined/check on subpages of a compound page.


Overloading PG_reserved would be an alternative, however, that flag can
also indicate that the remainder of the memmap might be mostly garbage,
so it's not that good of a fit IMHO.

> 
>> @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page);
>>  
>>  __PAGEFLAG(Isolated, isolated, PF_ANY);
>>  
>> +static __always_inline bool folio_test_slab(struct folio *folio)
>> +{
>> +	return !folio_test_anon(folio) &&
>> +	       test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
>> +}
>> +
>> +static __always_inline int PageSlab(struct page *page)
>> +{
>> +	return !PageAnon(page) &&
>> +		test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags);
>> +}
> 
> In case we do end up using this, this would be better implemented as
> 
> static __always_inline int PageSlab(struct page *page)
> {
> 	return folio_test_slab(page_folio(page));
> }
> 
> since PageAnon already has a page_folio() call embedded in it.

Agreed, I mainly copied the stubs and extended them.

> 
>> +static __always_inline void __SetPageSlab(struct page *page)
>> +{
>> +	VM_BUG_ON_PGFLAGS(PageAnon(page), page);
>> +	__set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
>> +}
> 
> There's only one caller of __SetPageSlab() left, in kfence.  And that
> code looks ... weird.
> 
>         for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
>                 if (!i || (i % 2))
>                         continue;
> 
>                 /* Verify we do not have a compound head page. */
>                 if (WARN_ON(compound_head(&pages[i]) != &pages[i]))
>                         goto err;
> 
>                 __SetPageSlab(&pages[i]);
> 
> I think the author probably intended WARN_ON(PageCompound(page)) because
> they're actually verifying that it's not a tail page, rather than head
> page.

It's certainly a head-scratcher.

> 
>> +static __always_inline void __ClearPageSlab(struct page *page)
>> +{
>> +	VM_BUG_ON_PGFLAGS(PageAnon(page), page);
>> +	__clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
>> +}
> 
> There are no remaining callers of __ClearPageSlab().  yay.
> 

Indeed, nice.

Thanks!
Matthew Wilcox March 9, 2022, 5:40 p.m. UTC | #3
On Wed, Mar 09, 2022 at 05:57:51PM +0100, David Hildenbrand wrote:
> Hi,
> 
> On 09.03.22 16:47, Matthew Wilcox wrote:
> > On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote:
> >> The basic question we would like to have a reliable and efficient answer
> >> to is: is this anonymous page exclusive to a single process or might it
> >> be shared?
> > 
> > Is this supposed to be for PAGE_SIZE pages as well, or is it only used
> > on pages > PAGE_SIZE?
> 
> As documented, simple/ordinary PAGE_SIZE pages as well (unfortunately ,
> otherwise we'd have more space :) ).

OK.  Documentation doesn't always capture exactly what the author was
thinking, so I appreciate the clarification ;-)

> > This feels a little _too_ creative to me.  There's now an implicit
> 
> It's making the semantics of PG_slab depend on another bit in the head
> page. I agree, it's not perfect, but it's not *too* crazy. As raised in
> the cover letter, not proud of this, but I didn't really find an
> alternative for the time being.
> 
> > requirement that SL[AOU]B doesn't use the bottom two bits of
> 
> I think only the last bit (0x1)

Yeah, OK, they can use three of the four possible combinations of the
bottom two bits ;-)  Still, it's yet more constraints on use of struct
page, which aren't obvious (and are even upside down for the casual
observer).

> > I have plans to get rid of PageError and PagePrivate, but those are going
> > to be too late for you.  I don't think mappedtodisk has meaning for anon
> > pages, even if they're in the swapcache.  It would need PG_has_hwpoisoned
> 
> Are you sure it's not used if the page is in the swapcache? I have no
> detailed knowledge how file-back swap targets are handled in that
> regard. So fs experience would be highly appreciated :)

I have two arguments here.  One is based on grep:

$ git grep mappedtodisk
fs/proc/page.c: u |= kpf_copy_bit(k, KPF_MAPPEDTODISK,  PG_mappedtodisk);
include/linux/page-flags.h:     PG_mappedtodisk,        /* Has blocks allocated on-disk */
include/linux/page-flags.h:     PG_has_hwpoisoned = PG_mappedtodisk,
include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
include/trace/events/mmflags.h: {1UL << PG_mappedtodisk,        "mappedtodisk"  },              \
include/trace/events/pagemap.h: (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \
mm/migrate.c:   if (folio_test_mappedtodisk(folio))
mm/migrate.c:           folio_set_mappedtodisk(newfolio);
mm/truncate.c:  folio_clear_mappedtodisk(folio);
tools/vm/page-types.c:  [KPF_MAPPEDTODISK]      = "d:mappedtodisk",

$ git grep MappedToDisk
fs/buffer.c:            SetPageMappedToDisk(page);
fs/buffer.c:    if (PageMappedToDisk(page))
fs/buffer.c:            SetPageMappedToDisk(page);
fs/ext4/readpage.c:                     SetPageMappedToDisk(page);
fs/f2fs/data.c:         SetPageMappedToDisk(page);
fs/f2fs/file.c: if (PageMappedToDisk(page))
fs/fuse/dev.c:  ClearPageMappedToDisk(newpage);
fs/mpage.c:             SetPageMappedToDisk(page);
fs/nilfs2/file.c:       if (PageMappedToDisk(page))
fs/nilfs2/file.c:                       SetPageMappedToDisk(page);
fs/nilfs2/page.c:       ClearPageMappedToDisk(page);
fs/nilfs2/page.c:               SetPageMappedToDisk(dpage);
fs/nilfs2/page.c:               ClearPageMappedToDisk(dpage);
fs/nilfs2/page.c:       if (PageMappedToDisk(src) && !PageMappedToDisk(dst))
fs/nilfs2/page.c:               SetPageMappedToDisk(dst);
fs/nilfs2/page.c:       else if (!PageMappedToDisk(src) && PageMappedToDisk(dst))
fs/nilfs2/page.c:               ClearPageMappedToDisk(dst);
fs/nilfs2/page.c:       ClearPageMappedToDisk(page);
include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)

so you can see it's _rarely_ used, and only by specific filesystems.

The swap code actually bypasses the filesystem (except for network
filesystems) and submits BIOs directly (see __swap_writepage() and
swap_readpage()).  There's no check for MappedToDisk, and nowhere to
set it/clear it.

The second argument is that MappedToDisk is used for delayed allocation
on writes for filesystems.  But swap is required to be allocated at
swapfile setup (so that the swap code can bypass the filesystem ...)
and so this flag is useless.

Of course, I may have missed something.  This is complex.

> > to shift to another bit ... but almost any bit will do for has_hwpoisoned.
> > Or have I overlooked something?
> 
> Good question, I assume we could use another bit that's not already
> defined/check on subpages of a compound page.
> 
> 
> Overloading PG_reserved would be an alternative, however, that flag can
> also indicate that the remainder of the memmap might be mostly garbage,
> so it's not that good of a fit IMHO.

I wouldn't reuse PG_reserved for anything.  Although I do have a modest
proposal that I should finish writing up ... let me start on that.
David Hildenbrand March 9, 2022, 6 p.m. UTC | #4
>> It's making the semantics of PG_slab depend on another bit in the head
>> page. I agree, it's not perfect, but it's not *too* crazy. As raised in
>> the cover letter, not proud of this, but I didn't really find an
>> alternative for the time being.
>>
>>> requirement that SL[AOU]B doesn't use the bottom two bits of
>>
>> I think only the last bit (0x1)
> 
> Yeah, OK, they can use three of the four possible combinations of the
> bottom two bits ;-)  Still, it's yet more constraints on use of struct
> page, which aren't obvious (and are even upside down for the casual
> observer).

I don't disagree that such constraints are nasty.

Having that said, I'd really like to avoid overloading PG_slab
(especially, such that I don't have to mess with
scripts/crash/makedumpfile). So if we can reuse MappedToDisk, that would
be very nice.

> 
>>> I have plans to get rid of PageError and PagePrivate, but those are going
>>> to be too late for you.  I don't think mappedtodisk has meaning for anon
>>> pages, even if they're in the swapcache.  It would need PG_has_hwpoisoned
>>
>> Are you sure it's not used if the page is in the swapcache? I have no
>> detailed knowledge how file-back swap targets are handled in that
>> regard. So fs experience would be highly appreciated :)
> 
> I have two arguments here.  One is based on grep:
> 
> $ git grep mappedtodisk
> fs/proc/page.c: u |= kpf_copy_bit(k, KPF_MAPPEDTODISK,  PG_mappedtodisk);
> include/linux/page-flags.h:     PG_mappedtodisk,        /* Has blocks allocated on-disk */
> include/linux/page-flags.h:     PG_has_hwpoisoned = PG_mappedtodisk,
> include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
> include/trace/events/mmflags.h: {1UL << PG_mappedtodisk,        "mappedtodisk"  },              \
> include/trace/events/pagemap.h: (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \
> mm/migrate.c:   if (folio_test_mappedtodisk(folio))
> mm/migrate.c:           folio_set_mappedtodisk(newfolio);
> mm/truncate.c:  folio_clear_mappedtodisk(folio);
> tools/vm/page-types.c:  [KPF_MAPPEDTODISK]      = "d:mappedtodisk",
> 
> $ git grep MappedToDisk
> fs/buffer.c:            SetPageMappedToDisk(page);
> fs/buffer.c:    if (PageMappedToDisk(page))
> fs/buffer.c:            SetPageMappedToDisk(page);
> fs/ext4/readpage.c:                     SetPageMappedToDisk(page);
> fs/f2fs/data.c:         SetPageMappedToDisk(page);
> fs/f2fs/file.c: if (PageMappedToDisk(page))
> fs/fuse/dev.c:  ClearPageMappedToDisk(newpage);
> fs/mpage.c:             SetPageMappedToDisk(page);
> fs/nilfs2/file.c:       if (PageMappedToDisk(page))
> fs/nilfs2/file.c:                       SetPageMappedToDisk(page);
> fs/nilfs2/page.c:       ClearPageMappedToDisk(page);
> fs/nilfs2/page.c:               SetPageMappedToDisk(dpage);
> fs/nilfs2/page.c:               ClearPageMappedToDisk(dpage);
> fs/nilfs2/page.c:       if (PageMappedToDisk(src) && !PageMappedToDisk(dst))
> fs/nilfs2/page.c:               SetPageMappedToDisk(dst);
> fs/nilfs2/page.c:       else if (!PageMappedToDisk(src) && PageMappedToDisk(dst))
> fs/nilfs2/page.c:               ClearPageMappedToDisk(dst);
> fs/nilfs2/page.c:       ClearPageMappedToDisk(page);
> include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
> 
> so you can see it's _rarely_ used, and only by specific filesystems.

Right, but I spot ext4 and fs/buffer.c core functionality. That
naturally makes me nervous :)

> 
> The swap code actually bypasses the filesystem (except for network
> filesystems) and submits BIOs directly (see __swap_writepage() and
> swap_readpage()).  There's no check for MappedToDisk, and nowhere to
> set it/clear it.
> 
> The second argument is that MappedToDisk is used for delayed allocation
> on writes for filesystems.  But swap is required to be allocated at
> swapfile setup (so that the swap code can bypass the filesystem ...)
> and so this flag is useless.

I have some faint memory that there are corner cases, but maybe
(hopefully) my memory is wrong.

> 
> Of course, I may have missed something.  This is complex.
> 

Yeah, that's why I was very careful. If any FS would end up setting that
flag we'd be in trouble and would have to blacklist that fs for swapping
or rework it (well, if we're even able to identify such a file system).

I think one sanity check I could add is making sure that
PageAnonExclusive() is never set when getting a page from the swapcache
in do_swap_page(). I think that would take care of the obvious bugs.
David Hildenbrand March 11, 2022, 6:46 p.m. UTC | #5
On 08.03.22 15:14, David Hildenbrand wrote:
> The basic question we would like to have a reliable and efficient answer
> to is: is this anonymous page exclusive to a single process or might it
> be shared?
> 
> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
> don't grow on trees, so we have to get a little creative for the time
> being.
> 
> Introduce a way to mark an anonymous page as exclusive, with the
> ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
> GUP pins lose consistency with the pages mapped into the page table,
> resulting in reported memory corruptions.
> 
> Most pageflags already have semantics for anonymous pages, so we're left
> with reusing PG_slab for our purpose: for PageAnon() pages PG_slab now
> translates to PG_anon_exclusive, teach some in-kernel code that manually
> handles PG_slab about that.
> 
> Add a spoiler on the semantics of PG_anon_exclusive as documentation. More
> documentation will be contained in the code that actually makes use of
> PG_anon_exclusive.
> 
> We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
> zapping) of page table entries, page freeing code will handle that when
> also invalidate page->mapping to not indicate PageAnon() anymore.
> Letting information about exclusivity stick around will be an important
> property when adding sanity checks to unpinning code.
> 
> RFC notes: in-tree tools/cgroup/memcg_slabinfo.py looks like it might need
> 	   some care. We'd have to lookup the head page and check if
> 	   PageAnon() is set. Similarly, tools living outside the kernel
> 	   repository like crash and makedumpfile might need adaptions.
> 
> Cc: Roman Gushchin <guro@fb.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

I'm currently testing with the following. My tests so far with a swapfile on
all different kinds of weird filesystems (excluding networking fs, though)
revealed no surprises so far:



From 13aeb929722e38d162d03baa75eadb8c2c84359d Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Mon, 20 Dec 2021 20:23:51 +0100
Subject: [PATCH] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for
 PageAnon() pages

The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it
be shared? We need that information for ordinary/single pages, hugetlb
pages, and possibly each subpage of a THP.

Introduce a way to mark an anonymous page as exclusive, with the
ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
GUP pins lose consistency with the pages mapped into the page table,
resulting in reported memory corruptions.

Most pageflags already have semantics for anonymous pages, however,
PG_mappedtodisk should never apply to pages in the swapcache, so let's
reuse that flag.

As PG_has_hwpoisoned also uses that flag on the second tail page of a
compound page, convert it to PG_waiters instead, which is marked as
PF_ONLY_HEAD and does definetly not apply to any tail pages.
__split_huge_page() properly does a ClearPageHasHWPoisoned() before
continuing with the THP split.

Use custom page flag modification functions such that we can do
additional sanity checks. Add a spoiler on the semantics of
PG_anon_exclusive as documentation. More documentation will be contained
in the code that actually makes use of PG_anon_exclusive.

We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.
Letting information about exclusivity stick around will be an important
property when adding sanity checks to unpinning code.

Note that we properly clear the flag in free_pages_prepare() via
PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
so there is no need to manually clear the flag.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 84 +++++++++++++++++++++++++++++++++++++-
 mm/hugetlb.c               |  2 +
 mm/memory.c                | 11 +++++
 mm/memremap.c              |  9 ++++
 mm/swapfile.c              |  4 ++
 tools/vm/page-types.c      |  8 +++-
 6 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..ac68d28b5e1f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -142,6 +142,60 @@ enum pageflags {
 
 	PG_readahead = PG_reclaim,
 
+	/*
+	 * Depending on the way an anonymous folio can be mapped into a page
+	 * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
+	 * THP), PG_anon_exclusive may be set only for the head page or for
+	 * subpages of an anonymous folio.
+	 *
+	 * PG_anon_exclusive is *usually* only expressive in combination with a
+	 * page table entry. Depending on the page table entry type it might
+	 * store the following information:
+	 *
+	 *	Is what's mapped via this page table entry exclusive to the
+	 *	single process and can be mapped writable without further
+	 *	checks? If not, it might be shared and we might have to COW.
+	 *
+	 * For now, we only expect PTE-mapped THPs to make use of
+	 * PG_anon_exclusive in subpages. For other anonymous compound
+	 * folios (i.e., hugetlb), only the head page is logically mapped and
+	 * holds this information.
+	 *
+	 * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
+	 * set on the head page. When replacing the PMD by a page table full
+	 * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
+	 * all tail pages accordingly. Note that converting from a PTE-mapping
+	 * to a PMD mapping using the same compound page is currently not
+	 * possible and consequently doesn't require care.
+	 *
+	 * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
+	 * it should only pin if the relevant PG_anon_bit is set. In that case,
+	 * the pin will be fully reliable and stay consistent with the pages
+	 * mapped into the page table, as the bit cannot get cleared (e.g., by
+	 * fork(), KSM) while the page is pinned. For anonymous pages that
+	 * are mapped R/W, PG_anon_exclusive can be assumed to always be set
+	 * because such pages cannot possibly be shared.
+	 *
+	 * The page table lock protecting the page table entry is the primary
+	 * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
+	 * not take the PT lock needs special care when trying to clear the
+	 * flag.
+	 *
+	 * Page table entry types and PG_anon_exclusive:
+	 * * Present: PG_anon_exclusive applies.
+	 * * Swap: the information is lost. PG_anon_exclusive was cleared.
+	 * * Migration: the entry holds this information instead.
+	 *		PG_anon_exclusive was cleared.
+	 * * Device private: PG_anon_exclusive applies.
+	 * * Device exclusive: PG_anon_exclusive applies.
+	 * * HW Poison: PG_anon_exclusive is stale and not changed.
+	 *
+	 * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
+	 * not allowed and the flag will stick around until the page is freed
+	 * and folio->mapping is cleared.
+	 */
+	PG_anon_exclusive = PG_mappedtodisk,
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -176,7 +230,7 @@ enum pageflags {
 	 * Indicates that at least one subpage is hwpoisoned in the
 	 * THP.
 	 */
-	PG_has_hwpoisoned = PG_mappedtodisk,
+	PG_has_hwpoisoned = PG_waiters,
 #endif
 
 	/* non-lru isolated movable page */
@@ -920,6 +974,34 @@ extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
 
+static __always_inline int PageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void SetPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1UL << PG_mlocked)
 #else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fb990d95dab..1ff0b9e1e28e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1669,6 +1669,8 @@ void free_huge_page(struct page *page)
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
 
 	hugetlb_set_page_subpool(page, NULL);
+	if (PageAnon(page))
+		__ClearPageAnonExclusive(page);
 	page->mapping = NULL;
 	restore_reserve = HPageRestoreReserve(page);
 	ClearHPageRestoreReserve(page);
diff --git a/mm/memory.c b/mm/memory.c
index 7b32f422798d..d01fab481134 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3671,6 +3671,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/*
+	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
+	 * must never point at an anonymous page in the swapcache that is
+	 * PG_anon_exclusive. Sanity check that this holds and especially, that
+	 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
+	 * check after taking the PT lock and making sure that nobody
+	 * concurrently faulted in this page and set PG_anon_exclusive.
+	 */
+	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
+	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
+
 	/*
 	 * Remove the swap entry and conditionally try to free up the swapcache.
 	 * We're already holding a reference on the page but haven't mapped it
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..160ea92e4e17 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -478,6 +478,15 @@ void free_devmap_managed_page(struct page *page)
 
 	mem_cgroup_uncharge(page_folio(page));
 
+	/*
+	 * Note: we don't expect anonymous compound pages yet. Once supported
+	 * and we could PTE-map them similar to THP, we'd have to clear
+	 * PG_anon_exclusive on all tail pages.
+	 */
+	VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
+	if (PageAnon(page))
+		__ClearPageAnonExclusive(page);
+
 	/*
 	 * When a device_private page is freed, the page->mapping field
 	 * may still contain a (stale) mapping value. For example, the
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7edc8e099b22..493acb967b7a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1796,6 +1796,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		goto out;
 	}
 
+	/* See do_swap_page() */
+	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
+	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
+
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index b1ed76d9a979..381dcc00cb62 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -80,9 +80,10 @@
 #define KPF_SOFTDIRTY		40
 #define KPF_ARCH_2		41
 
-/* [48-] take some arbitrary free slots for expanding overloaded flags
+/* [47-] take some arbitrary free slots for expanding overloaded flags
  * not part of kernel API
  */
+#define KPF_ANON_EXCLUSIVE	47
 #define KPF_READAHEAD		48
 #define KPF_SLOB_FREE		49
 #define KPF_SLUB_FROZEN		50
@@ -138,6 +139,7 @@ static const char * const page_flag_names[] = {
 	[KPF_SOFTDIRTY]		= "f:softdirty",
 	[KPF_ARCH_2]		= "H:arch_2",
 
+	[KPF_ANON_EXCLUSIVE]	= "d:anon_exclusive",
 	[KPF_READAHEAD]		= "I:readahead",
 	[KPF_SLOB_FREE]		= "P:slob_free",
 	[KPF_SLUB_FROZEN]	= "A:slub_frozen",
@@ -472,6 +474,10 @@ static int bit_mask_ok(uint64_t flags)
 
 static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
 {
+	/* Anonymous pages overload PG_mappedtodisk */
+	if ((flags & BIT(ANON)) && (flags & BIT(MAPPEDTODISK)))
+		flags ^= BIT(MAPPEDTODISK) | BIT(ANON_EXCLUSIVE);
+
 	/* SLOB/SLUB overload several page flags */
 	if (flags & BIT(SLAB)) {
 		if (flags & BIT(PRIVATE))
David Hildenbrand March 11, 2022, 7:36 p.m. UTC | #6
On 11.03.22 20:22, Linus Torvalds wrote:
> On Fri, Mar 11, 2022 at 10:46 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> -       PG_has_hwpoisoned = PG_mappedtodisk,
>> +       PG_has_hwpoisoned = PG_waiters,
> 
> That makes me too nervous for words. PG_waiters is very subtle.

Yes, but PG_has_hwpoisoned is located on the second subpage of a
compound page, not on the head page.

> 
> Not only is it magical in bit location ways (and the special
> clear_bit_unlock_is_negative_byte() macro that *literally* exists only
> for unlocking a page), it just ends up having fairly subtle semantics
> with intentionally racy clearing etc.
> 
> Mixing that up with any hwpoison bits - that aren't used by any normal
> mortals and thus get very little coverage - just sounds horribly
> horribly wrong.

I used PG_error before, but felt like using a bit that is never ever
valid to be set/cleared/checked on a subpage would be even a better fit:

Note the:

  PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)

whereby PF_ONLY_HEAD translates to:

"for compound page, callers only ever operate on the head page."


I can just switch to PG_error, but for the second subpage, PG_waiters
should be just fine (unless I am missing something important).
Matthew Wilcox March 11, 2022, 9:02 p.m. UTC | #7
On Fri, Mar 11, 2022 at 07:46:39PM +0100, David Hildenbrand wrote:
> I'm currently testing with the following. My tests so far with a swapfile on
> all different kinds of weird filesystems (excluding networking fs, though)
> revealed no surprises so far:

I like this a lot better than reusing PG_swap.  Thanks!

I'm somewhat reluctant to introduce a new flag that can be set on tail
pages.  Do we lose much if it's always set only on the head page?

> +++ b/include/linux/page-flags.h
> @@ -142,6 +142,60 @@ enum pageflags {
>  
>  	PG_readahead = PG_reclaim,
>  
> +	/*
> +	 * Depending on the way an anonymous folio can be mapped into a page
> +	 * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
> +	 * THP), PG_anon_exclusive may be set only for the head page or for
> +	 * subpages of an anonymous folio.
> +	 *
> +	 * PG_anon_exclusive is *usually* only expressive in combination with a
> +	 * page table entry. Depending on the page table entry type it might
> +	 * store the following information:
> +	 *
> +	 *	Is what's mapped via this page table entry exclusive to the
> +	 *	single process and can be mapped writable without further
> +	 *	checks? If not, it might be shared and we might have to COW.
> +	 *
> +	 * For now, we only expect PTE-mapped THPs to make use of
> +	 * PG_anon_exclusive in subpages. For other anonymous compound
> +	 * folios (i.e., hugetlb), only the head page is logically mapped and
> +	 * holds this information.
> +	 *
> +	 * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
> +	 * set on the head page. When replacing the PMD by a page table full
> +	 * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
> +	 * all tail pages accordingly. Note that converting from a PTE-mapping
> +	 * to a PMD mapping using the same compound page is currently not
> +	 * possible and consequently doesn't require care.
> +	 *
> +	 * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
> +	 * it should only pin if the relevant PG_anon_bit is set. In that case,
> +	 * the pin will be fully reliable and stay consistent with the pages
> +	 * mapped into the page table, as the bit cannot get cleared (e.g., by
> +	 * fork(), KSM) while the page is pinned. For anonymous pages that
> +	 * are mapped R/W, PG_anon_exclusive can be assumed to always be set
> +	 * because such pages cannot possibly be shared.
> +	 *
> +	 * The page table lock protecting the page table entry is the primary
> +	 * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
> +	 * not take the PT lock needs special care when trying to clear the
> +	 * flag.
> +	 *
> +	 * Page table entry types and PG_anon_exclusive:
> +	 * * Present: PG_anon_exclusive applies.
> +	 * * Swap: the information is lost. PG_anon_exclusive was cleared.
> +	 * * Migration: the entry holds this information instead.
> +	 *		PG_anon_exclusive was cleared.
> +	 * * Device private: PG_anon_exclusive applies.
> +	 * * Device exclusive: PG_anon_exclusive applies.
> +	 * * HW Poison: PG_anon_exclusive is stale and not changed.
> +	 *
> +	 * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
> +	 * not allowed and the flag will stick around until the page is freed
> +	 * and folio->mapping is cleared.
> +	 */

... I also don't think this is the right place for this comment.  Not
sure where it should go.

> +static __always_inline void SetPageAnonExclusive(struct page *page)
> +{
> +	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);

hm.  seems to me like we should have a PageAnonNotKsm which just
does
	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
			PAGE_MAPPING_ANON;
because that's "a bit" inefficient.  OK, that's just a VM_BUG_ON,
but we have other users in real code:

mm/migrate.c:   if (PageAnon(page) && !PageKsm(page))
mm/page_idle.c: need_lock = !PageAnon(page) || PageKsm(page);
mm/rmap.c:      if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
Matthew Wilcox March 11, 2022, 9:11 p.m. UTC | #8
On Fri, Mar 11, 2022 at 08:36:37PM +0100, David Hildenbrand wrote:
> I used PG_error before, but felt like using a bit that is never ever
> valid to be set/cleared/checked on a subpage would be even a better fit:
> 
> Note the:
> 
>   PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)
> 
> whereby PF_ONLY_HEAD translates to:
> 
> "for compound page, callers only ever operate on the head page."
> 
> 
> I can just switch to PG_error, but for the second subpage, PG_waiters
> should be just fine (unless I am missing something important).

I think you're missing something important that almost everybody misses
when looking at this code (including me).

PF_ANY flags can be set on individual pages.
PF_HEAD means "we automatically redirect all operations to the head page".
PF_ONLY_HEAD means "If you try to call this on a tail page, we BUG".
PF_NO_TAIL means "If you try to read this flag on a tail page, we'll
look at the head page instead, but if you try to set/clear this flag
on a tail page, we BUG"
PF_NO_COMPOUND means "We BUG() if you call this on a compound page"

So really, you can reuse any flag as PF_SECOND that isn't PF_ANY.

No, that's not what the documentation currently says.  It should be.
I had a patch to reword it at some point, but I guess it got lost.
The current documentation reads like "We replicate the flag currently
set on the head page to all tail pages", but that just isn't what
the code does.
David Hildenbrand March 12, 2022, 8:11 a.m. UTC | #9
On 11.03.22 22:11, Matthew Wilcox wrote:
> On Fri, Mar 11, 2022 at 08:36:37PM +0100, David Hildenbrand wrote:
>> I used PG_error before, but felt like using a bit that is never ever
>> valid to be set/cleared/checked on a subpage would be even a better fit:
>>
>> Note the:
>>
>>   PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)
>>
>> whereby PF_ONLY_HEAD translates to:
>>
>> "for compound page, callers only ever operate on the head page."
>>
>>
>> I can just switch to PG_error, but for the second subpage, PG_waiters
>> should be just fine (unless I am missing something important).
> 
> I think you're missing something important that almost everybody misses
> when looking at this code (including me).
> 
> PF_ANY flags can be set on individual pages.
> PF_HEAD means "we automatically redirect all operations to the head page".
> PF_ONLY_HEAD means "If you try to call this on a tail page, we BUG".
> PF_NO_TAIL means "If you try to read this flag on a tail page, we'll
> look at the head page instead, but if you try to set/clear this flag
> on a tail page, we BUG"

^ that's actually the confusing part for me

"modifications of the page flag must be done on small or head pages,
checks can be done on tail pages too."

Just from reading that I thought checks on the tail page would *not* get
redirected to the head.

I'd have written that as and extended version of PF_HEAD:

"for compound page all operations related to the page flag are applied
to head page. Checks done on a tail page will be redirected to the head
page."

[...]

> 
> So really, you can reuse any flag as PF_SECOND that isn't PF_ANY.
> 

I'll go back to PG_error, thanks!
David Hildenbrand March 12, 2022, 8:26 a.m. UTC | #10
On 11.03.22 22:02, Matthew Wilcox wrote:
> On Fri, Mar 11, 2022 at 07:46:39PM +0100, David Hildenbrand wrote:
>> I'm currently testing with the following. My tests so far with a swapfile on
>> all different kinds of weird filesystems (excluding networking fs, though)
>> revealed no surprises so far:
> 
> I like this a lot better than reusing PG_swap.  Thanks!
> 
> I'm somewhat reluctant to introduce a new flag that can be set on tail
> pages.  Do we lose much if it's always set only on the head page?
After spending one month on getting THP to work without PF_ANY, I can
say with confidence that the whole thing won't fly with THP when not
tracking it on the minimum-mapping granularity. For a PTE-mapped THP,
that's on the subpage level.

The next patch in the series documents some details. Intuitively, if we
could replace the pageflag by a PTE/PMD bit, we'd get roughly the same
result.

> 
>> +++ b/include/linux/page-flags.h
>> @@ -142,6 +142,60 @@ enum pageflags {
>>  
>>  	PG_readahead = PG_reclaim,
>>  
>> +	/*
>> +	 * Depending on the way an anonymous folio can be mapped into a page
>> +	 * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>> +	 * THP), PG_anon_exclusive may be set only for the head page or for
>> +	 * subpages of an anonymous folio.
>> +	 *
>> +	 * PG_anon_exclusive is *usually* only expressive in combination with a
>> +	 * page table entry. Depending on the page table entry type it might
>> +	 * store the following information:
>> +	 *
>> +	 *	Is what's mapped via this page table entry exclusive to the
>> +	 *	single process and can be mapped writable without further
>> +	 *	checks? If not, it might be shared and we might have to COW.
>> +	 *
>> +	 * For now, we only expect PTE-mapped THPs to make use of
>> +	 * PG_anon_exclusive in subpages. For other anonymous compound
>> +	 * folios (i.e., hugetlb), only the head page is logically mapped and
>> +	 * holds this information.
>> +	 *
>> +	 * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
>> +	 * set on the head page. When replacing the PMD by a page table full
>> +	 * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
>> +	 * all tail pages accordingly. Note that converting from a PTE-mapping
>> +	 * to a PMD mapping using the same compound page is currently not
>> +	 * possible and consequently doesn't require care.
>> +	 *
>> +	 * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
>> +	 * it should only pin if the relevant PG_anon_bit is set. In that case,
>> +	 * the pin will be fully reliable and stay consistent with the pages
>> +	 * mapped into the page table, as the bit cannot get cleared (e.g., by
>> +	 * fork(), KSM) while the page is pinned. For anonymous pages that
>> +	 * are mapped R/W, PG_anon_exclusive can be assumed to always be set
>> +	 * because such pages cannot possibly be shared.
>> +	 *
>> +	 * The page table lock protecting the page table entry is the primary
>> +	 * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
>> +	 * not take the PT lock needs special care when trying to clear the
>> +	 * flag.
>> +	 *
>> +	 * Page table entry types and PG_anon_exclusive:
>> +	 * * Present: PG_anon_exclusive applies.
>> +	 * * Swap: the information is lost. PG_anon_exclusive was cleared.
>> +	 * * Migration: the entry holds this information instead.
>> +	 *		PG_anon_exclusive was cleared.
>> +	 * * Device private: PG_anon_exclusive applies.
>> +	 * * Device exclusive: PG_anon_exclusive applies.
>> +	 * * HW Poison: PG_anon_exclusive is stale and not changed.
>> +	 *
>> +	 * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
>> +	 * not allowed and the flag will stick around until the page is freed
>> +	 * and folio->mapping is cleared.
>> +	 */
> 
> ... I also don't think this is the right place for this comment.  Not
> sure where it should go.

I went for "rather have some documentation at a sub-optimal place then
no documentation at all". I'm thinking about writing a proper
documentation once everything is in place, and moving some details from
there into that document then.

> 
>> +static __always_inline void SetPageAnonExclusive(struct page *page)
>> +{
>> +	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
> 
> hm.  seems to me like we should have a PageAnonNotKsm which just
> does
> 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
> 			PAGE_MAPPING_ANON;
> because that's "a bit" inefficient.  OK, that's just a VM_BUG_ON,
> but we have other users in real code:
> 
> mm/migrate.c:   if (PageAnon(page) && !PageKsm(page))
> mm/page_idle.c: need_lock = !PageAnon(page) || PageKsm(page);
> mm/rmap.c:      if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
> 

I'm wondering if the compiler won't be able to optimize that. Having
that said, I can look into adding that outside of this series.

Thanks!
diff mbox series

Patch

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 9f1077d94cde..61187568e5f7 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -182,8 +182,7 @@  u64 stable_page_flags(struct page *page)
 
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
-	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
-	if (PageTail(page) && PageSlab(compound_head(page)))
+	if (PageSlab(page) || PageSlab(compound_head(page)))
 		u |= 1 << KPF_SLAB;
 
 	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..b3c2cf71467e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,7 +107,7 @@  enum pageflags {
 	PG_workingset,
 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
 	PG_error,
-	PG_slab,
+	PG_slab,		/* Slab page if !PageAnon() */
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
 	PG_reserved,
@@ -142,6 +142,63 @@  enum pageflags {
 
 	PG_readahead = PG_reclaim,
 
+	/*
+	 * Depending on the way an anonymous folio can be mapped into a page
+	 * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
+	 * THP), PG_anon_exclusive may be set only for the head page or for
+	 * subpages of an anonymous folio.
+	 *
+	 * PG_anon_exclusive is *usually* only expressive in combination with a
+	 * page table entry. Depending on the page table entry type it might
+	 * store the following information:
+	 *
+	 *	Is what's mapped via this page table entry exclusive to the
+	 *	single process and can be mapped writable without further
+	 *	checks? If not, it might be shared and we might have to COW.
+	 *
+	 * For now, we only expect PTE-mapped THPs to make use of
+	 * PG_anon_exclusive in subpages. For other anonymous compound
+	 * folios (i.e., hugetlb), only the head page is logically mapped and
+	 * holds this information.
+	 *
+	 * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
+	 * set on the head page. When replacing the PMD by a page table full
+	 * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
+	 * all tail pages accordingly. Note that converting from a PTE-mapping
+	 * to a PMD mapping using the same compound page is currently not
+	 * possible and consequently doesn't require care.
+	 *
+	 * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
+	 * it should only pin if the relevant PG_anon_bit is set. In that case,
+	 * the pin will be fully reliable and stay consistent with the pages
+	 * mapped into the page table, as the bit cannot get cleared (e.g., by
+	 * fork(), KSM) while the page is pinned. For anonymous pages that
+	 * are mapped R/W, PG_anon_exclusive can be assumed to always be set
+	 * because such pages cannot possibly be shared.
+	 *
+	 * The page table lock protecting the page table entry is the primary
+	 * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
+	 * not take the PT lock needs special care when trying to clear the
+	 * flag.
+	 *
+	 * Page table entry types and PG_anon_exclusive:
+	 * * Present: PG_anon_exclusive applies.
+	 * * Swap: the information is lost. PG_anon_exclusive was cleared.
+	 * * Migration: the entry holds this information instead.
+	 *		PG_anon_exclusive was cleared.
+	 * * Device private: PG_anon_exclusive applies.
+	 * * Device exclusive: PG_anon_exclusive applies.
+	 * * HW Poison: PG_anon_exclusive is stale and not changed.
+	 *
+	 * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
+	 * not allowed and the flag will stick around until the page is freed
+	 * and folio->mapping is cleared.
+	 *
+	 * Before clearing folio->mapping, PG_anon_exclusive has to be cleared
+	 * to not result in PageSlab() false positives.
+	 */
+	PG_anon_exclusive = PG_slab,
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -425,7 +482,6 @@  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
 	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
-__PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
 
@@ -920,6 +976,70 @@  extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
 
+static __always_inline bool folio_test_slab(struct folio *folio)
+{
+	return !folio_test_anon(folio) &&
+	       test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
+}
+
+static __always_inline int PageSlab(struct page *page)
+{
+	return !PageAnon(page) &&
+		test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags);
+}
+
+static __always_inline void __folio_set_slab(struct folio *folio)
+{
+	VM_BUG_ON(folio_test_anon(folio));
+	__set_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
+}
+
+static __always_inline void __SetPageSlab(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageAnon(page), page);
+	__set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
+}
+
+static __always_inline void __folio_clear_slab(struct folio *folio)
+{
+	VM_BUG_ON(folio_test_anon(folio));
+	__clear_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
+}
+
+static __always_inline void __ClearPageSlab(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageAnon(page), page);
+	__clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
+}
+
+static __always_inline int PageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void SetPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1UL << PG_mlocked)
 #else
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..5662f74e027f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -103,7 +103,7 @@ 
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
 	{1UL << PG_workingset,		"workingset"	},		\
-	{1UL << PG_slab,		"slab"		},		\
+	{1UL << PG_slab,		"slab/anon_exclusive" },	\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
 	{1UL << PG_reserved,		"reserved"	},		\
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fb990d95dab..bf5ce91c623c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1669,6 +1669,10 @@  void free_huge_page(struct page *page)
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
 
 	hugetlb_set_page_subpool(page, NULL);
+	if (PageAnon(page)) {
+		__ClearPageAnonExclusive(page);
+		wmb(); /* avoid PageSlab() false positives */
+	}
 	page->mapping = NULL;
 	restore_reserve = HPageRestoreReserve(page);
 	ClearHPageRestoreReserve(page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97a9ed8f87a9..176fa2fbd699 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1470,17 +1470,29 @@  static int identify_page_state(unsigned long pfn, struct page *p,
 	 * The first check uses the current page flags which may not have any
 	 * relevant information. The second check with the saved page flags is
 	 * carried out only if the first check can't determine the page status.
+	 *
+	 * Note that PG_slab is also used as PG_anon_exclusive for PageAnon()
+	 * pages. Most of these pages should have been handled previously,
+	 * however, let's play safe and verify via PageAnon().
 	 */
-	for (ps = error_states;; ps++)
-		if ((p->flags & ps->mask) == ps->res)
-			break;
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) != ps->res)
+			continue;
+		if ((ps->type == MF_MSG_SLAB) && PageAnon(p))
+			continue;
+		break;
+	}
 
 	page_flags |= (p->flags & (1UL << PG_dirty));
 
 	if (!ps->mask)
-		for (ps = error_states;; ps++)
-			if ((page_flags & ps->mask) == ps->res)
-				break;
+		for (ps = error_states;; ps++) {
+			if ((page_flags & ps->mask) != ps->res)
+				continue;
+			if ((ps->type == MF_MSG_SLAB) && PageAnon(p))
+				continue;
+			break;
+		}
 	return page_action(ps, p, pfn);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..a6b82ae8b3e6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -478,6 +478,17 @@  void free_devmap_managed_page(struct page *page)
 
 	mem_cgroup_uncharge(page_folio(page));
 
+	/*
+	 * Note: we don't expect anonymous compound pages yet. Once supported
+	 * and we could PTE-map them similar to THP, we'd have to clear
+	 * PG_anon_exclusive on all tail pages.
+	 */
+	VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
+	if (PageAnon(page)) {
+		__ClearPageAnonExclusive(page);
+		wmb(); /* avoid PageSlab() false positives */
+	}
+
 	/*
 	 * When a device_private page is freed, the page->mapping field
 	 * may still contain a (stale) mapping value. For example, the
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..e59e68a61d1a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1297,6 +1297,7 @@  static __always_inline bool free_pages_prepare(struct page *page,
 {
 	int bad = 0;
 	bool skip_kasan_poison = should_skip_kasan_poison(page, fpi_flags);
+	const bool anon = PageAnon(page);
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -1329,6 +1330,14 @@  static __always_inline bool free_pages_prepare(struct page *page,
 			ClearPageHasHWPoisoned(page);
 		}
 		for (i = 1; i < (1 << order); i++) {
+			/*
+			 * Freeing a previously PTE-mapped THP can have
+			 * PG_anon_exclusive set on tail pages. Clear it
+			 * manually as it's overloaded with PG_slab that we
+			 * want to catch in check_free_page().
+			 */
+			if (anon)
+				__ClearPageAnonExclusive(page + i);
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
 			if (unlikely(check_free_page(page + i))) {
@@ -1338,6 +1347,10 @@  static __always_inline bool free_pages_prepare(struct page *page,
 			(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		}
 	}
+	if (anon) {
+		__ClearPageAnonExclusive(page);
+		wmb(); /* avoid PageSlab() false positives */
+	}
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
 	if (memcg_kmem_enabled() && PageMemcgKmem(page))