Message ID | 20220308141437.144919-11-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: COW fixes part 2: reliable GUP pins of anonymous pages | expand |
On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote: > The basic question we would like to have a reliable and efficient answer > to is: is this anonymous page exclusive to a single process or might it > be shared? Is this supposed to be for PAGE_SIZE pages as well, or is it only used on pages > PAGE_SIZE? > In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags > don't grow on trees, so we have to get a little creative for the time > being. This feels a little _too_ creative to me. There's now an implicit requirement that SL[AOU]B doesn't use the bottom two bits of ->slab_cache, which is probably OK but would need to be documented. I have plans to get rid of PageError and PagePrivate, but those are going to be too late for you. I don't think mappedtodisk has meaning for anon pages, even if they're in the swapcache. It would need PG_has_hwpoisoned to shift to another bit ... but almost any bit will do for has_hwpoisoned. Or have I overlooked something? > @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page); > > __PAGEFLAG(Isolated, isolated, PF_ANY); > > +static __always_inline bool folio_test_slab(struct folio *folio) > +{ > + return !folio_test_anon(folio) && > + test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL)); > +} > + > +static __always_inline int PageSlab(struct page *page) > +{ > + return !PageAnon(page) && > + test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags); > +} In case we do end up using this, this would be better implemented as static __always_inline int PageSlab(struct page *page) { return folio_test_slab(page_folio(page)); } since PageAnon already has a page_folio() call embedded in it. > +static __always_inline void __SetPageSlab(struct page *page) > +{ > + VM_BUG_ON_PGFLAGS(PageAnon(page), page); > + __set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags); > +} There's only one caller of __SetPageSlab() left, in kfence. And that code looks ... weird. for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { if (!i || (i % 2)) continue; /* Verify we do not have a compound head page. */ if (WARN_ON(compound_head(&pages[i]) != &pages[i])) goto err; __SetPageSlab(&pages[i]); I think the author probably intended WARN_ON(PageCompound(page)) because they're actually verifying that it's not a tail page, rather than head page. > +static __always_inline void __ClearPageSlab(struct page *page) > +{ > + VM_BUG_ON_PGFLAGS(PageAnon(page), page); > + __clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags); > +} There are no remaining callers of __ClearPageSlab(). yay.
Hi, On 09.03.22 16:47, Matthew Wilcox wrote: > On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote: >> The basic question we would like to have a reliable and efficient answer >> to is: is this anonymous page exclusive to a single process or might it >> be shared? > > Is this supposed to be for PAGE_SIZE pages as well, or is it only used > on pages > PAGE_SIZE? As documented, simple/ordinary PAGE_SIZE pages as well (unfortunately , otherwise we'd have more space :) ). > >> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags >> don't grow on trees, so we have to get a little creative for the time >> being. > > This feels a little _too_ creative to me. There's now an implicit It's making the semantics of PG_slab depend on another bit in the head page. I agree, it's not perfect, but it's not *too* crazy. As raised in the cover letter, not proud of this, but I didn't really find an alternative for the time being. > requirement that SL[AOU]B doesn't use the bottom two bits of I think only the last bit (0x1) > ->slab_cache, which is probably OK but would need to be documented. We'd already have false positive PageAnon() if that wasn't the case. At least in stable_page_flags() would already indicate something wrong I think (KPF_ANON). We'd need !PageSlab() checks at a couple of places already if I'm not wrong. I had a comment in there, but after the PageSlab cleanups came in, I dropped it because my assumption was that actually nobody is really allowed to use the lowest mapping bit for something else. We can document that, of course. So at least in that regard, I think this is fine. > > I have plans to get rid of PageError and PagePrivate, but those are going > to be too late for you. I don't think mappedtodisk has meaning for anon > pages, even if they're in the swapcache. It would need PG_has_hwpoisoned Are you sure it's not used if the page is in the swapcache? I have no detailed knowledge how file-back swap targets are handled in that regard. So fs experience would be highly appreciated :) > to shift to another bit ... but almost any bit will do for has_hwpoisoned. > Or have I overlooked something? Good question, I assume we could use another bit that's not already defined/check on subpages of a compound page. Overloading PG_reserved would be an alternative, however, that flag can also indicate that the remainder of the memmap might be mostly garbage, so it's not that good of a fit IMHO. > >> @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page); >> >> __PAGEFLAG(Isolated, isolated, PF_ANY); >> >> +static __always_inline bool folio_test_slab(struct folio *folio) >> +{ >> + return !folio_test_anon(folio) && >> + test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL)); >> +} >> + >> +static __always_inline int PageSlab(struct page *page) >> +{ >> + return !PageAnon(page) && >> + test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags); >> +} > > In case we do end up using this, this would be better implemented as > > static __always_inline int PageSlab(struct page *page) > { > return folio_test_slab(page_folio(page)); > } > > since PageAnon already has a page_folio() call embedded in it. Agreed, I mainly copied the stubs and extended them. > >> +static __always_inline void __SetPageSlab(struct page *page) >> +{ >> + VM_BUG_ON_PGFLAGS(PageAnon(page), page); >> + __set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags); >> +} > > There's only one caller of __SetPageSlab() left, in kfence. And that > code looks ... weird. > > for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { > if (!i || (i % 2)) > continue; > > /* Verify we do not have a compound head page. */ > if (WARN_ON(compound_head(&pages[i]) != &pages[i])) > goto err; > > __SetPageSlab(&pages[i]); > > I think the author probably intended WARN_ON(PageCompound(page)) because > they're actually verifying that it's not a tail page, rather than head > page. It's certainly a head-scratcher. > >> +static __always_inline void __ClearPageSlab(struct page *page) >> +{ >> + VM_BUG_ON_PGFLAGS(PageAnon(page), page); >> + __clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags); >> +} > > There are no remaining callers of __ClearPageSlab(). yay. > Indeed, nice. Thanks!
On Wed, Mar 09, 2022 at 05:57:51PM +0100, David Hildenbrand wrote: > Hi, > > On 09.03.22 16:47, Matthew Wilcox wrote: > > On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote: > >> The basic question we would like to have a reliable and efficient answer > >> to is: is this anonymous page exclusive to a single process or might it > >> be shared? > > > > Is this supposed to be for PAGE_SIZE pages as well, or is it only used > > on pages > PAGE_SIZE? > > As documented, simple/ordinary PAGE_SIZE pages as well (unfortunately , > otherwise we'd have more space :) ). OK. Documentation doesn't always capture exactly what the author was thinking, so I appreciate the clarification ;-) > > This feels a little _too_ creative to me. There's now an implicit > > It's making the semantics of PG_slab depend on another bit in the head > page. I agree, it's not perfect, but it's not *too* crazy. As raised in > the cover letter, not proud of this, but I didn't really find an > alternative for the time being. > > > requirement that SL[AOU]B doesn't use the bottom two bits of > > I think only the last bit (0x1) Yeah, OK, they can use three of the four possible combinations of the bottom two bits ;-) Still, it's yet more constraints on use of struct page, which aren't obvious (and are even upside down for the casual observer). > > I have plans to get rid of PageError and PagePrivate, but those are going > > to be too late for you. I don't think mappedtodisk has meaning for anon > > pages, even if they're in the swapcache. It would need PG_has_hwpoisoned > > Are you sure it's not used if the page is in the swapcache? I have no > detailed knowledge how file-back swap targets are handled in that > regard. So fs experience would be highly appreciated :) I have two arguments here. One is based on grep: $ git grep mappedtodisk fs/proc/page.c: u |= kpf_copy_bit(k, KPF_MAPPEDTODISK, PG_mappedtodisk); include/linux/page-flags.h: PG_mappedtodisk, /* Has blocks allocated on-disk */ include/linux/page-flags.h: PG_has_hwpoisoned = PG_mappedtodisk, include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) include/trace/events/mmflags.h: {1UL << PG_mappedtodisk, "mappedtodisk" }, \ include/trace/events/pagemap.h: (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \ mm/migrate.c: if (folio_test_mappedtodisk(folio)) mm/migrate.c: folio_set_mappedtodisk(newfolio); mm/truncate.c: folio_clear_mappedtodisk(folio); tools/vm/page-types.c: [KPF_MAPPEDTODISK] = "d:mappedtodisk", $ git grep MappedToDisk fs/buffer.c: SetPageMappedToDisk(page); fs/buffer.c: if (PageMappedToDisk(page)) fs/buffer.c: SetPageMappedToDisk(page); fs/ext4/readpage.c: SetPageMappedToDisk(page); fs/f2fs/data.c: SetPageMappedToDisk(page); fs/f2fs/file.c: if (PageMappedToDisk(page)) fs/fuse/dev.c: ClearPageMappedToDisk(newpage); fs/mpage.c: SetPageMappedToDisk(page); fs/nilfs2/file.c: if (PageMappedToDisk(page)) fs/nilfs2/file.c: SetPageMappedToDisk(page); fs/nilfs2/page.c: ClearPageMappedToDisk(page); fs/nilfs2/page.c: SetPageMappedToDisk(dpage); fs/nilfs2/page.c: ClearPageMappedToDisk(dpage); fs/nilfs2/page.c: if (PageMappedToDisk(src) && !PageMappedToDisk(dst)) fs/nilfs2/page.c: SetPageMappedToDisk(dst); fs/nilfs2/page.c: else if (!PageMappedToDisk(src) && PageMappedToDisk(dst)) fs/nilfs2/page.c: ClearPageMappedToDisk(dst); fs/nilfs2/page.c: ClearPageMappedToDisk(page); include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) so you can see it's _rarely_ used, and only by specific filesystems. The swap code actually bypasses the filesystem (except for network filesystems) and submits BIOs directly (see __swap_writepage() and swap_readpage()). There's no check for MappedToDisk, and nowhere to set it/clear it. The second argument is that MappedToDisk is used for delayed allocation on writes for filesystems. But swap is required to be allocated at swapfile setup (so that the swap code can bypass the filesystem ...) and so this flag is useless. Of course, I may have missed something. This is complex. > > to shift to another bit ... but almost any bit will do for has_hwpoisoned. > > Or have I overlooked something? > > Good question, I assume we could use another bit that's not already > defined/check on subpages of a compound page. > > > Overloading PG_reserved would be an alternative, however, that flag can > also indicate that the remainder of the memmap might be mostly garbage, > so it's not that good of a fit IMHO. I wouldn't reuse PG_reserved for anything. Although I do have a modest proposal that I should finish writing up ... let me start on that.
>> It's making the semantics of PG_slab depend on another bit in the head >> page. I agree, it's not perfect, but it's not *too* crazy. As raised in >> the cover letter, not proud of this, but I didn't really find an >> alternative for the time being. >> >>> requirement that SL[AOU]B doesn't use the bottom two bits of >> >> I think only the last bit (0x1) > > Yeah, OK, they can use three of the four possible combinations of the > bottom two bits ;-) Still, it's yet more constraints on use of struct > page, which aren't obvious (and are even upside down for the casual > observer). I don't disagree that such constraints are nasty. Having that said, I'd really like to avoid overloading PG_slab (especially, such that I don't have to mess with scripts/crash/makedumpfile). So if we can reuse MappedToDisk, that would be very nice. > >>> I have plans to get rid of PageError and PagePrivate, but those are going >>> to be too late for you. I don't think mappedtodisk has meaning for anon >>> pages, even if they're in the swapcache. It would need PG_has_hwpoisoned >> >> Are you sure it's not used if the page is in the swapcache? I have no >> detailed knowledge how file-back swap targets are handled in that >> regard. So fs experience would be highly appreciated :) > > I have two arguments here. One is based on grep: > > $ git grep mappedtodisk > fs/proc/page.c: u |= kpf_copy_bit(k, KPF_MAPPEDTODISK, PG_mappedtodisk); > include/linux/page-flags.h: PG_mappedtodisk, /* Has blocks allocated on-disk */ > include/linux/page-flags.h: PG_has_hwpoisoned = PG_mappedtodisk, > include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) > include/trace/events/mmflags.h: {1UL << PG_mappedtodisk, "mappedtodisk" }, \ > include/trace/events/pagemap.h: (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \ > mm/migrate.c: if (folio_test_mappedtodisk(folio)) > mm/migrate.c: folio_set_mappedtodisk(newfolio); > mm/truncate.c: folio_clear_mappedtodisk(folio); > tools/vm/page-types.c: [KPF_MAPPEDTODISK] = "d:mappedtodisk", > > $ git grep MappedToDisk > fs/buffer.c: SetPageMappedToDisk(page); > fs/buffer.c: if (PageMappedToDisk(page)) > fs/buffer.c: SetPageMappedToDisk(page); > fs/ext4/readpage.c: SetPageMappedToDisk(page); > fs/f2fs/data.c: SetPageMappedToDisk(page); > fs/f2fs/file.c: if (PageMappedToDisk(page)) > fs/fuse/dev.c: ClearPageMappedToDisk(newpage); > fs/mpage.c: SetPageMappedToDisk(page); > fs/nilfs2/file.c: if (PageMappedToDisk(page)) > fs/nilfs2/file.c: SetPageMappedToDisk(page); > fs/nilfs2/page.c: ClearPageMappedToDisk(page); > fs/nilfs2/page.c: SetPageMappedToDisk(dpage); > fs/nilfs2/page.c: ClearPageMappedToDisk(dpage); > fs/nilfs2/page.c: if (PageMappedToDisk(src) && !PageMappedToDisk(dst)) > fs/nilfs2/page.c: SetPageMappedToDisk(dst); > fs/nilfs2/page.c: else if (!PageMappedToDisk(src) && PageMappedToDisk(dst)) > fs/nilfs2/page.c: ClearPageMappedToDisk(dst); > fs/nilfs2/page.c: ClearPageMappedToDisk(page); > include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) > > so you can see it's _rarely_ used, and only by specific filesystems. Right, but I spot ext4 and fs/buffer.c core functionality. That naturally makes me nervous :) > > The swap code actually bypasses the filesystem (except for network > filesystems) and submits BIOs directly (see __swap_writepage() and > swap_readpage()). There's no check for MappedToDisk, and nowhere to > set it/clear it. > > The second argument is that MappedToDisk is used for delayed allocation > on writes for filesystems. But swap is required to be allocated at > swapfile setup (so that the swap code can bypass the filesystem ...) > and so this flag is useless. I have some faint memory that there are corner cases, but maybe (hopefully) my memory is wrong. > > Of course, I may have missed something. This is complex. > Yeah, that's why I was very careful. If any FS would end up setting that flag we'd be in trouble and would have to blacklist that fs for swapping or rework it (well, if we're even able to identify such a file system). I think one sanity check I could add is making sure that PageAnonExclusive() is never set when getting a page from the swapcache in do_swap_page(). I think that would take care of the obvious bugs.
On 08.03.22 15:14, David Hildenbrand wrote: > The basic question we would like to have a reliable and efficient answer > to is: is this anonymous page exclusive to a single process or might it > be shared? > > In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags > don't grow on trees, so we have to get a little creative for the time > being. > > Introduce a way to mark an anonymous page as exclusive, with the > ultimate goal of teaching our COW logic to not do "wrong COWs", whereby > GUP pins lose consistency with the pages mapped into the page table, > resulting in reported memory corruptions. > > Most pageflags already have semantics for anonymous pages, so we're left > with reusing PG_slab for our purpose: for PageAnon() pages PG_slab now > translates to PG_anon_exclusive, teach some in-kernel code that manually > handles PG_slab about that. > > Add a spoiler on the semantics of PG_anon_exclusive as documentation. More > documentation will be contained in the code that actually makes use of > PG_anon_exclusive. > > We won't be clearing PG_anon_exclusive on destructive unmapping (i.e., > zapping) of page table entries, page freeing code will handle that when > also invalidate page->mapping to not indicate PageAnon() anymore. > Letting information about exclusivity stick around will be an important > property when adding sanity checks to unpinning code. > > RFC notes: in-tree tools/cgroup/memcg_slabinfo.py looks like it might need > some care. We'd have to lookup the head page and check if > PageAnon() is set. Similarly, tools living outside the kernel > repository like crash and makedumpfile might need adaptions. > > Cc: Roman Gushchin <guro@fb.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- I'm currently testing with the following. My tests so far with a swapfile on all different kinds of weird filesystems (excluding networking fs, though) revealed no surprises so far: From 13aeb929722e38d162d03baa75eadb8c2c84359d Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Mon, 20 Dec 2021 20:23:51 +0100 Subject: [PATCH] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages The basic question we would like to have a reliable and efficient answer to is: is this anonymous page exclusive to a single process or might it be shared? We need that information for ordinary/single pages, hugetlb pages, and possibly each subpage of a THP. Introduce a way to mark an anonymous page as exclusive, with the ultimate goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins lose consistency with the pages mapped into the page table, resulting in reported memory corruptions. Most pageflags already have semantics for anonymous pages, however, PG_mappedtodisk should never apply to pages in the swapcache, so let's reuse that flag. As PG_has_hwpoisoned also uses that flag on the second tail page of a compound page, convert it to PG_waiters instead, which is marked as PF_ONLY_HEAD and does definetly not apply to any tail pages. __split_huge_page() properly does a ClearPageHasHWPoisoned() before continuing with the THP split. Use custom page flag modification functions such that we can do additional sanity checks. Add a spoiler on the semantics of PG_anon_exclusive as documentation. More documentation will be contained in the code that actually makes use of PG_anon_exclusive. We won't be clearing PG_anon_exclusive on destructive unmapping (i.e., zapping) of page table entries, page freeing code will handle that when also invalidate page->mapping to not indicate PageAnon() anymore. Letting information about exclusivity stick around will be an important property when adding sanity checks to unpinning code. Note that we properly clear the flag in free_pages_prepare() via PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page, so there is no need to manually clear the flag. Signed-off-by: David Hildenbrand <david@redhat.com> --- include/linux/page-flags.h | 84 +++++++++++++++++++++++++++++++++++++- mm/hugetlb.c | 2 + mm/memory.c | 11 +++++ mm/memremap.c | 9 ++++ mm/swapfile.c | 4 ++ tools/vm/page-types.c | 8 +++- 6 files changed, 116 insertions(+), 2 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1c3b6e5c8bfd..ac68d28b5e1f 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -142,6 +142,60 @@ enum pageflags { PG_readahead = PG_reclaim, + /* + * Depending on the way an anonymous folio can be mapped into a page + * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped + * THP), PG_anon_exclusive may be set only for the head page or for + * subpages of an anonymous folio. + * + * PG_anon_exclusive is *usually* only expressive in combination with a + * page table entry. Depending on the page table entry type it might + * store the following information: + * + * Is what's mapped via this page table entry exclusive to the + * single process and can be mapped writable without further + * checks? If not, it might be shared and we might have to COW. + * + * For now, we only expect PTE-mapped THPs to make use of + * PG_anon_exclusive in subpages. For other anonymous compound + * folios (i.e., hugetlb), only the head page is logically mapped and + * holds this information. + * + * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive + * set on the head page. When replacing the PMD by a page table full + * of PTEs, PG_anon_exclusive, if set on the head page, will be set on + * all tail pages accordingly. Note that converting from a PTE-mapping + * to a PMD mapping using the same compound page is currently not + * possible and consequently doesn't require care. + * + * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page, + * it should only pin if the relevant PG_anon_bit is set. In that case, + * the pin will be fully reliable and stay consistent with the pages + * mapped into the page table, as the bit cannot get cleared (e.g., by + * fork(), KSM) while the page is pinned. For anonymous pages that + * are mapped R/W, PG_anon_exclusive can be assumed to always be set + * because such pages cannot possibly be shared. + * + * The page table lock protecting the page table entry is the primary + * synchronization mechanism for PG_anon_exclusive; GUP-fast that does + * not take the PT lock needs special care when trying to clear the + * flag. + * + * Page table entry types and PG_anon_exclusive: + * * Present: PG_anon_exclusive applies. + * * Swap: the information is lost. PG_anon_exclusive was cleared. + * * Migration: the entry holds this information instead. + * PG_anon_exclusive was cleared. + * * Device private: PG_anon_exclusive applies. + * * Device exclusive: PG_anon_exclusive applies. + * * HW Poison: PG_anon_exclusive is stale and not changed. + * + * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is + * not allowed and the flag will stick around until the page is freed + * and folio->mapping is cleared. + */ + PG_anon_exclusive = PG_mappedtodisk, + /* Filesystems */ PG_checked = PG_owner_priv_1, @@ -176,7 +230,7 @@ enum pageflags { * Indicates that at least one subpage is hwpoisoned in the * THP. */ - PG_has_hwpoisoned = PG_mappedtodisk, + PG_has_hwpoisoned = PG_waiters, #endif /* non-lru isolated movable page */ @@ -920,6 +974,34 @@ extern bool is_free_buddy_page(struct page *page); __PAGEFLAG(Isolated, isolated, PF_ANY); +static __always_inline int PageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + +static __always_inline void SetPageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + +static __always_inline void ClearPageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + +static __always_inline void __ClearPageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + #ifdef CONFIG_MMU #define __PG_MLOCKED (1UL << PG_mlocked) #else diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9fb990d95dab..1ff0b9e1e28e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1669,6 +1669,8 @@ void free_huge_page(struct page *page) VM_BUG_ON_PAGE(page_mapcount(page), page); hugetlb_set_page_subpool(page, NULL); + if (PageAnon(page)) + __ClearPageAnonExclusive(page); page->mapping = NULL; restore_reserve = HPageRestoreReserve(page); ClearHPageRestoreReserve(page); diff --git a/mm/memory.c b/mm/memory.c index 7b32f422798d..d01fab481134 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3671,6 +3671,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* + * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte + * must never point at an anonymous page in the swapcache that is + * PG_anon_exclusive. Sanity check that this holds and especially, that + * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity + * check after taking the PT lock and making sure that nobody + * concurrently faulted in this page and set PG_anon_exclusive. + */ + BUG_ON(!PageAnon(page) && PageMappedToDisk(page)); + BUG_ON(PageAnon(page) && PageAnonExclusive(page)); + /* * Remove the swap entry and conditionally try to free up the swapcache. * We're already holding a reference on the page but haven't mapped it diff --git a/mm/memremap.c b/mm/memremap.c index 6aa5f0c2d11f..160ea92e4e17 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -478,6 +478,15 @@ void free_devmap_managed_page(struct page *page) mem_cgroup_uncharge(page_folio(page)); + /* + * Note: we don't expect anonymous compound pages yet. Once supported + * and we could PTE-map them similar to THP, we'd have to clear + * PG_anon_exclusive on all tail pages. + */ + VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page); + if (PageAnon(page)) + __ClearPageAnonExclusive(page); + /* * When a device_private page is freed, the page->mapping field * may still contain a (stale) mapping value. For example, the diff --git a/mm/swapfile.c b/mm/swapfile.c index 7edc8e099b22..493acb967b7a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1796,6 +1796,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, goto out; } + /* See do_swap_page() */ + BUG_ON(!PageAnon(page) && PageMappedToDisk(page)); + BUG_ON(PageAnon(page) && PageAnonExclusive(page)); + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); inc_mm_counter(vma->vm_mm, MM_ANONPAGES); get_page(page); diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c index b1ed76d9a979..381dcc00cb62 100644 --- a/tools/vm/page-types.c +++ b/tools/vm/page-types.c @@ -80,9 +80,10 @@ #define KPF_SOFTDIRTY 40 #define KPF_ARCH_2 41 -/* [48-] take some arbitrary free slots for expanding overloaded flags +/* [47-] take some arbitrary free slots for expanding overloaded flags * not part of kernel API */ +#define KPF_ANON_EXCLUSIVE 47 #define KPF_READAHEAD 48 #define KPF_SLOB_FREE 49 #define KPF_SLUB_FROZEN 50 @@ -138,6 +139,7 @@ static const char * const page_flag_names[] = { [KPF_SOFTDIRTY] = "f:softdirty", [KPF_ARCH_2] = "H:arch_2", + [KPF_ANON_EXCLUSIVE] = "d:anon_exclusive", [KPF_READAHEAD] = "I:readahead", [KPF_SLOB_FREE] = "P:slob_free", [KPF_SLUB_FROZEN] = "A:slub_frozen", @@ -472,6 +474,10 @@ static int bit_mask_ok(uint64_t flags) static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme) { + /* Anonymous pages overload PG_mappedtodisk */ + if ((flags & BIT(ANON)) && (flags & BIT(MAPPEDTODISK))) + flags ^= BIT(MAPPEDTODISK) | BIT(ANON_EXCLUSIVE); + /* SLOB/SLUB overload several page flags */ if (flags & BIT(SLAB)) { if (flags & BIT(PRIVATE))
On 11.03.22 20:22, Linus Torvalds wrote: > On Fri, Mar 11, 2022 at 10:46 AM David Hildenbrand <david@redhat.com> wrote: >> >> - PG_has_hwpoisoned = PG_mappedtodisk, >> + PG_has_hwpoisoned = PG_waiters, > > That makes me too nervous for words. PG_waiters is very subtle. Yes, but PG_has_hwpoisoned is located on the second subpage of a compound page, not on the head page. > > Not only is it magical in bit location ways (and the special > clear_bit_unlock_is_negative_byte() macro that *literally* exists only > for unlocking a page), it just ends up having fairly subtle semantics > with intentionally racy clearing etc. > > Mixing that up with any hwpoison bits - that aren't used by any normal > mortals and thus get very little coverage - just sounds horribly > horribly wrong. I used PG_error before, but felt like using a bit that is never ever valid to be set/cleared/checked on a subpage would be even a better fit: Note the: PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) whereby PF_ONLY_HEAD translates to: "for compound page, callers only ever operate on the head page." I can just switch to PG_error, but for the second subpage, PG_waiters should be just fine (unless I am missing something important).
On Fri, Mar 11, 2022 at 07:46:39PM +0100, David Hildenbrand wrote: > I'm currently testing with the following. My tests so far with a swapfile on > all different kinds of weird filesystems (excluding networking fs, though) > revealed no surprises so far: I like this a lot better than reusing PG_swap. Thanks! I'm somewhat reluctant to introduce a new flag that can be set on tail pages. Do we lose much if it's always set only on the head page? > +++ b/include/linux/page-flags.h > @@ -142,6 +142,60 @@ enum pageflags { > > PG_readahead = PG_reclaim, > > + /* > + * Depending on the way an anonymous folio can be mapped into a page > + * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped > + * THP), PG_anon_exclusive may be set only for the head page or for > + * subpages of an anonymous folio. > + * > + * PG_anon_exclusive is *usually* only expressive in combination with a > + * page table entry. Depending on the page table entry type it might > + * store the following information: > + * > + * Is what's mapped via this page table entry exclusive to the > + * single process and can be mapped writable without further > + * checks? If not, it might be shared and we might have to COW. > + * > + * For now, we only expect PTE-mapped THPs to make use of > + * PG_anon_exclusive in subpages. For other anonymous compound > + * folios (i.e., hugetlb), only the head page is logically mapped and > + * holds this information. > + * > + * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive > + * set on the head page. When replacing the PMD by a page table full > + * of PTEs, PG_anon_exclusive, if set on the head page, will be set on > + * all tail pages accordingly. Note that converting from a PTE-mapping > + * to a PMD mapping using the same compound page is currently not > + * possible and consequently doesn't require care. > + * > + * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page, > + * it should only pin if the relevant PG_anon_bit is set. In that case, > + * the pin will be fully reliable and stay consistent with the pages > + * mapped into the page table, as the bit cannot get cleared (e.g., by > + * fork(), KSM) while the page is pinned. For anonymous pages that > + * are mapped R/W, PG_anon_exclusive can be assumed to always be set > + * because such pages cannot possibly be shared. > + * > + * The page table lock protecting the page table entry is the primary > + * synchronization mechanism for PG_anon_exclusive; GUP-fast that does > + * not take the PT lock needs special care when trying to clear the > + * flag. > + * > + * Page table entry types and PG_anon_exclusive: > + * * Present: PG_anon_exclusive applies. > + * * Swap: the information is lost. PG_anon_exclusive was cleared. > + * * Migration: the entry holds this information instead. > + * PG_anon_exclusive was cleared. > + * * Device private: PG_anon_exclusive applies. > + * * Device exclusive: PG_anon_exclusive applies. > + * * HW Poison: PG_anon_exclusive is stale and not changed. > + * > + * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is > + * not allowed and the flag will stick around until the page is freed > + * and folio->mapping is cleared. > + */ ... I also don't think this is the right place for this comment. Not sure where it should go. > +static __always_inline void SetPageAnonExclusive(struct page *page) > +{ > + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page); hm. seems to me like we should have a PageAnonNotKsm which just does return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON; because that's "a bit" inefficient. OK, that's just a VM_BUG_ON, but we have other users in real code: mm/migrate.c: if (PageAnon(page) && !PageKsm(page)) mm/page_idle.c: need_lock = !PageAnon(page) || PageKsm(page); mm/rmap.c: if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
On Fri, Mar 11, 2022 at 08:36:37PM +0100, David Hildenbrand wrote: > I used PG_error before, but felt like using a bit that is never ever > valid to be set/cleared/checked on a subpage would be even a better fit: > > Note the: > > PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) > > whereby PF_ONLY_HEAD translates to: > > "for compound page, callers only ever operate on the head page." > > > I can just switch to PG_error, but for the second subpage, PG_waiters > should be just fine (unless I am missing something important). I think you're missing something important that almost everybody misses when looking at this code (including me). PF_ANY flags can be set on individual pages. PF_HEAD means "we automatically redirect all operations to the head page". PF_ONLY_HEAD means "If you try to call this on a tail page, we BUG". PF_NO_TAIL means "If you try to read this flag on a tail page, we'll look at the head page instead, but if you try to set/clear this flag on a tail page, we BUG" PF_NO_COMPOUND means "We BUG() if you call this on a compound page" So really, you can reuse any flag as PF_SECOND that isn't PF_ANY. No, that's not what the documentation currently says. It should be. I had a patch to reword it at some point, but I guess it got lost. The current documentation reads like "We replicate the flag currently set on the head page to all tail pages", but that just isn't what the code does.
On 11.03.22 22:11, Matthew Wilcox wrote: > On Fri, Mar 11, 2022 at 08:36:37PM +0100, David Hildenbrand wrote: >> I used PG_error before, but felt like using a bit that is never ever >> valid to be set/cleared/checked on a subpage would be even a better fit: >> >> Note the: >> >> PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) >> >> whereby PF_ONLY_HEAD translates to: >> >> "for compound page, callers only ever operate on the head page." >> >> >> I can just switch to PG_error, but for the second subpage, PG_waiters >> should be just fine (unless I am missing something important). > > I think you're missing something important that almost everybody misses > when looking at this code (including me). > > PF_ANY flags can be set on individual pages. > PF_HEAD means "we automatically redirect all operations to the head page". > PF_ONLY_HEAD means "If you try to call this on a tail page, we BUG". > PF_NO_TAIL means "If you try to read this flag on a tail page, we'll > look at the head page instead, but if you try to set/clear this flag > on a tail page, we BUG" ^ that's actually the confusing part for me "modifications of the page flag must be done on small or head pages, checks can be done on tail pages too." Just from reading that I thought checks on the tail page would *not* get redirected to the head. I'd have written that as and extended version of PF_HEAD: "for compound page all operations related to the page flag are applied to head page. Checks done on a tail page will be redirected to the head page." [...] > > So really, you can reuse any flag as PF_SECOND that isn't PF_ANY. > I'll go back to PG_error, thanks!
On 11.03.22 22:02, Matthew Wilcox wrote: > On Fri, Mar 11, 2022 at 07:46:39PM +0100, David Hildenbrand wrote: >> I'm currently testing with the following. My tests so far with a swapfile on >> all different kinds of weird filesystems (excluding networking fs, though) >> revealed no surprises so far: > > I like this a lot better than reusing PG_swap. Thanks! > > I'm somewhat reluctant to introduce a new flag that can be set on tail > pages. Do we lose much if it's always set only on the head page? After spending one month on getting THP to work without PF_ANY, I can say with confidence that the whole thing won't fly with THP when not tracking it on the minimum-mapping granularity. For a PTE-mapped THP, that's on the subpage level. The next patch in the series documents some details. Intuitively, if we could replace the pageflag by a PTE/PMD bit, we'd get roughly the same result. > >> +++ b/include/linux/page-flags.h >> @@ -142,6 +142,60 @@ enum pageflags { >> >> PG_readahead = PG_reclaim, >> >> + /* >> + * Depending on the way an anonymous folio can be mapped into a page >> + * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped >> + * THP), PG_anon_exclusive may be set only for the head page or for >> + * subpages of an anonymous folio. >> + * >> + * PG_anon_exclusive is *usually* only expressive in combination with a >> + * page table entry. Depending on the page table entry type it might >> + * store the following information: >> + * >> + * Is what's mapped via this page table entry exclusive to the >> + * single process and can be mapped writable without further >> + * checks? If not, it might be shared and we might have to COW. >> + * >> + * For now, we only expect PTE-mapped THPs to make use of >> + * PG_anon_exclusive in subpages. For other anonymous compound >> + * folios (i.e., hugetlb), only the head page is logically mapped and >> + * holds this information. >> + * >> + * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive >> + * set on the head page. When replacing the PMD by a page table full >> + * of PTEs, PG_anon_exclusive, if set on the head page, will be set on >> + * all tail pages accordingly. Note that converting from a PTE-mapping >> + * to a PMD mapping using the same compound page is currently not >> + * possible and consequently doesn't require care. >> + * >> + * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page, >> + * it should only pin if the relevant PG_anon_bit is set. In that case, >> + * the pin will be fully reliable and stay consistent with the pages >> + * mapped into the page table, as the bit cannot get cleared (e.g., by >> + * fork(), KSM) while the page is pinned. For anonymous pages that >> + * are mapped R/W, PG_anon_exclusive can be assumed to always be set >> + * because such pages cannot possibly be shared. >> + * >> + * The page table lock protecting the page table entry is the primary >> + * synchronization mechanism for PG_anon_exclusive; GUP-fast that does >> + * not take the PT lock needs special care when trying to clear the >> + * flag. >> + * >> + * Page table entry types and PG_anon_exclusive: >> + * * Present: PG_anon_exclusive applies. >> + * * Swap: the information is lost. PG_anon_exclusive was cleared. >> + * * Migration: the entry holds this information instead. >> + * PG_anon_exclusive was cleared. >> + * * Device private: PG_anon_exclusive applies. >> + * * Device exclusive: PG_anon_exclusive applies. >> + * * HW Poison: PG_anon_exclusive is stale and not changed. >> + * >> + * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is >> + * not allowed and the flag will stick around until the page is freed >> + * and folio->mapping is cleared. >> + */ > > ... I also don't think this is the right place for this comment. Not > sure where it should go. I went for "rather have some documentation at a sub-optimal place then no documentation at all". I'm thinking about writing a proper documentation once everything is in place, and moving some details from there into that document then. > >> +static __always_inline void SetPageAnonExclusive(struct page *page) >> +{ >> + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page); > > hm. seems to me like we should have a PageAnonNotKsm which just > does > return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == > PAGE_MAPPING_ANON; > because that's "a bit" inefficient. OK, that's just a VM_BUG_ON, > but we have other users in real code: > > mm/migrate.c: if (PageAnon(page) && !PageKsm(page)) > mm/page_idle.c: need_lock = !PageAnon(page) || PageKsm(page); > mm/rmap.c: if (!is_locked && (!PageAnon(page) || PageKsm(page))) { > I'm wondering if the compiler won't be able to optimize that. Having that said, I can look into adding that outside of this series. Thanks!
diff --git a/fs/proc/page.c b/fs/proc/page.c index 9f1077d94cde..61187568e5f7 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -182,8 +182,7 @@ u64 stable_page_flags(struct page *page) u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked); - u |= kpf_copy_bit(k, KPF_SLAB, PG_slab); - if (PageTail(page) && PageSlab(compound_head(page))) + if (PageSlab(page) || PageSlab(compound_head(page))) u |= 1 << KPF_SLAB; u |= kpf_copy_bit(k, KPF_ERROR, PG_error); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1c3b6e5c8bfd..b3c2cf71467e 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -107,7 +107,7 @@ enum pageflags { PG_workingset, PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ PG_error, - PG_slab, + PG_slab, /* Slab page if !PageAnon() */ PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, @@ -142,6 +142,63 @@ enum pageflags { PG_readahead = PG_reclaim, + /* + * Depending on the way an anonymous folio can be mapped into a page + * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped + * THP), PG_anon_exclusive may be set only for the head page or for + * subpages of an anonymous folio. + * + * PG_anon_exclusive is *usually* only expressive in combination with a + * page table entry. Depending on the page table entry type it might + * store the following information: + * + * Is what's mapped via this page table entry exclusive to the + * single process and can be mapped writable without further + * checks? If not, it might be shared and we might have to COW. + * + * For now, we only expect PTE-mapped THPs to make use of + * PG_anon_exclusive in subpages. For other anonymous compound + * folios (i.e., hugetlb), only the head page is logically mapped and + * holds this information. + * + * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive + * set on the head page. When replacing the PMD by a page table full + * of PTEs, PG_anon_exclusive, if set on the head page, will be set on + * all tail pages accordingly. Note that converting from a PTE-mapping + * to a PMD mapping using the same compound page is currently not + * possible and consequently doesn't require care. + * + * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page, + * it should only pin if the relevant PG_anon_bit is set. In that case, + * the pin will be fully reliable and stay consistent with the pages + * mapped into the page table, as the bit cannot get cleared (e.g., by + * fork(), KSM) while the page is pinned. For anonymous pages that + * are mapped R/W, PG_anon_exclusive can be assumed to always be set + * because such pages cannot possibly be shared. + * + * The page table lock protecting the page table entry is the primary + * synchronization mechanism for PG_anon_exclusive; GUP-fast that does + * not take the PT lock needs special care when trying to clear the + * flag. + * + * Page table entry types and PG_anon_exclusive: + * * Present: PG_anon_exclusive applies. + * * Swap: the information is lost. PG_anon_exclusive was cleared. + * * Migration: the entry holds this information instead. + * PG_anon_exclusive was cleared. + * * Device private: PG_anon_exclusive applies. + * * Device exclusive: PG_anon_exclusive applies. + * * HW Poison: PG_anon_exclusive is stale and not changed. + * + * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is + * not allowed and the flag will stick around until the page is freed + * and folio->mapping is cleared. + * + * Before clearing folio->mapping, PG_anon_exclusive has to be cleared + * to not result in PageSlab() false positives. + */ + PG_anon_exclusive = PG_slab, + /* Filesystems */ PG_checked = PG_owner_priv_1, @@ -425,7 +482,6 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) PAGEFLAG(Workingset, workingset, PF_HEAD) TESTCLEARFLAG(Workingset, workingset, PF_HEAD) -__PAGEFLAG(Slab, slab, PF_NO_TAIL) __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL) PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */ @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page); __PAGEFLAG(Isolated, isolated, PF_ANY); +static __always_inline bool folio_test_slab(struct folio *folio) +{ + return !folio_test_anon(folio) && + test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL)); +} + +static __always_inline int PageSlab(struct page *page) +{ + return !PageAnon(page) && + test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags); +} + +static __always_inline void __folio_set_slab(struct folio *folio) +{ + VM_BUG_ON(folio_test_anon(folio)); + __set_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL)); +} + +static __always_inline void __SetPageSlab(struct page *page) +{ + VM_BUG_ON_PGFLAGS(PageAnon(page), page); + __set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags); +} + +static __always_inline void __folio_clear_slab(struct folio *folio) +{ + VM_BUG_ON(folio_test_anon(folio)); + __clear_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL)); +} + +static __always_inline void __ClearPageSlab(struct page *page) +{ + VM_BUG_ON_PGFLAGS(PageAnon(page), page); + __clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags); +} + +static __always_inline int PageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + +static __always_inline void SetPageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + +static __always_inline void ClearPageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + +static __always_inline void __ClearPageAnonExclusive(struct page *page) +{ + VM_BUG_ON_PGFLAGS(!PageAnon(page), page); + VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); +} + #ifdef CONFIG_MMU #define __PG_MLOCKED (1UL << PG_mlocked) #else diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 116ed4d5d0f8..5662f74e027f 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -103,7 +103,7 @@ {1UL << PG_lru, "lru" }, \ {1UL << PG_active, "active" }, \ {1UL << PG_workingset, "workingset" }, \ - {1UL << PG_slab, "slab" }, \ + {1UL << PG_slab, "slab/anon_exclusive" }, \ {1UL << PG_owner_priv_1, "owner_priv_1" }, \ {1UL << PG_arch_1, "arch_1" }, \ {1UL << PG_reserved, "reserved" }, \ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9fb990d95dab..bf5ce91c623c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1669,6 +1669,10 @@ void free_huge_page(struct page *page) VM_BUG_ON_PAGE(page_mapcount(page), page); hugetlb_set_page_subpool(page, NULL); + if (PageAnon(page)) { + __ClearPageAnonExclusive(page); + wmb(); /* avoid PageSlab() false positives */ + } page->mapping = NULL; restore_reserve = HPageRestoreReserve(page); ClearHPageRestoreReserve(page); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 97a9ed8f87a9..176fa2fbd699 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1470,17 +1470,29 @@ static int identify_page_state(unsigned long pfn, struct page *p, * The first check uses the current page flags which may not have any * relevant information. The second check with the saved page flags is * carried out only if the first check can't determine the page status. + * + * Note that PG_slab is also used as PG_anon_exclusive for PageAnon() + * pages. Most of these pages should have been handled previously, + * however, let's play safe and verify via PageAnon(). */ - for (ps = error_states;; ps++) - if ((p->flags & ps->mask) == ps->res) - break; + for (ps = error_states;; ps++) { + if ((p->flags & ps->mask) != ps->res) + continue; + if ((ps->type == MF_MSG_SLAB) && PageAnon(p)) + continue; + break; + } page_flags |= (p->flags & (1UL << PG_dirty)); if (!ps->mask) - for (ps = error_states;; ps++) - if ((page_flags & ps->mask) == ps->res) - break; + for (ps = error_states;; ps++) { + if ((page_flags & ps->mask) != ps->res) + continue; + if ((ps->type == MF_MSG_SLAB) && PageAnon(p)) + continue; + break; + } return page_action(ps, p, pfn); } diff --git a/mm/memremap.c b/mm/memremap.c index 6aa5f0c2d11f..a6b82ae8b3e6 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -478,6 +478,17 @@ void free_devmap_managed_page(struct page *page) mem_cgroup_uncharge(page_folio(page)); + /* + * Note: we don't expect anonymous compound pages yet. Once supported + * and we could PTE-map them similar to THP, we'd have to clear + * PG_anon_exclusive on all tail pages. + */ + VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page); + if (PageAnon(page)) { + __ClearPageAnonExclusive(page); + wmb(); /* avoid PageSlab() false positives */ + } + /* * When a device_private page is freed, the page->mapping field * may still contain a (stale) mapping value. For example, the diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589febc6d31..e59e68a61d1a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1297,6 +1297,7 @@ static __always_inline bool free_pages_prepare(struct page *page, { int bad = 0; bool skip_kasan_poison = should_skip_kasan_poison(page, fpi_flags); + const bool anon = PageAnon(page); VM_BUG_ON_PAGE(PageTail(page), page); @@ -1329,6 +1330,14 @@ static __always_inline bool free_pages_prepare(struct page *page, ClearPageHasHWPoisoned(page); } for (i = 1; i < (1 << order); i++) { + /* + * Freeing a previously PTE-mapped THP can have + * PG_anon_exclusive set on tail pages. Clear it + * manually as it's overloaded with PG_slab that we + * want to catch in check_free_page(). + */ + if (anon) + __ClearPageAnonExclusive(page + i); if (compound) bad += free_tail_pages_check(page, page + i); if (unlikely(check_free_page(page + i))) { @@ -1338,6 +1347,10 @@ static __always_inline bool free_pages_prepare(struct page *page, (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; } } + if (anon) { + __ClearPageAnonExclusive(page); + wmb(); /* avoid PageSlab() false positives */ + } if (PageMappingFlags(page)) page->mapping = NULL; if (memcg_kmem_enabled() && PageMemcgKmem(page))
The basic question we would like to have a reliable and efficient answer to is: is this anonymous page exclusive to a single process or might it be shared? In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags don't grow on trees, so we have to get a little creative for the time being. Introduce a way to mark an anonymous page as exclusive, with the ultimate goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins lose consistency with the pages mapped into the page table, resulting in reported memory corruptions. Most pageflags already have semantics for anonymous pages, so we're left with reusing PG_slab for our purpose: for PageAnon() pages PG_slab now translates to PG_anon_exclusive, teach some in-kernel code that manually handles PG_slab about that. Add a spoiler on the semantics of PG_anon_exclusive as documentation. More documentation will be contained in the code that actually makes use of PG_anon_exclusive. We won't be clearing PG_anon_exclusive on destructive unmapping (i.e., zapping) of page table entries, page freeing code will handle that when also invalidate page->mapping to not indicate PageAnon() anymore. Letting information about exclusivity stick around will be an important property when adding sanity checks to unpinning code. RFC notes: in-tree tools/cgroup/memcg_slabinfo.py looks like it might need some care. We'd have to lookup the head page and check if PageAnon() is set. Similarly, tools living outside the kernel repository like crash and makedumpfile might need adaptions. Cc: Roman Gushchin <guro@fb.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- fs/proc/page.c | 3 +- include/linux/page-flags.h | 124 ++++++++++++++++++++++++++++++++- include/trace/events/mmflags.h | 2 +- mm/hugetlb.c | 4 ++ mm/memory-failure.c | 24 +++++-- mm/memremap.c | 11 +++ mm/page_alloc.c | 13 ++++ 7 files changed, 170 insertions(+), 11 deletions(-)