[v1,00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

Message ID	20220308141437.144919-1-david@redhat.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Linus Torvalds <torvalds@linux-foundation.org>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@linux.ibm.com>, Yang Shi <shy828301@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>, Michal Hocko <mhocko@kernel.org>, Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>, Andrea Arcangeli <aarcange@redhat.com>, Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>, Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>, Liang Zhang <zhangliang5@huawei.com>, Pedro Gomes <pedrodemargomes@gmail.com>, Oded Gabbay <oded.gabbay@gmail.com>, linux-mm@kvack.org, David Hildenbrand <david@redhat.com>, Khalid Aziz <khalid.aziz@oracle.com> Subject: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages Date: Tue, 8 Mar 2022 15:14:22 +0100 Message-Id: <20220308141437.144919-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: COW fixes part 2: reliable GUP pins of anonymous pages \| expand [v1,00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages [v1,01/15] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed [v1,02/15] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() [v1,03/15] mm/memory: slightly simplify copy_present_pte() [v1,04/15] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() [v1,05/15] mm/rmap: convert RMAP flags to a proper distinct rmap_t type [v1,06/15] mm/rmap: remove do_page_add_anon_rmap() [v1,07/15] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() [v1,08/15] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() [v1,09/15] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively [v1,10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages [v1,11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive [v1,12/15] mm/gup: disallow follow_page(FOLL_PIN) [v1,13/15] mm: support GUP-triggered unsharing of anonymous pages [v1,14/15] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page [v1,15/15] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pi…

David Hildenbrand March 8, 2022, 2:14 p.m. UTC

This series is the result of the discussion on the previous approach [2].
More information on the general COW issues can be found there. It is based
on v5.17-rc7 and [1], which resides in -mm and -next:
	[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
	THP and swap

v1 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1

This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
on an anonymous page and COW logic fails to detect exclusivity of the page
to then replacing the anonymous page by a copy in the page table: The
GUP pin lost synchronicity with the pages mapped into the page tables.

This issue, including other related COW issues, has been summarized in [3]
under 3):
"
  3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)

  page_maybe_dma_pinned() is used to check if a page may be pinned for
  DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
  tolerable, false negatives are problematic: pages that are pinned for
  DMA must not be added to the swapcache. If it happens, the (now pinned)
  page could be faulted back from the swapcache into page tables
  read-only. Future write-access would detect the pinning and COW the
  page, losing synchronicity. For the interested reader, this is nicely
  documented in feb889fb40fa ("mm: don't put pinned pages into the swap
  cache").

  Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
  cases and can result in a violation of the documented semantics:
  giving false negatives because of the race.

  There are cases where we call it without properly taking a per-process
  sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
  one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
  handle, there is especially one rmap case (shrink_page_list) that's hard
  to fix: in the rmap world, we're not limited to a single process.

  The shrink_page_list() issue is really subtle. If we race with
  someone pinning a page, we can trigger the same issue as in the FOLL_GET
  case. See the detail section at the end of this mail on a discussion how
  bad this can bite us with VFIO or other FOLL_PIN user.

  It's harder to reproduce, but I managed to modify the O_DIRECT
  reproducer to use io_uring fixed buffers [15] instead, which ends up
  using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
  similarly trigger a loss of synchronicity and consequently a memory
  corruption.

  Again, the root issue is that a write-fault on a page that has
  additional references results in a COW and thereby a loss of
  synchronicity and consequently a memory corruption if two parties
  believe they are referencing the same page.
"

This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
especially also taking care of concurrent pinning via GUP-fast,
for example, also fully fixing an issue reported regarding NUMA
balancing [4] recently. While doing that, it further reduces "unnecessary
COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
COW security for hugetlb for FOLL_PIN.

In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
anonymous page is exclusive. Exclusive anonymous pages that are mapped
R/O can directly be mapped R/W by the COW logic in the write fault handler.
Exclusive anonymous pages that want to be shared (fork(), KSM) first have
to mark a mapped anonymous page shared -- which will fail if there are
GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
that is exclusive. The PT lock is the primary mechanism to synchronize
modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
page table entry.

Special care has to be taken about swap, migration, and THPs (whereby a
PMD-mapping can be converted to a PTE mapping and we have to track
information for subpages). Besides these, we let the rmap code handle most
magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
logic as part of our previous approach [2], however, it's now 100% mapcount
free and I further simplified it a bit.

  #1 is a fix
  #3-#9 are mostly rmap preparations for PG_anon_exclusive handling
  #10 introduces PG_anon_exclusive
  #11 uses PG_anon_exclusive and make R/W pins of anonymous pages
   reliable
  #12 is a preparation for reliable R/O pins
  #13 and #14 is reused/modified GUP-triggered unsharing for R/O GUP pins
   make R/O pins of anonymous pages reliable
  #15 adds sanity check when (un)pinning anonymous pages

I'm not proud about patch #10, suggestions welcome. Patch #11 contains
excessive explanations and the main logic for R/W pins. #12 and #13
resemble what we proposed in the previous approach [2]. I consider the
general approach of #15 very nice and helpful, and I remember Linus even
envisioning something like that for finding BUGs, although we might want to
implement the sanity checks eventually differently

It passes my growing set of tests for "wrong COW" and "missed COW",
including the ones in [3] -- I'd really appreciate some experienced eyes
to take a close look at corner cases.

I'm planning on sending a part 3 that will remember PG_anon_exclusive for
ordinary swap entries: this will make FOLL_GET | FOLL_WRITE references
reliable and fix the memory corruptions for O_DIRECT -- as described in
[3] under 2) -- as well, as long as there is no fork().

The long term goal should be to convert relevant users of FOLL_GET to
FOLL_PIN, however, with part3 it wouldn't be required to fix the obvious
memory corruptions we are aware of. Once that's in place we can streamline
our COW logic for hugetlb to rely on page_count() as well and fix any
possible COW security issues.

[1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
[4] https://bugzilla.kernel.org/show_bug.cgi?id=215616


RFC -> v1:
* Rephrased/extended some patch descriptions+comments
* Tested on aarch64, ppc64 and x86_64
* "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
 -> Added
* "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
 -> Added
* "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
 -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
    dropping the PT lock.
 -> Use "reuse" label in __do_huge_pmd_anonymous_page()
 -> Slightly simplify logic in hugetlb_cow()
 -> In remove_migration_pte(), remove unrelated changes around
    page_remove_rmap()
* "mm: support GUP-triggered unsharing of anonymous pages"
 -> In handle_pte_fault(), trigger pte_mkdirty() only with
    FAULT_FLAG_WRITE
 -> In __handle_mm_fault(), extend comment regarding anonymous PUDs
* "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
   anonymous page"
   -> Added unsharing logic to gup_hugepte() and gup_huge_pud()
   -> Changed return logic in __follow_hugetlb_must_fault(), making sure
      that "unshare" is always set
* "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
   exclusive when (un)pinning"
  -> Slightly simplified sanity_check_pinned_pages()

David Hildenbrand (15):
  mm/rmap: fix missing swap_free() in try_to_unmap() after
    arch_unmap_one() failed
  mm/hugetlb: take src_mm->write_protect_seq in
    copy_hugetlb_page_range()
  mm/memory: slightly simplify copy_present_pte()
  mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
    page_try_dup_anon_rmap()
  mm/rmap: convert RMAP flags to a proper distinct rmap_t type
  mm/rmap: remove do_page_add_anon_rmap()
  mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
  mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
    page exclusively
  mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages
  mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  mm/gup: disallow follow_page(FOLL_PIN)
  mm: support GUP-triggered unsharing of anonymous pages
  mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
    anonymous page
  mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
    exclusive when (un)pinning

 fs/proc/page.c                 |   3 +-
 include/linux/mm.h             |  46 ++++++-
 include/linux/mm_types.h       |   8 ++
 include/linux/page-flags.h     | 124 +++++++++++++++++-
 include/linux/rmap.h           | 109 ++++++++++++++--
 include/linux/swap.h           |  15 ++-
 include/linux/swapops.h        |  25 ++++
 include/trace/events/mmflags.h |   2 +-
 kernel/events/uprobes.c        |   2 +-
 mm/gup.c                       | 103 ++++++++++++++-
 mm/huge_memory.c               | 122 +++++++++++++-----
 mm/hugetlb.c                   | 137 ++++++++++++++------
 mm/khugepaged.c                |   2 +-
 mm/ksm.c                       |  15 ++-
 mm/memory-failure.c            |  24 +++-
 mm/memory.c                    | 221 ++++++++++++++++++++-------------
 mm/memremap.c                  |  11 ++
 mm/migrate.c                   |  40 +++++-
 mm/mprotect.c                  |   8 +-
 mm/page_alloc.c                |  13 ++
 mm/rmap.c                      |  95 ++++++++++----
 mm/swapfile.c                  |   4 +-
 mm/userfaultfd.c               |   2 +-
 23 files changed, 904 insertions(+), 227 deletions(-)

David Hildenbrand March 8, 2022, 2:19 p.m. UTC | #1

On 08.03.22 15:14, David Hildenbrand wrote:
> This series is the result of the discussion on the previous approach [2].
> More information on the general COW issues can be found there. It is based
> on v5.17-rc7 and [1], which resides in -mm and -next:
> 	[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
> 	THP and swap
> 
> v1 is located at:
> 	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1
> 
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table: The
> GUP pin lost synchronicity with the pages mapped into the page tables.
> 
> This issue, including other related COW issues, has been summarized in [3]
> under 3):
> "
>   3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
> 
>   page_maybe_dma_pinned() is used to check if a page may be pinned for
>   DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
>   tolerable, false negatives are problematic: pages that are pinned for
>   DMA must not be added to the swapcache. If it happens, the (now pinned)
>   page could be faulted back from the swapcache into page tables
>   read-only. Future write-access would detect the pinning and COW the
>   page, losing synchronicity. For the interested reader, this is nicely
>   documented in feb889fb40fa ("mm: don't put pinned pages into the swap
>   cache").
> 
>   Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
>   cases and can result in a violation of the documented semantics:
>   giving false negatives because of the race.
> 
>   There are cases where we call it without properly taking a per-process
>   sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
>   one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
>   handle, there is especially one rmap case (shrink_page_list) that's hard
>   to fix: in the rmap world, we're not limited to a single process.
> 
>   The shrink_page_list() issue is really subtle. If we race with
>   someone pinning a page, we can trigger the same issue as in the FOLL_GET
>   case. See the detail section at the end of this mail on a discussion how
>   bad this can bite us with VFIO or other FOLL_PIN user.
> 
>   It's harder to reproduce, but I managed to modify the O_DIRECT
>   reproducer to use io_uring fixed buffers [15] instead, which ends up
>   using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
>   similarly trigger a loss of synchronicity and consequently a memory
>   corruption.
> 
>   Again, the root issue is that a write-fault on a page that has
>   additional references results in a COW and thereby a loss of
>   synchronicity and consequently a memory corruption if two parties
>   believe they are referencing the same page.
> "
> 
> This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
> especially also taking care of concurrent pinning via GUP-fast,
> for example, also fully fixing an issue reported regarding NUMA
> balancing [4] recently. While doing that, it further reduces "unnecessary
> COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
> COW security for hugetlb for FOLL_PIN.
> 
> In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
> anonymous page is exclusive. Exclusive anonymous pages that are mapped
> R/O can directly be mapped R/W by the COW logic in the write fault handler.
> Exclusive anonymous pages that want to be shared (fork(), KSM) first have
> to mark a mapped anonymous page shared -- which will fail if there are
> GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
> that is exclusive. The PT lock is the primary mechanism to synchronize
> modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
> src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
> page table entry.
> 
> Special care has to be taken about swap, migration, and THPs (whereby a
> PMD-mapping can be converted to a PTE mapping and we have to track
> information for subpages). Besides these, we let the rmap code handle most
> magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
> logic as part of our previous approach [2], however, it's now 100% mapcount
> free and I further simplified it a bit.
> 
>   #1 is a fix
>   #3-#9 are mostly rmap preparations for PG_anon_exclusive handling
>   #10 introduces PG_anon_exclusive
>   #11 uses PG_anon_exclusive and make R/W pins of anonymous pages
>    reliable
>   #12 is a preparation for reliable R/O pins
>   #13 and #14 is reused/modified GUP-triggered unsharing for R/O GUP pins
>    make R/O pins of anonymous pages reliable
>   #15 adds sanity check when (un)pinning anonymous pages
> 
> I'm not proud about patch #10, suggestions welcome. Patch #11 contains
> excessive explanations and the main logic for R/W pins. #12 and #13
> resemble what we proposed in the previous approach [2]. I consider the
> general approach of #15 very nice and helpful, and I remember Linus even
> envisioning something like that for finding BUGs, although we might want to
> implement the sanity checks eventually differently
> 
> It passes my growing set of tests for "wrong COW" and "missed COW",
> including the ones in [3] -- I'd really appreciate some experienced eyes
> to take a close look at corner cases.
> 
> I'm planning on sending a part 3 that will remember PG_anon_exclusive for
> ordinary swap entries: this will make FOLL_GET | FOLL_WRITE references
> reliable and fix the memory corruptions for O_DIRECT -- as described in
> [3] under 2) -- as well, as long as there is no fork().
> 
> The long term goal should be to convert relevant users of FOLL_GET to
> FOLL_PIN, however, with part3 it wouldn't be required to fix the obvious
> memory corruptions we are aware of. Once that's in place we can streamline
> our COW logic for hugetlb to rely on page_count() as well and fix any
> possible COW security issues.
> 
> [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
> [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
> [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
> [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
> 
> 
> RFC -> v1:
> * Rephrased/extended some patch descriptions+comments
> * Tested on aarch64, ppc64 and x86_64
> * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
>  -> Added
> * "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
>  -> Added
> * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
>  -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
>     dropping the PT lock.
>  -> Use "reuse" label in __do_huge_pmd_anonymous_page()
>  -> Slightly simplify logic in hugetlb_cow()
>  -> In remove_migration_pte(), remove unrelated changes around
>     page_remove_rmap()
> * "mm: support GUP-triggered unsharing of anonymous pages"
>  -> In handle_pte_fault(), trigger pte_mkdirty() only with
>     FAULT_FLAG_WRITE
>  -> In __handle_mm_fault(), extend comment regarding anonymous PUDs
> * "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
>    anonymous page"
>    -> Added unsharing logic to gup_hugepte() and gup_huge_pud()
>    -> Changed return logic in __follow_hugetlb_must_fault(), making sure
>       that "unshare" is always set
> * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>    exclusive when (un)pinning"
>   -> Slightly simplified sanity_check_pinned_pages()
> 
> David Hildenbrand (15):
>   mm/rmap: fix missing swap_free() in try_to_unmap() after
>     arch_unmap_one() failed
>   mm/hugetlb: take src_mm->write_protect_seq in
>     copy_hugetlb_page_range()
>   mm/memory: slightly simplify copy_present_pte()
>   mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
>     page_try_dup_anon_rmap()
>   mm/rmap: convert RMAP flags to a proper distinct rmap_t type
>   mm/rmap: remove do_page_add_anon_rmap()
>   mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
>   mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
>   mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
>     page exclusively
>   mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages
>   mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
>   mm/gup: disallow follow_page(FOLL_PIN)
>   mm: support GUP-triggered unsharing of anonymous pages
>   mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
>     anonymous page
>   mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>     exclusive when (un)pinning
> 
>  fs/proc/page.c                 |   3 +-
>  include/linux/mm.h             |  46 ++++++-
>  include/linux/mm_types.h       |   8 ++
>  include/linux/page-flags.h     | 124 +++++++++++++++++-
>  include/linux/rmap.h           | 109 ++++++++++++++--
>  include/linux/swap.h           |  15 ++-
>  include/linux/swapops.h        |  25 ++++
>  include/trace/events/mmflags.h |   2 +-
>  kernel/events/uprobes.c        |   2 +-
>  mm/gup.c                       | 103 ++++++++++++++-
>  mm/huge_memory.c               | 122 +++++++++++++-----
>  mm/hugetlb.c                   | 137 ++++++++++++++------
>  mm/khugepaged.c                |   2 +-
>  mm/ksm.c                       |  15 ++-
>  mm/memory-failure.c            |  24 +++-
>  mm/memory.c                    | 221 ++++++++++++++++++++-------------
>  mm/memremap.c                  |  11 ++
>  mm/migrate.c                   |  40 +++++-
>  mm/mprotect.c                  |   8 +-
>  mm/page_alloc.c                |  13 ++
>  mm/rmap.c                      |  95 ++++++++++----
>  mm/swapfile.c                  |   4 +-
>  mm/userfaultfd.c               |   2 +-
>  23 files changed, 904 insertions(+), 227 deletions(-)
> 

Full diff to RFC:


diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4957bac4868..1bc522d28a78 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -158,23 +158,37 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
-/* bitflags for do_page_add_anon_rmap() */
-#define RMAP_EXCLUSIVE 0x01
-#define RMAP_COMPOUND 0x02
+/* RMAP flags, currently only relevant for some anon rmap operations. */
+typedef int __bitwise rmap_t;
+
+/*
+ * No special request: if the page is a subpage of a compound page, it is
+ * mapped via a PTE. The mapped (sub)page is possibly shared between processes.
+ */
+#define RMAP_NONE		((__force rmap_t)0)
+
+/* The (sub)page is exclusive to a single process. */
+#define RMAP_EXCLUSIVE		((__force rmap_t)BIT(0))
+
+/*
+ * The compound page is not mapped via PTEs, but instead via a single PMD and
+ * should be accounted accordingly.
+ */
+#define RMAP_COMPOUND		((__force rmap_t)BIT(1))
 
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, int);
+		unsigned long, rmap_t);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long);
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-			    unsigned long, int);
+			    unsigned long, rmap_t);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..b6fdb23fb3ea 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (new_page) {
 		get_page(new_page);
-		page_add_new_anon_rmap(new_page, vma, addr, false);
+		page_add_new_anon_rmap(new_page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/gup.c b/mm/gup.c
index df1ae29a1f5f..b36f02f2b720 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -50,30 +50,27 @@ static inline void sanity_check_pinned_pages(struct page **pages,
 {
 #ifdef CONFIG_DEBUG_VM
 	/*
-	 * We expect to only succeed in pinning anonymous pages if they are
-	 * exclusive. Once pinned, we can no longer mark them shared and the
-	 * PageAnonExclusive() will stick around until the page is freed
-	 * after it was unmapped.
+	 * We only pin anonymous pages if they are exclusive. Once pinned, we
+	 * can no longer turn them possibly shared and PageAnonExclusive() will
+	 * stick around until the page is freed.
 	 *
-	 * We'd like to verify that our pinned anonymous pages are mapped
+	 * We'd like to verify that our pinned anonymous pages are still mapped
 	 * exclusively. The issue with anon THP is that we don't know how
 	 * they are/were mapped when pinning them. However, for anon
 	 * THP we can assume that either the given page (PTE-mapped THP) or
 	 * the head page (PMD-mapped THP) should be PageAnonExclusive(). If
-	 * neither is the case there is certainly something wrong.
+	 * neither is the case, there is certainly something wrong.
 	 */
 	for (; npages; npages--, pages++) {
 		struct page *page = *pages;
 		struct page *head = compound_head(page);
 
-		if (!PageAnon(page))
+		if (!PageAnon(head))
 			continue;
-		if (!PageCompound(page))
-			VM_BUG_ON_PAGE(!PageAnonExclusive(page), page);
-		else if (PageHuge(page))
+		if (!PageCompound(head) || PageHuge(head))
 			VM_BUG_ON_PAGE(!PageAnonExclusive(head), page);
 		else
-			/* Either PTE-mapped or PMD-mapped, we don't know. */
+			/* Either a PTE-mapped or a PMD-mapped THP. */
 			VM_BUG_ON_PAGE(!PageAnonExclusive(head) &&
 				       !PageAnonExclusive(page), page);
 	}
@@ -449,7 +446,7 @@ static void unpin_user_pages_lockless(struct page **pages, unsigned long npages)
 
 	/*
 	 * Don't perform any sanity checks because we might have raced with
-	 * fork() and some anonymous pages might no actually be shared --
+	 * fork() and some anonymous pages might now actually be shared --
 	 * which is why we're unpinning after all.
 	 */
 	for_each_compound_head(index, pages, npages, head, ntails)
@@ -2607,6 +2604,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
+	if (!pte_write(pte) && gup_must_unshare(flags, head)) {
+		put_compound_head(head, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
@@ -2706,6 +2708,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
+	if (!pud_write(orig) && gup_must_unshare(flags, head)) {
+		put_compound_head(head, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72ed42d0ef3c..d9559341a5f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -647,7 +647,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -1296,22 +1296,9 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	if (PageAnonExclusive(page)) {
-		pmd_t entry;
-
-		/* R/O and exclusive, nothing to do. */
-		if (unlikely(unshare)) {
-			spin_unlock(vmf->ptl);
-			return 0;
-		}
-
-		entry = pmd_mkyoung(orig_pmd);
-		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
-			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		spin_unlock(vmf->ptl);
-		return VM_FAULT_WRITE;
-	}
+	/* Early check when only holding the PT lock. */
+	if (PageAnonExclusive(page))
+		goto reuse;
 
 	if (!trylock_page(page)) {
 		get_page(page);
@@ -1327,6 +1314,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		put_page(page);
 	}
 
+	/* Recheck after temporarily dropping the PT lock. */
+	if (PageAnonExclusive(page)) {
+		unlock_page(page);
+		goto reuse;
+	}
+
 	/*
 	 * See do_wp_page(): we can only reuse the page exclusively if there are
 	 * no additional references. Note that we always drain the LRU
@@ -1340,8 +1333,9 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		pmd_t entry;
 
 		page_move_anon_rmap(page, vma);
+		unlock_page(page);
+reuse:
 		if (unlikely(unshare)) {
-			unlock_page(page);
 			spin_unlock(vmf->ptl);
 			return 0;
 		}
@@ -1349,7 +1343,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		unlock_page(page);
 		spin_unlock(vmf->ptl);
 		return VM_FAULT_WRITE;
 	}
@@ -2376,8 +2369,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * After successful get_page_unless_zero() might follow flags change,
 	 * for example lock_page() which set PG_waiters.
 	 *
-	 * Keep PG_anon_exclusive information already maintained for all
-	 * subpages of a compound page untouched.
+	 * Keep PG_anon_exclusive information, already maintained for all
+	 * subpages of a compound page, untouched.
 	 */
 	page_tail->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~PG_anon_exclusive);
 	page_tail->flags |= (head->flags &
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7bac6ed93fea..0d150d100111 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5169,14 +5169,14 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * owner and can reuse this page.
 	 */
 	if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
-		if (unlikely(unshare) && PageAnonExclusive(old_page))
-			return 0;
-		page_move_anon_rmap(old_page, vma);
+		if (!PageAnonExclusive(old_page))
+			page_move_anon_rmap(old_page, vma);
 		if (likely(!unshare))
 			set_huge_ptep_writable(vma, haddr, ptep);
 		return 0;
 	}
-	VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page), old_page);
+	VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
+		       old_page);
 
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
@@ -5968,9 +5968,11 @@ static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
 		return false;
 	if (flags & FOLL_WRITE)
 		return true;
-	if (gup_must_unshare(flags, pte_page(pteval)))
+	if (gup_must_unshare(flags, pte_page(pteval))) {
 		*unshare = true;
-	return *unshare;
+		return true;
+	}
+	return false;
 }
 
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -5987,7 +5989,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
 		spinlock_t *ptl = NULL;
-		bool unshare;
+		bool unshare = false;
 		int absent;
 		struct page *page;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a325a646be33..96cc903c4788 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1183,7 +1183,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address, true);
+	page_add_new_anon_rmap(new_page, vma, address);
 	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
diff --git a/mm/ksm.c b/mm/ksm.c
index c380c36f3ebd..350b46432d65 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1164,7 +1164,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 */
 	if (!is_zero_pfn(page_to_pfn(kpage))) {
 		get_page(kpage);
-		page_add_anon_rmap(kpage, vma, addr, 0);
+		page_add_anon_rmap(kpage, vma, addr, RMAP_NONE);
 		newpte = mk_pte(kpage, vma->vm_page_prot);
 	} else {
 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
diff --git a/mm/memory.c b/mm/memory.c
index 33d046c34819..94355bb34d8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -899,7 +899,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	*prealloc = NULL;
 	copy_user_highpage(new_page, page, addr, src_vma);
 	__SetPageUptodate(new_page);
-	page_add_new_anon_rmap(new_page, dst_vma, addr, false);
+	page_add_new_anon_rmap(new_page, dst_vma, addr);
 	lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
 	rss[mm_counter(new_page)]++;
 
@@ -2948,6 +2948,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	pte_t entry;
 
+	VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
 	VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page));
 
 	/*
@@ -3068,7 +3069,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * some TLBs while the old PTE remains in others.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		page_add_new_anon_rmap(new_page, vma, vmf->address);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -3282,8 +3283,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
-		/* No anonymous page -> nothing to do. */
 		if (unlikely(unshare)) {
+			/* No anonymous page -> nothing to do. */
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return 0;
 		}
@@ -3565,10 +3566,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL, *swapcache;
 	struct swap_info_struct *si = NULL;
+	rmap_t rmap_flags = RMAP_NONE;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	int exclusive = 0;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3744,7 +3745,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->flags &= ~FAULT_FLAG_WRITE;
 			ret |= VM_FAULT_WRITE;
 		}
-		exclusive = RMAP_EXCLUSIVE;
+		rmap_flags |= RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
@@ -3757,10 +3758,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
 	}
 
 	VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page)));
@@ -3908,7 +3909,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	page_add_new_anon_rmap(page, vma, vmf->address);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -4085,7 +4086,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
@@ -4682,7 +4683,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
-		entry = pte_mkdirty(entry);
+		else if (likely(vmf->flags & FAULT_FLAG_WRITE))
+			entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
 	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
@@ -4746,8 +4748,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		barrier();
 		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
 
-			/* NUMA case for anonymous PUDs would go here */
-
+			/*
+			 * TODO once we support anonymous PUDs: NUMA case and
+			 * FAULT_FLAG_UNSHARE handling.
+			 */
 			if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
diff --git a/mm/migrate.c b/mm/migrate.c
index 63a3241041e1..7f440d2103ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -188,7 +188,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	while (page_vma_mapped_walk(&pvmw)) {
-		int rmap_flags = 0;
+		rmap_t rmap_flags = RMAP_NONE;
 
 		if (PageKsm(page))
 			new = page;
@@ -2357,12 +2357,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			set_pte_at(mm, addr, ptep, swp_pte);
 
 			/*
-			 * This is like regular unmap: we remove the
-			 * rmap and drop page refcount. Page won't be
-			 * freed, as we took a reference just above.
+			 * This is like regular unmap: we remove the rmap and
+			 * drop page refcount. Page won't be freed, as we took
+			 * a reference just above.
 			 */
-			put_page(page);
 			page_remove_rmap(page, false);
+			put_page(page);
 
 			if (pte_present(pte))
 				unmapped++;
@@ -2753,7 +2753,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, addr, false);
+	page_add_new_anon_rmap(page, vma, addr);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index cc54bd1cce61..9d2a7e11e8cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,8 +1141,8 @@ static void __page_check_anon_rmap(struct page *page,
  * and to ensure that PageAnon is not being upgraded racily to PageKsm
  * (but PageKsm is never downgraded to PageAnon).
  */
-void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, int flags)
+void page_add_anon_rmap(struct page *page,
+	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1185,25 +1185,28 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
-				flags & RMAP_EXCLUSIVE);
+				     !!(flags & RMAP_EXCLUSIVE));
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
 
 /**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * page_add_new_anon_rmap - add mapping to a new anonymous page
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- * @compound:	charge the page as compound or small page
+ *
+ * If it's a compound page, it is accounted as a compound page. As the page
+ * is new, it's assume to get mapped exclusively by a single process.
  *
  * Same as page_add_anon_rmap but must only be called on *new* pages.
  * This means the inc-and-test can be bypassed.
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address)
 {
+	const bool compound = PageCompound(page);
 	int nr = compound ? thp_nr_pages(page) : 1;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
@@ -1639,6 +1642,16 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
+			/*
+			 * Note: We *don't* remember yet if the page was mapped
+			 * exclusively in the swap entry, so swapin code has
+			 * to re-determine that manually and might detect the
+			 * page as possibly shared, for example, if there are
+			 * other references on the page or if the page is under
+			 * writeback. We made sure that there are no GUP pins
+			 * on the page that would rely on it, so for GUP pins
+			 * this is fine.
+			 */
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
 				if (list_empty(&mm->mmlist))
@@ -2450,7 +2463,7 @@ void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc)
  * RMAP_COMPOUND is ignored.
  */
 void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
-			    unsigned long address, int flags)
+			    unsigned long address, rmap_t flags)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	int first;
@@ -2462,7 +2475,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
 	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 	if (first)
-		__page_set_anon_rmap(page, vma, address, flags & RMAP_EXCLUSIVE);
+		__page_set_anon_rmap(page, vma, address,
+				     flags & RMAP_EXCLUSIVE);
 }
 
 void hugepage_add_new_anon_rmap(struct page *page,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index eacda7ca7ac9..7edc8e099b22 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1800,9 +1800,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, 0);
+		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	set_pte_at(vma->vm_mm, addr, pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..4ca854ce14f0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -98,7 +98,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (page_in_cache)
 		page_add_file_rmap(page, false);
 	else
-		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		page_add_new_anon_rmap(page, dst_vma, dst_addr);
 
 	/*
 	 * Must happen after rmap, as mm_counter() checks mapping (via

Linus Torvalds March 8, 2022, 9:22 p.m. UTC | #2

On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <david@redhat.com> wrote:
>
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table [...]

From a cursory scan of the patches, this looks sane.

I'm not sure what the next step should be, but I really would like the
people who do a lot of pinning stuff to give it a good shake-down.
Including both looking at the patches, but very much actually running
it on whatever test-cases etc you people have.

Please?

                 Linus

David Hildenbrand March 9, 2022, 8 a.m. UTC | #3

On 08.03.22 22:22, Linus Torvalds wrote:
> On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
>> on an anonymous page and COW logic fails to detect exclusivity of the page
>> to then replacing the anonymous page by a copy in the page table [...]
> 
> From a cursory scan of the patches, this looks sane.

Thanks for skimming over the patches that quickly!

> 
> I'm not sure what the next step should be, but I really would like the
> people who do a lot of pinning stuff to give it a good shake-down.
> Including both looking at the patches, but very much actually running
> it on whatever test-cases etc you people have.
> 
> Please?

My proposal would be to pull it into -next early after we have
v5.18-rc1. I expect some minor clashes with folio changes that should go
in in the next merge window, so I'll have to rebase+resend either way,
and I'm planning on thoroughly testing at least on s390x as well.

We'd then have plenty of time to further review+test while in -next
until the v5.19 merge window opens up.

By that time I should also have my selftests cleaned up and ready, and
part 3 ready to improve the situation for FOLL_GET|FOLL_WRITE until we
have the full FOLL_GET->FOLL_PIN conversion from John (I'll most
probably sent out an early RFC of part 3 soonish). So we *might* be able
to have everything fixed in v5.19.

Last but not least, tools/cgroup/memcg_slabinfo.py as mentioned in patch
#10 still needs care due to the PG_slab reuse, but I consider that a
secondary concern (yet, it should be fixed and help from the Authors
would be appreciated ;) ).

Oded Gabbay March 10, 2022, 11:13 a.m. UTC | #4

On Wed, Mar 9, 2022 at 10:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.03.22 22:22, Linus Torvalds wrote:
> > On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> >> on an anonymous page and COW logic fails to detect exclusivity of the page
> >> to then replacing the anonymous page by a copy in the page table [...]
> >
> > From a cursory scan of the patches, this looks sane.
>
> Thanks for skimming over the patches that quickly!
>
> >
> > I'm not sure what the next step should be, but I really would like the
> > people who do a lot of pinning stuff to give it a good shake-down.
> > Including both looking at the patches, but very much actually running
> > it on whatever test-cases etc you people have.
> >
> > Please?

I can take this patch-set and test it in our data-center with all the
DL workloads we are running
on Gaudi.

David,
Any chance you can prepare me a branch with your patch-set based on 5.17-rc7 ?
I prefer to take a stable kernel and not 5.18-rc1 as this is going to
run on hundreds of machines.

Thanks,
Oded

>
> My proposal would be to pull it into -next early after we have
> v5.18-rc1. I expect some minor clashes with folio changes that should go
> in in the next merge window, so I'll have to rebase+resend either way,
> and I'm planning on thoroughly testing at least on s390x as well.
>
> We'd then have plenty of time to further review+test while in -next
> until the v5.19 merge window opens up.
>
> By that time I should also have my selftests cleaned up and ready, and
> part 3 ready to improve the situation for FOLL_GET|FOLL_WRITE until we
> have the full FOLL_GET->FOLL_PIN conversion from John (I'll most
> probably sent out an early RFC of part 3 soonish). So we *might* be able
> to have everything fixed in v5.19.
>
> Last but not least, tools/cgroup/memcg_slabinfo.py as mentioned in patch
> #10 still needs care due to the PG_slab reuse, but I consider that a
> secondary concern (yet, it should be fixed and help from the Authors
> would be appreciated ;) ).
>
> --
> Thanks,
>
> David / dhildenb
>

David Hildenbrand March 10, 2022, 11:57 a.m. UTC | #5

On 10.03.22 12:13, Oded Gabbay wrote:
> On Wed, Mar 9, 2022 at 10:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.03.22 22:22, Linus Torvalds wrote:
>>> On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
>>>> on an anonymous page and COW logic fails to detect exclusivity of the page
>>>> to then replacing the anonymous page by a copy in the page table [...]
>>>
>>> From a cursory scan of the patches, this looks sane.
>>
>> Thanks for skimming over the patches that quickly!
>>
>>>
>>> I'm not sure what the next step should be, but I really would like the
>>> people who do a lot of pinning stuff to give it a good shake-down.
>>> Including both looking at the patches, but very much actually running
>>> it on whatever test-cases etc you people have.
>>>
>>> Please?
> 
> I can take this patch-set and test it in our data-center with all the
> DL workloads we are running
> on Gaudi.

Perfect!

> 
> David,
> Any chance you can prepare me a branch with your patch-set based on 5.17-rc7 ?
> I prefer to take a stable kernel and not 5.18-rc1 as this is going to
> run on hundreds of machines.

https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1

contains the state of this series, based on v5.17-rc7 and part 1.

I'll be leaving that branch unmodified when working on newer versions of
the patch set.

[v1,00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

Message

Comments