[v2,11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

From: David Hildenbrand <david@redhat.com>

Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
exclusive, and use that information to make GUP pins reliable and stay
consistent with the page mapped into the page table even if the
page table entry gets write-protected.

With that information at hand, we can extend our COW logic to always
reuse anonymous pages that are exclusive. For anonymous pages that
might be shared, the existing logic applies.

As already documented, PG_anon_exclusive is usually only expressive in
combination with a page table entry. Especially PTE vs. PMD-mapped
anonymous pages require more thought, some examples: due to mremap() we
can easily have a single compound page PTE-mapped into multiple page tables
exclusively in a single process -- multiple page table locks apply.
Further, due to MADV_WIPEONFORK we might not necessarily write-protect
all PTEs, and only some subpages might be pinned. Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP.

For simplicity, this commit mostly talks about "anonymous pages", while
it's for THP actually "the part of an anonymous folio referenced via
a page table entry".

To not spill PG_anon_exclusive code all over the mm code-base, we let
the anon rmap code to handle all PG_anon_exclusive logic it can easily
handle.

If a writable, present page table entry points at an anonymous (sub)page,
that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
pin (FOLL_PIN) on an anonymous page references via a present
page table entry, it must only pin if PG_anon_exclusive is set for the
mapped (sub)page.

This commit doesn't adjust GUP, so this is only implicitly handled for
FOLL_WRITE, follow-up commits will teach GUP to also respect it for
FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
fully reliable.

Whenever an anonymous page is to be shared (fork(), KSM), or when
temporarily unmapping an anonymous page (swap, migration), the relevant
PG_anon_exclusive bit has to be cleared to mark the anonymous page
possibly shared. Clearing will fail if there are GUP pins on the page:
* For fork(), this means having to copy the page and not being able to
  share it. fork() protects against concurrent GUP using the PT lock and
  the src_mm->write_protect_seq.
* For KSM, this means sharing will fail. For swap this means, unmapping
  will fail, For migration this means, migration will fail early. All
  three cases protect against concurrent GUP using the PT lock and a
  proper clear/invalidate+flush of the relevant page table entry.

This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
pinned page gets mapped R/O and the successive write fault ends up
replacing the page instead of reusing it. It improves the situation for
O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN,
if fork() is *not* involved, however swapout and fork() are still
problematic. Properly using FOLL_PIN instead of FOLL_GET for these
GUP users will fix the issue for them.

I. Details about basic handling

I.1. Fresh anonymous pages

page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
the mechanism fresh anonymous pages come into life (besides migration
code where we copy the page->mapping), all fresh anonymous pages will
start out as exclusive.

I.2. COW reuse handling of anonymous pages

When a COW handler stumbles over a (sub)page that's marked exclusive, it
simply reuses it. Otherwise, the handler tries harder under page lock to
detect if the (sub)page is exclusive and can be reused. If exclusive,
page_move_anon_rmap() will mark the given (sub)page exclusive.

Note that hugetlb code does not yet check for PageAnonExclusive(), as it
still uses the old COW logic that is prone to the COW security issue
because hugetlb code cannot really tolerate unnecessary/wrong COW as
huge pages are a scarce resource.

I.3. Migration handling

try_to_migrate() has to try marking an exclusive anonymous page shared
via page_try_share_anon_rmap(). If it fails because there are GUP pins
on the page, unmap fails. migrate_vma_collect_pmd() and
__split_huge_pmd_locked() are handled similarly.

Writable migration entries implicitly point at shared anonymous pages.
For readable migration entries that information is stored via a new
"readable-exclusive" migration entry, specific to anonymous pages.

When restoring a migration entry in remove_migration_pte(), information
about exlusivity is detected via the migration entry type, and
RMAP_EXCLUSIVE is set accordingly for
page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that
information.

I.4. Swapout handling

try_to_unmap() has to try marking the mapped page possibly shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. For now, information about exclusivity is lost. In the
future, we might want to remember that information in the swap entry in
some cases, however, it requires more thought, care, and a way to store
that information in swap entries.

I.5. Swapin handling

do_swap_page() will never stumble over exclusive anonymous pages in the
swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
to detect manually if an anonymous page is exclusive and has to set
RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

I.6. THP handling

__split_huge_pmd_locked() has to move the information about exclusivity
from the PMD to the PTEs.

a) In case we have a readable-exclusive PMD migration entry, simply insert
readable-exclusive PTE migration entries.

b) In case we have a present PMD entry and we don't want to freeze
("convert to migration entries"), simply forward PG_anon_exclusive to
all sub-pages, no need to temporarily clear the bit.

c) In case we have a present PMD entry and want to freeze, handle it
similar to try_to_migrate(): try marking the page shared first. In case
we fail, we ignore the "freeze" instruction and simply split ordinarily.
try_to_migrate() will properly fail because the THP is still mapped via
PTEs.

When splitting a compound anonymous folio (THP), the information about
exclusivity is implicitly handled via the migration entries: no need to
replicate PG_anon_exclusive manually.

I.7. fork() handling

fork() handling is relatively easy, because PG_anon_exclusive is only
expressive for some page table entry types.

a) Present anonymous pages

page_try_dup_anon_rmap() will mark the given subpage shared -- which
will fail if the page is pinned. If it failed, we have to copy (or
PTE-map a PMD to handle it on the PTE level).

Note that device exclusive entries are just a pointer at a PageAnon()
page. fork() will first convert a device exclusive entry to a present
page table and handle it just like present anonymous pages.

b) Device private entry

Device private entries point at PageAnon() pages that cannot be mapped
directly and, therefore, cannot get pinned.

page_try_dup_anon_rmap() will mark the given subpage shared, which
cannot fail because they cannot get pinned.

c) HW poison entries

PG_anon_exclusive will remain untouched and is stale -- the page table
entry is just a placeholder after all.

d) Migration entries

Writable and readable-exclusive entries are converted to readable
entries: possibly shared.

I.8. mprotect() handling

mprotect() only has to properly handle the new readable-exclusive
migration entry:

When write-protecting a migration entry that points at an anonymous
page, remember the information about exclusivity via the
"readable-exclusive" migration entry type.

II. Migration and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a migration entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper
clear/invalidate+flush to make the following scenario impossible:

1. try_to_migrate() places a migration entry after checking for GUP pins
   and marks the page possibly shared.
2. GUP-fast pins the page due to lack of synchronization
3. fork() converts the "writable/readable-exclusive" migration entry into a
   readable migration entry
4. Migration fails due to the GUP pin (failing to freeze the refcount)
5. Migration entries are restored. PG_anon_exclusive is lost

-> We have a pinned page that is not marked exclusive anymore.

Note that we move information about exclusivity from the page to the
migration entry as it otherwise highly overcomplicates fork() and
PTE-mapping a THP.

III. Swapout and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a swap entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper
clear/invalidate+flush to make the following scenario impossible:

1. try_to_unmap() places a swap entry after checking for GUP pins and
   clears exclusivity information on the page.
2. GUP-fast pins the page due to lack of synchronization.

-> We have a pinned page that is not marked exclusive anymore.

If we'd ever store information about exclusivity in the swap entry,
similar to migration handling, the same considerations as in II would
apply. This is future work.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h    | 40 +++++++++++++++++++++
 include/linux/swap.h    | 15 +++++---
 include/linux/swapops.h | 25 +++++++++++++
 mm/huge_memory.c        | 77 +++++++++++++++++++++++++++++++++++++----
 mm/hugetlb.c            | 15 +++++---
 mm/ksm.c                | 13 ++++++-
 mm/memory.c             | 33 +++++++++++++-----
 mm/migrate.c            | 34 ++++++++++++++++--
 mm/mprotect.c           |  8 +++--
 mm/rmap.c               | 59 ++++++++++++++++++++++++++++---
 10 files changed, 285 insertions(+), 34 deletions(-)

Message ID	20220315104741.63071-12-david@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DE36C433EF for <linux-mm@archiver.kernel.org>; Tue, 15 Mar 2022 10:52:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 160268D0005; Tue, 15 Mar 2022 06:52:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 10F918D0001; Tue, 15 Mar 2022 06:52:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E7E678D0005; Tue, 15 Mar 2022 06:52:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id D56928D0001 for <linux-mm@kvack.org>; Tue, 15 Mar 2022 06:52:52 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A3368611D4 for <linux-mm@kvack.org>; Tue, 15 Mar 2022 10:52:52 +0000 (UTC) X-FDA: 79246307784.12.B636192 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 75CD240015 for <linux-mm@kvack.org>; Tue, 15 Mar 2022 10:52:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1647341571; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9Gxqncwg+tADjvVX/Oqjs3kj9m7vlf0APFIy17GeKpw=; b=PyPzT87gincMjVeGg9u8j5m+ORlmOaHIMNktyqP61bzZidtGnt4XlALH9v0qOvvKBHFKfE lbfB6cWXFJ+BFtkUkE0H+YW3rpjfVFhew6/06wev49muUUZSHG/ZmdRbqrFlFcqumWcljE LTMLfWZQQOAYzgwBuQKB/n5a2gk7kAE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-358-QECeZ68JMO-YyRiV2BAVWQ-1; Tue, 15 Mar 2022 06:52:47 -0400 X-MC-Unique: QECeZ68JMO-YyRiV2BAVWQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A77D3802809; Tue, 15 Mar 2022 10:52:46 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.194.72]) by smtp.corp.redhat.com (Postfix) with ESMTP id F32522D45D; Tue, 15 Mar 2022 10:52:40 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Linus Torvalds <torvalds@linux-foundation.org>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@linux.ibm.com>, Yang Shi <shy828301@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>, Michal Hocko <mhocko@kernel.org>, Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>, Andrea Arcangeli <aarcange@redhat.com>, Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>, Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>, Liang Zhang <zhangliang5@huawei.com>, Pedro Gomes <pedrodemargomes@gmail.com>, Oded Gabbay <oded.gabbay@gmail.com>, linux-mm@kvack.org, David Hildenbrand <david@redhat.com> Subject: [PATCH v2 11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive Date: Tue, 15 Mar 2022 11:47:37 +0100 Message-Id: <20220315104741.63071-12-david@redhat.com> In-Reply-To: <20220315104741.63071-1-david@redhat.com> References: <20220315104741.63071-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 75CD240015 X-Stat-Signature: mykuty68ki9cqcpkrrwguq9kmpdkeihq Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PyPzT87g; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf04.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1647341571-520174 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	mm: COW fixes part 2: reliable GUP pins of anonymous pages \| expand [v2,00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages [v2,01/15] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed [v2,02/15] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() [v2,03/15] mm/memory: slightly simplify copy_present_pte() [v2,04/15] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() [v2,05/15] mm/rmap: convert RMAP flags to a proper distinct rmap_t type [v2,06/15] mm/rmap: remove do_page_add_anon_rmap() [v2,07/15] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() [v2,08/15] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() [v2,09/15] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively [v2,10/15] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages [v2,11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive [v2,12/15] mm/gup: disallow follow_page(FOLL_PIN) [v2,13/15] mm: support GUP-triggered unsharing of anonymous pages [v2,14/15] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page [v2,15/15] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pi…

[v2,11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

Commit Message

Comments

Patch