Message ID | 20220315104741.63071-12-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: COW fixes part 2: reliable GUP pins of anonymous pages | expand |
On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: > > Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as > exclusive, and use that information to make GUP pins reliable and stay > consistent with the page mapped into the page table even if the > page table entry gets write-protected. > > With that information at hand, we can extend our COW logic to always > reuse anonymous pages that are exclusive. For anonymous pages that > might be shared, the existing logic applies. > > As already documented, PG_anon_exclusive is usually only expressive in > combination with a page table entry. Especially PTE vs. PMD-mapped > anonymous pages require more thought, some examples: due to mremap() we > can easily have a single compound page PTE-mapped into multiple page tables > exclusively in a single process -- multiple page table locks apply. > Further, due to MADV_WIPEONFORK we might not necessarily write-protect > all PTEs, and only some subpages might be pinned. Long story short: once > PTE-mapped, we have to track information about exclusivity per sub-page, > but until then, we can just track it for the compound page in the head > page and not having to update a whole bunch of subpages all of the time > for a simple PMD mapping of a THP. > > For simplicity, this commit mostly talks about "anonymous pages", while > it's for THP actually "the part of an anonymous folio referenced via > a page table entry". > > To not spill PG_anon_exclusive code all over the mm code-base, we let > the anon rmap code to handle all PG_anon_exclusive logic it can easily > handle. > > If a writable, present page table entry points at an anonymous (sub)page, > that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably > pin (FOLL_PIN) on an anonymous page references via a present > page table entry, it must only pin if PG_anon_exclusive is set for the > mapped (sub)page. > > This commit doesn't adjust GUP, so this is only implicitly handled for > FOLL_WRITE, follow-up commits will teach GUP to also respect it for > FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages > fully reliable. > > Whenever an anonymous page is to be shared (fork(), KSM), or when > temporarily unmapping an anonymous page (swap, migration), the relevant > PG_anon_exclusive bit has to be cleared to mark the anonymous page > possibly shared. Clearing will fail if there are GUP pins on the page: > * For fork(), this means having to copy the page and not being able to > share it. fork() protects against concurrent GUP using the PT lock and > the src_mm->write_protect_seq. > * For KSM, this means sharing will fail. For swap this means, unmapping > will fail, For migration this means, migration will fail early. All > three cases protect against concurrent GUP using the PT lock and a > proper clear/invalidate+flush of the relevant page table entry. > > This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a > pinned page gets mapped R/O and the successive write fault ends up > replacing the page instead of reusing it. It improves the situation for > O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, > if fork() is *not* involved, however swapout and fork() are still > problematic. Properly using FOLL_PIN instead of FOLL_GET for these > GUP users will fix the issue for them. > > I. Details about basic handling > > I.1. Fresh anonymous pages > > page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the > given page exclusive via __page_set_anon_rmap(exclusive=1). As that is > the mechanism fresh anonymous pages come into life (besides migration > code where we copy the page->mapping), all fresh anonymous pages will > start out as exclusive. > > I.2. COW reuse handling of anonymous pages > > When a COW handler stumbles over a (sub)page that's marked exclusive, it > simply reuses it. Otherwise, the handler tries harder under page lock to > detect if the (sub)page is exclusive and can be reused. If exclusive, > page_move_anon_rmap() will mark the given (sub)page exclusive. > > Note that hugetlb code does not yet check for PageAnonExclusive(), as it > still uses the old COW logic that is prone to the COW security issue > because hugetlb code cannot really tolerate unnecessary/wrong COW as > huge pages are a scarce resource. > > I.3. Migration handling > > try_to_migrate() has to try marking an exclusive anonymous page shared > via page_try_share_anon_rmap(). If it fails because there are GUP pins > on the page, unmap fails. migrate_vma_collect_pmd() and > __split_huge_pmd_locked() are handled similarly. > > Writable migration entries implicitly point at shared anonymous pages. > For readable migration entries that information is stored via a new > "readable-exclusive" migration entry, specific to anonymous pages. > > When restoring a migration entry in remove_migration_pte(), information > about exlusivity is detected via the migration entry type, and > RMAP_EXCLUSIVE is set accordingly for > page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that > information. > > I.4. Swapout handling > > try_to_unmap() has to try marking the mapped page possibly shared via > page_try_share_anon_rmap(). If it fails because there are GUP pins on the > page, unmap fails. For now, information about exclusivity is lost. In the > future, we might want to remember that information in the swap entry in > some cases, however, it requires more thought, care, and a way to store > that information in swap entries. > > I.5. Swapin handling > > do_swap_page() will never stumble over exclusive anonymous pages in the > swap cache, as try_to_migrate() prohibits that. do_swap_page() always has > to detect manually if an anonymous page is exclusive and has to set > RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. > > I.6. THP handling > > __split_huge_pmd_locked() has to move the information about exclusivity > from the PMD to the PTEs. > > a) In case we have a readable-exclusive PMD migration entry, simply insert > readable-exclusive PTE migration entries. > > b) In case we have a present PMD entry and we don't want to freeze > ("convert to migration entries"), simply forward PG_anon_exclusive to > all sub-pages, no need to temporarily clear the bit. > > c) In case we have a present PMD entry and want to freeze, handle it > similar to try_to_migrate(): try marking the page shared first. In case > we fail, we ignore the "freeze" instruction and simply split ordinarily. > try_to_migrate() will properly fail because the THP is still mapped via > PTEs. How come will try_to_migrate() fail? The afterward pvmw will find those PTEs then convert them to migration entries anyway IIUC. > > When splitting a compound anonymous folio (THP), the information about > exclusivity is implicitly handled via the migration entries: no need to > replicate PG_anon_exclusive manually. > > I.7. fork() handling > > fork() handling is relatively easy, because PG_anon_exclusive is only > expressive for some page table entry types. > > a) Present anonymous pages > > page_try_dup_anon_rmap() will mark the given subpage shared -- which > will fail if the page is pinned. If it failed, we have to copy (or > PTE-map a PMD to handle it on the PTE level). > > Note that device exclusive entries are just a pointer at a PageAnon() > page. fork() will first convert a device exclusive entry to a present > page table and handle it just like present anonymous pages. > > b) Device private entry > > Device private entries point at PageAnon() pages that cannot be mapped > directly and, therefore, cannot get pinned. > > page_try_dup_anon_rmap() will mark the given subpage shared, which > cannot fail because they cannot get pinned. > > c) HW poison entries > > PG_anon_exclusive will remain untouched and is stale -- the page table > entry is just a placeholder after all. > > d) Migration entries > > Writable and readable-exclusive entries are converted to readable > entries: possibly shared. > > I.8. mprotect() handling > > mprotect() only has to properly handle the new readable-exclusive > migration entry: > > When write-protecting a migration entry that points at an anonymous > page, remember the information about exclusivity via the > "readable-exclusive" migration entry type. > > II. Migration and GUP-fast > > Whenever replacing a present page table entry that maps an exclusive > anonymous page by a migration entry, we have to mark the page possibly > shared and synchronize against GUP-fast by a proper > clear/invalidate+flush to make the following scenario impossible: > > 1. try_to_migrate() places a migration entry after checking for GUP pins > and marks the page possibly shared. > 2. GUP-fast pins the page due to lack of synchronization > 3. fork() converts the "writable/readable-exclusive" migration entry into a > readable migration entry > 4. Migration fails due to the GUP pin (failing to freeze the refcount) > 5. Migration entries are restored. PG_anon_exclusive is lost > > -> We have a pinned page that is not marked exclusive anymore. > > Note that we move information about exclusivity from the page to the > migration entry as it otherwise highly overcomplicates fork() and > PTE-mapping a THP. > > III. Swapout and GUP-fast > > Whenever replacing a present page table entry that maps an exclusive > anonymous page by a swap entry, we have to mark the page possibly > shared and synchronize against GUP-fast by a proper > clear/invalidate+flush to make the following scenario impossible: > > 1. try_to_unmap() places a swap entry after checking for GUP pins and > clears exclusivity information on the page. > 2. GUP-fast pins the page due to lack of synchronization. > > -> We have a pinned page that is not marked exclusive anymore. > > If we'd ever store information about exclusivity in the swap entry, > similar to migration handling, the same considerations as in II would > apply. This is future work. > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > include/linux/rmap.h | 40 +++++++++++++++++++++ > include/linux/swap.h | 15 +++++--- > include/linux/swapops.h | 25 +++++++++++++ > mm/huge_memory.c | 77 +++++++++++++++++++++++++++++++++++++---- > mm/hugetlb.c | 15 +++++--- > mm/ksm.c | 13 ++++++- > mm/memory.c | 33 +++++++++++++----- > mm/migrate.c | 34 ++++++++++++++++-- > mm/mprotect.c | 8 +++-- > mm/rmap.c | 59 ++++++++++++++++++++++++++++--- > 10 files changed, 285 insertions(+), 34 deletions(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 51953bace0a3..1aef834e1d60 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -224,6 +224,13 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, > { > VM_BUG_ON_PAGE(!PageAnon(page), page); > > + /* > + * No need to check+clear for already shared pages, including KSM > + * pages. > + */ > + if (!PageAnonExclusive(page)) > + goto dup; > + > /* > * If this page may have been pinned by the parent process, > * don't allow to duplicate the mapping but instead require to e.g., > @@ -235,14 +242,47 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, > unlikely(page_needs_cow_for_dma(vma, page)))) > return -EBUSY; > > + ClearPageAnonExclusive(page); > /* > * It's okay to share the anon page between both processes, mapping > * the page R/O into both processes. > */ > +dup: > __page_dup_rmap(page, compound); > return 0; > } > > +/** > + * page_try_share_anon_rmap - try marking an exclusive anonymous page possibly > + * shared to prepare for KSM or temporary unmapping > + * @page: the exclusive anonymous page to try marking possibly shared > + * > + * The caller needs to hold the PT lock and has to have the page table entry > + * cleared/invalidated+flushed, to properly sync against GUP-fast. > + * > + * This is similar to page_try_dup_anon_rmap(), however, not used during fork() > + * to duplicate a mapping, but instead to prepare for KSM or temporarily > + * unmapping a page (swap, migration) via page_remove_rmap(). > + * > + * Marking the page shared can only fail if the page may be pinned; device > + * private pages cannot get pinned and consequently this function cannot fail. > + * > + * Returns 0 if marking the page possibly shared succeeded. Returns -EBUSY > + * otherwise. > + */ > +static inline int page_try_share_anon_rmap(struct page *page) > +{ > + VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page); > + > + /* See page_try_dup_anon_rmap(). */ > + if (likely(!is_device_private_page(page) && > + unlikely(page_maybe_dma_pinned(page)))) > + return -EBUSY; > + > + ClearPageAnonExclusive(page); > + return 0; > +} > + > /* > * Called from mm/vmscan.c to handle paging out > */ > diff --git a/include/linux/swap.h b/include/linux/swap.h > index b546e4bd5c5a..422765d1141c 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -78,12 +78,19 @@ static inline int current_is_kswapd(void) > #endif > > /* > - * NUMA node memory migration support > + * Page migration support. > + * > + * SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and > + * indicates that the referenced (part of) an anonymous page is exclusive to > + * a single process. For SWP_MIGRATION_WRITE, that information is implicit: > + * (part of) an anonymous page that are mapped writable are exclusive to a > + * single process. > */ > #ifdef CONFIG_MIGRATION > -#define SWP_MIGRATION_NUM 2 > -#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) > -#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) > +#define SWP_MIGRATION_NUM 3 > +#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) > +#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) > +#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2) > #else > #define SWP_MIGRATION_NUM 0 > #endif > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > index d356ab4047f7..06280fc1c99b 100644 > --- a/include/linux/swapops.h > +++ b/include/linux/swapops.h > @@ -194,6 +194,7 @@ static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) > static inline int is_migration_entry(swp_entry_t entry) > { > return unlikely(swp_type(entry) == SWP_MIGRATION_READ || > + swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE || > swp_type(entry) == SWP_MIGRATION_WRITE); > } > > @@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry) > return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE); > } > > +static inline int is_readable_migration_entry(swp_entry_t entry) > +{ > + return unlikely(swp_type(entry) == SWP_MIGRATION_READ); > +} > + > +static inline int is_readable_exclusive_migration_entry(swp_entry_t entry) > +{ > + return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE); > +} > + > static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) > { > return swp_entry(SWP_MIGRATION_READ, offset); > } > > +static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset) > +{ > + return swp_entry(SWP_MIGRATION_READ_EXCLUSIVE, offset); > +} > + > static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) > { > return swp_entry(SWP_MIGRATION_WRITE, offset); > @@ -224,6 +240,11 @@ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) > return swp_entry(0, 0); > } > > +static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset) > +{ > + return swp_entry(0, 0); > +} > + > static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) > { > return swp_entry(0, 0); > @@ -244,6 +265,10 @@ static inline int is_writable_migration_entry(swp_entry_t entry) > { > return 0; > } > +static inline int is_readable_migration_entry(swp_entry_t entry) > +{ > + return 0; > +} > > #endif > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 0b6fb409b9e4..4872e7120ee1 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1054,7 +1054,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > swp_entry_t entry = pmd_to_swp_entry(pmd); > > VM_BUG_ON(!is_pmd_migration_entry(pmd)); > - if (is_writable_migration_entry(entry)) { > + if (!is_readable_migration_entry(entry)) { > entry = make_readable_migration_entry( > swp_offset(entry)); > pmd = swp_entry_to_pmd(entry); > @@ -1292,6 +1292,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > page = pmd_page(orig_pmd); > VM_BUG_ON_PAGE(!PageHead(page), page); > > + /* Early check when only holding the PT lock. */ > + if (PageAnonExclusive(page)) > + goto reuse; > + > if (!trylock_page(page)) { > get_page(page); > spin_unlock(vmf->ptl); > @@ -1306,6 +1310,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > put_page(page); > } > > + /* Recheck after temporarily dropping the PT lock. */ > + if (PageAnonExclusive(page)) { > + unlock_page(page); > + goto reuse; > + } > + > /* > * See do_wp_page(): we can only map the page writable if there are > * no additional references. Note that we always drain the LRU > @@ -1319,11 +1329,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > pmd_t entry; > > page_move_anon_rmap(page, vma); > + unlock_page(page); > +reuse: > entry = pmd_mkyoung(orig_pmd); > entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1)) > update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); > - unlock_page(page); > spin_unlock(vmf->ptl); > return VM_FAULT_WRITE; > } > @@ -1741,6 +1752,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION > if (is_swap_pmd(*pmd)) { > swp_entry_t entry = pmd_to_swp_entry(*pmd); > + struct page *page = pfn_swap_entry_to_page(entry); > > VM_BUG_ON(!is_pmd_migration_entry(*pmd)); > if (is_writable_migration_entry(entry)) { > @@ -1749,8 +1761,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > * A protection check is difficult so > * just be safe and disable write > */ > - entry = make_readable_migration_entry( > - swp_offset(entry)); > + if (PageAnon(page)) > + entry = make_readable_exclusive_migration_entry(swp_offset(entry)); > + else > + entry = make_readable_migration_entry(swp_offset(entry)); > newpmd = swp_entry_to_pmd(entry); > if (pmd_swp_soft_dirty(*pmd)) > newpmd = pmd_swp_mksoft_dirty(newpmd); > @@ -1959,6 +1973,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > pgtable_t pgtable; > pmd_t old_pmd, _pmd; > bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false; > + bool anon_exclusive = false; > unsigned long addr; > int i; > > @@ -2040,6 +2055,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > entry = pmd_to_swp_entry(old_pmd); > page = pfn_swap_entry_to_page(entry); > write = is_writable_migration_entry(entry); > + if (PageAnon(page)) > + anon_exclusive = is_readable_exclusive_migration_entry(entry); > young = false; > soft_dirty = pmd_swp_soft_dirty(old_pmd); > uffd_wp = pmd_swp_uffd_wp(old_pmd); > @@ -2051,6 +2068,23 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > young = pmd_young(old_pmd); > soft_dirty = pmd_soft_dirty(old_pmd); > uffd_wp = pmd_uffd_wp(old_pmd); > + > + /* > + * Without "freeze", we'll simply split the PMD, propagating the > + * PageAnonExclusive() flag for each PTE by setting it for > + * each subpage -- no need to (temporarily) clear. > + * > + * With "freeze" we want to replace mapped pages by > + * migration entries right away. This is only possible if we > + * managed to clear PageAnonExclusive() -- see > + * set_pmd_migration_entry(). > + * > + * In case we cannot clear PageAnonExclusive(), split the PMD > + * only and let try_to_migrate_one() fail later. > + */ > + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); > + if (freeze && anon_exclusive && page_try_share_anon_rmap(page)) > + freeze = false; > } > VM_BUG_ON_PAGE(!page_count(page), page); > page_ref_add(page, HPAGE_PMD_NR - 1); > @@ -2074,6 +2108,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > if (write) > swp_entry = make_writable_migration_entry( > page_to_pfn(page + i)); > + else if (anon_exclusive) > + swp_entry = make_readable_exclusive_migration_entry( > + page_to_pfn(page + i)); > else > swp_entry = make_readable_migration_entry( > page_to_pfn(page + i)); > @@ -2085,6 +2122,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > } else { > entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); > entry = maybe_mkwrite(entry, vma); > + if (anon_exclusive) > + SetPageAnonExclusive(page + i); > if (!write) > entry = pte_wrprotect(entry); > if (!young) > @@ -2315,6 +2354,13 @@ static void __split_huge_page_tail(struct page *head, int tail, > * > * After successful get_page_unless_zero() might follow flags change, > * for example lock_page() which set PG_waiters. > + * > + * Note that for mapped sub-pages of an anonymous THP, > + * PG_anon_exclusive has been cleared in unmap_page() and is stored in > + * the migration entry instead from where remap_page() will restore it. > + * We can still have PG_anon_exclusive set on effectively unmapped and > + * unreferenced sub-pages of an anonymous THP: we can simply drop > + * PG_anon_exclusive (-> PG_mappedtodisk) for these here. > */ > page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; > page_tail->flags |= (head->flags & > @@ -3070,6 +3116,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, > struct vm_area_struct *vma = pvmw->vma; > struct mm_struct *mm = vma->vm_mm; > unsigned long address = pvmw->address; > + bool anon_exclusive; > pmd_t pmdval; > swp_entry_t entry; > pmd_t pmdswp; > @@ -3079,10 +3126,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, > > flush_cache_range(vma, address, address + HPAGE_PMD_SIZE); > pmdval = pmdp_invalidate(vma, address, pvmw->pmd); > + > + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); > + if (anon_exclusive && page_try_share_anon_rmap(page)) { > + set_pmd_at(mm, address, pvmw->pmd, pmdval); > + return; > + } > + > if (pmd_dirty(pmdval)) > set_page_dirty(page); > if (pmd_write(pmdval)) > entry = make_writable_migration_entry(page_to_pfn(page)); > + else if (anon_exclusive) > + entry = make_readable_exclusive_migration_entry(page_to_pfn(page)); > else > entry = make_readable_migration_entry(page_to_pfn(page)); > pmdswp = swp_entry_to_pmd(entry); > @@ -3116,10 +3172,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) > pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); > > flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE); > - if (PageAnon(new)) > - page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND); > - else > + if (PageAnon(new)) { > + int rmap_flags = RMAP_COMPOUND; > + > + if (!is_readable_migration_entry(entry)) > + rmap_flags |= RMAP_EXCLUSIVE; > + > + page_add_anon_rmap(new, vma, mmun_start, rmap_flags); > + } else { > page_add_file_rmap(new, true); > + } > + VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new)); > set_pmd_at(mm, mmun_start, pvmw->pmd, pmde); > if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new)) > mlock_vma_page(new); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1ff0b9e1e28e..c0e19ea5ebb5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4765,7 +4765,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > is_hugetlb_entry_hwpoisoned(entry))) { > swp_entry_t swp_entry = pte_to_swp_entry(entry); > > - if (is_writable_migration_entry(swp_entry) && cow) { > + if (!is_readable_migration_entry(swp_entry) && cow) { > /* > * COW mappings require pages in both > * parent and child to be set to read. > @@ -5165,6 +5165,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, > set_huge_ptep_writable(vma, haddr, ptep); > return 0; > } > + VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page), > + old_page); > > /* > * If the process that created a MAP_PRIVATE mapping is about to > @@ -6160,12 +6162,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > } > if (unlikely(is_hugetlb_entry_migration(pte))) { > swp_entry_t entry = pte_to_swp_entry(pte); > + struct page *page = pfn_swap_entry_to_page(entry); > > - if (is_writable_migration_entry(entry)) { > + if (!is_readable_migration_entry(entry)) { > pte_t newpte; > > - entry = make_readable_migration_entry( > - swp_offset(entry)); > + if (PageAnon(page)) > + entry = make_readable_exclusive_migration_entry( > + swp_offset(entry)); > + else > + entry = make_readable_migration_entry( > + swp_offset(entry)); > newpte = swp_entry_to_pte(entry); > set_huge_swap_pte_at(mm, address, ptep, > newpte, huge_page_size(h)); > diff --git a/mm/ksm.c b/mm/ksm.c > index 9ff28097bc0a..350b46432d65 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -866,6 +866,7 @@ static inline struct stable_node *page_stable_node(struct page *page) > static inline void set_page_stable_node(struct page *page, > struct stable_node *stable_node) > { > + VM_BUG_ON_PAGE(PageAnon(page) && PageAnonExclusive(page), page); > page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); > } > > @@ -1041,6 +1042,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > int swapped; > int err = -EFAULT; > struct mmu_notifier_range range; > + bool anon_exclusive; > > pvmw.address = page_address_in_vma(page, vma); > if (pvmw.address == -EFAULT) > @@ -1058,9 +1060,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?")) > goto out_unlock; > > + anon_exclusive = PageAnonExclusive(page); > if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || > (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) || > - mm_tlb_flush_pending(mm)) { > + anon_exclusive || mm_tlb_flush_pending(mm)) { > pte_t entry; > > swapped = PageSwapCache(page); > @@ -1088,6 +1091,12 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > set_pte_at(mm, pvmw.address, pvmw.pte, entry); > goto out_unlock; > } > + > + if (anon_exclusive && page_try_share_anon_rmap(page)) { > + set_pte_at(mm, pvmw.address, pvmw.pte, entry); > + goto out_unlock; > + } > + > if (pte_dirty(entry)) > set_page_dirty(page); > > @@ -1146,6 +1155,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > pte_unmap_unlock(ptep, ptl); > goto out_mn; > } > + VM_BUG_ON_PAGE(PageAnonExclusive(page), page); > + VM_BUG_ON_PAGE(PageAnon(kpage) && PageAnonExclusive(kpage), kpage); > > /* > * No need to check ksm_use_zero_pages here: we can only have a > diff --git a/mm/memory.c b/mm/memory.c > index d01fab481134..222aaf277af4 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -720,6 +720,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma, > else if (is_writable_device_exclusive_entry(entry)) > pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > + VM_BUG_ON(pte_write(pte) && !(PageAnon(page) && PageAnonExclusive(page))); > + > /* > * No need to take a page reference as one was already > * created when the swap entry was made. > @@ -799,11 +801,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > rss[mm_counter(page)]++; > > - if (is_writable_migration_entry(entry) && > + if (!is_readable_migration_entry(entry) && > is_cow_mapping(vm_flags)) { > /* > - * COW mappings require pages in both > - * parent and child to be set to read. > + * COW mappings require pages in both parent and child > + * to be set to read. A previously exclusive entry is > + * now shared. > */ > entry = make_readable_migration_entry( > swp_offset(entry)); > @@ -954,6 +957,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > ptep_set_wrprotect(src_mm, addr, src_pte); > pte = pte_wrprotect(pte); > } > + VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page)); > > /* > * If it's a shared mapping, mark it clean in > @@ -2943,6 +2947,9 @@ static inline void wp_page_reuse(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > struct page *page = vmf->page; > pte_t entry; > + > + VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page)); > + > /* > * Clear the pages cpupid information as the existing > * information potentially belongs to a now completely > @@ -3277,6 +3284,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) > if (PageAnon(vmf->page)) { > struct page *page = vmf->page; > > + /* > + * If the page is exclusive to this process we must reuse the > + * page without further checks. > + */ > + if (PageAnonExclusive(page)) > + goto reuse; > + > /* > * We have to verify under page lock: these early checks are > * just an optimization to avoid locking the page and freeing > @@ -3309,6 +3323,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) > */ > page_move_anon_rmap(page, vma); > unlock_page(page); > +reuse: > wp_page_reuse(vmf); > return VM_FAULT_WRITE; > } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > @@ -3700,11 +3715,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > * that are certainly not shared because we just allocated them without > * exposing them to the swapcache. > */ > - if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) && > - (page != swapcache || page_count(page) == 1)) { > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > - vmf->flags &= ~FAULT_FLAG_WRITE; > - ret |= VM_FAULT_WRITE; > + if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) { > + if (vmf->flags & FAULT_FLAG_WRITE) { > + pte = maybe_mkwrite(pte_mkdirty(pte), vma); > + vmf->flags &= ~FAULT_FLAG_WRITE; > + ret |= VM_FAULT_WRITE; > + } > rmap_flags |= RMAP_EXCLUSIVE; > } > flush_icache_page(vma, page); > @@ -3724,6 +3740,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > page_add_anon_rmap(page, vma, vmf->address, rmap_flags); > } > > + VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page))); > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); > > diff --git a/mm/migrate.c b/mm/migrate.c > index fd9eba33b34a..7f440d2103ce 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -188,6 +188,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > > VM_BUG_ON_PAGE(PageTail(page), page); > while (page_vma_mapped_walk(&pvmw)) { > + rmap_t rmap_flags = RMAP_NONE; > + > if (PageKsm(page)) > new = page; > else > @@ -217,6 +219,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > else if (pte_swp_uffd_wp(*pvmw.pte)) > pte = pte_mkuffd_wp(pte); > > + if (PageAnon(new) && !is_readable_migration_entry(entry)) > + rmap_flags |= RMAP_EXCLUSIVE; > + > if (unlikely(is_device_private_page(new))) { > if (pte_write(pte)) > entry = make_writable_device_private_entry( > @@ -239,7 +244,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > pte = arch_make_huge_pte(pte, shift, vma->vm_flags); > if (PageAnon(new)) > hugepage_add_anon_rmap(new, vma, pvmw.address, > - RMAP_NONE); > + rmap_flags); > else > page_dup_file_rmap(new, true); > set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); > @@ -248,7 +253,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > { > if (PageAnon(new)) > page_add_anon_rmap(new, vma, pvmw.address, > - RMAP_NONE); > + rmap_flags); > else > page_add_file_rmap(new, false); > set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); > @@ -379,6 +384,10 @@ int folio_migrate_mapping(struct address_space *mapping, > /* No turning back from here */ > newfolio->index = folio->index; > newfolio->mapping = folio->mapping; > + /* > + * Note: PG_anon_exclusive is always migrated via migration > + * entries. > + */ > if (folio_test_swapbacked(folio)) > __folio_set_swapbacked(newfolio); > > @@ -2302,15 +2311,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > * set up a special migration page table entry now. > */ > if (trylock_page(page)) { > + bool anon_exclusive; > pte_t swp_pte; > > + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); > + if (anon_exclusive) { > + flush_cache_page(vma, addr, pte_pfn(*ptep)); > + ptep_clear_flush(vma, addr, ptep); > + > + if (page_try_share_anon_rmap(page)) { > + set_pte_at(mm, addr, ptep, pte); > + unlock_page(page); > + put_page(page); > + mpfn = 0; > + goto next; > + } > + } else { > + ptep_get_and_clear(mm, addr, ptep); > + } > + > migrate->cpages++; > - ptep_get_and_clear(mm, addr, ptep); > > /* Setup special migration page table entry */ > if (mpfn & MIGRATE_PFN_WRITE) > entry = make_writable_migration_entry( > page_to_pfn(page)); > + else if (anon_exclusive) > + entry = make_readable_exclusive_migration_entry( > + page_to_pfn(page)); > else > entry = make_readable_migration_entry( > page_to_pfn(page)); > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 2887644fd150..9441675e993c 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -141,6 +141,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > pages++; > } else if (is_swap_pte(oldpte)) { > swp_entry_t entry = pte_to_swp_entry(oldpte); > + struct page *page = pfn_swap_entry_to_page(entry); > pte_t newpte; > > if (is_writable_migration_entry(entry)) { > @@ -148,8 +149,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > * A protection check is difficult so > * just be safe and disable write > */ > - entry = make_readable_migration_entry( > - swp_offset(entry)); > + if (PageAnon(page)) > + entry = make_readable_exclusive_migration_entry( > + swp_offset(entry)); > + else > + entry = make_readable_migration_entry(swp_offset(entry)); > newpte = swp_entry_to_pte(entry); > if (pte_swp_soft_dirty(oldpte)) > newpte = pte_swp_mksoft_dirty(newpte); > diff --git a/mm/rmap.c b/mm/rmap.c > index ebe7140c4493..9d2a7e11e8cc 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1048,6 +1048,7 @@ EXPORT_SYMBOL_GPL(folio_mkclean); > void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) > { > struct anon_vma *anon_vma = vma->anon_vma; > + struct page *subpage = page; > > page = compound_head(page); > > @@ -1061,6 +1062,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) > * PageAnon()) will not see one without the other. > */ > WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); > + SetPageAnonExclusive(subpage); > } > > /** > @@ -1078,7 +1080,7 @@ static void __page_set_anon_rmap(struct page *page, > BUG_ON(!anon_vma); > > if (PageAnon(page)) > - return; > + goto out; > > /* > * If the page isn't exclusively mapped into this vma, > @@ -1097,6 +1099,9 @@ static void __page_set_anon_rmap(struct page *page, > anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; > WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); > page->index = linear_page_index(vma, address); > +out: > + if (exclusive) > + SetPageAnonExclusive(page); > } > > /** > @@ -1156,6 +1161,8 @@ void page_add_anon_rmap(struct page *page, > } else { > first = atomic_inc_and_test(&page->_mapcount); > } > + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > + VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > > if (first) { > int nr = compound ? thp_nr_pages(page) : 1; > @@ -1422,7 +1429,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > }; > pte_t pteval; > struct page *subpage; > - bool ret = true; > + bool anon_exclusive, ret = true; > struct mmu_notifier_range range; > enum ttu_flags flags = (enum ttu_flags)(long)arg; > > @@ -1485,6 +1492,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); > address = pvmw.address; > + anon_exclusive = PageAnon(page) && PageAnonExclusive(subpage); > > if (PageHuge(page) && !PageAnon(page)) { > /* > @@ -1520,9 +1528,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > } > } > > - /* Nuke the page table entry. */ > + /* > + * Nuke the page table entry. When having to clear > + * PageAnonExclusive(), we always have to flush. > + */ > flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); > - if (should_defer_flush(mm, flags)) { > + if (should_defer_flush(mm, flags) && !anon_exclusive) { > /* > * We clear the PTE but do not flush so potentially > * a remote CPU could still be writing to the page. > @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > page_vma_mapped_walk_done(&pvmw); > break; > } > + if (anon_exclusive && > + page_try_share_anon_rmap(subpage)) { > + swap_free(entry); > + set_pte_at(mm, address, pvmw.pte, pteval); > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } > + /* > + * Note: We *don't* remember yet if the page was mapped > + * exclusively in the swap entry, so swapin code has > + * to re-determine that manually and might detect the > + * page as possibly shared, for example, if there are > + * other references on the page or if the page is under > + * writeback. We made sure that there are no GUP pins > + * on the page that would rely on it, so for GUP pins > + * this is fine. > + */ > if (list_empty(&mm->mmlist)) { > spin_lock(&mmlist_lock); > if (list_empty(&mm->mmlist)) > @@ -1723,7 +1752,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > }; > pte_t pteval; > struct page *subpage; > - bool ret = true; > + bool anon_exclusive, ret = true; > struct mmu_notifier_range range; > enum ttu_flags flags = (enum ttu_flags)(long)arg; > > @@ -1782,6 +1811,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > > subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); > address = pvmw.address; > + anon_exclusive = PageAnon(page) && PageAnonExclusive(subpage); > > if (PageHuge(page) && !PageAnon(page)) { > /* > @@ -1833,6 +1863,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > swp_entry_t entry; > pte_t swp_pte; > > + if (anon_exclusive) > + BUG_ON(page_try_share_anon_rmap(subpage)); > + > /* > * Store the pfn of the page in a special migration > * pte. do_swap_page() will wait until the migration > @@ -1841,6 +1874,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > entry = pte_to_swp_entry(pteval); > if (is_writable_device_private_entry(entry)) > entry = make_writable_migration_entry(pfn); > + else if (anon_exclusive) > + entry = make_readable_exclusive_migration_entry(pfn); > else > entry = make_readable_migration_entry(pfn); > swp_pte = swp_entry_to_pte(entry); > @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > page_vma_mapped_walk_done(&pvmw); > break; > } > + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && > + !anon_exclusive, page); > + if (anon_exclusive && > + page_try_share_anon_rmap(subpage)) { > + set_pte_at(mm, address, pvmw.pte, pteval); > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } > > /* > * Store the pfn of the page in a special migration > @@ -1912,6 +1956,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > if (pte_write(pteval)) > entry = make_writable_migration_entry( > page_to_pfn(subpage)); > + else if (anon_exclusive) > + entry = make_readable_exclusive_migration_entry( > + page_to_pfn(subpage)); > else > entry = make_readable_migration_entry( > page_to_pfn(subpage)); > @@ -2425,6 +2472,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > BUG_ON(!anon_vma); > /* address might be in next vma when migration races vma_adjust */ > first = atomic_inc_and_test(compound_mapcount_ptr(page)); > + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > + VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > if (first) > __page_set_anon_rmap(page, vma, address, > flags & RMAP_EXCLUSIVE); > -- > 2.35.1 >
On 16.03.22 22:23, Yang Shi wrote: > On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: >> >> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as >> exclusive, and use that information to make GUP pins reliable and stay >> consistent with the page mapped into the page table even if the >> page table entry gets write-protected. >> >> With that information at hand, we can extend our COW logic to always >> reuse anonymous pages that are exclusive. For anonymous pages that >> might be shared, the existing logic applies. >> >> As already documented, PG_anon_exclusive is usually only expressive in >> combination with a page table entry. Especially PTE vs. PMD-mapped >> anonymous pages require more thought, some examples: due to mremap() we >> can easily have a single compound page PTE-mapped into multiple page tables >> exclusively in a single process -- multiple page table locks apply. >> Further, due to MADV_WIPEONFORK we might not necessarily write-protect >> all PTEs, and only some subpages might be pinned. Long story short: once >> PTE-mapped, we have to track information about exclusivity per sub-page, >> but until then, we can just track it for the compound page in the head >> page and not having to update a whole bunch of subpages all of the time >> for a simple PMD mapping of a THP. >> >> For simplicity, this commit mostly talks about "anonymous pages", while >> it's for THP actually "the part of an anonymous folio referenced via >> a page table entry". >> >> To not spill PG_anon_exclusive code all over the mm code-base, we let >> the anon rmap code to handle all PG_anon_exclusive logic it can easily >> handle. >> >> If a writable, present page table entry points at an anonymous (sub)page, >> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably >> pin (FOLL_PIN) on an anonymous page references via a present >> page table entry, it must only pin if PG_anon_exclusive is set for the >> mapped (sub)page. >> >> This commit doesn't adjust GUP, so this is only implicitly handled for >> FOLL_WRITE, follow-up commits will teach GUP to also respect it for >> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages >> fully reliable. >> >> Whenever an anonymous page is to be shared (fork(), KSM), or when >> temporarily unmapping an anonymous page (swap, migration), the relevant >> PG_anon_exclusive bit has to be cleared to mark the anonymous page >> possibly shared. Clearing will fail if there are GUP pins on the page: >> * For fork(), this means having to copy the page and not being able to >> share it. fork() protects against concurrent GUP using the PT lock and >> the src_mm->write_protect_seq. >> * For KSM, this means sharing will fail. For swap this means, unmapping >> will fail, For migration this means, migration will fail early. All >> three cases protect against concurrent GUP using the PT lock and a >> proper clear/invalidate+flush of the relevant page table entry. >> >> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a >> pinned page gets mapped R/O and the successive write fault ends up >> replacing the page instead of reusing it. It improves the situation for >> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, >> if fork() is *not* involved, however swapout and fork() are still >> problematic. Properly using FOLL_PIN instead of FOLL_GET for these >> GUP users will fix the issue for them. >> >> I. Details about basic handling >> >> I.1. Fresh anonymous pages >> >> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the >> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is >> the mechanism fresh anonymous pages come into life (besides migration >> code where we copy the page->mapping), all fresh anonymous pages will >> start out as exclusive. >> >> I.2. COW reuse handling of anonymous pages >> >> When a COW handler stumbles over a (sub)page that's marked exclusive, it >> simply reuses it. Otherwise, the handler tries harder under page lock to >> detect if the (sub)page is exclusive and can be reused. If exclusive, >> page_move_anon_rmap() will mark the given (sub)page exclusive. >> >> Note that hugetlb code does not yet check for PageAnonExclusive(), as it >> still uses the old COW logic that is prone to the COW security issue >> because hugetlb code cannot really tolerate unnecessary/wrong COW as >> huge pages are a scarce resource. >> >> I.3. Migration handling >> >> try_to_migrate() has to try marking an exclusive anonymous page shared >> via page_try_share_anon_rmap(). If it fails because there are GUP pins >> on the page, unmap fails. migrate_vma_collect_pmd() and >> __split_huge_pmd_locked() are handled similarly. >> >> Writable migration entries implicitly point at shared anonymous pages. >> For readable migration entries that information is stored via a new >> "readable-exclusive" migration entry, specific to anonymous pages. >> >> When restoring a migration entry in remove_migration_pte(), information >> about exlusivity is detected via the migration entry type, and >> RMAP_EXCLUSIVE is set accordingly for >> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that >> information. >> >> I.4. Swapout handling >> >> try_to_unmap() has to try marking the mapped page possibly shared via >> page_try_share_anon_rmap(). If it fails because there are GUP pins on the >> page, unmap fails. For now, information about exclusivity is lost. In the >> future, we might want to remember that information in the swap entry in >> some cases, however, it requires more thought, care, and a way to store >> that information in swap entries. >> >> I.5. Swapin handling >> >> do_swap_page() will never stumble over exclusive anonymous pages in the >> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has >> to detect manually if an anonymous page is exclusive and has to set >> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. >> >> I.6. THP handling >> >> __split_huge_pmd_locked() has to move the information about exclusivity >> from the PMD to the PTEs. >> >> a) In case we have a readable-exclusive PMD migration entry, simply insert >> readable-exclusive PTE migration entries. >> >> b) In case we have a present PMD entry and we don't want to freeze >> ("convert to migration entries"), simply forward PG_anon_exclusive to >> all sub-pages, no need to temporarily clear the bit. >> >> c) In case we have a present PMD entry and want to freeze, handle it >> similar to try_to_migrate(): try marking the page shared first. In case >> we fail, we ignore the "freeze" instruction and simply split ordinarily. >> try_to_migrate() will properly fail because the THP is still mapped via >> PTEs. Hi, thanks for the review! > > How come will try_to_migrate() fail? The afterward pvmw will find > those PTEs then convert them to migration entries anyway IIUC. > It will run into that code: >> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, >> page_vma_mapped_walk_done(&pvmw); >> break; >> } >> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && >> + !anon_exclusive, page); >> + if (anon_exclusive && >> + page_try_share_anon_rmap(subpage)) { >> + set_pte_at(mm, address, pvmw.pte, pteval); >> + ret = false; >> + page_vma_mapped_walk_done(&pvmw); >> + break; >> + } and similarly fail the page_try_share_anon_rmap(), at which point try_to_migrate() stops and the caller will still observe a "page_mapped() == true".
On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote: > > On 16.03.22 22:23, Yang Shi wrote: > > On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as > >> exclusive, and use that information to make GUP pins reliable and stay > >> consistent with the page mapped into the page table even if the > >> page table entry gets write-protected. > >> > >> With that information at hand, we can extend our COW logic to always > >> reuse anonymous pages that are exclusive. For anonymous pages that > >> might be shared, the existing logic applies. > >> > >> As already documented, PG_anon_exclusive is usually only expressive in > >> combination with a page table entry. Especially PTE vs. PMD-mapped > >> anonymous pages require more thought, some examples: due to mremap() we > >> can easily have a single compound page PTE-mapped into multiple page tables > >> exclusively in a single process -- multiple page table locks apply. > >> Further, due to MADV_WIPEONFORK we might not necessarily write-protect > >> all PTEs, and only some subpages might be pinned. Long story short: once > >> PTE-mapped, we have to track information about exclusivity per sub-page, > >> but until then, we can just track it for the compound page in the head > >> page and not having to update a whole bunch of subpages all of the time > >> for a simple PMD mapping of a THP. > >> > >> For simplicity, this commit mostly talks about "anonymous pages", while > >> it's for THP actually "the part of an anonymous folio referenced via > >> a page table entry". > >> > >> To not spill PG_anon_exclusive code all over the mm code-base, we let > >> the anon rmap code to handle all PG_anon_exclusive logic it can easily > >> handle. > >> > >> If a writable, present page table entry points at an anonymous (sub)page, > >> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably > >> pin (FOLL_PIN) on an anonymous page references via a present > >> page table entry, it must only pin if PG_anon_exclusive is set for the > >> mapped (sub)page. > >> > >> This commit doesn't adjust GUP, so this is only implicitly handled for > >> FOLL_WRITE, follow-up commits will teach GUP to also respect it for > >> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages > >> fully reliable. > >> > >> Whenever an anonymous page is to be shared (fork(), KSM), or when > >> temporarily unmapping an anonymous page (swap, migration), the relevant > >> PG_anon_exclusive bit has to be cleared to mark the anonymous page > >> possibly shared. Clearing will fail if there are GUP pins on the page: > >> * For fork(), this means having to copy the page and not being able to > >> share it. fork() protects against concurrent GUP using the PT lock and > >> the src_mm->write_protect_seq. > >> * For KSM, this means sharing will fail. For swap this means, unmapping > >> will fail, For migration this means, migration will fail early. All > >> three cases protect against concurrent GUP using the PT lock and a > >> proper clear/invalidate+flush of the relevant page table entry. > >> > >> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a > >> pinned page gets mapped R/O and the successive write fault ends up > >> replacing the page instead of reusing it. It improves the situation for > >> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, > >> if fork() is *not* involved, however swapout and fork() are still > >> problematic. Properly using FOLL_PIN instead of FOLL_GET for these > >> GUP users will fix the issue for them. > >> > >> I. Details about basic handling > >> > >> I.1. Fresh anonymous pages > >> > >> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the > >> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is > >> the mechanism fresh anonymous pages come into life (besides migration > >> code where we copy the page->mapping), all fresh anonymous pages will > >> start out as exclusive. > >> > >> I.2. COW reuse handling of anonymous pages > >> > >> When a COW handler stumbles over a (sub)page that's marked exclusive, it > >> simply reuses it. Otherwise, the handler tries harder under page lock to > >> detect if the (sub)page is exclusive and can be reused. If exclusive, > >> page_move_anon_rmap() will mark the given (sub)page exclusive. > >> > >> Note that hugetlb code does not yet check for PageAnonExclusive(), as it > >> still uses the old COW logic that is prone to the COW security issue > >> because hugetlb code cannot really tolerate unnecessary/wrong COW as > >> huge pages are a scarce resource. > >> > >> I.3. Migration handling > >> > >> try_to_migrate() has to try marking an exclusive anonymous page shared > >> via page_try_share_anon_rmap(). If it fails because there are GUP pins > >> on the page, unmap fails. migrate_vma_collect_pmd() and > >> __split_huge_pmd_locked() are handled similarly. > >> > >> Writable migration entries implicitly point at shared anonymous pages. > >> For readable migration entries that information is stored via a new > >> "readable-exclusive" migration entry, specific to anonymous pages. > >> > >> When restoring a migration entry in remove_migration_pte(), information > >> about exlusivity is detected via the migration entry type, and > >> RMAP_EXCLUSIVE is set accordingly for > >> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that > >> information. > >> > >> I.4. Swapout handling > >> > >> try_to_unmap() has to try marking the mapped page possibly shared via > >> page_try_share_anon_rmap(). If it fails because there are GUP pins on the > >> page, unmap fails. For now, information about exclusivity is lost. In the > >> future, we might want to remember that information in the swap entry in > >> some cases, however, it requires more thought, care, and a way to store > >> that information in swap entries. > >> > >> I.5. Swapin handling > >> > >> do_swap_page() will never stumble over exclusive anonymous pages in the > >> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has > >> to detect manually if an anonymous page is exclusive and has to set > >> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. > >> > >> I.6. THP handling > >> > >> __split_huge_pmd_locked() has to move the information about exclusivity > >> from the PMD to the PTEs. > >> > >> a) In case we have a readable-exclusive PMD migration entry, simply insert > >> readable-exclusive PTE migration entries. > >> > >> b) In case we have a present PMD entry and we don't want to freeze > >> ("convert to migration entries"), simply forward PG_anon_exclusive to > >> all sub-pages, no need to temporarily clear the bit. > >> > >> c) In case we have a present PMD entry and want to freeze, handle it > >> similar to try_to_migrate(): try marking the page shared first. In case > >> we fail, we ignore the "freeze" instruction and simply split ordinarily. > >> try_to_migrate() will properly fail because the THP is still mapped via > >> PTEs. > > Hi, > > thanks for the review! > > > > > How come will try_to_migrate() fail? The afterward pvmw will find > > those PTEs then convert them to migration entries anyway IIUC. > > > > It will run into that code: > > >> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > >> page_vma_mapped_walk_done(&pvmw); > >> break; > >> } > >> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && > >> + !anon_exclusive, page); > >> + if (anon_exclusive && > >> + page_try_share_anon_rmap(subpage)) { > >> + set_pte_at(mm, address, pvmw.pte, pteval); > >> + ret = false; > >> + page_vma_mapped_walk_done(&pvmw); > >> + break; > >> + } > > and similarly fail the page_try_share_anon_rmap(), at which point > try_to_migrate() stops and the caller will still observe a > "page_mapped() == true". Thanks, I missed that. Yes, the page will still be mapped. This should trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make this happen more often, we may consider removing that warning even though it is "once" since seeing a mapped page may become a normal case (once DIO is switched to FOLL_PIN, it may be more often). Anyway we don't have to remove it right now. > > -- > Thanks, > > David / dhildenb >
On 18.03.22 21:29, Yang Shi wrote: > On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 16.03.22 22:23, Yang Shi wrote: >>> On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as >>>> exclusive, and use that information to make GUP pins reliable and stay >>>> consistent with the page mapped into the page table even if the >>>> page table entry gets write-protected. >>>> >>>> With that information at hand, we can extend our COW logic to always >>>> reuse anonymous pages that are exclusive. For anonymous pages that >>>> might be shared, the existing logic applies. >>>> >>>> As already documented, PG_anon_exclusive is usually only expressive in >>>> combination with a page table entry. Especially PTE vs. PMD-mapped >>>> anonymous pages require more thought, some examples: due to mremap() we >>>> can easily have a single compound page PTE-mapped into multiple page tables >>>> exclusively in a single process -- multiple page table locks apply. >>>> Further, due to MADV_WIPEONFORK we might not necessarily write-protect >>>> all PTEs, and only some subpages might be pinned. Long story short: once >>>> PTE-mapped, we have to track information about exclusivity per sub-page, >>>> but until then, we can just track it for the compound page in the head >>>> page and not having to update a whole bunch of subpages all of the time >>>> for a simple PMD mapping of a THP. >>>> >>>> For simplicity, this commit mostly talks about "anonymous pages", while >>>> it's for THP actually "the part of an anonymous folio referenced via >>>> a page table entry". >>>> >>>> To not spill PG_anon_exclusive code all over the mm code-base, we let >>>> the anon rmap code to handle all PG_anon_exclusive logic it can easily >>>> handle. >>>> >>>> If a writable, present page table entry points at an anonymous (sub)page, >>>> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably >>>> pin (FOLL_PIN) on an anonymous page references via a present >>>> page table entry, it must only pin if PG_anon_exclusive is set for the >>>> mapped (sub)page. >>>> >>>> This commit doesn't adjust GUP, so this is only implicitly handled for >>>> FOLL_WRITE, follow-up commits will teach GUP to also respect it for >>>> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages >>>> fully reliable. >>>> >>>> Whenever an anonymous page is to be shared (fork(), KSM), or when >>>> temporarily unmapping an anonymous page (swap, migration), the relevant >>>> PG_anon_exclusive bit has to be cleared to mark the anonymous page >>>> possibly shared. Clearing will fail if there are GUP pins on the page: >>>> * For fork(), this means having to copy the page and not being able to >>>> share it. fork() protects against concurrent GUP using the PT lock and >>>> the src_mm->write_protect_seq. >>>> * For KSM, this means sharing will fail. For swap this means, unmapping >>>> will fail, For migration this means, migration will fail early. All >>>> three cases protect against concurrent GUP using the PT lock and a >>>> proper clear/invalidate+flush of the relevant page table entry. >>>> >>>> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a >>>> pinned page gets mapped R/O and the successive write fault ends up >>>> replacing the page instead of reusing it. It improves the situation for >>>> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, >>>> if fork() is *not* involved, however swapout and fork() are still >>>> problematic. Properly using FOLL_PIN instead of FOLL_GET for these >>>> GUP users will fix the issue for them. >>>> >>>> I. Details about basic handling >>>> >>>> I.1. Fresh anonymous pages >>>> >>>> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the >>>> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is >>>> the mechanism fresh anonymous pages come into life (besides migration >>>> code where we copy the page->mapping), all fresh anonymous pages will >>>> start out as exclusive. >>>> >>>> I.2. COW reuse handling of anonymous pages >>>> >>>> When a COW handler stumbles over a (sub)page that's marked exclusive, it >>>> simply reuses it. Otherwise, the handler tries harder under page lock to >>>> detect if the (sub)page is exclusive and can be reused. If exclusive, >>>> page_move_anon_rmap() will mark the given (sub)page exclusive. >>>> >>>> Note that hugetlb code does not yet check for PageAnonExclusive(), as it >>>> still uses the old COW logic that is prone to the COW security issue >>>> because hugetlb code cannot really tolerate unnecessary/wrong COW as >>>> huge pages are a scarce resource. >>>> >>>> I.3. Migration handling >>>> >>>> try_to_migrate() has to try marking an exclusive anonymous page shared >>>> via page_try_share_anon_rmap(). If it fails because there are GUP pins >>>> on the page, unmap fails. migrate_vma_collect_pmd() and >>>> __split_huge_pmd_locked() are handled similarly. >>>> >>>> Writable migration entries implicitly point at shared anonymous pages. >>>> For readable migration entries that information is stored via a new >>>> "readable-exclusive" migration entry, specific to anonymous pages. >>>> >>>> When restoring a migration entry in remove_migration_pte(), information >>>> about exlusivity is detected via the migration entry type, and >>>> RMAP_EXCLUSIVE is set accordingly for >>>> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that >>>> information. >>>> >>>> I.4. Swapout handling >>>> >>>> try_to_unmap() has to try marking the mapped page possibly shared via >>>> page_try_share_anon_rmap(). If it fails because there are GUP pins on the >>>> page, unmap fails. For now, information about exclusivity is lost. In the >>>> future, we might want to remember that information in the swap entry in >>>> some cases, however, it requires more thought, care, and a way to store >>>> that information in swap entries. >>>> >>>> I.5. Swapin handling >>>> >>>> do_swap_page() will never stumble over exclusive anonymous pages in the >>>> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has >>>> to detect manually if an anonymous page is exclusive and has to set >>>> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. >>>> >>>> I.6. THP handling >>>> >>>> __split_huge_pmd_locked() has to move the information about exclusivity >>>> from the PMD to the PTEs. >>>> >>>> a) In case we have a readable-exclusive PMD migration entry, simply insert >>>> readable-exclusive PTE migration entries. >>>> >>>> b) In case we have a present PMD entry and we don't want to freeze >>>> ("convert to migration entries"), simply forward PG_anon_exclusive to >>>> all sub-pages, no need to temporarily clear the bit. >>>> >>>> c) In case we have a present PMD entry and want to freeze, handle it >>>> similar to try_to_migrate(): try marking the page shared first. In case >>>> we fail, we ignore the "freeze" instruction and simply split ordinarily. >>>> try_to_migrate() will properly fail because the THP is still mapped via >>>> PTEs. >> >> Hi, >> >> thanks for the review! >> >>> >>> How come will try_to_migrate() fail? The afterward pvmw will find >>> those PTEs then convert them to migration entries anyway IIUC. >>> >> >> It will run into that code: >> >>>> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, >>>> page_vma_mapped_walk_done(&pvmw); >>>> break; >>>> } >>>> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && >>>> + !anon_exclusive, page); >>>> + if (anon_exclusive && >>>> + page_try_share_anon_rmap(subpage)) { >>>> + set_pte_at(mm, address, pvmw.pte, pteval); >>>> + ret = false; >>>> + page_vma_mapped_walk_done(&pvmw); >>>> + break; >>>> + } >> >> and similarly fail the page_try_share_anon_rmap(), at which point >> try_to_migrate() stops and the caller will still observe a >> "page_mapped() == true". > > Thanks, I missed that. Yes, the page will still be mapped. This should > trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make > this happen more often, we may consider removing that warning even > though it is "once" since seeing a mapped page may become a normal > case (once DIO is switched to FOLL_PIN, it may be more often). Anyway > we don't have to remove it right now. Oh, very good catch! I wasn't able to trigger that warning in my testing so far. Interestingly, arch_unmap_one() could theoretically make this fail already and trigger the warning. Apart from that warning, split_huge_page_to_list() should work as expected: freezing the refcount will fail if still mapped and we'll remap. I'll include a separate patch to just remove that VM_WARN_ON_ONCE -- thanks!
On 19.03.22 11:21, David Hildenbrand wrote: > On 18.03.22 21:29, Yang Shi wrote: >> On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 16.03.22 22:23, Yang Shi wrote: >>>> On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as >>>>> exclusive, and use that information to make GUP pins reliable and stay >>>>> consistent with the page mapped into the page table even if the >>>>> page table entry gets write-protected. >>>>> >>>>> With that information at hand, we can extend our COW logic to always >>>>> reuse anonymous pages that are exclusive. For anonymous pages that >>>>> might be shared, the existing logic applies. >>>>> >>>>> As already documented, PG_anon_exclusive is usually only expressive in >>>>> combination with a page table entry. Especially PTE vs. PMD-mapped >>>>> anonymous pages require more thought, some examples: due to mremap() we >>>>> can easily have a single compound page PTE-mapped into multiple page tables >>>>> exclusively in a single process -- multiple page table locks apply. >>>>> Further, due to MADV_WIPEONFORK we might not necessarily write-protect >>>>> all PTEs, and only some subpages might be pinned. Long story short: once >>>>> PTE-mapped, we have to track information about exclusivity per sub-page, >>>>> but until then, we can just track it for the compound page in the head >>>>> page and not having to update a whole bunch of subpages all of the time >>>>> for a simple PMD mapping of a THP. >>>>> >>>>> For simplicity, this commit mostly talks about "anonymous pages", while >>>>> it's for THP actually "the part of an anonymous folio referenced via >>>>> a page table entry". >>>>> >>>>> To not spill PG_anon_exclusive code all over the mm code-base, we let >>>>> the anon rmap code to handle all PG_anon_exclusive logic it can easily >>>>> handle. >>>>> >>>>> If a writable, present page table entry points at an anonymous (sub)page, >>>>> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably >>>>> pin (FOLL_PIN) on an anonymous page references via a present >>>>> page table entry, it must only pin if PG_anon_exclusive is set for the >>>>> mapped (sub)page. >>>>> >>>>> This commit doesn't adjust GUP, so this is only implicitly handled for >>>>> FOLL_WRITE, follow-up commits will teach GUP to also respect it for >>>>> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages >>>>> fully reliable. >>>>> >>>>> Whenever an anonymous page is to be shared (fork(), KSM), or when >>>>> temporarily unmapping an anonymous page (swap, migration), the relevant >>>>> PG_anon_exclusive bit has to be cleared to mark the anonymous page >>>>> possibly shared. Clearing will fail if there are GUP pins on the page: >>>>> * For fork(), this means having to copy the page and not being able to >>>>> share it. fork() protects against concurrent GUP using the PT lock and >>>>> the src_mm->write_protect_seq. >>>>> * For KSM, this means sharing will fail. For swap this means, unmapping >>>>> will fail, For migration this means, migration will fail early. All >>>>> three cases protect against concurrent GUP using the PT lock and a >>>>> proper clear/invalidate+flush of the relevant page table entry. >>>>> >>>>> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a >>>>> pinned page gets mapped R/O and the successive write fault ends up >>>>> replacing the page instead of reusing it. It improves the situation for >>>>> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, >>>>> if fork() is *not* involved, however swapout and fork() are still >>>>> problematic. Properly using FOLL_PIN instead of FOLL_GET for these >>>>> GUP users will fix the issue for them. >>>>> >>>>> I. Details about basic handling >>>>> >>>>> I.1. Fresh anonymous pages >>>>> >>>>> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the >>>>> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is >>>>> the mechanism fresh anonymous pages come into life (besides migration >>>>> code where we copy the page->mapping), all fresh anonymous pages will >>>>> start out as exclusive. >>>>> >>>>> I.2. COW reuse handling of anonymous pages >>>>> >>>>> When a COW handler stumbles over a (sub)page that's marked exclusive, it >>>>> simply reuses it. Otherwise, the handler tries harder under page lock to >>>>> detect if the (sub)page is exclusive and can be reused. If exclusive, >>>>> page_move_anon_rmap() will mark the given (sub)page exclusive. >>>>> >>>>> Note that hugetlb code does not yet check for PageAnonExclusive(), as it >>>>> still uses the old COW logic that is prone to the COW security issue >>>>> because hugetlb code cannot really tolerate unnecessary/wrong COW as >>>>> huge pages are a scarce resource. >>>>> >>>>> I.3. Migration handling >>>>> >>>>> try_to_migrate() has to try marking an exclusive anonymous page shared >>>>> via page_try_share_anon_rmap(). If it fails because there are GUP pins >>>>> on the page, unmap fails. migrate_vma_collect_pmd() and >>>>> __split_huge_pmd_locked() are handled similarly. >>>>> >>>>> Writable migration entries implicitly point at shared anonymous pages. >>>>> For readable migration entries that information is stored via a new >>>>> "readable-exclusive" migration entry, specific to anonymous pages. >>>>> >>>>> When restoring a migration entry in remove_migration_pte(), information >>>>> about exlusivity is detected via the migration entry type, and >>>>> RMAP_EXCLUSIVE is set accordingly for >>>>> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that >>>>> information. >>>>> >>>>> I.4. Swapout handling >>>>> >>>>> try_to_unmap() has to try marking the mapped page possibly shared via >>>>> page_try_share_anon_rmap(). If it fails because there are GUP pins on the >>>>> page, unmap fails. For now, information about exclusivity is lost. In the >>>>> future, we might want to remember that information in the swap entry in >>>>> some cases, however, it requires more thought, care, and a way to store >>>>> that information in swap entries. >>>>> >>>>> I.5. Swapin handling >>>>> >>>>> do_swap_page() will never stumble over exclusive anonymous pages in the >>>>> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has >>>>> to detect manually if an anonymous page is exclusive and has to set >>>>> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. >>>>> >>>>> I.6. THP handling >>>>> >>>>> __split_huge_pmd_locked() has to move the information about exclusivity >>>>> from the PMD to the PTEs. >>>>> >>>>> a) In case we have a readable-exclusive PMD migration entry, simply insert >>>>> readable-exclusive PTE migration entries. >>>>> >>>>> b) In case we have a present PMD entry and we don't want to freeze >>>>> ("convert to migration entries"), simply forward PG_anon_exclusive to >>>>> all sub-pages, no need to temporarily clear the bit. >>>>> >>>>> c) In case we have a present PMD entry and want to freeze, handle it >>>>> similar to try_to_migrate(): try marking the page shared first. In case >>>>> we fail, we ignore the "freeze" instruction and simply split ordinarily. >>>>> try_to_migrate() will properly fail because the THP is still mapped via >>>>> PTEs. >>> >>> Hi, >>> >>> thanks for the review! >>> >>>> >>>> How come will try_to_migrate() fail? The afterward pvmw will find >>>> those PTEs then convert them to migration entries anyway IIUC. >>>> >>> >>> It will run into that code: >>> >>>>> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, >>>>> page_vma_mapped_walk_done(&pvmw); >>>>> break; >>>>> } >>>>> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && >>>>> + !anon_exclusive, page); >>>>> + if (anon_exclusive && >>>>> + page_try_share_anon_rmap(subpage)) { >>>>> + set_pte_at(mm, address, pvmw.pte, pteval); >>>>> + ret = false; >>>>> + page_vma_mapped_walk_done(&pvmw); >>>>> + break; >>>>> + } >>> >>> and similarly fail the page_try_share_anon_rmap(), at which point >>> try_to_migrate() stops and the caller will still observe a >>> "page_mapped() == true". >> >> Thanks, I missed that. Yes, the page will still be mapped. This should >> trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make >> this happen more often, we may consider removing that warning even >> though it is "once" since seeing a mapped page may become a normal >> case (once DIO is switched to FOLL_PIN, it may be more often). Anyway >> we don't have to remove it right now. > > Oh, very good catch! I wasn't able to trigger that warning in my testing > so far. Interestingly, arch_unmap_one() could theoretically make this > fail already and trigger the warning. > > Apart from that warning, split_huge_page_to_list() should work as > expected: freezing the refcount will fail if still mapped and we'll remap. > > I'll include a separate patch to just remove that VM_WARN_ON_ONCE -- thanks! > From e6e983d841cd2aa2a9c8dc71779211881cf0d96f Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Sat, 19 Mar 2022 11:49:39 +0100 Subject: [PATCH] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() We can already theoretically fail to unmap (still having page_mapped()) in case arch_unmap_one() fails. Although this applies only to anonymous pages for now, get rid of the VM_WARN_ON_ONCE_PAGE() completely: the caller -- split_huge_page_to_list() -- will fail to freeze the refcount and remap the page via remap_page(). So the caller can handle unmap errors just fine. This is a preparation for making try_to_migrate() fail on anonymous pages with GUP pins. Reported-by: Yang Shi <shy828301@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- mm/huge_memory.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0b6fb409b9e4..0fe0ab3ec3fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2263,8 +2263,6 @@ static void unmap_page(struct page *page) try_to_migrate(page, ttu_flags); else try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); - - VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } static void remap_page(struct page *page, unsigned int nr)
On Sat, Mar 19, 2022 at 3:21 AM David Hildenbrand <david@redhat.com> wrote: > > On 18.03.22 21:29, Yang Shi wrote: > > On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 16.03.22 22:23, Yang Shi wrote: > >>> On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>>> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as > >>>> exclusive, and use that information to make GUP pins reliable and stay > >>>> consistent with the page mapped into the page table even if the > >>>> page table entry gets write-protected. > >>>> > >>>> With that information at hand, we can extend our COW logic to always > >>>> reuse anonymous pages that are exclusive. For anonymous pages that > >>>> might be shared, the existing logic applies. > >>>> > >>>> As already documented, PG_anon_exclusive is usually only expressive in > >>>> combination with a page table entry. Especially PTE vs. PMD-mapped > >>>> anonymous pages require more thought, some examples: due to mremap() we > >>>> can easily have a single compound page PTE-mapped into multiple page tables > >>>> exclusively in a single process -- multiple page table locks apply. > >>>> Further, due to MADV_WIPEONFORK we might not necessarily write-protect > >>>> all PTEs, and only some subpages might be pinned. Long story short: once > >>>> PTE-mapped, we have to track information about exclusivity per sub-page, > >>>> but until then, we can just track it for the compound page in the head > >>>> page and not having to update a whole bunch of subpages all of the time > >>>> for a simple PMD mapping of a THP. > >>>> > >>>> For simplicity, this commit mostly talks about "anonymous pages", while > >>>> it's for THP actually "the part of an anonymous folio referenced via > >>>> a page table entry". > >>>> > >>>> To not spill PG_anon_exclusive code all over the mm code-base, we let > >>>> the anon rmap code to handle all PG_anon_exclusive logic it can easily > >>>> handle. > >>>> > >>>> If a writable, present page table entry points at an anonymous (sub)page, > >>>> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably > >>>> pin (FOLL_PIN) on an anonymous page references via a present > >>>> page table entry, it must only pin if PG_anon_exclusive is set for the > >>>> mapped (sub)page. > >>>> > >>>> This commit doesn't adjust GUP, so this is only implicitly handled for > >>>> FOLL_WRITE, follow-up commits will teach GUP to also respect it for > >>>> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages > >>>> fully reliable. > >>>> > >>>> Whenever an anonymous page is to be shared (fork(), KSM), or when > >>>> temporarily unmapping an anonymous page (swap, migration), the relevant > >>>> PG_anon_exclusive bit has to be cleared to mark the anonymous page > >>>> possibly shared. Clearing will fail if there are GUP pins on the page: > >>>> * For fork(), this means having to copy the page and not being able to > >>>> share it. fork() protects against concurrent GUP using the PT lock and > >>>> the src_mm->write_protect_seq. > >>>> * For KSM, this means sharing will fail. For swap this means, unmapping > >>>> will fail, For migration this means, migration will fail early. All > >>>> three cases protect against concurrent GUP using the PT lock and a > >>>> proper clear/invalidate+flush of the relevant page table entry. > >>>> > >>>> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a > >>>> pinned page gets mapped R/O and the successive write fault ends up > >>>> replacing the page instead of reusing it. It improves the situation for > >>>> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, > >>>> if fork() is *not* involved, however swapout and fork() are still > >>>> problematic. Properly using FOLL_PIN instead of FOLL_GET for these > >>>> GUP users will fix the issue for them. > >>>> > >>>> I. Details about basic handling > >>>> > >>>> I.1. Fresh anonymous pages > >>>> > >>>> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the > >>>> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is > >>>> the mechanism fresh anonymous pages come into life (besides migration > >>>> code where we copy the page->mapping), all fresh anonymous pages will > >>>> start out as exclusive. > >>>> > >>>> I.2. COW reuse handling of anonymous pages > >>>> > >>>> When a COW handler stumbles over a (sub)page that's marked exclusive, it > >>>> simply reuses it. Otherwise, the handler tries harder under page lock to > >>>> detect if the (sub)page is exclusive and can be reused. If exclusive, > >>>> page_move_anon_rmap() will mark the given (sub)page exclusive. > >>>> > >>>> Note that hugetlb code does not yet check for PageAnonExclusive(), as it > >>>> still uses the old COW logic that is prone to the COW security issue > >>>> because hugetlb code cannot really tolerate unnecessary/wrong COW as > >>>> huge pages are a scarce resource. > >>>> > >>>> I.3. Migration handling > >>>> > >>>> try_to_migrate() has to try marking an exclusive anonymous page shared > >>>> via page_try_share_anon_rmap(). If it fails because there are GUP pins > >>>> on the page, unmap fails. migrate_vma_collect_pmd() and > >>>> __split_huge_pmd_locked() are handled similarly. > >>>> > >>>> Writable migration entries implicitly point at shared anonymous pages. > >>>> For readable migration entries that information is stored via a new > >>>> "readable-exclusive" migration entry, specific to anonymous pages. > >>>> > >>>> When restoring a migration entry in remove_migration_pte(), information > >>>> about exlusivity is detected via the migration entry type, and > >>>> RMAP_EXCLUSIVE is set accordingly for > >>>> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that > >>>> information. > >>>> > >>>> I.4. Swapout handling > >>>> > >>>> try_to_unmap() has to try marking the mapped page possibly shared via > >>>> page_try_share_anon_rmap(). If it fails because there are GUP pins on the > >>>> page, unmap fails. For now, information about exclusivity is lost. In the > >>>> future, we might want to remember that information in the swap entry in > >>>> some cases, however, it requires more thought, care, and a way to store > >>>> that information in swap entries. > >>>> > >>>> I.5. Swapin handling > >>>> > >>>> do_swap_page() will never stumble over exclusive anonymous pages in the > >>>> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has > >>>> to detect manually if an anonymous page is exclusive and has to set > >>>> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. > >>>> > >>>> I.6. THP handling > >>>> > >>>> __split_huge_pmd_locked() has to move the information about exclusivity > >>>> from the PMD to the PTEs. > >>>> > >>>> a) In case we have a readable-exclusive PMD migration entry, simply insert > >>>> readable-exclusive PTE migration entries. > >>>> > >>>> b) In case we have a present PMD entry and we don't want to freeze > >>>> ("convert to migration entries"), simply forward PG_anon_exclusive to > >>>> all sub-pages, no need to temporarily clear the bit. > >>>> > >>>> c) In case we have a present PMD entry and want to freeze, handle it > >>>> similar to try_to_migrate(): try marking the page shared first. In case > >>>> we fail, we ignore the "freeze" instruction and simply split ordinarily. > >>>> try_to_migrate() will properly fail because the THP is still mapped via > >>>> PTEs. > >> > >> Hi, > >> > >> thanks for the review! > >> > >>> > >>> How come will try_to_migrate() fail? The afterward pvmw will find > >>> those PTEs then convert them to migration entries anyway IIUC. > >>> > >> > >> It will run into that code: > >> > >>>> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > >>>> page_vma_mapped_walk_done(&pvmw); > >>>> break; > >>>> } > >>>> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && > >>>> + !anon_exclusive, page); > >>>> + if (anon_exclusive && > >>>> + page_try_share_anon_rmap(subpage)) { > >>>> + set_pte_at(mm, address, pvmw.pte, pteval); > >>>> + ret = false; > >>>> + page_vma_mapped_walk_done(&pvmw); > >>>> + break; > >>>> + } > >> > >> and similarly fail the page_try_share_anon_rmap(), at which point > >> try_to_migrate() stops and the caller will still observe a > >> "page_mapped() == true". > > > > Thanks, I missed that. Yes, the page will still be mapped. This should > > trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make > > this happen more often, we may consider removing that warning even > > though it is "once" since seeing a mapped page may become a normal > > case (once DIO is switched to FOLL_PIN, it may be more often). Anyway > > we don't have to remove it right now. > > Oh, very good catch! I wasn't able to trigger that warning in my testing > so far. Interestingly, arch_unmap_one() could theoretically make this > fail already and trigger the warning. It should be very rare to trigger that warning. I just saw one report before. And arch_unmap_one() actually can't fail except for Sparc. So that warning was intended to catch some unusual issues, but this COW patchset may make it much more usual than before. > > Apart from that warning, split_huge_page_to_list() should work as > expected: freezing the refcount will fail if still mapped and we'll remap. Yes, that was why it was changed to a WARN from a BUG, please see commit 504e070dc08f ("mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split") > > I'll include a separate patch to just remove that VM_WARN_ON_ONCE -- thanks! Thanks, I'm fine to remove it. > > -- > Thanks, > > David / dhildenb >
On Sat, Mar 19, 2022 at 3:50 AM David Hildenbrand <david@redhat.com> wrote: > > On 19.03.22 11:21, David Hildenbrand wrote: > > On 18.03.22 21:29, Yang Shi wrote: > >> On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 16.03.22 22:23, Yang Shi wrote: > >>>> On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: > >>>>> > >>>>> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as > >>>>> exclusive, and use that information to make GUP pins reliable and stay > >>>>> consistent with the page mapped into the page table even if the > >>>>> page table entry gets write-protected. > >>>>> > >>>>> With that information at hand, we can extend our COW logic to always > >>>>> reuse anonymous pages that are exclusive. For anonymous pages that > >>>>> might be shared, the existing logic applies. > >>>>> > >>>>> As already documented, PG_anon_exclusive is usually only expressive in > >>>>> combination with a page table entry. Especially PTE vs. PMD-mapped > >>>>> anonymous pages require more thought, some examples: due to mremap() we > >>>>> can easily have a single compound page PTE-mapped into multiple page tables > >>>>> exclusively in a single process -- multiple page table locks apply. > >>>>> Further, due to MADV_WIPEONFORK we might not necessarily write-protect > >>>>> all PTEs, and only some subpages might be pinned. Long story short: once > >>>>> PTE-mapped, we have to track information about exclusivity per sub-page, > >>>>> but until then, we can just track it for the compound page in the head > >>>>> page and not having to update a whole bunch of subpages all of the time > >>>>> for a simple PMD mapping of a THP. > >>>>> > >>>>> For simplicity, this commit mostly talks about "anonymous pages", while > >>>>> it's for THP actually "the part of an anonymous folio referenced via > >>>>> a page table entry". > >>>>> > >>>>> To not spill PG_anon_exclusive code all over the mm code-base, we let > >>>>> the anon rmap code to handle all PG_anon_exclusive logic it can easily > >>>>> handle. > >>>>> > >>>>> If a writable, present page table entry points at an anonymous (sub)page, > >>>>> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably > >>>>> pin (FOLL_PIN) on an anonymous page references via a present > >>>>> page table entry, it must only pin if PG_anon_exclusive is set for the > >>>>> mapped (sub)page. > >>>>> > >>>>> This commit doesn't adjust GUP, so this is only implicitly handled for > >>>>> FOLL_WRITE, follow-up commits will teach GUP to also respect it for > >>>>> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages > >>>>> fully reliable. > >>>>> > >>>>> Whenever an anonymous page is to be shared (fork(), KSM), or when > >>>>> temporarily unmapping an anonymous page (swap, migration), the relevant > >>>>> PG_anon_exclusive bit has to be cleared to mark the anonymous page > >>>>> possibly shared. Clearing will fail if there are GUP pins on the page: > >>>>> * For fork(), this means having to copy the page and not being able to > >>>>> share it. fork() protects against concurrent GUP using the PT lock and > >>>>> the src_mm->write_protect_seq. > >>>>> * For KSM, this means sharing will fail. For swap this means, unmapping > >>>>> will fail, For migration this means, migration will fail early. All > >>>>> three cases protect against concurrent GUP using the PT lock and a > >>>>> proper clear/invalidate+flush of the relevant page table entry. > >>>>> > >>>>> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a > >>>>> pinned page gets mapped R/O and the successive write fault ends up > >>>>> replacing the page instead of reusing it. It improves the situation for > >>>>> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, > >>>>> if fork() is *not* involved, however swapout and fork() are still > >>>>> problematic. Properly using FOLL_PIN instead of FOLL_GET for these > >>>>> GUP users will fix the issue for them. > >>>>> > >>>>> I. Details about basic handling > >>>>> > >>>>> I.1. Fresh anonymous pages > >>>>> > >>>>> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the > >>>>> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is > >>>>> the mechanism fresh anonymous pages come into life (besides migration > >>>>> code where we copy the page->mapping), all fresh anonymous pages will > >>>>> start out as exclusive. > >>>>> > >>>>> I.2. COW reuse handling of anonymous pages > >>>>> > >>>>> When a COW handler stumbles over a (sub)page that's marked exclusive, it > >>>>> simply reuses it. Otherwise, the handler tries harder under page lock to > >>>>> detect if the (sub)page is exclusive and can be reused. If exclusive, > >>>>> page_move_anon_rmap() will mark the given (sub)page exclusive. > >>>>> > >>>>> Note that hugetlb code does not yet check for PageAnonExclusive(), as it > >>>>> still uses the old COW logic that is prone to the COW security issue > >>>>> because hugetlb code cannot really tolerate unnecessary/wrong COW as > >>>>> huge pages are a scarce resource. > >>>>> > >>>>> I.3. Migration handling > >>>>> > >>>>> try_to_migrate() has to try marking an exclusive anonymous page shared > >>>>> via page_try_share_anon_rmap(). If it fails because there are GUP pins > >>>>> on the page, unmap fails. migrate_vma_collect_pmd() and > >>>>> __split_huge_pmd_locked() are handled similarly. > >>>>> > >>>>> Writable migration entries implicitly point at shared anonymous pages. > >>>>> For readable migration entries that information is stored via a new > >>>>> "readable-exclusive" migration entry, specific to anonymous pages. > >>>>> > >>>>> When restoring a migration entry in remove_migration_pte(), information > >>>>> about exlusivity is detected via the migration entry type, and > >>>>> RMAP_EXCLUSIVE is set accordingly for > >>>>> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that > >>>>> information. > >>>>> > >>>>> I.4. Swapout handling > >>>>> > >>>>> try_to_unmap() has to try marking the mapped page possibly shared via > >>>>> page_try_share_anon_rmap(). If it fails because there are GUP pins on the > >>>>> page, unmap fails. For now, information about exclusivity is lost. In the > >>>>> future, we might want to remember that information in the swap entry in > >>>>> some cases, however, it requires more thought, care, and a way to store > >>>>> that information in swap entries. > >>>>> > >>>>> I.5. Swapin handling > >>>>> > >>>>> do_swap_page() will never stumble over exclusive anonymous pages in the > >>>>> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has > >>>>> to detect manually if an anonymous page is exclusive and has to set > >>>>> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. > >>>>> > >>>>> I.6. THP handling > >>>>> > >>>>> __split_huge_pmd_locked() has to move the information about exclusivity > >>>>> from the PMD to the PTEs. > >>>>> > >>>>> a) In case we have a readable-exclusive PMD migration entry, simply insert > >>>>> readable-exclusive PTE migration entries. > >>>>> > >>>>> b) In case we have a present PMD entry and we don't want to freeze > >>>>> ("convert to migration entries"), simply forward PG_anon_exclusive to > >>>>> all sub-pages, no need to temporarily clear the bit. > >>>>> > >>>>> c) In case we have a present PMD entry and want to freeze, handle it > >>>>> similar to try_to_migrate(): try marking the page shared first. In case > >>>>> we fail, we ignore the "freeze" instruction and simply split ordinarily. > >>>>> try_to_migrate() will properly fail because the THP is still mapped via > >>>>> PTEs. > >>> > >>> Hi, > >>> > >>> thanks for the review! > >>> > >>>> > >>>> How come will try_to_migrate() fail? The afterward pvmw will find > >>>> those PTEs then convert them to migration entries anyway IIUC. > >>>> > >>> > >>> It will run into that code: > >>> > >>>>> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > >>>>> page_vma_mapped_walk_done(&pvmw); > >>>>> break; > >>>>> } > >>>>> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && > >>>>> + !anon_exclusive, page); > >>>>> + if (anon_exclusive && > >>>>> + page_try_share_anon_rmap(subpage)) { > >>>>> + set_pte_at(mm, address, pvmw.pte, pteval); > >>>>> + ret = false; > >>>>> + page_vma_mapped_walk_done(&pvmw); > >>>>> + break; > >>>>> + } > >>> > >>> and similarly fail the page_try_share_anon_rmap(), at which point > >>> try_to_migrate() stops and the caller will still observe a > >>> "page_mapped() == true". > >> > >> Thanks, I missed that. Yes, the page will still be mapped. This should > >> trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make > >> this happen more often, we may consider removing that warning even > >> though it is "once" since seeing a mapped page may become a normal > >> case (once DIO is switched to FOLL_PIN, it may be more often). Anyway > >> we don't have to remove it right now. > > > > Oh, very good catch! I wasn't able to trigger that warning in my testing > > so far. Interestingly, arch_unmap_one() could theoretically make this > > fail already and trigger the warning. > > > > Apart from that warning, split_huge_page_to_list() should work as > > expected: freezing the refcount will fail if still mapped and we'll remap. > > > > I'll include a separate patch to just remove that VM_WARN_ON_ONCE -- thanks! > > > > From e6e983d841cd2aa2a9c8dc71779211881cf0d96f Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Sat, 19 Mar 2022 11:49:39 +0100 > Subject: [PATCH] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from > unmap_page() > > We can already theoretically fail to unmap (still having page_mapped()) in > case arch_unmap_one() fails. Although this applies only to anonymous pages > for now, get rid of the VM_WARN_ON_ONCE_PAGE() completely: the caller -- > split_huge_page_to_list() -- will fail to freeze the refcount and > remap the page via remap_page(). So the caller can handle unmap errors > just fine. > > This is a preparation for making try_to_migrate() fail on anonymous pages > with GUP pins. Better to mention "making try_to_migrate() fail on pinned anonymous pages may make this happen more usual". > > Reported-by: Yang Shi <shy828301@gmail.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Fine to me. Reviewed-by: Yang Shi <shy828301@gmail.com> > --- > mm/huge_memory.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 0b6fb409b9e4..0fe0ab3ec3fc 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2263,8 +2263,6 @@ static void unmap_page(struct page *page) > try_to_migrate(page, ttu_flags); > else > try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); > - > - VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); > } > > static void remap_page(struct page *page, unsigned int nr) > -- > 2.35.1 > > > -- > Thanks, > > David / dhildenb >
On 21.03.22 21:56, Yang Shi wrote: > On Sat, Mar 19, 2022 at 3:50 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.03.22 11:21, David Hildenbrand wrote: >>> On 18.03.22 21:29, Yang Shi wrote: >>>> On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 16.03.22 22:23, Yang Shi wrote: >>>>>> On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote: >>>>>>> >>>>>>> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as >>>>>>> exclusive, and use that information to make GUP pins reliable and stay >>>>>>> consistent with the page mapped into the page table even if the >>>>>>> page table entry gets write-protected. >>>>>>> >>>>>>> With that information at hand, we can extend our COW logic to always >>>>>>> reuse anonymous pages that are exclusive. For anonymous pages that >>>>>>> might be shared, the existing logic applies. >>>>>>> >>>>>>> As already documented, PG_anon_exclusive is usually only expressive in >>>>>>> combination with a page table entry. Especially PTE vs. PMD-mapped >>>>>>> anonymous pages require more thought, some examples: due to mremap() we >>>>>>> can easily have a single compound page PTE-mapped into multiple page tables >>>>>>> exclusively in a single process -- multiple page table locks apply. >>>>>>> Further, due to MADV_WIPEONFORK we might not necessarily write-protect >>>>>>> all PTEs, and only some subpages might be pinned. Long story short: once >>>>>>> PTE-mapped, we have to track information about exclusivity per sub-page, >>>>>>> but until then, we can just track it for the compound page in the head >>>>>>> page and not having to update a whole bunch of subpages all of the time >>>>>>> for a simple PMD mapping of a THP. >>>>>>> >>>>>>> For simplicity, this commit mostly talks about "anonymous pages", while >>>>>>> it's for THP actually "the part of an anonymous folio referenced via >>>>>>> a page table entry". >>>>>>> >>>>>>> To not spill PG_anon_exclusive code all over the mm code-base, we let >>>>>>> the anon rmap code to handle all PG_anon_exclusive logic it can easily >>>>>>> handle. >>>>>>> >>>>>>> If a writable, present page table entry points at an anonymous (sub)page, >>>>>>> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably >>>>>>> pin (FOLL_PIN) on an anonymous page references via a present >>>>>>> page table entry, it must only pin if PG_anon_exclusive is set for the >>>>>>> mapped (sub)page. >>>>>>> >>>>>>> This commit doesn't adjust GUP, so this is only implicitly handled for >>>>>>> FOLL_WRITE, follow-up commits will teach GUP to also respect it for >>>>>>> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages >>>>>>> fully reliable. >>>>>>> >>>>>>> Whenever an anonymous page is to be shared (fork(), KSM), or when >>>>>>> temporarily unmapping an anonymous page (swap, migration), the relevant >>>>>>> PG_anon_exclusive bit has to be cleared to mark the anonymous page >>>>>>> possibly shared. Clearing will fail if there are GUP pins on the page: >>>>>>> * For fork(), this means having to copy the page and not being able to >>>>>>> share it. fork() protects against concurrent GUP using the PT lock and >>>>>>> the src_mm->write_protect_seq. >>>>>>> * For KSM, this means sharing will fail. For swap this means, unmapping >>>>>>> will fail, For migration this means, migration will fail early. All >>>>>>> three cases protect against concurrent GUP using the PT lock and a >>>>>>> proper clear/invalidate+flush of the relevant page table entry. >>>>>>> >>>>>>> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a >>>>>>> pinned page gets mapped R/O and the successive write fault ends up >>>>>>> replacing the page instead of reusing it. It improves the situation for >>>>>>> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, >>>>>>> if fork() is *not* involved, however swapout and fork() are still >>>>>>> problematic. Properly using FOLL_PIN instead of FOLL_GET for these >>>>>>> GUP users will fix the issue for them. >>>>>>> >>>>>>> I. Details about basic handling >>>>>>> >>>>>>> I.1. Fresh anonymous pages >>>>>>> >>>>>>> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the >>>>>>> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is >>>>>>> the mechanism fresh anonymous pages come into life (besides migration >>>>>>> code where we copy the page->mapping), all fresh anonymous pages will >>>>>>> start out as exclusive. >>>>>>> >>>>>>> I.2. COW reuse handling of anonymous pages >>>>>>> >>>>>>> When a COW handler stumbles over a (sub)page that's marked exclusive, it >>>>>>> simply reuses it. Otherwise, the handler tries harder under page lock to >>>>>>> detect if the (sub)page is exclusive and can be reused. If exclusive, >>>>>>> page_move_anon_rmap() will mark the given (sub)page exclusive. >>>>>>> >>>>>>> Note that hugetlb code does not yet check for PageAnonExclusive(), as it >>>>>>> still uses the old COW logic that is prone to the COW security issue >>>>>>> because hugetlb code cannot really tolerate unnecessary/wrong COW as >>>>>>> huge pages are a scarce resource. >>>>>>> >>>>>>> I.3. Migration handling >>>>>>> >>>>>>> try_to_migrate() has to try marking an exclusive anonymous page shared >>>>>>> via page_try_share_anon_rmap(). If it fails because there are GUP pins >>>>>>> on the page, unmap fails. migrate_vma_collect_pmd() and >>>>>>> __split_huge_pmd_locked() are handled similarly. >>>>>>> >>>>>>> Writable migration entries implicitly point at shared anonymous pages. >>>>>>> For readable migration entries that information is stored via a new >>>>>>> "readable-exclusive" migration entry, specific to anonymous pages. >>>>>>> >>>>>>> When restoring a migration entry in remove_migration_pte(), information >>>>>>> about exlusivity is detected via the migration entry type, and >>>>>>> RMAP_EXCLUSIVE is set accordingly for >>>>>>> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that >>>>>>> information. >>>>>>> >>>>>>> I.4. Swapout handling >>>>>>> >>>>>>> try_to_unmap() has to try marking the mapped page possibly shared via >>>>>>> page_try_share_anon_rmap(). If it fails because there are GUP pins on the >>>>>>> page, unmap fails. For now, information about exclusivity is lost. In the >>>>>>> future, we might want to remember that information in the swap entry in >>>>>>> some cases, however, it requires more thought, care, and a way to store >>>>>>> that information in swap entries. >>>>>>> >>>>>>> I.5. Swapin handling >>>>>>> >>>>>>> do_swap_page() will never stumble over exclusive anonymous pages in the >>>>>>> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has >>>>>>> to detect manually if an anonymous page is exclusive and has to set >>>>>>> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. >>>>>>> >>>>>>> I.6. THP handling >>>>>>> >>>>>>> __split_huge_pmd_locked() has to move the information about exclusivity >>>>>>> from the PMD to the PTEs. >>>>>>> >>>>>>> a) In case we have a readable-exclusive PMD migration entry, simply insert >>>>>>> readable-exclusive PTE migration entries. >>>>>>> >>>>>>> b) In case we have a present PMD entry and we don't want to freeze >>>>>>> ("convert to migration entries"), simply forward PG_anon_exclusive to >>>>>>> all sub-pages, no need to temporarily clear the bit. >>>>>>> >>>>>>> c) In case we have a present PMD entry and want to freeze, handle it >>>>>>> similar to try_to_migrate(): try marking the page shared first. In case >>>>>>> we fail, we ignore the "freeze" instruction and simply split ordinarily. >>>>>>> try_to_migrate() will properly fail because the THP is still mapped via >>>>>>> PTEs. >>>>> >>>>> Hi, >>>>> >>>>> thanks for the review! >>>>> >>>>>> >>>>>> How come will try_to_migrate() fail? The afterward pvmw will find >>>>>> those PTEs then convert them to migration entries anyway IIUC. >>>>>> >>>>> >>>>> It will run into that code: >>>>> >>>>>>> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, >>>>>>> page_vma_mapped_walk_done(&pvmw); >>>>>>> break; >>>>>>> } >>>>>>> + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && >>>>>>> + !anon_exclusive, page); >>>>>>> + if (anon_exclusive && >>>>>>> + page_try_share_anon_rmap(subpage)) { >>>>>>> + set_pte_at(mm, address, pvmw.pte, pteval); >>>>>>> + ret = false; >>>>>>> + page_vma_mapped_walk_done(&pvmw); >>>>>>> + break; >>>>>>> + } >>>>> >>>>> and similarly fail the page_try_share_anon_rmap(), at which point >>>>> try_to_migrate() stops and the caller will still observe a >>>>> "page_mapped() == true". >>>> >>>> Thanks, I missed that. Yes, the page will still be mapped. This should >>>> trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make >>>> this happen more often, we may consider removing that warning even >>>> though it is "once" since seeing a mapped page may become a normal >>>> case (once DIO is switched to FOLL_PIN, it may be more often). Anyway >>>> we don't have to remove it right now. >>> >>> Oh, very good catch! I wasn't able to trigger that warning in my testing >>> so far. Interestingly, arch_unmap_one() could theoretically make this >>> fail already and trigger the warning. >>> >>> Apart from that warning, split_huge_page_to_list() should work as >>> expected: freezing the refcount will fail if still mapped and we'll remap. >>> >>> I'll include a separate patch to just remove that VM_WARN_ON_ONCE -- thanks! >>> >> >> From e6e983d841cd2aa2a9c8dc71779211881cf0d96f Mon Sep 17 00:00:00 2001 >> From: David Hildenbrand <david@redhat.com> >> Date: Sat, 19 Mar 2022 11:49:39 +0100 >> Subject: [PATCH] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from >> unmap_page() >> >> We can already theoretically fail to unmap (still having page_mapped()) in >> case arch_unmap_one() fails. Although this applies only to anonymous pages >> for now, get rid of the VM_WARN_ON_ONCE_PAGE() completely: the caller -- >> split_huge_page_to_list() -- will fail to freeze the refcount and >> remap the page via remap_page(). So the caller can handle unmap errors >> just fine. >> >> This is a preparation for making try_to_migrate() fail on anonymous pages >> with GUP pins. > > Better to mention "making try_to_migrate() fail on pinned anonymous > pages may make this happen more usual". > >> >> Reported-by: Yang Shi <shy828301@gmail.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > Fine to me. Reviewed-by: Yang Shi <shy828301@gmail.com> Thanks, updated patch: From 4f68aa1067e43caa370a62b930094a4d07492f92 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Sat, 19 Mar 2022 11:49:39 +0100 Subject: [PATCH] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() We can already theoretically fail to unmap (still having page_mapped()) in case arch_unmap_one() fails, which can happen on sparc. Failures to unmap are handled gracefully, just as if there are other references on the target page: freezing the refcount in split_huge_page_to_list() will fail if still mapped and we'll simply remap. In commit 504e070dc08f ("mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split") we already converted to VM_WARN_ON_ONCE_PAGE, let's get rid of it completely now. This is a preparation for making try_to_migrate() fail on anonymous pages with GUP pins, which will make this VM_WARN_ON_ONCE_PAGE trigger more frequently. Reported-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- mm/huge_memory.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0b6fb409b9e4..0fe0ab3ec3fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2263,8 +2263,6 @@ static void unmap_page(struct page *page) try_to_migrate(page, ttu_flags); else try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); - - VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } static void remap_page(struct page *page, unsigned int nr)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 51953bace0a3..1aef834e1d60 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -224,6 +224,13 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, { VM_BUG_ON_PAGE(!PageAnon(page), page); + /* + * No need to check+clear for already shared pages, including KSM + * pages. + */ + if (!PageAnonExclusive(page)) + goto dup; + /* * If this page may have been pinned by the parent process, * don't allow to duplicate the mapping but instead require to e.g., @@ -235,14 +242,47 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, unlikely(page_needs_cow_for_dma(vma, page)))) return -EBUSY; + ClearPageAnonExclusive(page); /* * It's okay to share the anon page between both processes, mapping * the page R/O into both processes. */ +dup: __page_dup_rmap(page, compound); return 0; } +/** + * page_try_share_anon_rmap - try marking an exclusive anonymous page possibly + * shared to prepare for KSM or temporary unmapping + * @page: the exclusive anonymous page to try marking possibly shared + * + * The caller needs to hold the PT lock and has to have the page table entry + * cleared/invalidated+flushed, to properly sync against GUP-fast. + * + * This is similar to page_try_dup_anon_rmap(), however, not used during fork() + * to duplicate a mapping, but instead to prepare for KSM or temporarily + * unmapping a page (swap, migration) via page_remove_rmap(). + * + * Marking the page shared can only fail if the page may be pinned; device + * private pages cannot get pinned and consequently this function cannot fail. + * + * Returns 0 if marking the page possibly shared succeeded. Returns -EBUSY + * otherwise. + */ +static inline int page_try_share_anon_rmap(struct page *page) +{ + VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page); + + /* See page_try_dup_anon_rmap(). */ + if (likely(!is_device_private_page(page) && + unlikely(page_maybe_dma_pinned(page)))) + return -EBUSY; + + ClearPageAnonExclusive(page); + return 0; +} + /* * Called from mm/vmscan.c to handle paging out */ diff --git a/include/linux/swap.h b/include/linux/swap.h index b546e4bd5c5a..422765d1141c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -78,12 +78,19 @@ static inline int current_is_kswapd(void) #endif /* - * NUMA node memory migration support + * Page migration support. + * + * SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and + * indicates that the referenced (part of) an anonymous page is exclusive to + * a single process. For SWP_MIGRATION_WRITE, that information is implicit: + * (part of) an anonymous page that are mapped writable are exclusive to a + * single process. */ #ifdef CONFIG_MIGRATION -#define SWP_MIGRATION_NUM 2 -#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) -#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) +#define SWP_MIGRATION_NUM 3 +#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) +#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) +#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2) #else #define SWP_MIGRATION_NUM 0 #endif diff --git a/include/linux/swapops.h b/include/linux/swapops.h index d356ab4047f7..06280fc1c99b 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -194,6 +194,7 @@ static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) static inline int is_migration_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_MIGRATION_READ || + swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE || swp_type(entry) == SWP_MIGRATION_WRITE); } @@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry) return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE); } +static inline int is_readable_migration_entry(swp_entry_t entry) +{ + return unlikely(swp_type(entry) == SWP_MIGRATION_READ); +} + +static inline int is_readable_exclusive_migration_entry(swp_entry_t entry) +{ + return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE); +} + static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) { return swp_entry(SWP_MIGRATION_READ, offset); } +static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset) +{ + return swp_entry(SWP_MIGRATION_READ_EXCLUSIVE, offset); +} + static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) { return swp_entry(SWP_MIGRATION_WRITE, offset); @@ -224,6 +240,11 @@ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) return swp_entry(0, 0); } +static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) { return swp_entry(0, 0); @@ -244,6 +265,10 @@ static inline int is_writable_migration_entry(swp_entry_t entry) { return 0; } +static inline int is_readable_migration_entry(swp_entry_t entry) +{ + return 0; +} #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0b6fb409b9e4..4872e7120ee1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1054,7 +1054,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, swp_entry_t entry = pmd_to_swp_entry(pmd); VM_BUG_ON(!is_pmd_migration_entry(pmd)); - if (is_writable_migration_entry(entry)) { + if (!is_readable_migration_entry(entry)) { entry = make_readable_migration_entry( swp_offset(entry)); pmd = swp_entry_to_pmd(entry); @@ -1292,6 +1292,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) page = pmd_page(orig_pmd); VM_BUG_ON_PAGE(!PageHead(page), page); + /* Early check when only holding the PT lock. */ + if (PageAnonExclusive(page)) + goto reuse; + if (!trylock_page(page)) { get_page(page); spin_unlock(vmf->ptl); @@ -1306,6 +1310,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) put_page(page); } + /* Recheck after temporarily dropping the PT lock. */ + if (PageAnonExclusive(page)) { + unlock_page(page); + goto reuse; + } + /* * See do_wp_page(): we can only map the page writable if there are * no additional references. Note that we always drain the LRU @@ -1319,11 +1329,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) pmd_t entry; page_move_anon_rmap(page, vma); + unlock_page(page); +reuse: entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1)) update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); - unlock_page(page); spin_unlock(vmf->ptl); return VM_FAULT_WRITE; } @@ -1741,6 +1752,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION if (is_swap_pmd(*pmd)) { swp_entry_t entry = pmd_to_swp_entry(*pmd); + struct page *page = pfn_swap_entry_to_page(entry); VM_BUG_ON(!is_pmd_migration_entry(*pmd)); if (is_writable_migration_entry(entry)) { @@ -1749,8 +1761,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, * A protection check is difficult so * just be safe and disable write */ - entry = make_readable_migration_entry( - swp_offset(entry)); + if (PageAnon(page)) + entry = make_readable_exclusive_migration_entry(swp_offset(entry)); + else + entry = make_readable_migration_entry(swp_offset(entry)); newpmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*pmd)) newpmd = pmd_swp_mksoft_dirty(newpmd); @@ -1959,6 +1973,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pgtable_t pgtable; pmd_t old_pmd, _pmd; bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false; + bool anon_exclusive = false; unsigned long addr; int i; @@ -2040,6 +2055,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry); + if (PageAnon(page)) + anon_exclusive = is_readable_exclusive_migration_entry(entry); young = false; soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); @@ -2051,6 +2068,23 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, young = pmd_young(old_pmd); soft_dirty = pmd_soft_dirty(old_pmd); uffd_wp = pmd_uffd_wp(old_pmd); + + /* + * Without "freeze", we'll simply split the PMD, propagating the + * PageAnonExclusive() flag for each PTE by setting it for + * each subpage -- no need to (temporarily) clear. + * + * With "freeze" we want to replace mapped pages by + * migration entries right away. This is only possible if we + * managed to clear PageAnonExclusive() -- see + * set_pmd_migration_entry(). + * + * In case we cannot clear PageAnonExclusive(), split the PMD + * only and let try_to_migrate_one() fail later. + */ + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + if (freeze && anon_exclusive && page_try_share_anon_rmap(page)) + freeze = false; } VM_BUG_ON_PAGE(!page_count(page), page); page_ref_add(page, HPAGE_PMD_NR - 1); @@ -2074,6 +2108,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (write) swp_entry = make_writable_migration_entry( page_to_pfn(page + i)); + else if (anon_exclusive) + swp_entry = make_readable_exclusive_migration_entry( + page_to_pfn(page + i)); else swp_entry = make_readable_migration_entry( page_to_pfn(page + i)); @@ -2085,6 +2122,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } else { entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); entry = maybe_mkwrite(entry, vma); + if (anon_exclusive) + SetPageAnonExclusive(page + i); if (!write) entry = pte_wrprotect(entry); if (!young) @@ -2315,6 +2354,13 @@ static void __split_huge_page_tail(struct page *head, int tail, * * After successful get_page_unless_zero() might follow flags change, * for example lock_page() which set PG_waiters. + * + * Note that for mapped sub-pages of an anonymous THP, + * PG_anon_exclusive has been cleared in unmap_page() and is stored in + * the migration entry instead from where remap_page() will restore it. + * We can still have PG_anon_exclusive set on effectively unmapped and + * unreferenced sub-pages of an anonymous THP: we can simply drop + * PG_anon_exclusive (-> PG_mappedtodisk) for these here. */ page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; page_tail->flags |= (head->flags & @@ -3070,6 +3116,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, struct vm_area_struct *vma = pvmw->vma; struct mm_struct *mm = vma->vm_mm; unsigned long address = pvmw->address; + bool anon_exclusive; pmd_t pmdval; swp_entry_t entry; pmd_t pmdswp; @@ -3079,10 +3126,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, flush_cache_range(vma, address, address + HPAGE_PMD_SIZE); pmdval = pmdp_invalidate(vma, address, pvmw->pmd); + + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + if (anon_exclusive && page_try_share_anon_rmap(page)) { + set_pmd_at(mm, address, pvmw->pmd, pmdval); + return; + } + if (pmd_dirty(pmdval)) set_page_dirty(page); if (pmd_write(pmdval)) entry = make_writable_migration_entry(page_to_pfn(page)); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry(page_to_pfn(page)); else entry = make_readable_migration_entry(page_to_pfn(page)); pmdswp = swp_entry_to_pmd(entry); @@ -3116,10 +3172,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE); - if (PageAnon(new)) - page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND); - else + if (PageAnon(new)) { + int rmap_flags = RMAP_COMPOUND; + + if (!is_readable_migration_entry(entry)) + rmap_flags |= RMAP_EXCLUSIVE; + + page_add_anon_rmap(new, vma, mmun_start, rmap_flags); + } else { page_add_file_rmap(new, true); + } + VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new)); set_pmd_at(mm, mmun_start, pvmw->pmd, pmde); if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new)) mlock_vma_page(new); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1ff0b9e1e28e..c0e19ea5ebb5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4765,7 +4765,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, is_hugetlb_entry_hwpoisoned(entry))) { swp_entry_t swp_entry = pte_to_swp_entry(entry); - if (is_writable_migration_entry(swp_entry) && cow) { + if (!is_readable_migration_entry(swp_entry) && cow) { /* * COW mappings require pages in both * parent and child to be set to read. @@ -5165,6 +5165,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, set_huge_ptep_writable(vma, haddr, ptep); return 0; } + VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page), + old_page); /* * If the process that created a MAP_PRIVATE mapping is about to @@ -6160,12 +6162,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, } if (unlikely(is_hugetlb_entry_migration(pte))) { swp_entry_t entry = pte_to_swp_entry(pte); + struct page *page = pfn_swap_entry_to_page(entry); - if (is_writable_migration_entry(entry)) { + if (!is_readable_migration_entry(entry)) { pte_t newpte; - entry = make_readable_migration_entry( - swp_offset(entry)); + if (PageAnon(page)) + entry = make_readable_exclusive_migration_entry( + swp_offset(entry)); + else + entry = make_readable_migration_entry( + swp_offset(entry)); newpte = swp_entry_to_pte(entry); set_huge_swap_pte_at(mm, address, ptep, newpte, huge_page_size(h)); diff --git a/mm/ksm.c b/mm/ksm.c index 9ff28097bc0a..350b46432d65 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -866,6 +866,7 @@ static inline struct stable_node *page_stable_node(struct page *page) static inline void set_page_stable_node(struct page *page, struct stable_node *stable_node) { + VM_BUG_ON_PAGE(PageAnon(page) && PageAnonExclusive(page), page); page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); } @@ -1041,6 +1042,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, int swapped; int err = -EFAULT; struct mmu_notifier_range range; + bool anon_exclusive; pvmw.address = page_address_in_vma(page, vma); if (pvmw.address == -EFAULT) @@ -1058,9 +1060,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?")) goto out_unlock; + anon_exclusive = PageAnonExclusive(page); if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) || - mm_tlb_flush_pending(mm)) { + anon_exclusive || mm_tlb_flush_pending(mm)) { pte_t entry; swapped = PageSwapCache(page); @@ -1088,6 +1091,12 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } + + if (anon_exclusive && page_try_share_anon_rmap(page)) { + set_pte_at(mm, pvmw.address, pvmw.pte, entry); + goto out_unlock; + } + if (pte_dirty(entry)) set_page_dirty(page); @@ -1146,6 +1155,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, pte_unmap_unlock(ptep, ptl); goto out_mn; } + VM_BUG_ON_PAGE(PageAnonExclusive(page), page); + VM_BUG_ON_PAGE(PageAnon(kpage) && PageAnonExclusive(kpage), kpage); /* * No need to check ksm_use_zero_pages here: we can only have a diff --git a/mm/memory.c b/mm/memory.c index d01fab481134..222aaf277af4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -720,6 +720,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma, else if (is_writable_device_exclusive_entry(entry)) pte = maybe_mkwrite(pte_mkdirty(pte), vma); + VM_BUG_ON(pte_write(pte) && !(PageAnon(page) && PageAnonExclusive(page))); + /* * No need to take a page reference as one was already * created when the swap entry was made. @@ -799,11 +801,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, rss[mm_counter(page)]++; - if (is_writable_migration_entry(entry) && + if (!is_readable_migration_entry(entry) && is_cow_mapping(vm_flags)) { /* - * COW mappings require pages in both - * parent and child to be set to read. + * COW mappings require pages in both parent and child + * to be set to read. A previously exclusive entry is + * now shared. */ entry = make_readable_migration_entry( swp_offset(entry)); @@ -954,6 +957,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } + VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page)); /* * If it's a shared mapping, mark it clean in @@ -2943,6 +2947,9 @@ static inline void wp_page_reuse(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct page *page = vmf->page; pte_t entry; + + VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page)); + /* * Clear the pages cpupid information as the existing * information potentially belongs to a now completely @@ -3277,6 +3284,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (PageAnon(vmf->page)) { struct page *page = vmf->page; + /* + * If the page is exclusive to this process we must reuse the + * page without further checks. + */ + if (PageAnonExclusive(page)) + goto reuse; + /* * We have to verify under page lock: these early checks are * just an optimization to avoid locking the page and freeing @@ -3309,6 +3323,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) */ page_move_anon_rmap(page, vma); unlock_page(page); +reuse: wp_page_reuse(vmf); return VM_FAULT_WRITE; } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == @@ -3700,11 +3715,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * that are certainly not shared because we just allocated them without * exposing them to the swapcache. */ - if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) && - (page != swapcache || page_count(page) == 1)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); - vmf->flags &= ~FAULT_FLAG_WRITE; - ret |= VM_FAULT_WRITE; + if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) { + if (vmf->flags & FAULT_FLAG_WRITE) { + pte = maybe_mkwrite(pte_mkdirty(pte), vma); + vmf->flags &= ~FAULT_FLAG_WRITE; + ret |= VM_FAULT_WRITE; + } rmap_flags |= RMAP_EXCLUSIVE; } flush_icache_page(vma, page); @@ -3724,6 +3740,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) page_add_anon_rmap(page, vma, vmf->address, rmap_flags); } + VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page))); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); diff --git a/mm/migrate.c b/mm/migrate.c index fd9eba33b34a..7f440d2103ce 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -188,6 +188,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(PageTail(page), page); while (page_vma_mapped_walk(&pvmw)) { + rmap_t rmap_flags = RMAP_NONE; + if (PageKsm(page)) new = page; else @@ -217,6 +219,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, else if (pte_swp_uffd_wp(*pvmw.pte)) pte = pte_mkuffd_wp(pte); + if (PageAnon(new) && !is_readable_migration_entry(entry)) + rmap_flags |= RMAP_EXCLUSIVE; + if (unlikely(is_device_private_page(new))) { if (pte_write(pte)) entry = make_writable_device_private_entry( @@ -239,7 +244,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, pte = arch_make_huge_pte(pte, shift, vma->vm_flags); if (PageAnon(new)) hugepage_add_anon_rmap(new, vma, pvmw.address, - RMAP_NONE); + rmap_flags); else page_dup_file_rmap(new, true); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); @@ -248,7 +253,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, { if (PageAnon(new)) page_add_anon_rmap(new, vma, pvmw.address, - RMAP_NONE); + rmap_flags); else page_add_file_rmap(new, false); set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); @@ -379,6 +384,10 @@ int folio_migrate_mapping(struct address_space *mapping, /* No turning back from here */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; + /* + * Note: PG_anon_exclusive is always migrated via migration + * entries. + */ if (folio_test_swapbacked(folio)) __folio_set_swapbacked(newfolio); @@ -2302,15 +2311,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, * set up a special migration page table entry now. */ if (trylock_page(page)) { + bool anon_exclusive; pte_t swp_pte; + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + if (anon_exclusive) { + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + + if (page_try_share_anon_rmap(page)) { + set_pte_at(mm, addr, ptep, pte); + unlock_page(page); + put_page(page); + mpfn = 0; + goto next; + } + } else { + ptep_get_and_clear(mm, addr, ptep); + } + migrate->cpages++; - ptep_get_and_clear(mm, addr, ptep); /* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry( page_to_pfn(page)); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry( + page_to_pfn(page)); else entry = make_readable_migration_entry( page_to_pfn(page)); diff --git a/mm/mprotect.c b/mm/mprotect.c index 2887644fd150..9441675e993c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -141,6 +141,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, pages++; } else if (is_swap_pte(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); + struct page *page = pfn_swap_entry_to_page(entry); pte_t newpte; if (is_writable_migration_entry(entry)) { @@ -148,8 +149,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, * A protection check is difficult so * just be safe and disable write */ - entry = make_readable_migration_entry( - swp_offset(entry)); + if (PageAnon(page)) + entry = make_readable_exclusive_migration_entry( + swp_offset(entry)); + else + entry = make_readable_migration_entry(swp_offset(entry)); newpte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(oldpte)) newpte = pte_swp_mksoft_dirty(newpte); diff --git a/mm/rmap.c b/mm/rmap.c index ebe7140c4493..9d2a7e11e8cc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1048,6 +1048,7 @@ EXPORT_SYMBOL_GPL(folio_mkclean); void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; + struct page *subpage = page; page = compound_head(page); @@ -1061,6 +1062,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) * PageAnon()) will not see one without the other. */ WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); + SetPageAnonExclusive(subpage); } /** @@ -1078,7 +1080,7 @@ static void __page_set_anon_rmap(struct page *page, BUG_ON(!anon_vma); if (PageAnon(page)) - return; + goto out; /* * If the page isn't exclusively mapped into this vma, @@ -1097,6 +1099,9 @@ static void __page_set_anon_rmap(struct page *page, anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); page->index = linear_page_index(vma, address); +out: + if (exclusive) + SetPageAnonExclusive(page); } /** @@ -1156,6 +1161,8 @@ void page_add_anon_rmap(struct page *page, } else { first = atomic_inc_and_test(&page->_mapcount); } + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); + VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); if (first) { int nr = compound ? thp_nr_pages(page) : 1; @@ -1422,7 +1429,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, }; pte_t pteval; struct page *subpage; - bool ret = true; + bool anon_exclusive, ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; @@ -1485,6 +1492,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address; + anon_exclusive = PageAnon(page) && PageAnonExclusive(subpage); if (PageHuge(page) && !PageAnon(page)) { /* @@ -1520,9 +1528,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } } - /* Nuke the page table entry. */ + /* + * Nuke the page table entry. When having to clear + * PageAnonExclusive(), we always have to flush. + */ flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); - if (should_defer_flush(mm, flags)) { + if (should_defer_flush(mm, flags) && !anon_exclusive) { /* * We clear the PTE but do not flush so potentially * a remote CPU could still be writing to the page. @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, page_vma_mapped_walk_done(&pvmw); break; } + if (anon_exclusive && + page_try_share_anon_rmap(subpage)) { + swap_free(entry); + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + /* + * Note: We *don't* remember yet if the page was mapped + * exclusively in the swap entry, so swapin code has + * to re-determine that manually and might detect the + * page as possibly shared, for example, if there are + * other references on the page or if the page is under + * writeback. We made sure that there are no GUP pins + * on the page that would rely on it, so for GUP pins + * this is fine. + */ if (list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); if (list_empty(&mm->mmlist)) @@ -1723,7 +1752,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, }; pte_t pteval; struct page *subpage; - bool ret = true; + bool anon_exclusive, ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; @@ -1782,6 +1811,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address; + anon_exclusive = PageAnon(page) && PageAnonExclusive(subpage); if (PageHuge(page) && !PageAnon(page)) { /* @@ -1833,6 +1863,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, swp_entry_t entry; pte_t swp_pte; + if (anon_exclusive) + BUG_ON(page_try_share_anon_rmap(subpage)); + /* * Store the pfn of the page in a special migration * pte. do_swap_page() will wait until the migration @@ -1841,6 +1874,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, entry = pte_to_swp_entry(pteval); if (is_writable_device_private_entry(entry)) entry = make_writable_migration_entry(pfn); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry(pfn); else entry = make_readable_migration_entry(pfn); swp_pte = swp_entry_to_pte(entry); @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, page_vma_mapped_walk_done(&pvmw); break; } + VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && + !anon_exclusive, page); + if (anon_exclusive && + page_try_share_anon_rmap(subpage)) { + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } /* * Store the pfn of the page in a special migration @@ -1912,6 +1956,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, if (pte_write(pteval)) entry = make_writable_migration_entry( page_to_pfn(subpage)); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry( + page_to_pfn(subpage)); else entry = make_readable_migration_entry( page_to_pfn(subpage)); @@ -2425,6 +2472,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, BUG_ON(!anon_vma); /* address might be in next vma when migration races vma_adjust */ first = atomic_inc_and_test(compound_mapcount_ptr(page)); + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); + VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); if (first) __page_set_anon_rmap(page, vma, address, flags & RMAP_EXCLUSIVE);
Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as exclusive, and use that information to make GUP pins reliable and stay consistent with the page mapped into the page table even if the page table entry gets write-protected. With that information at hand, we can extend our COW logic to always reuse anonymous pages that are exclusive. For anonymous pages that might be shared, the existing logic applies. As already documented, PG_anon_exclusive is usually only expressive in combination with a page table entry. Especially PTE vs. PMD-mapped anonymous pages require more thought, some examples: due to mremap() we can easily have a single compound page PTE-mapped into multiple page tables exclusively in a single process -- multiple page table locks apply. Further, due to MADV_WIPEONFORK we might not necessarily write-protect all PTEs, and only some subpages might be pinned. Long story short: once PTE-mapped, we have to track information about exclusivity per sub-page, but until then, we can just track it for the compound page in the head page and not having to update a whole bunch of subpages all of the time for a simple PMD mapping of a THP. For simplicity, this commit mostly talks about "anonymous pages", while it's for THP actually "the part of an anonymous folio referenced via a page table entry". To not spill PG_anon_exclusive code all over the mm code-base, we let the anon rmap code to handle all PG_anon_exclusive logic it can easily handle. If a writable, present page table entry points at an anonymous (sub)page, that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably pin (FOLL_PIN) on an anonymous page references via a present page table entry, it must only pin if PG_anon_exclusive is set for the mapped (sub)page. This commit doesn't adjust GUP, so this is only implicitly handled for FOLL_WRITE, follow-up commits will teach GUP to also respect it for FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages fully reliable. Whenever an anonymous page is to be shared (fork(), KSM), or when temporarily unmapping an anonymous page (swap, migration), the relevant PG_anon_exclusive bit has to be cleared to mark the anonymous page possibly shared. Clearing will fail if there are GUP pins on the page: * For fork(), this means having to copy the page and not being able to share it. fork() protects against concurrent GUP using the PT lock and the src_mm->write_protect_seq. * For KSM, this means sharing will fail. For swap this means, unmapping will fail, For migration this means, migration will fail early. All three cases protect against concurrent GUP using the PT lock and a proper clear/invalidate+flush of the relevant page table entry. This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a pinned page gets mapped R/O and the successive write fault ends up replacing the page instead of reusing it. It improves the situation for O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, if fork() is *not* involved, however swapout and fork() are still problematic. Properly using FOLL_PIN instead of FOLL_GET for these GUP users will fix the issue for them. I. Details about basic handling I.1. Fresh anonymous pages page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the given page exclusive via __page_set_anon_rmap(exclusive=1). As that is the mechanism fresh anonymous pages come into life (besides migration code where we copy the page->mapping), all fresh anonymous pages will start out as exclusive. I.2. COW reuse handling of anonymous pages When a COW handler stumbles over a (sub)page that's marked exclusive, it simply reuses it. Otherwise, the handler tries harder under page lock to detect if the (sub)page is exclusive and can be reused. If exclusive, page_move_anon_rmap() will mark the given (sub)page exclusive. Note that hugetlb code does not yet check for PageAnonExclusive(), as it still uses the old COW logic that is prone to the COW security issue because hugetlb code cannot really tolerate unnecessary/wrong COW as huge pages are a scarce resource. I.3. Migration handling try_to_migrate() has to try marking an exclusive anonymous page shared via page_try_share_anon_rmap(). If it fails because there are GUP pins on the page, unmap fails. migrate_vma_collect_pmd() and __split_huge_pmd_locked() are handled similarly. Writable migration entries implicitly point at shared anonymous pages. For readable migration entries that information is stored via a new "readable-exclusive" migration entry, specific to anonymous pages. When restoring a migration entry in remove_migration_pte(), information about exlusivity is detected via the migration entry type, and RMAP_EXCLUSIVE is set accordingly for page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information. I.4. Swapout handling try_to_unmap() has to try marking the mapped page possibly shared via page_try_share_anon_rmap(). If it fails because there are GUP pins on the page, unmap fails. For now, information about exclusivity is lost. In the future, we might want to remember that information in the swap entry in some cases, however, it requires more thought, care, and a way to store that information in swap entries. I.5. Swapin handling do_swap_page() will never stumble over exclusive anonymous pages in the swap cache, as try_to_migrate() prohibits that. do_swap_page() always has to detect manually if an anonymous page is exclusive and has to set RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. I.6. THP handling __split_huge_pmd_locked() has to move the information about exclusivity from the PMD to the PTEs. a) In case we have a readable-exclusive PMD migration entry, simply insert readable-exclusive PTE migration entries. b) In case we have a present PMD entry and we don't want to freeze ("convert to migration entries"), simply forward PG_anon_exclusive to all sub-pages, no need to temporarily clear the bit. c) In case we have a present PMD entry and want to freeze, handle it similar to try_to_migrate(): try marking the page shared first. In case we fail, we ignore the "freeze" instruction and simply split ordinarily. try_to_migrate() will properly fail because the THP is still mapped via PTEs. When splitting a compound anonymous folio (THP), the information about exclusivity is implicitly handled via the migration entries: no need to replicate PG_anon_exclusive manually. I.7. fork() handling fork() handling is relatively easy, because PG_anon_exclusive is only expressive for some page table entry types. a) Present anonymous pages page_try_dup_anon_rmap() will mark the given subpage shared -- which will fail if the page is pinned. If it failed, we have to copy (or PTE-map a PMD to handle it on the PTE level). Note that device exclusive entries are just a pointer at a PageAnon() page. fork() will first convert a device exclusive entry to a present page table and handle it just like present anonymous pages. b) Device private entry Device private entries point at PageAnon() pages that cannot be mapped directly and, therefore, cannot get pinned. page_try_dup_anon_rmap() will mark the given subpage shared, which cannot fail because they cannot get pinned. c) HW poison entries PG_anon_exclusive will remain untouched and is stale -- the page table entry is just a placeholder after all. d) Migration entries Writable and readable-exclusive entries are converted to readable entries: possibly shared. I.8. mprotect() handling mprotect() only has to properly handle the new readable-exclusive migration entry: When write-protecting a migration entry that points at an anonymous page, remember the information about exclusivity via the "readable-exclusive" migration entry type. II. Migration and GUP-fast Whenever replacing a present page table entry that maps an exclusive anonymous page by a migration entry, we have to mark the page possibly shared and synchronize against GUP-fast by a proper clear/invalidate+flush to make the following scenario impossible: 1. try_to_migrate() places a migration entry after checking for GUP pins and marks the page possibly shared. 2. GUP-fast pins the page due to lack of synchronization 3. fork() converts the "writable/readable-exclusive" migration entry into a readable migration entry 4. Migration fails due to the GUP pin (failing to freeze the refcount) 5. Migration entries are restored. PG_anon_exclusive is lost -> We have a pinned page that is not marked exclusive anymore. Note that we move information about exclusivity from the page to the migration entry as it otherwise highly overcomplicates fork() and PTE-mapping a THP. III. Swapout and GUP-fast Whenever replacing a present page table entry that maps an exclusive anonymous page by a swap entry, we have to mark the page possibly shared and synchronize against GUP-fast by a proper clear/invalidate+flush to make the following scenario impossible: 1. try_to_unmap() places a swap entry after checking for GUP pins and clears exclusivity information on the page. 2. GUP-fast pins the page due to lack of synchronization. -> We have a pinned page that is not marked exclusive anymore. If we'd ever store information about exclusivity in the swap entry, similar to migration handling, the same considerations as in II would apply. This is future work. Signed-off-by: David Hildenbrand <david@redhat.com> --- include/linux/rmap.h | 40 +++++++++++++++++++++ include/linux/swap.h | 15 +++++--- include/linux/swapops.h | 25 +++++++++++++ mm/huge_memory.c | 77 +++++++++++++++++++++++++++++++++++++---- mm/hugetlb.c | 15 +++++--- mm/ksm.c | 13 ++++++- mm/memory.c | 33 +++++++++++++----- mm/migrate.c | 34 ++++++++++++++++-- mm/mprotect.c | 8 +++-- mm/rmap.c | 59 ++++++++++++++++++++++++++++--- 10 files changed, 285 insertions(+), 34 deletions(-)