[RFC,30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)

Message ID	20230823131350.114942-31-alexandru.elisei@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Alexandru Elisei <alexandru.elisei@arm.com> To: catalin.marinas@arm.com, will@kernel.org, oliver.upton@linux.dev, maz@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, arnd@arndb.de, akpm@linux-foundation.org, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, mhiramat@kernel.org, rppt@kernel.org, hughd@google.com Cc: pcc@google.com, steven.price@arm.com, anshuman.khandual@arm.com, vincenzo.frascino@arm.com, david@redhat.com, eugenis@google.com, kcc@google.com, hyesoo.yu@samsung.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Subject: [PATCH RFC 30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE) Date: Wed, 23 Aug 2023 14:13:43 +0100 Message-Id: <20230823131350.114942-31-alexandru.elisei@arm.com> In-Reply-To: <20230823131350.114942-1-alexandru.elisei@arm.com> References: <20230823131350.114942-1-alexandru.elisei@arm.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	[RFC,01/37] mm: page_alloc: Rename gfp_to_alloc_flags_cma -> gfp_to_alloc_flags_fast \| expand [RFC,01/37] mm: page_alloc: Rename gfp_to_alloc_flags_cma -> gfp_to_alloc_flags_fast [RFC,02/37] arm64: mte: Rework naming for tag manipulation functions [RFC,03/37] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED [RFC,04/37] mm: Add MIGRATE_METADATA allocation policy [RFC,05/37] mm: Add memory statistics for the MIGRATE_METADATA allocation policy [RFC,06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA [RFC,07/37] mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages [RFC,08/37] mm: compaction: Account for free metadata pages in __compact_finished() [RFC,09/37] mm: compaction: Handle metadata pages as source for direct compaction [RFC,10/37] mm: compaction: Do not use MIGRATE_METADATA to replace pages with metadata [RFC,11/37] mm: migrate/mempolicy: Allocate metadata-enabled destination page [RFC,12/37] mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages [RFC,13/37] arm64: mte: Reserve tag storage memory [RFC,14/37] arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist [RFC,15/37] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK [RFC,16/37] arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled [RFC,17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled [RFC,18/37] arm64: mte: Check that tag storage blocks are in the same zone [RFC,19/37] mm: page_alloc: Manage metadata storage on page allocation [RFC,20/37] mm: compaction: Reserve metadata storage in compaction_alloc() [RFC,21/37] mm: khugepaged: Handle metadata-enabled VMAs [RFC,22/37] mm: shmem: Allocate metadata storage for in-memory filesystems [RFC,23/37] mm: Teach vma_alloc_folio() about metadata-enabled VMAs [RFC,24/37] mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA [RFC,25/37] arm64: mte: Manage tag storage on page allocation [RFC,26/37] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free [RFC,27/37] arm64: mte: Reserve tag block for the zero page [RFC,28/37] mm: sched: Introduce PF_MEMALLOC_ISOLATE [RFC,29/37] mm: arm64: Define the PAGE_METADATA_NONE page protection [RFC,30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE) [RFC,31/37] mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata storage [RFC,32/37] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore() [RFC,33/37] arm64: mte: swap/copypage: Handle tag restoring when missing tag storage [RFC,34/37] arm64: mte: Handle fatal signal in reserve_metadata_storage() [RFC,35/37] mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages [RFC,36/37] KVM: arm64: Disable MTE is tag storage is enabled [RFC,37/37] arm64: mte: Enable tag storage management

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index ba316ffb9aef..27bde1d2609c 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -531,6 +531,10 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp) mutex_lock(&tag_blocks_lock); + /* Can happen for concurrent accesses to a METADATA_NONE page. */ + if (page_tag_storage_reserved(page)) + goto out_unlock; + /* Make sure existing entries are not freed from out under out feet. */ xa_lock_irqsave(&tag_blocks_reserved, flags); for (block = start_block; block < end_block; block += region->block_size) { @@ -568,6 +572,8 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp) set_bit(PG_tag_storage_reserved, &(page + i)->flags); memalloc_isolate_restore(cflags); + +out_unlock: mutex_unlock(&tag_blocks_lock); return 0; diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index f37cc03f9369..5a9af239e425 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -29,6 +29,7 @@ enum migrate_reason { MR_CONTIG_RANGE, MR_LONGTERM_PIN, MR_DEMOTION, + MR_METADATA_NONE, MR_TYPES }; diff --git a/include/linux/mm.h b/include/linux/mm.h index ce87d55ecf87..6bd7d5810122 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2466,6 +2466,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */ #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ MM_CP_UFFD_WP_RESOLVE) +/* Whether this protection change is to allocate metadata on next access */ +#define MM_CP_PROT_METADATA_NONE (1UL << 4) bool vma_needs_dirty_tracking(struct vm_area_struct *vma); int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); diff --git a/mm/memory.c b/mm/memory.c index 01f39e8144ef..6c4a6151c7b2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include <linux/swap.h> #include <linux/highmem.h> #include <linux/pagemap.h> +#include <linux/page-isolation.h> #include <linux/memremap.h> #include <linux/kmsan.h> #include <linux/ksm.h> @@ -82,6 +83,7 @@ #include <trace/events/kmem.h> #include <asm/io.h> +#include <asm/memory_metadata.h> #include <asm/mmu_context.h> #include <asm/pgalloc.h> #include <linux/uaccess.h> @@ -4681,6 +4683,151 @@ static vm_fault_t do_fault(struct vm_fault *vmf) return ret; } +/* Returns with the page reference dropped. */ +static void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma) +{ + struct migration_target_control mtc = { + .nid = NUMA_NO_NODE, + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED, + }; + LIST_HEAD(pagelist); + int ret, tries; + + lru_cache_disable(); + + if (!isolate_lru_page(page)) { + put_page(page); + lru_cache_enable(); + return; + } + /* Isolate just grabbed another reference, drop ours. */ + put_page(page); + + list_add_tail(&page->lru, &pagelist); + + tries = 5; + while (tries--) { + ret = migrate_pages(&pagelist, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_METADATA_NONE, NULL); + if (ret == 0 || ret != -EBUSY) + break; + } + + if (ret != 0) { + list_del(&page->lru); + putback_movable_pages(&pagelist); + } + lru_cache_enable(); +} + +static vm_fault_t do_metadata_none_page(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct page *page = NULL; + bool do_migrate = false; + pte_t new_pte, old_pte; + bool writable = false; + vm_fault_t err; + int ret; + + /* + * The pte at this point cannot be used safely without validation + * through pte_same(). + */ + vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd); + spin_lock(vmf->ptl); + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; + } + + /* Get the normal PTE */ + old_pte = ptep_get(vmf->pte); + new_pte = pte_modify(old_pte, vma->vm_page_prot); + + /* + * Detect now whether the PTE could be writable; this information + * is only valid while holding the PT lock. + */ + writable = pte_write(new_pte); + if (!writable && vma_wants_manual_pte_write_upgrade(vma) && + can_change_pte_writable(vma, vmf->address, new_pte)) + writable = true; + + page = vm_normal_page(vma, vmf->address, new_pte); + if (!page) + goto out_map; + + /* + * This should never happen, once a VMA has been marked as tagged, that + * cannot be changed. + */ + if (!(vma->vm_flags & VM_MTE)) + goto out_map; + + /* Prevent the page from being unmapped from under us. */ + get_page(page); + vma_set_access_pid_bit(vma); + + pte_unmap_unlock(vmf->pte, vmf->ptl); + + /* + * Probably the page is being isolated for migration, replay the fault + * to give time for the entry to be replaced by a migration pte. + */ + if (unlikely(is_migrate_isolate_page(page))) { + if (!(vmf->flags & FAULT_FLAG_TRIED)) + err = VM_FAULT_RETRY; + else + err = 0; + put_page(page); + return 0; + } else if (is_migrate_metadata_page(page)) { + do_migrate = true; + } else { + ret = reserve_metadata_storage(page, 0, GFP_HIGHUSER_MOVABLE); + if (ret == -EINTR) { + put_page(page); + return VM_FAULT_RETRY; + } else if (ret) { + do_migrate = true; + } + } + if (do_migrate) { + migrate_metadata_none_page(page, vma); + /* + * Either the page was migrated, in which case there's nothing + * we need to do; either migration failed, in which case all we + * can do is try again. So don't change the pte. + */ + return 0; + } + + put_page(page); + + spin_lock(vmf->ptl); + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; + } + +out_map: + /* + * Make it present again, depending on how arch implements + * non-accessible ptes, some can allow access by kernel mode. + */ + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + new_pte = pte_modify(old_pte, vma->vm_page_prot); + new_pte = pte_mkyoung(new_pte); + if (writable) + new_pte = pte_mkwrite(new_pte); + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte); + update_mmu_cache(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + + return 0; +} + int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags) { @@ -4941,8 +5088,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { + if (metadata_storage_enabled() && pte_metadata_none(vmf->orig_pte)) + return do_metadata_none_page(vmf); return do_numa_page(vmf); + } spin_lock(vmf->ptl); entry = vmf->orig_pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 6f658d483704..2c022133aed3 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -33,6 +33,7 @@ #include <linux/userfaultfd_k.h> #include <linux/memory-tiers.h> #include <asm/cacheflush.h> +#include <asm/memory_metadata.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> #include <asm/tlb.h> @@ -89,6 +90,7 @@ static long change_pte_range(struct mmu_gather *tlb, long pages = 0; int target_node = NUMA_NO_NODE; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; + bool prot_metadata_none = cp_flags & MM_CP_PROT_METADATA_NONE; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; @@ -161,6 +163,40 @@ static long change_pte_range(struct mmu_gather *tlb, jiffies_to_msecs(jiffies)); } + if (prot_metadata_none) { + struct page *page; + + /* + * Skip METADATA_NONE pages, but not NUMA pages, + * just so we don't get two faults, one after + * the other. The page fault handling code + * might end up migrating the current page + * anyway, so there really is no need to keep + * the pte marked for NUMA balancing. + */ + if (pte_protnone(oldpte) && pte_metadata_none(oldpte)) + continue; + + page = vm_normal_page(vma, addr, oldpte); + if (!page || is_zone_device_page(page)) + continue; + + /* Page already mapped as tagged in a shared VMA. */ + if (page_has_metadata(page)) + continue; + + /* + * The LRU takes a page reference, which means + * that page_count > 1 is true even if the page + * is not COW. Reserving tag storage for a COW + * page is ok, because one mapping of that page + * won't be migrated; but not reserving tag + * storage for a page is definitely wrong. So + * don't skip pages that might be COW, like + * NUMA does. + */ + } + oldpte = ptep_modify_prot_start(vma, addr, pte); ptent = pte_modify(oldpte, newprot); @@ -531,6 +567,13 @@ long change_protection(struct mmu_gather *tlb, WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA); #endif +#ifdef CONFIG_MEMORY_METADATA + if (cp_flags & MM_CP_PROT_METADATA_NONE) + newprot = PAGE_METADATA_NONE; +#else + WARN_ON_ONCE(cp_flags & MM_CP_PROT_METADATA_NONE); +#endif + if (is_vm_hugetlb_page(vma)) pages = hugetlb_change_protection(vma, start, end, newprot, cp_flags); @@ -661,6 +704,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE; vma_set_page_prot(vma); + if (metadata_storage_enabled() && (newflags & VM_MTE) && !(oldflags & VM_MTE)) + mm_cp_flags |= MM_CP_PROT_METADATA_NONE; + change_protection(tlb, vma, start, end, mm_cp_flags); /*

[RFC,30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)

Commit Message

Patch