Message ID | 20230622144210.2623299-12-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Transparent Contiguous PTEs for User Mappings | expand |
On 6/22/23 07:42, Ryan Roberts wrote: > With the ptep API sufficiently refactored, we can now introduce a new > "contpte" API layer, which transparently manages the PTE_CONT bit for > user mappings. Whenever it detects a set of PTEs that meet the > requirements for a contiguous range, the PTEs are re-painted with the > PTE_CONT bit. > > This initial change provides a baseline that can be optimized in future > commits. That said, fold/unfold operations (which imply tlb > invalidation) are avoided where possible with a few tricks for > access/dirty bit management. > > Write-enable and write-protect modifications are likely non-optimal and > likely incure a regression in fork() performance. This will be addressed > separately. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- Hi Ryan! While trying out the full series from your gitlab features/granule_perf/all branch, I found it necessary to EXPORT a symbol in order to build this. Please see below: ... > + > +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) > +{ > + /* > + * Gather access/dirty bits, which may be populated in any of the ptes > + * of the contig range. We are guarranteed to be holding the PTL, so any > + * contiguous range cannot be unfolded or otherwise modified under our > + * feet. > + */ > + > + pte_t pte; > + int i; > + > + ptep = contpte_align_down(ptep); > + > + for (i = 0; i < CONT_PTES; i++, ptep++) { > + pte = __ptep_get(ptep); > + > + /* > + * Deal with the partial contpte_ptep_get_and_clear_full() case, > + * where some of the ptes in the range may be cleared but others > + * are still to do. See contpte_ptep_get_and_clear_full(). > + */ > + if (pte_val(pte) == 0) > + continue; > + > + if (pte_dirty(pte)) > + orig_pte = pte_mkdirty(orig_pte); > + > + if (pte_young(pte)) > + orig_pte = pte_mkyoung(orig_pte); > + } > + > + return orig_pte; > +} Here we need something like this, in order to get it to build in all possible configurations: EXPORT_SYMBOL_GPL(contpte_ptep_get); (and a corresponding "#include linux/export.h" at the top of the file). Because, the static inline functions invoke this routine, above. thanks,
On 30/06/2023 02:54, John Hubbard wrote: > On 6/22/23 07:42, Ryan Roberts wrote: >> With the ptep API sufficiently refactored, we can now introduce a new >> "contpte" API layer, which transparently manages the PTE_CONT bit for >> user mappings. Whenever it detects a set of PTEs that meet the >> requirements for a contiguous range, the PTEs are re-painted with the >> PTE_CONT bit. >> >> This initial change provides a baseline that can be optimized in future >> commits. That said, fold/unfold operations (which imply tlb >> invalidation) are avoided where possible with a few tricks for >> access/dirty bit management. >> >> Write-enable and write-protect modifications are likely non-optimal and >> likely incure a regression in fork() performance. This will be addressed >> separately. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- > > Hi Ryan! > > While trying out the full series from your gitlab features/granule_perf/all > branch, I found it necessary to EXPORT a symbol in order to build this. Thanks for the bug report! > Please see below: > > ... >> + >> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) >> +{ >> + /* >> + * Gather access/dirty bits, which may be populated in any of the ptes >> + * of the contig range. We are guarranteed to be holding the PTL, so any >> + * contiguous range cannot be unfolded or otherwise modified under our >> + * feet. >> + */ >> + >> + pte_t pte; >> + int i; >> + >> + ptep = contpte_align_down(ptep); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++) { >> + pte = __ptep_get(ptep); >> + >> + /* >> + * Deal with the partial contpte_ptep_get_and_clear_full() case, >> + * where some of the ptes in the range may be cleared but others >> + * are still to do. See contpte_ptep_get_and_clear_full(). >> + */ >> + if (pte_val(pte) == 0) >> + continue; >> + >> + if (pte_dirty(pte)) >> + orig_pte = pte_mkdirty(orig_pte); >> + >> + if (pte_young(pte)) >> + orig_pte = pte_mkyoung(orig_pte); >> + } >> + >> + return orig_pte; >> +} > > Here we need something like this, in order to get it to build in all > possible configurations: > > EXPORT_SYMBOL_GPL(contpte_ptep_get); > > (and a corresponding "#include linux/export.h" at the top of the file). > > Because, the static inline functions invoke this routine, above. A quick grep through the drivers directory shows: ptep_get() is used by: - drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c - drivers/misc/sgi-gru/grufault.c - drivers/vfio/vfio_iommu_type1.c - drivers/xen/privcmd.c ptep_set_at() is used by: - drivers/gpu/drm/i915/i915_mm.c - drivers/xen/xlate_mmu.c None of the other symbols are called, but I guess it is possible that out of tree modules are calling others. So on the basis that these symbols were previously pure inline, I propose to export all the contpte_* symbols using EXPORT_SYMBOL() so that anything that was previously calling them successfully continue to do so. Will include in v2. Thanks, Ryan > > thanks,
Hi Ryan, Some comments below. I did not have time to trim down the quoted text, so you may need to scroll through it. On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote: > With the ptep API sufficiently refactored, we can now introduce a new > "contpte" API layer, which transparently manages the PTE_CONT bit for > user mappings. Whenever it detects a set of PTEs that meet the > requirements for a contiguous range, the PTEs are re-painted with the > PTE_CONT bit. > > This initial change provides a baseline that can be optimized in future > commits. That said, fold/unfold operations (which imply tlb > invalidation) are avoided where possible with a few tricks for > access/dirty bit management. > > Write-enable and write-protect modifications are likely non-optimal and > likely incure a regression in fork() performance. This will be addressed > separately. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 137 ++++++++++++- > arch/arm64/mm/Makefile | 3 +- > arch/arm64/mm/contpte.c | 334 +++++++++++++++++++++++++++++++ > 3 files changed, 466 insertions(+), 8 deletions(-) > create mode 100644 arch/arm64/mm/contpte.c > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 31df4d73f9ac..17ea534bc5b0 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep, > pte_t old_pte, pte_t new_pte); > > +/* > + * The contpte APIs are used to transparently manage the contiguous bit in ptes > + * where it is possible and makes sense to do so. The PTE_CONT bit is considered > + * a private implementation detail of the public ptep API (see below). > + */ > +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte); > +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte); > +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte); > +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep); > +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte, unsigned int nr); > +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep); > +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep); > +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep, > + pte_t entry, int dirty); > + > +static inline pte_t *contpte_align_down(pte_t *ptep) > +{ > + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3); > +} > + > +static inline bool contpte_is_enabled(struct mm_struct *mm) > +{ > + /* > + * Don't attempt to apply the contig bit to kernel mappings, because > + * dynamically adding/removing the contig bit can cause page faults. > + * These racing faults are ok for user space, since they get serialized > + * on the PTL. But kernel mappings can't tolerate faults. > + */ > + > + return mm != &init_mm; > +} > + > +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte) > +{ > + /* > + * Only bother trying if both the virtual and physical addresses are > + * aligned and correspond to the last entry in a contig range. The core > + * code mostly modifies ranges from low to high, so this is the likely > + * the last modification in the contig range, so a good time to fold. > + */ > + > + bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1; > + bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1; > + > + if (contpte_is_enabled(mm) && > + pte_present(pte) && !pte_cont(pte) && > + valign && palign) > + __contpte_try_fold(mm, addr, ptep, pte); I would use pte_valid() here instead. pte_present() also includes the PTE_PROT_NONE option which we don't really care about as it's not accessible. I've been discussing with Alexandru Elisei about PTE_PROT_NONE and whether we can use other bits from the pte even if they clash with other valid permissions. Since the pte is not valid, in theory we could as long as nothing corrupts the (like a cont bit). The background to this is multiple migrate types (not just NUMA) for the MTE tag carveout reuse. > +} > + > +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte) > +{ > + if (contpte_is_enabled(mm) && > + pte_present(pte) && pte_cont(pte)) > + __contpte_try_unfold(mm, addr, ptep, pte); > +} Same here and probably most other places where pte_present() is used in this patch. > + > /* > * The below functions constitute the public API that arm64 presents to the > * core-mm to manipulate PTE entries within the their page tables (or at least > @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, > * versions will automatically and transparently apply the contiguous bit where > * it makes sense to do so. Therefore any users that are contig-aware (e.g. > * hugetlb, kernel mapper) should NOT use these APIs, but instead use the > - * private versions, which are prefixed with double underscore. > + * private versions, which are prefixed with double underscore. All of these > + * APIs except for ptep_get_lockless() are expected to be called with the PTL > + * held. > */ > > #define ptep_get ptep_get > static inline pte_t ptep_get(pte_t *ptep) > { > - return __ptep_get(ptep); > + pte_t pte = __ptep_get(ptep); > + > + if (!pte_present(pte) || !pte_cont(pte)) > + return pte; > + > + return contpte_ptep_get(ptep, pte); > +} > + > +#define ptep_get_lockless ptep_get_lockless > +static inline pte_t ptep_get_lockless(pte_t *ptep) > +{ > + pte_t pte = __ptep_get(ptep); > + > + if (!pte_present(pte) || !pte_cont(pte)) > + return pte; > + > + return contpte_ptep_get_lockless(ptep); > } > > static inline void set_pte(pte_t *ptep, pte_t pte) > { > - __set_pte(ptep, pte); > + /* > + * We don't have the mm or vaddr so cannot unfold or fold contig entries > + * (since it requires tlb maintenance). set_pte() is not used in core > + * code, so this should never even be called. Regardless do our best to > + * service any call and emit a warning if there is any attempt to set a > + * pte on top of an existing contig range. > + */ > + pte_t orig_pte = __ptep_get(ptep); > + > + WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte)); > + __set_pte(ptep, pte_mknoncont(pte)); Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning only checks the old entry. > } > > #define set_ptes set_ptes > static inline void set_ptes(struct mm_struct *mm, unsigned long addr, > pte_t *ptep, pte_t pte, unsigned int nr) > { > - __set_ptes(mm, addr, ptep, pte, nr); > + pte = pte_mknoncont(pte); > + > + if (!contpte_is_enabled(mm)) > + __set_ptes(mm, addr, ptep, pte, nr); > + else if (nr == 1) { > + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); > + __set_ptes(mm, addr, ptep, pte, nr); > + contpte_try_fold(mm, addr, ptep, pte); > + } else > + contpte_set_ptes(mm, addr, ptep, pte, nr); > } > > static inline void pte_clear(struct mm_struct *mm, > unsigned long addr, pte_t *ptep) > { > + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); > __pte_clear(mm, addr, ptep); > } > > @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm, > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long addr, pte_t *ptep) > { > + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); > return __ptep_get_and_clear(mm, addr, ptep); > } > > @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep) > { > - return __ptep_test_and_clear_young(vma, addr, ptep); > + pte_t orig_pte = __ptep_get(ptep); > + > + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) > + return __ptep_test_and_clear_young(vma, addr, ptep); Since I've seen this construct a few times, you may want to turn it into a specific check: pte_valid_cont(). > + > + return contpte_ptep_test_and_clear_young(vma, addr, ptep); > } > > #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH > static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep) > { > - return __ptep_clear_flush_young(vma, addr, ptep); > + pte_t orig_pte = __ptep_get(ptep); > + > + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) > + return __ptep_clear_flush_young(vma, addr, ptep); > + > + return contpte_ptep_clear_flush_young(vma, addr, ptep); > } > > #define __HAVE_ARCH_PTEP_SET_WRPROTECT > static inline void ptep_set_wrprotect(struct mm_struct *mm, > unsigned long addr, pte_t *ptep) > { > + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); > __ptep_set_wrprotect(mm, addr, ptep); > + contpte_try_fold(mm, addr, ptep, __ptep_get(ptep)); > } > > #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS > @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep, > pte_t entry, int dirty) > { > - return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); > + pte_t orig_pte = __ptep_get(ptep); > + > + entry = pte_mknoncont(entry); As in a few other places, it's not clear to me why the pte_mknoncont() is needed. Here I expect 'entry' to be cont if *ptep is cont. > + > + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) > + return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); Also wondering, can we have this check on 'entry' rather than 'orig_pte'? And maybe a warning if the cont bit differs between them. > + > + return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty); > } > > #endif /* !__ASSEMBLY__ */ > diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile > index dbd1bc95967d..70b6aba09b5d 100644 > --- a/arch/arm64/mm/Makefile > +++ b/arch/arm64/mm/Makefile > @@ -2,7 +2,8 @@ > obj-y := dma-mapping.o extable.o fault.o init.o \ > cache.o copypage.o flush.o \ > ioremap.o mmap.o pgd.o mmu.o \ > - context.o proc.o pageattr.o fixmap.o > + context.o proc.o pageattr.o fixmap.o \ > + contpte.o > obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o > obj-$(CONFIG_PTDUMP_CORE) += ptdump.o > obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c > new file mode 100644 > index 000000000000..e8e4a298fd53 > --- /dev/null > +++ b/arch/arm64/mm/contpte.c > @@ -0,0 +1,334 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (C) 2023 ARM Ltd. > + */ > + > +#include <linux/mm.h> > +#include <asm/tlbflush.h> > + > +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, int nr) > +{ > + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); > + unsigned long start_addr = addr; > + int i; > + > + for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) > + __pte_clear(mm, addr, ptep); > + > + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); > +} > + > +static bool ptep_any_present(pte_t *ptep, int nr) Valid? > +{ > + int i; > + > + for (i = 0; i < nr; i++, ptep++) { > + if (pte_present(__ptep_get(ptep))) > + return true; > + } > + > + return false; > +} > + > +static void contpte_fold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte, bool fold) > +{ > + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); > + unsigned long start_addr; > + pte_t *start_ptep; > + int i; > + > + start_ptep = ptep = contpte_align_down(ptep); > + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); > + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte)); > + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte); > + > + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) { > + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep); > + > + if (pte_dirty(ptent)) > + pte = pte_mkdirty(pte); > + > + if (pte_young(ptent)) > + pte = pte_mkyoung(pte); > + } I presume this can be unsafe if any of the ptes in the range differ, so we need some higher level check. But that means we now have three loops for folding, one to check, another to clear and the last one via __set_ptes(). I guess we can't collapse the first two loops in a 'try' function as we need to do the cleaning (and would have to re-instate the old entries if they can't be made contiguous). > + > + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); > + > + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES); > +} > + > +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte) > +{ > + /* > + * We have already checked that the virtual and pysical addresses are > + * correctly aligned for a contig mapping in contpte_try_fold() so the > + * remaining checks are to ensure that the contig range is fully covered > + * by a single folio, and ensure that all the ptes are present with > + * contiguous PFNs and matching prots. > + */ > + > + struct page *page = pte_page(pte); > + struct folio *folio = page_folio(page); > + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE; > + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE; > + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE); > + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE; > + unsigned long pfn; > + pgprot_t prot; > + pte_t subpte; > + pte_t *orig_ptep; > + int i; > + > + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr) > + return; > + > + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT); > + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); > + orig_ptep = ptep; > + ptep = contpte_align_down(ptep); > + > + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { > + subpte = __ptep_get(ptep); > + subpte = pte_mkold(pte_mkclean(subpte)); IIUC, this function assumes ptes that only differ by the dirty status can be contiguous. That's probably ok, with a chance of the dirty status spreading to the adjacent ptes in the fold function. Maybe add a comment on why this is ok (or why it doesn't happen). > + > + if (!pte_present(subpte) || > + pte_pfn(subpte) != pfn || > + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot)) > + return; > + } > + > + contpte_fold(mm, addr, orig_ptep, pte, true); > +} > + > +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte) > +{ > + /* > + * We have already checked that the ptes are contiguous in > + * contpte_try_unfold(), so we can unfold unconditionally here. > + */ > + > + contpte_fold(mm, addr, ptep, pte, false); > +} So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly for the unfold case. Maybe it's clearer if we just have two separate functions (or document why the pte_mk*() are needed). > + > +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) > +{ > + /* > + * Gather access/dirty bits, which may be populated in any of the ptes > + * of the contig range. We are guarranteed to be holding the PTL, so any > + * contiguous range cannot be unfolded or otherwise modified under our > + * feet. > + */ > + > + pte_t pte; > + int i; > + > + ptep = contpte_align_down(ptep); > + > + for (i = 0; i < CONT_PTES; i++, ptep++) { > + pte = __ptep_get(ptep); > + > + /* > + * Deal with the partial contpte_ptep_get_and_clear_full() case, > + * where some of the ptes in the range may be cleared but others > + * are still to do. See contpte_ptep_get_and_clear_full(). > + */ > + if (pte_val(pte) == 0) > + continue; > + > + if (pte_dirty(pte)) > + orig_pte = pte_mkdirty(orig_pte); > + > + if (pte_young(pte)) > + orig_pte = pte_mkyoung(orig_pte); > + } > + > + return orig_pte; > +} > + > +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) > +{ > + /* > + * Gather access/dirty bits, which may be populated in any of the ptes > + * of the contig range. We may not be holding the PTL, so any contiguous > + * range may be unfolded/modified/refolded under our feet. > + */ > + > + pte_t orig_pte; > + pgprot_t orig_prot; > + pte_t *ptep; > + unsigned long pfn; > + pte_t pte; > + pgprot_t prot; > + int i; > + > +retry: > + orig_pte = __ptep_get(orig_ptep); > + > + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) > + return orig_pte; I haven't looked through all the patches, so not entirely sure when this function is called. But since you mention that the range may be folded/unfolded, how do we deal with pte_cont() racing with something setting the contig bit? > + > + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); > + ptep = contpte_align_down(orig_ptep); > + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); > + > + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { > + pte = __ptep_get(ptep); > + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); > + > + if (!pte_present(pte) || !pte_cont(pte) || > + pte_pfn(pte) != pfn || > + pgprot_val(prot) != pgprot_val(orig_prot)) > + goto retry; It needs better documenting, I don't understand what the retry here is for (presumably to handle some races). Do we care about some memory ordering as well? __pte_get() only takes care of reading the ptep once. > + > + if (pte_dirty(pte)) > + orig_pte = pte_mkdirty(orig_pte); > + > + if (pte_young(pte)) > + orig_pte = pte_mkyoung(orig_pte); > + } > + > + return orig_pte; > +} > + > +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, > + pte_t *ptep, pte_t pte, unsigned int nr) > +{ > + unsigned long next; > + unsigned long end = addr + (nr << PAGE_SHIFT); > + unsigned long pfn = pte_pfn(pte); > + pgprot_t prot = pte_pgprot(pte); > + pte_t orig_pte; > + > + do { > + next = pte_cont_addr_end(addr, end); > + nr = (next - addr) >> PAGE_SHIFT; > + pte = pfn_pte(pfn, prot); > + > + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0) > + pte = pte_mkcont(pte); > + else > + pte = pte_mknoncont(pte); > + > + /* > + * If operating on a partial contiguous range then we must first > + * unfold the contiguous range if it was previously folded. > + * Otherwise we could end up with overlapping tlb entries. > + */ > + if (nr != CONT_PTES) > + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); > + > + /* > + * If we are replacing ptes that were contiguous or if the new > + * ptes are contiguous and any of the ptes being replaced are > + * present, we need to clear and flush the range to prevent > + * overlapping tlb entries. > + */ > + orig_pte = __ptep_get(ptep); > + if ((pte_present(orig_pte) && pte_cont(orig_pte)) || > + (pte_cont(pte) && ptep_any_present(ptep, nr))) > + ptep_clear_flush_range(mm, addr, ptep, nr); > + > + __set_ptes(mm, addr, ptep, pte, nr); > + > + addr = next; > + ptep += nr; > + pfn += nr; > + > + } while (addr != end); > +} > + > +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > +{ > + /* > + * ptep_clear_flush_young() technically requires us to clear the access > + * flag for a _single_ pte. However, the core-mm code actually tracks > + * access/dirty per folio, not per page. And since we only create a > + * contig range when the range is covered by a single folio, we can get > + * away with clearing young for the whole contig range here, so we avoid > + * having to unfold. > + */ > + > + int i; > + int young = 0; > + > + ptep = contpte_align_down(ptep); > + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); > + > + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) > + young |= __ptep_test_and_clear_young(vma, addr, ptep); > + > + return young; > +} > + > +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > +{ > + int young; > + > + young = contpte_ptep_test_and_clear_young(vma, addr, ptep); > + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); > + > + if (young) { > + /* > + * See comment in __ptep_clear_flush_young(); same rationale for > + * eliding the trailing DSB applies here. > + */ > + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE, > + PAGE_SIZE, true, 3); > + } > + > + return young; > +} > + > +int contpte_ptep_set_access_flags(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep, > + pte_t entry, int dirty) > +{ > + pte_t orig_pte; > + int i; > + > + /* > + * Gather the access/dirty bits for the contiguous range. If nothing has > + * changed, its a noop. > + */ > + orig_pte = ptep_get(ptep); > + if (pte_val(orig_pte) == pte_val(entry)) > + return 0; > + > + /* > + * We can fix up access/dirty bits without having to unfold/fold the > + * contig range. But if the write bit is changing, we need to go through > + * the full unfold/fold cycle. > + */ > + if (pte_write(orig_pte) == pte_write(entry)) { Depending on the architecture version, pte_write() either checks a software only bit or it checks the DBM one. > + /* > + * No need to flush here; This is always "more permissive" so we > + * can only be _adding_ the access or dirty bit. And since the > + * tlb can't cache an entry without the AF set and the dirty bit > + * is a SW bit, there can be no confusion. For HW access > + * management, we technically only need to update the flag on a > + * single pte in the range. But for SW access management, we > + * need to update all the ptes to prevent extra faults. > + */ On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel perspective but clean) may be cached in the TLB and we do need flushing. > + ptep = contpte_align_down(ptep); > + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); > + > + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) > + __ptep_set_access_flags(vma, addr, ptep, entry, 0); > + } else { > + /* > + * No need to flush in __ptep_set_access_flags() because we just > + * flushed the whole range in __contpte_try_unfold(). > + */ > + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); > + __ptep_set_access_flags(vma, addr, ptep, entry, 0); > + contpte_try_fold(vma->vm_mm, addr, ptep, entry); > + } > + > + return 1; > +}
On 03/07/2023 16:17, Catalin Marinas wrote: > Hi Ryan, > > Some comments below. I did not have time to trim down the quoted text, > so you may need to scroll through it. Thanks for the review! Looking at the comments, I think they all relate to implementation. Does that imply that you are happy with the shape/approach? Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig option that defaults to disabled and also adding a command line option to disable it when compiled in. I think that makes sense for now at least to reduce risk of performance regression? > > On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote: >> With the ptep API sufficiently refactored, we can now introduce a new >> "contpte" API layer, which transparently manages the PTE_CONT bit for >> user mappings. Whenever it detects a set of PTEs that meet the >> requirements for a contiguous range, the PTEs are re-painted with the >> PTE_CONT bit. >> >> This initial change provides a baseline that can be optimized in future >> commits. That said, fold/unfold operations (which imply tlb >> invalidation) are avoided where possible with a few tricks for >> access/dirty bit management. >> >> Write-enable and write-protect modifications are likely non-optimal and >> likely incure a regression in fork() performance. This will be addressed >> separately. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> arch/arm64/include/asm/pgtable.h | 137 ++++++++++++- >> arch/arm64/mm/Makefile | 3 +- >> arch/arm64/mm/contpte.c | 334 +++++++++++++++++++++++++++++++ >> 3 files changed, 466 insertions(+), 8 deletions(-) >> create mode 100644 arch/arm64/mm/contpte.c >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index 31df4d73f9ac..17ea534bc5b0 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, >> unsigned long addr, pte_t *ptep, >> pte_t old_pte, pte_t new_pte); >> >> +/* >> + * The contpte APIs are used to transparently manage the contiguous bit in ptes >> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered >> + * a private implementation detail of the public ptep API (see below). >> + */ >> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte); >> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte); >> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte); >> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep); >> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte, unsigned int nr); >> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep); >> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep); >> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep, >> + pte_t entry, int dirty); >> + >> +static inline pte_t *contpte_align_down(pte_t *ptep) >> +{ >> + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3); >> +} >> + >> +static inline bool contpte_is_enabled(struct mm_struct *mm) >> +{ >> + /* >> + * Don't attempt to apply the contig bit to kernel mappings, because >> + * dynamically adding/removing the contig bit can cause page faults. >> + * These racing faults are ok for user space, since they get serialized >> + * on the PTL. But kernel mappings can't tolerate faults. >> + */ >> + >> + return mm != &init_mm; >> +} >> + >> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte) >> +{ >> + /* >> + * Only bother trying if both the virtual and physical addresses are >> + * aligned and correspond to the last entry in a contig range. The core >> + * code mostly modifies ranges from low to high, so this is the likely >> + * the last modification in the contig range, so a good time to fold. >> + */ >> + >> + bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1; >> + bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1; >> + >> + if (contpte_is_enabled(mm) && >> + pte_present(pte) && !pte_cont(pte) && >> + valign && palign) >> + __contpte_try_fold(mm, addr, ptep, pte); > > I would use pte_valid() here instead. pte_present() also includes the > PTE_PROT_NONE option which we don't really care about as it's not > accessible. Yep good point. I'll audit all of this and make the appropriate changes for v2. > > I've been discussing with Alexandru Elisei about PTE_PROT_NONE and > whether we can use other bits from the pte even if they clash with other > valid permissions. Since the pte is not valid, in theory we could as > long as nothing corrupts the (like a cont bit). The background to this > is multiple migrate types (not just NUMA) for the MTE tag carveout > reuse. ACK. > >> +} >> + >> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte) >> +{ >> + if (contpte_is_enabled(mm) && >> + pte_present(pte) && pte_cont(pte)) >> + __contpte_try_unfold(mm, addr, ptep, pte); >> +} > > Same here and probably most other places where pte_present() is used in > this patch. ACK. > >> + >> /* >> * The below functions constitute the public API that arm64 presents to the >> * core-mm to manipulate PTE entries within the their page tables (or at least >> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, >> * versions will automatically and transparently apply the contiguous bit where >> * it makes sense to do so. Therefore any users that are contig-aware (e.g. >> * hugetlb, kernel mapper) should NOT use these APIs, but instead use the >> - * private versions, which are prefixed with double underscore. >> + * private versions, which are prefixed with double underscore. All of these >> + * APIs except for ptep_get_lockless() are expected to be called with the PTL >> + * held. >> */ >> >> #define ptep_get ptep_get >> static inline pte_t ptep_get(pte_t *ptep) >> { >> - return __ptep_get(ptep); >> + pte_t pte = __ptep_get(ptep); >> + >> + if (!pte_present(pte) || !pte_cont(pte)) >> + return pte; >> + >> + return contpte_ptep_get(ptep, pte); >> +} >> + >> +#define ptep_get_lockless ptep_get_lockless >> +static inline pte_t ptep_get_lockless(pte_t *ptep) >> +{ >> + pte_t pte = __ptep_get(ptep); >> + >> + if (!pte_present(pte) || !pte_cont(pte)) >> + return pte; >> + >> + return contpte_ptep_get_lockless(ptep); >> } >> >> static inline void set_pte(pte_t *ptep, pte_t pte) >> { >> - __set_pte(ptep, pte); >> + /* >> + * We don't have the mm or vaddr so cannot unfold or fold contig entries >> + * (since it requires tlb maintenance). set_pte() is not used in core >> + * code, so this should never even be called. Regardless do our best to >> + * service any call and emit a warning if there is any attempt to set a >> + * pte on top of an existing contig range. >> + */ >> + pte_t orig_pte = __ptep_get(ptep); >> + >> + WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte)); >> + __set_pte(ptep, pte_mknoncont(pte)); > > Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning > only checks the old entry. Originally, it was my intent that PTE_CONT bit would be totally private to this layer and the bit should never leak to the generic code (i.e. ptep_get() would clear it before returning the pte and all functions that accept a pte would WARN if the bit was set on entry. However, this approach proved problematic for accounting; I have a separate change that logs the amount of memory mapped as contpte in /proc/<pid>/smaps[_rollup]. For this to work, the PTE_CONT bit must be leaked to the generic code (ptep_get() no longer explicitly clears it). But if we deliberately leak it, then its possible that it will be set in functions that take a pte, which would lead to incorrect behavior (potentially leading to a contpte range that has some PTE_CONT bits set and others cleared). This happens because there is generic code that follows a pattern like this: pte = ptep_get_and_clear(ptep) pte = modify_some_bits(pte) set_pte_at(pte) To solve this, I'm explicitly clearing CONT_PTE from any pte that is passed in to one of these functions. > >> } >> >> #define set_ptes set_ptes >> static inline void set_ptes(struct mm_struct *mm, unsigned long addr, >> pte_t *ptep, pte_t pte, unsigned int nr) >> { >> - __set_ptes(mm, addr, ptep, pte, nr); >> + pte = pte_mknoncont(pte); >> + >> + if (!contpte_is_enabled(mm)) >> + __set_ptes(mm, addr, ptep, pte, nr); >> + else if (nr == 1) { >> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); >> + __set_ptes(mm, addr, ptep, pte, nr); >> + contpte_try_fold(mm, addr, ptep, pte); >> + } else >> + contpte_set_ptes(mm, addr, ptep, pte, nr); >> } >> >> static inline void pte_clear(struct mm_struct *mm, >> unsigned long addr, pte_t *ptep) >> { >> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); >> __pte_clear(mm, addr, ptep); >> } >> >> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm, >> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >> unsigned long addr, pte_t *ptep) >> { >> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); >> return __ptep_get_and_clear(mm, addr, ptep); >> } >> >> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >> unsigned long addr, pte_t *ptep) >> { >> - return __ptep_test_and_clear_young(vma, addr, ptep); >> + pte_t orig_pte = __ptep_get(ptep); >> + >> + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) >> + return __ptep_test_and_clear_young(vma, addr, ptep); > > Since I've seen this construct a few times, you may want to turn it into > a specific check: pte_valid_cont(). ACK - will do for v2. > >> + >> + return contpte_ptep_test_and_clear_young(vma, addr, ptep); >> } >> >> #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH >> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >> unsigned long addr, pte_t *ptep) >> { >> - return __ptep_clear_flush_young(vma, addr, ptep); >> + pte_t orig_pte = __ptep_get(ptep); >> + >> + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) >> + return __ptep_clear_flush_young(vma, addr, ptep); >> + >> + return contpte_ptep_clear_flush_young(vma, addr, ptep); >> } >> >> #define __HAVE_ARCH_PTEP_SET_WRPROTECT >> static inline void ptep_set_wrprotect(struct mm_struct *mm, >> unsigned long addr, pte_t *ptep) >> { >> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); >> __ptep_set_wrprotect(mm, addr, ptep); >> + contpte_try_fold(mm, addr, ptep, __ptep_get(ptep)); >> } >> >> #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS >> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma, >> unsigned long addr, pte_t *ptep, >> pte_t entry, int dirty) >> { >> - return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); >> + pte_t orig_pte = __ptep_get(ptep); >> + >> + entry = pte_mknoncont(entry); > > As in a few other places, it's not clear to me why the pte_mknoncont() > is needed. Here I expect 'entry' to be cont if *ptep is cont. See explanation above. > >> + >> + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) >> + return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); > > Also wondering, can we have this check on 'entry' rather than > 'orig_pte'? And maybe a warning if the cont bit differs between them. No - the idea is that this API layer has exclusicve control over whether PTE_CONT is set in the pgtable. Upper layers should never pass a pte with CONT_PTE set (except for the corner case described above which we deal with by explicitly clearing PTE_CONT from the passed in pte). So the check must be on orig_pte - we are checking if a contpte range is present over the pte we are about to modify. If it is, then we need to handle it carefully (potentially by unfolding it first; handled by contpte_ptep_set_access_flags()). If there is no contprte range, then we can just handle it the "normal" way. > >> + >> + return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty); >> } >> >> #endif /* !__ASSEMBLY__ */ >> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile >> index dbd1bc95967d..70b6aba09b5d 100644 >> --- a/arch/arm64/mm/Makefile >> +++ b/arch/arm64/mm/Makefile >> @@ -2,7 +2,8 @@ >> obj-y := dma-mapping.o extable.o fault.o init.o \ >> cache.o copypage.o flush.o \ >> ioremap.o mmap.o pgd.o mmu.o \ >> - context.o proc.o pageattr.o fixmap.o >> + context.o proc.o pageattr.o fixmap.o \ >> + contpte.o >> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o >> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o >> obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c >> new file mode 100644 >> index 000000000000..e8e4a298fd53 >> --- /dev/null >> +++ b/arch/arm64/mm/contpte.c >> @@ -0,0 +1,334 @@ >> +// SPDX-License-Identifier: GPL-2.0-only >> +/* >> + * Copyright (C) 2023 ARM Ltd. >> + */ >> + >> +#include <linux/mm.h> >> +#include <asm/tlbflush.h> >> + >> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, int nr) >> +{ >> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); >> + unsigned long start_addr = addr; >> + int i; >> + >> + for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) >> + __pte_clear(mm, addr, ptep); >> + >> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); >> +} >> + >> +static bool ptep_any_present(pte_t *ptep, int nr) > > Valid? ACK > >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++, ptep++) { >> + if (pte_present(__ptep_get(ptep))) >> + return true; >> + } >> + >> + return false; >> +} >> + >> +static void contpte_fold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte, bool fold) >> +{ >> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); >> + unsigned long start_addr; >> + pte_t *start_ptep; >> + int i; >> + >> + start_ptep = ptep = contpte_align_down(ptep); >> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte)); >> + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) { >> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep); >> + >> + if (pte_dirty(ptent)) >> + pte = pte_mkdirty(pte); >> + >> + if (pte_young(ptent)) >> + pte = pte_mkyoung(pte); >> + } > > I presume this can be unsafe if any of the ptes in the range differ, so > we need some higher level check. Sorry I'm not quite sure what you mean here? The higher level check is where we look at the current value of the target PTE; if PTE_CONT is set then we know it is part of a contpte range. We are careful that PTE_CONT is set consistently for all (valid) PTEs in a contpte range, so we only need to check 1 entry. There is no risk of racing here because we are always serialized by the PTL. > But that means we now have three loops > for folding, one to check, another to clear and the last one via > __set_ptes(). I guess we can't collapse the first two loops in a 'try' > function as we need to do the cleaning (and would have to re-instate the > old entries if they can't be made contiguous). Yes 3 loops, and I don't see how you would reduce that. The good news is that this folding path should be rarely taken; most qualifying ranges will be set via set_ptes() so they are ritten "pre-folded". We only attempt to fold after setting the pte at the _end_ of the range (see contpte_try_fold()), and the checker loop in __contpte_try_fold() will usually exit on the second iteration if the memory is not physically contiguous. > >> + >> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); >> + >> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES); >> +} >> + >> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte) >> +{ >> + /* >> + * We have already checked that the virtual and pysical addresses are >> + * correctly aligned for a contig mapping in contpte_try_fold() so the >> + * remaining checks are to ensure that the contig range is fully covered >> + * by a single folio, and ensure that all the ptes are present with >> + * contiguous PFNs and matching prots. >> + */ >> + >> + struct page *page = pte_page(pte); >> + struct folio *folio = page_folio(page); >> + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE; >> + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE; >> + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >> + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE; >> + unsigned long pfn; >> + pgprot_t prot; >> + pte_t subpte; >> + pte_t *orig_ptep; >> + int i; >> + >> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr) >> + return; >> + >> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT); >> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); >> + orig_ptep = ptep; >> + ptep = contpte_align_down(ptep); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { >> + subpte = __ptep_get(ptep); >> + subpte = pte_mkold(pte_mkclean(subpte)); > > IIUC, this function assumes ptes that only differ by the dirty status > can be contiguous. That's probably ok, with a chance of the dirty status > spreading to the adjacent ptes in the fold function. Maybe add a comment > on why this is ok (or why it doesn't happen). Conceptually a contpte range only has a single access and dirty bit. So when folding, we or all the access bits and all the dirty bits from the constituent ptes to determine the single access and dirty bits for the contpte mapping. And when unfolding, we take the single access and dirty bit for the contpte mapping and apply those values to every individual entry. So yes, we ignore the access and dirty values for the subptes when evaluating whether a contiguous range exists. I'll add a comment. > >> + >> + if (!pte_present(subpte) || >> + pte_pfn(subpte) != pfn || >> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot)) >> + return; >> + } >> + >> + contpte_fold(mm, addr, orig_ptep, pte, true); >> +} >> + >> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte) >> +{ >> + /* >> + * We have already checked that the ptes are contiguous in >> + * contpte_try_unfold(), so we can unfold unconditionally here. >> + */ >> + >> + contpte_fold(mm, addr, ptep, pte, false); >> +} > > So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly > for the unfold case. Maybe it's clearer if we just have two separate > functions (or document why the pte_mk*() are needed). No that's not the case. In the unfold case, we need to "collect" the single access and dirty bit from the contpte mapping (these may be in any of the entries), and set the final values for all individual ptes during unfolding. The obvious side effect here is that if any one page is dirty at fold time, the whole range will be marked as dirty after folding, then at unfolding all pages will be marked as dirty (same goes for access). This is the same concern that I raised in the cover letter. I don't think this is a problem from the kernel's point of view; the kernel will compress the per-page access/dirty info to per-folio and we only fold if the whole range is covered by a single folio. But user space could observe this "over-dirtying" through /proc/<pid>/pagemap. I'm not sure if that's a problem in practice? > >> + >> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) >> +{ >> + /* >> + * Gather access/dirty bits, which may be populated in any of the ptes >> + * of the contig range. We are guarranteed to be holding the PTL, so any >> + * contiguous range cannot be unfolded or otherwise modified under our >> + * feet. >> + */ >> + >> + pte_t pte; >> + int i; >> + >> + ptep = contpte_align_down(ptep); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++) { >> + pte = __ptep_get(ptep); >> + >> + /* >> + * Deal with the partial contpte_ptep_get_and_clear_full() case, >> + * where some of the ptes in the range may be cleared but others >> + * are still to do. See contpte_ptep_get_and_clear_full(). >> + */ >> + if (pte_val(pte) == 0) >> + continue; >> + >> + if (pte_dirty(pte)) >> + orig_pte = pte_mkdirty(orig_pte); >> + >> + if (pte_young(pte)) >> + orig_pte = pte_mkyoung(orig_pte); >> + } >> + >> + return orig_pte; >> +} >> + >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) >> +{ >> + /* >> + * Gather access/dirty bits, which may be populated in any of the ptes >> + * of the contig range. We may not be holding the PTL, so any contiguous >> + * range may be unfolded/modified/refolded under our feet. >> + */ >> + >> + pte_t orig_pte; >> + pgprot_t orig_prot; >> + pte_t *ptep; >> + unsigned long pfn; >> + pte_t pte; >> + pgprot_t prot; >> + int i; >> + >> +retry: >> + orig_pte = __ptep_get(orig_ptep); >> + >> + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) >> + return orig_pte; > > I haven't looked through all the patches, so not entirely sure when this > function is called. ptep_get_lockless() is one of the mm optional arch interfaces. arm64 doesn't currently implement it, because ptep_get() (READ_ONCE()) is safe without the lock being held. But with the introduction of contpte mappings, there are cases now where we have to read the whole contpte range to gather access and dirty, which obviously isn't atomic. And doing that without the PTL is harder than when we have the PTL, so I've implemented ptep_get_lockless() so we can assume the PTL is held in ptep_get() and do the simple thing there (the common case). > But since you mention that the range may be > folded/unfolded, how do we deal with pte_cont() racing with something > setting the contig bit? ptep_get_lockless() is inherrently racy. My intedntion was that we just need to ensure we read a pte or contpte range that is consistent with itself. > >> + >> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); >> + ptep = contpte_align_down(orig_ptep); >> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { >> + pte = __ptep_get(ptep); >> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); >> + >> + if (!pte_present(pte) || !pte_cont(pte) || >> + pte_pfn(pte) != pfn || >> + pgprot_val(prot) != pgprot_val(orig_prot)) >> + goto retry; > > It needs better documenting, I don't understand what the retry here is > for (presumably to handle some races). Do we care about some memory > ordering as well? __pte_get() only takes care of reading the ptep once. The intention is that the loop keeps retrying until it scans a whole contpte range that is consistent with itself (i.e. PTE_CONT bit is set in all, pfn increments monotonically and pgprots are all the same). If any of those considtions are not true, it indicates we are racing with an update and need to retry until its consistent. I'd need to think a bit more on whether we need anything special for memory ordering... To be honest, I'm not a big fan of this function. As far as I can tell, the only user of ptep_get_lockless() that cares about access/dirty is ptdump. Perhaps we can re-spec this to not return access/dirty info (that would simplify it back to a READ_ONCE()), then figure out a way to hold the PTL for ptdump and use ptep_get() which will return the access/dirty info correctly. Do you think something like that could work? > >> + >> + if (pte_dirty(pte)) >> + orig_pte = pte_mkdirty(orig_pte); >> + >> + if (pte_young(pte)) >> + orig_pte = pte_mkyoung(orig_pte); >> + } >> + >> + return orig_pte; >> +} >> + >> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, >> + pte_t *ptep, pte_t pte, unsigned int nr) >> +{ >> + unsigned long next; >> + unsigned long end = addr + (nr << PAGE_SHIFT); >> + unsigned long pfn = pte_pfn(pte); >> + pgprot_t prot = pte_pgprot(pte); >> + pte_t orig_pte; >> + >> + do { >> + next = pte_cont_addr_end(addr, end); >> + nr = (next - addr) >> PAGE_SHIFT; >> + pte = pfn_pte(pfn, prot); >> + >> + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0) >> + pte = pte_mkcont(pte); >> + else >> + pte = pte_mknoncont(pte); >> + >> + /* >> + * If operating on a partial contiguous range then we must first >> + * unfold the contiguous range if it was previously folded. >> + * Otherwise we could end up with overlapping tlb entries. >> + */ >> + if (nr != CONT_PTES) >> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); >> + >> + /* >> + * If we are replacing ptes that were contiguous or if the new >> + * ptes are contiguous and any of the ptes being replaced are >> + * present, we need to clear and flush the range to prevent >> + * overlapping tlb entries. >> + */ >> + orig_pte = __ptep_get(ptep); >> + if ((pte_present(orig_pte) && pte_cont(orig_pte)) || >> + (pte_cont(pte) && ptep_any_present(ptep, nr))) >> + ptep_clear_flush_range(mm, addr, ptep, nr); >> + >> + __set_ptes(mm, addr, ptep, pte, nr); >> + >> + addr = next; >> + ptep += nr; >> + pfn += nr; >> + >> + } while (addr != end); >> +} >> + >> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep) >> +{ >> + /* >> + * ptep_clear_flush_young() technically requires us to clear the access >> + * flag for a _single_ pte. However, the core-mm code actually tracks >> + * access/dirty per folio, not per page. And since we only create a >> + * contig range when the range is covered by a single folio, we can get >> + * away with clearing young for the whole contig range here, so we avoid >> + * having to unfold. >> + */ >> + >> + int i; >> + int young = 0; >> + >> + ptep = contpte_align_down(ptep); >> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) >> + young |= __ptep_test_and_clear_young(vma, addr, ptep); >> + >> + return young; >> +} >> + >> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep) >> +{ >> + int young; >> + >> + young = contpte_ptep_test_and_clear_young(vma, addr, ptep); >> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >> + >> + if (young) { >> + /* >> + * See comment in __ptep_clear_flush_young(); same rationale for >> + * eliding the trailing DSB applies here. >> + */ >> + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE, >> + PAGE_SIZE, true, 3); >> + } >> + >> + return young; >> +} >> + >> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep, >> + pte_t entry, int dirty) >> +{ >> + pte_t orig_pte; >> + int i; >> + >> + /* >> + * Gather the access/dirty bits for the contiguous range. If nothing has >> + * changed, its a noop. >> + */ >> + orig_pte = ptep_get(ptep); >> + if (pte_val(orig_pte) == pte_val(entry)) >> + return 0; >> + >> + /* >> + * We can fix up access/dirty bits without having to unfold/fold the >> + * contig range. But if the write bit is changing, we need to go through >> + * the full unfold/fold cycle. >> + */ >> + if (pte_write(orig_pte) == pte_write(entry)) { > > Depending on the architecture version, pte_write() either checks a > software only bit or it checks the DBM one. > >> + /* >> + * No need to flush here; This is always "more permissive" so we >> + * can only be _adding_ the access or dirty bit. And since the >> + * tlb can't cache an entry without the AF set and the dirty bit >> + * is a SW bit, there can be no confusion. For HW access >> + * management, we technically only need to update the flag on a >> + * single pte in the range. But for SW access management, we >> + * need to update all the ptes to prevent extra faults. >> + */ > > On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel > perspective but clean) may be cached in the TLB and we do need flushing. I don't follow; The Arm ARM says: IPNQBP When an Access flag fault is generated, the translation table entry causing the fault is not cached in a TLB. So the entry can only be in the TLB if AF is already 1. And given the dirty bit is SW, it shouldn't affect the TLB state. And this function promises to only change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1). So I'm not sure what case you are describing here? > >> + ptep = contpte_align_down(ptep); >> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >> + >> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) >> + __ptep_set_access_flags(vma, addr, ptep, entry, 0); >> + } else { >> + /* >> + * No need to flush in __ptep_set_access_flags() because we just >> + * flushed the whole range in __contpte_try_unfold(). >> + */ >> + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); >> + __ptep_set_access_flags(vma, addr, ptep, entry, 0); >> + contpte_try_fold(vma->vm_mm, addr, ptep, entry); >> + } >> + >> + return 1; >> +} > Thanks, Ryan
On 04/07/2023 12:09, Ryan Roberts wrote: > On 03/07/2023 16:17, Catalin Marinas wrote: >> Hi Ryan, >> ... >>> + >>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma, >>> + unsigned long addr, pte_t *ptep, >>> + pte_t entry, int dirty) >>> +{ >>> + pte_t orig_pte; >>> + int i; >>> + >>> + /* >>> + * Gather the access/dirty bits for the contiguous range. If nothing has >>> + * changed, its a noop. >>> + */ >>> + orig_pte = ptep_get(ptep); >>> + if (pte_val(orig_pte) == pte_val(entry)) >>> + return 0; >>> + >>> + /* >>> + * We can fix up access/dirty bits without having to unfold/fold the >>> + * contig range. But if the write bit is changing, we need to go through >>> + * the full unfold/fold cycle. >>> + */ >>> + if (pte_write(orig_pte) == pte_write(entry)) { >> >> Depending on the architecture version, pte_write() either checks a >> software only bit or it checks the DBM one. >> >>> + /* >>> + * No need to flush here; This is always "more permissive" so we >>> + * can only be _adding_ the access or dirty bit. And since the >>> + * tlb can't cache an entry without the AF set and the dirty bit >>> + * is a SW bit, there can be no confusion. For HW access >>> + * management, we technically only need to update the flag on a >>> + * single pte in the range. But for SW access management, we >>> + * need to update all the ptes to prevent extra faults. >>> + */ >> >> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel >> perspective but clean) may be cached in the TLB and we do need flushing. > > I don't follow; The Arm ARM says: > > IPNQBP When an Access flag fault is generated, the translation table entry > causing the fault is not cached in a TLB. > > So the entry can only be in the TLB if AF is already 1. And given the dirty bit > is SW, it shouldn't affect the TLB state. And this function promises to only > change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1). > > So I'm not sure what case you are describing here? Ahh sorry, I get your point now - on pre-DBM hardware, the HW sees a read-only PTE when the kernel considers it clean and this can be in the TLB. Then when making it dirty (from kernel's perspective), we are removing the read-only protection from the HW perspective, so we need to flush the TLB entry. > >> >>> + ptep = contpte_align_down(ptep); >>> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >>> + >>> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) >>> + __ptep_set_access_flags(vma, addr, ptep, entry, 0); Fixed by adding this after iterating though the ptes, intent it to avoid the per-page tlb flash and instead flush the whole range at the end: if (dirty) __flush_tlb_range(vma, start_addr, addr, PAGE_SIZE, true, 3); >>> + } else { >>> + /* >>> + * No need to flush in __ptep_set_access_flags() because we just >>> + * flushed the whole range in __contpte_try_unfold(). >>> + */ >>> + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); >>> + __ptep_set_access_flags(vma, addr, ptep, entry, 0); I also think this is wrong; we must pass `dirty` as the last parameter so that __ptep_set_access_flags will flush if neccessary. My comment about having just done the flush is incorrect - we have just done a flush, but the ptes are still valid with their old value so the HW could pull this into the TLB before we modify the value. >>> + contpte_try_fold(vma->vm_mm, addr, ptep, entry); >>> + } >>> + >>> + return 1; >>> +} >> > > Thanks, > Ryan >
On Tue, Jul 04, 2023 at 12:09:31PM +0100, Ryan Roberts wrote: > On 03/07/2023 16:17, Catalin Marinas wrote: > > Hi Ryan, > > > > Some comments below. I did not have time to trim down the quoted text, > > so you may need to scroll through it. > > Thanks for the review! > > Looking at the comments, I think they all relate to implementation. Does that > imply that you are happy with the shape/approach? I can't really tell yet as there are a few dependencies and I haven't applied them to look at the bigger picture. My preference would be to handle the large folio breaking/making in the core code via APIs like set_ptes() and eliminate the loop heuristics in the arm64 code to fold/unfold. Maybe it's not entirely possible I need to look at the bigger picture with all the series applied (and on a bigger screen, writing this reply on a laptop in flight). > Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig > option that defaults to disabled and also adding a command line option to > disable it when compiled in. I think that makes sense for now at least to reduce > risk of performance regression? I'm fine with a Kconfig option (maybe expert) but default enabled, otherwise it won't get enough coverage. AFAICT, the biggest risk of regression is the heuristics for folding/unfolding. In general the overhead should be offset by the reduced TLB pressure but we may find some pathological case where this gets in the way. > > On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote: > >> + /* > >> + * No need to flush here; This is always "more permissive" so we > >> + * can only be _adding_ the access or dirty bit. And since the > >> + * tlb can't cache an entry without the AF set and the dirty bit > >> + * is a SW bit, there can be no confusion. For HW access > >> + * management, we technically only need to update the flag on a > >> + * single pte in the range. But for SW access management, we > >> + * need to update all the ptes to prevent extra faults. > >> + */ > > > > On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel > > perspective but clean) may be cached in the TLB and we do need flushing. > > I don't follow; The Arm ARM says: > > IPNQBP When an Access flag fault is generated, the translation table entry > causing the fault is not cached in a TLB. > > So the entry can only be in the TLB if AF is already 1. And given the dirty bit > is SW, it shouldn't affect the TLB state. And this function promises to only > change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1). > > So I'm not sure what case you are describing here? The comment for this function states that it sets the access/dirty flags as well as the write permission. Prior to DBM, the page is marked PTE_RDONLY and we take a fault. This function marks the page dirty by setting the software PTE_DIRTY bit (no need to worry) but also clearing PTE_RDONLY so that a subsequent access won't fault again. We do need the TLBI here since PTE_RDONLY is allowed to be cached in the TLB. Sorry, I did not reply to your other comments (we can talk in person in about a week time). I also noticed you figured the above but I had written it already.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 31df4d73f9ac..17ea534bc5b0 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); +/* + * The contpte APIs are used to transparently manage the contiguous bit in ptes + * where it is possible and makes sense to do so. The PTE_CONT bit is considered + * a private implementation detail of the public ptep API (see below). + */ +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte); +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep); +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr); +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + pte_t entry, int dirty); + +static inline pte_t *contpte_align_down(pte_t *ptep) +{ + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3); +} + +static inline bool contpte_is_enabled(struct mm_struct *mm) +{ + /* + * Don't attempt to apply the contig bit to kernel mappings, because + * dynamically adding/removing the contig bit can cause page faults. + * These racing faults are ok for user space, since they get serialized + * on the PTL. But kernel mappings can't tolerate faults. + */ + + return mm != &init_mm; +} + +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + /* + * Only bother trying if both the virtual and physical addresses are + * aligned and correspond to the last entry in a contig range. The core + * code mostly modifies ranges from low to high, so this is the likely + * the last modification in the contig range, so a good time to fold. + */ + + bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1; + bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1; + + if (contpte_is_enabled(mm) && + pte_present(pte) && !pte_cont(pte) && + valign && palign) + __contpte_try_fold(mm, addr, ptep, pte); +} + +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + if (contpte_is_enabled(mm) && + pte_present(pte) && pte_cont(pte)) + __contpte_try_unfold(mm, addr, ptep, pte); +} + /* * The below functions constitute the public API that arm64 presents to the * core-mm to manipulate PTE entries within the their page tables (or at least @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma, * versions will automatically and transparently apply the contiguous bit where * it makes sense to do so. Therefore any users that are contig-aware (e.g. * hugetlb, kernel mapper) should NOT use these APIs, but instead use the - * private versions, which are prefixed with double underscore. + * private versions, which are prefixed with double underscore. All of these + * APIs except for ptep_get_lockless() are expected to be called with the PTL + * held. */ #define ptep_get ptep_get static inline pte_t ptep_get(pte_t *ptep) { - return __ptep_get(ptep); + pte_t pte = __ptep_get(ptep); + + if (!pte_present(pte) || !pte_cont(pte)) + return pte; + + return contpte_ptep_get(ptep, pte); +} + +#define ptep_get_lockless ptep_get_lockless +static inline pte_t ptep_get_lockless(pte_t *ptep) +{ + pte_t pte = __ptep_get(ptep); + + if (!pte_present(pte) || !pte_cont(pte)) + return pte; + + return contpte_ptep_get_lockless(ptep); } static inline void set_pte(pte_t *ptep, pte_t pte) { - __set_pte(ptep, pte); + /* + * We don't have the mm or vaddr so cannot unfold or fold contig entries + * (since it requires tlb maintenance). set_pte() is not used in core + * code, so this should never even be called. Regardless do our best to + * service any call and emit a warning if there is any attempt to set a + * pte on top of an existing contig range. + */ + pte_t orig_pte = __ptep_get(ptep); + + WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte)); + __set_pte(ptep, pte_mknoncont(pte)); } #define set_ptes set_ptes static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { - __set_ptes(mm, addr, ptep, pte, nr); + pte = pte_mknoncont(pte); + + if (!contpte_is_enabled(mm)) + __set_ptes(mm, addr, ptep, pte, nr); + else if (nr == 1) { + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + __set_ptes(mm, addr, ptep, pte, nr); + contpte_try_fold(mm, addr, ptep, pte); + } else + contpte_set_ptes(mm, addr, ptep, pte, nr); } static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); __pte_clear(mm, addr, ptep); } @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm, static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); return __ptep_get_and_clear(mm, addr, ptep); } @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { - return __ptep_test_and_clear_young(vma, addr, ptep); + pte_t orig_pte = __ptep_get(ptep); + + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) + return __ptep_test_and_clear_young(vma, addr, ptep); + + return contpte_ptep_test_and_clear_young(vma, addr, ptep); } #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH static inline int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { - return __ptep_clear_flush_young(vma, addr, ptep); + pte_t orig_pte = __ptep_get(ptep); + + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) + return __ptep_clear_flush_young(vma, addr, ptep); + + return contpte_ptep_clear_flush_young(vma, addr, ptep); } #define __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); __ptep_set_wrprotect(mm, addr, ptep); + contpte_try_fold(mm, addr, ptep, __ptep_get(ptep)); } #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t entry, int dirty) { - return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); + pte_t orig_pte = __ptep_get(ptep); + + entry = pte_mknoncont(entry); + + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) + return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); + + return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty); } #endif /* !__ASSEMBLY__ */ diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile index dbd1bc95967d..70b6aba09b5d 100644 --- a/arch/arm64/mm/Makefile +++ b/arch/arm64/mm/Makefile @@ -2,7 +2,8 @@ obj-y := dma-mapping.o extable.o fault.o init.o \ cache.o copypage.o flush.o \ ioremap.o mmap.o pgd.o mmu.o \ - context.o proc.o pageattr.o fixmap.o + context.o proc.o pageattr.o fixmap.o \ + contpte.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c new file mode 100644 index 000000000000..e8e4a298fd53 --- /dev/null +++ b/arch/arm64/mm/contpte.c @@ -0,0 +1,334 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2023 ARM Ltd. + */ + +#include <linux/mm.h> +#include <asm/tlbflush.h> + +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, int nr) +{ + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); + unsigned long start_addr = addr; + int i; + + for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) + __pte_clear(mm, addr, ptep); + + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); +} + +static bool ptep_any_present(pte_t *ptep, int nr) +{ + int i; + + for (i = 0; i < nr; i++, ptep++) { + if (pte_present(__ptep_get(ptep))) + return true; + } + + return false; +} + +static void contpte_fold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, bool fold) +{ + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); + unsigned long start_addr; + pte_t *start_ptep; + int i; + + start_ptep = ptep = contpte_align_down(ptep); + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte)); + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte); + + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) { + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep); + + if (pte_dirty(ptent)) + pte = pte_mkdirty(pte); + + if (pte_young(ptent)) + pte = pte_mkyoung(pte); + } + + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); + + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES); +} + +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + /* + * We have already checked that the virtual and pysical addresses are + * correctly aligned for a contig mapping in contpte_try_fold() so the + * remaining checks are to ensure that the contig range is fully covered + * by a single folio, and ensure that all the ptes are present with + * contiguous PFNs and matching prots. + */ + + struct page *page = pte_page(pte); + struct folio *folio = page_folio(page); + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE; + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE; + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE; + unsigned long pfn; + pgprot_t prot; + pte_t subpte; + pte_t *orig_ptep; + int i; + + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr) + return; + + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT); + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); + orig_ptep = ptep; + ptep = contpte_align_down(ptep); + + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { + subpte = __ptep_get(ptep); + subpte = pte_mkold(pte_mkclean(subpte)); + + if (!pte_present(subpte) || + pte_pfn(subpte) != pfn || + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot)) + return; + } + + contpte_fold(mm, addr, orig_ptep, pte, true); +} + +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + /* + * We have already checked that the ptes are contiguous in + * contpte_try_unfold(), so we can unfold unconditionally here. + */ + + contpte_fold(mm, addr, ptep, pte, false); +} + +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) +{ + /* + * Gather access/dirty bits, which may be populated in any of the ptes + * of the contig range. We are guarranteed to be holding the PTL, so any + * contiguous range cannot be unfolded or otherwise modified under our + * feet. + */ + + pte_t pte; + int i; + + ptep = contpte_align_down(ptep); + + for (i = 0; i < CONT_PTES; i++, ptep++) { + pte = __ptep_get(ptep); + + /* + * Deal with the partial contpte_ptep_get_and_clear_full() case, + * where some of the ptes in the range may be cleared but others + * are still to do. See contpte_ptep_get_and_clear_full(). + */ + if (pte_val(pte) == 0) + continue; + + if (pte_dirty(pte)) + orig_pte = pte_mkdirty(orig_pte); + + if (pte_young(pte)) + orig_pte = pte_mkyoung(orig_pte); + } + + return orig_pte; +} + +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) +{ + /* + * Gather access/dirty bits, which may be populated in any of the ptes + * of the contig range. We may not be holding the PTL, so any contiguous + * range may be unfolded/modified/refolded under our feet. + */ + + pte_t orig_pte; + pgprot_t orig_prot; + pte_t *ptep; + unsigned long pfn; + pte_t pte; + pgprot_t prot; + int i; + +retry: + orig_pte = __ptep_get(orig_ptep); + + if (!pte_present(orig_pte) || !pte_cont(orig_pte)) + return orig_pte; + + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); + ptep = contpte_align_down(orig_ptep); + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); + + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { + pte = __ptep_get(ptep); + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); + + if (!pte_present(pte) || !pte_cont(pte) || + pte_pfn(pte) != pfn || + pgprot_val(prot) != pgprot_val(orig_prot)) + goto retry; + + if (pte_dirty(pte)) + orig_pte = pte_mkdirty(orig_pte); + + if (pte_young(pte)) + orig_pte = pte_mkyoung(orig_pte); + } + + return orig_pte; +} + +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) +{ + unsigned long next; + unsigned long end = addr + (nr << PAGE_SHIFT); + unsigned long pfn = pte_pfn(pte); + pgprot_t prot = pte_pgprot(pte); + pte_t orig_pte; + + do { + next = pte_cont_addr_end(addr, end); + nr = (next - addr) >> PAGE_SHIFT; + pte = pfn_pte(pfn, prot); + + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0) + pte = pte_mkcont(pte); + else + pte = pte_mknoncont(pte); + + /* + * If operating on a partial contiguous range then we must first + * unfold the contiguous range if it was previously folded. + * Otherwise we could end up with overlapping tlb entries. + */ + if (nr != CONT_PTES) + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + + /* + * If we are replacing ptes that were contiguous or if the new + * ptes are contiguous and any of the ptes being replaced are + * present, we need to clear and flush the range to prevent + * overlapping tlb entries. + */ + orig_pte = __ptep_get(ptep); + if ((pte_present(orig_pte) && pte_cont(orig_pte)) || + (pte_cont(pte) && ptep_any_present(ptep, nr))) + ptep_clear_flush_range(mm, addr, ptep, nr); + + __set_ptes(mm, addr, ptep, pte, nr); + + addr = next; + ptep += nr; + pfn += nr; + + } while (addr != end); +} + +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + /* + * ptep_clear_flush_young() technically requires us to clear the access + * flag for a _single_ pte. However, the core-mm code actually tracks + * access/dirty per folio, not per page. And since we only create a + * contig range when the range is covered by a single folio, we can get + * away with clearing young for the whole contig range here, so we avoid + * having to unfold. + */ + + int i; + int young = 0; + + ptep = contpte_align_down(ptep); + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) + young |= __ptep_test_and_clear_young(vma, addr, ptep); + + return young; +} + +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + int young; + + young = contpte_ptep_test_and_clear_young(vma, addr, ptep); + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + + if (young) { + /* + * See comment in __ptep_clear_flush_young(); same rationale for + * eliding the trailing DSB applies here. + */ + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE, + PAGE_SIZE, true, 3); + } + + return young; +} + +int contpte_ptep_set_access_flags(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + pte_t entry, int dirty) +{ + pte_t orig_pte; + int i; + + /* + * Gather the access/dirty bits for the contiguous range. If nothing has + * changed, its a noop. + */ + orig_pte = ptep_get(ptep); + if (pte_val(orig_pte) == pte_val(entry)) + return 0; + + /* + * We can fix up access/dirty bits without having to unfold/fold the + * contig range. But if the write bit is changing, we need to go through + * the full unfold/fold cycle. + */ + if (pte_write(orig_pte) == pte_write(entry)) { + /* + * No need to flush here; This is always "more permissive" so we + * can only be _adding_ the access or dirty bit. And since the + * tlb can't cache an entry without the AF set and the dirty bit + * is a SW bit, there can be no confusion. For HW access + * management, we technically only need to update the flag on a + * single pte in the range. But for SW access management, we + * need to update all the ptes to prevent extra faults. + */ + ptep = contpte_align_down(ptep); + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) + __ptep_set_access_flags(vma, addr, ptep, entry, 0); + } else { + /* + * No need to flush in __ptep_set_access_flags() because we just + * flushed the whole range in __contpte_try_unfold(). + */ + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); + __ptep_set_access_flags(vma, addr, ptep, entry, 0); + contpte_try_fold(vma->vm_mm, addr, ptep, entry); + } + + return 1; +}
With the ptep API sufficiently refactored, we can now introduce a new "contpte" API layer, which transparently manages the PTE_CONT bit for user mappings. Whenever it detects a set of PTEs that meet the requirements for a contiguous range, the PTEs are re-painted with the PTE_CONT bit. This initial change provides a baseline that can be optimized in future commits. That said, fold/unfold operations (which imply tlb invalidation) are avoided where possible with a few tricks for access/dirty bit management. Write-enable and write-protect modifications are likely non-optimal and likely incure a regression in fork() performance. This will be addressed separately. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- arch/arm64/include/asm/pgtable.h | 137 ++++++++++++- arch/arm64/mm/Makefile | 3 +- arch/arm64/mm/contpte.c | 334 +++++++++++++++++++++++++++++++ 3 files changed, 466 insertions(+), 8 deletions(-) create mode 100644 arch/arm64/mm/contpte.c