[v1,11/14] arm64/mm: Wire up PTE_CONT for user mappings

Message ID	20230622144210.2623299-12-ryan.roberts@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, James Morse <james.morse@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Zenghui Yu <yuzenghui@huawei.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Alexander Potapenko <glider@google.com>, Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Andrew Morton <akpm@linux-foundation.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>, Mark Rutland <mark.rutland@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings Date: Thu, 22 Jun 2023 15:42:06 +0100 Message-Id: <20230622144210.2623299-12-ryan.roberts@arm.com> In-Reply-To: <20230622144210.2623299-1-ryan.roberts@arm.com> References: <20230622144210.2623299-1-ryan.roberts@arm.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Transparent Contiguous PTEs for User Mappings \| expand [v1,00/14] Transparent Contiguous PTEs for User Mappings [v1,01/14] arm64/mm: set_pte(): New layer to manage contig bit [v1,02/14] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit [v1,03/14] arm64/mm: pte_clear(): New layer to manage contig bit [v1,04/14] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit [v1,05/14] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit [v1,06/14] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit [v1,07/14] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit [v1,08/14] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit [v1,09/14] arm64/mm: ptep_get(): New layer to manage contig bit [v1,10/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB [v1,11/14] arm64/mm: Wire up PTE_CONT for user mappings [v1,12/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown [v1,13/14] mm: Batch-copy PTE ranges during fork() [v1,14/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork()

Ryan Roberts June 22, 2023, 2:42 p.m. UTC

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings. Whenever it detects a set of PTEs that meet the
requirements for a contiguous range, the PTEs are re-painted with the
PTE_CONT bit.

This initial change provides a baseline that can be optimized in future
commits. That said, fold/unfold operations (which imply tlb
invalidation) are avoided where possible with a few tricks for
access/dirty bit management.

Write-enable and write-protect modifications are likely non-optimal and
likely incure a regression in fork() performance. This will be addressed
separately.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
 arch/arm64/mm/Makefile           |   3 +-
 arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
 3 files changed, 466 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

John Hubbard June 30, 2023, 1:54 a.m. UTC | #1

On 6/22/23 07:42, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
> 
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
> 
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Hi Ryan!

While trying out the full series from your gitlab features/granule_perf/all
branch, I found it necessary to EXPORT a symbol in order to build this.
Please see below:

...
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */
> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}

Here we need something like this, in order to get it to build in all
possible configurations:

EXPORT_SYMBOL_GPL(contpte_ptep_get);

(and a corresponding "#include linux/export.h" at the top of the file).

Because, the static inline functions invoke this routine, above.

thanks,

Ryan Roberts July 3, 2023, 9:48 a.m. UTC | #2

On 30/06/2023 02:54, John Hubbard wrote:
> On 6/22/23 07:42, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
> 
> Hi Ryan!
> 
> While trying out the full series from your gitlab features/granule_perf/all
> branch, I found it necessary to EXPORT a symbol in order to build this.

Thanks for the bug report!

> Please see below:
> 
> ...
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +    /*
>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>> +     * contiguous range cannot be unfolded or otherwise modified under our
>> +     * feet.
>> +     */
>> +
>> +    pte_t pte;
>> +    int i;
>> +
>> +    ptep = contpte_align_down(ptep);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +        pte = __ptep_get(ptep);
>> +
>> +        /*
>> +         * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +         * where some of the ptes in the range may be cleared but others
>> +         * are still to do. See contpte_ptep_get_and_clear_full().
>> +         */
>> +        if (pte_val(pte) == 0)
>> +            continue;
>> +
>> +        if (pte_dirty(pte))
>> +            orig_pte = pte_mkdirty(orig_pte);
>> +
>> +        if (pte_young(pte))
>> +            orig_pte = pte_mkyoung(orig_pte);
>> +    }
>> +
>> +    return orig_pte;
>> +}
> 
> Here we need something like this, in order to get it to build in all
> possible configurations:
> 
> EXPORT_SYMBOL_GPL(contpte_ptep_get);
> 
> (and a corresponding "#include linux/export.h" at the top of the file).
> 
> Because, the static inline functions invoke this routine, above.


A quick grep through the drivers directory shows:

ptep_get() is used by:
  - drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
  - drivers/misc/sgi-gru/grufault.c
  - drivers/vfio/vfio_iommu_type1.c
  - drivers/xen/privcmd.c

ptep_set_at() is used by:
  - drivers/gpu/drm/i915/i915_mm.c
  - drivers/xen/xlate_mmu.c

None of the other symbols are called, but I guess it is possible that out of
tree modules are calling others.

So on the basis that these symbols were previously pure inline, I propose to
export all the contpte_* symbols using EXPORT_SYMBOL() so that anything that was
previously calling them successfully continue to do so. Will include in v2.

Thanks,
Ryan


> 
> thanks,

Catalin Marinas July 3, 2023, 3:17 p.m. UTC | #3

Hi Ryan,

Some comments below. I did not have time to trim down the quoted text,
so you may need to scroll through it.

On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
> 
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
> 
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>  arch/arm64/mm/Makefile           |   3 +-
>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>  3 files changed, 466 insertions(+), 8 deletions(-)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 31df4d73f9ac..17ea534bc5b0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
>  
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> +}
> +
> +static inline bool contpte_is_enabled(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +
> +	return mm != &init_mm;
> +}
> +
> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * Only bother trying if both the virtual and physical addresses are
> +	 * aligned and correspond to the last entry in a contig range. The core
> +	 * code mostly modifies ranges from low to high, so this is the likely
> +	 * the last modification in the contig range, so a good time to fold.
> +	 */
> +
> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
> +
> +	if (contpte_is_enabled(mm) &&
> +	    pte_present(pte) && !pte_cont(pte) &&
> +	    valign && palign)
> +		__contpte_try_fold(mm, addr, ptep, pte);

I would use pte_valid() here instead. pte_present() also includes the
PTE_PROT_NONE option which we don't really care about as it's not
accessible.

I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
whether we can use other bits from the pte even if they clash with other
valid permissions. Since the pte is not valid, in theory we could as
long as nothing corrupts the (like a cont bit). The background to this
is multiple migrate types (not just NUMA) for the MTE tag carveout
reuse.

> +}
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (contpte_is_enabled(mm) &&
> +	    pte_present(pte) && pte_cont(pte))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}

Same here and probably most other places where pte_present() is used in
this patch.

> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within the their page tables (or at least
> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>   * versions will automatically and transparently apply the contiguous bit where
>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> - * private versions, which are prefixed with double underscore.
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
>   */
>  
>  #define ptep_get ptep_get
>  static inline pte_t ptep_get(pte_t *ptep)
>  {
> -	return __ptep_get(ptep);
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (!pte_present(pte) || !pte_cont(pte))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (!pte_present(pte) || !pte_cont(pte))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
>  }
>  
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
> -	__set_pte(ptep, pte);
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
> +	 * (since it requires tlb maintenance). set_pte() is not used in core
> +	 * code, so this should never even be called. Regardless do our best to
> +	 * service any call and emit a warning if there is any attempt to set a
> +	 * pte on top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));

Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
only checks the old entry.

>  }
>  
>  #define set_ptes set_ptes
>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> -	__set_ptes(mm, addr, ptep, pte, nr);
> +	pte = pte_mknoncont(pte);
> +
> +	if (!contpte_is_enabled(mm))
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +	else if (nr == 1) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +		contpte_try_fold(mm, addr, ptep, pte);
> +	} else
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>  }
>  
>  static inline void pte_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	return __ptep_get_and_clear(mm, addr, ptep);
>  }
>  
> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	return __ptep_test_and_clear_young(vma, addr, ptep);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);

Since I've seen this construct a few times, you may want to turn it into
a specific check: pte_valid_cont().

> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	return __ptep_clear_flush_young(vma, addr, ptep);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	__ptep_set_wrprotect(mm, addr, ptep);
> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty)
>  {
> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);

As in a few other places, it's not clear to me why the pte_mknoncont()
is needed. Here I expect 'entry' to be cont if *ptep is cont.

> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);

Also wondering, can we have this check on 'entry' rather than
'orig_pte'? And maybe a warning if the cont bit differs between them.

> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>  }
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..70b6aba09b5d 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -2,7 +2,8 @@
>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
> -				   context.o proc.o pageattr.o fixmap.o
> +				   context.o proc.o pageattr.o fixmap.o \
> +				   contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..e8e4a298fd53
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,334 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <asm/tlbflush.h>
> +
> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, int nr)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr = addr;
> +	int i;
> +
> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> +		__pte_clear(mm, addr, ptep);
> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +}
> +
> +static bool ptep_any_present(pte_t *ptep, int nr)

Valid?

> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++, ptep++) {
> +		if (pte_present(__ptep_get(ptep)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte, bool fold)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

I presume this can be unsafe if any of the ptes in the range differ, so
we need some higher level check. But that means we now have three loops
for folding, one to check, another to clear and the last one via
__set_ptes(). I guess we can't collapse the first two loops in a 'try'
function as we need to do the cleaning (and would have to re-instate the
old entries if they can't be made contiguous).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the virtual and pysical addresses are
> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
> +	 * remaining checks are to ensure that the contig range is fully covered
> +	 * by a single folio, and ensure that all the ptes are present with
> +	 * contiguous PFNs and matching prots.
> +	 */
> +
> +	struct page *page = pte_page(pte);
> +	struct folio *folio = page_folio(page);
> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	pte_t subpte;
> +	pte_t *orig_ptep;
> +	int i;
> +
> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> +		return;
> +
> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +	orig_ptep = ptep;
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		subpte = __ptep_get(ptep);
> +		subpte = pte_mkold(pte_mkclean(subpte));

IIUC, this function assumes ptes that only differ by the dirty status
can be contiguous. That's probably ok, with a chance of the dirty status
spreading to the adjacent ptes in the fold function. Maybe add a comment
on why this is ok (or why it doesn't happen).

> +
> +		if (!pte_present(subpte) ||
> +		    pte_pfn(subpte) != pfn ||
> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
> +			return;
> +	}
> +
> +	contpte_fold(mm, addr, orig_ptep, pte, true);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
> +	 */
> +
> +	contpte_fold(mm, addr, ptep, pte, false);
> +}

So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
for the unfold case. Maybe it's clearer if we just have two separate
functions (or document why the pte_mk*() are needed).

> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */
> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet.
> +	 */
> +
> +	pte_t orig_pte;
> +	pgprot_t orig_prot;
> +	pte_t *ptep;
> +	unsigned long pfn;
> +	pte_t pte;
> +	pgprot_t prot;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return orig_pte;

I haven't looked through all the patches, so not entirely sure when this
function is called. But since you mention that the range may be
folded/unfolded, how do we deal with pte_cont() racing with something
setting the contig bit?

> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_present(pte) || !pte_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;

It needs better documenting, I don't understand what the retry here is
for (presumably to handle some races). Do we care about some memory
ordering as well? __pte_get() only takes care of reading the ptep once.

> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end = addr + (nr << PAGE_SHIFT);
> +	unsigned long pfn = pte_pfn(pte);
> +	pgprot_t prot = pte_pgprot(pte);
> +	pte_t orig_pte;
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		/*
> +		 * If operating on a partial contiguous range then we must first
> +		 * unfold the contiguous range if it was previously folded.
> +		 * Otherwise we could end up with overlapping tlb entries.
> +		 */
> +		if (nr != CONT_PTES)
> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> +		/*
> +		 * If we are replacing ptes that were contiguous or if the new
> +		 * ptes are contiguous and any of the ptes being replaced are
> +		 * present, we need to clear and flush the range to prevent
> +		 * overlapping tlb entries.
> +		 */
> +		orig_pte = __ptep_get(ptep);
> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
> +			ptep_clear_flush_range(mm, addr, ptep, nr);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int i;
> +	int young = 0;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = ptep_get(ptep);
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold/fold the
> +	 * contig range. But if the write bit is changing, we need to go through
> +	 * the full unfold/fold cycle.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {

Depending on the architecture version, pte_write() either checks a
software only bit or it checks the DBM one.

> +		/*
> +		 * No need to flush here; This is always "more permissive" so we
> +		 * can only be _adding_ the access or dirty bit. And since the
> +		 * tlb can't cache an entry without the AF set and the dirty bit
> +		 * is a SW bit, there can be no confusion. For HW access
> +		 * management, we technically only need to update the flag on a
> +		 * single pte in the range. But for SW access management, we
> +		 * need to update all the ptes to prevent extra faults.
> +		 */

On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
perspective but clean) may be cached in the TLB and we do need flushing.

> +		ptep = contpte_align_down(ptep);
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +	} else {
> +		/*
> +		 * No need to flush in __ptep_set_access_flags() because we just
> +		 * flushed the whole range in __contpte_try_unfold().
> +		 */
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
> +	}
> +
> +	return 1;
> +}

Ryan Roberts July 4, 2023, 11:09 a.m. UTC | #4

On 03/07/2023 16:17, Catalin Marinas wrote:
> Hi Ryan,
> 
> Some comments below. I did not have time to trim down the quoted text,
> so you may need to scroll through it.

Thanks for the review!

Looking at the comments, I think they all relate to implementation. Does that
imply that you are happy with the shape/approach?

Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
option that defaults to disabled and also adding a command line option to
disable it when compiled in. I think that makes sense for now at least to reduce
risk of performance regression?

> 
> On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>>  arch/arm64/mm/Makefile           |   3 +-
>>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>>  3 files changed, 466 insertions(+), 8 deletions(-)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 31df4d73f9ac..17ea534bc5b0 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>  				    unsigned long addr, pte_t *ptep,
>>  				    pte_t old_pte, pte_t new_pte);
>>  
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>> +}
>> +
>> +static inline bool contpte_is_enabled(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +
>> +	return mm != &init_mm;
>> +}
>> +
>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 */
>> +
>> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
>> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
>> +
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && !pte_cont(pte) &&
>> +	    valign && palign)
>> +		__contpte_try_fold(mm, addr, ptep, pte);
> 
> I would use pte_valid() here instead. pte_present() also includes the
> PTE_PROT_NONE option which we don't really care about as it's not
> accessible.

Yep good point. I'll audit all of this and make the appropriate changes for v2.

> 
> I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
> whether we can use other bits from the pte even if they clash with other
> valid permissions. Since the pte is not valid, in theory we could as
> long as nothing corrupts the (like a cont bit). The background to this
> is multiple migrate types (not just NUMA) for the MTE tag carveout
> reuse.

ACK.

> 
>> +}
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && pte_cont(pte))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
> 
> Same here and probably most other places where pte_present() is used in
> this patch.

ACK.

> 
>> +
>>  /*
>>   * The below functions constitute the public API that arm64 presents to the
>>   * core-mm to manipulate PTE entries within the their page tables (or at least
>> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   * versions will automatically and transparently apply the contiguous bit where
>>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> - * private versions, which are prefixed with double underscore.
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>>   */
>>  
>>  #define ptep_get ptep_get
>>  static inline pte_t ptep_get(pte_t *ptep)
>>  {
>> -	return __ptep_get(ptep);
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>>  }
>>  
>>  static inline void set_pte(pte_t *ptep, pte_t pte)
>>  {
>> -	__set_pte(ptep, pte);
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
>> +	 * (since it requires tlb maintenance). set_pte() is not used in core
>> +	 * code, so this should never even be called. Regardless do our best to
>> +	 * service any call and emit a warning if there is any attempt to set a
>> +	 * pte on top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
> 
> Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
> only checks the old entry.

Originally, it was my intent that PTE_CONT bit would be totally private to this
layer and the bit should never leak to the generic code (i.e. ptep_get() would
clear it before returning the pte and all functions that accept a pte would WARN
if the bit was set on entry.

However, this approach proved problematic for accounting; I have a separate
change that logs the amount of memory mapped as contpte in
/proc/<pid>/smaps[_rollup]. For this to work, the PTE_CONT bit must be leaked to
the generic code (ptep_get() no longer explicitly clears it). But if we
deliberately leak it, then its possible that it will be set in functions that
take a pte, which would lead to incorrect behavior (potentially leading to a
contpte range that has some PTE_CONT bits set and others cleared). This happens
because there is generic code that follows a pattern like this:

  pte = ptep_get_and_clear(ptep)
  pte = modify_some_bits(pte)
  set_pte_at(pte)

To solve this, I'm explicitly clearing CONT_PTE from any pte that is passed in
to one of these functions.

> 
>>  }
>>  
>>  #define set_ptes set_ptes
>>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr)
>>  {
>> -	__set_ptes(mm, addr, ptep, pte, nr);
>> +	pte = pte_mknoncont(pte);
>> +
>> +	if (!contpte_is_enabled(mm))
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +	else if (nr == 1) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>> +	} else
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  }
>>  
>>  static inline void pte_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	return __ptep_get_and_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_test_and_clear_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> 
> Since I've seen this construct a few times, you may want to turn it into
> a specific check: pte_valid_cont().

ACK - will do for v2.

> 
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_clear_flush_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__ptep_set_wrprotect(mm, addr, ptep);
>> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty)
>>  {
>> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
> 
> As in a few other places, it's not clear to me why the pte_mknoncont()
> is needed. Here I expect 'entry' to be cont if *ptep is cont.

See explanation above.

> 
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> 
> Also wondering, can we have this check on 'entry' rather than
> 'orig_pte'? And maybe a warning if the cont bit differs between them.

No - the idea is that this API layer has exclusicve control over whether
PTE_CONT is set in the pgtable. Upper layers should never pass a pte with
CONT_PTE set (except for the corner case described above which we deal with by
explicitly clearing PTE_CONT from the passed in pte).

So the check must be on orig_pte - we are checking if a contpte range is present
over the pte we are about to modify. If it is, then we need to handle it
carefully (potentially by unfolding it first; handled by
contpte_ptep_set_access_flags()). If there is no contprte range, then we can
just handle it the "normal" way.

> 
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>  }
>>  
>>  #endif /* !__ASSEMBLY__ */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..70b6aba09b5d 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -2,7 +2,8 @@
>>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>> -				   context.o proc.o pageattr.o fixmap.o
>> +				   context.o proc.o pageattr.o fixmap.o \
>> +				   contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..e8e4a298fd53
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,334 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, int nr)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr = addr;
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>> +		__pte_clear(mm, addr, ptep);
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +}
>> +
>> +static bool ptep_any_present(pte_t *ptep, int nr)
> 
> Valid?

ACK

> 
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++) {
>> +		if (pte_present(__ptep_get(ptep)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte, bool fold)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> I presume this can be unsafe if any of the ptes in the range differ, so
> we need some higher level check. 

Sorry I'm not quite sure what you mean here? The higher level check is where we
look at the current value of the target PTE; if PTE_CONT is set then we know it
is part of a contpte range. We are careful that PTE_CONT is set consistently for
all (valid) PTEs in a contpte range, so we only need to check 1 entry. There is
no risk of racing here because we are always serialized by the PTL.

> But that means we now have three loops
> for folding, one to check, another to clear and the last one via
> __set_ptes(). I guess we can't collapse the first two loops in a 'try'
> function as we need to do the cleaning (and would have to re-instate the
> old entries if they can't be made contiguous).

Yes 3 loops, and I don't see how you would reduce that. The good news is that
this folding path should be rarely taken; most qualifying ranges will be set via
set_ptes() so they are ritten "pre-folded". We only attempt to fold after
setting the pte at the _end_ of the range (see contpte_try_fold()), and the
checker loop in __contpte_try_fold() will usually exit on the second iteration
if the memory is not physically contiguous.

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contig range is fully covered
>> +	 * by a single folio, and ensure that all the ptes are present with
>> +	 * contiguous PFNs and matching prots.
>> +	 */
>> +
>> +	struct page *page = pte_page(pte);
>> +	struct folio *folio = page_folio(page);
>> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	pte_t subpte;
>> +	pte_t *orig_ptep;
>> +	int i;
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		subpte = __ptep_get(ptep);
>> +		subpte = pte_mkold(pte_mkclean(subpte));
> 
> IIUC, this function assumes ptes that only differ by the dirty status
> can be contiguous. That's probably ok, with a chance of the dirty status
> spreading to the adjacent ptes in the fold function. Maybe add a comment
> on why this is ok (or why it doesn't happen).

Conceptually a contpte range only has a single access and dirty bit. So when
folding, we or all the access bits and all the dirty bits from the constituent
ptes to determine the single access and dirty bits for the contpte mapping. And
when unfolding, we take the single access and dirty bit for the contpte mapping
and apply those values to every individual entry.

So yes, we ignore the access and dirty values for the subptes when evaluating
whether a contiguous range exists. I'll add a comment.

> 
>> +
>> +		if (!pte_present(subpte) ||
>> +		    pte_pfn(subpte) != pfn ||
>> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>> +			return;
>> +	}
>> +
>> +	contpte_fold(mm, addr, orig_ptep, pte, true);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
>> +	 */
>> +
>> +	contpte_fold(mm, addr, ptep, pte, false);
>> +}
> 
> So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
> for the unfold case. Maybe it's clearer if we just have two separate
> functions (or document why the pte_mk*() are needed).

No that's not the case. In the unfold case, we need to "collect" the single
access and dirty bit from the contpte mapping (these may be in any of the
entries), and set the final values for all individual ptes during unfolding.

The obvious side effect here is that if any one page is dirty at fold time, the
whole range will be marked as dirty after folding, then at unfolding all pages
will be marked as dirty (same goes for access). This is the same concern that I
raised in the cover letter. I don't think this is a problem from the kernel's
point of view; the kernel will compress the per-page access/dirty info to
per-folio and we only fold if the whole range is covered by a single folio. But
user space could observe this "over-dirtying" through /proc/<pid>/pagemap. I'm
not sure if that's a problem in practice?

> 
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		/*
>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +		 * where some of the ptes in the range may be cleared but others
>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>> +		 */
>> +		if (pte_val(pte) == 0)
>> +			continue;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet.
>> +	 */
>> +
>> +	pte_t orig_pte;
>> +	pgprot_t orig_prot;
>> +	pte_t *ptep;
>> +	unsigned long pfn;
>> +	pte_t pte;
>> +	pgprot_t prot;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return orig_pte;
> 
> I haven't looked through all the patches, so not entirely sure when this
> function is called. 

ptep_get_lockless() is one of the mm optional arch interfaces. arm64 doesn't
currently implement it, because ptep_get() (READ_ONCE()) is safe without the
lock being held. But with the introduction of contpte mappings, there are cases
now where we have to read the whole contpte range to gather access and dirty,
which obviously isn't atomic. And doing that without the PTL is harder than when
we have the PTL, so I've implemented ptep_get_lockless() so we can assume the
PTL is held in ptep_get() and do the simple thing there (the common case).

> But since you mention that the range may be
> folded/unfolded, how do we deal with pte_cont() racing with something
> setting the contig bit?

ptep_get_lockless() is inherrently racy. My intedntion was that we just need to
ensure we read a pte or contpte range that is consistent with itself.

> 
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_present(pte) || !pte_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
> 
> It needs better documenting, I don't understand what the retry here is
> for (presumably to handle some races). Do we care about some memory
> ordering as well? __pte_get() only takes care of reading the ptep once.

The intention is that the loop keeps retrying until it scans a whole contpte
range that is consistent with itself (i.e. PTE_CONT bit is set in all, pfn
increments monotonically and pgprots are all the same). If any of those
considtions are not true, it indicates we are racing with an update and need to
retry until its consistent. I'd need to think a bit more on whether we need
anything special for memory ordering...

To be honest, I'm not a big fan of this function. As far as I can tell, the only
user of ptep_get_lockless() that cares about access/dirty is ptdump. Perhaps we
can re-spec this to not return access/dirty info (that would simplify it back to
a READ_ONCE()), then figure out a way to hold the PTL for ptdump and use
ptep_get() which will return the access/dirty info correctly. Do you think
something like that could work?

> 
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end = addr + (nr << PAGE_SHIFT);
>> +	unsigned long pfn = pte_pfn(pte);
>> +	pgprot_t prot = pte_pgprot(pte);
>> +	pte_t orig_pte;
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		/*
>> +		 * If operating on a partial contiguous range then we must first
>> +		 * unfold the contiguous range if it was previously folded.
>> +		 * Otherwise we could end up with overlapping tlb entries.
>> +		 */
>> +		if (nr != CONT_PTES)
>> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +		/*
>> +		 * If we are replacing ptes that were contiguous or if the new
>> +		 * ptes are contiguous and any of the ptes being replaced are
>> +		 * present, we need to clear and flush the range to prevent
>> +		 * overlapping tlb entries.
>> +		 */
>> +		orig_pte = __ptep_get(ptep);
>> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
>> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
>> +			ptep_clear_flush_range(mm, addr, ptep, nr);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int i;
>> +	int young = 0;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = ptep_get(ptep);
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>> +	 * contig range. But if the write bit is changing, we need to go through
>> +	 * the full unfold/fold cycle.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
> 
> Depending on the architecture version, pte_write() either checks a
> software only bit or it checks the DBM one.
> 
>> +		/*
>> +		 * No need to flush here; This is always "more permissive" so we
>> +		 * can only be _adding_ the access or dirty bit. And since the
>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>> +		 * is a SW bit, there can be no confusion. For HW access
>> +		 * management, we technically only need to update the flag on a
>> +		 * single pte in the range. But for SW access management, we
>> +		 * need to update all the ptes to prevent extra faults.
>> +		 */
> 
> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> perspective but clean) may be cached in the TLB and we do need flushing.

I don't follow; The Arm ARM says:

  IPNQBP When an Access flag fault is generated, the translation table entry
         causing the fault is not cached in a TLB.

So the entry can only be in the TLB if AF is already 1. And given the dirty bit
is SW, it shouldn't affect the TLB state. And this function promises to only
change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).

So I'm not sure what case you are describing here?

> 
>> +		ptep = contpte_align_down(ptep);
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +	} else {
>> +		/*
>> +		 * No need to flush in __ptep_set_access_flags() because we just
>> +		 * flushed the whole range in __contpte_try_unfold().
>> +		 */
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>> +	}
>> +
>> +	return 1;
>> +}
> 

Thanks,
Ryan

Ryan Roberts July 5, 2023, 1:13 p.m. UTC | #5

On 04/07/2023 12:09, Ryan Roberts wrote:
> On 03/07/2023 16:17, Catalin Marinas wrote:
>> Hi Ryan,
>>

...

>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = ptep_get(ptep);
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>>> +	 * contig range. But if the write bit is changing, we need to go through
>>> +	 * the full unfold/fold cycle.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>
>> Depending on the architecture version, pte_write() either checks a
>> software only bit or it checks the DBM one.
>>
>>> +		/*
>>> +		 * No need to flush here; This is always "more permissive" so we
>>> +		 * can only be _adding_ the access or dirty bit. And since the
>>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>>> +		 * is a SW bit, there can be no confusion. For HW access
>>> +		 * management, we technically only need to update the flag on a
>>> +		 * single pte in the range. But for SW access management, we
>>> +		 * need to update all the ptes to prevent extra faults.
>>> +		 */
>>
>> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
>> perspective but clean) may be cached in the TLB and we do need flushing.
> 
> I don't follow; The Arm ARM says:
> 
>   IPNQBP When an Access flag fault is generated, the translation table entry
>          causing the fault is not cached in a TLB.
> 
> So the entry can only be in the TLB if AF is already 1. And given the dirty bit
> is SW, it shouldn't affect the TLB state. And this function promises to only
> change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).
> 
> So I'm not sure what case you are describing here?

Ahh sorry, I get your point now - on pre-DBM hardware, the HW sees a read-only
PTE when the kernel considers it clean and this can be in the TLB. Then when
making it dirty (from kernel's perspective), we are removing the read-only
protection from the HW perspective, so we need to flush the TLB entry.

> 
>>
>>> +		ptep = contpte_align_down(ptep);
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);

Fixed by adding this after iterating though the ptes, intent it to avoid the
per-page tlb flash and instead flush the whole range at the end:

		if (dirty)
			__flush_tlb_range(vma, start_addr, addr,
							PAGE_SIZE, true, 3);

>>> +	} else {
>>> +		/*
>>> +		 * No need to flush in __ptep_set_access_flags() because we just
>>> +		 * flushed the whole range in __contpte_try_unfold().
>>> +		 */
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);

I also think this is wrong; we must pass `dirty` as the last parameter so that
__ptep_set_access_flags will flush if neccessary. My comment about having just
done the flush is incorrect - we have just done a flush, but the ptes are still
valid with their old value so the HW could pull this into the TLB before we
modify the value.

>>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>
> 
> Thanks,
> Ryan
>

Catalin Marinas July 16, 2023, 3:09 p.m. UTC | #6

On Tue, Jul 04, 2023 at 12:09:31PM +0100, Ryan Roberts wrote:
> On 03/07/2023 16:17, Catalin Marinas wrote:
> > Hi Ryan,
> > 
> > Some comments below. I did not have time to trim down the quoted text,
> > so you may need to scroll through it.
> 
> Thanks for the review!
> 
> Looking at the comments, I think they all relate to implementation. Does that
> imply that you are happy with the shape/approach?

I can't really tell yet as there are a few dependencies and I haven't
applied them to look at the bigger picture. My preference would be to
handle the large folio breaking/making in the core code via APIs like
set_ptes() and eliminate the loop heuristics in the arm64
code to fold/unfold. Maybe it's not entirely possible I need to look at
the bigger picture with all the series applied (and on a bigger screen,
writing this reply on a laptop in flight).

> Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
> option that defaults to disabled and also adding a command line option to
> disable it when compiled in. I think that makes sense for now at least to reduce
> risk of performance regression?

I'm fine with a Kconfig option (maybe expert) but default enabled,
otherwise it won't get enough coverage. AFAICT, the biggest risk of
regression is the heuristics for folding/unfolding. In general the
overhead should be offset by the reduced TLB pressure but we may find
some pathological case where this gets in the way.

> > On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> >> +		/*
> >> +		 * No need to flush here; This is always "more permissive" so we
> >> +		 * can only be _adding_ the access or dirty bit. And since the
> >> +		 * tlb can't cache an entry without the AF set and the dirty bit
> >> +		 * is a SW bit, there can be no confusion. For HW access
> >> +		 * management, we technically only need to update the flag on a
> >> +		 * single pte in the range. But for SW access management, we
> >> +		 * need to update all the ptes to prevent extra faults.
> >> +		 */
> > 
> > On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> > perspective but clean) may be cached in the TLB and we do need flushing.
> 
> I don't follow; The Arm ARM says:
> 
>   IPNQBP When an Access flag fault is generated, the translation table entry
>          causing the fault is not cached in a TLB.
> 
> So the entry can only be in the TLB if AF is already 1. And given the dirty bit
> is SW, it shouldn't affect the TLB state. And this function promises to only
> change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).
> 
> So I'm not sure what case you are describing here?

The comment for this function states that it sets the access/dirty flags
as well as the write permission. Prior to DBM, the page is marked
PTE_RDONLY and we take a fault. This function marks the page dirty by
setting the software PTE_DIRTY bit (no need to worry) but also clearing
PTE_RDONLY so that a subsequent access won't fault again. We do need the
TLBI here since PTE_RDONLY is allowed to be cached in the TLB.

Sorry, I did not reply to your other comments (we can talk in person in
about a week time). I also noticed you figured the above but I had
written it already.

[v1,11/14] arm64/mm: Wire up PTE_CONT for user mappings

Commit Message

Comments

Patch