[3/3] arm64: tlb: skip tlbi broadcast

Message ID	20200223192520.20808-4-aarcange@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=zVN4=4L=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3FA22206ED From: Andrea Arcangeli <aarcange@redhat.com> To: Will Deacon <will@kernel.org>, Catalin Marinas <catalin.marinas@arm.com>, Rafael Aquini <aquini@redhat.com>, Mark Salter <msalter@redhat.com> Cc: Jon Masters <jcm@jonmasters.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, Michal Hocko <mhocko@kernel.org>, QI Fuli <qi.fuli@fujitsu.com> Subject: [PATCH 3/3] arm64: tlb: skip tlbi broadcast Date: Sun, 23 Feb 2020 14:25:20 -0500 Message-Id: <20200223192520.20808-4-aarcange@redhat.com> In-Reply-To: <20200223192520.20808-1-aarcange@redhat.com> References: <20200223192520.20808-1-aarcange@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes \| expand [1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes [2/3] arm64: select CPUMASK_OFFSTACK if NUMA [3/3] arm64: tlb: skip tlbi broadcast

Andrea Arcangeli Feb. 23, 2020, 7:25 p.m. UTC

With multiple NUMA nodes and multiple sockets, the tlbi broadcast
shall be delivered through the interconnects in turn increasing the
CPU interconnect traffic and the latency of the tlbi broadcast
instruction. To avoid the synchronous delivery of the tlbi broadcast
before the tlbi instruction can be retired, the hardware would need to
implement a replicated mm_cpumask bitflag for each ASID and every CPU
would need to tell every other CPU which ASID is being loaded. Exactly
what x86 does with mm_cpumask in software.

Even within a single NUMA node the latency of the tlbi broadcast
instruction increases almost linearly with the number of CPUs trying
to send tlbi broadcasts at the same time.

If a single thread of the process is running and it's also running in
the CPU issuing the TLB flush, or if no thread of the process are
running, we can achieve full SMP scalability in the arm64 TLB flushng
by skipping the tlbi broadcasting.

After the local TLB flush this means the ASID context goes out of sync
in all CPUs except the local one. This can be tracked on the per-mm
cpumask: if the bit is set it means the ASID context is stale for that
CPU. This results in an extra local ASID TLB flush only when threads
are running in new CPUs after a TLB flush.

Skipping the tlbi instruction broadcasting is already implemented in
local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
flush_tlb_range() and flush_tlb_page() too.

The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8
Ampere), so it should be far from a worst case scenario: the
enterprise kernel config allows multiple NUMA nodes with NR_CPUS set
by default to 4096.

=== stock ===

 # cat for-each-cpu.sh
 #!/bin/bash

 for i in $(seq `nproc`); do
         "$@" &>/dev/null &
 done
 wait
 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
[..]
            2.1696 +- 0.0122 seconds time elapsed  ( +-  0.56% )

 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
[..]
            0.99018 +- 0.00360 seconds time elapsed  ( +-  0.36% )

 # cat sort-compute
 #!/bin/bash

 for x in `seq 256`; do
         for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done &
 done
 wait
 # perf stat -r 10 -e dummy ./sort-compute
[..]
            1.8094 +- 0.0139 seconds time elapsed  ( +-  0.77% )

=== patch applied ===

 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
[..]
            0.13941 +- 0.00449 seconds time elapsed  ( +-  3.22% )

 # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
[..]
            0.90510 +- 0.00262 seconds time elapsed  ( +-  0.29% )

 # perf stat -r 10 -e dummy ./sort-compute
[..]
            1.64025 +- 0.00618 seconds time elapsed  ( +-  0.38% )

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm64/include/asm/efi.h         |  2 +-
 arch/arm64/include/asm/mmu.h         |  4 +-
 arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
 arch/arm64/include/asm/tlbflush.h    | 95 +++++++++++++++++++++++++++-
 arch/arm64/mm/context.c              | 54 ++++++++++++++++
 5 files changed, 177 insertions(+), 11 deletions(-)

Rafael Aquini March 2, 2020, 3:24 p.m. UTC | #1

On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote:
> With multiple NUMA nodes and multiple sockets, the tlbi broadcast
> shall be delivered through the interconnects in turn increasing the
> CPU interconnect traffic and the latency of the tlbi broadcast
> instruction. To avoid the synchronous delivery of the tlbi broadcast
> before the tlbi instruction can be retired, the hardware would need to
> implement a replicated mm_cpumask bitflag for each ASID and every CPU
> would need to tell every other CPU which ASID is being loaded. Exactly
> what x86 does with mm_cpumask in software.
> 
> Even within a single NUMA node the latency of the tlbi broadcast
> instruction increases almost linearly with the number of CPUs trying
> to send tlbi broadcasts at the same time.
> 
> If a single thread of the process is running and it's also running in
> the CPU issuing the TLB flush, or if no thread of the process are
> running, we can achieve full SMP scalability in the arm64 TLB flushng
> by skipping the tlbi broadcasting.
> 
> After the local TLB flush this means the ASID context goes out of sync
> in all CPUs except the local one. This can be tracked on the per-mm
> cpumask: if the bit is set it means the ASID context is stale for that
> CPU. This results in an extra local ASID TLB flush only when threads
> are running in new CPUs after a TLB flush.
> 
> Skipping the tlbi instruction broadcasting is already implemented in
> local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
> flush_tlb_range() and flush_tlb_page() too.
> 
> The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8
> Ampere), so it should be far from a worst case scenario: the
> enterprise kernel config allows multiple NUMA nodes with NR_CPUS set
> by default to 4096.
> 
> === stock ===
> 
>  # cat for-each-cpu.sh
>  #!/bin/bash
> 
>  for i in $(seq `nproc`); do
>          "$@" &>/dev/null &
>  done
>  wait
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
> [..]
>             2.1696 +- 0.0122 seconds time elapsed  ( +-  0.56% )
> 
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
> [..]
>             0.99018 +- 0.00360 seconds time elapsed  ( +-  0.36% )
> 
>  # cat sort-compute
>  #!/bin/bash
> 
>  for x in `seq 256`; do
>          for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done &
>  done
>  wait
>  # perf stat -r 10 -e dummy ./sort-compute
> [..]
>             1.8094 +- 0.0139 seconds time elapsed  ( +-  0.77% )
> 
> === patch applied ===
> 
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
> [..]
>             0.13941 +- 0.00449 seconds time elapsed  ( +-  3.22% )
> 
>  # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
> [..]
>             0.90510 +- 0.00262 seconds time elapsed  ( +-  0.29% )
> 
>  # perf stat -r 10 -e dummy ./sort-compute
> [..]
>             1.64025 +- 0.00618 seconds time elapsed  ( +-  0.38% )
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/arm64/include/asm/efi.h         |  2 +-
>  arch/arm64/include/asm/mmu.h         |  4 +-
>  arch/arm64/include/asm/mmu_context.h | 33 ++++++++--
>  arch/arm64/include/asm/tlbflush.h    | 95 +++++++++++++++++++++++++++-
>  arch/arm64/mm/context.c              | 54 ++++++++++++++++
>  5 files changed, 177 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
> index 44531a69d32b..5d9a1433d918 100644
> --- a/arch/arm64/include/asm/efi.h
> +++ b/arch/arm64/include/asm/efi.h
> @@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
>  
>  static inline void efi_set_pgd(struct mm_struct *mm)
>  {
> -	__switch_mm(mm);
> +	__switch_mm(mm, smp_processor_id());
>  
>  	if (system_uses_ttbr0_pan()) {
>  		if (mm != current->active_mm) {
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index e4d862420bb4..9072fd7bc5f8 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -20,6 +20,7 @@ typedef struct {
>  	atomic64_t	id;
>  	void		*vdso;
>  	unsigned long	flags;
> +	atomic_t	nr_active_mm;
>  } mm_context_t;
>  
>  /*
> @@ -27,7 +28,8 @@ typedef struct {
>   * ASID change and therefore doesn't need to reload the counter using
>   * atomic64_read.
>   */
> -#define ASID(mm)	((mm)->context.id.counter & 0xffff)
> +#define __ASID(asid)	((asid) & 0xffff)
> +#define ASID(mm)	__ASID((mm)->context.id.counter)
>  
>  extern bool arm64_use_ng_mappings;
>  
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 3827ff4040a3..9c66fe317e2f 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -175,7 +175,10 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
>  #define destroy_context(mm)		do { } while(0)
>  void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
>  
> -#define init_new_context(tsk,mm)	({ atomic64_set(&(mm)->context.id, 0); 0; })
> +#define init_new_context(tsk,mm)			\
> +	({ atomic64_set(&(mm)->context.id, 0);		\
> +	   atomic_set(&(mm)->context.nr_active_mm, 0);	\
> +	   0; })
>  
>  #ifdef CONFIG_ARM64_SW_TTBR0_PAN
>  static inline void update_saved_ttbr0(struct task_struct *tsk,
> @@ -203,6 +206,15 @@ static inline void update_saved_ttbr0(struct task_struct *tsk,
>  static inline void
>  enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  {
> +	unsigned int cpu = smp_processor_id();
> +	if (per_cpu(cpu_not_lazy_tlb, cpu) &&
> +	    is_idle_task(tsk)) {
> +		per_cpu(cpu_not_lazy_tlb, cpu) = false;
> +		if (!system_uses_ttbr0_pan())
> +			cpu_set_reserved_ttbr0();
> +		atomic_dec(&mm->context.nr_active_mm);
> +	}
> +	VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0);
>  	/*
>  	 * We don't actually care about the ttbr0 mapping, so point it at the
>  	 * zero page.
> @@ -210,10 +222,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  	update_saved_ttbr0(tsk, &init_mm);
>  }
>  
> -static inline void __switch_mm(struct mm_struct *next)
> +static inline void __switch_mm(struct mm_struct *next, unsigned int cpu)
>  {
> -	unsigned int cpu = smp_processor_id();
> -
>  	/*
>  	 * init_mm.pgd does not contain any user mappings and it is always
>  	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
> @@ -230,8 +240,19 @@ static inline void
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	  struct task_struct *tsk)
>  {
> -	if (prev != next)
> -		__switch_mm(next);
> +	unsigned int cpu = smp_processor_id();
> +
> +	if (!per_cpu(cpu_not_lazy_tlb, cpu)) {
> +		per_cpu(cpu_not_lazy_tlb, cpu) = true;
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +	} else if (prev != next) {
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +		atomic_dec(&prev->context.nr_active_mm);
> +	}
> +	VM_WARN_ON(!atomic_read(&next->context.nr_active_mm));
> +	VM_WARN_ON(atomic_read(&prev->context.nr_active_mm) < 0);
>  
>  	/*
>  	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index bc3949064725..0bd987ff9cbd 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void)
>  	isb();
>  }
>  
> +static inline void local_flush_tlb_asid(unsigned long asid)
> +{
> +	asid = __TLBI_VADDR(0, __ASID(asid));
> +	dsb(nshst);
> +	__tlbi(aside1, asid);
> +	__tlbi_user(aside1, asid);
> +	dsb(nsh);
> +}
> +
>  static inline void flush_tlb_all(void)
>  {
>  	dsb(ishst);
> @@ -144,9 +153,38 @@ static inline void flush_tlb_all(void)
>  	isb();
>  }
>  
> +DECLARE_PER_CPU(bool, cpu_not_lazy_tlb);
> +
> +enum tlb_flush_types {
> +	TLB_FLUSH_NO,
> +	TLB_FLUSH_LOCAL,
> +	TLB_FLUSH_BROADCAST,
> +};
> +extern enum tlb_flush_types tlb_flush_check(struct mm_struct *mm,
> +					    unsigned int cpu);
> +
>  static inline void flush_tlb_mm(struct mm_struct *mm)
>  {
>  	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
> +	enum tlb_flush_types flush;
> +
> +	flush = tlb_flush_check(mm, get_cpu());
> +	switch (flush) {
> +	case TLB_FLUSH_LOCAL:
> +
> +		dsb(nshst);
> +		__tlbi(aside1, asid);
> +		__tlbi_user(aside1, asid);
> +		dsb(nsh);
> +
> +		/* fall through */
> +	case TLB_FLUSH_NO:
> +		put_cpu();
> +		return;
> +	case TLB_FLUSH_BROADCAST:
> +		break;
> +	}
> +	put_cpu();
>  
>  	dsb(ishst);
>  	__tlbi(aside1is, asid);
> @@ -167,7 +205,31 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>  static inline void flush_tlb_page(struct vm_area_struct *vma,
>  				  unsigned long uaddr)
>  {
> -	flush_tlb_page_nosync(vma, uaddr);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
> +	enum tlb_flush_types flush;
> +
> +	flush = tlb_flush_check(mm, get_cpu());
> +	switch (flush) {
> +	case TLB_FLUSH_LOCAL:
> +
> +		dsb(nshst);
> +		__tlbi(vale1, addr);
> +		__tlbi_user(vale1, addr);
> +		dsb(nsh);
> +
> +		/* fall through */
> +	case TLB_FLUSH_NO:
> +		put_cpu();
> +		return;
> +	case TLB_FLUSH_BROADCAST:
> +		break;
> +	}
> +	put_cpu();
> +
> +	dsb(ishst);
> +	__tlbi(vale1is, addr);
> +	__tlbi_user(vale1is, addr);
>  	dsb(ish);
>  }
>  
> @@ -181,14 +243,16 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>  				     unsigned long start, unsigned long end,
>  				     unsigned long stride, bool last_level)
>  {
> -	unsigned long asid = ASID(vma->vm_mm);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long asid = ASID(mm);
>  	unsigned long addr;
> +	enum tlb_flush_types flush;
>  
>  	start = round_down(start, stride);
>  	end = round_up(end, stride);
>  
>  	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
> -		flush_tlb_mm(vma->vm_mm);
> +		flush_tlb_mm(mm);
>  		return;
>  	}
>  
> @@ -198,6 +262,31 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>  	start = __TLBI_VADDR(start, asid);
>  	end = __TLBI_VADDR(end, asid);
>  
> +	flush = tlb_flush_check(mm, get_cpu());
> +	switch (flush) {
> +	case TLB_FLUSH_LOCAL:
> +
> +		dsb(nshst);
> +		for (addr = start; addr < end; addr += stride) {
> +			if (last_level) {
> +				__tlbi(vale1, addr);
> +				__tlbi_user(vale1, addr);
> +			} else {
> +				__tlbi(vae1, addr);
> +				__tlbi_user(vae1, addr);
> +			}
> +		}
> +		dsb(nsh);
> +
> +		/* fall through */
> +	case TLB_FLUSH_NO:
> +		put_cpu();
> +		return;
> +	case TLB_FLUSH_BROADCAST:
> +		break;
> +	}
> +	put_cpu();
> +
>  	dsb(ishst);
>  	for (addr = start; addr < end; addr += stride) {
>  		if (last_level) {
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 8ef73e89d514..3152b7f7da12 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -25,6 +25,7 @@ static unsigned long *asid_map;
>  static DEFINE_PER_CPU(atomic64_t, active_asids);
>  static DEFINE_PER_CPU(u64, reserved_asids);
>  static cpumask_t tlb_flush_pending;
> +DEFINE_PER_CPU(bool, cpu_not_lazy_tlb);
>  
>  #define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
>  #define ASID_FIRST_VERSION	(1UL << asid_bits)
> @@ -191,6 +192,12 @@ static u64 new_context(struct mm_struct *mm)
>  set_asid:
>  	__set_bit(asid, asid_map);
>  	cur_idx = asid;
> +	/*
> +	  * check_and_switch_context() will change the ASID of this mm
> +	  * so no need of extra ASID local TLB flushes: the new ASID
> +	  * isn't stale anymore after the tlb_flush_pending was set.
> +	  */
> +	cpumask_clear(mm_cpumask(mm));
>  	return idx2asid(asid) | generation;
>  }
>  
> @@ -240,6 +247,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
>  
>  switch_mm_fastpath:
> +	/*
> +	 * Enforce CPU ordering between the atomic_inc(nr_active_mm)
> +	 * in switch_mm() and the below cpumask_test_cpu(mm_cpumask).
> +	 */
> +	smp_mb();
> +	if (cpumask_test_cpu(cpu, mm_cpumask(mm))) {
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));
> +		local_flush_tlb_asid(asid);
> +	}
>  
>  	arm64_apply_bp_hardening();
>  
> @@ -251,6 +267,44 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  		cpu_switch_mm(mm->pgd, mm);
>  }
>  
> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
> +{
> +	if (atomic_read(&mm->context.nr_active_mm) <= 1) {
> +		bool is_local = current->active_mm == mm &&
> +			per_cpu(cpu_not_lazy_tlb, cpu);
> +		cpumask_t *stale_cpumask = mm_cpumask(mm);
> +		unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask);
> +		bool local_is_clear = false;
> +		if (next_zero < nr_cpu_ids &&
> +		    (is_local && next_zero == cpu)) {
> +			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
> +			local_is_clear = true;
> +		}
> +		if (next_zero < nr_cpu_ids) {
> +			cpumask_setall(stale_cpumask);
> +			local_is_clear = false;
> +		}
> +
> +		/*
> +		 * Enforce CPU ordering between the above
> +		 * cpumask_setall(mm_cpumask) and the below
> +		 * atomic_read(nr_active_mm).
> +		 */
> +		smp_mb();
> +
> +		if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) {
> +			if (is_local) {
> +				if (!local_is_clear)
> +					cpumask_clear_cpu(cpu, stale_cpumask);
> +				return TLB_FLUSH_LOCAL;
> +			}
> +			if (atomic_read(&mm->context.nr_active_mm) == 0)
> +				return TLB_FLUSH_NO;
> +		}
> +	}
> +	return TLB_FLUSH_BROADCAST;
> +}
> +
>  /* Errata workaround post TTBRx_EL1 update. */
>  asmlinkage void post_ttbr_update_workaround(void)
>  {


May I suggest the following (cosmetic) changes to this
patch?

I'm testing these changes against RHEL integration + regression
tests, and I'll also run them against a specially crafted test
to measure the impact on task-switching, if any. (I'll report back,
soon)

Cheers!
 Rafael
--


diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bb73f02bef28..14eceba302bc 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -182,25 +182,21 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	flush = tlb_flush_check(mm, get_cpu());
 	switch (flush) {
 	case TLB_FLUSH_LOCAL:
-
 		dsb(nshst);
 		__tlbi(aside1, asid);
 		__tlbi_user(aside1, asid);
 		dsb(nsh);
-
 		/* fall through */
 	case TLB_FLUSH_NO:
-		put_cpu();
-		return;
+		break;
 	case TLB_FLUSH_BROADCAST:
+		dsb(ishst);
+		__tlbi(aside1is, asid);
+		__tlbi_user(aside1is, asid);
+		dsb(ish);
 		break;
 	}
 	put_cpu();
-
-	dsb(ishst);
-	__tlbi(aside1is, asid);
-	__tlbi_user(aside1is, asid);
-	dsb(ish);
 }
 
 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
@@ -223,25 +219,21 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	flush = tlb_flush_check(mm, get_cpu());
 	switch (flush) {
 	case TLB_FLUSH_LOCAL:
-
 		dsb(nshst);
 		__tlbi(vale1, addr);
 		__tlbi_user(vale1, addr);
 		dsb(nsh);
-
 		/* fall through */
 	case TLB_FLUSH_NO:
-		put_cpu();
-		return;
+		break;
 	case TLB_FLUSH_BROADCAST:
+		dsb(ishst);
+		__tlbi(vale1is, addr);
+		__tlbi_user(vale1is, addr);
+		dsb(ish);
 		break;
 	}
 	put_cpu();
-
-	dsb(ishst);
-	__tlbi(vale1is, addr);
-	__tlbi_user(vale1is, addr);
-	dsb(ish);
 }
 
 /*
@@ -276,7 +268,6 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	flush = tlb_flush_check(mm, get_cpu());
 	switch (flush) {
 	case TLB_FLUSH_LOCAL:
-
 		dsb(nshst);
 		for (addr = start; addr < end; addr += stride) {
 			if (last_level) {
@@ -288,27 +279,24 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 			}
 		}
 		dsb(nsh);
-
 		/* fall through */
 	case TLB_FLUSH_NO:
-		put_cpu();
-		return;
+		break;
 	case TLB_FLUSH_BROADCAST:
+		dsb(ishst);
+		for (addr = start; addr < end; addr += stride) {
+			if (last_level) {
+				__tlbi(vale1is, addr);
+				__tlbi_user(vale1is, addr);
+			} else {
+				__tlbi(vae1is, addr);
+				__tlbi_user(vae1is, addr);
+			}
+		}
+		dsb(ish);
 		break;
 	}
 	put_cpu();
-
-	dsb(ishst);
-	for (addr = start; addr < end; addr += stride) {
-		if (last_level) {
-			__tlbi(vale1is, addr);
-			__tlbi_user(vale1is, addr);
-		} else {
-			__tlbi(vae1is, addr);
-			__tlbi_user(vae1is, addr);
-		}
-	}
-	dsb(ish);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,

Rafael Aquini March 4, 2020, 4:19 a.m. UTC | #2

On Mon, Mar 02, 2020 at 10:24:51AM -0500, Rafael Aquini wrote:
[...]
> I'm testing these changes against RHEL integration + regression
> tests, and I'll also run them against a specially crafted test
> to measure the impact on task-switching, if any. (I'll report back,
> soon)
>

As promised, I ran the the patches against a full round of kernel
integration/regression tests (the same we use to run for RHEL kernels)
unfortunately, there is no easy way to share these internal results, 
but apart from a couple warnings -- which were not related to the test
build -- everything went on very smoothly with the patches on top of
a RHEL-8 test-kernel.


I grabbed some perf numbers, to serve as kernel build benchmark. 
Test system is a 1 socket 32 core 3.3Ghz ARMv8 Ampere eMAG 256GB RAM.
rpmbuild spawns the build with make -j32 and besides the stock kernel RPM,
it also builds the -debug flavor RPM and all the sub-RPMs for tools and
extra modules.

* stock RHEL-8 build:

 Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm':

     27,817,487.14 msec task-clock                #   15.015 CPUs utilized          
         1,318,586      context-switches          #    0.047 K/sec                  
           515,342      cpu-migrations            #    0.019 K/sec                  
        68,513,346      page-faults               #    0.002 M/sec                  
91,713,983,302,759      cycles                    #    3.297 GHz                    
49,871,167,452,081      instructions              #    0.54  insn per cycle         
23,801,939,026,338      cache-references          #  855.647 M/sec                  
   568,847,557,178      cache-misses              #    2.390 % of all cache refs    
   145,305,286,469      dTLB-loads                #    5.224 M/sec                  
       752,451,698      dTLB-load-misses          #    0.52% of all dTLB cache hits 

    1852.656905157 seconds time elapsed

   26866.849105000 seconds user
     965.756120000 seconds sys


* RHEL8 kernel + Andrea's patches:

 Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm':

     27,713,883.25 msec task-clock                #   15.137 CPUs utilized          
         1,310,196      context-switches          #    0.047 K/sec                  
           511,909      cpu-migrations            #    0.018 K/sec                  
        68,535,178      page-faults               #    0.002 M/sec                  
91,412,320,904,990      cycles                    #    3.298 GHz                    
49,844,016,063,738      instructions              #    0.55  insn per cycle         
23,795,774,331,203      cache-references          #  858.623 M/sec                  
   568,445,037,308      cache-misses              #    2.389 % of all cache refs    
   135,868,301,669      dTLB-loads                #    4.903 M/sec                  
       746,267,581      dTLB-load-misses          #    0.55% of all dTLB cache hits 

    1830.813507976 seconds time elapsed

   26785.529337000 seconds user
     943.251641000 seconds sys




I also wanted to measure the impact of the increased amount of code in
the task switching path (in order to decide which TLB invalidation
scheme to pick), to figure out what would be the worst case scenario
for single threads of execution constrained into one core and yelding
the CPU to each other. I wrote the small test (attached) and grabbed
some numbers in the same box, for the sake of comparison:

* stock RHEL-8 build:

 Performance counter stats for './tlb-test' (1000 runs):

            122.67 msec task-clock                #    0.997 CPUs utilized            ( +-  0.04% )
            32,297      context-switches          #    0.263 M/sec                    ( +-  0.00% )
                 0      cpu-migrations            #    0.000 K/sec                  
               325      page-faults               #    0.003 M/sec                    ( +-  0.01% )
       404,648,928      cycles                    #    3.299 GHz                      ( +-  0.04% )
       239,856,265      instructions              #    0.59  insn per cycle           ( +-  0.00% )
       121,189,080      cache-references          #  987.964 M/sec                    ( +-  0.00% )
         3,414,396      cache-misses              #    2.817 % of all cache refs      ( +-  0.05% )
         2,820,754      dTLB-loads                #   22.996 M/sec                    ( +-  0.04% )
            34,545      dTLB-load-misses          #    1.22% of all dTLB cache hits   ( +-  6.16% )

         0.1230361 +- 0.0000435 seconds time elapsed  ( +-  0.04% )


* RHEL8 kernel + Andrea's patches:

 Performance counter stats for './tlb-test' (1000 runs):

            125.57 msec task-clock                #    0.997 CPUs utilized            ( +-  0.05% )
            32,244      context-switches          #    0.257 M/sec                    ( +-  0.01% )
                 0      cpu-migrations            #    0.000 K/sec                  
               325      page-faults               #    0.003 M/sec                    ( +-  0.01% )
       413,692,492      cycles                    #    3.295 GHz                      ( +-  0.04% )
       241,017,764      instructions              #    0.58  insn per cycle           ( +-  0.00% )
       121,155,050      cache-references          #  964.856 M/sec                    ( +-  0.00% )
         3,512,035      cache-misses              #    2.899 % of all cache refs      ( +-  0.04% )
         2,912,475      dTLB-loads                #   23.194 M/sec                    ( +-  0.02% )
            45,340      dTLB-load-misses          #    1.56% of all dTLB cache hits   ( +-  5.07% )

         0.1259462 +- 0.0000634 seconds time elapsed  ( +-  0.05% )



AFAICS, the aforementioned benchmark numbers are suggesting that there is, 
virtually, no considerable impact, or very minimal and no detrimental impact 
to ordinary workloads imposed by the changes, and Andrea's benchmarks are 
showing/suggesting that a broad range of particular workloads will considerably
benefit from the changes. 

With the numbers above, added to what I've seen in our (internal)
integration tests, I'm confident on the stability of the changes.

-- Rafael
// SPDX-License-Identifier: BSD-2-Clause
/*
 * Copyright (c) 2020, Rafael Aquini <aquini@redhat.com>
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 *  1. Redistributions of source code must retain the above copyright notice,
 *     this list of conditions and the following disclaimer.
 *
 *  2. Redistributions in binary form must reproduce the above copyright notice,
 *     this list of conditions and the following disclaimer in the documentation
 *     and/or other materials provided with the distribution.
 *
 *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 *  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 *  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 *  ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
 *  BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 *  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 *  SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
 *  BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
 *  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 *  OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
 *  ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * compile with:  gcc -o tlb-test -D_GNU_SOURCE -lpthread tlb-test.c
 * dependencies:
 *  - _GNU_SOURCE required for asprintf(3), sched_getcpu(3) && sched_setaffinity(2)
 *  - libpthreads required for POSIX semaphores
 */
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sched.h>
#include <time.h>
#include <semaphore.h>
#include <sys/types.h>
#include <sys/times.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mman.h>

#ifndef NCHILDS
#define NCHILDS		4
#endif

#ifndef NPAGES
#define NPAGES		32
#endif

#ifndef NRUNS
#define NRUNS		8192
#endif

#ifdef DEBUG
#define DPRINTF(...)	fprintf(stderr, __VA_ARGS__)
#else
#define DPRINTF(...)
#endif

#define ERROR_EXIT(msg)							\
	do {								\
		char *estr = NULL;					\
		asprintf(&estr, "[%s:%d] %s", __FILE__, __LINE__, msg);	\
		perror(estr);						\
		exit(EXIT_FAILURE);					\
	} while (0)

static const char *prg_name = "tlb-test";
static long system_hz;
static long page_size;
static sem_t *sem;

/*
 * Fisher-Yates shuffler algorithm [Statistical Tables (London, 1938), Ex.12],
 * adapted to computer language by R. Durstenfeld [CACM 7 (1964), 420], and
 * presented by Donald E. Knuth at:
 *  "The Art of Computer Programming, Volume 2: Seminumerical Algorithms"
 *  [Algorithm P (shuffling) under Section 3.4 OTHER TYPES OF RANDOM QUANTITIES]
 */
void fy_shuffler(unsigned long *buf, unsigned long len)
{
	unsigned long j, u, tmp;

	for (j = len - 1; j > 0; j--) {
		u = rand() % j;
		tmp = *(buf + u);
		*(buf + u) = *(buf + j);
		*(buf + j) = tmp;
	}
}

unsigned long usec_diff(struct timeval *a, struct timeval *b)
{
	unsigned long usec;

	usec = (b->tv_sec - a->tv_sec) * 1000000;
	usec += b->tv_usec - a->tv_usec;
	return usec;
}

unsigned long workload(void *addr, size_t len, unsigned long *fault_order, int child)
{
	struct timeval start, end;
	unsigned long i;

	gettimeofday(&start, NULL);
	for (i = 0; i < len; i++) {
		unsigned long p = *(fault_order + i);
		*((unsigned char *)(addr + (p * page_size))) = ((i * p) % 0xff);
	}
	gettimeofday(&end, NULL);

	DPRINTF("[%s: child-%d (CPU=%d PID=%ld)] RUNNING! \n",
		prg_name, child, sched_getcpu(), (long) getpid());

	return usec_diff(&start, &end);
}

int child(int n, FILE *stream)
{
	unsigned long pages[NPAGES];
	size_t map_sz;
	int i, runs;
	void *addr;
	double elapsed = 0;

	for (i = 0; i < NPAGES; i++)
		pages[i] = i;

	map_sz = page_size * NPAGES;
	addr = mmap(NULL, map_sz, PROT_READ | PROT_WRITE,
				MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	if (addr == MAP_FAILED)
		ERROR_EXIT("mmap");

	if (madvise(addr, map_sz, MADV_NOHUGEPAGE) == -1)
		ERROR_EXIT("madvise");

	srand(time(NULL));

	for (runs = 0; runs < NRUNS; runs++) {
		sem_wait(sem);
		elapsed += workload(addr, NPAGES, pages, n);
		fy_shuffler(pages, NPAGES);
		sem_post(sem);
		/*
		 * relinquish the CPU to provide a small backoff, so other tasks
		 * get a fair chance on aquiring the semaphore.
		 */
		sched_yield();
	}

	fprintf(stream, "[%s: child-%d (CPU=%d PID=%ld)]  %lf msecs\n",
		prg_name, n, sched_getcpu(), (long) getpid(), (double )(elapsed / 1000));

	return 0;
}

int main(int argc, char *argv[])
{
	pid_t pid[NCHILDS];
	int i, ret, status;
	cpu_set_t set;

	CPU_ZERO(&set);		/* clear the set */
	CPU_SET(1, &set);
	if (sched_setaffinity(0, sizeof(cpu_set_t), &set) == -1)
		ERROR_EXIT("sched_setaffinity");

	if ((system_hz = sysconf(_SC_CLK_TCK)) == -1)
		ERROR_EXIT("sysconf");

	if ((page_size = sysconf(_SC_PAGESIZE)) == -1)
		ERROR_EXIT("sysconf");

	sem = sem_open(prg_name, O_CREAT, S_IRUSR | S_IWUSR, 0);
	if (sem == SEM_FAILED)
		ERROR_EXIT("sem_open");

	for (i = 0; i < NCHILDS; i++) {
		pid[i] = fork();
		switch (pid[i]) {
		case -1:	/* fork() has failed */
			ERROR_EXIT("fork");
			break;
		case 0:		/* child of a sucessful fork() */
			ret = child(i+1, stdout);
			exit(ret);
			break;
		}
	}

	sem_post(sem);

	for (;;) {
		if (wait(&status) == -1) {
			if (errno == ECHILD) {
				goto out;
			} else {
				ERROR_EXIT("wait");
			}
		}
	}
out:
	sem_close(sem);
	sem_unlink(prg_name);
	exit(EXIT_SUCCESS);
}

Catalin Marinas March 9, 2020, 11:22 a.m. UTC | #3

Hi Andrea,

On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote:
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	  struct task_struct *tsk)
>  {
> -	if (prev != next)
> -		__switch_mm(next);
> +	unsigned int cpu = smp_processor_id();
> +
> +	if (!per_cpu(cpu_not_lazy_tlb, cpu)) {
> +		per_cpu(cpu_not_lazy_tlb, cpu) = true;
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +	} else if (prev != next) {
> +		atomic_inc(&next->context.nr_active_mm);
> +		__switch_mm(next, cpu);
> +		atomic_dec(&prev->context.nr_active_mm);
> +	}

IIUC, nr_active_mm keeps track of how many instances of the current pgd
(TTBR0_EL1) are active.

> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
> +{
> +	if (atomic_read(&mm->context.nr_active_mm) <= 1) {
> +		bool is_local = current->active_mm == mm &&
> +			per_cpu(cpu_not_lazy_tlb, cpu);
> +		cpumask_t *stale_cpumask = mm_cpumask(mm);
> +		unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask);
> +		bool local_is_clear = false;
> +		if (next_zero < nr_cpu_ids &&
> +		    (is_local && next_zero == cpu)) {
> +			next_zero = cpumask_next_zero(next_zero, stale_cpumask);
> +			local_is_clear = true;
> +		}
> +		if (next_zero < nr_cpu_ids) {
> +			cpumask_setall(stale_cpumask);
> +			local_is_clear = false;
> +		}
> +
> +		/*
> +		 * Enforce CPU ordering between the above
> +		 * cpumask_setall(mm_cpumask) and the below
> +		 * atomic_read(nr_active_mm).
> +		 */
> +		smp_mb();
> +
> +		if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) {
> +			if (is_local) {
> +				if (!local_is_clear)
> +					cpumask_clear_cpu(cpu, stale_cpumask);
> +				return TLB_FLUSH_LOCAL;
> +			}
> +			if (atomic_read(&mm->context.nr_active_mm) == 0)
> +				return TLB_FLUSH_NO;
> +		}
> +	}
> +	return TLB_FLUSH_BROADCAST;

And this code here can assume that if nr_active_mm <= 1, no broadcast is
necessary.

One concern I have is the ordering between TTBR0_EL1 update in
cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
only have an ISB for context synchronisation on that CPU but I don't
think the architecture guarantees any relation between sysreg access and
the memory update. We have a DSB but that's further down in switch_to().

However, what worries me more is that you can now potentially do a TLB
shootdown without clearing the intermediate (e.g. VA to pte) walk caches
from the TLB. Even if the corresponding pgd and ASID are no longer
active on other CPUs, I'm not sure it's entirely safe to free (and
re-allocate) pages belonging to a pgtable without first flushing the
TLB. All the architecture spec states is that the software must first
clear the entry followed by TLBI (the break-before-make rules).

That said, the benchmark numbers are not very encouraging. Around 1%
improvement in a single run, it can as well be noise. Also something
like hackbench may also show a slight impact on the context switch path.
Maybe with a true NUMA machine with hundreds of CPUs we may see a
difference, but it depends on how well the TLBI is implemented.

Thanks.

Andrea Arcangeli March 14, 2020, 3:16 a.m. UTC | #4

Hi Catalin,

On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote:
> IIUC, nr_active_mm keeps track of how many instances of the current pgd
> (TTBR0_EL1) are active.

Correct.

> And this code here can assume that if nr_active_mm <= 1, no broadcast is
> necessary.

Yes.

> One concern I have is the ordering between TTBR0_EL1 update in
> cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
> only have an ISB for context synchronisation on that CPU but I don't
> think the architecture guarantees any relation between sysreg access and
> the memory update. We have a DSB but that's further down in switch_to().

There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm
updates that can happen on different CPUs simultaneously. It's hard to
see exactly which one you refer to.

Overall the idea here is that even if a speculative tlb lookup happens
in between those updates while the "mm" is going away and
atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't
matter because no userland software can use such stale tlb anymore
until local_flush_tlb_asid() gets rid of it.

The v1 patch (before I posted the incremental mm_count check) had
issues with speculatively loaded stale tlb entries only because they
weren't guaranteed to be flushed when the kernel thread switched back
to an userland process. So it relied on the CPU not to speculatively
load random pagetables while the kernel thread was running in lazy tlb
mode, but here the flush is guaranteed and in turn the CPU can always
load any TLB it wants at any given time.

> However, what worries me more is that you can now potentially do a TLB
> shootdown without clearing the intermediate (e.g. VA to pte) walk caches
> from the TLB. Even if the corresponding pgd and ASID are no longer
> active on other CPUs, I'm not sure it's entirely safe to free (and
> re-allocate) pages belonging to a pgtable without first flushing the

With regard to not doing a tlbi broadcast, nothing fundamentally
changed between v1 (single threaded using mm_users) and the latest
version (multhtreaded introducing nr_active_mm).

The v1 only skipped the tlbi broadcast in remote CPUs that run the
asid of a single threaded process before a CPU migration, but the
pages could already be reallocated from the point of view of the
remote CPUs.

In the current version the page can be reallocated even from the point
of view of the current CPU. However the fact the window has been
enlarged significantly should be a good thing, so if there would have
been something wrong with it, it would have been far easier to
reproduce it.

This concern is still a corollary of the previous paragraph: it's
still about stale tlb entries being left in an asid that can't ever be
used through the current asid.

> TLB. All the architecture spec states is that the software must first
> clear the entry followed by TLBI (the break-before-make rules).

The "break" in "break-before-make" is still guaranteed or it wouldn't
boot: however it's not implemented with the tlbi broadcast anymore.

The break is implemented by enforcing no stale tlb entry of such asid
exists in the TLB while any userland code runs.

X86 specs supposed an OS would allocate a TSS per-process and you
would do a context switch by using a task gate. I recall the first
Linux version I used had a TSS per process as envisioned by the
specs. Later the TSS become per-CPU and the esp0 pointer was updated
instead (native_load_sp0) and the stack was switched by hand.

I guess reading the specs may result confusing after such a software
change, that doesn't mean the software shouldn't optimize things
behind the specs if it's safe to do and it's not explicitly forbidden.

The page being reused by another virtual address in another CPU isn't
necessarily an invalid scenario from the point of view of the CPU. It
looks invalid if you assume the page is freed. You can think of it
like a MAP_SHARED page that gets one more mapping associated to it
(the reuse of the page) from another CPU it may look more legit. The
fact there's an old mapping left on the MAP_SHARED pagecache
indefinitely doesn't mean the CPU with such old mapping left in the
TLB is allowed to change the content of the page if the software never
writes to such virtual address through the old mapping. The important
thing is that the content of the page must not change unless the
software running in the CPU explicitly writes through the virtual
address that corresponds to the stale TLB entry (and it's guaranteed
the software won't write to it). The stale TLB of such asid eventually
is flushed either through a bumb of the asid generation or through a
local asid flush.

> That said, the benchmark numbers are not very encouraging. Around 1%
> improvement in a single run, it can as well be noise. Also something
> like hackbench may also show a slight impact on the context switch path.

I recall I tested hackbench and it appeared faster with processes,
otherwise it was within measurement error.

hackbench with processes is fork heavy so it gets some benefit because
all those copy-on-write post fork will trigger tlbi broadcasts on all
CPUs to flush the wrprotected tlb entry. Specifically the one
converted to local tlb flush is ptep_clear_flush_notify in
wp_page_copy() and there's one for each page being modified by parent
or child.

> Maybe with a true NUMA machine with hundreds of CPUs we may see a
> difference, but it depends on how well the TLBI is implemented.

The numbers in the commit header were not in a single run. perf stat
-r 10 -e dummy runs it 10 times and then shows the stdev along the
number too so you can what the noise was. It wasn't only a 1%
improvement either. Overall there's no noise in the measurement.

tcmalloc_large_heap_fragmentation_unittest simulating dealing with
small objects by many different containers at the same time, was 9.4%
faster (%stdev 0.36% 0.29%), with 32 CPUs and no NUMA.

256 times parallel run of 32 `sort` in a row, was 10.3% faster (%stdev
0.77% 0.38%), with 32 CPUs and no NUMA.

The multithreaded microbenchmark runs x16 times faster, but that's not
meaningful by itself, it's still a good hint some real life workload
(especially those with frequent MADV_DONTNEED) will run faster and
they did (and a verification that now also multithreaded apps can be
optimized).

Rafael already posted a benchmark specifically stressing the context
switch.

It's reasonable to expect any multi-socket NUMA to show more benefit
from the optimization than the 32 CPU non NUMA used for the current
benchmarks.

Thanks,
Andrea

Mark Rutland March 16, 2020, 2:09 p.m. UTC | #5

Hi Andrea,

On Fri, Mar 13, 2020 at 11:16:09PM -0400, Andrea Arcangeli wrote:
> On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote:
> > One concern I have is the ordering between TTBR0_EL1 update in
> > cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
> > only have an ISB for context synchronisation on that CPU but I don't
> > think the architecture guarantees any relation between sysreg access and
> > the memory update. We have a DSB but that's further down in switch_to().
> 
> There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm
> updates that can happen on different CPUs simultaneously. It's hard to
> see exactly which one you refer to.
> 
> Overall the idea here is that even if a speculative tlb lookup happens
> in between those updates while the "mm" is going away and
> atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't
> matter because no userland software can use such stale tlb anymore
> until local_flush_tlb_asid() gets rid of it.

I think you're focussing on the use of the end-to-end translations by
architecturally executed instructions, while Catalin is describing
problems resulting from the use of intermediate translations by
speculative walks.

The concern here is that speculation can result in intermediate walk
entries in the TLB being used to continue page table walks, and these
walks can have side-effects regardless of whether the resulting entires
are consumed by architecturally executed instructions.

For example, say we free a PMD page for which there is a stale
intermediate entry in a TLB, and that page gets reallocated for
arbitrary date. A speculative walk can consume that data as-if it were a
PMD, and depending on the value it sees a number of things can happen.
If the value happens to point at MMIO, the MMU might read from that MMIO
with side-effects. If the value happens to contain an architecturally
invalid configuration the result can be CONSTRAINED UNPREDICTABLE and
not limited to the corresponding ASID.

Even if the resulting end-to-end translation is never architecturally
consumed there are potential problems here, and I think that this series
is assuming stronger-than-architected behaviour from the MMU and page
table walks.

> This concern is still a corollary of the previous paragraph: it's
> still about stale tlb entries being left in an asid that can't ever be
> used through the current asid.

I'm not certain if the architecture generally allows or forbids walks to
be continued for an ASID that is not live in a TTBR, but there are cases
where that is possible, so I don't think the above accurately captures
the situation. Stale intermediate entries can certainly be consumed in
some cases.

> > TLB. All the architecture spec states is that the software must first
> > clear the entry followed by TLBI (the break-before-make rules).
> 
> The "break" in "break-before-make" is still guaranteed or it wouldn't
> boot: however it's not implemented with the tlbi broadcast anymore.
>
> The break is implemented by enforcing no stale tlb entry of such asid
> exists in the TLB while any userland code runs.

I understand the point you're making, but please don't overload the
terminology. Break-Before-Make is a well-defined term which refers to a
specific sequence which includes TLB invalidation, and the above is not
a break per that definition.

> X86 specs supposed an OS would allocate a TSS per-process and you
> would do a context switch by using a task gate. I recall the first
> Linux version I used had a TSS per process as envisioned by the
> specs. Later the TSS become per-CPU and the esp0 pointer was updated
> instead (native_load_sp0) and the stack was switched by hand.
> 
> I guess reading the specs may result confusing after such a software
> change, that doesn't mean the software shouldn't optimize things
> behind the specs if it's safe to do and it's not explicitly forbidden.

I think the important thing is that this is not necessarily safe. The
architecture currently states that a Break-Before-Make sequence is
necessary, and this series does not follow that.

AFAICT, this series relies on:

* An ISB completing prior page table walks when updating TTBR. I don't
  believe this is necessarily the case, given how things work for an
  EL1->EL2 transition where there can be ongoing EL1 walks.

* Walks never being initiated for `inactive` contexts within the current
  translation regime. e.g. while ASID x is installed, never starting a
  walk for ASID y. I can imagine that the architecture may permit a form
  of this starting with intermediate walk entries in the TLBs.

Before we can say whether this series is safe, we'd need to have a
better description of how a PE may use `inactive` values for an
in-context translation regime. 

I've asked for some clarification on these points.

Thanks,
Mark.

Mark Rutland March 31, 2020, 9:45 a.m. UTC | #6

Hi Andrea,

On Mon, Mar 16, 2020 at 02:09:07PM +0000, Mark Rutland wrote:
> AFAICT, this series relies on:
> 
> * An ISB completing prior page table walks when updating TTBR. I don't
>   believe this is necessarily the case, given how things work for an
>   EL1->EL2 transition where there can be ongoing EL1 walks.

I've had confirmation that a DSB is necessary (after the MSR and ISB) to
complete any ongoing translation table walks for the stale context.

Without a DSB, those walks can observe subsequent stores and encounter
the usual set of CONSTRAINED UNPREDICTABLE behaviours (e.g. walking into
MMIO with side-effects, continuing from amalgamted entries, etc). Those
issues are purely to do with the walk, and apply regardless of whether
the resulting translations are architecturally consumed.

> * Walks never being initiated for `inactive` contexts within the current
>   translation regime. e.g. while ASID x is installed, never starting a
>   walk for ASID y. I can imagine that the architecture may permit a form
>   of this starting with intermediate walk entries in the TLBs.

I'm still chasing this point.

Thanks,
Mark.

Andrea Arcangeli April 1, 2020, 12:32 a.m. UTC | #7

Hello Mark,

On Tue, Mar 31, 2020 at 10:45:11AM +0100, Mark Rutland wrote:
> Hi Andrea,
> 
> On Mon, Mar 16, 2020 at 02:09:07PM +0000, Mark Rutland wrote:
> > AFAICT, this series relies on:
> > 
> > * An ISB completing prior page table walks when updating TTBR. I don't
> >   believe this is necessarily the case, given how things work for an
> >   EL1->EL2 transition where there can be ongoing EL1 walks.
> 
> I've had confirmation that a DSB is necessary (after the MSR and ISB) to
> complete any ongoing translation table walks for the stale context.
> 
> Without a DSB, those walks can observe subsequent stores and encounter
> the usual set of CONSTRAINED UNPREDICTABLE behaviours (e.g. walking into
> MMIO with side-effects, continuing from amalgamted entries, etc). Those
> issues are purely to do with the walk, and apply regardless of whether
> the resulting translations are architecturally consumed.

Ok, sorry I didn't get it earlier... I attempted a quick fix below.

From ab30d8082be62fe24a97eceec5dbfeea8e278511 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Tue, 31 Mar 2020 20:03:43 -0400
Subject: [PATCH 1/1] arm64: tlb: skip tlbi broadcast, fix speculative tlb
 lookups

Without DSB in between "MSR; ISB" and "atomic_dec(&nr_active_mm)"
there's the risk a speculative pagecache lookup may still be walking
pagetables of the unloaded asid after nr_active_mm has been
decreased. In such case the remote CPU could free the pagetables and
reuse the memory without first issuing a tlbi broadcast, while the
speculative tlb lookup still runs on the unloaded asid. For this
reason the speculative pagetable walks needs to be flushed before
decreasing nr_active_mm.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm64/include/asm/mmu_context.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 9c66fe317e2f..d821ea3ce839 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -210,8 +210,18 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (per_cpu(cpu_not_lazy_tlb, cpu) &&
 	    is_idle_task(tsk)) {
 		per_cpu(cpu_not_lazy_tlb, cpu) = false;
-		if (!system_uses_ttbr0_pan())
+		if (!system_uses_ttbr0_pan()) {
 			cpu_set_reserved_ttbr0();
+			/*
+			 * DSB will flush the speculative pagetable
+			 * walks on the old asid. It's required before
+			 * decreasing nr_active_mm because after
+			 * decreasing nr_active_mm the tlbi broadcast
+			 * may not happen on the unloaded asid before
+			 * the pagetables are freed.
+			 */
+			dsb(ish);
+		}
 		atomic_dec(&mm->context.nr_active_mm);
 	}
 	VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0);
@@ -249,6 +259,14 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	} else if (prev != next) {
 		atomic_inc(&next->context.nr_active_mm);
 		__switch_mm(next, cpu);
+		/*
+		 * DSB will flush the speculative pagetable walks on the old
+		 * asid. It's required before decreasing nr_active_mm because
+		 * after decreasing nr_active_mm the tlbi broadcast may not
+		 * happen on the unloaded asid before the pagetables are
+		 * freed.
+		 */
+		dsb(ish);
 		atomic_dec(&prev->context.nr_active_mm);
 	}
 	VM_WARN_ON(!atomic_read(&next->context.nr_active_mm));


I didn't test it yet, because this being a theoretical issue it is
better reviewed in the source.

> > * Walks never being initiated for `inactive` contexts within the current
> >   translation regime. e.g. while ASID x is installed, never starting a
> >   walk for ASID y. I can imagine that the architecture may permit a form
> >   of this starting with intermediate walk entries in the TLBs.
> 
> I'm still chasing this point.

Appreciated! I'll cross fingers you don't find the speculative lookups
can randomly start on unloaded ASID. That would also imply that it
would be impossible on arm64 to use different asid on different CPUs
as it is normally done on other arches.

Thanks,
Andrea

[3/3] arm64: tlb: skip tlbi broadcast

Commit Message

Comments

Patch