Message ID | 20200223192520.20808-4-aarcange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes | expand |
On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote: > With multiple NUMA nodes and multiple sockets, the tlbi broadcast > shall be delivered through the interconnects in turn increasing the > CPU interconnect traffic and the latency of the tlbi broadcast > instruction. To avoid the synchronous delivery of the tlbi broadcast > before the tlbi instruction can be retired, the hardware would need to > implement a replicated mm_cpumask bitflag for each ASID and every CPU > would need to tell every other CPU which ASID is being loaded. Exactly > what x86 does with mm_cpumask in software. > > Even within a single NUMA node the latency of the tlbi broadcast > instruction increases almost linearly with the number of CPUs trying > to send tlbi broadcasts at the same time. > > If a single thread of the process is running and it's also running in > the CPU issuing the TLB flush, or if no thread of the process are > running, we can achieve full SMP scalability in the arm64 TLB flushng > by skipping the tlbi broadcasting. > > After the local TLB flush this means the ASID context goes out of sync > in all CPUs except the local one. This can be tracked on the per-mm > cpumask: if the bit is set it means the ASID context is stale for that > CPU. This results in an extra local ASID TLB flush only when threads > are running in new CPUs after a TLB flush. > > Skipping the tlbi instruction broadcasting is already implemented in > local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(), > flush_tlb_range() and flush_tlb_page() too. > > The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8 > Ampere), so it should be far from a worst case scenario: the > enterprise kernel config allows multiple NUMA nodes with NR_CPUS set > by default to 4096. > > === stock === > > # cat for-each-cpu.sh > #!/bin/bash > > for i in $(seq `nproc`); do > "$@" &>/dev/null & > done > wait > # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000 > [..] > 2.1696 +- 0.0122 seconds time elapsed ( +- 0.56% ) > > # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest > [..] > 0.99018 +- 0.00360 seconds time elapsed ( +- 0.36% ) > > # cat sort-compute > #!/bin/bash > > for x in `seq 256`; do > for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done & > done > wait > # perf stat -r 10 -e dummy ./sort-compute > [..] > 1.8094 +- 0.0139 seconds time elapsed ( +- 0.77% ) > > === patch applied === > > # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000 > [..] > 0.13941 +- 0.00449 seconds time elapsed ( +- 3.22% ) > > # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest > [..] > 0.90510 +- 0.00262 seconds time elapsed ( +- 0.29% ) > > # perf stat -r 10 -e dummy ./sort-compute > [..] > 1.64025 +- 0.00618 seconds time elapsed ( +- 0.38% ) > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > arch/arm64/include/asm/efi.h | 2 +- > arch/arm64/include/asm/mmu.h | 4 +- > arch/arm64/include/asm/mmu_context.h | 33 ++++++++-- > arch/arm64/include/asm/tlbflush.h | 95 +++++++++++++++++++++++++++- > arch/arm64/mm/context.c | 54 ++++++++++++++++ > 5 files changed, 177 insertions(+), 11 deletions(-) > > diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h > index 44531a69d32b..5d9a1433d918 100644 > --- a/arch/arm64/include/asm/efi.h > +++ b/arch/arm64/include/asm/efi.h > @@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt) > > static inline void efi_set_pgd(struct mm_struct *mm) > { > - __switch_mm(mm); > + __switch_mm(mm, smp_processor_id()); > > if (system_uses_ttbr0_pan()) { > if (mm != current->active_mm) { > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index e4d862420bb4..9072fd7bc5f8 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -20,6 +20,7 @@ typedef struct { > atomic64_t id; > void *vdso; > unsigned long flags; > + atomic_t nr_active_mm; > } mm_context_t; > > /* > @@ -27,7 +28,8 @@ typedef struct { > * ASID change and therefore doesn't need to reload the counter using > * atomic64_read. > */ > -#define ASID(mm) ((mm)->context.id.counter & 0xffff) > +#define __ASID(asid) ((asid) & 0xffff) > +#define ASID(mm) __ASID((mm)->context.id.counter) > > extern bool arm64_use_ng_mappings; > > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h > index 3827ff4040a3..9c66fe317e2f 100644 > --- a/arch/arm64/include/asm/mmu_context.h > +++ b/arch/arm64/include/asm/mmu_context.h > @@ -175,7 +175,10 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp) > #define destroy_context(mm) do { } while(0) > void check_and_switch_context(struct mm_struct *mm, unsigned int cpu); > > -#define init_new_context(tsk,mm) ({ atomic64_set(&(mm)->context.id, 0); 0; }) > +#define init_new_context(tsk,mm) \ > + ({ atomic64_set(&(mm)->context.id, 0); \ > + atomic_set(&(mm)->context.nr_active_mm, 0); \ > + 0; }) > > #ifdef CONFIG_ARM64_SW_TTBR0_PAN > static inline void update_saved_ttbr0(struct task_struct *tsk, > @@ -203,6 +206,15 @@ static inline void update_saved_ttbr0(struct task_struct *tsk, > static inline void > enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) > { > + unsigned int cpu = smp_processor_id(); > + if (per_cpu(cpu_not_lazy_tlb, cpu) && > + is_idle_task(tsk)) { > + per_cpu(cpu_not_lazy_tlb, cpu) = false; > + if (!system_uses_ttbr0_pan()) > + cpu_set_reserved_ttbr0(); > + atomic_dec(&mm->context.nr_active_mm); > + } > + VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0); > /* > * We don't actually care about the ttbr0 mapping, so point it at the > * zero page. > @@ -210,10 +222,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) > update_saved_ttbr0(tsk, &init_mm); > } > > -static inline void __switch_mm(struct mm_struct *next) > +static inline void __switch_mm(struct mm_struct *next, unsigned int cpu) > { > - unsigned int cpu = smp_processor_id(); > - > /* > * init_mm.pgd does not contain any user mappings and it is always > * active for kernel addresses in TTBR1. Just set the reserved TTBR0. > @@ -230,8 +240,19 @@ static inline void > switch_mm(struct mm_struct *prev, struct mm_struct *next, > struct task_struct *tsk) > { > - if (prev != next) > - __switch_mm(next); > + unsigned int cpu = smp_processor_id(); > + > + if (!per_cpu(cpu_not_lazy_tlb, cpu)) { > + per_cpu(cpu_not_lazy_tlb, cpu) = true; > + atomic_inc(&next->context.nr_active_mm); > + __switch_mm(next, cpu); > + } else if (prev != next) { > + atomic_inc(&next->context.nr_active_mm); > + __switch_mm(next, cpu); > + atomic_dec(&prev->context.nr_active_mm); > + } > + VM_WARN_ON(!atomic_read(&next->context.nr_active_mm)); > + VM_WARN_ON(atomic_read(&prev->context.nr_active_mm) < 0); > > /* > * Update the saved TTBR0_EL1 of the scheduled-in task as the previous > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index bc3949064725..0bd987ff9cbd 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void) > isb(); > } > > +static inline void local_flush_tlb_asid(unsigned long asid) > +{ > + asid = __TLBI_VADDR(0, __ASID(asid)); > + dsb(nshst); > + __tlbi(aside1, asid); > + __tlbi_user(aside1, asid); > + dsb(nsh); > +} > + > static inline void flush_tlb_all(void) > { > dsb(ishst); > @@ -144,9 +153,38 @@ static inline void flush_tlb_all(void) > isb(); > } > > +DECLARE_PER_CPU(bool, cpu_not_lazy_tlb); > + > +enum tlb_flush_types { > + TLB_FLUSH_NO, > + TLB_FLUSH_LOCAL, > + TLB_FLUSH_BROADCAST, > +}; > +extern enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, > + unsigned int cpu); > + > static inline void flush_tlb_mm(struct mm_struct *mm) > { > unsigned long asid = __TLBI_VADDR(0, ASID(mm)); > + enum tlb_flush_types flush; > + > + flush = tlb_flush_check(mm, get_cpu()); > + switch (flush) { > + case TLB_FLUSH_LOCAL: > + > + dsb(nshst); > + __tlbi(aside1, asid); > + __tlbi_user(aside1, asid); > + dsb(nsh); > + > + /* fall through */ > + case TLB_FLUSH_NO: > + put_cpu(); > + return; > + case TLB_FLUSH_BROADCAST: > + break; > + } > + put_cpu(); > > dsb(ishst); > __tlbi(aside1is, asid); > @@ -167,7 +205,31 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, > static inline void flush_tlb_page(struct vm_area_struct *vma, > unsigned long uaddr) > { > - flush_tlb_page_nosync(vma, uaddr); > + struct mm_struct *mm = vma->vm_mm; > + unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm)); > + enum tlb_flush_types flush; > + > + flush = tlb_flush_check(mm, get_cpu()); > + switch (flush) { > + case TLB_FLUSH_LOCAL: > + > + dsb(nshst); > + __tlbi(vale1, addr); > + __tlbi_user(vale1, addr); > + dsb(nsh); > + > + /* fall through */ > + case TLB_FLUSH_NO: > + put_cpu(); > + return; > + case TLB_FLUSH_BROADCAST: > + break; > + } > + put_cpu(); > + > + dsb(ishst); > + __tlbi(vale1is, addr); > + __tlbi_user(vale1is, addr); > dsb(ish); > } > > @@ -181,14 +243,16 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, > unsigned long start, unsigned long end, > unsigned long stride, bool last_level) > { > - unsigned long asid = ASID(vma->vm_mm); > + struct mm_struct *mm = vma->vm_mm; > + unsigned long asid = ASID(mm); > unsigned long addr; > + enum tlb_flush_types flush; > > start = round_down(start, stride); > end = round_up(end, stride); > > if ((end - start) >= (MAX_TLBI_OPS * stride)) { > - flush_tlb_mm(vma->vm_mm); > + flush_tlb_mm(mm); > return; > } > > @@ -198,6 +262,31 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, > start = __TLBI_VADDR(start, asid); > end = __TLBI_VADDR(end, asid); > > + flush = tlb_flush_check(mm, get_cpu()); > + switch (flush) { > + case TLB_FLUSH_LOCAL: > + > + dsb(nshst); > + for (addr = start; addr < end; addr += stride) { > + if (last_level) { > + __tlbi(vale1, addr); > + __tlbi_user(vale1, addr); > + } else { > + __tlbi(vae1, addr); > + __tlbi_user(vae1, addr); > + } > + } > + dsb(nsh); > + > + /* fall through */ > + case TLB_FLUSH_NO: > + put_cpu(); > + return; > + case TLB_FLUSH_BROADCAST: > + break; > + } > + put_cpu(); > + > dsb(ishst); > for (addr = start; addr < end; addr += stride) { > if (last_level) { > diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c > index 8ef73e89d514..3152b7f7da12 100644 > --- a/arch/arm64/mm/context.c > +++ b/arch/arm64/mm/context.c > @@ -25,6 +25,7 @@ static unsigned long *asid_map; > static DEFINE_PER_CPU(atomic64_t, active_asids); > static DEFINE_PER_CPU(u64, reserved_asids); > static cpumask_t tlb_flush_pending; > +DEFINE_PER_CPU(bool, cpu_not_lazy_tlb); > > #define ASID_MASK (~GENMASK(asid_bits - 1, 0)) > #define ASID_FIRST_VERSION (1UL << asid_bits) > @@ -191,6 +192,12 @@ static u64 new_context(struct mm_struct *mm) > set_asid: > __set_bit(asid, asid_map); > cur_idx = asid; > + /* > + * check_and_switch_context() will change the ASID of this mm > + * so no need of extra ASID local TLB flushes: the new ASID > + * isn't stale anymore after the tlb_flush_pending was set. > + */ > + cpumask_clear(mm_cpumask(mm)); > return idx2asid(asid) | generation; > } > > @@ -240,6 +247,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) > raw_spin_unlock_irqrestore(&cpu_asid_lock, flags); > > switch_mm_fastpath: > + /* > + * Enforce CPU ordering between the atomic_inc(nr_active_mm) > + * in switch_mm() and the below cpumask_test_cpu(mm_cpumask). > + */ > + smp_mb(); > + if (cpumask_test_cpu(cpu, mm_cpumask(mm))) { > + cpumask_clear_cpu(cpu, mm_cpumask(mm)); > + local_flush_tlb_asid(asid); > + } > > arm64_apply_bp_hardening(); > > @@ -251,6 +267,44 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) > cpu_switch_mm(mm->pgd, mm); > } > > +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu) > +{ > + if (atomic_read(&mm->context.nr_active_mm) <= 1) { > + bool is_local = current->active_mm == mm && > + per_cpu(cpu_not_lazy_tlb, cpu); > + cpumask_t *stale_cpumask = mm_cpumask(mm); > + unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask); > + bool local_is_clear = false; > + if (next_zero < nr_cpu_ids && > + (is_local && next_zero == cpu)) { > + next_zero = cpumask_next_zero(next_zero, stale_cpumask); > + local_is_clear = true; > + } > + if (next_zero < nr_cpu_ids) { > + cpumask_setall(stale_cpumask); > + local_is_clear = false; > + } > + > + /* > + * Enforce CPU ordering between the above > + * cpumask_setall(mm_cpumask) and the below > + * atomic_read(nr_active_mm). > + */ > + smp_mb(); > + > + if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) { > + if (is_local) { > + if (!local_is_clear) > + cpumask_clear_cpu(cpu, stale_cpumask); > + return TLB_FLUSH_LOCAL; > + } > + if (atomic_read(&mm->context.nr_active_mm) == 0) > + return TLB_FLUSH_NO; > + } > + } > + return TLB_FLUSH_BROADCAST; > +} > + > /* Errata workaround post TTBRx_EL1 update. */ > asmlinkage void post_ttbr_update_workaround(void) > { May I suggest the following (cosmetic) changes to this patch? I'm testing these changes against RHEL integration + regression tests, and I'll also run them against a specially crafted test to measure the impact on task-switching, if any. (I'll report back, soon) Cheers! Rafael -- diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index bb73f02bef28..14eceba302bc 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -182,25 +182,21 @@ static inline void flush_tlb_mm(struct mm_struct *mm) flush = tlb_flush_check(mm, get_cpu()); switch (flush) { case TLB_FLUSH_LOCAL: - dsb(nshst); __tlbi(aside1, asid); __tlbi_user(aside1, asid); dsb(nsh); - /* fall through */ case TLB_FLUSH_NO: - put_cpu(); - return; + break; case TLB_FLUSH_BROADCAST: + dsb(ishst); + __tlbi(aside1is, asid); + __tlbi_user(aside1is, asid); + dsb(ish); break; } put_cpu(); - - dsb(ishst); - __tlbi(aside1is, asid); - __tlbi_user(aside1is, asid); - dsb(ish); } static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, @@ -223,25 +219,21 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, flush = tlb_flush_check(mm, get_cpu()); switch (flush) { case TLB_FLUSH_LOCAL: - dsb(nshst); __tlbi(vale1, addr); __tlbi_user(vale1, addr); dsb(nsh); - /* fall through */ case TLB_FLUSH_NO: - put_cpu(); - return; + break; case TLB_FLUSH_BROADCAST: + dsb(ishst); + __tlbi(vale1is, addr); + __tlbi_user(vale1is, addr); + dsb(ish); break; } put_cpu(); - - dsb(ishst); - __tlbi(vale1is, addr); - __tlbi_user(vale1is, addr); - dsb(ish); } /* @@ -276,7 +268,6 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, flush = tlb_flush_check(mm, get_cpu()); switch (flush) { case TLB_FLUSH_LOCAL: - dsb(nshst); for (addr = start; addr < end; addr += stride) { if (last_level) { @@ -288,27 +279,24 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, } } dsb(nsh); - /* fall through */ case TLB_FLUSH_NO: - put_cpu(); - return; + break; case TLB_FLUSH_BROADCAST: + dsb(ishst); + for (addr = start; addr < end; addr += stride) { + if (last_level) { + __tlbi(vale1is, addr); + __tlbi_user(vale1is, addr); + } else { + __tlbi(vae1is, addr); + __tlbi_user(vae1is, addr); + } + } + dsb(ish); break; } put_cpu(); - - dsb(ishst); - for (addr = start; addr < end; addr += stride) { - if (last_level) { - __tlbi(vale1is, addr); - __tlbi_user(vale1is, addr); - } else { - __tlbi(vae1is, addr); - __tlbi_user(vae1is, addr); - } - } - dsb(ish); } static inline void flush_tlb_range(struct vm_area_struct *vma,
On Mon, Mar 02, 2020 at 10:24:51AM -0500, Rafael Aquini wrote: [...] > I'm testing these changes against RHEL integration + regression > tests, and I'll also run them against a specially crafted test > to measure the impact on task-switching, if any. (I'll report back, > soon) > As promised, I ran the the patches against a full round of kernel integration/regression tests (the same we use to run for RHEL kernels) unfortunately, there is no easy way to share these internal results, but apart from a couple warnings -- which were not related to the test build -- everything went on very smoothly with the patches on top of a RHEL-8 test-kernel. I grabbed some perf numbers, to serve as kernel build benchmark. Test system is a 1 socket 32 core 3.3Ghz ARMv8 Ampere eMAG 256GB RAM. rpmbuild spawns the build with make -j32 and besides the stock kernel RPM, it also builds the -debug flavor RPM and all the sub-RPMs for tools and extra modules. * stock RHEL-8 build: Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm': 27,817,487.14 msec task-clock # 15.015 CPUs utilized 1,318,586 context-switches # 0.047 K/sec 515,342 cpu-migrations # 0.019 K/sec 68,513,346 page-faults # 0.002 M/sec 91,713,983,302,759 cycles # 3.297 GHz 49,871,167,452,081 instructions # 0.54 insn per cycle 23,801,939,026,338 cache-references # 855.647 M/sec 568,847,557,178 cache-misses # 2.390 % of all cache refs 145,305,286,469 dTLB-loads # 5.224 M/sec 752,451,698 dTLB-load-misses # 0.52% of all dTLB cache hits 1852.656905157 seconds time elapsed 26866.849105000 seconds user 965.756120000 seconds sys * RHEL8 kernel + Andrea's patches: Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm': 27,713,883.25 msec task-clock # 15.137 CPUs utilized 1,310,196 context-switches # 0.047 K/sec 511,909 cpu-migrations # 0.018 K/sec 68,535,178 page-faults # 0.002 M/sec 91,412,320,904,990 cycles # 3.298 GHz 49,844,016,063,738 instructions # 0.55 insn per cycle 23,795,774,331,203 cache-references # 858.623 M/sec 568,445,037,308 cache-misses # 2.389 % of all cache refs 135,868,301,669 dTLB-loads # 4.903 M/sec 746,267,581 dTLB-load-misses # 0.55% of all dTLB cache hits 1830.813507976 seconds time elapsed 26785.529337000 seconds user 943.251641000 seconds sys I also wanted to measure the impact of the increased amount of code in the task switching path (in order to decide which TLB invalidation scheme to pick), to figure out what would be the worst case scenario for single threads of execution constrained into one core and yelding the CPU to each other. I wrote the small test (attached) and grabbed some numbers in the same box, for the sake of comparison: * stock RHEL-8 build: Performance counter stats for './tlb-test' (1000 runs): 122.67 msec task-clock # 0.997 CPUs utilized ( +- 0.04% ) 32,297 context-switches # 0.263 M/sec ( +- 0.00% ) 0 cpu-migrations # 0.000 K/sec 325 page-faults # 0.003 M/sec ( +- 0.01% ) 404,648,928 cycles # 3.299 GHz ( +- 0.04% ) 239,856,265 instructions # 0.59 insn per cycle ( +- 0.00% ) 121,189,080 cache-references # 987.964 M/sec ( +- 0.00% ) 3,414,396 cache-misses # 2.817 % of all cache refs ( +- 0.05% ) 2,820,754 dTLB-loads # 22.996 M/sec ( +- 0.04% ) 34,545 dTLB-load-misses # 1.22% of all dTLB cache hits ( +- 6.16% ) 0.1230361 +- 0.0000435 seconds time elapsed ( +- 0.04% ) * RHEL8 kernel + Andrea's patches: Performance counter stats for './tlb-test' (1000 runs): 125.57 msec task-clock # 0.997 CPUs utilized ( +- 0.05% ) 32,244 context-switches # 0.257 M/sec ( +- 0.01% ) 0 cpu-migrations # 0.000 K/sec 325 page-faults # 0.003 M/sec ( +- 0.01% ) 413,692,492 cycles # 3.295 GHz ( +- 0.04% ) 241,017,764 instructions # 0.58 insn per cycle ( +- 0.00% ) 121,155,050 cache-references # 964.856 M/sec ( +- 0.00% ) 3,512,035 cache-misses # 2.899 % of all cache refs ( +- 0.04% ) 2,912,475 dTLB-loads # 23.194 M/sec ( +- 0.02% ) 45,340 dTLB-load-misses # 1.56% of all dTLB cache hits ( +- 5.07% ) 0.1259462 +- 0.0000634 seconds time elapsed ( +- 0.05% ) AFAICS, the aforementioned benchmark numbers are suggesting that there is, virtually, no considerable impact, or very minimal and no detrimental impact to ordinary workloads imposed by the changes, and Andrea's benchmarks are showing/suggesting that a broad range of particular workloads will considerably benefit from the changes. With the numbers above, added to what I've seen in our (internal) integration tests, I'm confident on the stability of the changes. -- Rafael // SPDX-License-Identifier: BSD-2-Clause /* * Copyright (c) 2020, Rafael Aquini <aquini@redhat.com> * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * compile with: gcc -o tlb-test -D_GNU_SOURCE -lpthread tlb-test.c * dependencies: * - _GNU_SOURCE required for asprintf(3), sched_getcpu(3) && sched_setaffinity(2) * - libpthreads required for POSIX semaphores */ #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <errno.h> #include <sched.h> #include <time.h> #include <semaphore.h> #include <sys/types.h> #include <sys/times.h> #include <sys/time.h> #include <sys/stat.h> #include <sys/wait.h> #include <sys/mman.h> #ifndef NCHILDS #define NCHILDS 4 #endif #ifndef NPAGES #define NPAGES 32 #endif #ifndef NRUNS #define NRUNS 8192 #endif #ifdef DEBUG #define DPRINTF(...) fprintf(stderr, __VA_ARGS__) #else #define DPRINTF(...) #endif #define ERROR_EXIT(msg) \ do { \ char *estr = NULL; \ asprintf(&estr, "[%s:%d] %s", __FILE__, __LINE__, msg); \ perror(estr); \ exit(EXIT_FAILURE); \ } while (0) static const char *prg_name = "tlb-test"; static long system_hz; static long page_size; static sem_t *sem; /* * Fisher-Yates shuffler algorithm [Statistical Tables (London, 1938), Ex.12], * adapted to computer language by R. Durstenfeld [CACM 7 (1964), 420], and * presented by Donald E. Knuth at: * "The Art of Computer Programming, Volume 2: Seminumerical Algorithms" * [Algorithm P (shuffling) under Section 3.4 OTHER TYPES OF RANDOM QUANTITIES] */ void fy_shuffler(unsigned long *buf, unsigned long len) { unsigned long j, u, tmp; for (j = len - 1; j > 0; j--) { u = rand() % j; tmp = *(buf + u); *(buf + u) = *(buf + j); *(buf + j) = tmp; } } unsigned long usec_diff(struct timeval *a, struct timeval *b) { unsigned long usec; usec = (b->tv_sec - a->tv_sec) * 1000000; usec += b->tv_usec - a->tv_usec; return usec; } unsigned long workload(void *addr, size_t len, unsigned long *fault_order, int child) { struct timeval start, end; unsigned long i; gettimeofday(&start, NULL); for (i = 0; i < len; i++) { unsigned long p = *(fault_order + i); *((unsigned char *)(addr + (p * page_size))) = ((i * p) % 0xff); } gettimeofday(&end, NULL); DPRINTF("[%s: child-%d (CPU=%d PID=%ld)] RUNNING! \n", prg_name, child, sched_getcpu(), (long) getpid()); return usec_diff(&start, &end); } int child(int n, FILE *stream) { unsigned long pages[NPAGES]; size_t map_sz; int i, runs; void *addr; double elapsed = 0; for (i = 0; i < NPAGES; i++) pages[i] = i; map_sz = page_size * NPAGES; addr = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) ERROR_EXIT("mmap"); if (madvise(addr, map_sz, MADV_NOHUGEPAGE) == -1) ERROR_EXIT("madvise"); srand(time(NULL)); for (runs = 0; runs < NRUNS; runs++) { sem_wait(sem); elapsed += workload(addr, NPAGES, pages, n); fy_shuffler(pages, NPAGES); sem_post(sem); /* * relinquish the CPU to provide a small backoff, so other tasks * get a fair chance on aquiring the semaphore. */ sched_yield(); } fprintf(stream, "[%s: child-%d (CPU=%d PID=%ld)] %lf msecs\n", prg_name, n, sched_getcpu(), (long) getpid(), (double )(elapsed / 1000)); return 0; } int main(int argc, char *argv[]) { pid_t pid[NCHILDS]; int i, ret, status; cpu_set_t set; CPU_ZERO(&set); /* clear the set */ CPU_SET(1, &set); if (sched_setaffinity(0, sizeof(cpu_set_t), &set) == -1) ERROR_EXIT("sched_setaffinity"); if ((system_hz = sysconf(_SC_CLK_TCK)) == -1) ERROR_EXIT("sysconf"); if ((page_size = sysconf(_SC_PAGESIZE)) == -1) ERROR_EXIT("sysconf"); sem = sem_open(prg_name, O_CREAT, S_IRUSR | S_IWUSR, 0); if (sem == SEM_FAILED) ERROR_EXIT("sem_open"); for (i = 0; i < NCHILDS; i++) { pid[i] = fork(); switch (pid[i]) { case -1: /* fork() has failed */ ERROR_EXIT("fork"); break; case 0: /* child of a sucessful fork() */ ret = child(i+1, stdout); exit(ret); break; } } sem_post(sem); for (;;) { if (wait(&status) == -1) { if (errno == ECHILD) { goto out; } else { ERROR_EXIT("wait"); } } } out: sem_close(sem); sem_unlink(prg_name); exit(EXIT_SUCCESS); }
Hi Andrea, On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote: > switch_mm(struct mm_struct *prev, struct mm_struct *next, > struct task_struct *tsk) > { > - if (prev != next) > - __switch_mm(next); > + unsigned int cpu = smp_processor_id(); > + > + if (!per_cpu(cpu_not_lazy_tlb, cpu)) { > + per_cpu(cpu_not_lazy_tlb, cpu) = true; > + atomic_inc(&next->context.nr_active_mm); > + __switch_mm(next, cpu); > + } else if (prev != next) { > + atomic_inc(&next->context.nr_active_mm); > + __switch_mm(next, cpu); > + atomic_dec(&prev->context.nr_active_mm); > + } IIUC, nr_active_mm keeps track of how many instances of the current pgd (TTBR0_EL1) are active. > +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu) > +{ > + if (atomic_read(&mm->context.nr_active_mm) <= 1) { > + bool is_local = current->active_mm == mm && > + per_cpu(cpu_not_lazy_tlb, cpu); > + cpumask_t *stale_cpumask = mm_cpumask(mm); > + unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask); > + bool local_is_clear = false; > + if (next_zero < nr_cpu_ids && > + (is_local && next_zero == cpu)) { > + next_zero = cpumask_next_zero(next_zero, stale_cpumask); > + local_is_clear = true; > + } > + if (next_zero < nr_cpu_ids) { > + cpumask_setall(stale_cpumask); > + local_is_clear = false; > + } > + > + /* > + * Enforce CPU ordering between the above > + * cpumask_setall(mm_cpumask) and the below > + * atomic_read(nr_active_mm). > + */ > + smp_mb(); > + > + if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) { > + if (is_local) { > + if (!local_is_clear) > + cpumask_clear_cpu(cpu, stale_cpumask); > + return TLB_FLUSH_LOCAL; > + } > + if (atomic_read(&mm->context.nr_active_mm) == 0) > + return TLB_FLUSH_NO; > + } > + } > + return TLB_FLUSH_BROADCAST; And this code here can assume that if nr_active_mm <= 1, no broadcast is necessary. One concern I have is the ordering between TTBR0_EL1 update in cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We only have an ISB for context synchronisation on that CPU but I don't think the architecture guarantees any relation between sysreg access and the memory update. We have a DSB but that's further down in switch_to(). However, what worries me more is that you can now potentially do a TLB shootdown without clearing the intermediate (e.g. VA to pte) walk caches from the TLB. Even if the corresponding pgd and ASID are no longer active on other CPUs, I'm not sure it's entirely safe to free (and re-allocate) pages belonging to a pgtable without first flushing the TLB. All the architecture spec states is that the software must first clear the entry followed by TLBI (the break-before-make rules). That said, the benchmark numbers are not very encouraging. Around 1% improvement in a single run, it can as well be noise. Also something like hackbench may also show a slight impact on the context switch path. Maybe with a true NUMA machine with hundreds of CPUs we may see a difference, but it depends on how well the TLBI is implemented. Thanks.
Hi Catalin, On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote: > IIUC, nr_active_mm keeps track of how many instances of the current pgd > (TTBR0_EL1) are active. Correct. > And this code here can assume that if nr_active_mm <= 1, no broadcast is > necessary. Yes. > One concern I have is the ordering between TTBR0_EL1 update in > cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We > only have an ISB for context synchronisation on that CPU but I don't > think the architecture guarantees any relation between sysreg access and > the memory update. We have a DSB but that's further down in switch_to(). There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm updates that can happen on different CPUs simultaneously. It's hard to see exactly which one you refer to. Overall the idea here is that even if a speculative tlb lookup happens in between those updates while the "mm" is going away and atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't matter because no userland software can use such stale tlb anymore until local_flush_tlb_asid() gets rid of it. The v1 patch (before I posted the incremental mm_count check) had issues with speculatively loaded stale tlb entries only because they weren't guaranteed to be flushed when the kernel thread switched back to an userland process. So it relied on the CPU not to speculatively load random pagetables while the kernel thread was running in lazy tlb mode, but here the flush is guaranteed and in turn the CPU can always load any TLB it wants at any given time. > However, what worries me more is that you can now potentially do a TLB > shootdown without clearing the intermediate (e.g. VA to pte) walk caches > from the TLB. Even if the corresponding pgd and ASID are no longer > active on other CPUs, I'm not sure it's entirely safe to free (and > re-allocate) pages belonging to a pgtable without first flushing the With regard to not doing a tlbi broadcast, nothing fundamentally changed between v1 (single threaded using mm_users) and the latest version (multhtreaded introducing nr_active_mm). The v1 only skipped the tlbi broadcast in remote CPUs that run the asid of a single threaded process before a CPU migration, but the pages could already be reallocated from the point of view of the remote CPUs. In the current version the page can be reallocated even from the point of view of the current CPU. However the fact the window has been enlarged significantly should be a good thing, so if there would have been something wrong with it, it would have been far easier to reproduce it. This concern is still a corollary of the previous paragraph: it's still about stale tlb entries being left in an asid that can't ever be used through the current asid. > TLB. All the architecture spec states is that the software must first > clear the entry followed by TLBI (the break-before-make rules). The "break" in "break-before-make" is still guaranteed or it wouldn't boot: however it's not implemented with the tlbi broadcast anymore. The break is implemented by enforcing no stale tlb entry of such asid exists in the TLB while any userland code runs. X86 specs supposed an OS would allocate a TSS per-process and you would do a context switch by using a task gate. I recall the first Linux version I used had a TSS per process as envisioned by the specs. Later the TSS become per-CPU and the esp0 pointer was updated instead (native_load_sp0) and the stack was switched by hand. I guess reading the specs may result confusing after such a software change, that doesn't mean the software shouldn't optimize things behind the specs if it's safe to do and it's not explicitly forbidden. The page being reused by another virtual address in another CPU isn't necessarily an invalid scenario from the point of view of the CPU. It looks invalid if you assume the page is freed. You can think of it like a MAP_SHARED page that gets one more mapping associated to it (the reuse of the page) from another CPU it may look more legit. The fact there's an old mapping left on the MAP_SHARED pagecache indefinitely doesn't mean the CPU with such old mapping left in the TLB is allowed to change the content of the page if the software never writes to such virtual address through the old mapping. The important thing is that the content of the page must not change unless the software running in the CPU explicitly writes through the virtual address that corresponds to the stale TLB entry (and it's guaranteed the software won't write to it). The stale TLB of such asid eventually is flushed either through a bumb of the asid generation or through a local asid flush. > That said, the benchmark numbers are not very encouraging. Around 1% > improvement in a single run, it can as well be noise. Also something > like hackbench may also show a slight impact on the context switch path. I recall I tested hackbench and it appeared faster with processes, otherwise it was within measurement error. hackbench with processes is fork heavy so it gets some benefit because all those copy-on-write post fork will trigger tlbi broadcasts on all CPUs to flush the wrprotected tlb entry. Specifically the one converted to local tlb flush is ptep_clear_flush_notify in wp_page_copy() and there's one for each page being modified by parent or child. > Maybe with a true NUMA machine with hundreds of CPUs we may see a > difference, but it depends on how well the TLBI is implemented. The numbers in the commit header were not in a single run. perf stat -r 10 -e dummy runs it 10 times and then shows the stdev along the number too so you can what the noise was. It wasn't only a 1% improvement either. Overall there's no noise in the measurement. tcmalloc_large_heap_fragmentation_unittest simulating dealing with small objects by many different containers at the same time, was 9.4% faster (%stdev 0.36% 0.29%), with 32 CPUs and no NUMA. 256 times parallel run of 32 `sort` in a row, was 10.3% faster (%stdev 0.77% 0.38%), with 32 CPUs and no NUMA. The multithreaded microbenchmark runs x16 times faster, but that's not meaningful by itself, it's still a good hint some real life workload (especially those with frequent MADV_DONTNEED) will run faster and they did (and a verification that now also multithreaded apps can be optimized). Rafael already posted a benchmark specifically stressing the context switch. It's reasonable to expect any multi-socket NUMA to show more benefit from the optimization than the 32 CPU non NUMA used for the current benchmarks. Thanks, Andrea
Hi Andrea, On Fri, Mar 13, 2020 at 11:16:09PM -0400, Andrea Arcangeli wrote: > On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote: > > One concern I have is the ordering between TTBR0_EL1 update in > > cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We > > only have an ISB for context synchronisation on that CPU but I don't > > think the architecture guarantees any relation between sysreg access and > > the memory update. We have a DSB but that's further down in switch_to(). > > There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm > updates that can happen on different CPUs simultaneously. It's hard to > see exactly which one you refer to. > > Overall the idea here is that even if a speculative tlb lookup happens > in between those updates while the "mm" is going away and > atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't > matter because no userland software can use such stale tlb anymore > until local_flush_tlb_asid() gets rid of it. I think you're focussing on the use of the end-to-end translations by architecturally executed instructions, while Catalin is describing problems resulting from the use of intermediate translations by speculative walks. The concern here is that speculation can result in intermediate walk entries in the TLB being used to continue page table walks, and these walks can have side-effects regardless of whether the resulting entires are consumed by architecturally executed instructions. For example, say we free a PMD page for which there is a stale intermediate entry in a TLB, and that page gets reallocated for arbitrary date. A speculative walk can consume that data as-if it were a PMD, and depending on the value it sees a number of things can happen. If the value happens to point at MMIO, the MMU might read from that MMIO with side-effects. If the value happens to contain an architecturally invalid configuration the result can be CONSTRAINED UNPREDICTABLE and not limited to the corresponding ASID. Even if the resulting end-to-end translation is never architecturally consumed there are potential problems here, and I think that this series is assuming stronger-than-architected behaviour from the MMU and page table walks. > This concern is still a corollary of the previous paragraph: it's > still about stale tlb entries being left in an asid that can't ever be > used through the current asid. I'm not certain if the architecture generally allows or forbids walks to be continued for an ASID that is not live in a TTBR, but there are cases where that is possible, so I don't think the above accurately captures the situation. Stale intermediate entries can certainly be consumed in some cases. > > TLB. All the architecture spec states is that the software must first > > clear the entry followed by TLBI (the break-before-make rules). > > The "break" in "break-before-make" is still guaranteed or it wouldn't > boot: however it's not implemented with the tlbi broadcast anymore. > > The break is implemented by enforcing no stale tlb entry of such asid > exists in the TLB while any userland code runs. I understand the point you're making, but please don't overload the terminology. Break-Before-Make is a well-defined term which refers to a specific sequence which includes TLB invalidation, and the above is not a break per that definition. > X86 specs supposed an OS would allocate a TSS per-process and you > would do a context switch by using a task gate. I recall the first > Linux version I used had a TSS per process as envisioned by the > specs. Later the TSS become per-CPU and the esp0 pointer was updated > instead (native_load_sp0) and the stack was switched by hand. > > I guess reading the specs may result confusing after such a software > change, that doesn't mean the software shouldn't optimize things > behind the specs if it's safe to do and it's not explicitly forbidden. I think the important thing is that this is not necessarily safe. The architecture currently states that a Break-Before-Make sequence is necessary, and this series does not follow that. AFAICT, this series relies on: * An ISB completing prior page table walks when updating TTBR. I don't believe this is necessarily the case, given how things work for an EL1->EL2 transition where there can be ongoing EL1 walks. * Walks never being initiated for `inactive` contexts within the current translation regime. e.g. while ASID x is installed, never starting a walk for ASID y. I can imagine that the architecture may permit a form of this starting with intermediate walk entries in the TLBs. Before we can say whether this series is safe, we'd need to have a better description of how a PE may use `inactive` values for an in-context translation regime. I've asked for some clarification on these points. Thanks, Mark.
Hi Andrea, On Mon, Mar 16, 2020 at 02:09:07PM +0000, Mark Rutland wrote: > AFAICT, this series relies on: > > * An ISB completing prior page table walks when updating TTBR. I don't > believe this is necessarily the case, given how things work for an > EL1->EL2 transition where there can be ongoing EL1 walks. I've had confirmation that a DSB is necessary (after the MSR and ISB) to complete any ongoing translation table walks for the stale context. Without a DSB, those walks can observe subsequent stores and encounter the usual set of CONSTRAINED UNPREDICTABLE behaviours (e.g. walking into MMIO with side-effects, continuing from amalgamted entries, etc). Those issues are purely to do with the walk, and apply regardless of whether the resulting translations are architecturally consumed. > * Walks never being initiated for `inactive` contexts within the current > translation regime. e.g. while ASID x is installed, never starting a > walk for ASID y. I can imagine that the architecture may permit a form > of this starting with intermediate walk entries in the TLBs. I'm still chasing this point. Thanks, Mark.
Hello Mark, On Tue, Mar 31, 2020 at 10:45:11AM +0100, Mark Rutland wrote: > Hi Andrea, > > On Mon, Mar 16, 2020 at 02:09:07PM +0000, Mark Rutland wrote: > > AFAICT, this series relies on: > > > > * An ISB completing prior page table walks when updating TTBR. I don't > > believe this is necessarily the case, given how things work for an > > EL1->EL2 transition where there can be ongoing EL1 walks. > > I've had confirmation that a DSB is necessary (after the MSR and ISB) to > complete any ongoing translation table walks for the stale context. > > Without a DSB, those walks can observe subsequent stores and encounter > the usual set of CONSTRAINED UNPREDICTABLE behaviours (e.g. walking into > MMIO with side-effects, continuing from amalgamted entries, etc). Those > issues are purely to do with the walk, and apply regardless of whether > the resulting translations are architecturally consumed. Ok, sorry I didn't get it earlier... I attempted a quick fix below. From ab30d8082be62fe24a97eceec5dbfeea8e278511 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@redhat.com> Date: Tue, 31 Mar 2020 20:03:43 -0400 Subject: [PATCH 1/1] arm64: tlb: skip tlbi broadcast, fix speculative tlb lookups Without DSB in between "MSR; ISB" and "atomic_dec(&nr_active_mm)" there's the risk a speculative pagecache lookup may still be walking pagetables of the unloaded asid after nr_active_mm has been decreased. In such case the remote CPU could free the pagetables and reuse the memory without first issuing a tlbi broadcast, while the speculative tlb lookup still runs on the unloaded asid. For this reason the speculative pagetable walks needs to be flushed before decreasing nr_active_mm. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- arch/arm64/include/asm/mmu_context.h | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index 9c66fe317e2f..d821ea3ce839 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -210,8 +210,18 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) if (per_cpu(cpu_not_lazy_tlb, cpu) && is_idle_task(tsk)) { per_cpu(cpu_not_lazy_tlb, cpu) = false; - if (!system_uses_ttbr0_pan()) + if (!system_uses_ttbr0_pan()) { cpu_set_reserved_ttbr0(); + /* + * DSB will flush the speculative pagetable + * walks on the old asid. It's required before + * decreasing nr_active_mm because after + * decreasing nr_active_mm the tlbi broadcast + * may not happen on the unloaded asid before + * the pagetables are freed. + */ + dsb(ish); + } atomic_dec(&mm->context.nr_active_mm); } VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0); @@ -249,6 +259,14 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, } else if (prev != next) { atomic_inc(&next->context.nr_active_mm); __switch_mm(next, cpu); + /* + * DSB will flush the speculative pagetable walks on the old + * asid. It's required before decreasing nr_active_mm because + * after decreasing nr_active_mm the tlbi broadcast may not + * happen on the unloaded asid before the pagetables are + * freed. + */ + dsb(ish); atomic_dec(&prev->context.nr_active_mm); } VM_WARN_ON(!atomic_read(&next->context.nr_active_mm)); I didn't test it yet, because this being a theoretical issue it is better reviewed in the source. > > * Walks never being initiated for `inactive` contexts within the current > > translation regime. e.g. while ASID x is installed, never starting a > > walk for ASID y. I can imagine that the architecture may permit a form > > of this starting with intermediate walk entries in the TLBs. > > I'm still chasing this point. Appreciated! I'll cross fingers you don't find the speculative lookups can randomly start on unloaded ASID. That would also imply that it would be impossible on arm64 to use different asid on different CPUs as it is normally done on other arches. Thanks, Andrea
diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h index 44531a69d32b..5d9a1433d918 100644 --- a/arch/arm64/include/asm/efi.h +++ b/arch/arm64/include/asm/efi.h @@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt) static inline void efi_set_pgd(struct mm_struct *mm) { - __switch_mm(mm); + __switch_mm(mm, smp_processor_id()); if (system_uses_ttbr0_pan()) { if (mm != current->active_mm) { diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index e4d862420bb4..9072fd7bc5f8 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -20,6 +20,7 @@ typedef struct { atomic64_t id; void *vdso; unsigned long flags; + atomic_t nr_active_mm; } mm_context_t; /* @@ -27,7 +28,8 @@ typedef struct { * ASID change and therefore doesn't need to reload the counter using * atomic64_read. */ -#define ASID(mm) ((mm)->context.id.counter & 0xffff) +#define __ASID(asid) ((asid) & 0xffff) +#define ASID(mm) __ASID((mm)->context.id.counter) extern bool arm64_use_ng_mappings; diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index 3827ff4040a3..9c66fe317e2f 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -175,7 +175,10 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp) #define destroy_context(mm) do { } while(0) void check_and_switch_context(struct mm_struct *mm, unsigned int cpu); -#define init_new_context(tsk,mm) ({ atomic64_set(&(mm)->context.id, 0); 0; }) +#define init_new_context(tsk,mm) \ + ({ atomic64_set(&(mm)->context.id, 0); \ + atomic_set(&(mm)->context.nr_active_mm, 0); \ + 0; }) #ifdef CONFIG_ARM64_SW_TTBR0_PAN static inline void update_saved_ttbr0(struct task_struct *tsk, @@ -203,6 +206,15 @@ static inline void update_saved_ttbr0(struct task_struct *tsk, static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { + unsigned int cpu = smp_processor_id(); + if (per_cpu(cpu_not_lazy_tlb, cpu) && + is_idle_task(tsk)) { + per_cpu(cpu_not_lazy_tlb, cpu) = false; + if (!system_uses_ttbr0_pan()) + cpu_set_reserved_ttbr0(); + atomic_dec(&mm->context.nr_active_mm); + } + VM_WARN_ON(atomic_read(&mm->context.nr_active_mm) < 0); /* * We don't actually care about the ttbr0 mapping, so point it at the * zero page. @@ -210,10 +222,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) update_saved_ttbr0(tsk, &init_mm); } -static inline void __switch_mm(struct mm_struct *next) +static inline void __switch_mm(struct mm_struct *next, unsigned int cpu) { - unsigned int cpu = smp_processor_id(); - /* * init_mm.pgd does not contain any user mappings and it is always * active for kernel addresses in TTBR1. Just set the reserved TTBR0. @@ -230,8 +240,19 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { - if (prev != next) - __switch_mm(next); + unsigned int cpu = smp_processor_id(); + + if (!per_cpu(cpu_not_lazy_tlb, cpu)) { + per_cpu(cpu_not_lazy_tlb, cpu) = true; + atomic_inc(&next->context.nr_active_mm); + __switch_mm(next, cpu); + } else if (prev != next) { + atomic_inc(&next->context.nr_active_mm); + __switch_mm(next, cpu); + atomic_dec(&prev->context.nr_active_mm); + } + VM_WARN_ON(!atomic_read(&next->context.nr_active_mm)); + VM_WARN_ON(atomic_read(&prev->context.nr_active_mm) < 0); /* * Update the saved TTBR0_EL1 of the scheduled-in task as the previous diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index bc3949064725..0bd987ff9cbd 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void) isb(); } +static inline void local_flush_tlb_asid(unsigned long asid) +{ + asid = __TLBI_VADDR(0, __ASID(asid)); + dsb(nshst); + __tlbi(aside1, asid); + __tlbi_user(aside1, asid); + dsb(nsh); +} + static inline void flush_tlb_all(void) { dsb(ishst); @@ -144,9 +153,38 @@ static inline void flush_tlb_all(void) isb(); } +DECLARE_PER_CPU(bool, cpu_not_lazy_tlb); + +enum tlb_flush_types { + TLB_FLUSH_NO, + TLB_FLUSH_LOCAL, + TLB_FLUSH_BROADCAST, +}; +extern enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, + unsigned int cpu); + static inline void flush_tlb_mm(struct mm_struct *mm) { unsigned long asid = __TLBI_VADDR(0, ASID(mm)); + enum tlb_flush_types flush; + + flush = tlb_flush_check(mm, get_cpu()); + switch (flush) { + case TLB_FLUSH_LOCAL: + + dsb(nshst); + __tlbi(aside1, asid); + __tlbi_user(aside1, asid); + dsb(nsh); + + /* fall through */ + case TLB_FLUSH_NO: + put_cpu(); + return; + case TLB_FLUSH_BROADCAST: + break; + } + put_cpu(); dsb(ishst); __tlbi(aside1is, asid); @@ -167,7 +205,31 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) { - flush_tlb_page_nosync(vma, uaddr); + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm)); + enum tlb_flush_types flush; + + flush = tlb_flush_check(mm, get_cpu()); + switch (flush) { + case TLB_FLUSH_LOCAL: + + dsb(nshst); + __tlbi(vale1, addr); + __tlbi_user(vale1, addr); + dsb(nsh); + + /* fall through */ + case TLB_FLUSH_NO: + put_cpu(); + return; + case TLB_FLUSH_BROADCAST: + break; + } + put_cpu(); + + dsb(ishst); + __tlbi(vale1is, addr); + __tlbi_user(vale1is, addr); dsb(ish); } @@ -181,14 +243,16 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, unsigned long stride, bool last_level) { - unsigned long asid = ASID(vma->vm_mm); + struct mm_struct *mm = vma->vm_mm; + unsigned long asid = ASID(mm); unsigned long addr; + enum tlb_flush_types flush; start = round_down(start, stride); end = round_up(end, stride); if ((end - start) >= (MAX_TLBI_OPS * stride)) { - flush_tlb_mm(vma->vm_mm); + flush_tlb_mm(mm); return; } @@ -198,6 +262,31 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, start = __TLBI_VADDR(start, asid); end = __TLBI_VADDR(end, asid); + flush = tlb_flush_check(mm, get_cpu()); + switch (flush) { + case TLB_FLUSH_LOCAL: + + dsb(nshst); + for (addr = start; addr < end; addr += stride) { + if (last_level) { + __tlbi(vale1, addr); + __tlbi_user(vale1, addr); + } else { + __tlbi(vae1, addr); + __tlbi_user(vae1, addr); + } + } + dsb(nsh); + + /* fall through */ + case TLB_FLUSH_NO: + put_cpu(); + return; + case TLB_FLUSH_BROADCAST: + break; + } + put_cpu(); + dsb(ishst); for (addr = start; addr < end; addr += stride) { if (last_level) { diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c index 8ef73e89d514..3152b7f7da12 100644 --- a/arch/arm64/mm/context.c +++ b/arch/arm64/mm/context.c @@ -25,6 +25,7 @@ static unsigned long *asid_map; static DEFINE_PER_CPU(atomic64_t, active_asids); static DEFINE_PER_CPU(u64, reserved_asids); static cpumask_t tlb_flush_pending; +DEFINE_PER_CPU(bool, cpu_not_lazy_tlb); #define ASID_MASK (~GENMASK(asid_bits - 1, 0)) #define ASID_FIRST_VERSION (1UL << asid_bits) @@ -191,6 +192,12 @@ static u64 new_context(struct mm_struct *mm) set_asid: __set_bit(asid, asid_map); cur_idx = asid; + /* + * check_and_switch_context() will change the ASID of this mm + * so no need of extra ASID local TLB flushes: the new ASID + * isn't stale anymore after the tlb_flush_pending was set. + */ + cpumask_clear(mm_cpumask(mm)); return idx2asid(asid) | generation; } @@ -240,6 +247,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) raw_spin_unlock_irqrestore(&cpu_asid_lock, flags); switch_mm_fastpath: + /* + * Enforce CPU ordering between the atomic_inc(nr_active_mm) + * in switch_mm() and the below cpumask_test_cpu(mm_cpumask). + */ + smp_mb(); + if (cpumask_test_cpu(cpu, mm_cpumask(mm))) { + cpumask_clear_cpu(cpu, mm_cpumask(mm)); + local_flush_tlb_asid(asid); + } arm64_apply_bp_hardening(); @@ -251,6 +267,44 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) cpu_switch_mm(mm->pgd, mm); } +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu) +{ + if (atomic_read(&mm->context.nr_active_mm) <= 1) { + bool is_local = current->active_mm == mm && + per_cpu(cpu_not_lazy_tlb, cpu); + cpumask_t *stale_cpumask = mm_cpumask(mm); + unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask); + bool local_is_clear = false; + if (next_zero < nr_cpu_ids && + (is_local && next_zero == cpu)) { + next_zero = cpumask_next_zero(next_zero, stale_cpumask); + local_is_clear = true; + } + if (next_zero < nr_cpu_ids) { + cpumask_setall(stale_cpumask); + local_is_clear = false; + } + + /* + * Enforce CPU ordering between the above + * cpumask_setall(mm_cpumask) and the below + * atomic_read(nr_active_mm). + */ + smp_mb(); + + if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) { + if (is_local) { + if (!local_is_clear) + cpumask_clear_cpu(cpu, stale_cpumask); + return TLB_FLUSH_LOCAL; + } + if (atomic_read(&mm->context.nr_active_mm) == 0) + return TLB_FLUSH_NO; + } + } + return TLB_FLUSH_BROADCAST; +} + /* Errata workaround post TTBRx_EL1 update. */ asmlinkage void post_ttbr_update_workaround(void) {
With multiple NUMA nodes and multiple sockets, the tlbi broadcast shall be delivered through the interconnects in turn increasing the CPU interconnect traffic and the latency of the tlbi broadcast instruction. To avoid the synchronous delivery of the tlbi broadcast before the tlbi instruction can be retired, the hardware would need to implement a replicated mm_cpumask bitflag for each ASID and every CPU would need to tell every other CPU which ASID is being loaded. Exactly what x86 does with mm_cpumask in software. Even within a single NUMA node the latency of the tlbi broadcast instruction increases almost linearly with the number of CPUs trying to send tlbi broadcasts at the same time. If a single thread of the process is running and it's also running in the CPU issuing the TLB flush, or if no thread of the process are running, we can achieve full SMP scalability in the arm64 TLB flushng by skipping the tlbi broadcasting. After the local TLB flush this means the ASID context goes out of sync in all CPUs except the local one. This can be tracked on the per-mm cpumask: if the bit is set it means the ASID context is stale for that CPU. This results in an extra local ASID TLB flush only when threads are running in new CPUs after a TLB flush. Skipping the tlbi instruction broadcasting is already implemented in local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(), flush_tlb_range() and flush_tlb_page() too. The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8 Ampere), so it should be far from a worst case scenario: the enterprise kernel config allows multiple NUMA nodes with NR_CPUS set by default to 4096. === stock === # cat for-each-cpu.sh #!/bin/bash for i in $(seq `nproc`); do "$@" &>/dev/null & done wait # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000 [..] 2.1696 +- 0.0122 seconds time elapsed ( +- 0.56% ) # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest [..] 0.99018 +- 0.00360 seconds time elapsed ( +- 0.36% ) # cat sort-compute #!/bin/bash for x in `seq 256`; do for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done & done wait # perf stat -r 10 -e dummy ./sort-compute [..] 1.8094 +- 0.0139 seconds time elapsed ( +- 0.77% ) === patch applied === # perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000 [..] 0.13941 +- 0.00449 seconds time elapsed ( +- 3.22% ) # perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest [..] 0.90510 +- 0.00262 seconds time elapsed ( +- 0.29% ) # perf stat -r 10 -e dummy ./sort-compute [..] 1.64025 +- 0.00618 seconds time elapsed ( +- 0.38% ) Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- arch/arm64/include/asm/efi.h | 2 +- arch/arm64/include/asm/mmu.h | 4 +- arch/arm64/include/asm/mmu_context.h | 33 ++++++++-- arch/arm64/include/asm/tlbflush.h | 95 +++++++++++++++++++++++++++- arch/arm64/mm/context.c | 54 ++++++++++++++++ 5 files changed, 177 insertions(+), 11 deletions(-)