[RFC,v12,17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done

Message ID	20250220052027.58847-18-byungchul@sk.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Byungchul Park <byungchul@sk.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, vernhao@tencent.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, rjgolo@gmail.com Subject: [RFC PATCH v12 17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Date: Thu, 20 Feb 2025 14:20:18 +0900 Message-Id: <20250220052027.58847-18-byungchul@sk.com> In-Reply-To: <20250220052027.58847-1-byungchul@sk.com> References: <20250220052027.58847-1-byungchul@sk.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	LUF(Lazy Unmap Flush) reducing tlb numbers over 90% \| expand [RFC,v12,00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% [RFC,v12,01/26] x86/tlb: add APIs manipulating tlb batch's arch data [RFC,v12,02/26] arm64/tlbflush: add APIs manipulating tlb batch's arch data [RFC,v12,03/26] riscv/tlb: add APIs manipulating tlb batch's arch data [RFC,v12,04/26] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_fl… [RFC,v12,05/26] mm/buddy: make room for a new variable, luf_key, in struct page [RFC,v12,06/26] mm: move should_skip_kasan_poison() to mm/internal.h [RFC,v12,07/26] mm: introduce luf_ugen to be used as a global timestamp [RFC,v12,08/26] mm: introduce luf_batch to be used as hash table to store luf meta data [RFC,v12,09/26] mm: introduce API to perform tlb shootdown on exit from page allocator [RFC,v12,10/26] mm: introduce APIs to check if the page allocation is tlb shootdownable [RFC,v12,11/26] mm: deliver luf_key to pcp or buddy on free after unmapping [RFC,v12,12/26] mm: delimit critical sections to take off pages from pcp or buddy alloctor [RFC,v12,13/26] mm: introduce pend_list in struct free_area to track luf'd pages [RFC,v12,14/26] mm/rmap: recognize read-only tlb entries during batched tlb flush [RFC,v12,15/26] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls [RFC,v12,16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped [RFC,v12,17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already… [RFC,v12,18/26] mm/page_alloc: retry 3 times to take pcp pages on luf check failure [RFC,v12,19/26] mm: skip luf tlb flush for luf'd mm that already has been done [RFC,v12,20/26] mm, fs: skip tlb flushes for luf'd filemap that already has been done [RFC,v12,21/26] mm: perform luf tlb shootdown per zone in batched manner [RFC,v12,22/26] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() [RFC,v12,23/26] mm: separate move/undo parts from migrate_pages_batch() [RFC,v12,24/26] mm/migrate: apply luf mechanism to unmapping during migration [RFC,v12,25/26] mm/vmscan: apply luf mechanism to unmapping during folio reclaim [RFC,v12,26/26] mm/luf: implement luf debug feature

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index a62e1ea61e4af..f8290bec32e01 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -354,6 +354,32 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) dsb(ish); } +static inline bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen) +{ + /* + * Nothing is needed in this architecture. + */ + return true; +} + +static inline bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen) +{ + /* + * Nothing is needed in this architecture. + */ + return true; +} + +static inline void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen) +{ + /* nothing to do */ +} + +static inline void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen) +{ + /* nothing to do */ +} + static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch) { /* nothing to do */ diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h index 1dc7d30273d59..ec5caeb3cf8ef 100644 --- a/arch/riscv/include/asm/tlbflush.h +++ b/arch/riscv/include/asm/tlbflush.h @@ -65,6 +65,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, unsigned long uaddr); void arch_flush_tlb_batched_pending(struct mm_struct *mm); void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen); +bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen); +void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen); +void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen); static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch) { diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c index 36f996af6256c..93afb7a299003 100644 --- a/arch/riscv/mm/tlbflush.c +++ b/arch/riscv/mm/tlbflush.c @@ -202,3 +202,111 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) __flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE); } + +static DEFINE_PER_CPU(atomic_long_t, ugen_done); + +static int __init luf_init_arch(void) +{ + int cpu; + + for_each_cpu(cpu, cpu_possible_mask) + atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1); + + return 0; +} +early_initcall(luf_init_arch); + +/* + * batch will not be updated. + */ +bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, + unsigned long ugen) +{ + int cpu; + + if (!ugen) + goto out; + + for_each_cpu(cpu, &batch->cpumask) { + unsigned long done; + + done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu)); + if (ugen_before(done, ugen)) + return false; + } + return true; +out: + return cpumask_empty(&batch->cpumask); +} + +bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, + unsigned long ugen) +{ + int cpu; + + if (!ugen) + goto out; + + for_each_cpu(cpu, &batch->cpumask) { + unsigned long done; + + done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu)); + if (!ugen_before(done, ugen)) + cpumask_clear_cpu(cpu, &batch->cpumask); + } +out: + return cpumask_empty(&batch->cpumask); +} + +void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, + unsigned long ugen) +{ + int cpu; + + if (!ugen) + return; + + for_each_cpu(cpu, &batch->cpumask) { + atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu); + unsigned long old = atomic_long_read(done); + + /* + * It's racy. The race results in unnecessary tlb flush + * because of the smaller ugen_done than it should be. + * However, it's okay in terms of correctness. + */ + if (!ugen_before(old, ugen)) + continue; + + /* + * It's for optimization. Just skip on fail than retry. + */ + atomic_long_cmpxchg(done, old, ugen); + } +} + +void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen) +{ + int cpu; + + if (!ugen) + return; + + for_each_cpu(cpu, mm_cpumask(mm)) { + atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu); + unsigned long old = atomic_long_read(done); + + /* + * It's racy. The race results in unnecessary tlb flush + * because of the smaller ugen_done than it should be. + * However, it's okay in terms of correctness. + */ + if (!ugen_before(old, ugen)) + continue; + + /* + * It's for optimization. Just skip on fail than retry. + */ + atomic_long_cmpxchg(done, old, ugen); + } +} diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 0ae9564c7301e..1fc5bacd72dff 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -293,6 +293,10 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) } extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen); +extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen); +extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen); +extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen); static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 860e49b223fd7..975f58fa4b30f 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1240,6 +1240,114 @@ void __flush_tlb_all(void) } EXPORT_SYMBOL_GPL(__flush_tlb_all); +static DEFINE_PER_CPU(atomic_long_t, ugen_done); + +static int __init luf_init_arch(void) +{ + int cpu; + + for_each_cpu(cpu, cpu_possible_mask) + atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1); + + return 0; +} +early_initcall(luf_init_arch); + +/* + * batch will not be updated. + */ +bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, + unsigned long ugen) +{ + int cpu; + + if (!ugen) + goto out; + + for_each_cpu(cpu, &batch->cpumask) { + unsigned long done; + + done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu)); + if (ugen_before(done, ugen)) + return false; + } + return true; +out: + return cpumask_empty(&batch->cpumask); +} + +bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, + unsigned long ugen) +{ + int cpu; + + if (!ugen) + goto out; + + for_each_cpu(cpu, &batch->cpumask) { + unsigned long done; + + done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu)); + if (!ugen_before(done, ugen)) + cpumask_clear_cpu(cpu, &batch->cpumask); + } +out: + return cpumask_empty(&batch->cpumask); +} + +void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, + unsigned long ugen) +{ + int cpu; + + if (!ugen) + return; + + for_each_cpu(cpu, &batch->cpumask) { + atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu); + unsigned long old = atomic_long_read(done); + + /* + * It's racy. The race results in unnecessary tlb flush + * because of the smaller ugen_done than it should be. + * However, it's okay in terms of correctness. + */ + if (!ugen_before(old, ugen)) + continue; + + /* + * It's for optimization. Just skip on fail than retry. + */ + atomic_long_cmpxchg(done, old, ugen); + } +} + +void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen) +{ + int cpu; + + if (!ugen) + return; + + for_each_cpu(cpu, mm_cpumask(mm)) { + atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu); + unsigned long old = atomic_long_read(done); + + /* + * It's racy. The race results in unnecessary tlb flush + * because of the smaller ugen_done than it should be. + * However, it's okay in terms of correctness. + */ + if (!ugen_before(old, ugen)) + continue; + + /* + * It's for optimization. Just skip on fail than retry. + */ + atomic_long_cmpxchg(done, old, ugen); + } +} + void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { struct flush_tlb_info *info; diff --git a/include/linux/sched.h b/include/linux/sched.h index 94321d51b91e8..5c6c4fd021973 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1377,6 +1377,7 @@ struct task_struct { #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH) int luf_no_shootdown; int luf_takeoff_started; + unsigned long luf_ugen; #endif struct tlbflush_unmap_batch tlb_ubc; diff --git a/mm/internal.h b/mm/internal.h index fe4a1c174895f..77657c17af204 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1246,6 +1246,7 @@ void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); void try_to_unmap_flush_takeoff(void); void flush_tlb_batched_pending(struct mm_struct *mm); +void reset_batch(struct tlbflush_unmap_batch *batch); void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset); void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src); #else @@ -1261,6 +1262,9 @@ static inline void try_to_unmap_flush_takeoff(void) static inline void flush_tlb_batched_pending(struct mm_struct *mm) { } +static inline void reset_batch(struct tlbflush_unmap_batch *batch) +{ +} static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset) { } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 65acc437d8387..3032fedd8392b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -668,9 +668,11 @@ bool luf_takeoff_start(void) */ void luf_takeoff_end(void) { + struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff; unsigned long flags; bool no_shootdown; bool outmost = false; + unsigned long cur_luf_ugen; local_irq_save(flags); VM_WARN_ON(!current->luf_takeoff_started); @@ -697,10 +699,19 @@ void luf_takeoff_end(void) if (no_shootdown) goto out; + cur_luf_ugen = current->luf_ugen; + + current->luf_ugen = 0; + + if (cur_luf_ugen && arch_tlbbatch_diet(&tlb_ubc_takeoff->arch, cur_luf_ugen)) + reset_batch(tlb_ubc_takeoff); + try_to_unmap_flush_takeoff(); out: - if (outmost) + if (outmost) { VM_WARN_ON(current->luf_no_shootdown); + VM_WARN_ON(current->luf_ugen); + } } /* @@ -757,6 +768,7 @@ bool luf_takeoff_check_and_fold(struct page *page) struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff; unsigned short luf_key = page_luf_key(page); struct luf_batch *lb; + unsigned long lb_ugen; unsigned long flags; /* @@ -770,13 +782,25 @@ bool luf_takeoff_check_and_fold(struct page *page) if (!luf_key) return true; - if (current->luf_no_shootdown) - return false; - lb = &luf_batch[luf_key]; read_lock_irqsave(&lb->lock, flags); + lb_ugen = lb->ugen; + + if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) { + read_unlock_irqrestore(&lb->lock, flags); + return true; + } + + if (current->luf_no_shootdown) { + read_unlock_irqrestore(&lb->lock, flags); + return false; + } + fold_batch(tlb_ubc_takeoff, &lb->batch, false); read_unlock_irqrestore(&lb->lock, flags); + + if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen)) + current->luf_ugen = lb_ugen; return true; } #endif diff --git a/mm/rmap.c b/mm/rmap.c index 0aaf02b1b34c3..cf6667fb18fe2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -656,7 +656,7 @@ static unsigned long new_luf_ugen(void) return ugen; } -static void reset_batch(struct tlbflush_unmap_batch *batch) +void reset_batch(struct tlbflush_unmap_batch *batch) { arch_tlbbatch_clear(&batch->arch); batch->flush_required = false; @@ -743,8 +743,14 @@ static void __fold_luf_batch(struct luf_batch *dst_lb, * more tlb shootdown might be needed to fulfill the newer * request. Conservertively keep the newer one. */ - if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen)) + if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen)) { + /* + * Good chance to shrink the batch using the old ugen. + */ + if (dst_lb->ugen && arch_tlbbatch_diet(&dst_lb->batch.arch, dst_lb->ugen)) + reset_batch(&dst_lb->batch); dst_lb->ugen = src_ugen; + } fold_batch(&dst_lb->batch, src_batch, false); } @@ -772,17 +778,45 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src) read_unlock_irqrestore(&src->lock, flags); } +static unsigned long tlb_flush_start(void) +{ + /* + * Memory barrier implied in the atomic operation prevents + * reading luf_ugen from happening after the following + * tlb flush. + */ + return new_luf_ugen(); +} + +static void tlb_flush_end(struct arch_tlbflush_unmap_batch *arch, + struct mm_struct *mm, unsigned long ugen) +{ + /* + * Prevent the following marking from placing prior to the + * actual tlb flush. + */ + smp_mb(); + + if (arch) + arch_tlbbatch_mark_ugen(arch, ugen); + if (mm) + arch_mm_mark_ugen(mm, ugen); +} + void try_to_unmap_flush_takeoff(void) { struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc; struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro; struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf; struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff; + unsigned long ugen; if (!tlb_ubc_takeoff->flush_required) return; + ugen = tlb_flush_start(); arch_tlbbatch_flush(&tlb_ubc_takeoff->arch); + tlb_flush_end(&tlb_ubc_takeoff->arch, NULL, ugen); /* * Now that tlb shootdown of tlb_ubc_takeoff has been performed, @@ -871,13 +905,17 @@ void try_to_unmap_flush(void) struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc; struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro; struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf; + unsigned long ugen; fold_batch(tlb_ubc, tlb_ubc_ro, true); fold_batch(tlb_ubc, tlb_ubc_luf, true); if (!tlb_ubc->flush_required) return; + ugen = tlb_flush_start(); arch_tlbbatch_flush(&tlb_ubc->arch); + tlb_flush_end(&tlb_ubc->arch, NULL, ugen); + reset_batch(tlb_ubc); } @@ -1009,7 +1047,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm) int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT; if (pending != flushed) { + unsigned long ugen; + + ugen = tlb_flush_start(); arch_flush_tlb_batched_pending(mm); + tlb_flush_end(NULL, mm, ugen); /* * If the new TLB flushing is pending during flushing, leave

[RFC,v12,17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done

Commit Message

Patch