From patchwork Fri Aug 23 22:41:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 11112889 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ECC4D14E5 for ; Sat, 24 Aug 2019 06:05:00 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C5F922133F for ; Sat, 24 Aug 2019 06:05:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C5F922133F Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=vmware.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1i1P8Y-0004Lr-MJ; Sat, 24 Aug 2019 06:02:54 +0000 Received: from all-amaz-eas1.inumbo.com ([34.197.232.57] helo=us1-amaz-eas2.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1i1P8X-0004Lk-Jt for xen-devel@lists.xenproject.org; Sat, 24 Aug 2019 06:02:53 +0000 X-Inumbo-ID: cd586ae8-c634-11e9-adf1-12813bfff9fa Received: from mail-pf1-f196.google.com (unknown [209.85.210.196]) by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS id cd586ae8-c634-11e9-adf1-12813bfff9fa; Sat, 24 Aug 2019 06:02:50 +0000 (UTC) Received: by mail-pf1-f196.google.com with SMTP id w16so8035992pfn.7 for ; Fri, 23 Aug 2019 23:02:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=F7wnaN2sm4dDETq+DUPx8pSGm3FuGxmG+QLwLZs+eRM=; b=S47Uy6FMew48+EEL1i38ELNutf6QjrIKjxDb1eUH4NgQKL9fpMUql2XHRC/1G60N3G F1SNO8g3WCsPhkEceDVRwOZuSRFw34QJysQ8hjdbXSeUw5DMC4qQLkHKmvUWjiqex5dt uB5YJbu3XIKKKoKCo3jQbnLq5ChJpgSLmDYq6mZnWPfrjDudi+5f4oIDDQ4u9R1+Kyat xSTkbMC9sg7fjXA87b+riiQ6u+NaI+zwVnyWhQ59zJUv0NpbSX2eY55BY7+3bKcgGsGb EGxeGCrVzTCeqhWJKMTjZK3Mb7wXABS4Vh0TLw/YZAT5BS0aiKSJPbeAhRYoO80Ct/Hd mqLA== X-Gm-Message-State: APjAAAXRP9qlIy5j1Sw6fhW3bgJVV/P0P3+8Jy2EZby0FDF3micmODKU 6u35OMSEK/m2XuNhmNNeiXc= X-Google-Smtp-Source: APXvYqz52LafkL2XCB2zaw0uHZgr1PbVP18me3xzKilmH63LO1lCskfJOHdA3QwemTRsPXXXLUh0sw== X-Received: by 2002:aa7:8602:: with SMTP id p2mr9258456pfn.138.1566626569056; Fri, 23 Aug 2019 23:02:49 -0700 (PDT) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id i11sm6505645pfk.34.2019.08.23.23.02.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Aug 2019 23:02:47 -0700 (PDT) From: Nadav Amit To: Andy Lutomirski , Dave Hansen Date: Fri, 23 Aug 2019 15:41:48 -0700 Message-Id: <20190823224153.15223-5-namit@vmware.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190823224153.15223-1-namit@vmware.com> References: <20190823224153.15223-1-namit@vmware.com> Subject: [Xen-devel] [PATCH v4 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Sasha Levin , Juergen Gross , linux-hyperv@vger.kernel.org, Stephen Hemminger , xen-devel@lists.xenproject.org, kvm@vger.kernel.org, Peter Zijlstra , Haiyang Zhang , x86@kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, Ingo Molnar , Nadav Amit , Paolo Bonzini , Borislav Petkov , Thomas Gleixner , "K. Y. Srinivasan" , Boris Ostrovsky MIME-Version: 1.0 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x86@kernel.org Cc: Juergen Gross Cc: Paolo Bonzini Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Boris Ostrovsky Cc: linux-hyperv@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: kvm@vger.kernel.org Cc: xen-devel@lists.xenproject.org Reviewed-by: Michael Kelley # Hyper-v parts Reviewed-by: Juergen Gross # Xen and paravirt parts Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/hyperv/mmu.c | 10 +++--- arch/x86/include/asm/paravirt.h | 6 ++-- arch/x86/include/asm/paravirt_types.h | 4 +-- arch/x86/include/asm/tlbflush.h | 8 ++--- arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 +++++-- arch/x86/kernel/paravirt.c | 2 +- arch/x86/mm/tlb.c | 45 ++++++++++++++++++--------- arch/x86/xen/mmu_pv.c | 11 +++---- include/trace/events/xen.h | 2 +- 10 files changed, 61 insertions(+), 40 deletions(-) diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c index e65d7fe6489f..8740d8b21db3 100644 --- a/arch/x86/hyperv/mmu.c +++ b/arch/x86/hyperv/mmu.c @@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset, return gva_n - offset; } -static void hyperv_flush_tlb_others(const struct cpumask *cpus, - const struct flush_tlb_info *info) +static void hyperv_flush_tlb_multi(const struct cpumask *cpus, + const struct flush_tlb_info *info) { int cpu, vcpu, gva_n, max_gvas; struct hv_tlb_flush **flush_pcpu; @@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, u64 status = U64_MAX; unsigned long flags; - trace_hyperv_mmu_flush_tlb_others(cpus, info); + trace_hyperv_mmu_flush_tlb_multi(cpus, info); if (!hv_hypercall_pg) goto do_native; @@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, if (!(status & HV_HYPERCALL_RESULT_MASK)) return; do_native: - native_flush_tlb_others(cpus, info); + native_flush_tlb_multi(cpus, info); } static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus, @@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void) return; pr_info("Using hypercall for remote TLB flush\n"); - pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others; + pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; pv_ops.mmu.tlb_remove_table = tlb_remove_table; } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 69089d46f128..bc4829c9b3f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr) PVOP_VCALL1(mmu.flush_tlb_one_user, addr); } -static inline void flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) +static inline void flush_tlb_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) { - PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info); + PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info); } static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 70b654f3ffe5..63fa751344bf 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -206,8 +206,8 @@ struct pv_mmu_ops { void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); void (*flush_tlb_one_user)(unsigned long addr); - void (*flush_tlb_others)(const struct cpumask *cpus, - const struct flush_tlb_info *info); + void (*flush_tlb_multi)(const struct cpumask *cpus, + const struct flush_tlb_info *info); void (*tlb_remove_table)(struct mmu_gather *tlb, void *table); diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 2f6e9be163ae..559195f79c2f 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -533,7 +533,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr) * - flush_tlb_page(vma, vmaddr) flushes one page * - flush_tlb_range(vma, start, end) flushes a range of pages * - flush_tlb_kernel_range(start, end) flushes a range of kernel pages - * - flush_tlb_others(cpumask, info) flushes TLBs on other cpus + * - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus * * ..but the i386 has somewhat limited tlb flushing capabilities, * and page-granular flushes are available only on i486 and up. @@ -586,7 +586,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } -void native_flush_tlb_others(const struct cpumask *cpumask, +void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info); static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) @@ -610,8 +610,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); #ifndef CONFIG_PARAVIRT -#define flush_tlb_others(mask, info) \ - native_flush_tlb_others(mask, info) +#define flush_tlb_multi(mask, info) \ + native_flush_tlb_multi(mask, info) #define paravirt_tlb_remove_table(tlb, page) \ tlb_remove_page(tlb, (void *)(page)) diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h index ace464f09681..85ca8560c7f9 100644 --- a/arch/x86/include/asm/trace/hyperv.h +++ b/arch/x86/include/asm/trace/hyperv.h @@ -8,7 +8,7 @@ #if IS_ENABLED(CONFIG_HYPERV) -TRACE_EVENT(hyperv_mmu_flush_tlb_others, +TRACE_EVENT(hyperv_mmu_flush_tlb_multi, TP_PROTO(const struct cpumask *cpus, const struct flush_tlb_info *info), TP_ARGS(cpus, info), diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4cc967178bf9..0941d2d7f1cb 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -592,7 +592,7 @@ static void __init kvm_apf_trap_init(void) static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask); -static void kvm_flush_tlb_others(const struct cpumask *cpumask, +static void kvm_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info) { u8 state; @@ -606,6 +606,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask, * queue flush_on_enter for pre-empted vCPUs */ for_each_cpu(cpu, flushmask) { + /* + * The local vCPU is never preempted, so we do not explicitly + * skip check for local vCPU - it will never be cleared from + * flushmask. + */ src = &per_cpu(steal_time, cpu); state = READ_ONCE(src->preempted); if ((state & KVM_VCPU_PREEMPTED)) { @@ -615,7 +620,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask, } } - native_flush_tlb_others(flushmask, info); + native_flush_tlb_multi(flushmask, info); } static void __init kvm_guest_init(void) @@ -637,7 +642,7 @@ static void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && !kvm_para_has_hint(KVM_HINTS_REALTIME) && kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others; + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi; pv_ops.mmu.tlb_remove_table = tlb_remove_table; } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 59d3d2763a9e..5520a04c84ba 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -359,7 +359,7 @@ struct paravirt_patch_template pv_ops = { .mmu.flush_tlb_user = native_flush_tlb, .mmu.flush_tlb_kernel = native_flush_tlb_global, .mmu.flush_tlb_one_user = native_flush_tlb_one_user, - .mmu.flush_tlb_others = native_flush_tlb_others, + .mmu.flush_tlb_multi = native_flush_tlb_multi, .mmu.tlb_remove_table = (void (*)(struct mmu_gather *, void *))tlb_remove_page, diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index c3ca3545d78a..5376a5447bd0 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -562,7 +562,7 @@ static void flush_tlb_func(void *info) * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. * - * This should be rare, with native_flush_tlb_others skipping + * This should be rare, with native_flush_tlb_multi() skipping * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); @@ -660,9 +660,14 @@ static bool tlb_is_not_lazy(int cpu) static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); -void native_flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) +void native_flush_tlb_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) { + /* + * Do accounting and tracing. Note that there are (and have always been) + * cases in which a remote TLB flush will be traced, but eventually + * would not happen. + */ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -682,10 +687,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * means that the percpu tlb_gen variables won't be updated * and we'll do pointless flushes on future context switches. * - * Rather than hooking native_flush_tlb_others() here, I think + * Rather than hooking native_flush_tlb_multi() here, I think * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ + flush_tlb_func((void *)info); + cpumask = uv_flush_tlb_others(cpumask, info); if (cpumask) smp_call_function_many(cpumask, flush_tlb_func, @@ -704,7 +711,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * doing a speculative memory access. */ if (info->freed_tables) { - smp_call_function_many(cpumask, flush_tlb_func, (void *)info, 1); + __smp_call_function_many(cpumask, flush_tlb_func, (void *)info, + SCF_WAIT | SCF_RUN_LOCAL); } else { /* * Although we could have used on_each_cpu_cond_mask(), @@ -731,7 +739,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask, if (tlb_is_not_lazy(cpu)) __cpumask_set_cpu(cpu, cond_cpumask); } - smp_call_function_many(cond_cpumask, flush_tlb_func, (void *)info, 1); + __smp_call_function_many(cond_cpumask, flush_tlb_func, + (void *)info, SCF_WAIT | SCF_RUN_LOCAL); } } @@ -812,16 +821,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, new_tlb_gen); - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + /* + * flush_tlb_multi() is not optimized for the common case in which only + * a local TLB flush is needed. Optimize this use-case by calling + * flush_tlb_func_local() directly in this case. + */ + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { + flush_tlb_multi(mm_cpumask(mm), info); + } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); local_irq_disable(); flush_tlb_func(info); local_irq_enable(); } - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), info); - put_flush_tlb_info(); put_cpu(); } @@ -875,16 +888,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0); - if (cpumask_test_cpu(cpu, &batch->cpumask)) { + /* + * flush_tlb_multi() is not optimized for the common case in which only + * a local TLB flush is needed. Optimize this use-case by calling + * flush_tlb_func_local() directly in this case. + */ + if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) { + flush_tlb_multi(&batch->cpumask, info); + } else if (cpumask_test_cpu(cpu, &batch->cpumask)) { lockdep_assert_irqs_enabled(); local_irq_disable(); flush_tlb_func(info); local_irq_enable(); } - if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) - flush_tlb_others(&batch->cpumask, info); - cpumask_clear(&batch->cpumask); put_flush_tlb_info(); diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 26e8b326966d..48f7c7eb4dbc 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1345,8 +1345,8 @@ static void xen_flush_tlb_one_user(unsigned long addr) preempt_enable(); } -static void xen_flush_tlb_others(const struct cpumask *cpus, - const struct flush_tlb_info *info) +static void xen_flush_tlb_multi(const struct cpumask *cpus, + const struct flush_tlb_info *info) { struct { struct mmuext_op op; @@ -1356,7 +1356,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus, const size_t mc_entry_size = sizeof(args->op) + sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus()); - trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end); + trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end); if (cpumask_empty(cpus)) return; /* nothing to do */ @@ -1365,9 +1365,8 @@ static void xen_flush_tlb_others(const struct cpumask *cpus, args = mcs.args; args->op.arg2.vcpumask = to_cpumask(args->mask); - /* Remove us, and any offline CPUS. */ + /* Remove any offline CPUs */ cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask)); args->op.cmd = MMUEXT_TLB_FLUSH_MULTI; if (info->end != TLB_FLUSH_ALL && @@ -2396,7 +2395,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = { .flush_tlb_user = xen_flush_tlb, .flush_tlb_kernel = xen_flush_tlb, .flush_tlb_one_user = xen_flush_tlb_one_user, - .flush_tlb_others = xen_flush_tlb_others, + .flush_tlb_multi = xen_flush_tlb_multi, .tlb_remove_table = tlb_remove_table, .pgd_alloc = xen_pgd_alloc, diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h index 9a0e8af21310..546022acf160 100644 --- a/include/trace/events/xen.h +++ b/include/trace/events/xen.h @@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user, TP_printk("addr %lx", __entry->addr) ); -TRACE_EVENT(xen_mmu_flush_tlb_others, +TRACE_EVENT(xen_mmu_flush_tlb_multi, TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm, unsigned long addr, unsigned long end), TP_ARGS(cpus, mm, addr, end),