From patchwork Fri Aug 23 22:41:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 11112887 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 68E6014E5 for ; Sat, 24 Aug 2019 06:04:28 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 4CF172133F for ; Sat, 24 Aug 2019 06:04:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4CF172133F Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=vmware.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1i1P8N-0004La-Cl; Sat, 24 Aug 2019 06:02:43 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1i1P8M-0004LV-Dk for xen-devel@lists.xenproject.org; Sat, 24 Aug 2019 06:02:42 +0000 X-Inumbo-ID: c8498ece-c634-11e9-b95f-bc764e2007e4 Received: from mail-pl1-f196.google.com (unknown [209.85.214.196]) by us1-rack-iad1.inumbo.com (Halon) with ESMTPS id c8498ece-c634-11e9-b95f-bc764e2007e4; Sat, 24 Aug 2019 06:02:41 +0000 (UTC) Received: by mail-pl1-f196.google.com with SMTP id 4so6864755pld.10 for ; Fri, 23 Aug 2019 23:02:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=PBhfWEbouVzAKjWN+Pc+TxdOjUmjwfquupK+iz/fQ4A=; b=M+Zk9OfJ0WbIzFvgoL5yf0Kz/8BKMYW61dM4E6WMnNFUD3BvATaQwnti+H275H/5s0 qoeMz3Hp6q9X5rjS3SqAj9fxM/MNHW49D3Gzb0s++2fJm6Axd/eK4xY6xMPOycWY2rzH erDVUTrJikHq4j6g7Uuw7M6Ve5fyHUPraGuCO/m/4Cg0RGnfUncCOf+XG88axuIZZGi7 kLpYHrj/SWw45b6vBi+fZaF536+TPZACsM3c6DN4xomEPmw7nrV6W3h4K8bfuF3V+zSI 0zpoZSxEIt6ogMg0bDs8e+1yxKjlPUsZgO6mYVravhWoerVPTnb88dVGhTNZy12mEWSI NyyQ== X-Gm-Message-State: APjAAAUrHwQPlaF1Dgr3JQb7jG6R00XNrjpftEcU26OOMw84Xo+bV9i5 KSUA5E0ei1vazqnmitl+hEg= X-Google-Smtp-Source: APXvYqxyv6SnHafwAq2wDS1W1xROCWpkM31he4PdrXfsHzhw8ccWshBnDB7Isij98lfCJIj1o8scog== X-Received: by 2002:a17:902:248:: with SMTP id 66mr8776408plc.19.1566626560378; Fri, 23 Aug 2019 23:02:40 -0700 (PDT) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id i11sm6505645pfk.34.2019.08.23.23.02.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Aug 2019 23:02:39 -0700 (PDT) From: Nadav Amit To: Andy Lutomirski , Dave Hansen Date: Fri, 23 Aug 2019 15:41:44 -0700 Message-Id: <20190823224153.15223-1-namit@vmware.com> X-Mailer: git-send-email 2.17.1 Subject: [Xen-devel] [PATCH v4 0/9] x86/tlb: Concurrent TLB flushes X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Juergen Gross , Sasha Levin , linux-hyperv@vger.kernel.org, Stephen Hemminger , kvm@vger.kernel.org, Paolo Bonzini , Rik van Riel , Peter Zijlstra , Haiyang Zhang , x86@kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, xen-devel@lists.xenproject.org, Ingo Molnar , Nadav Amit , Josh Poimboeuf , Borislav Petkov , Thomas Gleixner , "K. Y. Srinivasan" , Boris Ostrovsky MIME-Version: 1.0 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" [ Similar cover-letter to v3 with updated performance numbers on skylake. Sorry for the time it since the last version. ] Currently, local and remote TLB flushes are not performed concurrently, which introduces unnecessary overhead - each PTE flushing can take 100s of cycles. This patch-set allows TLB flushes to be run concurrently: first request the remote CPUs to initiate the flush, then run it locally, and finally wait for the remote CPUs to finish their work. In addition, there are various small optimizations to avoid, for example, unwarranted false-sharing. The proposed changes should also improve the performance of other invocations of on_each_cpu(). Hopefully, no one has relied on this behavior of on_each_cpu() that invoked functions first remotely and only then locally [Peter says he remembers someone might do so, but without further information it is hard to know how to address it]. Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on 2-socket, 56-logical-cores (28+SMT) Skylake, 5 repetitions: sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run Th. tip-aug22 avg (stdev) +patch-set avg (stdev) change --- --------------------- ---------------------- ------ 1 1152920 (7453) 1169469 (9059) +1.4% 2 1545832 (12555) 1584172 (10484) +2.4% 4 2480703 (12039) 2518641 (12875) +1.5% 8 3684486 (26007) 3840343 (44144) +4.2% 16 4981567 (23565) 5125756 (15458) +2.8% 32 5679542 (10116) 5887826 (6121) +3.6% 56 5630944 (17937) 5812514 (26264) +3.2% (Note that on configurations with up to 28 threads numactl was used to set all threads on socket 1, which explains the drop in performance when going to 32 threads). Running the same benchmark with security mitigations disabled (PTI, Spectre, MDS): Th. tip-aug22 avg (stdev) +patch-set avg (stdev) change --- --------------------- ---------------------- ------ 1 1444119 (8524) 1469606 (10527) +1.7% 2 1921540 (24169) 1961899 (14450) +2.1% 4 3073716 (21786) 3199880 (16774) +4.1% 8 4700698 (49534) 4802312 (11043) +2.1% 16 6005180 (6366) 6006656 (31624) 0% 32 6826466 (10496) 6886622 (19110) +0.8% 56 6832344 (13468) 6885586 (20646) +0.8% The results are somewhat different than the results that have been obtained on Haswell-X, which were sent before and the maximum performance improvement is smaller. However, the performance improvement is significant. v3 -> v4: * Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter] * Prevent preemption on_each_cpu(). It is not needed, but it prevents concerns. [Peter/tglx] * Adding acked-, review-by tags v2 -> v3: * Open-code the remote/local-flush decision code [Andy] * Fix hyper-v, Xen implementations [Andrew] * Fix redundant TLB flushes. v1 -> v2: * Removing the patches that Thomas took [tglx] * Adding hyper-v, Xen compile-tested implementations [Dave] * Removing UV [Andy] * Adding lazy optimization, removing inline keyword [Dave] * Restructuring patch-set RFCv2 -> v1: * Fix comment on flush_tlb_multi [Juergen] * Removing async invalidation optimizations [Andy] * Adding KVM support [Paolo] Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Dave Hansen Cc: Haiyang Zhang Cc: Ingo Molnar Cc: Josh Poimboeuf Cc: Juergen Gross Cc: "K. Y. Srinivasan" Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Rik van Riel Cc: Sasha Levin Cc: Stephen Hemminger Cc: Thomas Gleixner Cc: kvm@vger.kernel.org Cc: linux-hyperv@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Cc: x86@kernel.org Cc: xen-devel@lists.xenproject.org Nadav Amit (9): smp: Run functions concurrently in smp_call_function_many() x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Do not make is_lazy dirty for no reason cpumask: Mark functions as pure x86/mm/tlb: Remove UV special case x86/mm/tlb: Remove unnecessary uses of the inline keyword arch/x86/hyperv/mmu.c | 10 +- arch/x86/include/asm/paravirt.h | 6 +- arch/x86/include/asm/paravirt_types.h | 4 +- arch/x86/include/asm/tlbflush.h | 52 +++---- arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 +- arch/x86/kernel/paravirt.c | 2 +- arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 195 ++++++++++++++------------ arch/x86/xen/mmu_pv.c | 11 +- include/linux/cpumask.h | 6 +- include/linux/smp.h | 34 ++++- include/trace/events/xen.h | 2 +- kernel/smp.c | 138 +++++++++--------- 14 files changed, 254 insertions(+), 221 deletions(-)