From patchwork Mon Mar 4 18:18:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Emilio Cota X-Patchwork-Id: 10838229 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8693D13B5 for ; Mon, 4 Mar 2019 18:52:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 72C6B2B0FE for ; Mon, 4 Mar 2019 18:52:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 659532B10F; Mon, 4 Mar 2019 18:52:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id C61FC2B0FE for ; Mon, 4 Mar 2019 18:52:46 +0000 (UTC) Received: from localhost ([127.0.0.1]:59088 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h0shh-00043f-UO for patchwork-qemu-devel@patchwork.kernel.org; Mon, 04 Mar 2019 13:52:45 -0500 Received: from eggs.gnu.org ([209.51.188.92]:45182) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h0sBJ-0000c4-BG for qemu-devel@nongnu.org; Mon, 04 Mar 2019 13:19:20 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h0sBI-0002Ny-6T for qemu-devel@nongnu.org; Mon, 04 Mar 2019 13:19:17 -0500 Received: from wout2-smtp.messagingengine.com ([64.147.123.25]:37775) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h0sBH-0001yh-Su for qemu-devel@nongnu.org; Mon, 04 Mar 2019 13:19:16 -0500 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id 2E71936F9; Mon, 4 Mar 2019 13:18:32 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Mon, 04 Mar 2019 13:18:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-type:content-transfer-encoding; s=mesmtp; bh=Z85MJ9xxlAmYzG2jWGNy9SG3vq4lvVeG0F3Ch6SIMC0=; b=s1EQg5iJox2T xF/+/oLQqjg3kzvqvyugJrrNuph8AElN3C3fB/HKWi9EeH+v1w2cZzx6PaCtzPfk +ovKaMZmVqXdIOi7FsFrBqOaZikX88ocWiNmjaSpthcqyUHehh8eejmHGtcQqwdL N4In+E0Sxl67ws1rd77lZHZbfGsjN6w= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=Z85MJ9xxlAmYzG2jWGNy9SG3vq4lvVeG0F3Ch6SIM C0=; b=Y0E7OKVTbRiuK9/N+N4c8RrQiQdszqt0bpyAHgs+5NPVUAFEowFLVJ5MO 6gV+dwMHORBm2xJxOXYr6vrPxqM5mS4xIK1OtyKS8HdWE3dQ1LljHbS1n4TtpnjN dPMX1ZeO0Br+9qM99zmmQsgccD5TmqBXYftGKDpQx+fp4BHRoTdPsMMIafS3VFCy HkI5cHh1wTMnt5dFsPGzDULgFIoBMBywi0xwXVaWueMZ+ok+ijt8I+uEgLJTcRsz sM/dD24nwIC34WqO0UukSrPq8Obq9Enm7hando8uRVwLc2kOCf9NDOeMrPvXktkK ZePEoG1MEDk6xz0AwMGjTxUVfG8bQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedutddrfedugdduuddtucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne goufhushhpvggtthffohhmrghinhculdegledmnecujfgurhephffvufffkffojghfgggt gfesthekredtredtjeenucfhrhhomhepfdfgmhhilhhiohcuifdrucevohhtrgdfuceotg hothgrsegsrhgrrghprdhorhhgqeenucffohhmrghinhepihhmghhurhdrtghomhenucfk phepuddvkedrheelrddvtddrvdduieenucfrrghrrghmpehmrghilhhfrhhomheptghoth grsegsrhgrrghprdhorhhgnecuvehluhhsthgvrhfuihiivgeptd X-ME-Proxy: Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id 7AE4F10319; Mon, 4 Mar 2019 13:18:31 -0500 (EST) From: "Emilio G. Cota" To: qemu-devel@nongnu.org Date: Mon, 4 Mar 2019 13:18:13 -0500 Message-Id: <20190304181813.8075-74-cota@braap.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190304181813.8075-1-cota@braap.org> References: <20190304181813.8075-1-cota@braap.org> MIME-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 64.147.123.25 Subject: [Qemu-devel] [PATCH v7 73/73] cputlb: queue async flush jobs without the BQL X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Alex_Benn=C3=A9e?= , Richard Henderson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP This yields sizable scalability improvements, as the below results show. Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell) Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with "make -j N", where N is the number of cores in the guest. Speedup vs a single thread (higher is better): 14 +---------------------------------------------------------------+ | + + + + + + $$$$$$ + | | $$$$$ | | $$$$$$ | 12 |-+ $A$$ +-| | $$ | | $$$ | 10 |-+ $$ ##D#####################D +-| | $$$ #####**B**************** | | $$####***** ***** | | A$#***** B | 8 |-+ $$B** +-| | $$** | | $** | 6 |-+ $$* +-| | A** | | $B | | $ | 4 |-+ $* +-| | $ | | $ | 2 |-+ $ +-| | $ +cputlb-no-bql $$A$$ | | A +per-cpu-lock ##D## | | + + + + + + baseline **B** | 0 +---------------------------------------------------------------+ 1 4 8 12 16 20 24 28 Guest vCPUs png: https://imgur.com/zZRvS7q Some notes: - baseline corresponds to the commit before this series - per-cpu-lock is the commit that converts the CPU loop to per-cpu locks. - cputlb-no-bql is this commit. - I'm using taskset to assign cores to threads, favouring locality whenever possible but not using SMT. When N=1, I'm using a single host core, which leads to superlinear speedups (since with more cores the I/O thread can execute while vCPU threads sleep). In the future I might use N+1 host cores for N guest cores to avoid this, or perhaps pin guest threads to cores one-by-one. Single-threaded performance is affected very lightly. Results below for debian aarch64 bootup+test for the entire series on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host: - Before: Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs): 7269.033478 task-clock (msec) # 0.998 CPUs utilized ( +- 0.06% ) 30,659,870,302 cycles # 4.218 GHz ( +- 0.06% ) 54,790,540,051 instructions # 1.79 insns per cycle ( +- 0.05% ) 9,796,441,380 branches # 1347.695 M/sec ( +- 0.05% ) 165,132,201 branch-misses # 1.69% of all branches ( +- 0.12% ) 7.287011656 seconds time elapsed ( +- 0.10% ) - After: 7375.924053 task-clock (msec) # 0.998 CPUs utilized ( +- 0.13% ) 31,107,548,846 cycles # 4.217 GHz ( +- 0.12% ) 55,355,668,947 instructions # 1.78 insns per cycle ( +- 0.05% ) 9,929,917,664 branches # 1346.261 M/sec ( +- 0.04% ) 166,547,442 branch-misses # 1.68% of all branches ( +- 0.09% ) 7.389068145 seconds time elapsed ( +- 0.13% ) That is, a 1.37% slowdown. Reviewed-by: Alex Bennée Reviewed-by: Richard Henderson Tested-by: Alex Bennée Signed-off-by: Emilio G. Cota --- accel/tcg/cputlb.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index 88cc8389e9..d9e0814b5c 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -260,7 +260,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn, CPU_FOREACH(cpu) { if (cpu != src) { - async_run_on_cpu(cpu, fn, d); + async_run_on_cpu_no_bql(cpu, fn, d); } } } @@ -336,8 +336,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap) tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap); if (cpu->created && !qemu_cpu_is_self(cpu)) { - async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work, - RUN_ON_CPU_HOST_INT(idxmap)); + async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work, + RUN_ON_CPU_HOST_INT(idxmap)); } else { tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap)); } @@ -481,8 +481,8 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap) addr_and_mmu_idx |= idxmap; if (!qemu_cpu_is_self(cpu)) { - async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_work, - RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx)); + async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_work, + RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx)); } else { tlb_flush_page_by_mmuidx_async_work( cpu, RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));