[v10,73/73] cputlb: queue async flush jobs without the BQL

From: "Emilio G. Cota" <cota@braap.org>

From: "Emilio G. Cota" <cota@braap.org>

This yields sizable scalability improvements, as the below results show.

Host: Two Intel Xeon Silver 4114 20-core CPUs at 2.20 GHz

VM: Ubuntu 18.04 ppc64

                   Speedup vs a single thread for kernel build                  

  7 +-----------------------------------------------------------------------+  
    |         +          +         +         +         +          +         |  
    |                                    ###########       baseline ******* |  
    |                               #####           ####   cpu lock ####### |  
    |                             ##                    ####                |  
  6 |-+                         ##                          ##            +-|  
    |                         ##                              ####          |  
    |                       ##                                    ###       |  
    |                     ##        *****                            #      |  
    |                   ##      ****     ***                          #     |  
    |                 ##     ***            *                               |  
  5 |-+             ##    ***                ****                         +-|  
    |              #  ****                       **                         |  
    |             # **                             **                       |  
    |             #*                                 **                     |  
    |          #*                                          **               |  
    |         #*                                             *              |  
    |         #                                               ******        |  
    |        #                                                      **      |  
    |       #                                                         *     |  
  3 |-+     #                                                             +-|  
    |      #                                                                |  
    |      #                                                                |  
    |     #                                                                 |  
    |     #                                                                 |  
  2 |-+  #                                                                +-|  
    |    #                                                                  |  
    |   #                                                                   |  
    |   #                                                                   |  
    |  #                                                                    |  
    |  #      +          +         +         +         +          +         |  
  1 +-----------------------------------------------------------------------+  
    0         5          10        15        20        25         30        35  
                                   Guest vCPUs  
Pictures are also here:
https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing

Some notes:
- baseline corresponds to the commit before this series
- cpu-lock is this series

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized
      ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz
      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle
      ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec
      ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches
      ( +-  0.12% )

       7.287011656 seconds time elapsed
 ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized
      ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz
      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle
      ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec
      ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches
      ( +-  0.09% )

       7.389068145 seconds time elapsed
 ( +-  0.13% )

That is, a 1.37% slowdown.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
[Updated the speedup chart results for re-based series.]
Signed-off-by: Robert Foley <robert.foley@linaro.org>
---
 accel/tcg/cputlb.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Message ID	20200617210231.4393-74-robert.foley@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=E1zC=76=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C21CA2088E From: Robert Foley <robert.foley@linaro.org> To: qemu-devel@nongnu.org Subject: [PATCH v10 73/73] cputlb: queue async flush jobs without the BQL Date: Wed, 17 Jun 2020 17:02:31 -0400 Message-Id: <20200617210231.4393-74-robert.foley@linaro.org> In-Reply-To: <20200617210231.4393-1-robert.foley@linaro.org> References: <20200617210231.4393-1-robert.foley@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::742; envelope-from=robert.foley@linaro.org; helo=mail-qk1-x742.google.com Precedence: list Cc: robert.foley@linaro.org, cota@braap.org, Paolo Bonzini <pbonzini@redhat.com>, peter.puhov@linaro.org, alex.bennee@linaro.org, Richard Henderson <rth@twiddle.net> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	per-CPU locks \| expand [v10,00/73] per-CPU locks [v10,01/73] cpu: rename cpu->work_mutex to cpu->lock [v10,02/73] cpu: introduce cpu_mutex_lock/unlock [v10,03/73] cpu: make qemu_work_cond per-cpu [v10,04/73] cpu: move run_on_cpu to cpus-common [v10,05/73] cpu: introduce process_queued_cpu_work_locked [v10,06/73] cpu: make per-CPU locks an alias of the BQL in TCG rr mode [v10,07/73] tcg-runtime: define helper_cpu_halted_set [v10,08/73] ppc: convert to helper_cpu_halted_set [v10,09/73] cris: convert to helper_cpu_halted_set [v10,10/73] hppa: convert to helper_cpu_halted_set [v10,11/73] m68k: convert to helper_cpu_halted_set [v10,12/73] alpha: convert to helper_cpu_halted_set [v10,13/73] microblaze: convert to helper_cpu_halted_set [v10,14/73] cpu: define cpu_halted helpers [v10,15/73] tcg-runtime: convert to cpu_halted_set [v10,16/73] hw/semihosting: convert to cpu_halted_set [v10,17/73] arm: convert to cpu_halted [v10,18/73] ppc: convert to cpu_halted [v10,19/73] sh4: convert to cpu_halted [v10,20/73] i386: convert to cpu_halted [v10,21/73] lm32: convert to cpu_halted [v10,22/73] m68k: convert to cpu_halted [v10,23/73] mips: convert to cpu_halted [v10,24/73] riscv: convert to cpu_halted [v10,25/73] s390x: convert to cpu_halted [v10,26/73] sparc: convert to cpu_halted [v10,27/73] xtensa: convert to cpu_halted [v10,28/73] gdbstub: convert to cpu_halted [v10,29/73] openrisc: convert to cpu_halted [v10,30/73] cpu-exec: convert to cpu_halted [v10,31/73] cpu: convert to cpu_halted [v10,32/73] cpu: define cpu_interrupt_request helpers [v10,33/73] ppc: use cpu_reset_interrupt [v10,34/73] exec: use cpu_reset_interrupt [v10,35/73] i386: use cpu_reset_interrupt [v10,36/73] s390x: use cpu_reset_interrupt [v10,37/73] openrisc: use cpu_reset_interrupt [v10,38/73] arm: convert to cpu_interrupt_request [v10,39/73] i386: convert to cpu_interrupt_request [v10,40/73] i386/kvm: convert to cpu_interrupt_request [v10,41/73] i386/hax-all: convert to cpu_interrupt_request [v10,42/73] i386/whpx-all: convert to cpu_interrupt_request [v10,43/73] i386/hvf: convert to cpu_request_interrupt [v10,44/73] ppc: convert to cpu_interrupt_request [v10,45/73] sh4: convert to cpu_interrupt_request [v10,46/73] cris: convert to cpu_interrupt_request [v10,47/73] hppa: convert to cpu_interrupt_request [v10,48/73] lm32: convert to cpu_interrupt_request [v10,49/73] m68k: convert to cpu_interrupt_request [v10,50/73] mips: convert to cpu_interrupt_request [v10,51/73] nios: convert to cpu_interrupt_request [v10,52/73] s390x: convert to cpu_interrupt_request [v10,53/73] alpha: convert to cpu_interrupt_request [v10,54/73] moxie: convert to cpu_interrupt_request [v10,55/73] sparc: convert to cpu_interrupt_request [v10,56/73] openrisc: convert to cpu_interrupt_request [v10,57/73] unicore32: convert to cpu_interrupt_request [v10,58/73] microblaze: convert to cpu_interrupt_request [v10,59/73] accel/tcg: convert to cpu_interrupt_request [v10,60/73] cpu: convert to interrupt_request [v10,61/73] cpu: call .cpu_has_work with the CPU lock held [v10,62/73] cpu: introduce cpu_has_work_with_iothread_lock [v10,63/73] ppc: convert to cpu_has_work_with_iothread_lock [v10,64/73] mips: convert to cpu_has_work_with_iothread_lock [v10,65/73] s390x: convert to cpu_has_work_with_iothread_lock [v10,66/73] riscv: convert to cpu_has_work_with_iothread_lock [v10,67/73] sparc: convert to cpu_has_work_with_iothread_lock [v10,68/73] xtensa: convert to cpu_has_work_with_iothread_lock [v10,69/73] cpu: rename all_cpu_threads_idle to qemu_tcg_rr_all_cpu_threads_idle [v10,70/73] cpu: protect CPU state with cpu->lock instead of the BQL [v10,71/73] cpus-common: release BQL earlier in run_on_cpu [v10,72/73] cpu: add async_run_on_cpu_no_bql [v10,73/73] cputlb: queue async flush jobs without the BQL

[v10,73/73] cputlb: queue async flush jobs without the BQL

Commit Message

Patch