Message ID | 20230117102930.1053337-1-usama.arif@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: arm64: implement vcpu_is_preempted check | expand |
On 17/01/2023 10:29, Usama Arif wrote: > This patchset adds support for vcpu_is_preempted in arm64, which allows the guest > to check if a vcpu was scheduled out, which is useful to know incase it was > holding a lock. vcpu_is_preempted is well integrated in core kernel code and can > be used to improve performance in locking (owner_on_cpu usage in mutex_spin_on_owner, > mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling > (available_idle_cpu which is used in several places in kernel/sched/fair.c > for e.g. in wake_affine to determine which CPU can run soonest). > > This patchset shows significant improvement on overcommitted hosts (vCPUs > pCPUS), > as waiting for preempted vCPUs reduces performance. > Hi, Just wanted to check if there are any comments for this? Thanks, Usama > If merged, vcpu_is_preempted could also be used to optimize IPI performance (along > with directed yield to target IPI vCPU) similar to how its done in x86 > (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/) > > All the results in the below experiments are done on an aws r6g.metal instance > which has 64 pCPUs. > > The following table shows the index results of UnixBench running on a 128 vCPU VM > with (6.0+vcpu_is_preempted) and without (6.0 base) the patchset. > TestName 6.0 base 6.0+vcpu_is_preempted % improvement for vcpu_is_preempted > Dhrystone 2 using register variables 187761 191274.7 1.871368389 > Double-Precision Whetstone 96743.6 98414.4 1.727039308 > Execl Throughput 689.3 10426 1412.548963 > File Copy 1024 bufsize 2000 maxblocks 549.5 3165 475.978162 > File Copy 256 bufsize 500 maxblocks 400.7 2084.7 420.2645371 > File Copy 4096 bufsize 8000 maxblocks 894.3 5003.2 459.4543218 > Pipe Throughput 76819.5 78601.5 2.319723508 > Pipe-based Context Switching 3444.8 13414.5 289.4130283 > Process Creation 301.1 293.4 -2.557289937 > Shell Scripts (1 concurrent) 1248.1 28300.6 2167.494592 > Shell Scripts (8 concurrent) 781.2 26222.3 3256.669227 > System Call Overhead 3426 3729.4 8.855808523 > > System Benchmarks Index Score 3053 11534 277.7923354 > > This shows a 278% overall improvement using these patches. > > The biggest improvement is in the shell scripts benchmark, which forks a lot of processes. > This acquires rwsem lock where a large chunk of time is spent in base kernel. > This can be seen from one of the callstack of the perf output of the shell > scripts benchmark on base (pseudo NMI enabled for perf numbers below): > - 33.79% el0_svc > - 33.43% do_el0_svc > - 33.43% el0_svc_common.constprop.3 > - 33.30% invoke_syscall > - 17.27% __arm64_sys_clone > - 17.27% __do_sys_clone > - 17.26% kernel_clone > - 16.73% copy_process > - 11.91% dup_mm > - 11.82% dup_mmap > - 9.15% down_write > - 8.87% rwsem_down_write_slowpath > - 8.48% osq_lock > > Just under 50% of the total time in the shell script benchmarks ends up being > spent in osq_lock in the base kernel: > Children Self Command Shared Object Symbol > 17.19% 10.71% sh [kernel.kallsyms] [k] osq_lock > 6.17% 4.04% sort [kernel.kallsyms] [k] osq_lock > 4.20% 2.60% multi. [kernel.kallsyms] [k] osq_lock > 3.77% 2.47% grep [kernel.kallsyms] [k] osq_lock > 3.50% 2.24% expr [kernel.kallsyms] [k] osq_lock > 3.41% 2.23% od [kernel.kallsyms] [k] osq_lock > 3.36% 2.15% rm [kernel.kallsyms] [k] osq_lock > 3.28% 2.12% tee [kernel.kallsyms] [k] osq_lock > 3.16% 2.02% wc [kernel.kallsyms] [k] osq_lock > 0.21% 0.13% looper [kernel.kallsyms] [k] osq_lock > 0.01% 0.00% Run [kernel.kallsyms] [k] osq_lock > > and this comes down to less than 1% total with 6.0+vcpu_is_preempted kernel: > Children Self Command Shared Object Symbol > 0.26% 0.21% sh [kernel.kallsyms] [k] osq_lock > 0.10% 0.08% multi. [kernel.kallsyms] [k] osq_lock > 0.04% 0.04% sort [kernel.kallsyms] [k] osq_lock > 0.02% 0.01% grep [kernel.kallsyms] [k] osq_lock > 0.02% 0.02% od [kernel.kallsyms] [k] osq_lock > 0.01% 0.01% tee [kernel.kallsyms] [k] osq_lock > 0.01% 0.00% expr [kernel.kallsyms] [k] osq_lock > 0.01% 0.01% looper [kernel.kallsyms] [k] osq_lock > 0.00% 0.00% wc [kernel.kallsyms] [k] osq_lock > 0.00% 0.00% rm [kernel.kallsyms] [k] osq_lock > > To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench > was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented > performed 0.9% better overall than base kernel, and the individual benchmarks > were within +/-2% improvement over 6.0 base. > Hence the patches have no negative affect when vCPUs < pCPUs. > > The respective QEMU change to test this is at > https://github.com/uarif1/qemu/commit/2da2c2927ae8de8f03f439804a0dad9cf68501b6. > > Looking forward to your response! > Thanks, > Usama > --- > v2->v3 > - Updated the patchset from 6.0 to 6.2-rc3 > - Made pv_lock_init an early_initcall > - Improved documentation > - Changed pvlock_vcpu_state to aligned struct > - Minor improvevments > > RFC->v2 > - Fixed table and code referencing in pvlock documentation > - Switched to using a single hypercall similar to ptp_kvm and made check > for has_kvm_pvlock simpler > > Usama Arif (6): > KVM: arm64: Document PV-lock interface > KVM: arm64: Add SMCCC paravirtualised lock calls > KVM: arm64: Support pvlock preempted via shared structure > KVM: arm64: Provide VCPU attributes for PV lock > KVM: arm64: Support the VCPU preemption check > KVM: selftests: add tests for PV time specific hypercall > > Documentation/virt/kvm/arm/hypercalls.rst | 3 + > Documentation/virt/kvm/arm/index.rst | 1 + > Documentation/virt/kvm/arm/pvlock.rst | 54 +++++++++ > Documentation/virt/kvm/devices/vcpu.rst | 25 ++++ > arch/arm64/include/asm/kvm_host.h | 25 ++++ > arch/arm64/include/asm/paravirt.h | 2 + > arch/arm64/include/asm/pvlock-abi.h | 15 +++ > arch/arm64/include/asm/spinlock.h | 16 ++- > arch/arm64/include/uapi/asm/kvm.h | 3 + > arch/arm64/kernel/paravirt.c | 113 ++++++++++++++++++ > arch/arm64/kvm/Makefile | 2 +- > arch/arm64/kvm/arm.c | 8 ++ > arch/arm64/kvm/guest.c | 9 ++ > arch/arm64/kvm/hypercalls.c | 8 ++ > arch/arm64/kvm/pvlock.c | 100 ++++++++++++++++ > include/linux/arm-smccc.h | 8 ++ > include/uapi/linux/kvm.h | 2 + > tools/arch/arm64/include/uapi/asm/kvm.h | 1 + > tools/include/linux/arm-smccc.h | 8 ++ > .../selftests/kvm/aarch64/hypercalls.c | 2 + > 20 files changed, 403 insertions(+), 2 deletions(-) > create mode 100644 Documentation/virt/kvm/arm/pvlock.rst > create mode 100644 arch/arm64/include/asm/pvlock-abi.h > create mode 100644 arch/arm64/kvm/pvlock.c >
On Tue, 14 Feb 2023 16:06:26 +0000, Usama Arif <usama.arif@bytedance.com> wrote: > > > > On 17/01/2023 10:29, Usama Arif wrote: > > This patchset adds support for vcpu_is_preempted in arm64, which allows the guest > > to check if a vcpu was scheduled out, which is useful to know incase it was > > holding a lock. vcpu_is_preempted is well integrated in core kernel code and can > > be used to improve performance in locking (owner_on_cpu usage in mutex_spin_on_owner, > > mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling > > (available_idle_cpu which is used in several places in kernel/sched/fair.c > > for e.g. in wake_affine to determine which CPU can run soonest). > > > > This patchset shows significant improvement on overcommitted hosts (vCPUs > pCPUS), > > as waiting for preempted vCPUs reduces performance. > > > > Hi, > > Just wanted to check if there are any comments for this? Not a lot, I'm afraid. My concerns with this thing are still the same: - it is KVM-specific - it doesn't work with nested virtualisation - its correctness is unproven on arm64 I'm also not going to entertain any of this without the core arm64 maintainers saying that they will enable this. Thanks, M.