From patchwork Mon Jul 24 10:41:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Zhang, Xiong Y" X-Patchwork-Id: 13324370 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E691C0015E for ; Mon, 24 Jul 2023 10:43:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232199AbjGXKnS (ORCPT ); Mon, 24 Jul 2023 06:43:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41730 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232171AbjGXKnQ (ORCPT ); Mon, 24 Jul 2023 06:43:16 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 45520E6 for ; Mon, 24 Jul 2023 03:43:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690195395; x=1721731395; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=Tzg+2W3QJiH31dz3eqTMCkxS3NEVhVyswtVYF5d0HuY=; b=F57bCwr7WxTHzDu+kRERcSEaQZjtXFme51jFPWOAMzRZ1EK3jdh7tMR4 H19HmIquS3QnCq76aI0hYdGazXRvd6ee+ZQLKoRo6q6vVvJfiJnkiwIGE 4AlIuzEhQ1pJWl+U9KnG4fy7Vc5M6JVthKdAqBVIUBT51bByILaw8erpk 7/1zFaFyMXKGD5aM2KYvgtXT76h2Gm2FIUfm4NWlM4v1mcnrcMEQKzYbe RPevfoBT8Km2mKoynlG4Wthw2FwsrQJnXCmwHgsBYy2kLZktxlClXSWvt kwhEiMQEyTUIgptu5Fjh0K6wiGkESRsjh6/ewDmlRDd7JwXAqfPTy5TJz w==; X-IronPort-AV: E=McAfee;i="6600,9927,10780"; a="370075369" X-IronPort-AV: E=Sophos;i="6.01,228,1684825200"; d="scan'208";a="370075369" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jul 2023 03:43:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.01,202,1684825200"; d="scan'208";a="869031655" Received: from ylei1-mobl.ccr.corp.intel.com (HELO xiongzha-desk1.ccr.corp.intel.com) ([10.255.29.13]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jul 2023 03:43:13 -0700 From: Xiong Zhang To: kvm@vger.kernel.org Cc: seanjc@google.com, like.xu.linux@gmail.com, weijiang.yang@intel.com, zhiyuan.lv@intel.com, zhenyu.z.wang@intel.com, kan.liang@intel.com, Xiong Zhang Subject: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document Date: Mon, 24 Jul 2023 18:41:54 +0800 Message-Id: <20230724104154.259573-1-xiong.y.zhang@intel.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Add a vPMU implementation and gap document to explain vArch PMU and vLBR implementation in kvm, especially the current gap to support host and guest perf event coexist. Signed-off-by: Xiong Zhang --- Documentation/virt/kvm/x86/index.rst | 1 + Documentation/virt/kvm/x86/pmu.rst | 249 +++++++++++++++++++++++++++ 2 files changed, 250 insertions(+) create mode 100644 Documentation/virt/kvm/x86/pmu.rst diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst index 9ece6b8dc817..02c1c7b01bf3 100644 --- a/Documentation/virt/kvm/x86/index.rst +++ b/Documentation/virt/kvm/x86/index.rst @@ -14,5 +14,6 @@ KVM for x86 systems mmu msr nested-vmx + pmu running-nested-guests timekeeping diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst new file mode 100644 index 000000000000..e95e8c88e0e0 --- /dev/null +++ b/Documentation/virt/kvm/x86/pmu.rst @@ -0,0 +1,249 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +PMU virtualization for X86 +========================== + +:Author: Xiong Zhang +:Copyright: (c) 2023, Intel. All rights reserved. + +.. Contents + +1. Overview +2. Perf Scheduler +3. Arch PMU virtualization +4. LBR virtualization + +1. Overview +=========== + +KVM has supported PMU virtualization on x86 for many years and provides +MSR based Arch PMU interface to the guest. The major features include +Arch PMU v2, LBR and PEBS. Users have the same operation to profile +performance in guest and host. +KVM is a normal perf subsystem user as other perf subsystem users. When +the guest access vPMU MSRs, KVM traps it and creates a perf event for it. +This perf event takes part in perf scheduler to request PMU resources +and let the guest use these resources. + +This document describes the X86 PMU virtualization architecture design +and opens. It is organized as follows: Next section describes more +details of Linux perf scheduler as it takes a key role in vPMU +implementation and allocates PMU resources for guest usage. Then Arch +PMU virtualization and LBR virtualization are introduced, each feature +has sections to introduce implementation overview, the expectation and +gaps when host and guest perf events coexist. + +2. Perf Scheduler +================= + +Perf scheduler's responsibility is choosing which events are active at +one moment and binding counter with perf event. As processor has limited +PMU counters and other resource, only limited perf events can be active +at one moment, the inactive perf event may be active in the next moment, +perf scheduler has defined rules to control these things. + +Usually the following cases cause perf event reschedule: +1) On a context switch from one task to a different task. +2) When an event is manually enabled. +3) A call to perf_event_open() with disabled field of the +perf_event_attr argument set to 0. + +When perf event reschedule is needed on a specific cpu, perf will send +an IPI to the target cpu, and the IPI handler will activate events +ordered by event type, and will iterate all the eligible events. + +When a perf event is sched out, this event mapped counter is disabled, +and the counter's setting and count value are saved. When a perf event +is sched in, perf driver assigns a counter to this event, the counter's +setting and count values are restored from last saved. + +Perf defines four types event, their priority are from high to low: +a. Per-cpu pinned: the event should be measured on the specified logical +core whenever it is enabled. +b. Per-process pinned: the event should be measured whenever it is +enabled and the process is running on any logical cores. +c. Per-cpu flexible: the event should measured on the specified logical +core whenever it is enabled. +d. Per-process flexible: the event should be measured whenever it is +enabled and the process is running on any logical cores. + +If the event could not be scheduled because no resource is available for +it, pinned event goes into error state and is excluded from perf +scheduler, the only way to recover it is re-enable it, flexible event +goes into inactive state and can be multiplexed with other events if +needed. + +3. Arch PMU virtualization +========================== + +3.1. Overview +------------- + +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest +PMU driver would access the Arch PMU MSRs (including Fixed and GP +counter) as the host does. All the guest Arch PMU MSRs accessing are +interceptable. + +When a guest virtual counter is enabled through guest MSR writing, the +KVM trap will create a kvm perf event through the perf subsystem. The +kvm perf event's attribute is gotten from the guest virtual counter's +MSR setting. + +When a guest changes the virtual counter's setting later, the KVM trap +will release the old kvm perf event then create a new kvm perf event +with the new setting. + +When guest read the virtual counter's count number, the kvm trap will +read kvm perf event's counter value and accumulate it to the previous +counter value. + +When guest no longer access the virtual counter's MSR within a +scheduling time slice and the virtual counter is disabled, KVM will +release the kvm perf event. + ---------------------------- + | Guest | + | perf subsystem | + ---------------------------- + | ^ + vMSR | | vPMI + v | + ---------------------------- + | vPMU KVM vCPU | + ---------------------------- + | ^ + Call | | Callbacks + v | + --------------------------- + | Host Linux Kernel | + | perf subsystem | + --------------------------- + | ^ + MSR | | PMI + v | + -------------------- + | PMU CPU | + -------------------- + +Each guest virtual counter has a corresponding kvm perf event, and the +kvm perf event joins host perf scheduler and complies with host perf +scheduler rule. When kvm perf event is scheduled by host perf scheduler +and is active, the guest virtual counter could supply the correct value. +However, if another host perf event comes in and takes over the kvm perf +event resource, the kvm perf event will be inactive, then the virtual +counter supplies wrong and meaningless value. + +3.2. Host and Guest perf event contention +----------------------------------------- + +Kvm perf event is a per-process pinned event, its priority is second. +When kvm perf event is active, it can be preempted by host per-cpu +pinned perf event, or it can preempt host flexible perf events. Such +preemption can be temporarily prohibited through disabling host IRQ. + +The following results are expected when host and guest perf event +coexist according to perf scheduler rule: +1). if host per cpu pinned events occupy all the HW resource, kvm perf +event can not be active as no available resource, the virtual counter +value is zero always when the guest read it. +2). if host per cpu pinned event release HW resource, and kvm perf event +is inactive, kvm perf event can claim the HW resource and switch into +active, then the guest can get the correct value from the guest virtual +counter during kvm perf event is active, but the guest total counter +value is not correct since counter value is lost during kvm perf event +is inactive. +3). if kvm perf event is active, then host per cpu pinned perf event +becomes active and reclaims kvm perf event resource, kvm perf event will +be inactive. Finally the virtual counter value is kept unchanged and +stores previous saved value when the guest reads it. So the guest toatal +counter isn't correct. +4). If host flexible perf events occupy all the HW resource, kvm perf +event can be active and preempts host flexible perf event resource, +guest can get the correct value from the guest virtual counter. +5). if kvm perf event is active, then other host flexible perf events +request to active, kvm perf event still own the resource and active, so +guest can get the correct value from the guest virtual counter. + +3.3. vPMU Arch Gaps +------------------- + +The coexist of host and guest perf events has gap: +1). when guest accesses PMU MSRs at the first time, KVM will trap it and +create kvm perf event, but this event may be inactive because the +contention with host perf event. But guest doesn't notice this and when +guest read virtual counter, the return value is zero. +2). when kvm perf event is active, host per-cpu pinned perf event can +reclaim kvm perf event resource at any time once resource contention +happens. But guest doesn't notice this neither and guest following +counter accesses get wrong data. +So maillist had some discussion titled "Reconsider the current approach +of vPMU". + +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/ + +The major suggestion in this discussion is host pass-through some +counters into guest, but this suggestion is not feasible, the reasons +are: +a. processor has several counters, but counters are not equal, some +event must bind with a specific counter. +b. if a special counter is passthrough into guest, host can not support +such event and lose some capability. +c. if a normal counter is passthrough into guest, guest can support +general event only, and the guest has limited capability. +So both host and guest lose capability in pass-through mode. + +4. LBR Virtualization +===================== + +4.1. Overview +------------- + +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability +into guest, The first guest access on LBR related MSRs is always +interceptable. The KVM trap would create a vLBR perf event which enables +the callstack mode and none of the hardware counters are assigned. The +host perf would enable and schedule this event as usual. + +When vLBR event is scheduled by host perf scheduler and is active, host +LBR MSRs are owned by guest and are pass-through into guest, guest will +access them without VM Exit. However, if another host LBR event comes in +and takes over the LBR facility, the vLBR event will be inactive, and +guest following accesses to the LBR MSRs will be trapped and meaningless. + +As kvm perf event, vLBR event will be released when guest doesn't access +LBR-related MSRs within a scheduling time slice and guest unset LBR +enable bit, then the pass-through state of the LBR MSRs will be canceled. + +4.2. Host and Guest LBR contention +---------------------------------- + +vLBR event is a per-process pinned event, its priority is second. vLBR +event together with host other LBR event to contend LBR resource, +according to perf scheduler rule, when vLBR event is active, it can be +preempted by host per-cpu pinned LBR event, or it can preempt host +flexible LBR event. Such preemption can be temporarily prohibited +through disabling host IRQ as perf scheduler uses IPI to change LBR owner. + +The following results are expected when host and guest LBR event coexist: +1) If host per cpu pinned LBR event is active when vm starts, the guest +vLBR event can not preempt the LBR resource, so the guest can not use +LBR. +2). If host flexible LBR events are active when vm starts, guest vLBR +event can preempt LBR, so the guest can use LBR. +3). If host per cpu pinned LBR event becomes enabled when guest vLBR +event is active, the guest vLBR event will lose LBR and the guest can +not use LBR anymore. +4). If host flexible LBR event becomes enabled when guest vLBR event is +active, the guest vLBR event keeps LBR, the guest can still use LBR. +5). If host per cpu pinned LBR event becomes inactive when guest vLBR +event is inactive, guest vLBR event can be active and own LBR, the guest +can use LBR. + +4.3. vLBR Arch Gaps +------------------- + +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned +event at any time, or vLBR event is inactive at creation, but guest +can not notice this, so the guest will get meaningless value when the +vLBR event is inactive.