Documentation: KVM: Add vPMU implementaion and gap document

Message ID	20230724104154.259573-1-xiong.y.zhang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Xiong Zhang <xiong.y.zhang@intel.com> To: kvm@vger.kernel.org Cc: seanjc@google.com, like.xu.linux@gmail.com, weijiang.yang@intel.com, zhiyuan.lv@intel.com, zhenyu.z.wang@intel.com, kan.liang@intel.com, Xiong Zhang <xiong.y.zhang@intel.com> Subject: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document Date: Mon, 24 Jul 2023 18:41:54 +0800 Message-Id: <20230724104154.259573-1-xiong.y.zhang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Documentation: KVM: Add vPMU implementaion and gap document \| expand Documentation: KVM: Add vPMU implementaion and gap document

Message ID

20230724104154.259573-1-xiong.y.zhang@intel.com (mailing list archive)

State

New, archived

Headers

From: Xiong Zhang <xiong.y.zhang@intel.com>
To: kvm@vger.kernel.org
Cc: seanjc@google.com, like.xu.linux@gmail.com,
        weijiang.yang@intel.com, zhiyuan.lv@intel.com,
        zhenyu.z.wang@intel.com, kan.liang@intel.com,
        Xiong Zhang <xiong.y.zhang@intel.com>
Subject: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document
Date: Mon, 24 Jul 2023 18:41:54 +0800
Message-Id: <20230724104154.259573-1-xiong.y.zhang@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Documentation: KVM: Add vPMU implementaion and gap document | expand

Commit Message

Zhang, Xiong Y July 24, 2023, 10:41 a.m. UTC

Add a vPMU implementation and gap document to explain vArch PMU and vLBR
implementation in kvm, especially the current gap to support host and
guest perf event coexist.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
---
 Documentation/virt/kvm/x86/index.rst |   1 +
 Documentation/virt/kvm/x86/pmu.rst   | 249 +++++++++++++++++++++++++++
 2 files changed, 250 insertions(+)
 create mode 100644 Documentation/virt/kvm/x86/pmu.rst

Comments

kernel test robot July 25, 2023, 1:48 p.m. UTC | #1

Hi Xiong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on mst-vhost/linux-next linus/master v6.5-rc3 next-20230725]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiong-Zhang/Documentation-KVM-Add-vPMU-implementaion-and-gap-document/20230724-184443
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20230724104154.259573-1-xiong.y.zhang%40intel.com
patch subject: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document
reproduce: (https://download.01.org/0day-ci/archive/20230725/202307252116.sD1ngIZF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307252116.sD1ngIZF-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Documentation/virt/kvm/x86/pmu.rst:104: WARNING: Unexpected indentation.
>> Documentation/virt/kvm/x86/pmu.rst:104: WARNING: Unexpected section title or transition.

vim +104 Documentation/virt/kvm/x86/pmu.rst

   100	
   101	When guest no longer access the virtual counter's MSR within a
   102	scheduling time slice and the virtual counter is disabled, KVM will
   103	release the kvm perf event.
 > 104	  ----------------------------
   105	  |  Guest                   |
   106	  |  perf subsystem          |
   107	  ----------------------------
   108	       |            ^
   109	  vMSR |            | vPMI
   110	       v            |
   111	  ----------------------------
   112	  |  vPMU        KVM vCPU    |
   113	  ----------------------------
   114	        |          ^
   115	  Call  |          | Callbacks
   116	        v          |
   117	  ---------------------------
   118	  | Host Linux Kernel       |
   119	  | perf subsystem          |
   120	  ---------------------------
   121	               |       ^
   122	           MSR |       | PMI
   123	               v       |
   124	         --------------------
   125		 | PMU        CPU   |
   126	         --------------------
   127

Yang, Weijiang July 26, 2023, 8:01 a.m. UTC | #2

On 7/24/2023 6:41 PM, Xiong Zhang wrote:
> Add a vPMU implementation and gap document to explain vArch PMU and vLBR
> implementation in kvm, especially the current gap to support host and
> guest perf event coexist.
>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> ---
>   Documentation/virt/kvm/x86/index.rst |   1 +
>   Documentation/virt/kvm/x86/pmu.rst   | 249 +++++++++++++++++++++++++++
>   2 files changed, 250 insertions(+)
>   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
>
> diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
> index 9ece6b8dc817..02c1c7b01bf3 100644
> --- a/Documentation/virt/kvm/x86/index.rst
> +++ b/Documentation/virt/kvm/x86/index.rst
> @@ -14,5 +14,6 @@ KVM for x86 systems
>      mmu
>      msr
>      nested-vmx
> +   pmu
>      running-nested-guests
>      timekeeping
> diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
> new file mode 100644
> index 000000000000..e95e8c88e0e0
> --- /dev/null
> +++ b/Documentation/virt/kvm/x86/pmu.rst
> @@ -0,0 +1,249 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +PMU virtualization for X86
> +==========================
> +
> +:Author: Xiong Zhang <xiong.y.zhang@intel.com>
> +:Copyright: (c) 2023, Intel.  All rights reserved.
> +
> +.. Contents
> +
> +1. Overview
> +2. Perf Scheduler
> +3. Arch PMU virtualization
> +4. LBR virtualization
> +
> +1. Overview
> +===========
> +
> +KVM has supported PMU virtualization on x86 for many years and provides
> +MSR based Arch PMU interface to the guest. The major features include
> +Arch PMU v2, LBR and PEBS. Users have the same operation to profile
> +performance in guest and host.
> +KVM is a normal perf subsystem user as other perf subsystem users. When
> +the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
> +This perf event takes part in perf scheduler to request PMU resources
> +and let the guest use these resources.
> +
> +This document describes the X86 PMU virtualization architecture design
> +and opens. It is organized as follows: Next section describes more
> +details of Linux perf scheduler as it takes a key role in vPMU
> +implementation and allocates PMU resources for guest usage. Then Arch
> +PMU virtualization and LBR virtualization are introduced, each feature
> +has sections to introduce implementation overview,  the expectation and
> +gaps when host and guest perf events coexist.
> +
> +2. Perf Scheduler
> +=================
> +
> +Perf scheduler's responsibility is choosing which events are active at
> +one moment and binding counter with perf event. As processor has limited
> +PMU counters and other resource, only limited perf events can be active
> +at one moment, the inactive perf event may be active in the next moment,
> +perf scheduler has defined rules to control these things.
> +
> +Usually the following cases cause perf event reschedule:
> +1) On a context switch from one task to a different task.
> +2) When an event is manually enabled.
> +3) A call to perf_event_open() with disabled field of the
> +perf_event_attr argument set to 0.

And when perf scheduler timer expires.

> +
> +When perf event reschedule is needed on a specific cpu, perf will send
> +an IPI to the target cpu, and the IPI handler will activate events
> +ordered by event type, and will iterate all the eligible events.

IIUC, this is only true for the event create case, not for all above 
reschedule cases.

> +
> +When a perf event is sched out, this event mapped counter is disabled,
> +and the counter's setting and count value are saved. When a perf event
> +is sched in, perf driver assigns a counter to this event, the counter's
> +setting and count values are restored from last saved.
> +
> +Perf defines four types event, their priority are from high to low:
> +a. Per-cpu pinned: the event should be measured on the specified logical
> +core whenever it is enabled.
> +b. Per-process pinned: the event should be measured whenever it is
> +enabled and the process is running on any logical cores.
> +c. Per-cpu flexible: the event should measured on the specified logical
> +core whenever it is enabled.
> +d. Per-process flexible: the event should be measured whenever it is
> +enabled and the process is running on any logical cores.
> +
> +If the event could not be scheduled because no resource is available for
> +it, pinned event goes into error state and is excluded from perf
> +scheduler, the only way to recover it is re-enable it, flexible event
> +goes into inactive state and can be multiplexed with other events if
> +needed.

Maybe you can add some diagrams or list some key definitions/data 
structures/prototypes

to facilitate readers to understand more about perf schedule since it's 
the key of perf subsystem.

> +
> +3. Arch PMU virtualization
> +==========================
> +
> +3.1. Overview
> +-------------
> +
> +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
> +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> +counter) as the host does. All the guest Arch PMU MSRs accessing are
> +interceptable.
> +
> +When a guest virtual counter is enabled through guest MSR writing, the
> +KVM trap will create a kvm perf event through the perf subsystem. The
> +kvm perf event's attribute is gotten from the guest virtual counter's
> +MSR setting.
> +
> +When a guest changes the virtual counter's setting later, the KVM trap
> +will release the old kvm perf event then create a new kvm perf event
> +with the new setting.
> +
> +When guest read the virtual counter's count number, the kvm trap will
> +read kvm perf event's counter value and accumulate it to the previous
> +counter value.
> +
> +When guest no longer access the virtual counter's MSR within a
> +scheduling time slice and the virtual counter is disabled, KVM will
> +release the kvm perf event.
> +  ----------------------------
> +  |  Guest                   |
> +  |  perf subsystem          |
> +  ----------------------------
> +       |            ^
> +  vMSR |            | vPMI
> +       v            |
> +  ----------------------------
> +  |  vPMU        KVM vCPU    |
> +  ----------------------------
> +        |          ^
> +  Call  |          | Callbacks
> +        v          |
> +  ---------------------------
> +  | Host Linux Kernel       |
> +  | perf subsystem          |
> +  ---------------------------
> +               |       ^
> +           MSR |       | PMI
> +               v       |
> +         --------------------
> +	 | PMU        CPU   |
> +         --------------------
> +
> +Each guest virtual counter has a corresponding kvm perf event, and the
> +kvm perf event joins host perf scheduler and complies with host perf
> +scheduler rule. When kvm perf event is scheduled by host perf scheduler
> +and is active, the guest virtual counter could supply the correct value.
> +However, if another host perf event comes in and takes over the kvm perf
> +event resource, the kvm perf event will be inactive, then the virtual
> +counter supplies wrong and meaningless value.

IMHO, the data is still valid for preempted event as it's saved when the 
event is sched_out.

But it doesn't match the running task under profiling, and this is 
normal when perf

preemption exits.

> +
> +3.2. Host and Guest perf event contention
> +-----------------------------------------
> +
> +Kvm perf event is a per-process pinned event, its priority is second.
> +When kvm perf event is active, it can be preempted by host per-cpu
> +pinned perf event, or it can preempt host flexible perf events. Such
> +preemption can be temporarily prohibited through disabling host IRQ.
> +
> +The following results are expected when host and guest perf event
> +coexist according to perf scheduler rule:
> +1). if host per cpu pinned events occupy all the HW resource, kvm perf
> +event can not be active as no available resource, the virtual counter
> +value is  zero always when the guest read it.
> +2). if host per cpu pinned event release HW resource, and kvm perf event
> +is inactive, kvm perf event can claim the HW resource and switch into
> +active, then the guest can get the correct value from the guest virtual
> +counter during kvm perf event is active, but the guest total counter
> +value is not correct since counter value is lost during kvm perf event
> +is inactive.
> +3). if kvm perf event is active, then host per cpu pinned perf event
> +becomes active and reclaims kvm perf event resource, kvm perf event will
> +be inactive. Finally the virtual counter value is kept unchanged and
> +stores previous saved value when the guest reads it. So the guest toatal
> +counter isn't correct.
> +4). If host flexible perf events occupy all the HW resource, kvm perf
> +event can be active and preempts host flexible perf event resource,
> +guest can get the correct value from the guest virtual counter.
> +5). if kvm perf event is active, then other host flexible perf events
> +request to active, kvm perf event still own the resource and active, so
> +guest can get the correct value from the guest virtual counter.
> +
> +3.3. vPMU Arch Gaps
> +-------------------
> +
> +The coexist of host and guest perf events has gap:
> +1). when guest accesses PMU MSRs at the first time, KVM will trap it and
> +create kvm perf event, but this event may be inactive because the
> +contention with host perf event. But guest doesn't notice this and when
> +guest read virtual counter, the return value is zero.
> +2). when kvm perf event is active, host per-cpu pinned perf event can
> +reclaim kvm perf event resource at any time once resource contention
> +happens. But guest doesn't notice this neither and guest following
> +counter accesses get wrong data.
> +So maillist had some discussion titled "Reconsider the current approach
> +of vPMU".
> +
> +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
> +
> +The major suggestion in this discussion is host pass-through some
> +counters into guest, but this suggestion is not feasible, the reasons
> +are:
> +a. processor has several counters, but counters are not equal, some
> +event must bind with a specific counter.
> +b. if a special counter is passthrough into guest, host can not support
> +such event and lose some capability.
> +c. if a normal counter is passthrough into guest, guest can support
> +general event only, and the guest has limited capability.
> +So both host and guest lose capability in pass-through mode.
> +
> +4. LBR Virtualization
> +=====================
> +
> +4.1. Overview
> +-------------
> +
> +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
> +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
> +into guest,  The first guest access on LBR related MSRs is always
> +interceptable. The KVM trap would create a vLBR perf event which enables
> +the callstack mode and none of the hardware counters are assigned. The
> +host perf would enable and schedule this event as usual.
> +
> +When vLBR event is scheduled by host perf scheduler and is active, host
> +LBR MSRs are owned by guest and are pass-through into guest, guest will
> +access them without VM Exit. However, if another host LBR event comes in
> +and takes over the LBR facility, the vLBR event will be inactive, and
> +guest following accesses to the LBR MSRs will be trapped and meaningless.

Is this true only when host created a pinned LBR event? Otherwise, it 
won't preempt

the guest vLBR.


> +
> +As kvm perf event, vLBR event will be released when guest doesn't access
> +LBR-related MSRs within a scheduling time slice and guest unset LBR
> +enable bit, then the pass-through state of the LBR MSRs will be canceled.
> +
> +4.2. Host and Guest LBR contention
> +----------------------------------
> +
> +vLBR event is a per-process pinned event, its priority is second. vLBR
> +event together with host other LBR event to contend LBR resource,
> +according to perf scheduler rule, when vLBR event is active, it can be
> +preempted by host per-cpu pinned LBR event, or it can preempt host
> +flexible LBR event. Such preemption can be temporarily prohibited
> +through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
> +
> +The following results are expected when host and guest LBR event coexist:
> +1) If host per cpu pinned LBR event is active when vm starts, the guest
> +vLBR event can not preempt the LBR resource, so the guest can not use
> +LBR.
> +2). If host flexible LBR events are active when vm starts, guest vLBR
> +event can preempt LBR, so the guest can use LBR.
> +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> +event is active, the guest vLBR event will lose LBR and the guest can
> +not use LBR anymore.
> +4). If host flexible LBR event becomes enabled when guest vLBR event is
> +active, the guest vLBR event keeps LBR, the guest can still use LBR.
> +5). If host per cpu pinned LBR event becomes inactive when guest vLBR
> +event is inactive, guest vLBR event can be active and own LBR, the guest
> +can use LBR.

Anyway, vLBR problems is still induced by perf scheduling priorities, if 
you can

clearly state current gaps of vPMU, it's also clear for vLBR issue, then 
this section

could be omitted.

> +
> +4.3. vLBR Arch Gaps
> +-------------------
> +
> +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
> +event at any time, or vLBR event is inactive at creation, but guest
> +can not notice this, so the guest will get meaningless value when the
> +vLBR event is inactive.

Zhang, Xiong Y July 28, 2023, 8:33 a.m. UTC | #3

> On 7/24/2023 6:41 PM, Xiong Zhang wrote:
> > Add a vPMU implementation and gap document to explain vArch PMU and
> > vLBR implementation in kvm, especially the current gap to support host
> > and guest perf event coexist.
> >
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > ---
> >   Documentation/virt/kvm/x86/index.rst |   1 +
> >   Documentation/virt/kvm/x86/pmu.rst   | 249
> +++++++++++++++++++++++++++
> >   2 files changed, 250 insertions(+)
> >   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
> >
> > diff --git a/Documentation/virt/kvm/x86/index.rst
> > b/Documentation/virt/kvm/x86/index.rst
> > index 9ece6b8dc817..02c1c7b01bf3 100644
> > --- a/Documentation/virt/kvm/x86/index.rst
> > +++ b/Documentation/virt/kvm/x86/index.rst
> > @@ -14,5 +14,6 @@ KVM for x86 systems
> >      mmu
> >      msr
> >      nested-vmx
> > +   pmu
> >      running-nested-guests
> >      timekeeping
> > diff --git a/Documentation/virt/kvm/x86/pmu.rst
> > b/Documentation/virt/kvm/x86/pmu.rst
> > new file mode 100644
> > index 000000000000..e95e8c88e0e0
> > --- /dev/null
> > +++ b/Documentation/virt/kvm/x86/pmu.rst
> > @@ -0,0 +1,249 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==========================
> > +PMU virtualization for X86
> > +==========================
> > +
> > +:Author: Xiong Zhang <xiong.y.zhang@intel.com>
> > +:Copyright: (c) 2023, Intel.  All rights reserved.
> > +
> > +.. Contents
> > +
> > +1. Overview
> > +2. Perf Scheduler
> > +3. Arch PMU virtualization
> > +4. LBR virtualization
> > +
> > +1. Overview
> > +===========
> > +
> > +KVM has supported PMU virtualization on x86 for many years and
> > +provides MSR based Arch PMU interface to the guest. The major
> > +features include Arch PMU v2, LBR and PEBS. Users have the same
> > +operation to profile performance in guest and host.
> > +KVM is a normal perf subsystem user as other perf subsystem users.
> > +When the guest access vPMU MSRs, KVM traps it and creates a perf event for
> it.
> > +This perf event takes part in perf scheduler to request PMU resources
> > +and let the guest use these resources.
> > +
> > +This document describes the X86 PMU virtualization architecture
> > +design and opens. It is organized as follows: Next section describes
> > +more details of Linux perf scheduler as it takes a key role in vPMU
> > +implementation and allocates PMU resources for guest usage. Then Arch
> > +PMU virtualization and LBR virtualization are introduced, each
> > +feature has sections to introduce implementation overview,  the
> > +expectation and gaps when host and guest perf events coexist.
> > +
> > +2. Perf Scheduler
> > +=================
> > +
> > +Perf scheduler's responsibility is choosing which events are active
> > +at one moment and binding counter with perf event. As processor has
> > +limited PMU counters and other resource, only limited perf events can
> > +be active at one moment, the inactive perf event may be active in the
> > +next moment, perf scheduler has defined rules to control these things.
> > +
> > +Usually the following cases cause perf event reschedule:
> > +1) On a context switch from one task to a different task.
> > +2) When an event is manually enabled.
> > +3) A call to perf_event_open() with disabled field of the
> > +perf_event_attr argument set to 0.
> 
> And when perf scheduler timer expires.
[Zhang, Xiong Y] yes, when perf_mux_hrtimer expires, perf will reschedule perf
events. But I'm hesitated whether it should be added or not ? perf_mux_hrtimer is
used for flexible events when counter multiplex happens, it doesn't have much
relationship with kvm pinned events. If perf_mux_hrtimer is added here, perf
multiplex should be introduced also. this perf scheduler section help reader
understanding kvm perf event, it isn't fully perf scheduler doc.
Except perf_mux_hrtimer, more corner cases will cause perf event reschedule and are
not list here.
> 
> > +
> > +When perf event reschedule is needed on a specific cpu, perf will
> > +send an IPI to the target cpu, and the IPI handler will activate
> > +events ordered by event type, and will iterate all the eligible events.
> 
> IIUC, this is only true for the event create case, not for all above reschedule cases.
[Zhang, Xiong Y] yes, perf_event_open() and perf_event_enable() send IPI, but
task_switch and perf_mux_hrtimer won't send IPI, I will modify this sentence.
> 
> > +
> > +When a perf event is sched out, this event mapped counter is
> > +disabled, and the counter's setting and count value are saved. When a
> > +perf event is sched in, perf driver assigns a counter to this event,
> > +the counter's setting and count values are restored from last saved.
> > +
> > +Perf defines four types event, their priority are from high to low:
> > +a. Per-cpu pinned: the event should be measured on the specified
> > +logical core whenever it is enabled.
> > +b. Per-process pinned: the event should be measured whenever it is
> > +enabled and the process is running on any logical cores.
> > +c. Per-cpu flexible: the event should measured on the specified
> > +logical core whenever it is enabled.
> > +d. Per-process flexible: the event should be measured whenever it is
> > +enabled and the process is running on any logical cores.
> > +
> > +If the event could not be scheduled because no resource is available
> > +for it, pinned event goes into error state and is excluded from perf
> > +scheduler, the only way to recover it is re-enable it, flexible event
> > +goes into inactive state and can be multiplexed with other events if
> > +needed.
> 
> Maybe you can add some diagrams or list some key definitions/data
> structures/prototypes
> 
> to facilitate readers to understand more about perf schedule since it's the key of
> perf subsystem.
[Zhang, Xiong Y] I will try to add some diagrams. 
> 
> > +
> > +3. Arch PMU virtualization
> > +==========================
> > +
> > +3.1. Overview
> > +-------------
> > +
> > +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
> > +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> > +counter) as the host does. All the guest Arch PMU MSRs accessing are
> > +interceptable.
> > +
> > +When a guest virtual counter is enabled through guest MSR writing,
> > +the KVM trap will create a kvm perf event through the perf subsystem.
> > +The kvm perf event's attribute is gotten from the guest virtual
> > +counter's MSR setting.
> > +
> > +When a guest changes the virtual counter's setting later, the KVM
> > +trap will release the old kvm perf event then create a new kvm perf
> > +event with the new setting.
> > +
> > +When guest read the virtual counter's count number, the kvm trap will
> > +read kvm perf event's counter value and accumulate it to the previous
> > +counter value.
> > +
> > +When guest no longer access the virtual counter's MSR within a
> > +scheduling time slice and the virtual counter is disabled, KVM will
> > +release the kvm perf event.
> > +  ----------------------------
> > +  |  Guest                   |
> > +  |  perf subsystem          |
> > +  ----------------------------
> > +       |            ^
> > +  vMSR |            | vPMI
> > +       v            |
> > +  ----------------------------
> > +  |  vPMU        KVM vCPU    |
> > +  ----------------------------
> > +        |          ^
> > +  Call  |          | Callbacks
> > +        v          |
> > +  ---------------------------
> > +  | Host Linux Kernel       |
> > +  | perf subsystem          |
> > +  ---------------------------
> > +               |       ^
> > +           MSR |       | PMI
> > +               v       |
> > +         --------------------
> > +	 | PMU        CPU   |
> > +         --------------------
> > +
> > +Each guest virtual counter has a corresponding kvm perf event, and
> > +the kvm perf event joins host perf scheduler and complies with host
> > +perf scheduler rule. When kvm perf event is scheduled by host perf
> > +scheduler and is active, the guest virtual counter could supply the correct
> value.
> > +However, if another host perf event comes in and takes over the kvm
> > +perf event resource, the kvm perf event will be inactive, then the
> > +virtual counter supplies wrong and meaningless value.
> 
> IMHO, the data is still valid for preempted event as it's saved when the event is
> sched_out.
> 
> But it doesn't match the running task under profiling, and this is normal when perf
> 
> preemption exits.
[Zhang, Xiong Y] the virtual counter supplies a saved value when it is preempted.
When preemption happens, perf_event->running_time is stopped, but
perf_event->enabling_time continue increase, so perf could get an estimate
counter value finally. But host perf couldn't notify this preemption into guest
virtual counter, and let guest perf stop guest_perf_event->running_time, so 
guest will get a wrong data. 
> 
> > +
> > +3.2. Host and Guest perf event contention
> > +-----------------------------------------
> > +
> > +Kvm perf event is a per-process pinned event, its priority is second.
> > +When kvm perf event is active, it can be preempted by host per-cpu
> > +pinned perf event, or it can preempt host flexible perf events. Such
> > +preemption can be temporarily prohibited through disabling host IRQ.
> > +
> > +The following results are expected when host and guest perf event
> > +coexist according to perf scheduler rule:
> > +1). if host per cpu pinned events occupy all the HW resource, kvm
> > +perf event can not be active as no available resource, the virtual
> > +counter value is  zero always when the guest read it.
> > +2). if host per cpu pinned event release HW resource, and kvm perf
> > +event is inactive, kvm perf event can claim the HW resource and
> > +switch into active, then the guest can get the correct value from the
> > +guest virtual counter during kvm perf event is active, but the guest
> > +total counter value is not correct since counter value is lost during
> > +kvm perf event is inactive.
> > +3). if kvm perf event is active, then host per cpu pinned perf event
> > +becomes active and reclaims kvm perf event resource, kvm perf event
> > +will be inactive. Finally the virtual counter value is kept unchanged
> > +and stores previous saved value when the guest reads it. So the guest
> > +toatal counter isn't correct.
> > +4). If host flexible perf events occupy all the HW resource, kvm perf
> > +event can be active and preempts host flexible perf event resource,
> > +guest can get the correct value from the guest virtual counter.
> > +5). if kvm perf event is active, then other host flexible perf events
> > +request to active, kvm perf event still own the resource and active,
> > +so guest can get the correct value from the guest virtual counter.
> > +
> > +3.3. vPMU Arch Gaps
> > +-------------------
> > +
> > +The coexist of host and guest perf events has gap:
> > +1). when guest accesses PMU MSRs at the first time, KVM will trap it
> > +and create kvm perf event, but this event may be inactive because the
> > +contention with host perf event. But guest doesn't notice this and
> > +when guest read virtual counter, the return value is zero.
> > +2). when kvm perf event is active, host per-cpu pinned perf event can
> > +reclaim kvm perf event resource at any time once resource contention
> > +happens. But guest doesn't notice this neither and guest following
> > +counter accesses get wrong data.
> > +So maillist had some discussion titled "Reconsider the current
> > +approach of vPMU".
> > +
> > +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gma
> > +il.com/
> > +
> > +The major suggestion in this discussion is host pass-through some
> > +counters into guest, but this suggestion is not feasible, the reasons
> > +are:
> > +a. processor has several counters, but counters are not equal, some
> > +event must bind with a specific counter.
> > +b. if a special counter is passthrough into guest, host can not
> > +support such event and lose some capability.
> > +c. if a normal counter is passthrough into guest, guest can support
> > +general event only, and the guest has limited capability.
> > +So both host and guest lose capability in pass-through mode.
> > +
> > +4. LBR Virtualization
> > +=====================
> > +
> > +4.1. Overview
> > +-------------
> > +
> > +The guest LBR driver would access the LBR MSR (including
> > +IA32_DEBUGCTLMSR and records MSRs) as host does once KVM/QEMU
> export
> > +vcpu's LBR capability into guest,  The first guest access on LBR
> > +related MSRs is always interceptable. The KVM trap would create a
> > +vLBR perf event which enables the callstack mode and none of the
> > +hardware counters are assigned. The host perf would enable and schedule this
> event as usual.
> > +
> > +When vLBR event is scheduled by host perf scheduler and is active,
> > +host LBR MSRs are owned by guest and are pass-through into guest,
> > +guest will access them without VM Exit. However, if another host LBR
> > +event comes in and takes over the LBR facility, the vLBR event will
> > +be inactive, and guest following accesses to the LBR MSRs will be trapped and
> meaningless.
> 
> Is this true only when host created a pinned LBR event? Otherwise, it won't
> preempt
> 
> the guest vLBR.
[Zhang, Xiong Y] yes, host could create per cpu pinned LBR event, like
perf record -b -a -e Instructions:D

thanks
> 
> 
> > +
> > +As kvm perf event, vLBR event will be released when guest doesn't
> > +access LBR-related MSRs within a scheduling time slice and guest
> > +unset LBR enable bit, then the pass-through state of the LBR MSRs will be
> canceled.
> > +
> > +4.2. Host and Guest LBR contention
> > +----------------------------------
> > +
> > +vLBR event is a per-process pinned event, its priority is second.
> > +vLBR event together with host other LBR event to contend LBR
> > +resource, according to perf scheduler rule, when vLBR event is
> > +active, it can be preempted by host per-cpu pinned LBR event, or it
> > +can preempt host flexible LBR event. Such preemption can be
> > +temporarily prohibited through disabling host IRQ as perf scheduler uses IPI to
> change LBR owner.
> > +
> > +The following results are expected when host and guest LBR event coexist:
> > +1) If host per cpu pinned LBR event is active when vm starts, the
> > +guest vLBR event can not preempt the LBR resource, so the guest can
> > +not use LBR.
> > +2). If host flexible LBR events are active when vm starts, guest vLBR
> > +event can preempt LBR, so the guest can use LBR.
> > +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> > +event is active, the guest vLBR event will lose LBR and the guest can
> > +not use LBR anymore.
> > +4). If host flexible LBR event becomes enabled when guest vLBR event
> > +is active, the guest vLBR event keeps LBR, the guest can still use LBR.
> > +5). If host per cpu pinned LBR event becomes inactive when guest vLBR
> > +event is inactive, guest vLBR event can be active and own LBR, the
> > +guest can use LBR.
> 
> Anyway, vLBR problems is still induced by perf scheduling priorities, if you can
> 
> clearly state current gaps of vPMU, it's also clear for vLBR issue, then this section
> 
> could be omitted.
> 
> > +
> > +4.3. vLBR Arch Gaps
> > +-------------------
> > +
> > +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu
> > +pinned event at any time, or vLBR event is inactive at creation, but
> > +guest can not notice this, so the guest will get meaningless value
> > +when the vLBR event is inactive.

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..02c1c7b01bf3 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -14,5 +14,6 @@  KVM for x86 systems
    mmu
    msr
    nested-vmx
+   pmu
    running-nested-guests
    timekeeping
diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
new file mode 100644
index 000000000000..e95e8c88e0e0
--- /dev/null
+++ b/Documentation/virt/kvm/x86/pmu.rst
@@ -0,0 +1,249 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PMU virtualization for X86
+==========================
+
+:Author: Xiong Zhang <xiong.y.zhang@intel.com>
+:Copyright: (c) 2023, Intel.  All rights reserved.
+
+.. Contents
+
+1. Overview
+2. Perf Scheduler
+3. Arch PMU virtualization
+4. LBR virtualization
+
+1. Overview
+===========
+
+KVM has supported PMU virtualization on x86 for many years and provides
+MSR based Arch PMU interface to the guest. The major features include
+Arch PMU v2, LBR and PEBS. Users have the same operation to profile
+performance in guest and host.
+KVM is a normal perf subsystem user as other perf subsystem users. When
+the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
+This perf event takes part in perf scheduler to request PMU resources
+and let the guest use these resources.
+
+This document describes the X86 PMU virtualization architecture design
+and opens. It is organized as follows: Next section describes more
+details of Linux perf scheduler as it takes a key role in vPMU
+implementation and allocates PMU resources for guest usage. Then Arch
+PMU virtualization and LBR virtualization are introduced, each feature
+has sections to introduce implementation overview,  the expectation and
+gaps when host and guest perf events coexist.
+
+2. Perf Scheduler
+=================
+
+Perf scheduler's responsibility is choosing which events are active at
+one moment and binding counter with perf event. As processor has limited
+PMU counters and other resource, only limited perf events can be active
+at one moment, the inactive perf event may be active in the next moment,
+perf scheduler has defined rules to control these things.
+
+Usually the following cases cause perf event reschedule:
+1) On a context switch from one task to a different task.
+2) When an event is manually enabled.
+3) A call to perf_event_open() with disabled field of the
+perf_event_attr argument set to 0.
+
+When perf event reschedule is needed on a specific cpu, perf will send
+an IPI to the target cpu, and the IPI handler will activate events
+ordered by event type, and will iterate all the eligible events.
+
+When a perf event is sched out, this event mapped counter is disabled,
+and the counter's setting and count value are saved. When a perf event
+is sched in, perf driver assigns a counter to this event, the counter's
+setting and count values are restored from last saved.
+
+Perf defines four types event, their priority are from high to low:
+a. Per-cpu pinned: the event should be measured on the specified logical
+core whenever it is enabled.
+b. Per-process pinned: the event should be measured whenever it is
+enabled and the process is running on any logical cores.
+c. Per-cpu flexible: the event should measured on the specified logical
+core whenever it is enabled.
+d. Per-process flexible: the event should be measured whenever it is
+enabled and the process is running on any logical cores.
+
+If the event could not be scheduled because no resource is available for
+it, pinned event goes into error state and is excluded from perf
+scheduler, the only way to recover it is re-enable it, flexible event
+goes into inactive state and can be multiplexed with other events if
+needed.
+
+3. Arch PMU virtualization
+==========================
+
+3.1. Overview
+-------------
+
+Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
+PMU driver would access the Arch PMU MSRs (including Fixed and GP
+counter) as the host does. All the guest Arch PMU MSRs accessing are
+interceptable.
+
+When a guest virtual counter is enabled through guest MSR writing, the
+KVM trap will create a kvm perf event through the perf subsystem. The
+kvm perf event's attribute is gotten from the guest virtual counter's
+MSR setting.
+
+When a guest changes the virtual counter's setting later, the KVM trap
+will release the old kvm perf event then create a new kvm perf event
+with the new setting.
+
+When guest read the virtual counter's count number, the kvm trap will
+read kvm perf event's counter value and accumulate it to the previous
+counter value.
+
+When guest no longer access the virtual counter's MSR within a
+scheduling time slice and the virtual counter is disabled, KVM will
+release the kvm perf event.
+  ----------------------------
+  |  Guest                   |
+  |  perf subsystem          |
+  ----------------------------
+       |            ^
+  vMSR |            | vPMI
+       v            |
+  ----------------------------
+  |  vPMU        KVM vCPU    |
+  ----------------------------
+        |          ^
+  Call  |          | Callbacks
+        v          |
+  ---------------------------
+  | Host Linux Kernel       |
+  | perf subsystem          |
+  ---------------------------
+               |       ^
+           MSR |       | PMI
+               v       |
+         --------------------
+	 | PMU        CPU   |
+         --------------------
+
+Each guest virtual counter has a corresponding kvm perf event, and the
+kvm perf event joins host perf scheduler and complies with host perf
+scheduler rule. When kvm perf event is scheduled by host perf scheduler
+and is active, the guest virtual counter could supply the correct value.
+However, if another host perf event comes in and takes over the kvm perf
+event resource, the kvm perf event will be inactive, then the virtual
+counter supplies wrong and meaningless value.
+
+3.2. Host and Guest perf event contention
+-----------------------------------------
+
+Kvm perf event is a per-process pinned event, its priority is second.
+When kvm perf event is active, it can be preempted by host per-cpu
+pinned perf event, or it can preempt host flexible perf events. Such
+preemption can be temporarily prohibited through disabling host IRQ.
+
+The following results are expected when host and guest perf event
+coexist according to perf scheduler rule:
+1). if host per cpu pinned events occupy all the HW resource, kvm perf
+event can not be active as no available resource, the virtual counter
+value is  zero always when the guest read it.
+2). if host per cpu pinned event release HW resource, and kvm perf event
+is inactive, kvm perf event can claim the HW resource and switch into
+active, then the guest can get the correct value from the guest virtual
+counter during kvm perf event is active, but the guest total counter
+value is not correct since counter value is lost during kvm perf event
+is inactive.
+3). if kvm perf event is active, then host per cpu pinned perf event
+becomes active and reclaims kvm perf event resource, kvm perf event will
+be inactive. Finally the virtual counter value is kept unchanged and
+stores previous saved value when the guest reads it. So the guest toatal
+counter isn't correct.
+4). If host flexible perf events occupy all the HW resource, kvm perf
+event can be active and preempts host flexible perf event resource,
+guest can get the correct value from the guest virtual counter.
+5). if kvm perf event is active, then other host flexible perf events
+request to active, kvm perf event still own the resource and active, so
+guest can get the correct value from the guest virtual counter.
+
+3.3. vPMU Arch Gaps
+-------------------
+
+The coexist of host and guest perf events has gap:
+1). when guest accesses PMU MSRs at the first time, KVM will trap it and
+create kvm perf event, but this event may be inactive because the
+contention with host perf event. But guest doesn't notice this and when
+guest read virtual counter, the return value is zero.
+2). when kvm perf event is active, host per-cpu pinned perf event can
+reclaim kvm perf event resource at any time once resource contention
+happens. But guest doesn't notice this neither and guest following
+counter accesses get wrong data.
+So maillist had some discussion titled "Reconsider the current approach
+of vPMU".
+
+https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
+
+The major suggestion in this discussion is host pass-through some
+counters into guest, but this suggestion is not feasible, the reasons
+are:
+a. processor has several counters, but counters are not equal, some
+event must bind with a specific counter.
+b. if a special counter is passthrough into guest, host can not support
+such event and lose some capability.
+c. if a normal counter is passthrough into guest, guest can support
+general event only, and the guest has limited capability.
+So both host and guest lose capability in pass-through mode.
+
+4. LBR Virtualization
+=====================
+
+4.1. Overview
+-------------
+
+The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
+and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
+into guest,  The first guest access on LBR related MSRs is always
+interceptable. The KVM trap would create a vLBR perf event which enables
+the callstack mode and none of the hardware counters are assigned. The
+host perf would enable and schedule this event as usual.
+
+When vLBR event is scheduled by host perf scheduler and is active, host
+LBR MSRs are owned by guest and are pass-through into guest, guest will
+access them without VM Exit. However, if another host LBR event comes in
+and takes over the LBR facility, the vLBR event will be inactive, and
+guest following accesses to the LBR MSRs will be trapped and meaningless.
+
+As kvm perf event, vLBR event will be released when guest doesn't access
+LBR-related MSRs within a scheduling time slice and guest unset LBR
+enable bit, then the pass-through state of the LBR MSRs will be canceled.
+
+4.2. Host and Guest LBR contention
+----------------------------------
+
+vLBR event is a per-process pinned event, its priority is second. vLBR
+event together with host other LBR event to contend LBR resource,
+according to perf scheduler rule, when vLBR event is active, it can be
+preempted by host per-cpu pinned LBR event, or it can preempt host
+flexible LBR event. Such preemption can be temporarily prohibited
+through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
+
+The following results are expected when host and guest LBR event coexist:
+1) If host per cpu pinned LBR event is active when vm starts, the guest
+vLBR event can not preempt the LBR resource, so the guest can not use
+LBR.
+2). If host flexible LBR events are active when vm starts, guest vLBR
+event can preempt LBR, so the guest can use LBR.
+3). If host per cpu pinned LBR event becomes enabled when guest vLBR
+event is active, the guest vLBR event will lose LBR and the guest can
+not use LBR anymore.
+4). If host flexible LBR event becomes enabled when guest vLBR event is
+active, the guest vLBR event keeps LBR, the guest can still use LBR.
+5). If host per cpu pinned LBR event becomes inactive when guest vLBR
+event is inactive, guest vLBR event can be active and own LBR, the guest
+can use LBR.
+
+4.3. vLBR Arch Gaps
+-------------------
+
+Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
+event at any time, or vLBR event is inactive at creation, but guest
+can not notice this, so the guest will get meaningless value when the
+vLBR event is inactive.

Documentation: KVM: Add vPMU implementaion and gap document

Commit Message

Comments

Patch