diff mbox series

[v3] Documentation: KVM: Add vPMU implementaion and gap document

Message ID 20230810054518.329117-1-xiong.y.zhang@intel.com (mailing list archive)
State New, archived
Headers show
Series [v3] Documentation: KVM: Add vPMU implementaion and gap document | expand

Commit Message

Zhang, Xiong Y Aug. 10, 2023, 5:45 a.m. UTC
Add a vPMU implementation and gap document to explain vArch PMU and vLBR
implementation in kvm, especially the current gap to support host and
guest perf event coexist.

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
---
Changelog:
v2 -> v3:
* When kvm perf event is inactive, it is in error state actually,
so inactive is changed into error.
* Fix make htmldoc warning

v1 -> v2:
* Refactor perf scheduler section
* Correct one sentence in vArch PMU section
---
 Documentation/virt/kvm/x86/index.rst |   1 +
 Documentation/virt/kvm/x86/pmu.rst   | 332 +++++++++++++++++++++++++++
 2 files changed, 333 insertions(+)
 create mode 100644 Documentation/virt/kvm/x86/pmu.rst


base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60

Comments

Like Xu Aug. 21, 2023, 5:40 a.m. UTC | #1
On 10/8/2023 1:45 pm, Xiong Zhang wrote:
> Add a vPMU implementation and gap document to explain vArch PMU and vLBR
> implementation in kvm, especially the current gap to support host and
> guest perf event coexist.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> ---
> Changelog:
> v2 -> v3:
> * When kvm perf event is inactive, it is in error state actually,
> so inactive is changed into error.
> * Fix make htmldoc warning
> 
> v1 -> v2:
> * Refactor perf scheduler section
> * Correct one sentence in vArch PMU section
> ---
>   Documentation/virt/kvm/x86/index.rst |   1 +
>   Documentation/virt/kvm/x86/pmu.rst   | 332 +++++++++++++++++++++++++++
>   2 files changed, 333 insertions(+)
>   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
> 
> diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
> index 9ece6b8dc817..02c1c7b01bf3 100644
> --- a/Documentation/virt/kvm/x86/index.rst
> +++ b/Documentation/virt/kvm/x86/index.rst
> @@ -14,5 +14,6 @@ KVM for x86 systems
>      mmu
>      msr
>      nested-vmx
> +   pmu
>      running-nested-guests
>      timekeeping
> diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
> new file mode 100644
> index 000000000000..503516c41119
> --- /dev/null
> +++ b/Documentation/virt/kvm/x86/pmu.rst
> @@ -0,0 +1,332 @@
> +.. SPDX-License-Identifier: GPL-2.0

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#34: FILE: Documentation/virt/kvm/x86/pmu.rst:1:
+.. SPDX-License-Identifier: GPL-2.0

> +
> +==========================
> +PMU virtualization for X86

For Intel ?

Many of the descriptions below are Intel features, so you either have to
add AMD part or limit the scope, otherwise the reader will be confused.

> +==========================
> +
> +:Author: Xiong Zhang <xiong.y.zhang@intel.com>
> +:Copyright: (c) 2023, Intel.  All rights reserved.
> +
> +.. Contents
> +
> +1. Overview
> +2. Perf Scheduler Basic

I would strongly suggest moving this section out of KVM subsystem,
Two possible locations:

- Documentation/admin-guide/perf
- tools/perf/Documentation/

Or the better one, the man2/perf_event_open.2 page in the man-pages project:

- 
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/perf_event_open.2

It's more generic and someone more appropriate could update and maintain it for 
more readers.
Here we only need to refer to it so as not to bother users with 
outdated/mismatched content.

> +3. Arch PMU virtualization
> +4. LBR virtualization

How about organizing it this way:

1. PMU Overview
2. Capability Enumeration
3. Basic Counter Virtualization
4. Intel PEBS Virtualization
5. Intel LBR Virtualization
6. KVM PMU APIs Guidance
7. Limitations and TODOs
8. Reference

> +
> +1. Overview
> +===========
> +
> +KVM has supported PMU virtualization on x86 for many years and provides
> +MSR based Arch PMU interface to the guest. The major features include
> +Arch PMU v2, LBR and PEBS. Users have the same operation to profile
> +performance in guest and host.
> +KVM is a normal perf subsystem user as other perf subsystem users. When
> +the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
> +This perf event takes part in perf scheduler to request PMU resources
> +and let the guest use these resources.
> +
> +This document describes the X86 PMU virtualization architecture design
> +and opens. It is organized as follows: Next section describes more
> +details of Linux perf scheduler as it takes a key role in vPMU
> +implementation and allocates PMU resources for guest usage. Then Arch
> +PMU virtualization and LBR virtualization are introduced, each feature
> +has sections to introduce implementation overview,  the expectation and
> +gaps when host and guest perf events coexist.
> +
> +2. Perf Scheduler Basic
> +=======================
> +
> +Perf subsystem users can not get PMU counter or resource directly, user
> +should create a perf event first and specify event’s attribute which is
> +used to choose PMU counters, then perf event joins in perf scheduler,
> +perf scheduler assigns the corresponding PMU counter to a perf event.
> +
> +Perf event is created by perf_event_open() system call::
> +
> +    int syscall(SYS_perf_event_open, struct perf_event_attr *,
> +		pid, cpu, group_fd, flags)
> +    struct perf_event_attr {
> +	    ......
> +	    /* Major type: hardware/software/tracepoint/etc. */
> +	    __u32   type;
> +	    /* Type specific configuration information. */
> +	    __u64   config;
> +	    union {
> +		    __u64      sample_period;
> +		    __u64      sample_freq;
> +	    }
> +	   __u64   disabled :1;
> +	           pinned   :1;
> +		   exclude_user  :1;
> +		   exclude_kernel :1;
> +		   exclude_host   :1;
> +	           exclude_guest  :1;
> +	......
> +    }
> +
> +The pid and cpu arguments allow specifying which process and CPU
> +to monitor::
> +
> +  pid == 0 and cpu == -1
> +        This measures the calling process/thread on any CPU.
> +  pid == 0 and cpu >= 0
> +        This measures the calling process/thread only when running on
> +	the specified cpu.
> +  pid > 0 and cpu == -1
> +        This measures the specified process/thread on any cpu.
> +  pid > 0 and cpu >= 0
> +        This  measures the specified process/thread only when running
> +	on the specified CPU.
> +  pid == -1 and cpu >= 0
> +        This measures all processes/threads on the specified CPU.
> +  pid == -1 and cpu == -1
> +        This setting is invalid and will return an error.
> +
> +Perf scheduler's responsibility is choosing which events are active at
> +one moment and binding counter with perf event. As processor has limited
> +PMU counters and other resource, only limited perf events can be active
> +at one moment, the inactive perf event may be active in the next moment,
> +perf scheduler has defined rules to control these things.
> +
> +Perf scheduler defines four types of perf event, defined by the pid and
> +cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
> +schedule priority are: per_cpu pinned > per_process pinned
> +> per_cpu flexible > per_process flexible. High priority events can
> +preempt low priority events when resources contend.
> +
> +perf event type::
> +
> +  --------------------------------------------------------
> +  |                      |   pid   |   cpu   |   pinned  |
> +  --------------------------------------------------------
> +  | Per-cpu pinned       |   *    |   >= 0   |     1     |
> +  --------------------------------------------------------
> +  | Per-process pinned   |  >= 0  |    *     |     1     |
> +  --------------------------------------------------------
> +  | Per-cpu flexible     |   *    |   >= 0   |     0     |
> +  --------------------------------------------------------
> +  | Per-process flexible | >= 0   |    *     |     0     |
> +  --------------------------------------------------------
> +
> +perf_event abstract::
> +
> +    struct perf_event {
> +	    struct list_head       event_entry;
> +	    ......
> +	    struct pmu             *pmu;
> +	    enum perf_event_state  state;
> +	    local64_t              count;
> +	    u64                    total_time_enabled;
> +	    u64                    total_time_running;
> +	    struct perf_event_attr attr;
> +	    ......
> +    }
> +
> +For per-cpu perf event, it is linked into per cpu global variable
> +perf_cpu_context, for per-process perf event, it is linked into
> +task_struct->perf_event_context.
> +
> +Usually the following cases cause perf event reschedule:
> +1) In a context switch from one task to a different task.
> +2) When an event is manually enabled.
> +3) A call to perf_event_open() with disabled field of the
> +perf_event_attr argument set to 0.
> +
> +When perf_event_open() or perf_event_enable() is called, perf event
> +reschedule is needed on a specific cpu, perf will send an IPI to the
> +target cpu, and the IPI handler will activate events ordered by event
> +type, and will iterate all the eligible events in per cpu gloable
> +variable perf_cpu_context and current->perf_event_context.
> +
> +When a perf event is sched out, this event mapped counter is disabled,
> +and the counter's setting and count value are saved. When a perf event
> +is sched in, perf driver assigns a counter to this event, the counter's
> +setting and count values are restored from last saved.
> +
> +If the event could not be scheduled because no resource is available for
> +it, pinned event goes into error state and is excluded from perf
> +scheduler, the only way to recover it is re-enable it, flexible event
> +goes into inactive state and can be multiplexed with other events if
> +needed.
> +
> +
> +3. Arch PMU virtualization
> +==========================
> +
> +3.1. Overview
> +-------------
> +
> +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
> +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> +counter) as the host does. All the guest Arch PMU MSRs accessing are
> +interceptable.
> +
> +When a guest virtual counter is enabled through guest MSR writing, the
> +KVM trap will create a kvm perf event through the perf subsystem. The
> +kvm perf event's attribute is gotten from the guest virtual counter's
> +MSR setting.
> +
> +When a guest changes the virtual counter's setting later, the KVM trap
> +will release the old kvm perf event then create a new kvm perf event
> +with the new setting.
> +
> +When guest read the virtual counter's count number, the kvm trap will
> +read kvm perf event's counter value and accumulate it to the previous
> +counter value.
> +
> +When guest no longer access the virtual counter's MSR within a
> +scheduling time slice and the virtual counter is disabled, KVM will
> +release the kvm perf event.
> +
> +vPMU diagram::
> +
> +  ----------------------------
> +  |  Guest                   |
> +  |  perf subsystem          |
> +  ----------------------------
> +       |            ^
> +  vMSR |            | vPMI
> +       v            |
> +  ----------------------------
> +  |  vPMU        KVM vCPU    |
> +  ----------------------------
> +        |          ^
> +  Call  |          | Callbacks
> +        v          |
> +  ---------------------------
> +  | Host Linux Kernel       |
> +  | perf subsystem          |
> +  ---------------------------
> +               |       ^
> +           MSR |       | PMI
> +               v       |
> +         --------------------
> +	 | PMU        CPU   |
> +         --------------------
> +
> +Each guest virtual counter has a corresponding kvm perf event, and the
> +kvm perf event joins host perf scheduler and complies with host perf
> +scheduler rule. When kvm perf event is scheduled by host perf scheduler
> +and is active, the guest virtual counter could supply the correct value.
> +However, if another host perf event comes in and takes over the kvm perf
> +event resource, the kvm perf event will be in error state, then the
> +virtual counter keeps the saved value when the kvm perf event is preempted.
> +But guest perf doesn't notice the underbeach virtual counter is stopped, so
> +the final guest profiling data is wrong.
> +
> +3.2. Host and Guest perf event contention
> +-----------------------------------------
> +
> +Kvm perf event is a per-process pinned event, its priority is second.
> +When kvm perf event is active, it can be preempted by host per-cpu
> +pinned perf event, or it can preempt host flexible perf events. Such
> +preemption can be temporarily prohibited through disabling host IRQ.
> +
> +The following results are expected when host and guest perf event
> +coexist according to perf scheduler rule:
> +1). if host per cpu pinned events occupy all the HW resource, kvm perf
> +event can not be active as no available resource, the virtual counter
> +value is zero always when the guest reads it.
> +2). if host per cpu pinned event release HW resource, and kvm perf event
> +is in error state, kvm perf event can claim the HW resource and switch into
> +active, then the guest can get the correct value from the guest virtual
> +counter during kvm perf event is active, but the guest total counter
> +value is not correct since counter value is lost during kvm perf event
> +is in error state.
> +3). if kvm perf event is active, then host per cpu pinned perf event
> +becomes active and reclaims kvm perf event resource, kvm perf event will
> +be in error state. Finally the virtual counter value is kept unchanged and
> +stores previous saved value when the guest reads it. So the guest total
> +counter isn't correct.
> +4). If host flexible perf events occupy all the HW resource, kvm perf
> +event can be active and preempts host flexible perf event resource,
> +the guest can get the correct value from the guest virtual counter.
> +5). if kvm perf event is active, then other host flexible perf events
> +request to active, kvm perf event still own the resource and active, so
> +the guest can get the correct value from the guest virtual counter.
> +
> +3.3. vPMU Arch Gaps
> +-------------------
> +
> +The coexist of host and guest perf events has gaps:
> +1). when guest accesses PMU MSRs at the first time, KVM will trap it and
> +create kvm perf event, but this event may be not active because the
> +contention with host perf event. But guest doesn't notice this and when
> +guest read virtual counter, the return value is zero.
> +2). when kvm perf event is active, host per-cpu pinned perf event can
> +reclaim kvm perf event resource at any time once resource contention
> +happens. But guest doesn't notice this neither and guest following
> +counter accesses get wrong data.
> +So maillist had some discussion titled "Reconsider the current approach
> +of vPMU".
> +
> +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
> +
> +The major suggestion in this discussion is host pass-through some
> +counters into guest, but this suggestion is not feasible, the reasons
> +are:
> +a. processor has several counters, but counters are not equal, some
> +event must bind with a specific counter.
> +b. if a special counter is passthrough into guest, host can not support
> +such events and lose some capability.
> +c. if a normal counter is passthrough into guest, guest can support
> +general event only, and the guest has limited capability.
> +So both host and guest lose capability in pass-through mode.
> +
> +4. LBR Virtualization
> +=====================
> +
> +4.1. Overview
> +-------------
> +
> +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
> +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
> +into guest,  The first guest access on LBR related MSRs is always
> +interceptable. The KVM trap would create a vLBR perf event which enables
> +the callstack mode and none of the hardware counters are assigned. The
> +host perf would enable and schedule this event as usual.
> +
> +When vLBR event is scheduled by host perf scheduler and is active, host
> +LBR MSRs are owned by guest and are pass-through into guest, guest will
> +access them without VM Exit. However, if another host LBR event comes in
> +and takes over the LBR facility, the vLBR event will be in error state,
> +and the guest following access to the LBR MSRs will be trapped and
> +meaningless.
> +
> +As kvm perf event, vLBR event will be released when guest doesn't access
> +LBR-related MSRs within a scheduling time slice and guest unset LBR
> +enable bit, then the pass-through state of the LBR MSRs will be canceled.
> +
> +4.2. Host and Guest LBR contention
> +----------------------------------
> +
> +vLBR event is a per-process pinned event, its priority is second. vLBR
> +event together with host other LBR event to contend LBR resource,
> +according to perf scheduler rule, when vLBR event is active, it can be
> +preempted by host per-cpu pinned LBR event, or it can preempt host
> +flexible LBR event. Such preemption can be temporarily prohibited
> +through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
> +
> +The following results are expected when host and guest LBR event coexist:
> +1) If host per cpu pinned LBR event is active when vm starts, the guest
> +vLBR event can not preempt the LBR resource, so the guest can not use
> +LBR.
> +2). If host flexible LBR events are active when vm starts, guest vLBR
> +event can preempt LBR, so the guest can use LBR.
> +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> +event is active, the guest vLBR event will lose LBR and the guest can
> +not use LBR anymore.
> +4). If host flexible LBR event becomes enabled when guest vLBR event is
> +active, the guest vLBR event keeps LBR, the guest can still use LBR.
> +5). If host per cpu pinned LBR event turns off when guest vLBR event is
> +not active, guest vLBR event can be active and own LBR, the guest can use
> +LBR.
> +
> +4.3. vLBR Arch Gaps
> +-------------------
> +
> +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
> +event at any time, or vLBR event is not active at creation, but guest
> +can not notice this, so the guest will get meaningless value when the
> +vLBR event is not active.
> 
> base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60
Like Xu Aug. 21, 2023, 6:04 a.m. UTC | #2
On 10/8/2023 1:45 pm, Xiong Zhang wrote:
> +1. Overview
> +===========
> +
> +KVM has supported PMU virtualization on x86 for many years and provides

"logical core level PMU virtualization"

> +MSR based Arch PMU interface to the guest. The major features include

Drop the "MSR based Arch PMU interface" statement, considering RDPMC.

> +Arch PMU v2, LBR and PEBS. Users have the same operation to profile

include (Arch for Intel) pmu versions up to 2, basic counters, Intel LBR and 
PEBS, etc.

> +performance in guest and host.

Any pmu profiler that works on host based on core pmu (e.g. Linux perl tool) can
be expected to work on guest.

On linux host and guest, the perf subsystem manages pmu hardware resources.

> +KVM is a normal perf subsystem user as other perf subsystem users. When

, but in the kernel space like nmi_watchdog and some ebpf programs that use pmu.

> +the guest access vPMU MSRs, KVM traps it and creates a perf event for it.

This is inaccurate.

KVM emulates the different guest's PMU capabilities as accurately as
possible via creating the appropriate perf_event(s) in the kernel space
to occupy the corresponding PMU resources from the host.

> +This perf event takes part in perf scheduler to request PMU resources
> +and let the guest use these resources.

 From the host perf-core perspective, these KVM-created perf events
take part in perf scheduler, just like any other perf_events.

> +
> +This document describes the X86 PMU virtualization architecture design

Intel PMU only or add more AMD stuff.

> +and opens. It is organized as follows: Next section describes more
> +details of Linux perf scheduler as it takes a key role in vPMU
> +implementation and allocates PMU resources for guest usage. Then Arch
> +PMU virtualization and LBR virtualization are introduced, each feature
> +has sections to introduce implementation overview,  the expectation and
> +gaps when host and guest perf events coexist.
Like Xu Aug. 21, 2023, 6:39 a.m. UTC | #3
On 10/8/2023 1:45 pm, Xiong Zhang wrote:
> +2. Perf Scheduler Basic
> +=======================
> +
> +Perf subsystem users can not get PMU counter or resource directly, user

s/can not get/not expected to access PMU hw counters/

> +should create a perf event first and specify event’s attribute which is

event’s attribute, drop the Unicode character.

attribute --> attributes which are

> +used to choose PMU counters, then perf event joins in perf scheduler,
> +perf scheduler assigns the corresponding PMU counter to a perf event.

"Counter" is not generic enough for LBR case.

The number of perf_event is not necessarily 1:1 mapped to the number of PMU
hardware resources (such as counters) acquired either.

> +
> +Perf event is created by perf_event_open() system call::

KVM is using the perf_event_create_kernel_counter() API.

The difference between these two interfaces is worth being described here.
But generic perf scheduler behavior doesn't fit.

> +
> +    int syscall(SYS_perf_event_open, struct perf_event_attr *,
> +		pid, cpu, group_fd, flags)
> +    struct perf_event_attr {
> +	    ......
> +	    /* Major type: hardware/software/tracepoint/etc. */
> +	    __u32   type;
> +	    /* Type specific configuration information. */
> +	    __u64   config;
> +	    union {
> +		    __u64      sample_period;
> +		    __u64      sample_freq;
> +	    }
> +	   __u64   disabled :1;
> +	           pinned   :1;
> +		   exclude_user  :1;
> +		   exclude_kernel :1;
> +		   exclude_host   :1;
> +	           exclude_guest  :1;
> +	......
> +    }
> +
> +The pid and cpu arguments allow specifying which process and CPU
> +to monitor::
> +
> +  pid == 0 and cpu == -1
> +        This measures the calling process/thread on any CPU.
> +  pid == 0 and cpu >= 0
> +        This measures the calling process/thread only when running on
> +	the specified cpu.
> +  pid > 0 and cpu == -1
> +        This measures the specified process/thread on any cpu.
> +  pid > 0 and cpu >= 0
> +        This  measures the specified process/thread only when running
> +	on the specified CPU.
> +  pid == -1 and cpu >= 0
> +        This measures all processes/threads on the specified CPU.
> +  pid == -1 and cpu == -1
> +        This setting is invalid and will return an error.
> +
> +Perf scheduler's responsibility is choosing which events are active at
> +one moment and binding counter with perf event. As processor has limited

This is not rigorous at all, perf manages a lot of kernel abstractions,
and hardware pmu is just one part of it.

> +PMU counters and other resource, only limited perf events can be active
> +at one moment, the inactive perf event may be active in the next moment,
> +perf scheduler has defined rules to control these things.

Some developers often ask about the mechanics of events/counters multiplexing,
which is expect to be mentioned here for generic perf behavior.

> +
> +Perf scheduler defines four types of perf event, defined by the pid and
> +cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
> +schedule priority are: per_cpu pinned > per_process pinned
> +> per_cpu flexible > per_process flexible. High priority events can
> +preempt low priority events when resources contend.

It's not "per-process",

  *  - CPU pinned (EVENT_CPU | EVENT_PINNED)
  *  - task pinned (EVENT_PINNED)
  *  - CPU flexible (EVENT_CPU | EVENT_FLEXIBLE)
  *  - task flexible (EVENT_FLEXIBLE).

It would be nice to mention here that perf function that handles prioritization:

static void ctx_resched(struct perf_cpu_context *cpuctx,
			struct perf_event_context *task_ctx,
			enum event_type_t event_type)

I wouldn't be surprised if the comment around ctx_resched() is out of date.

> +
> +perf event type::
> +
> +  --------------------------------------------------------
> +  |                      |   pid   |   cpu   |   pinned  |
> +  --------------------------------------------------------
> +  | Per-cpu pinned       |   *    |   >= 0   |     1     |
> +  --------------------------------------------------------
> +  | Per-process pinned   |  >= 0  |    *     |     1     |
> +  --------------------------------------------------------
> +  | Per-cpu flexible     |   *    |   >= 0   |     0     |
> +  --------------------------------------------------------
> +  | Per-process flexible | >= 0   |    *     |     0     |
> +  --------------------------------------------------------
> +
> +perf_event abstract::
> +
> +    struct perf_event {
> +	    struct list_head       event_entry;
> +	    ......
> +	    struct pmu             *pmu;
> +	    enum perf_event_state  state;
> +	    local64_t              count;
> +	    u64                    total_time_enabled;
> +	    u64                    total_time_running;
> +	    struct perf_event_attr attr;
> +	    ......
> +    }
> +
> +For per-cpu perf event, it is linked into per cpu global variable
> +perf_cpu_context, for per-process perf event, it is linked into
> +task_struct->perf_event_context.
> +
> +Usually the following cases cause perf event reschedule:
> +1) In a context switch from one task to a different task.
> +2) When an event is manually enabled.
> +3) A call to perf_event_open() with disabled field of the
> +perf_event_attr argument set to 0.
> +
> +When perf_event_open() or perf_event_enable() is called, perf event
> +reschedule is needed on a specific cpu, perf will send an IPI to the
> +target cpu, and the IPI handler will activate events ordered by event
> +type, and will iterate all the eligible events in per cpu gloable
> +variable perf_cpu_context and current->perf_event_context.
> +
> +When a perf event is sched out, this event mapped counter is disabled,
> +and the counter's setting and count value are saved. When a perf event
> +is sched in, perf driver assigns a counter to this event, the counter's
> +setting and count values are restored from last saved.
> +
> +If the event could not be scheduled because no resource is available for
> +it, pinned event goes into error state and is excluded from perf
> +scheduler, the only way to recover it is re-enable it, flexible event
> +goes into inactive state and can be multiplexed with other events if
> +needed.

I highly doubt that these are internal behaviors that the perf system is designed
to do, some are, some aren't, and some have exceptions like the BTS event.

Obviously this part needs to be reviewed by more perf developers.
Trying to muddle through will only mislead more developers.

I'd much rather see those perf descriptions in the perf comments or man-pages.
 From time to time, perf core will refactor or change its internal implementations.
Like Xu Aug. 21, 2023, 7:23 a.m. UTC | #4
On 10/8/2023 1:45 pm, Xiong Zhang wrote:
> +3. Arch PMU virtualization
> +==========================
> +
> +3.1. Overview
> +-------------
> +
> +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest

KVM user space (like QEMU)

> +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> +counter) as the host does. All the guest Arch PMU MSRs accessing are

as expected from guest emulated capabilities (like CPUID hinted).

No need to mention "MSRs accessing are interceptable", and it's not true.

> +interceptable.

More, we need to mention here what to expect if the user space sets more PMU
capabilities than KVM can support.

> +
> +When a guest virtual counter is enabled through guest MSR writing, the
> +KVM trap will create a kvm perf event through the perf subsystem. The
> +kvm perf event's attribute is gotten from the guest virtual counter's
> +MSR setting.

This part fits the "Basic Counter Virtualization", not true for vLBR emulation.

> +
> +When a guest changes the virtual counter's setting later, the KVM trap
> +will release the old kvm perf event then create a new kvm perf event
> +with the new setting.

It's not true. Often times perf_event will be reused

Here it is also necessary to mention the life cycle management of perf_event,
such as the lazy-release mechanism in the KVM context.

> +
> +When guest read the virtual counter's count number, the kvm trap will
> +read kvm perf event's counter value and accumulate it to the previous
> +counter value.

Again, it only fits the  "Basic Counter Virtualization".

> +
> +When guest no longer access the virtual counter's MSR within a
> +scheduling time slice and the virtual counter is disabled, KVM will
> +release the kvm perf event.

The timing of the release and why needs to be explained in more detail here.

> +
> +vPMU diagram::
> +
> +  ----------------------------
> +  |  Guest                   |
> +  |  perf subsystem          |
> +  ----------------------------
> +       |            ^
> +  vMSR |            | vPMI

Try to cover RDPMC trap.

> +       v            |
> +  ----------------------------
> +  |  vPMU        KVM vCPU    |
> +  ----------------------------
> +        |          ^
> +  Call  |          | Callbacks
> +        v          |
> +  ---------------------------
> +  | Host Linux Kernel       |
> +  | perf subsystem          |
> +  ---------------------------
> +               |       ^
> +           MSR |       | PMI
> +               v       |
> +         --------------------
> +	 | PMU        CPU   |
> +         --------------------
> +

Again, it only fits the "Basic Counter Virtualization".

> +Each guest virtual counter has a corresponding kvm perf event, and the
> +kvm perf event joins host perf scheduler and complies with host perf
> +scheduler rule. When kvm perf event is scheduled by host perf scheduler
> +and is active, the guest virtual counter could supply the correct value.
> +However, if another host perf event comes in and takes over the kvm perf
> +event resource, the kvm perf event will be in error state, then the

How about case "in the INACTIVE state" ? And more:

enum perf_event_state {
	PERF_EVENT_STATE_DEAD		= -4,
	PERF_EVENT_STATE_EXIT		= -3,
	PERF_EVENT_STATE_ERROR		= -2,
	PERF_EVENT_STATE_OFF		= -1,
	PERF_EVENT_STATE_INACTIVE	=  0,
	PERF_EVENT_STATE_ACTIVE		=  1,
};

> +virtual counter keeps the saved value when the kvm perf event is preempted.
> +But guest perf doesn't notice the underbeach virtual counter is stopped, so
> +the final guest profiling data is wrong.

It is also important to mention here the impact on vPMC when a counter
is share in a multiplexing approach.

> +
> +3.2. Host and Guest perf event contention
> +-----------------------------------------
> +
> +Kvm perf event is a per-process pinned event, its priority is second.

Always use "KVM".

> +When kvm perf event is active, it can be preempted by host per-cpu
> +pinned perf event, or it can preempt host flexible perf events. Such
> +preemption can be temporarily prohibited through disabling host IRQ.
> +
> +The following results are expected when host and guest perf event
> +coexist according to perf scheduler rule:
> +1). if host per cpu pinned events occupy all the HW resource, kvm perf
> +event can not be active as no available resource, the virtual counter

Not prioritizing enough to grab limited resources.

Again, we need selftests to cover this part of the behavior, and the following.

> +value is zero always when the guest reads it.
> +2). if host per cpu pinned event release HW resource, and kvm perf event
> +is in error state, kvm perf event can claim the HW resource and switch into

I think there might be a window period here. Is the transition from
the error state to the active state done by event re-enanablement, or
by recreating perf_event, or is perf-core internally re-sched the events ?

> +active, then the guest can get the correct value from the guest virtual
> +counter during kvm perf event is active, but the guest total counter
> +value is not correct since counter value is lost during kvm perf event
> +is in error state.
> +3). if kvm perf event is active, then host per cpu pinned perf event
> +becomes active and reclaims kvm perf event resource, kvm perf event will
> +be in error state. Finally the virtual counter value is kept unchanged and
> +stores previous saved value when the guest reads it. So the guest total
> +counter isn't correct.
> +4). If host flexible perf events occupy all the HW resource, kvm perf
> +event can be active and preempts host flexible perf event resource,
> +the guest can get the correct value from the guest virtual counter.
> +5). if kvm perf event is active, then other host flexible perf events
> +request to active, kvm perf event still own the resource and active, so
> +the guest can get the correct value from the guest virtual counter.

What happens when an event of the same priority is added ?

Here we can have one table to cover all combinations.

> +
> +3.3. vPMU Arch Gaps
> +-------------------
> +
> +The coexist of host and guest perf events has gaps:
> +1). when guest accesses PMU MSRs at the first time, KVM will trap it and
> +create kvm perf event, but this event may be not active because the
> +contention with host perf event. But guest doesn't notice this and when
> +guest read virtual counter, the return value is zero.

Or historical old values.

> +2). when kvm perf event is active, host per-cpu pinned perf event can
> +reclaim kvm perf event resource at any time once resource contention

Not reclaimed at any time, but at a point in time beyond KVM's control or awareness.

That's the root of the pmu-coexistence issues.

> +happens. But guest doesn't notice this neither and guest following
> +counter accesses get wrong data.
> +So maillist had some discussion titled "Reconsider the current approach
> +of vPMU".
> +
> +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
> +
> +The major suggestion in this discussion is host pass-through some
> +counters into guest, but this suggestion is not feasible, the reasons
> +are:
> +a. processor has several counters, but counters are not equal, some
> +event must bind with a specific counter.
> +b. if a special counter is passthrough into guest, host can not support
> +such events and lose some capability.
> +c. if a normal counter is passthrough into guest, guest can support
> +general event only, and the guest has limited capability.
> +So both host and guest lose capability in pass-through mode.
> +

In short, any solution that statically splits pmu resources between
host and guest at run time will not survive.

But one line of thought is to completely disable the ability to host perf
from init, which we haven't tried it yet.
Like Xu Aug. 21, 2023, 7:38 a.m. UTC | #5
On 10/8/2023 1:45 pm, Xiong Zhang wrote:
> +4. LBR Virtualization
> +=====================
> +
> +4.1. Overview
> +-------------
> +
> +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
> +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
> +into guest,  The first guest access on LBR related MSRs is always
> +interceptable. The KVM trap would create a vLBR perf event which enables

intercepted.

> +the callstack mode and none of the hardware counters are assigned. The

the LBR callstack mode.

> +host perf would enable and schedule this event as usual.

in the absence of contention.

> +
> +When vLBR event is scheduled by host perf scheduler and is active, host
> +LBR MSRs are owned by guest and are pass-through into guest, guest will
> +access them without VM Exit. However, if another host LBR event comes in
> +and takes over the LBR facility, the vLBR event will be in error state,

Doesn't LBR event have a scheduling priority ?

> +and the guest following access to the LBR MSRs will be trapped and
> +meaningless.

One description missing here is whether KVM retries to create LBR events 
frequently ?

> +
> +As kvm perf event, vLBR event will be released when guest doesn't access
> +LBR-related MSRs within a scheduling time slice and guest unset LBR

Not all LBR-related MSRs, but the DEBUGCTLMSR_LBR bit at now.

> +enable bit, then the pass-through state of the LBR MSRs will be canceled.
> +
> +4.2. Host and Guest LBR contention
> +----------------------------------
> +
> +vLBR event is a per-process pinned event, its priority is second. vLBR
> +event together with host other LBR event to contend LBR resource,
> +according to perf scheduler rule, when vLBR event is active, it can be
> +preempted by host per-cpu pinned LBR event, or it can preempt host
> +flexible LBR event. Such preemption can be temporarily prohibited
> +through disabling host IRQ as perf scheduler uses IPI to change LBR owner.

Those same descriptions can be shared with counter contention.

> +
> +The following results are expected when host and guest LBR event coexist:
> +1) If host per cpu pinned LBR event is active when vm starts, the guest
> +vLBR event can not preempt the LBR resource, so the guest can not use
> +LBR.

This is not the same as the current implementation. One could argue that this
is expected, but the current state of the system must be described first.

> +2). If host flexible LBR events are active when vm starts, guest vLBR
> +event can preempt LBR, so the guest can use LBR.
> +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> +event is active, the guest vLBR event will lose LBR and the guest can
> +not use LBR anymore.
> +4). If host flexible LBR event becomes enabled when guest vLBR event is
> +active, the guest vLBR event keeps LBR, the guest can still use LBR.
> +5). If host per cpu pinned LBR event turns off when guest vLBR event is
> +not active, guest vLBR event can be active and own LBR, the guest can use
> +LBR.
> +
> +4.3. vLBR Arch Gaps
> +-------------------
> +
> +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
> +event at any time, or vLBR event is not active at creation, but guest
> +can not notice this, so the guest will get meaningless value when the
> +vLBR event is not active.

Another gap is live-migration support for guest LBR.
Zhang, Xiong Y Aug. 22, 2023, 8:39 a.m. UTC | #6
> 
> On 10/8/2023 1:45 pm, Xiong Zhang wrote:
> > Add a vPMU implementation and gap document to explain vArch PMU and
> > vLBR implementation in kvm, especially the current gap to support host
> > and guest perf event coexist.
> >
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> > ---
> > Changelog:
> > v2 -> v3:
> > * When kvm perf event is inactive, it is in error state actually, so
> > inactive is changed into error.
> > * Fix make htmldoc warning
> >
> > v1 -> v2:
> > * Refactor perf scheduler section
> > * Correct one sentence in vArch PMU section
> > ---
> >   Documentation/virt/kvm/x86/index.rst |   1 +
> >   Documentation/virt/kvm/x86/pmu.rst   | 332
> +++++++++++++++++++++++++++
> >   2 files changed, 333 insertions(+)
> >   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
> >
> > diff --git a/Documentation/virt/kvm/x86/index.rst
> > b/Documentation/virt/kvm/x86/index.rst
> > index 9ece6b8dc817..02c1c7b01bf3 100644
> > --- a/Documentation/virt/kvm/x86/index.rst
> > +++ b/Documentation/virt/kvm/x86/index.rst
> > @@ -14,5 +14,6 @@ KVM for x86 systems
> >      mmu
> >      msr
> >      nested-vmx
> > +   pmu
> >      running-nested-guests
> >      timekeeping
> > diff --git a/Documentation/virt/kvm/x86/pmu.rst
> > b/Documentation/virt/kvm/x86/pmu.rst
> > new file mode 100644
> > index 000000000000..503516c41119
> > --- /dev/null
> > +++ b/Documentation/virt/kvm/x86/pmu.rst
> > @@ -0,0 +1,332 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> 
> WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
> #34: FILE: Documentation/virt/kvm/x86/pmu.rst:1:
> +.. SPDX-License-Identifier: GPL-2.0
> 
> > +
> > +==========================
> > +PMU virtualization for X86
> 
> For Intel ?
> 
> Many of the descriptions below are Intel features, so you either have to add
> AMD part or limit the scope, otherwise the reader will be confused.
> 
> > +==========================
> > +
> > +:Author: Xiong Zhang <xiong.y.zhang@intel.com>
> > +:Copyright: (c) 2023, Intel.  All rights reserved.
> > +
> > +.. Contents
> > +
> > +1. Overview
> > +2. Perf Scheduler Basic
> 
> I would strongly suggest moving this section out of KVM subsystem, Two
> possible locations:
> 
> - Documentation/admin-guide/perf
> - tools/perf/Documentation/
> 
> Or the better one, the man2/perf_event_open.2 page in the man-pages project:
> 
> -
> https://git.kernel.org/pub/scm/docs/man-pages/man-
> pages.git/tree/man2/perf_event_open.2
> 
> It's more generic and someone more appropriate could update and maintain it
> for more readers.
> Here we only need to refer to it so as not to bother users with
> outdated/mismatched content.
> 
> > +3. Arch PMU virtualization
> > +4. LBR virtualization
> 
> How about organizing it this way:
> 
> 1. PMU Overview
> 2. Capability Enumeration
> 3. Basic Counter Virtualization
> 4. Intel PEBS Virtualization
> 5. Intel LBR Virtualization
> 6. KVM PMU APIs Guidance
> 7. Limitations and TODOs
> 8. Reference
Thanks a lot for reviewing and giving suggestions.
I will import your other suggestions and adopt this organizing.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..02c1c7b01bf3 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -14,5 +14,6 @@  KVM for x86 systems
    mmu
    msr
    nested-vmx
+   pmu
    running-nested-guests
    timekeeping
diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
new file mode 100644
index 000000000000..503516c41119
--- /dev/null
+++ b/Documentation/virt/kvm/x86/pmu.rst
@@ -0,0 +1,332 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PMU virtualization for X86
+==========================
+
+:Author: Xiong Zhang <xiong.y.zhang@intel.com>
+:Copyright: (c) 2023, Intel.  All rights reserved.
+
+.. Contents
+
+1. Overview
+2. Perf Scheduler Basic
+3. Arch PMU virtualization
+4. LBR virtualization
+
+1. Overview
+===========
+
+KVM has supported PMU virtualization on x86 for many years and provides
+MSR based Arch PMU interface to the guest. The major features include
+Arch PMU v2, LBR and PEBS. Users have the same operation to profile
+performance in guest and host.
+KVM is a normal perf subsystem user as other perf subsystem users. When
+the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
+This perf event takes part in perf scheduler to request PMU resources
+and let the guest use these resources.
+
+This document describes the X86 PMU virtualization architecture design
+and opens. It is organized as follows: Next section describes more
+details of Linux perf scheduler as it takes a key role in vPMU
+implementation and allocates PMU resources for guest usage. Then Arch
+PMU virtualization and LBR virtualization are introduced, each feature
+has sections to introduce implementation overview,  the expectation and
+gaps when host and guest perf events coexist.
+
+2. Perf Scheduler Basic
+=======================
+
+Perf subsystem users can not get PMU counter or resource directly, user
+should create a perf event first and specify event’s attribute which is
+used to choose PMU counters, then perf event joins in perf scheduler,
+perf scheduler assigns the corresponding PMU counter to a perf event.
+
+Perf event is created by perf_event_open() system call::
+
+    int syscall(SYS_perf_event_open, struct perf_event_attr *,
+		pid, cpu, group_fd, flags)
+    struct perf_event_attr {
+	    ......
+	    /* Major type: hardware/software/tracepoint/etc. */
+	    __u32   type;
+	    /* Type specific configuration information. */
+	    __u64   config;
+	    union {
+		    __u64      sample_period;
+		    __u64      sample_freq;
+	    }
+	   __u64   disabled :1;
+	           pinned   :1;
+		   exclude_user  :1;
+		   exclude_kernel :1;
+		   exclude_host   :1;
+	           exclude_guest  :1;
+	......
+    }
+
+The pid and cpu arguments allow specifying which process and CPU
+to monitor::
+
+  pid == 0 and cpu == -1
+        This measures the calling process/thread on any CPU.
+  pid == 0 and cpu >= 0
+        This measures the calling process/thread only when running on
+	the specified cpu.
+  pid > 0 and cpu == -1
+        This measures the specified process/thread on any cpu.
+  pid > 0 and cpu >= 0
+        This  measures the specified process/thread only when running
+	on the specified CPU.
+  pid == -1 and cpu >= 0
+        This measures all processes/threads on the specified CPU.
+  pid == -1 and cpu == -1
+        This setting is invalid and will return an error.
+
+Perf scheduler's responsibility is choosing which events are active at
+one moment and binding counter with perf event. As processor has limited
+PMU counters and other resource, only limited perf events can be active
+at one moment, the inactive perf event may be active in the next moment,
+perf scheduler has defined rules to control these things.
+
+Perf scheduler defines four types of perf event, defined by the pid and
+cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
+schedule priority are: per_cpu pinned > per_process pinned
+> per_cpu flexible > per_process flexible. High priority events can
+preempt low priority events when resources contend.
+
+perf event type::
+
+  --------------------------------------------------------
+  |                      |   pid   |   cpu   |   pinned  |
+  --------------------------------------------------------
+  | Per-cpu pinned       |   *    |   >= 0   |     1     |
+  --------------------------------------------------------
+  | Per-process pinned   |  >= 0  |    *     |     1     |
+  --------------------------------------------------------
+  | Per-cpu flexible     |   *    |   >= 0   |     0     |
+  --------------------------------------------------------
+  | Per-process flexible | >= 0   |    *     |     0     |
+  --------------------------------------------------------
+
+perf_event abstract::
+
+    struct perf_event {
+	    struct list_head       event_entry;
+	    ......
+	    struct pmu             *pmu;
+	    enum perf_event_state  state;
+	    local64_t              count;
+	    u64                    total_time_enabled;
+	    u64                    total_time_running;
+	    struct perf_event_attr attr;
+	    ......
+    }
+
+For per-cpu perf event, it is linked into per cpu global variable
+perf_cpu_context, for per-process perf event, it is linked into
+task_struct->perf_event_context.
+
+Usually the following cases cause perf event reschedule:
+1) In a context switch from one task to a different task.
+2) When an event is manually enabled.
+3) A call to perf_event_open() with disabled field of the
+perf_event_attr argument set to 0.
+
+When perf_event_open() or perf_event_enable() is called, perf event
+reschedule is needed on a specific cpu, perf will send an IPI to the
+target cpu, and the IPI handler will activate events ordered by event
+type, and will iterate all the eligible events in per cpu gloable
+variable perf_cpu_context and current->perf_event_context.
+
+When a perf event is sched out, this event mapped counter is disabled,
+and the counter's setting and count value are saved. When a perf event
+is sched in, perf driver assigns a counter to this event, the counter's
+setting and count values are restored from last saved.
+
+If the event could not be scheduled because no resource is available for
+it, pinned event goes into error state and is excluded from perf
+scheduler, the only way to recover it is re-enable it, flexible event
+goes into inactive state and can be multiplexed with other events if
+needed.
+
+
+3. Arch PMU virtualization
+==========================
+
+3.1. Overview
+-------------
+
+Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
+PMU driver would access the Arch PMU MSRs (including Fixed and GP
+counter) as the host does. All the guest Arch PMU MSRs accessing are
+interceptable.
+
+When a guest virtual counter is enabled through guest MSR writing, the
+KVM trap will create a kvm perf event through the perf subsystem. The
+kvm perf event's attribute is gotten from the guest virtual counter's
+MSR setting.
+
+When a guest changes the virtual counter's setting later, the KVM trap
+will release the old kvm perf event then create a new kvm perf event
+with the new setting.
+
+When guest read the virtual counter's count number, the kvm trap will
+read kvm perf event's counter value and accumulate it to the previous
+counter value.
+
+When guest no longer access the virtual counter's MSR within a
+scheduling time slice and the virtual counter is disabled, KVM will
+release the kvm perf event.
+
+vPMU diagram::
+
+  ----------------------------
+  |  Guest                   |
+  |  perf subsystem          |
+  ----------------------------
+       |            ^
+  vMSR |            | vPMI
+       v            |
+  ----------------------------
+  |  vPMU        KVM vCPU    |
+  ----------------------------
+        |          ^
+  Call  |          | Callbacks
+        v          |
+  ---------------------------
+  | Host Linux Kernel       |
+  | perf subsystem          |
+  ---------------------------
+               |       ^
+           MSR |       | PMI
+               v       |
+         --------------------
+	 | PMU        CPU   |
+         --------------------
+
+Each guest virtual counter has a corresponding kvm perf event, and the
+kvm perf event joins host perf scheduler and complies with host perf
+scheduler rule. When kvm perf event is scheduled by host perf scheduler
+and is active, the guest virtual counter could supply the correct value.
+However, if another host perf event comes in and takes over the kvm perf
+event resource, the kvm perf event will be in error state, then the
+virtual counter keeps the saved value when the kvm perf event is preempted.
+But guest perf doesn't notice the underbeach virtual counter is stopped, so
+the final guest profiling data is wrong.
+
+3.2. Host and Guest perf event contention
+-----------------------------------------
+
+Kvm perf event is a per-process pinned event, its priority is second.
+When kvm perf event is active, it can be preempted by host per-cpu
+pinned perf event, or it can preempt host flexible perf events. Such
+preemption can be temporarily prohibited through disabling host IRQ.
+
+The following results are expected when host and guest perf event
+coexist according to perf scheduler rule:
+1). if host per cpu pinned events occupy all the HW resource, kvm perf
+event can not be active as no available resource, the virtual counter
+value is zero always when the guest reads it.
+2). if host per cpu pinned event release HW resource, and kvm perf event
+is in error state, kvm perf event can claim the HW resource and switch into
+active, then the guest can get the correct value from the guest virtual
+counter during kvm perf event is active, but the guest total counter
+value is not correct since counter value is lost during kvm perf event
+is in error state.
+3). if kvm perf event is active, then host per cpu pinned perf event
+becomes active and reclaims kvm perf event resource, kvm perf event will
+be in error state. Finally the virtual counter value is kept unchanged and
+stores previous saved value when the guest reads it. So the guest total
+counter isn't correct.
+4). If host flexible perf events occupy all the HW resource, kvm perf
+event can be active and preempts host flexible perf event resource,
+the guest can get the correct value from the guest virtual counter.
+5). if kvm perf event is active, then other host flexible perf events
+request to active, kvm perf event still own the resource and active, so
+the guest can get the correct value from the guest virtual counter.
+
+3.3. vPMU Arch Gaps
+-------------------
+
+The coexist of host and guest perf events has gaps:
+1). when guest accesses PMU MSRs at the first time, KVM will trap it and
+create kvm perf event, but this event may be not active because the
+contention with host perf event. But guest doesn't notice this and when
+guest read virtual counter, the return value is zero.
+2). when kvm perf event is active, host per-cpu pinned perf event can
+reclaim kvm perf event resource at any time once resource contention
+happens. But guest doesn't notice this neither and guest following
+counter accesses get wrong data.
+So maillist had some discussion titled "Reconsider the current approach
+of vPMU".
+
+https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
+
+The major suggestion in this discussion is host pass-through some
+counters into guest, but this suggestion is not feasible, the reasons
+are:
+a. processor has several counters, but counters are not equal, some
+event must bind with a specific counter.
+b. if a special counter is passthrough into guest, host can not support
+such events and lose some capability.
+c. if a normal counter is passthrough into guest, guest can support
+general event only, and the guest has limited capability.
+So both host and guest lose capability in pass-through mode.
+
+4. LBR Virtualization
+=====================
+
+4.1. Overview
+-------------
+
+The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
+and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
+into guest,  The first guest access on LBR related MSRs is always
+interceptable. The KVM trap would create a vLBR perf event which enables
+the callstack mode and none of the hardware counters are assigned. The
+host perf would enable and schedule this event as usual.
+
+When vLBR event is scheduled by host perf scheduler and is active, host
+LBR MSRs are owned by guest and are pass-through into guest, guest will
+access them without VM Exit. However, if another host LBR event comes in
+and takes over the LBR facility, the vLBR event will be in error state,
+and the guest following access to the LBR MSRs will be trapped and
+meaningless.
+
+As kvm perf event, vLBR event will be released when guest doesn't access
+LBR-related MSRs within a scheduling time slice and guest unset LBR
+enable bit, then the pass-through state of the LBR MSRs will be canceled.
+
+4.2. Host and Guest LBR contention
+----------------------------------
+
+vLBR event is a per-process pinned event, its priority is second. vLBR
+event together with host other LBR event to contend LBR resource,
+according to perf scheduler rule, when vLBR event is active, it can be
+preempted by host per-cpu pinned LBR event, or it can preempt host
+flexible LBR event. Such preemption can be temporarily prohibited
+through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
+
+The following results are expected when host and guest LBR event coexist:
+1) If host per cpu pinned LBR event is active when vm starts, the guest
+vLBR event can not preempt the LBR resource, so the guest can not use
+LBR.
+2). If host flexible LBR events are active when vm starts, guest vLBR
+event can preempt LBR, so the guest can use LBR.
+3). If host per cpu pinned LBR event becomes enabled when guest vLBR
+event is active, the guest vLBR event will lose LBR and the guest can
+not use LBR anymore.
+4). If host flexible LBR event becomes enabled when guest vLBR event is
+active, the guest vLBR event keeps LBR, the guest can still use LBR.
+5). If host per cpu pinned LBR event turns off when guest vLBR event is
+not active, guest vLBR event can be active and own LBR, the guest can use
+LBR.
+
+4.3. vLBR Arch Gaps
+-------------------
+
+Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
+event at any time, or vLBR event is not active at creation, but guest
+can not notice this, so the guest will get meaningless value when the
+vLBR event is not active.