mbox series

[RFC,00/41] KVM: x86/pmu: Introduce passthrough vPM

Message ID 20240126085444.324918-1-xiong.y.zhang@linux.intel.com (mailing list archive)
Headers show
Series KVM: x86/pmu: Introduce passthrough vPM | expand

Message

Xiong Zhang Jan. 26, 2024, 8:54 a.m. UTC
Background
===
KVM has supported vPMU for years as the emulated vPMU. In particular, KVM
presents a virtual PMU to guest where accesses to PMU get trapped and
converted into perf events. These perf events get scheduled along with
other perf events at the host level, sharing the HW resource. In the
emulated vPMU design, KVM is a client of the perf subsystem and has no
control of the HW PMU resource at host level.

This emulated vPMU has these drawbacks:
1. Poor performance, guest PMU MSR accessing has VM-exit and some has
expensive host perf API call. Once guest PMU is multiplexing its
counters, KVM will waste majority of time re-creating/starting/releasing
KVM perf events, then the guest perf performance is dropped dramatically.
2. Guest perf events's backend may be swapped out or disabled silently.
This is because host perf scheduler treats KVM perf events and other host
perf events equally, they will contend HW resources. KVM perf events will
be inactive when all HW resources have been owned by host perf events.
But KVM can not notify this backend error into guest, this slient error
is a red flag for vPMU as a production.
3. Hard to add new vPMU features. For each vPMU new feature, KVM needs
to emulate new MSRs, this involves perf and kvm two subsystems, mostly
the vendor specific perf API is added and is hard to accept.

The community has discussed these drawbacks for years and reconsidered
current emulated vPMU [1]. In latest discussion [2], both Perf and KVM
x86 community agreed to try a passthrough vPMU. So we co-work with google
engineers to develop this RFC, currently we implement it on Intel CPU
only, and can add other arch's implementation later.
Complete RFC source code can be found in below link:
https://github.com/googleprodkernel/linux-kvm/tree/passthrough-pmu-rfc

Under passthrough vPMU, VM direct access to all HW PMU general purpose
counters and some of the fixed counters, VM has transparency of x86 PMU
HW. All host perf events using x86 PMU are stopped during VM running, and
are restarted at VM-exit. This has the following benefits:
1. Better performance, when guest access x86 PMU MSRs and rdpmc, no VM-exit
and no host perf API call.
2. Guest perf events exclusively own HW resource during guest running. Host
perf events are stopped and give up HW resource at VM-entry, and restart 
runnging after VM-exit.
3. Easier to enable PMU new features. KVM just needs to passthrough new
MSRs and save/restore them at VM-exit and VM-entry, no need to add
perf API.

Note, passthrough vPMU does satisfy the enterprise-level requirement of
secure usage for PMU by intercepting guest access to all event selectors.
But the key problem of passthrough vPMU is that host user loses the
capability to profile guest. If any users want to profile guest from the
host, they should not enable passthrough vPMU mode. Another problem is
the NMI watchdog is not fully functional anymore. Please see design opens
for more details.

Implementation
===
To passthrough host x86 PMU into guest, PMU context switch is mandatory,
this RFC implements this PMU context switch at VM-entry/exit boundary.

At VM-entry:
1. KVM call perf supplied perf_guest_enter() interface, perf stops all
the perf events which use host x86 PMU.
2. KVM call perf supplied perf_guest_switch_to_kvm_pmi_vector() interface,
perf switch PMI vector to a separate kvm_pmi_vector, so that KVM handles
PMI after this point and KVM injects HW PMI into guest.
3. KVM restores guest PMU context.

In order to support KVM PMU filter feature for security, EVENT_SELECT and
FIXED_CTR_CTRL MSRs are intercepted, all other MSRs defined in Architectural
Performance Monitoring spec and rdpmc are passthrough, so guest can access
them without VM exit during guest running, when guest counter overflow
happens, HW PMI is triggered with dedicated kvm_pmi_vector, KVM injects a
virtual PMI into guest through virtual local apic.

At VM-exit:
1. KVM saves and clears guest PMU context.
2. KVM call perf supplied perf_guest_switch_to_host_pmi_vector() interface,
perf switch PMI vector to host NMI, so that host handles PMI after this
point.
3. KVM call perf supplied perf_guest_exit() interface, perf resched all
the perf events, these events stopped at VM-entry will be re-started here.

Design Opens
===
we met some design opens during this POC and seek supporting from
community:

1. host system wide / QEMU events handling during VM running
   At VM-entry, all the host perf events which use host x86 PMU will be
   stopped. These events with attr.exclude_guest = 1 will be stopped here
   and re-started after vm-exit. These events without attr.exclude_guest=1
   will be in error state, and they cannot recovery into active state even
   if the guest stops running. This impacts host perf a lot and request
   host system wide perf events have attr.exclude_guest=1.

   This requests QEMU Process's perf event with attr.exclude_guest=1 also.

   During VM running, perf event creation for system wide and QEMU
   process without attr.exclude_guest=1 fail with -EBUSY. 

2. NMI watchdog
   the perf event for NMI watchdog is a system wide cpu pinned event, it
   will be stopped also during vm running, but it doesn't have
   attr.exclude_guest=1, we add it in this RFC. But this still means NMI
   watchdog loses function during VM running.

   Two candidates exist for replacing perf event of NMI watchdog:
   a. Buddy hardlock detector[3] may be not reliable to replace perf event.
   b. HPET-based hardlock detector [4] isn't in the upstream kernel.

3. Dedicated kvm_pmi_vector
   In emulated vPMU, host PMI handler notify KVM to inject a virtual
   PMI into guest when physical PMI belongs to guest counter. If the
   same mechanism is used in passthrough vPMU and PMI skid exists
   which cause physical PMI belonging to guest happens after VM-exit,
   then the host PMI handler couldn't identify this PMI belongs to
   host or guest.
   So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
   has this vector only. The PMI belonging to host still has an NMI
   vector.

   Without considering PMI skid especially for AMD, the host NMI vector
   could be used for guest PMI also, this method is simpler and doesn't
   need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
   didn't meet the skid PMI issue on modern Intel processors.

4. per-VM passthrough mode configuration
   Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
   it decides vPMU is passthrough mode or emulated mode at kvm module
   load time.
   Do we need the capability of per-VM passthrough mode configuration?
   So an admin can launch some non-passthrough VM and profile these
   non-passthrough VMs in host, but admin still cannot profile all
   the VMs once passthrough VM existence. This means passthrough vPMU
   and emulated vPMU mix on one platform, it has challenges to implement.
   As the commit message in commit 0011, the main challenge is 
   passthrough vPMU and emulated vPMU have different vPMU features, this
   ends up with two different values for kvm_cap.supported_perf_cap, which
   is initialized at module load time. To support it, more refactor is
   needed.

Commits construction
===
0000 ~ 0003: Perf extends exclude_guest to stop perf events during
             guest running.
0004 ~ 0009: Perf interface for dedicated kvm_pmi_vector.
0010 ~ 0032: all passthrough vPMU with PMU context switch at
             VM-entry/exit boundary.
0033 ~ 0037: Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs for
             KVM PMU filter feature.
0038 ~ 0039: Add emulated instructions to guest counter.
0040 ~ 0041: Fixes for passthrough vPMU live migration and Nested VM.

Performance Data
===
Measure method:
First step: guest run workload without perf, and get basic workload score.
Second step: guest run workload with perf commands, and get perf workload
             score.
Third step: perf overhead to workload is gotten from (first-second)/first.
Finally: compare perf overhead between emulated vPMU and passthrough vPMU.

Workload: Specint-2017
HW platform: Sapphire rapids, 1 socket, 56 cores, no-SMT
Perf command:
a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

Guest performance overhead:
---------------------------------------------------------------------------
| Test case          | emulated vPMU | all passthrough | passthrough with |
|                    |               |                 | event filters    |
---------------------------------------------------------------------------
| basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
---------------------------------------------------------------------------
| multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
---------------------------------------------------------------------------
Note: here "passthrough with event filters" means KVM intercepts EVENT_SELECT
and FIXED_CTR_CTRL MSRs to support KVM PMU filter feature for security, this
is current RFC implementation. In order to collect EVENT_SELECT interception
impact, we modified RFC source to passthrough all the MSRs into guest, this
is "all passthrough" in above table.

Conclusion:
1. passthrough vPMU has much better performance than emulated vPMU.
2. Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs cause 2% overhead.
3. As PMU context switch happens at VM-exit/entry, the more VM-exit,
the more vPMU overhead. This does not only impacts perf, but it also
impacts other benchmarks which have massive VM-exit like fio. We will
optimize this at the second phase of passthrough vPMU.

Remain Works
===
1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
2. Add more PMU features like LBR, PEBS, perf metrics.
3. vPMU live migration.

Reference
===
1. https://lore.kernel.org/lkml/2db2ebbe-e552-b974-fc77-870d958465ba@gmail.com/
2. https://lkml.kernel.org/kvm/ZRRl6y1GL-7RM63x@google.com/
3. https://lwn.net/Articles/932497/
4. https://lwn.net/Articles/924927/

Dapeng Mi (4):
  x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
  KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  KVM: x86/pmu: Clear PERF_METRICS MSR for guest

Kan Liang (2):
  perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  perf: Support guest enter/exit interfaces

Mingwei Zhang (22):
  perf: core/x86: Forbid PMI handler when guest own PMU
  perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
    x86_pmu_cap
  KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
    propage to KVM instance
  KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
  KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  KVM: x86/pmu: Allow RDPMC pass through
  KVM: x86/pmu: Create a function prototype to disable MSR interception
  KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
    interception
  KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
    perf capabilities
  KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
  KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
    context
  KVM: x86/pmu: Introduce function prototype for Intel CPU to
    save/restore PMU context
  KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
    information leakage
  KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
    capability
  KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  KVM: x86/pmu: Allow writing to event selector for GP counters if event
    is allowed
  KVM: x86/pmu: Allow writing to fixed counter selector if counter is
    exposed
  KVM: x86/pmu: Introduce PMU helper to increment counter
  KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
    non-passthrough vPMU
  KVM: nVMX: Add nested virtualization support for passthrough PMU

Xiong Zhang (13):
  perf: Set exclude_guest onto nmi_watchdog
  perf: core/x86: Add support to register a new vector for PMI handling
  KVM: x86/pmu: Register PMI handler for passthrough PMU
  perf: x86: Add function to switch PMI handler
  perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
  KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
  KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  KVM: x86/pmu: Intercept EVENT_SELECT MSR
  KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR

 arch/x86/events/core.c                   |  38 +++++
 arch/x86/events/intel/core.c             |   8 +
 arch/x86/events/perf_event.h             |   1 +
 arch/x86/include/asm/hardirq.h           |   1 +
 arch/x86/include/asm/idtentry.h          |   1 +
 arch/x86/include/asm/irq.h               |   1 +
 arch/x86/include/asm/irq_vectors.h       |   2 +-
 arch/x86/include/asm/kvm-x86-pmu-ops.h   |   3 +
 arch/x86/include/asm/kvm_host.h          |   8 +
 arch/x86/include/asm/msr-index.h         |   1 +
 arch/x86/include/asm/perf_event.h        |   4 +
 arch/x86/include/asm/vmx.h               |   1 +
 arch/x86/kernel/idt.c                    |   1 +
 arch/x86/kernel/irq.c                    |  29 ++++
 arch/x86/kvm/cpuid.c                     |   4 +
 arch/x86/kvm/lapic.h                     |   5 +
 arch/x86/kvm/pmu.c                       | 102 ++++++++++++-
 arch/x86/kvm/pmu.h                       |  37 ++++-
 arch/x86/kvm/vmx/capabilities.h          |   1 +
 arch/x86/kvm/vmx/nested.c                |  52 +++++++
 arch/x86/kvm/vmx/pmu_intel.c             | 186 +++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.c                   | 176 +++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h                   |   3 +-
 arch/x86/kvm/x86.c                       |  37 ++++-
 arch/x86/kvm/x86.h                       |   2 +
 include/linux/perf_event.h               |  11 ++
 kernel/events/core.c                     | 179 ++++++++++++++++++++++
 kernel/watchdog_perf.c                   |   1 +
 tools/arch/x86/include/asm/irq_vectors.h |   1 +
 29 files changed, 852 insertions(+), 44 deletions(-)


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86

Comments

Sean Christopherson April 11, 2024, 5:03 p.m. UTC | #1
<bikeshed>

I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
emulates the control plane (controls and event selectors), while the data is
fully passed through (counters).

</bikeshed>

On Fri, Jan 26, 2024, Xiong Zhang wrote:

> 1. host system wide / QEMU events handling during VM running
>    At VM-entry, all the host perf events which use host x86 PMU will be
>    stopped. These events with attr.exclude_guest = 1 will be stopped here
>    and re-started after vm-exit. These events without attr.exclude_guest=1
>    will be in error state, and they cannot recovery into active state even
>    if the guest stops running. This impacts host perf a lot and request
>    host system wide perf events have attr.exclude_guest=1.
> 
>    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
> 
>    During VM running, perf event creation for system wide and QEMU
>    process without attr.exclude_guest=1 fail with -EBUSY. 
> 
> 2. NMI watchdog
>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>    will be stopped also during vm running, but it doesn't have
>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>    watchdog loses function during VM running.
> 
>    Two candidates exist for replacing perf event of NMI watchdog:
>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.

I think the simplest solution is to allow mediated PMU usage if and only if
the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
problem to solve.

> 3. Dedicated kvm_pmi_vector
>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>    PMI into guest when physical PMI belongs to guest counter. If the
>    same mechanism is used in passthrough vPMU and PMI skid exists
>    which cause physical PMI belonging to guest happens after VM-exit,
>    then the host PMI handler couldn't identify this PMI belongs to
>    host or guest.
>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>    has this vector only. The PMI belonging to host still has an NMI
>    vector.
> 
>    Without considering PMI skid especially for AMD, the host NMI vector
>    could be used for guest PMI also, this method is simpler and doesn't

I don't see how multiplexing NMIs between guest and host is simpler.  At best,
the complexity is a wash, just in different locations, and I highly doubt it's
a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
LVTPC.

E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.

>    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
>    didn't meet the skid PMI issue on modern Intel processors.
> 
> 4. per-VM passthrough mode configuration
>    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
>    it decides vPMU is passthrough mode or emulated mode at kvm module
>    load time.
>    Do we need the capability of per-VM passthrough mode configuration?
>    So an admin can launch some non-passthrough VM and profile these
>    non-passthrough VMs in host, but admin still cannot profile all
>    the VMs once passthrough VM existence. This means passthrough vPMU
>    and emulated vPMU mix on one platform, it has challenges to implement.
>    As the commit message in commit 0011, the main challenge is 
>    passthrough vPMU and emulated vPMU have different vPMU features, this
>    ends up with two different values for kvm_cap.supported_perf_cap, which
>    is initialized at module load time. To support it, more refactor is
>    needed.

I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
existing vPMU support entirely, but that's probably not be realistic, at least not
in the near future.

> Remain Works
> ===
> 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.

Before this gets out of its "RFC" phase, I would at least like line of sight to
a more optimized switch.  I 100% agree that starting with a conservative
implementation is the way to go, and the kernel absolutely needs to be able to
profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
guest PMU loaded for the entirety of KVM_RUN isn't a viable option.

But I also don't want to get into a situation where can't figure out a clean,
robust way to do the optimized context switch without needing (another) massive
rewrite.
Sean Christopherson April 11, 2024, 11:25 p.m. UTC | #2
On Fri, Jan 26, 2024, Xiong Zhang wrote:
> Dapeng Mi (4):
>   x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
>   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
>   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
>   KVM: x86/pmu: Clear PERF_METRICS MSR for guest
> 
> Kan Liang (2):
>   perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
>   perf: Support guest enter/exit interfaces
> 
> Mingwei Zhang (22):
>   perf: core/x86: Forbid PMI handler when guest own PMU
>   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
>     propage to KVM instance
>   KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
>   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
>   KVM: x86/pmu: Allow RDPMC pass through
>   KVM: x86/pmu: Create a function prototype to disable MSR interception
>   KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
>     interception
>   KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
>     perf capabilities
>   KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
>   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
>     context
>   KVM: x86/pmu: Introduce function prototype for Intel CPU to
>     save/restore PMU context
>   KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
>     information leakage
>   KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
>     capability
>   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
>   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
>   KVM: x86/pmu: Allow writing to event selector for GP counters if event
>     is allowed
>   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
>     exposed
>   KVM: x86/pmu: Introduce PMU helper to increment counter
>   KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
>   KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
>     non-passthrough vPMU
>   KVM: nVMX: Add nested virtualization support for passthrough PMU
> 
> Xiong Zhang (13):
>   perf: Set exclude_guest onto nmi_watchdog
>   perf: core/x86: Add support to register a new vector for PMI handling
>   KVM: x86/pmu: Register PMI handler for passthrough PMU
>   perf: x86: Add function to switch PMI handler
>   perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
>   KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
>   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
>   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
>   KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
>   KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
>   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
>   KVM: x86/pmu: Intercept EVENT_SELECT MSR
>   KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR

All done with this pass.  Looks quite good, nothing on the KVM side scares me.  Nice!

I haven't spent much time thinking about whether or not the overall implementation
correct/optimal, i.e. I mostly just reviewed the mechanics.  I'll make sure to
spend a bit more time on that for the next RFC.

Please be sure to rebase to kvm-x86/next for the next RFC, there are a few patches
that will change quite a bit.
Mingwei Zhang April 11, 2024, 11:56 p.m. UTC | #3
Hi Sean,

On Thu, Apr 11, 2024 at 4:26 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > Dapeng Mi (4):
> >   x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
> >   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
> >   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
> >   KVM: x86/pmu: Clear PERF_METRICS MSR for guest
> >
> > Kan Liang (2):
> >   perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
> >   perf: Support guest enter/exit interfaces
> >
> > Mingwei Zhang (22):
> >   perf: core/x86: Forbid PMI handler when guest own PMU
> >   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
> >     x86_pmu_cap
> >   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
> >     propage to KVM instance
> >   KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
> >   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
> >   KVM: x86/pmu: Allow RDPMC pass through
> >   KVM: x86/pmu: Create a function prototype to disable MSR interception
> >   KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
> >     interception
> >   KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
> >     perf capabilities
> >   KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
> >   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
> >     context
> >   KVM: x86/pmu: Introduce function prototype for Intel CPU to
> >     save/restore PMU context
> >   KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
> >     information leakage
> >   KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
> >     capability
> >   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
> >   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
> >   KVM: x86/pmu: Allow writing to event selector for GP counters if event
> >     is allowed
> >   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
> >     exposed
> >   KVM: x86/pmu: Introduce PMU helper to increment counter
> >   KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
> >   KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
> >     non-passthrough vPMU
> >   KVM: nVMX: Add nested virtualization support for passthrough PMU
> >
> > Xiong Zhang (13):
> >   perf: Set exclude_guest onto nmi_watchdog
> >   perf: core/x86: Add support to register a new vector for PMI handling
> >   KVM: x86/pmu: Register PMI handler for passthrough PMU
> >   perf: x86: Add function to switch PMI handler
> >   perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
> >   KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
> >   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
> >   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
> >   KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
> >   KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
> >   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
> >   KVM: x86/pmu: Intercept EVENT_SELECT MSR
> >   KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR
>
> All done with this pass.  Looks quite good, nothing on the KVM side scares me.  Nice!

yay! Thank you Sean for the review!

>
> I haven't spent much time thinking about whether or not the overall implementation
> correct/optimal, i.e. I mostly just reviewed the mechanics.  I'll make sure to
> spend a bit more time on that for the next RFC.

Yes, I am expecting the debate/discussion in PUCK after v2 is sent
out. There should be room for optimization as well.

>
> Please be sure to rebase to kvm-x86/next for the next RFC, there are a few patches
> that will change quite a bit.

Will do the rebase and all of the feedback will be taken and into
updates in v2. In v2, we will incorporate passthrough vPMU with AMD
support. Will do our best to get it in high quality.

Thanks.
-Mingwei
Xiong Zhang April 12, 2024, 2:19 a.m. UTC | #4
On 4/12/2024 1:03 AM, Sean Christopherson wrote:
> <bikeshed>
> 
> I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> emulates the control plane (controls and event selectors), while the data is
> fully passed through (counters).
> 
> </bikeshed>
> 
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> 
>> 1. host system wide / QEMU events handling during VM running
>>    At VM-entry, all the host perf events which use host x86 PMU will be
>>    stopped. These events with attr.exclude_guest = 1 will be stopped here
>>    and re-started after vm-exit. These events without attr.exclude_guest=1
>>    will be in error state, and they cannot recovery into active state even
>>    if the guest stops running. This impacts host perf a lot and request
>>    host system wide perf events have attr.exclude_guest=1.
>>
>>    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
>>
>>    During VM running, perf event creation for system wide and QEMU
>>    process without attr.exclude_guest=1 fail with -EBUSY. 
>>
>> 2. NMI watchdog
>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>    will be stopped also during vm running, but it doesn't have
>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>    watchdog loses function during VM running.
>>
>>    Two candidates exist for replacing perf event of NMI watchdog:
>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> 
> I think the simplest solution is to allow mediated PMU usage if and only if
> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> problem to solve.
Make sense. KVM should not affect host high priority work.
NMI watchdog is a client of perf and is a system wide perf event, perf can't distinguish a system wide perf event is NMI watchdog or others, so how about we extend this suggestion to all the system wide perf events ?
mediated PMU is only allowed when all system wide perf events are disabled or non-exist at vm creation.
but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> 
>> 3. Dedicated kvm_pmi_vector
>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>    PMI into guest when physical PMI belongs to guest counter. If the
>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>    which cause physical PMI belonging to guest happens after VM-exit,
>>    then the host PMI handler couldn't identify this PMI belongs to
>>    host or guest.
>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>    has this vector only. The PMI belonging to host still has an NMI
>>    vector.
>>
>>    Without considering PMI skid especially for AMD, the host NMI vector
>>    could be used for guest PMI also, this method is simpler and doesn't
> 
> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> the complexity is a wash, just in different locations, and I highly doubt it's
> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> LVTPC.
when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing NMI between guest and host, we could extend guest PT's PMI framework to mediated PMU. so I think this is simpler.
> 
> E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> 
>>    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
>>    didn't meet the skid PMI issue on modern Intel processors.
>>
>> 4. per-VM passthrough mode configuration
>>    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
>>    it decides vPMU is passthrough mode or emulated mode at kvm module
>>    load time.
>>    Do we need the capability of per-VM passthrough mode configuration?
>>    So an admin can launch some non-passthrough VM and profile these
>>    non-passthrough VMs in host, but admin still cannot profile all
>>    the VMs once passthrough VM existence. This means passthrough vPMU
>>    and emulated vPMU mix on one platform, it has challenges to implement.
>>    As the commit message in commit 0011, the main challenge is 
>>    passthrough vPMU and emulated vPMU have different vPMU features, this
>>    ends up with two different values for kvm_cap.supported_perf_cap, which
>>    is initialized at module load time. To support it, more refactor is
>>    needed.
> 
> I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> existing vPMU support entirely, but that's probably not be realistic, at least not
> in the near future.
> 
>> Remain Works
>> ===
>> 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> 
> Before this gets out of its "RFC" phase, I would at least like line of sight to
> a more optimized switch.  I 100% agree that starting with a conservative
> implementation is the way to go, and the kernel absolutely needs to be able to
> profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> 
> But I also don't want to get into a situation where can't figure out a clean,
> robust way to do the optimized context switch without needing (another) massive
> rewrite.
> 
Current PMU context switch happens at each vm-entry/exit, this impacts guest performance even if guest doesn't use PMU, as our first optimization, we will switch the PMU context only when guest really use PMU.

thanks
Sean Christopherson April 12, 2024, 6:32 p.m. UTC | #5
On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
> >> 2. NMI watchdog
> >>    the perf event for NMI watchdog is a system wide cpu pinned event, it
> >>    will be stopped also during vm running, but it doesn't have
> >>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> >>    watchdog loses function during VM running.
> >>
> >>    Two candidates exist for replacing perf event of NMI watchdog:
> >>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> >>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> > 
> > I think the simplest solution is to allow mediated PMU usage if and only if
> > the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> > problem to solve.
> Make sense. KVM should not affect host high priority work.
> NMI watchdog is a client of perf and is a system wide perf event, perf can't
> distinguish a system wide perf event is NMI watchdog or others, so how about
> we extend this suggestion to all the system wide perf events ?  mediated PMU
> is only allowed when all system wide perf events are disabled or non-exist at
> vm creation.

What other kernel-driven system wide perf events are there?

> but NMI watchdog is usually enabled, this will limit mediated PMU usage.

I don't think it is at all unreasonable to require users that want optimal PMU
virtualization to adjust their environment.  And we can and should document the
tradeoffs and alternatives, e.g. so that users that want better PMU results don't
need to re-discover all the "gotchas" on their own.

This would even be one of the rare times where I would be ok with a dmesg log.
E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
perf events, pr_warn() to explain the conflict and direct the user at documentation
explaining how to make their system compatible with mediate PMU usage.

> >> 3. Dedicated kvm_pmi_vector
> >>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> >>    PMI into guest when physical PMI belongs to guest counter. If the
> >>    same mechanism is used in passthrough vPMU and PMI skid exists
> >>    which cause physical PMI belonging to guest happens after VM-exit,
> >>    then the host PMI handler couldn't identify this PMI belongs to
> >>    host or guest.
> >>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> >>    has this vector only. The PMI belonging to host still has an NMI
> >>    vector.
> >>
> >>    Without considering PMI skid especially for AMD, the host NMI vector
> >>    could be used for guest PMI also, this method is simpler and doesn't
> > 
> > I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> > the complexity is a wash, just in different locations, and I highly doubt it's
> > a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> > LVTPC.
> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
> NMI between guest and host, we could extend guest PT's PMI framework to
> mediated PMU. so I think this is simpler.

Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
PT framework of multiplexing NMI?
Xiong Zhang April 15, 2024, 1:06 a.m. UTC | #6
On 4/13/2024 2:32 AM, Sean Christopherson wrote:
> On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
>>>> 2. NMI watchdog
>>>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>>>    will be stopped also during vm running, but it doesn't have
>>>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>>>    watchdog loses function during VM running.
>>>>
>>>>    Two candidates exist for replacing perf event of NMI watchdog:
>>>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
>>>
>>> I think the simplest solution is to allow mediated PMU usage if and only if
>>> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
>>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
>>> problem to solve.
>> Make sense. KVM should not affect host high priority work.
>> NMI watchdog is a client of perf and is a system wide perf event, perf can't
>> distinguish a system wide perf event is NMI watchdog or others, so how about
>> we extend this suggestion to all the system wide perf events ?  mediated PMU
>> is only allowed when all system wide perf events are disabled or non-exist at
>> vm creation.
> 
> What other kernel-driven system wide perf events are there?
does "kernel-driven" mean perf events created through perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ?
User can create system wide perf event through "perf record -e {} -a" also, I call it as user-driven system wide perf events.
Perf subsystem doesn't distinguish "kernel-driven" and "user-driven" system wide perf events.
> 
>> but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> 
> I don't think it is at all unreasonable to require users that want optimal PMU
> virtualization to adjust their environment.  And we can and should document the
> tradeoffs and alternatives, e.g. so that users that want better PMU results don't
> need to re-discover all the "gotchas" on their own.
> 
> This would even be one of the rare times where I would be ok with a dmesg log.
> E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
> perf events, pr_warn() to explain the conflict and direct the user at documentation
> explaining how to make their system compatible with mediate PMU usage.> 
>>>> 3. Dedicated kvm_pmi_vector
>>>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>>>    PMI into guest when physical PMI belongs to guest counter. If the
>>>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>>>    which cause physical PMI belonging to guest happens after VM-exit,
>>>>    then the host PMI handler couldn't identify this PMI belongs to
>>>>    host or guest.
>>>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>>>    has this vector only. The PMI belonging to host still has an NMI
>>>>    vector.
>>>>
>>>>    Without considering PMI skid especially for AMD, the host NMI vector
>>>>    could be used for guest PMI also, this method is simpler and doesn't
>>>
>>> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
>>> the complexity is a wash, just in different locations, and I highly doubt it's
>>> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
>>> LVTPC.
>> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
>> NMI between guest and host, we could extend guest PT's PMI framework to
>> mediated PMU. so I think this is simpler.
> 
> Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
> PT framework of multiplexing NMI?
here "this" means "extending the PT framework of multiplexing NMI".

thanks
>
Sean Christopherson April 15, 2024, 3:05 p.m. UTC | #7
On Mon, Apr 15, 2024, Xiong Y Zhang wrote:
> On 4/13/2024 2:32 AM, Sean Christopherson wrote:
> > On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
> >>>> 2. NMI watchdog
> >>>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
> >>>>    will be stopped also during vm running, but it doesn't have
> >>>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> >>>>    watchdog loses function during VM running.
> >>>>
> >>>>    Two candidates exist for replacing perf event of NMI watchdog:
> >>>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> >>>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> >>>
> >>> I think the simplest solution is to allow mediated PMU usage if and only if
> >>> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> >>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> >>> problem to solve.
> >> Make sense. KVM should not affect host high priority work.
> >> NMI watchdog is a client of perf and is a system wide perf event, perf can't
> >> distinguish a system wide perf event is NMI watchdog or others, so how about
> >> we extend this suggestion to all the system wide perf events ?  mediated PMU
> >> is only allowed when all system wide perf events are disabled or non-exist at
> >> vm creation.
> > 
> > What other kernel-driven system wide perf events are there?
> does "kernel-driven" mean perf events created through
> perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ?

By kernel-driven I meant events that aren't tied to a single userspace process
or action.

E.g. KVM creates events, but those events are effectively user-driven because
they will go away if the associated VM terminates.

> User can create system wide perf event through "perf record -e {} -a" also, I
> call it as user-driven system wide perf events.  Perf subsystem doesn't
> distinguish "kernel-driven" and "user-driven" system wide perf events.

Right, but us humans can build a list, even if it's only for documentation, e.g.
to provide help for someone to run KVM guests with mediated PMUs, but can't
because there are active !exclude_guest events.

> >> but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> > 
> > I don't think it is at all unreasonable to require users that want optimal PMU
> > virtualization to adjust their environment.  And we can and should document the
> > tradeoffs and alternatives, e.g. so that users that want better PMU results don't
> > need to re-discover all the "gotchas" on their own.
> > 
> > This would even be one of the rare times where I would be ok with a dmesg log.
> > E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
> > perf events, pr_warn() to explain the conflict and direct the user at documentation
> > explaining how to make their system compatible with mediate PMU usage.> 
> >>>> 3. Dedicated kvm_pmi_vector
> >>>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> >>>>    PMI into guest when physical PMI belongs to guest counter. If the
> >>>>    same mechanism is used in passthrough vPMU and PMI skid exists
> >>>>    which cause physical PMI belonging to guest happens after VM-exit,
> >>>>    then the host PMI handler couldn't identify this PMI belongs to
> >>>>    host or guest.
> >>>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> >>>>    has this vector only. The PMI belonging to host still has an NMI
> >>>>    vector.
> >>>>
> >>>>    Without considering PMI skid especially for AMD, the host NMI vector
> >>>>    could be used for guest PMI also, this method is simpler and doesn't
> >>>
> >>> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> >>> the complexity is a wash, just in different locations, and I highly doubt it's
> >>> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> >>> LVTPC.
> >> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
> >> NMI between guest and host, we could extend guest PT's PMI framework to
> >> mediated PMU. so I think this is simpler.
> > 
> > Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
> > PT framework of multiplexing NMI?
> here "this" means "extending the PT framework of multiplexing NMI".

The PT framework's multiplexing is just as crude as regular PMIs though.  Perf
basically just asks KVM: is this yours?  And KVM simply checks that the callback
occurred while KVM_HANDLING_NMI is set.

E.g. prior to commit 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region"),
nothing would prevent perf from miscontruing a host PMI as a guest PMI, because
KVM re-enabled host PT prior to servicing guest NMIs, i.e. host PT would be active
while KVM_HANDLING_NMI is set.

And conversely, if a guest PMI skids past VM-Exit, as things currently stand, the
NMI will always be treated as host PMI, because KVM will not be in KVM_HANDLING_NMI.
KVM's emulated PMI can (and should) eliminate false positives for host PMIs by
precisely checking exclude_guest, but that doesn't help with false negatives for
guest PMIs, nor does it help with NMIs that aren't perf related, i.e. didn't come
from the LVTPC.

Is a naive implementation simpler?  Maybe.  But IMO, multiplexing NMI and getting
all the edge cases right is more complex than using a dedicated vector for guest
PMIs, as the latter provides a "hard" boundary and allows the kernel to _know_ that
an interrupt is for a guest PMI.
Xiong Zhang April 16, 2024, 5:11 a.m. UTC | #8
On 4/15/2024 11:05 PM, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Xiong Y Zhang wrote:
>> On 4/13/2024 2:32 AM, Sean Christopherson wrote:
>>> On Fri, Apr 12, 2024, Xiong Y Zhang wrote:
>>>>>> 2. NMI watchdog
>>>>>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>>>>>    will be stopped also during vm running, but it doesn't have
>>>>>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>>>>>    watchdog loses function during VM running.
>>>>>>
>>>>>>    Two candidates exist for replacing perf event of NMI watchdog:
>>>>>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>>>>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
>>>>>
>>>>> I think the simplest solution is to allow mediated PMU usage if and only if
>>>>> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
>>>>> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
>>>>> problem to solve.
>>>> Make sense. KVM should not affect host high priority work.
>>>> NMI watchdog is a client of perf and is a system wide perf event, perf can't
>>>> distinguish a system wide perf event is NMI watchdog or others, so how about
>>>> we extend this suggestion to all the system wide perf events ?  mediated PMU
>>>> is only allowed when all system wide perf events are disabled or non-exist at
>>>> vm creation.
>>>
>>> What other kernel-driven system wide perf events are there?
>> does "kernel-driven" mean perf events created through
>> perf_event_create_kernel_counter() like nmi_watchdog and kvm perf events ?
> 
> By kernel-driven I meant events that aren't tied to a single userspace process
> or action.
> 
> E.g. KVM creates events, but those events are effectively user-driven because
> they will go away if the associated VM terminates.
> 
>> User can create system wide perf event through "perf record -e {} -a" also, I
>> call it as user-driven system wide perf events.  Perf subsystem doesn't
>> distinguish "kernel-driven" and "user-driven" system wide perf events.
> 
> Right, but us humans can build a list, even if it's only for documentation, e.g.
> to provide help for someone to run KVM guests with mediated PMUs, but can't
> because there are active !exclude_guest events.
> 
>>>> but NMI watchdog is usually enabled, this will limit mediated PMU usage.
>>>
>>> I don't think it is at all unreasonable to require users that want optimal PMU
>>> virtualization to adjust their environment.  And we can and should document the
>>> tradeoffs and alternatives, e.g. so that users that want better PMU results don't
>>> need to re-discover all the "gotchas" on their own.
>>>
>>> This would even be one of the rare times where I would be ok with a dmesg log.
>>> E.g. if KVM is loaded with enable_mediated_pmu=true, but there are system wide
>>> perf events, pr_warn() to explain the conflict and direct the user at documentation
>>> explaining how to make their system compatible with mediate PMU usage.> 
>>>>>> 3. Dedicated kvm_pmi_vector
>>>>>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>>>>>    PMI into guest when physical PMI belongs to guest counter. If the
>>>>>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>>>>>    which cause physical PMI belonging to guest happens after VM-exit,
>>>>>>    then the host PMI handler couldn't identify this PMI belongs to
>>>>>>    host or guest.
>>>>>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>>>>>    has this vector only. The PMI belonging to host still has an NMI
>>>>>>    vector.
>>>>>>
>>>>>>    Without considering PMI skid especially for AMD, the host NMI vector
>>>>>>    could be used for guest PMI also, this method is simpler and doesn't
>>>>>
>>>>> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
>>>>> the complexity is a wash, just in different locations, and I highly doubt it's
>>>>> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
>>>>> LVTPC.
>>>> when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing
>>>> NMI between guest and host, we could extend guest PT's PMI framework to
>>>> mediated PMU. so I think this is simpler.
>>>
>>> Heh, what do you mean by "this"?  Using a dedicated IRQ vector, or extending the
>>> PT framework of multiplexing NMI?
>> here "this" means "extending the PT framework of multiplexing NMI".
> 
> The PT framework's multiplexing is just as crude as regular PMIs though.  Perf
> basically just asks KVM: is this yours?  And KVM simply checks that the callback
> occurred while KVM_HANDLING_NMI is set.
> 
> E.g. prior to commit 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region"),
> nothing would prevent perf from miscontruing a host PMI as a guest PMI, because
> KVM re-enabled host PT prior to servicing guest NMIs, i.e. host PT would be active
> while KVM_HANDLING_NMI is set.
> 
> And conversely, if a guest PMI skids past VM-Exit, as things currently stand, the
> NMI will always be treated as host PMI, because KVM will not be in KVM_HANDLING_NMI.
> KVM's emulated PMI can (and should) eliminate false positives for host PMIs by
> precisely checking exclude_guest, but that doesn't help with false negatives for
> guest PMIs, nor does it help with NMIs that aren't perf related, i.e. didn't come
> from the LVTPC> 
> Is a naive implementation simpler?  Maybe.  But IMO, multiplexing NMI and getting
> all the edge cases right is more complex than using a dedicated vector for guest
> PMIs, as the latter provides a "hard" boundary and allows the kernel to _know_ that
> an interrupt is for a guest PMI.
>Totally agree the complex to fix multiplexing NMI corner case. Thanks for explanation.
Mingwei Zhang April 18, 2024, 8:46 p.m. UTC | #9
On Thu, Apr 11, 2024, Sean Christopherson wrote:
> <bikeshed>
> 
> I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> emulates the control plane (controls and event selectors), while the data is
> fully passed through (counters).
> 
> </bikeshed>
Sean,

I feel "mediated PMU" seems to be a little bit off the ..., no? In
KVM, almost all of features are mediated. In our specific case, the
legacy PMU is mediated by KVM and perf subsystem on the host. In new
design, it is mediated by KVM only.

We intercept the control plan in current design, but the only thing
we do is the event filtering. No fancy code change to emulate the control
registers. So, it is still a passthrough logic.

In some (rare) business cases, I think maybe we could fully passthrough
the control plan as well. For instance, sole-tenant machine, or
full-machine VM + full offload. In case if there is a cpu errata, KVM
can force vmexit and dynamically intercept the selectors on all vcpus
with filters checked. It is not supported in current RFC, but maybe
doable in later versions.

With the above, I wonder if we can still use passthrough PMU for
simplicity? But no strong opinion if you really want to keep this name.
I would have to take some time to convince myself.

Thanks.
-Mingwei
> 
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> 
> > 1. host system wide / QEMU events handling during VM running
> >    At VM-entry, all the host perf events which use host x86 PMU will be
> >    stopped. These events with attr.exclude_guest = 1 will be stopped here
> >    and re-started after vm-exit. These events without attr.exclude_guest=1
> >    will be in error state, and they cannot recovery into active state even
> >    if the guest stops running. This impacts host perf a lot and request
> >    host system wide perf events have attr.exclude_guest=1.
> > 
> >    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
> > 
> >    During VM running, perf event creation for system wide and QEMU
> >    process without attr.exclude_guest=1 fail with -EBUSY. 
> > 
> > 2. NMI watchdog
> >    the perf event for NMI watchdog is a system wide cpu pinned event, it
> >    will be stopped also during vm running, but it doesn't have
> >    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> >    watchdog loses function during VM running.
> > 
> >    Two candidates exist for replacing perf event of NMI watchdog:
> >    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> >    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> 
> I think the simplest solution is to allow mediated PMU usage if and only if
> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> problem to solve.
> 
> > 3. Dedicated kvm_pmi_vector
> >    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> >    PMI into guest when physical PMI belongs to guest counter. If the
> >    same mechanism is used in passthrough vPMU and PMI skid exists
> >    which cause physical PMI belonging to guest happens after VM-exit,
> >    then the host PMI handler couldn't identify this PMI belongs to
> >    host or guest.
> >    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> >    has this vector only. The PMI belonging to host still has an NMI
> >    vector.
> > 
> >    Without considering PMI skid especially for AMD, the host NMI vector
> >    could be used for guest PMI also, this method is simpler and doesn't
> 
> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> the complexity is a wash, just in different locations, and I highly doubt it's
> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> LVTPC.
> 
> E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> 
> >    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
> >    didn't meet the skid PMI issue on modern Intel processors.
> > 
> > 4. per-VM passthrough mode configuration
> >    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
> >    it decides vPMU is passthrough mode or emulated mode at kvm module
> >    load time.
> >    Do we need the capability of per-VM passthrough mode configuration?
> >    So an admin can launch some non-passthrough VM and profile these
> >    non-passthrough VMs in host, but admin still cannot profile all
> >    the VMs once passthrough VM existence. This means passthrough vPMU
> >    and emulated vPMU mix on one platform, it has challenges to implement.
> >    As the commit message in commit 0011, the main challenge is 
> >    passthrough vPMU and emulated vPMU have different vPMU features, this
> >    ends up with two different values for kvm_cap.supported_perf_cap, which
> >    is initialized at module load time. To support it, more refactor is
> >    needed.
> 
> I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> existing vPMU support entirely, but that's probably not be realistic, at least not
> in the near future.
> 
> > Remain Works
> > ===
> > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> 
> Before this gets out of its "RFC" phase, I would at least like line of sight to
> a more optimized switch.  I 100% agree that starting with a conservative
> implementation is the way to go, and the kernel absolutely needs to be able to
> profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> 
> But I also don't want to get into a situation where can't figure out a clean,
> robust way to do the optimized context switch without needing (another) massive
> rewrite.
Mingwei Zhang April 18, 2024, 9:52 p.m. UTC | #10
On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Thu, Apr 11, 2024, Sean Christopherson wrote:
> > <bikeshed>
> > 
> > I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> > emulates the control plane (controls and event selectors), while the data is
> > fully passed through (counters).
> > 
> > </bikeshed>
> Sean,
> 
> I feel "mediated PMU" seems to be a little bit off the ..., no? In
> KVM, almost all of features are mediated. In our specific case, the
> legacy PMU is mediated by KVM and perf subsystem on the host. In new
> design, it is mediated by KVM only.
> 
> We intercept the control plan in current design, but the only thing
> we do is the event filtering. No fancy code change to emulate the control
> registers. So, it is still a passthrough logic.
> 
> In some (rare) business cases, I think maybe we could fully passthrough
> the control plan as well. For instance, sole-tenant machine, or
> full-machine VM + full offload. In case if there is a cpu errata, KVM
> can force vmexit and dynamically intercept the selectors on all vcpus
> with filters checked. It is not supported in current RFC, but maybe
> doable in later versions.
> 
> With the above, I wonder if we can still use passthrough PMU for
> simplicity? But no strong opinion if you really want to keep this name.
> I would have to take some time to convince myself.
> 

One propoal. Maybe "direct vPMU"? I think there would be many words that
focus on the "passthrough" side but not on the "interception/mediation"
side?

> Thanks.
> -Mingwei
> > 
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > 
> > > 1. host system wide / QEMU events handling during VM running
> > >    At VM-entry, all the host perf events which use host x86 PMU will be
> > >    stopped. These events with attr.exclude_guest = 1 will be stopped here
> > >    and re-started after vm-exit. These events without attr.exclude_guest=1
> > >    will be in error state, and they cannot recovery into active state even
> > >    if the guest stops running. This impacts host perf a lot and request
> > >    host system wide perf events have attr.exclude_guest=1.
> > > 
> > >    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
> > > 
> > >    During VM running, perf event creation for system wide and QEMU
> > >    process without attr.exclude_guest=1 fail with -EBUSY. 
> > > 
> > > 2. NMI watchdog
> > >    the perf event for NMI watchdog is a system wide cpu pinned event, it
> > >    will be stopped also during vm running, but it doesn't have
> > >    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
> > >    watchdog loses function during VM running.
> > > 
> > >    Two candidates exist for replacing perf event of NMI watchdog:
> > >    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
> > >    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> > 
> > I think the simplest solution is to allow mediated PMU usage if and only if
> > the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> > problem to solve.
> > 
> > > 3. Dedicated kvm_pmi_vector
> > >    In emulated vPMU, host PMI handler notify KVM to inject a virtual
> > >    PMI into guest when physical PMI belongs to guest counter. If the
> > >    same mechanism is used in passthrough vPMU and PMI skid exists
> > >    which cause physical PMI belonging to guest happens after VM-exit,
> > >    then the host PMI handler couldn't identify this PMI belongs to
> > >    host or guest.
> > >    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
> > >    has this vector only. The PMI belonging to host still has an NMI
> > >    vector.
> > > 
> > >    Without considering PMI skid especially for AMD, the host NMI vector
> > >    could be used for guest PMI also, this method is simpler and doesn't
> > 
> > I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> > the complexity is a wash, just in different locations, and I highly doubt it's
> > a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> > LVTPC.
> > 
> > E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> > SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> > 
> > >    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
> > >    didn't meet the skid PMI issue on modern Intel processors.
> > > 
> > > 4. per-VM passthrough mode configuration
> > >    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
> > >    it decides vPMU is passthrough mode or emulated mode at kvm module
> > >    load time.
> > >    Do we need the capability of per-VM passthrough mode configuration?
> > >    So an admin can launch some non-passthrough VM and profile these
> > >    non-passthrough VMs in host, but admin still cannot profile all
> > >    the VMs once passthrough VM existence. This means passthrough vPMU
> > >    and emulated vPMU mix on one platform, it has challenges to implement.
> > >    As the commit message in commit 0011, the main challenge is 
> > >    passthrough vPMU and emulated vPMU have different vPMU features, this
> > >    ends up with two different values for kvm_cap.supported_perf_cap, which
> > >    is initialized at module load time. To support it, more refactor is
> > >    needed.
> > 
> > I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> > existing vPMU support entirely, but that's probably not be realistic, at least not
> > in the near future.
> > 
> > > Remain Works
> > > ===
> > > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> > 
> > Before this gets out of its "RFC" phase, I would at least like line of sight to
> > a more optimized switch.  I 100% agree that starting with a conservative
> > implementation is the way to go, and the kernel absolutely needs to be able to
> > profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> > guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> > 
> > But I also don't want to get into a situation where can't figure out a clean,
> > robust way to do the optimized context switch without needing (another) massive
> > rewrite.
Sean Christopherson April 19, 2024, 7:14 p.m. UTC | #11
On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Thu, Apr 11, 2024, Sean Christopherson wrote:
> > <bikeshed>
> > 
> > I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> > emulates the control plane (controls and event selectors), while the data is
> > fully passed through (counters).
> > 
> > </bikeshed>
> Sean,
> 
> I feel "mediated PMU" seems to be a little bit off the ..., no? In
> KVM, almost all of features are mediated. In our specific case, the
> legacy PMU is mediated by KVM and perf subsystem on the host. In new
> design, it is mediated by KVM only.

Currently, at a feature level, I mentally bin things into two rough categories
in KVM:

 1. Virtualized - Guest state is loaded into hardware, or hardware supports
                  running with both host and guest state (e.g. TSC scaling), and
                  the guest has full read/write access to its state while running.

 2. Emulated    - Guest state is never loaded into hardware, and instead the 
                  feature/state is emulated in software.

There is no "Passthrough" because that's (mostly) covered by my Virtualized
definition.   And because I also think of passthrough as being about *assets*,
not about the features themselves.
 
They are far from perfect definitions, e.g. individual assets can be passed through,
virtualized by hardware, or emulated in software.  But for the most part, I think
classifying features as virtualized vs. emulated works well, as it helps reason
about the expected behavior and performance of a feature.

E.g. for some virtualized features, certain assets may need to be explicitly passed
through, e.g. access to x2APIC MSRs for APICv.  But APICv itself still falls
into the virtualized category, e.g. the "real" APIC state isn't passed through
to the guest.

If KVM didn't already have a PMU implementation to deal with, this wouldn't be
an issue, e.g. we'd just add "enable_pmu" and I'd mentally bin it into the
virtualized category.  But we need to distinguish between the two PMU models,
and using "enable_virtualized_pmu" would be comically confusing for users. :-)

And because this is user visible, I would like to come up with a name that (some)
KVM users will already be familiar with, i.e. will have some chance of intuitively
understand without having to go read docs.

Which is why I proposed "mediated"; what we are proposing for the PMU is similar
to the "mediated device" concepts in VFIO.  And I also think "mediated" is a good
fit in general, e.g. this becomes my third classification:

 3. Mediated    - Guest is context switched at VM-Enter/VM-Exit, i.e. is loaded
                  into hardware, but the guest does NOT have full read/write access
                  to the feature.

But my main motiviation for using "mediated" really is that I hope that it will
help KVM users grok the basic gist of the design without having to read and
understand KVM documentation, because there is already existing terminology in
the broader KVM space.

> We intercept the control plan in current design, but the only thing
> we do is the event filtering. No fancy code change to emulate the control
> registers. So, it is still a passthrough logic.

It's not though.  Passthrough very specifically means the guest has unfettered
access to some asset, and/or KVM does no filtering/adjustments whatseover.

"Direct" is similar, e.g. KVM's uses "direct" in MMU context to refer to addresses
that don't require KVM to intervene and translate.  E.g. entire MMUs can be direct,
but individual shadow pages can also be direct (no corresponding guest PTE to
translate).

For this flavor of PMU, it's not full passthrough or direct.  Some assets are
passed through, e.g. PMCs, but others are not.  

> In some (rare) business cases, I think maybe we could fully passthrough
> the control plan as well. For instance, sole-tenant machine, or
> full-machine VM + full offload. In case if there is a cpu errata, KVM
> can force vmexit and dynamically intercept the selectors on all vcpus
> with filters checked. It is not supported in current RFC, but maybe
> doable in later versions.

Heh, that's an argument for using something other than "passthrough", because if
we ever do support such a use case, we'd end up with enable_fully_passthrough_pmu,
or in the spirit of KVM shortlogs, really_passthrough_pmu :-)

Though I think even then I would vote for "enable_dedicated_pmu", or something
along those lines, purely to avoid overloading "passthrough", i.e. to try to use
passhtrough strictly when talking about assets, not features.  And because unless
we can also passthrough LVTPC, it still wouldn't be a complete passthrough of the
PMU as KVM would be emulating PMIs.
Mingwei Zhang April 19, 2024, 10:02 p.m. UTC | #12
> Currently, at a feature level, I mentally bin things into two rough categories
> in KVM:
>
>  1. Virtualized - Guest state is loaded into hardware, or hardware supports
>                   running with both host and guest state (e.g. TSC scaling), and
>                   the guest has full read/write access to its state while running.
>
>  2. Emulated    - Guest state is never loaded into hardware, and instead the
>                   feature/state is emulated in software.
>
> There is no "Passthrough" because that's (mostly) covered by my Virtualized
> definition.   And because I also think of passthrough as being about *assets*,
> not about the features themselves.

Sure. In fact, "virtualized" works for me as well. My mind is aligned with this.

>
> They are far from perfect definitions, e.g. individual assets can be passed through,
> virtualized by hardware, or emulated in software.  But for the most part, I think
> classifying features as virtualized vs. emulated works well, as it helps reason
> about the expected behavior and performance of a feature.
>
> E.g. for some virtualized features, certain assets may need to be explicitly passed
> through, e.g. access to x2APIC MSRs for APICv.  But APICv itself still falls
> into the virtualized category, e.g. the "real" APIC state isn't passed through
> to the guest.
>
> If KVM didn't already have a PMU implementation to deal with, this wouldn't be
> an issue, e.g. we'd just add "enable_pmu" and I'd mentally bin it into the
> virtualized category.  But we need to distinguish between the two PMU models,
> and using "enable_virtualized_pmu" would be comically confusing for users. :-)
>
> And because this is user visible, I would like to come up with a name that (some)
> KVM users will already be familiar with, i.e. will have some chance of intuitively
> understand without having to go read docs.
>
> Which is why I proposed "mediated"; what we are proposing for the PMU is similar
> to the "mediated device" concepts in VFIO.  And I also think "mediated" is a good
> fit in general, e.g. this becomes my third classification:
>
>  3. Mediated    - Guest is context switched at VM-Enter/VM-Exit, i.e. is loaded
>                   into hardware, but the guest does NOT have full read/write access
>                   to the feature.
>
> But my main motiviation for using "mediated" really is that I hope that it will
> help KVM users grok the basic gist of the design without having to read and
> understand KVM documentation, because there is already existing terminology in
> the broader KVM space.

Understand this part. Mediated is the fact that KVM sits in between,
but I feel we can find a better name :)
>
> > We intercept the control plan in current design, but the only thing
> > we do is the event filtering. No fancy code change to emulate the control
> > registers. So, it is still a passthrough logic.
>
> It's not though.  Passthrough very specifically means the guest has unfettered
> access to some asset, and/or KVM does no filtering/adjustments whatseover.
>
> "Direct" is similar, e.g. KVM's uses "direct" in MMU context to refer to addresses
> that don't require KVM to intervene and translate.  E.g. entire MMUs can be direct,
> but individual shadow pages can also be direct (no corresponding guest PTE to
> translate).

Oh, isn't "direct" a perfect word for this? Look, our new design does
not require KVM to translate the encodings into events and into
encoding again (in "perf subsystem") before entering HW. It is really
"direct" in this sense, no?

Neither does KVM do any translation of the event encodings across
micro-architectures. So, it is really _direct_ from this perspective
as well.

On the other hand, "direct" means straightforward, indicating
passthrough, but not always, in which KVM retains the power of
control.

>
> For this flavor of PMU, it's not full passthrough or direct.  Some assets are
> passed through, e.g. PMCs, but others are not.
>
> > In some (rare) business cases, I think maybe we could fully passthrough
> > the control plan as well. For instance, sole-tenant machine, or
> > full-machine VM + full offload. In case if there is a cpu errata, KVM
> > can force vmexit and dynamically intercept the selectors on all vcpus
> > with filters checked. It is not supported in current RFC, but maybe
> > doable in later versions.
>
> Heh, that's an argument for using something other than "passthrough", because if
> we ever do support such a use case, we'd end up with enable_fully_passthrough_pmu,
> or in the spirit of KVM shortlogs, really_passthrough_pmu :-)

Full passthrough is possible and the naming of "really_passthrough"
and others can all be alive under the "direct PMU".

>
> Though I think even then I would vote for "enable_dedicated_pmu", or something
> along those lines, purely to avoid overloading "passthrough", i.e. to try to use
> passhtrough strictly when talking about assets, not features.  And because unless
> we can also passthrough LVTPC, it still wouldn't be a complete passthrough of the
> PMU as KVM would be emulating PMIs.

I agree to avoid "passthrough". Dedicated is also a fine word. It
indicates the PMU is dedicated to serve the KVM guests. But the scope
might be a little narrow. This is just my opinion. Maybe it is because
my mind has been stuck with "direct" :)

Thanks.

-Mingwei