diff mbox series

[v4,07/13] perf/x86: Add constraint for guest perf metrics event

Message ID 20230927033124.1226509-8-dapeng1.mi@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Enable fixed counter 3 and topdown perf metrics for vPMU | expand

Commit Message

Mi, Dapeng Sept. 27, 2023, 3:31 a.m. UTC
When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
to be created in the perf subsystem so that the guest can have exclusive
ownership of the PERF_METRICS MSR.

We introduce the new vmetrics constraint, so that we can couple this
virtual metrics event with slots event as a events group to involves in
the host perf system scheduling. Since Guest metric events are always
recognized as vCPU process's events on host, they are time-sharing
multiplexed with other host metric events, so that we choose bit 48
(INTEL_PMC_IDX_METRIC_BASE) as the index of this virtual metrics event.

Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 28 +++++++++++++++++++++-------
 arch/x86/events/perf_event.h      |  1 +
 arch/x86/include/asm/perf_event.h | 15 +++++++++++++++
 3 files changed, 37 insertions(+), 7 deletions(-)

Comments

Peter Zijlstra Sept. 27, 2023, 11:33 a.m. UTC | #1
On Wed, Sep 27, 2023 at 11:31:18AM +0800, Dapeng Mi wrote:
> When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
> to be created in the perf subsystem so that the guest can have exclusive
> ownership of the PERF_METRICS MSR.

Urgh, can someone please remind me how all that is supposed to work
again? The guest is just a task that wants the event. If the
host creates a CPU event, then that gets scheduled with higher priority
and the task looses out, no joy.

So you cannot guarantee the guest gets anything.

That is, I remember we've had this exact problem before, but I keep
forgetting how this all is supposed to work. I don't use this virt stuff
(and every time I try qemu arguments defeat me and I give up in
disgust).
Sean Christopherson Sept. 27, 2023, 5:27 p.m. UTC | #2
+Jim, David, and Mingwei

On Wed, Sep 27, 2023, Peter Zijlstra wrote:
> On Wed, Sep 27, 2023 at 11:31:18AM +0800, Dapeng Mi wrote:
> > When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
> > to be created in the perf subsystem so that the guest can have exclusive
> > ownership of the PERF_METRICS MSR.
> 
> Urgh, can someone please remind me how all that is supposed to work
> again? The guest is just a task that wants the event. If the
> host creates a CPU event, then that gets scheduled with higher priority
> and the task looses out, no joy.
> 
> So you cannot guarantee the guest gets anything.
> 
> That is, I remember we've had this exact problem before, but I keep
> forgetting how this all is supposed to work. I don't use this virt stuff
> (and every time I try qemu arguments defeat me and I give up in
> disgust).

I don't think it does work, at least not without a very, very carefully crafted
setup and a host userspace that knows it must not use certain aspects of perf.
E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
in hardware, KVM+perf simply disables the counter.

And for top-down slots, getting anything remotely accurate requires pinning vCPUs
1:1 with pCPUs and enumerating an accurate toplogy to the guest:

  The count is distributed among unhalted logical processors (hyper-threads) who
  share the same physical core, in processors that support Intel Hyper-Threading
  Technology.

Jumping the gun a bit (we're in the *super* early stages of scraping together a
rough PoC), but I think we should effectively put KVM's current vPMU support into
maintenance-only mode, i.e. stop adding new features unless they are *very* simple
to enable, and instead pursue an implementation that (a) lets userspace (and/or
the kernel builder) completely disable host perf (or possibly just host perf usage
of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
has been turned off in the host.

I.e. keep KVM's existing best-offset vPMU support, e.g. for setups where the
platform owner is also the VM ueer (running a Windows VM on a Linux box, hosting
a Linux VM in ChromeOS, etc...).  But for anything advanced and for hard guarantees,
e.g. cloud providers that want to expose fully featured vPMU to customers, force
the platform owner to choose between using perf (or again, perf with hardware PMU)
in the host, and exposing the hardware PMU to the guest.

Hardware vendors are pushing us in the direction whether we like it or not, e.g.
SNP and TDX want to disallow profiling the guest from the host, ARM has an
upcoming PMU model where (IIUC) it can't be virtualized without a passthrough
approach, Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,
and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
requires an absurd amount of complexity throughout the kernel and userspace.

Note, a similar idea was floated and rejected in the past[*], but that failed
proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
which I agree would create an awful ABI for the host.  If we make the "knob" a
Kconfig or kernel param, i.e. require the platform owner to opt-out of using perf
no later than at boot time, then I think we can provide a sane ABI, keep the
implementation simple, all without breaking existing users that utilize perf in
the host to profile guests.

[*] https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5Q093K3rMUeAsF3vEL39MGV5Br9wEQQ@mail.gmail.com
Mi, Dapeng Sept. 28, 2023, 9:24 a.m. UTC | #3
On 9/28/2023 1:27 AM, Sean Christopherson wrote:
> +Jim, David, and Mingwei
>
> On Wed, Sep 27, 2023, Peter Zijlstra wrote:
>> On Wed, Sep 27, 2023 at 11:31:18AM +0800, Dapeng Mi wrote:
>>> When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
>>> to be created in the perf subsystem so that the guest can have exclusive
>>> ownership of the PERF_METRICS MSR.
>> Urgh, can someone please remind me how all that is supposed to work
>> again? The guest is just a task that wants the event. If the
>> host creates a CPU event, then that gets scheduled with higher priority
>> and the task looses out, no joy.


It looks I used the inaccurate words in the comments. Yes, it's not 
*exclusive* from host's point view.  Currently the perf events created 
by KVM are task-pinned events, they are indeed possible to be preempted 
by CPU-pinned host events which have higher priority. This is a long 
term issue which vPMU encountered. We ever have some internal discussion 
about this issue, but it seems we don't have a good way to solve this 
issue thoroughly in current vPMU framework.

But if there is no such CPU-pinned events which have the highest 
priority on host, KVM perf events can share the HW resource with other 
host events with the way of time-multiplexing.

>> So you cannot guarantee the guest gets anything.
>>
>> That is, I remember we've had this exact problem before, but I keep
>> forgetting how this all is supposed to work. I don't use this virt stuff
>> (and every time I try qemu arguments defeat me and I give up in
>> disgust).
> I don't think it does work, at least not without a very, very carefully crafted
> setup and a host userspace that knows it must not use certain aspects of perf.
> E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
> in hardware, KVM+perf simply disables the counter.
>
> And for top-down slots, getting anything remotely accurate requires pinning vCPUs
> 1:1 with pCPUs and enumerating an accurate toplogy to the guest:
>
>    The count is distributed among unhalted logical processors (hyper-threads) who
>    share the same physical core, in processors that support Intel Hyper-Threading
>    Technology.
>
> Jumping the gun a bit (we're in the *super* early stages of scraping together a
> rough PoC), but I think we should effectively put KVM's current vPMU support into
> maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> to enable, and instead pursue an implementation that (a) lets userspace (and/or
> the kernel builder) completely disable host perf (or possibly just host perf usage
> of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> has been turned off in the host.
>
> I.e. keep KVM's existing best-offset vPMU support, e.g. for setups where the
> platform owner is also the VM ueer (running a Windows VM on a Linux box, hosting
> a Linux VM in ChromeOS, etc...).  But for anything advanced and for hard guarantees,
> e.g. cloud providers that want to expose fully featured vPMU to customers, force
> the platform owner to choose between using perf (or again, perf with hardware PMU)
> in the host, and exposing the hardware PMU to the guest.
>
> Hardware vendors are pushing us in the direction whether we like it or not, e.g.
> SNP and TDX want to disallow profiling the guest from the host, ARM has an
> upcoming PMU model where (IIUC) it can't be virtualized without a passthrough
> approach, Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,
> and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
> requires an absurd amount of complexity throughout the kernel and userspace.
>
> Note, a similar idea was floated and rejected in the past[*], but that failed
> proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> which I agree would create an awful ABI for the host.  If we make the "knob" a
> Kconfig or kernel param, i.e. require the platform owner to opt-out of using perf
> no later than at boot time, then I think we can provide a sane ABI, keep the
> implementation simple, all without breaking existing users that utilize perf in
> the host to profile guests.
>
> [*] https://lore.kernel.org/all/CALMp9eRBOmwz=mspp0m5Q093K3rMUeAsF3vEL39MGV5Br9wEQQ@mail.gmail.com
Peter Zijlstra Sept. 29, 2023, 11:53 a.m. UTC | #4
On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:

> I don't think it does work, at least not without a very, very carefully crafted
> setup and a host userspace that knows it must not use certain aspects of perf.
> E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
> in hardware, KVM+perf simply disables the counter.

I have distinct memories of there being patches to rewrite the PEBS
buffer, but I really can't remember what we ended up doing. Like I said,
I can't operate KVM in any meaningful way -- it's a monster :-(

> And for top-down slots, getting anything remotely accurate requires pinning vCPUs
> 1:1 with pCPUs and enumerating an accurate toplogy to the guest:
> 
>   The count is distributed among unhalted logical processors (hyper-threads) who
>   share the same physical core, in processors that support Intel Hyper-Threading
>   Technology.

So IIRC slots is per logical CPU, it counts the actual pipeline stages
going towards that logical CPU, this is required to make it work on SMT
at all -- even for native.

But it's been a long while since that was explained -- and because it
was a call, I can't very well read it back, god how I hate calls :-(

> Jumping the gun a bit (we're in the *super* early stages of scraping together a
> rough PoC), but I think we should effectively put KVM's current vPMU support into
> maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> to enable, and instead pursue an implementation that (a) lets userspace (and/or
> the kernel builder) completely disable host perf (or possibly just host perf usage
> of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> has been turned off in the host.

I don't think you need to go that far, host can use PMU just fine as
long as it doesn't overlap with a vCPU. Basically, if you force
perf_attr::exclude_guest on everything your vCPU can haz the full thing.

> Hardware vendors are pushing us in the direction whether we like it or not, e.g.
> SNP and TDX want to disallow profiling the guest from the host, 

Yeah, sekjoerity model etc.. bah.

> ARM has an upcoming PMU model where (IIUC) it can't be virtualized
> without a passthrough approach,

:-(

> Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,

Anybodies hybrid things are a clusterfuck, hybrid vs virt doesn't work
sanely on ARM either AFAIU.

I intensely dislike hybrid (and virt ofc), but alas we get to live with
that mess :/ And it's only going to get worse I fear..

At least (for now) AMD hybrid is committed to identical ISA, including
PMUs with their Zen4+Zen4c things. We'll have to wait and see how
that'll end up.

> and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
> requires an absurd amount of complexity throughout the kernel and userspace.

I'm not sure about top-down, the other two, for sure.

My main beef with top-down is the ludicrously bad hardware interface we
have on big cores, I like the atom interface a *ton* better.

> Note, a similar idea was floated and rejected in the past[*], but that failed
> proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> which I agree would create an awful ABI for the host.  If we make the "knob" a
> Kconfig 

Must not be Kconfig, distros would have no sane choice.

> or kernel param, i.e. require the platform owner to opt-out of using perf
> no later than at boot time, then I think we can provide a sane ABI, keep the
> implementation simple, all without breaking existing users that utilize perf in
> the host to profile guests.

It's a shit choice to have to make. At the same time I'm not sure I have
a better proposal.

It does mean a host cannot profile one guest and have pass-through on the
other. Eg. have a development and production guest on the same box. This
is pretty crap.

Making it a guest-boot-option would allow that, but then the host gets
complicated again. I think I can make it trivially work for per-task
events, simply error the creation of events without exclude_guest for
affected vCPU tasks. But the CPU events are tricky.


I will firmly reject anything that takes the PMU away from the host
entirely through.

Also, NMI watchdog needs a solution.. Ideally hardware grows a second
per-CPU timer we can program to NMI.
Ravi Bangoria Sept. 29, 2023, 3:20 p.m. UTC | #5
On 29-Sep-23 5:23 PM, Peter Zijlstra wrote:
> On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:
> 
>> I don't think it does work, at least not without a very, very carefully crafted
>> setup and a host userspace that knows it must not use certain aspects of perf.
>> E.g. for PEBS, if the guest virtual counters don't map 1:1 to the "real" counters
>> in hardware, KVM+perf simply disables the counter.
> 
> I have distinct memories of there being patches to rewrite the PEBS
> buffer, but I really can't remember what we ended up doing. Like I said,
> I can't operate KVM in any meaningful way -- it's a monster :-(
> 
>> And for top-down slots, getting anything remotely accurate requires pinning vCPUs
>> 1:1 with pCPUs and enumerating an accurate toplogy to the guest:
>>
>>   The count is distributed among unhalted logical processors (hyper-threads) who
>>   share the same physical core, in processors that support Intel Hyper-Threading
>>   Technology.
> 
> So IIRC slots is per logical CPU, it counts the actual pipeline stages
> going towards that logical CPU, this is required to make it work on SMT
> at all -- even for native.
> 
> But it's been a long while since that was explained -- and because it
> was a call, I can't very well read it back, god how I hate calls :-(
> 
>> Jumping the gun a bit (we're in the *super* early stages of scraping together a
>> rough PoC), but I think we should effectively put KVM's current vPMU support into
>> maintenance-only mode, i.e. stop adding new features unless they are *very* simple
>> to enable, and instead pursue an implementation that (a) lets userspace (and/or
>> the kernel builder) completely disable host perf (or possibly just host perf usage
>> of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
>> has been turned off in the host.
> 
> I don't think you need to go that far, host can use PMU just fine as
> long as it doesn't overlap with a vCPU. Basically, if you force
> perf_attr::exclude_guest on everything your vCPU can haz the full thing.
> 
>> Hardware vendors are pushing us in the direction whether we like it or not, e.g.
>> SNP and TDX want to disallow profiling the guest from the host, 
> 
> Yeah, sekjoerity model etc.. bah.
> 
>> ARM has an upcoming PMU model where (IIUC) it can't be virtualized
>> without a passthrough approach,
> 
> :-(
> 
>> Intel's hybrid CPUs are a complete trainwreck unless vCPUs are pinned,
> 
> Anybodies hybrid things are a clusterfuck, hybrid vs virt doesn't work
> sanely on ARM either AFAIU.
> 
> I intensely dislike hybrid (and virt ofc), but alas we get to live with
> that mess :/ And it's only going to get worse I fear..
> 
> At least (for now) AMD hybrid is committed to identical ISA, including
> PMUs with their Zen4+Zen4c things. We'll have to wait and see how
> that'll end up.
> 
>> and virtualizing things like top-down slots, PEBS, and LBRs in the shared model
>> requires an absurd amount of complexity throughout the kernel and userspace.
> 
> I'm not sure about top-down, the other two, for sure.
> 
> My main beef with top-down is the ludicrously bad hardware interface we
> have on big cores, I like the atom interface a *ton* better.
> 
>> Note, a similar idea was floated and rejected in the past[*], but that failed
>> proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
>> which I agree would create an awful ABI for the host.  If we make the "knob" a
>> Kconfig 
> 
> Must not be Kconfig, distros would have no sane choice.
> 
>> or kernel param, i.e. require the platform owner to opt-out of using perf
>> no later than at boot time, then I think we can provide a sane ABI, keep the
>> implementation simple, all without breaking existing users that utilize perf in
>> the host to profile guests.
> 
> It's a shit choice to have to make. At the same time I'm not sure I have
> a better proposal.

How about keying off based on PMU specific KVM module parameter? Something
like what Manali has proposed for AMD VIBS? Please see solution 1.1:

https://lore.kernel.org/r/3a6c693e-1ef4-6542-bc90-d4468773b97d@amd.com

> It does mean a host cannot profile one guest and have pass-through on the
> other. Eg. have a development and production guest on the same box. This
> is pretty crap.
> 
> Making it a guest-boot-option would allow that, but then the host gets
> complicated again. I think I can make it trivially work for per-task
> events, simply error the creation of events without exclude_guest for
> affected vCPU tasks. But the CPU events are tricky.
> 
> 
> I will firmly reject anything that takes the PMU away from the host
> entirely through.
> 
> Also, NMI watchdog needs a solution.. Ideally hardware grows a second
> per-CPU timer we can program to NMI.

Thanks,
Ravi
Sean Christopherson Sept. 29, 2023, 3:46 p.m. UTC | #6
On Fri, Sep 29, 2023, Peter Zijlstra wrote:
> On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:
> > Jumping the gun a bit (we're in the *super* early stages of scraping together a
> > rough PoC), but I think we should effectively put KVM's current vPMU support into
> > maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> > to enable, and instead pursue an implementation that (a) lets userspace (and/or
> > the kernel builder) completely disable host perf (or possibly just host perf usage
> > of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> > has been turned off in the host.
> 
> I don't think you need to go that far, host can use PMU just fine as
> long as it doesn't overlap with a vCPU. Basically, if you force
> perf_attr::exclude_guest on everything your vCPU can haz the full thing.

Complexity aside, my understanding is that the overhead of trapping and emulating
all of the guest counter and MSR accesses results in unacceptably degraded functionality
for the guest.  And we haven't even gotten to things like arch LBRs where context
switching MSRs between the guest and host is going to be quite costly.

> > Note, a similar idea was floated and rejected in the past[*], but that failed
> > proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> > which I agree would create an awful ABI for the host.  If we make the "knob" a
> > Kconfig 
> 
> Must not be Kconfig, distros would have no sane choice.

Or not only a Kconfig?  E.g. similar to how the kernel has
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS and nopku.

> > or kernel param, i.e. require the platform owner to opt-out of using perf
> > no later than at boot time, then I think we can provide a sane ABI, keep the
> > implementation simple, all without breaking existing users that utilize perf in
> > the host to profile guests.
> 
> It's a shit choice to have to make. At the same time I'm not sure I have
> a better proposal.
> 
> It does mean a host cannot profile one guest and have pass-through on the
> other. Eg. have a development and production guest on the same box. This
> is pretty crap.
> 
> Making it a guest-boot-option would allow that, but then the host gets
> complicated again. I think I can make it trivially work for per-task
> events, simply error the creation of events without exclude_guest for
> affected vCPU tasks. But the CPU events are tricky.
> 
> 
> I will firmly reject anything that takes the PMU away from the host
> entirely through.

Why?  What is so wrong with supporting use cases where the platform owner *wants*
to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
were complex, highly invasive, and/or difficult to maintain, then I can understand
the pushback.  

But if we simply allow hiding hardware PMU support, then isn't the cost to perf
just a few lines in init_hw_perf_events()?  And if we put a stake in the ground
and say that exposing "advanced" PMU features to KVM guests requires a passthrough
PMU, i.e. the PMU to be hidden from the host, that will significantly reduce our
maintenance and complexity.

The kernel allows disabling almost literally every other feature that is even
remotely optional, I don't understand why the hardware PMU is special.
Jim Mattson Sept. 30, 2023, 3:29 a.m. UTC | #7
On Fri, Sep 29, 2023 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Sep 29, 2023, Peter Zijlstra wrote:
> > On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:
> > > Jumping the gun a bit (we're in the *super* early stages of scraping together a
> > > rough PoC), but I think we should effectively put KVM's current vPMU support into
> > > maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> > > to enable, and instead pursue an implementation that (a) lets userspace (and/or
> > > the kernel builder) completely disable host perf (or possibly just host perf usage
> > > of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> > > has been turned off in the host.
> >
> > I don't think you need to go that far, host can use PMU just fine as
> > long as it doesn't overlap with a vCPU. Basically, if you force
> > perf_attr::exclude_guest on everything your vCPU can haz the full thing.
>
> Complexity aside, my understanding is that the overhead of trapping and emulating
> all of the guest counter and MSR accesses results in unacceptably degraded functionality
> for the guest.  And we haven't even gotten to things like arch LBRs where context
> switching MSRs between the guest and host is going to be quite costly.

Trapping and emulating all of the PMU MSR accesses is ludicrously
slow, especially when the guest is multiplexing events.

Also, the current scheme of implicitly tying together usage mode and
priority means that KVM's "task pinned" perf_events always lose to
someone else's "CPU pinned" perf_events. Even if those "CPU pinned"
perf events are tagged "exclude_guest," the counters they occupy are
not available for KVM's "exclude_host" events, because host perf won't
multiplex a counter between an "exclude_host" event and an
"exclude_guest" event, even though the two events don't overlap.
Frankly, we wouldn't want it to, because that would introduce
egregious overheads at VM-entry and VM-exit. What we would need would
be a mechanism for allocating KVM's "task pinned" perf_events at the
highest priority, so they always win.

For things to work well in the "vPMU as a client of host perf" world,
we need to have the following at a minimum:
1) Guaranteed identity mapping of guest PMCs to host PMCs, so that we
don't have to intercept accesses to IA32_PERF_GLOBAL_CTRL.
2) Exclusive ownership of the PMU MSRs while in the KVM_RUN loop, so
that we don't have to switch any PMU MSRs on VM-entry/VM-exit (with
the exception of IA32_PERF_GLOBAL_CTRL, which has guest and host
fields in the VMCS).

There are other issues with the current implementation, like the
ridiculous overhead of bumping a counter in software to account for an
emulated instruction. That should just be a RDMSR, an increment, a
WRMSR, and the conditional synthesis of a guest PMI on overflow.
Instead, we have to pause a perf_event and reprogram it before
continuing. Putting a high-level abstraction between the guest PMU and
the host PMU does not yield the most efficient implementation.

> > > Note, a similar idea was floated and rejected in the past[*], but that failed
> > > proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> > > which I agree would create an awful ABI for the host.  If we make the "knob" a
> > > Kconfig
> >
> > Must not be Kconfig, distros would have no sane choice.
>
> Or not only a Kconfig?  E.g. similar to how the kernel has
> CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS and nopku.
>
> > > or kernel param, i.e. require the platform owner to opt-out of using perf
> > > no later than at boot time, then I think we can provide a sane ABI, keep the
> > > implementation simple, all without breaking existing users that utilize perf in
> > > the host to profile guests.
> >
> > It's a shit choice to have to make. At the same time I'm not sure I have
> > a better proposal.
> >
> > It does mean a host cannot profile one guest and have pass-through on the
> > other. Eg. have a development and production guest on the same box. This
> > is pretty crap.
> >
> > Making it a guest-boot-option would allow that, but then the host gets
> > complicated again. I think I can make it trivially work for per-task
> > events, simply error the creation of events without exclude_guest for
> > affected vCPU tasks. But the CPU events are tricky.
> >
> >
> > I will firmly reject anything that takes the PMU away from the host
> > entirely through.
>
> Why?  What is so wrong with supporting use cases where the platform owner *wants*
> to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
> were complex, highly invasive, and/or difficult to maintain, then I can understand
> the pushback.
>
> But if we simply allow hiding hardware PMU support, then isn't the cost to perf
> just a few lines in init_hw_perf_events()?  And if we put a stake in the ground
> and say that exposing "advanced" PMU features to KVM guests requires a passthrough
> PMU, i.e. the PMU to be hidden from the host, that will significantly reduce our
> maintenance and complexity.
>
> The kernel allows disabling almost literally every other feature that is even
> remotely optional, I don't understand why the hardware PMU is special.
Namhyung Kim Oct. 1, 2023, 12:31 a.m. UTC | #8
Hello,

On Fri, Sep 29, 2023 at 8:29 PM Jim Mattson <jmattson@google.com> wrote:
> For things to work well in the "vPMU as a client of host perf" world,
> we need to have the following at a minimum:
> 1) Guaranteed identity mapping of guest PMCs to host PMCs, so that we
> don't have to intercept accesses to IA32_PERF_GLOBAL_CTRL.
> 2) Exclusive ownership of the PMU MSRs while in the KVM_RUN loop, so
> that we don't have to switch any PMU MSRs on VM-entry/VM-exit (with
> the exception of IA32_PERF_GLOBAL_CTRL, which has guest and host
> fields in the VMCS).

IIUC that means multiplexing should be handled in the guest.
Not sure it can support mixing host and guest events at the
same time.

Thanks,
Namhyung
Peter Zijlstra Oct. 2, 2023, 11:57 a.m. UTC | #9
On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:

> > I will firmly reject anything that takes the PMU away from the host
> > entirely through.
> 
> Why?  What is so wrong with supporting use cases where the platform owner *wants*
> to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
> were complex, highly invasive, and/or difficult to maintain, then I can understand
> the pushback.  

Because it sucks.

You're forcing people to choose between no host PMU or a slow guest PMU.
And that's simply not a sane choice for most people -- worse it's not a
choice based in technical reality.

It's a choice out of lazyness, disabling host PMU is not a requirement
for pass-through. 

Like I wrote, all we need to do is ensure vCPU tasks will never have a
perf-event scheduled that covers guest mode. Currently this would be
achievable by having event creation for both:

 - CPU events without attr::exclude_guest=1, and
 - task events for vCPU task of interest without attr::exclude_guest=1

error with -EBUSY or something.

This ensures there are no events active for those vCPU tasks at VMENTER
time and you can haz pass-through.
Peter Zijlstra Oct. 2, 2023, 12:29 p.m. UTC | #10
On Fri, Sep 29, 2023 at 08:50:07PM +0530, Ravi Bangoria wrote:
> On 29-Sep-23 5:23 PM, Peter Zijlstra wrote:

> > I don't think you need to go that far, host can use PMU just fine as
> > long as it doesn't overlap with a vCPU. Basically, if you force
> > perf_attr::exclude_guest on everything your vCPU can haz the full thing.

^ this..

> How about keying off based on PMU specific KVM module parameter? Something
> like what Manali has proposed for AMD VIBS? Please see solution 1.1:
> 
> https://lore.kernel.org/r/3a6c693e-1ef4-6542-bc90-d4468773b97d@amd.com

So I hadn't read that, but isn't that more or less what I proposed
above?
Ingo Molnar Oct. 2, 2023, 1:30 p.m. UTC | #11
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
> 
> > > I will firmly reject anything that takes the PMU away from the host
> > > entirely through.
> > 
> > Why?  What is so wrong with supporting use cases where the platform owner *wants*
> > to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
> > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > the pushback.  
> 
> Because it sucks.
> 
> You're forcing people to choose between no host PMU or a slow guest PMU.
> And that's simply not a sane choice for most people -- worse it's not a
> choice based in technical reality.
> 
> It's a choice out of lazyness, disabling host PMU is not a requirement
> for pass-through. 

Not just a choice of laziness, but it will clearly be forced upon users
by external entities:

   "Pass ownership of the PMU to the guest and have no host PMU, or you
    won't have sane guest PMU support at all. If you disagree, please open
    a support ticket, which we'll ignore."

The host OS shouldn't offer facilities that severely limit its own capabilities,
when there's a better solution. We don't give the FPU to apps exclusively either,
it would be insanely stupid for a platform to do that.

Thanks,

	Ingo
David Dunn Oct. 2, 2023, 3:23 p.m. UTC | #12
On Mon, Oct 2, 2023 at 6:30 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> The host OS shouldn't offer facilities that severely limit its own capabilities,
> when there's a better solution. We don't give the FPU to apps exclusively either,
> it would be insanely stupid for a platform to do that.
>

If you think of the guest VM as a usermode application (which it
effectively is), the analogous situation is that there is no way to
tell the usermode application which portions of the FPU state might be
used by the kernel without context switching.  Although the kernel can
and does use FPU state, it doesn't zero out a portion of that state
whenever the kernel needs to use the FPU.

Today there is no way for a guest to dynamically adjust which PMU
state is valid or invalid.  And this changes based on usage by other
commands run on the host.  As observed by perf subsystem running in
the guest kernel, this looks like counters that simply zero out and
stop counting at random.

I think the request here is that there be a way for KVM to be able to
tell the guest kernel (running the perf subsystem) that it has a
functional HW PMU.  And for that to be true.  This doesn't mean taking
away the use of the PMU any more than exposing the FPU to usermode
applications means taking away the FPU from the kernel.  But it does
mean that when entering the KVM run loop, the host perf system needs
to context switch away the host PMU state and allow KVM to load the
guest PMU state.  And much like the FPU situation, the portion of the
host kernel that runs between the context switch to the KVM thread and
VMENTER to the guest cannot use the PMU.

This obviously should be a policy set by the host owner.  They are
deliberately giving up the ability to profile that small portion of
the host (KVM VCPU thread cannot be profiled) in return to providing a
full set of perf functionality to the guest kernel.

Dave Dunn
Sean Christopherson Oct. 2, 2023, 3:56 p.m. UTC | #13
On Mon, Oct 02, 2023, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
> > 
> > > > I will firmly reject anything that takes the PMU away from the host
> > > > entirely through.
> > > 
> > > Why?  What is so wrong with supporting use cases where the platform owner *wants*
> > > to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
> > > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > > the pushback.  
> > 
> > Because it sucks.
>
> > You're forcing people to choose between no host PMU or a slow guest PMU.

Nowhere did I say that we wouldn't take patches to improve the existing vPMU
support.  But that's largely a moot point because I don't think it's possible to
improve the current approach to the point where it would provide a performant,
functional guest PMU.

> > And that's simply not a sane choice for most people --

It's better than the status quo, which is that no one gets to choose, everyone
gets a slow guest PMU.

> > worse it's not a choice based in technical reality.

The technical reality is that context switching the PMU between host and guest
requires reading and writing far too many MSRs for KVM to be able to context
switch at every VM-Enter and every VM-Exit.  And PMIs skidding past VM-Exit adds
another layer of complexity to deal with.

> > It's a choice out of lazyness, disabling host PMU is not a requirement
> > for pass-through.

The requirement isn't passthrough access, the requirements are that the guest's
PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
guest doesn't have a meaningful impact on guest performance.

> Not just a choice of laziness, but it will clearly be forced upon users
> by external entities:
> 
>    "Pass ownership of the PMU to the guest and have no host PMU, or you
>     won't have sane guest PMU support at all. If you disagree, please open
>     a support ticket, which we'll ignore."

We don't have sane guest PMU support today.  In the 12+ years since commit
f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests"), KVM has
never provided anything remotely close to a sane vPMU.  It *mostly* works if host
perf is quiesced, but that "good enough" approach doesn't suffice for any form of
PMU usage that requires a high level of accuracy and precision.

> The host OS shouldn't offer facilities that severely limit its own capabilities,
> when there's a better solution. We don't give the FPU to apps exclusively either,
> it would be insanely stupid for a platform to do that.

The FPU can be effeciently context switched, guest state remains resident in
hardware so long as the vCPU task is scheduled in (ignoring infrequrent FPU usage
from IRQ context), and guest usage of the FPU doesn't require trap-and-emulate
behavior in KVM.

As David said, ceding the hardware PMU for all of kvm_arch_vcpu_ioctl_run()
(module the vCPU task being scheduled out) is likely a viable alternative.

 : But it does mean that when entering the KVM run loop, the host perf system 
 : needs to context switch away the host PMU state and allow KVM to load the guest
 : PMU state.  And much like the FPU situation, the portion of the host kernel
 : that runs between the context switch to the KVM thread and VMENTER to the guest
 : cannot use the PMU.

If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
loop, then I'm all for exploring that option.  But that idea got shot down over
a year ago[*].  Or at least, that was my reading of things.  Maybe it was just a
misunderstanding because we didn't do a good job of defining the behavior.

I am completely ok with either approach, but I am not ok with being nak'd on both.
Because unless there's a magical third option lurking, those two options are the
only ways for KVM to provide a vPMU that meets the requirements for slice-of-hardware
use cases.

[*] https://lore.kernel.org/all/YgPCm1WIt9dHuoEo@hirez.programming.kicks-ass.net
Mingwei Zhang Oct. 2, 2023, 7:02 p.m. UTC | #14
On Mon, Oct 2, 2023 at 8:23 AM David Dunn <daviddunn@google.com> wrote:
>
> On Mon, Oct 2, 2023 at 6:30 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > The host OS shouldn't offer facilities that severely limit its own capabilities,
> > when there's a better solution. We don't give the FPU to apps exclusively either,
> > it would be insanely stupid for a platform to do that.
> >
>
> If you think of the guest VM as a usermode application (which it
> effectively is), the analogous situation is that there is no way to
> tell the usermode application which portions of the FPU state might be
> used by the kernel without context switching.  Although the kernel can
> and does use FPU state, it doesn't zero out a portion of that state
> whenever the kernel needs to use the FPU.
>
> Today there is no way for a guest to dynamically adjust which PMU
> state is valid or invalid.  And this changes based on usage by other
> commands run on the host.  As observed by perf subsystem running in
> the guest kernel, this looks like counters that simply zero out and
> stop counting at random.
>
> I think the request here is that there be a way for KVM to be able to
> tell the guest kernel (running the perf subsystem) that it has a
> functional HW PMU.  And for that to be true.  This doesn't mean taking
> away the use of the PMU any more than exposing the FPU to usermode
> applications means taking away the FPU from the kernel.  But it does
> mean that when entering the KVM run loop, the host perf system needs
> to context switch away the host PMU state and allow KVM to load the
> guest PMU state.  And much like the FPU situation, the portion of the
> host kernel that runs between the context switch to the KVM thread and
> VMENTER to the guest cannot use the PMU.
>
> This obviously should be a policy set by the host owner.  They are
> deliberately giving up the ability to profile that small portion of
> the host (KVM VCPU thread cannot be profiled) in return to providing a
> full set of perf functionality to the guest kernel.
>

+1

I was pretty confused until I read this one. Pass-through vPMU for the
guest VM does not conflict with the host PMU software. All we need is
to accept the feasibility that host PMU software (perf subsystem in
Linux) can co-exist with pass-through vPMU in KVM. They could both
work directly on the hardware PMU, operating the registers etc...

To achieve this, I think what we really ask for the perf subsystem in
Linux are two things:
 - full context switch for hardware PMU. Currently, perf subsystem is
the exclusive owner of this piece of hardware. So this code is missing
 - NMI sharing or NMI control transfer. Either KVM could raise its own
NMI handler and get control transferred or Linux promotes the existing
NMI handler to serve two entities in the kernel.

Once the above is achieved, KVM and perf subsystem in Linux could
harmoniously share the hardware PMU as I believe, instead of forcing
the former as a client of the latter.

To step back a little bit, we are not asking about the feasibility,
since KVM and perf subsystem sharing hardware PMU is a reality because
of TDX/SEV-SNP. So, I think all that is just a draft proposal to make
the sharing clean and efficient.

Thanks.
-Mingwei

> Dave Dunn
Liang, Kan Oct. 2, 2023, 7:50 p.m. UTC | #15
On 2023-10-02 11:56 a.m., Sean Christopherson wrote:
> I am completely ok with either approach, but I am not ok with being nak'd on both.
> Because unless there's a magical third option lurking, those two options are the
> only ways for KVM to provide a vPMU that meets the requirements for slice-of-hardware
> use cases.

It may be doable to introduce a knob to enable/disable the "CPU-pinned"
capability of perf_event. If the capability is disabled, the KVM doesn't
need to worry about the counters being gone without notification, since
the "task pinned" has the highest priority at the moment.

It should also keeps the flexibility that sometimes the host wants to
profile the guest.

Now, the NMI watchdog is using a "CPU-pinned" event. But I think it can
be replaced by the buddy system, commit 1f423c905a6b
("watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs")

Thanks,
Kan
Peter Zijlstra Oct. 2, 2023, 8:40 p.m. UTC | #16
On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
> > > 
> > > > > I will firmly reject anything that takes the PMU away from the host
> > > > > entirely through.
> > > > 
> > > > Why?  What is so wrong with supporting use cases where the platform owner *wants*
> > > > to give up host PMU and NMI watchdog functionality?  If disabling host PMU usage
> > > > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > > > the pushback.  
> > > 
> > > Because it sucks.
> >
> > > You're forcing people to choose between no host PMU or a slow guest PMU.
> 
> Nowhere did I say that we wouldn't take patches to improve the existing vPMU
> support.

Nowhere did I talk about vPMU -- I explicitly mentioned pass-through.

> > > worse it's not a choice based in technical reality.
> 
> The technical reality is that context switching the PMU between host and guest
> requires reading and writing far too many MSRs for KVM to be able to context
> switch at every VM-Enter and every VM-Exit.  And PMIs skidding past VM-Exit adds
> another layer of complexity to deal with.

I'm not sure what you're suggesting here. It will have to save/restore
all those MSRs anyway. Suppose it switches between vCPUs.

> > > It's a choice out of lazyness, disabling host PMU is not a requirement
> > > for pass-through.
> 
> The requirement isn't passthrough access, the requirements are that the guest's
> PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
> guest doesn't have a meaningful impact on guest performance.

Given you don't think that trapping MSR accesses is viable, what else
besides pass-through did you have in mind?

> > Not just a choice of laziness, but it will clearly be forced upon users
> > by external entities:
> > 
> >    "Pass ownership of the PMU to the guest and have no host PMU, or you
> >     won't have sane guest PMU support at all. If you disagree, please open
> >     a support ticket, which we'll ignore."
> 
> We don't have sane guest PMU support today.

Because KVM is too damn hard to use, rebooting a machine is *sooo* much
easier -- and I'm really not kidding here.

Anyway, you want pass-through, but that doesn't mean host cannot use
PMU when vCPU thread is not running.

> If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
> loop, then I'm all for exploring that option.  But that idea got shot down over
> a year ago[*]. 

I never saw that idea in that thread. You virt people keep talking like
I know how KVM works -- I'm not joking when I say I have no clue about
virt.

Sometimes I get a little clue after y'all keep bashing me over the head,
but it quickly erases itself.

> Or at least, that was my reading of things.  Maybe it was just a
> misunderstanding because we didn't do a good job of defining the behavior.

This might be the case. I don't particularly care where the guest
boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
PMU is usable again etc..

Re-reading parts of that linked thread, I see mention of
PT_MODE_HOST_GUEST -- see I knew we had something there, but I can never
remember all that nonsense. Worst part is that I can't find the relevant
perf code when I grep for that string :/


Anyway, what I don't like is KVM silently changing all events to
::exclude_guest=1. I would like all (pre-existing) ::exclude_guest=0
events to hard error when they run into a vCPU with pass-through on
(PERF_EVENT_STATE_ERROR). I would like event-creation to error out on
::exclude_guest=0 events when a vCPU with pass-through exists -- with
minimal scope (this probably means all CPU events, but only relevant
vCPU events).

It also means ::exclude_guest should actually work -- it often does not
today -- the IBS thing for example totally ignores it.

Anyway, none of this means host cannot use PMU because virt muck wants
it.
Peter Zijlstra Oct. 2, 2023, 8:52 p.m. UTC | #17
On Mon, Oct 02, 2023 at 03:50:24PM -0400, Liang, Kan wrote:

> Now, the NMI watchdog is using a "CPU-pinned" event. But I think it can
> be replaced by the buddy system, commit 1f423c905a6b
> ("watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs")

For some simple cases. I've had plenty experience with situations where
that thing would be completely useless.

That is, at some point the 'all CPUs hard locked up' scenario was
something I ran into a lot (although I can't for the life of me remember
wtf I was doing at the time). All that needs is a single
spin_lock_irqsave() on a common lock (or group of locks, like the
rq->lock). Before you know it, the whole machine is a brick.

That said; if you augment this thing with a bunch of CPUs that have
HPET-NMI and IPI-NMI for backtraces, it might actually be useful.
Sean Christopherson Oct. 3, 2023, 12:56 a.m. UTC | #18
On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote:
> > > > worse it's not a choice based in technical reality.
> > 
> > The technical reality is that context switching the PMU between host and guest
> > requires reading and writing far too many MSRs for KVM to be able to context
> > switch at every VM-Enter and every VM-Exit.  And PMIs skidding past VM-Exit adds
> > another layer of complexity to deal with.
> 
> I'm not sure what you're suggesting here. It will have to save/restore
> all those MSRs anyway. Suppose it switches between vCPUs.

The "when" is what's important.   If KVM took a literal interpretation of
"exclude guest" for pass-through MSRs, then KVM would context switch all those
MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
<1500 cycles, and "fastpath" exits are something like half that.  Switching all
the MSRs is likely 1000+ cycles, if not double that.

FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
vCPU is pinned 1:1 with a host pCPU.  I suspect it's a similar story for the other
CSPs that are trying to provide accurate PMUs to guests.  If a vCPU is scheduled
out, then yes, a bunch of context switching will need to happen.  But for the
types of VMs that are the target audience, their vCPUs will rarely be scheduled
out.

> > > > It's a choice out of lazyness, disabling host PMU is not a requirement
> > > > for pass-through.
> > 
> > The requirement isn't passthrough access, the requirements are that the guest's
> > PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
> > guest doesn't have a meaningful impact on guest performance.
> 
> Given you don't think that trapping MSR accesses is viable, what else
> besides pass-through did you have in mind?

Sorry, I didn't mean to imply that we don't want pass-through of MSRs.  What I was
trying to say is that *just* passthrough MSRs doesn't solve the problem, because
again I thought the whole "context switch PMU state less often" approach had been
firmly nak'd.

> > > Not just a choice of laziness, but it will clearly be forced upon users
> > > by external entities:
> > > 
> > >    "Pass ownership of the PMU to the guest and have no host PMU, or you
> > >     won't have sane guest PMU support at all. If you disagree, please open
> > >     a support ticket, which we'll ignore."
> > 
> > We don't have sane guest PMU support today.
> 
> Because KVM is too damn hard to use, rebooting a machine is *sooo* much
> easier -- and I'm really not kidding here.
> 
> Anyway, you want pass-through, but that doesn't mean host cannot use
> PMU when vCPU thread is not running.
> 
> > If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
> > loop, then I'm all for exploring that option.  But that idea got shot down over
> > a year ago[*]. 
> 
> I never saw that idea in that thread. You virt people keep talking like
> I know how KVM works -- I'm not joking when I say I have no clue about
> virt.
> 
> Sometimes I get a little clue after y'all keep bashing me over the head,
> but it quickly erases itself.
>
> > Or at least, that was my reading of things.  Maybe it was just a
> > misunderstanding because we didn't do a good job of defining the behavior.
> 
> This might be the case. I don't particularly care where the guest
> boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> PMU is usable again etc..

Well drat, that there would have saved a wee bit of frustration.  Better late
than never though, that's for sure.

Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
out or KVM exits to userspace, would mean that host perf events won't be active
for potentially large swaths of non-KVM code.  Any function calls or event/exception
handlers that occur within the context of ioctl(KVM_RUN) would run with host
perf events disabled.

Are you ok with that approach?  Assuming we don't completely botch things, the
interfaces are sane, we can come up with a clean solution for handling NMIs, etc.

> Re-reading parts of that linked thread, I see mention of
> PT_MODE_HOST_GUEST -- see I knew we had something there, but I can never
> remember all that nonsense. Worst part is that I can't find the relevant
> perf code when I grep for that string :/

The PT stuff is actually an example of what we don't want, at least not exactly.
The concept of a hard switch between guest and host is ok, but as-is, KVM's PT
code does a big pile of MSR reads and writes on every VM-Enter and VM-Exit.

> Anyway, what I don't like is KVM silently changing all events to
> ::exclude_guest=1. I would like all (pre-existing) ::exclude_guest=0
> events to hard error when they run into a vCPU with pass-through on
> (PERF_EVENT_STATE_ERROR). I would like event-creation to error out on
> ::exclude_guest=0 events when a vCPU with pass-through exists -- with
> minimal scope (this probably means all CPU events, but only relevant
> vCPU events).

Agreed, I am definitely against KVM silently doing anything.  And the more that
is surfaced to the user, the better.

> It also means ::exclude_guest should actually work -- it often does not
> today -- the IBS thing for example totally ignores it.

Is that already an in-tree, or are you talking about Manali's proposed series to
support virtualizing IBS?

> Anyway, none of this means host cannot use PMU because virt muck wants
> it.
Ravi Bangoria Oct. 3, 2023, 6:36 a.m. UTC | #19
On 02-Oct-23 5:59 PM, Peter Zijlstra wrote:
> On Fri, Sep 29, 2023 at 08:50:07PM +0530, Ravi Bangoria wrote:
>> On 29-Sep-23 5:23 PM, Peter Zijlstra wrote:
> 
>>> I don't think you need to go that far, host can use PMU just fine as
>>> long as it doesn't overlap with a vCPU. Basically, if you force
>>> perf_attr::exclude_guest on everything your vCPU can haz the full thing.
> 
> ^ this..
> 
>> How about keying off based on PMU specific KVM module parameter? Something
>> like what Manali has proposed for AMD VIBS? Please see solution 1.1:
>>
>> https://lore.kernel.org/r/3a6c693e-1ef4-6542-bc90-d4468773b97d@amd.com
> 
> So I hadn't read that, but isn't that more or less what I proposed
> above?

Yes, it's pretty much same as:

   "Like I wrote, all we need to do is ensure vCPU tasks will never have a
   perf-event scheduled that covers guest mode. Currently this would be
   achievable by having event creation for both:

    - CPU events without attr::exclude_guest=1, and
    - task events for vCPU task of interest without attr::exclude_guest=1

   error with -EBUSY or something."

Thanks,
Ravi
Peter Zijlstra Oct. 3, 2023, 8:16 a.m. UTC | #20
On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, Peter Zijlstra wrote:

> > I'm not sure what you're suggesting here. It will have to save/restore
> > all those MSRs anyway. Suppose it switches between vCPUs.
> 
> The "when" is what's important.   If KVM took a literal interpretation of
> "exclude guest" for pass-through MSRs, then KVM would context switch all those
> MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
> all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
> <1500 cycles, and "fastpath" exits are something like half that.  Switching all
> the MSRs is likely 1000+ cycles, if not double that.

See, you're the virt-nerd and I'm sure you know what you're talking
about, but I have no clue :-) I didn't know there were different levels
of vm-exit.

> FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> vCPU is pinned 1:1 with a host pCPU.

I've been given to understand that vm-exit is a bad word in this
scenario, any exit is a fail. They get MWAIT and all the other crap and
more or less pretend to be real hardware.

So why do you care about those MSRs so much? That should 'never' happen
in this scenario.

> > > Or at least, that was my reading of things.  Maybe it was just a
> > > misunderstanding because we didn't do a good job of defining the behavior.
> > 
> > This might be the case. I don't particularly care where the guest
> > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > PMU is usable again etc..
> 
> Well drat, that there would have saved a wee bit of frustration.  Better late
> than never though, that's for sure.
> 
> Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> out or KVM exits to userspace, would mean that host perf events won't be active
> for potentially large swaths of non-KVM code.  Any function calls or event/exception
> handlers that occur within the context of ioctl(KVM_RUN) would run with host
> perf events disabled.

Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
sounds like a ton more.

/me frobs around the kvm code some...

Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
seems to run with IRQs disabled, so at most you can trigger a #PF or
something, which will then trip an exception fixup because you can't run
#PF with IRQs disabled etc..

That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
coupled with the existing kvm_x86_handle_exit_irqoff() seems like
reasonable solution from where I'm sitting. That also more or less
matches the FPU state save/restore AFAICT.

Or are you talking about the whole of vcpu_run() ? That seems like a
massive amount of code, and doesn't look like anything I'd call a
fast-path. Also, much of that loop has preemption enabled...

> Are you ok with that approach?  Assuming we don't completely botch things, the
> interfaces are sane, we can come up with a clean solution for handling NMIs, etc.

Since you steal the whole PMU, can't you re-route the PMI to something
that's virt friendly too?

> > It also means ::exclude_guest should actually work -- it often does not
> > today -- the IBS thing for example totally ignores it.
> 
> Is that already an in-tree, or are you talking about Manali's proposed series to
> support virtualizing IBS?

The IBS code as is, it totally ignores ::exclude_guest. Manali was going
to add some of it. But I'm not at all sure about the state of the other
PMU drivers we have.

Just for giggles, P4 has VMX support... /me runs like crazy
Sean Christopherson Oct. 3, 2023, 3:23 p.m. UTC | #21
On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
> > On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> 
> > > I'm not sure what you're suggesting here. It will have to save/restore
> > > all those MSRs anyway. Suppose it switches between vCPUs.
> > 
> > The "when" is what's important.   If KVM took a literal interpretation of
> > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
> > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
> > <1500 cycles, and "fastpath" exits are something like half that.  Switching all
> > the MSRs is likely 1000+ cycles, if not double that.
> 
> See, you're the virt-nerd and I'm sure you know what you're talking
> about, but I have no clue :-) I didn't know there were different levels
> of vm-exit.

An exit is essentially a fancy exception/event.  The hardware transition from
guest=>host is the exception itself (VM-Exit), and the transition back to guest
is analagous to the IRET (VM-Enter).

In between, software will do some amount of work, and the amount of work that is
done can vary quite significantly depending on what caused the exit.

> > FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> > vCPU is pinned 1:1 with a host pCPU.
> 
> I've been given to understand that vm-exit is a bad word in this
> scenario, any exit is a fail. They get MWAIT and all the other crap and
> more or less pretend to be real hardware.
> 
> So why do you care about those MSRs so much? That should 'never' happen
> in this scenario.

It's not feasible to completely avoid exits, as current/upcoming hardware doesn't
(yet) virtualize a few important things.  Off the top of my head, the two most
relevant flows are:

  - APIC_LVTPC entry and PMU counters.  If a PMU counter overflows, the NMI that
    is generated will trigger a hardware level NMI and cause an exit.  And sadly,
    the guest's NMI handler (assuming the guest is also using NMIs for PMIs) will
    trigger another exit when it clears the mask bit in its LVTPC entry.

  - Timer related IRQs, both in the guest and host.  These are the biggest source
    of exits on modern hardware.  Neither AMD nor Intel provide a virtual APIC
    timer, and so KVM must trap and emulate writes to TSC_DEADLINE (or to APIC_TMICT),
    and the subsequent IRQ will also cause an exit.

The cumulative cost of all exits is important, but the latency of each individual
exit is even more critical, especially for PMU related stuff.  E.g. if the guest
is trying to use perf/PMU to profile a workload, adding a few thousand cycles to
each exit will introduce too much noise into the results.

> > > > Or at least, that was my reading of things.  Maybe it was just a
> > > > misunderstanding because we didn't do a good job of defining the behavior.
> > > 
> > > This might be the case. I don't particularly care where the guest
> > > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > > PMU is usable again etc..
> > 
> > Well drat, that there would have saved a wee bit of frustration.  Better late
> > than never though, that's for sure.
> > 
> > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > out or KVM exits to userspace, would mean that host perf events won't be active
> > for potentially large swaths of non-KVM code.  Any function calls or event/exception
> > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > perf events disabled.
> 
> Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> sounds like a ton more.
> 
> /me frobs around the kvm code some...
> 
> Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> seems to run with IRQs disabled, so at most you can trigger a #PF or
> something, which will then trip an exception fixup because you can't run
> #PF with IRQs disabled etc..
>
> That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> reasonable solution from where I'm sitting. That also more or less
> matches the FPU state save/restore AFAICT.
> 
> Or are you talking about the whole of vcpu_run() ? That seems like a
> massive amount of code, and doesn't look like anything I'd call a
> fast-path. Also, much of that loop has preemption enabled...

The whole of vcpu_run().  And yes, much of it runs with preemption enabled.  KVM
uses preempt notifiers to context switch state if the vCPU task is scheduled
out/in, we'd use those hooks to swap PMU state.

Jumping back to the exception analogy, not all exits are equal.  For "simple" exits
that KVM can handle internally, the roundtrip is <1500.   The exit_fastpath loop is
roughly half that.

But for exits that are more complex, e.g. if the guest hits the equivalent of a
page fault, the cost of handling the page fault can vary significantly.  It might
be <1500, but it might also be 10x that if handling the page fault requires faulting
in a new page in the host.

We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
because doing too much work with IRQs disabled can cause latency problems for the
host.  This isn't much of a concern for slice-of-hardware setups, but would be
quite problematic for other use cases.

And except for obviously slow paths (from the guest's perspective), extra latency
on any exit can be problematic.  E.g. even if we got to the point where KVM handles
99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
inopportune time could throw off the guest's profiling results, introduce unacceptable
jitter, etc.

> > Are you ok with that approach?  Assuming we don't completely botch things, the
> > interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
> 
> Since you steal the whole PMU, can't you re-route the PMI to something
> that's virt friendly too?

Hmm, actually, we probably could.  It would require modifying the host's APIC_LVTPC
entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ
vector.  As gross as that sounds, it might actually be cleaner overall than
deciphering whether an NMI belongs to the host or guest, and it would almost
certainly yield lower latency for guest PMIs.
Manali Shukla Oct. 3, 2023, 5:31 p.m. UTC | #22
On 10/3/2023 1:46 PM, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
>> On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> 
>>> I'm not sure what you're suggesting here. It will have to save/restore
>>> all those MSRs anyway. Suppose it switches between vCPUs.
>>
>> The "when" is what's important.   If KVM took a literal interpretation of
>> "exclude guest" for pass-through MSRs, then KVM would context switch all those
>> MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
>> reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
>> all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
>> VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
>> <1500 cycles, and "fastpath" exits are something like half that.  Switching all
>> the MSRs is likely 1000+ cycles, if not double that.
> 
> See, you're the virt-nerd and I'm sure you know what you're talking
> about, but I have no clue :-) I didn't know there were different levels
> of vm-exit.
> 
>> FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
>> vCPU is pinned 1:1 with a host pCPU.
> 
> I've been given to understand that vm-exit is a bad word in this
> scenario, any exit is a fail. They get MWAIT and all the other crap and
> more or less pretend to be real hardware.
> 
> So why do you care about those MSRs so much? That should 'never' happen
> in this scenario.
> 
>>>> Or at least, that was my reading of things.  Maybe it was just a
>>>> misunderstanding because we didn't do a good job of defining the behavior.
>>>
>>> This might be the case. I don't particularly care where the guest
>>> boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
>>> PMU is usable again etc..
>>
>> Well drat, that there would have saved a wee bit of frustration.  Better late
>> than never though, that's for sure.
>>
>> Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
>> out or KVM exits to userspace, would mean that host perf events won't be active
>> for potentially large swaths of non-KVM code.  Any function calls or event/exception
>> handlers that occur within the context of ioctl(KVM_RUN) would run with host
>> perf events disabled.
> 
> Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> sounds like a ton more.
> 
> /me frobs around the kvm code some...
> 
> Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> seems to run with IRQs disabled, so at most you can trigger a #PF or
> something, which will then trip an exception fixup because you can't run
> #PF with IRQs disabled etc..
> 
> That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> reasonable solution from where I'm sitting. That also more or less
> matches the FPU state save/restore AFAICT.
> 
> Or are you talking about the whole of vcpu_run() ? That seems like a
> massive amount of code, and doesn't look like anything I'd call a
> fast-path. Also, much of that loop has preemption enabled...
> 
>> Are you ok with that approach?  Assuming we don't completely botch things, the
>> interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
> 
> Since you steal the whole PMU, can't you re-route the PMI to something
> that's virt friendly too?
> 
>>> It also means ::exclude_guest should actually work -- it often does not
>>> today -- the IBS thing for example totally ignores it.
>>
>> Is that already an in-tree, or are you talking about Manali's proposed series to
>> support virtualizing IBS?
> 
> The IBS code as is, it totally ignores ::exclude_guest. Manali was going
> to add some of it. But I'm not at all sure about the state of the other
> PMU drivers we have.
> 
> Just for giggles, P4 has VMX support... /me runs like crazy

I am working on Solution 1.1 from the approach proposed in [*]. 
I will send V2 (for IBS virtualization series) based on it shortly.

* https://lore.kernel.org/all/20230908133114.GK19320@noisy.programming.kicks-ass.net/T/#m7389910e577966c93a0b50fbaf9442be80dc730b

- Manali
Jim Mattson Oct. 3, 2023, 6:21 p.m. UTC | #23
On Tue, Oct 3, 2023 at 8:23 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
> > > On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> >
> > > > I'm not sure what you're suggesting here. It will have to save/restore
> > > > all those MSRs anyway. Suppose it switches between vCPUs.
> > >
> > > The "when" is what's important.   If KVM took a literal interpretation of
> > > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > > reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
> > > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > > VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
> > > <1500 cycles, and "fastpath" exits are something like half that.  Switching all
> > > the MSRs is likely 1000+ cycles, if not double that.
> >
> > See, you're the virt-nerd and I'm sure you know what you're talking
> > about, but I have no clue :-) I didn't know there were different levels
> > of vm-exit.
>
> An exit is essentially a fancy exception/event.  The hardware transition from
> guest=>host is the exception itself (VM-Exit), and the transition back to guest
> is analagous to the IRET (VM-Enter).
>
> In between, software will do some amount of work, and the amount of work that is
> done can vary quite significantly depending on what caused the exit.
>
> > > FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> > > vCPU is pinned 1:1 with a host pCPU.
> >
> > I've been given to understand that vm-exit is a bad word in this
> > scenario, any exit is a fail. They get MWAIT and all the other crap and
> > more or less pretend to be real hardware.
> >
> > So why do you care about those MSRs so much? That should 'never' happen
> > in this scenario.
>
> It's not feasible to completely avoid exits, as current/upcoming hardware doesn't
> (yet) virtualize a few important things.  Off the top of my head, the two most
> relevant flows are:
>
>   - APIC_LVTPC entry and PMU counters.  If a PMU counter overflows, the NMI that
>     is generated will trigger a hardware level NMI and cause an exit.  And sadly,
>     the guest's NMI handler (assuming the guest is also using NMIs for PMIs) will
>     trigger another exit when it clears the mask bit in its LVTPC entry.

In addition, when the guest PMI handler writes to
IA32_PERF_GLOBAL_CTRL to disable all counters (and again later to
re-enable the counters), KVM has to intercept that as well, with
today's implementation. Similarly, on each guest timer tick, when
guest perf is multiplexing PMCs, KVM has to intercept writes to
IA32_PERF_GLOBAL _CTRL.

Furthermore, in some cases, Linux perf seems to double-disable
counters, using both the individual enable bits in each PerfEvtSel, as
well as the bits in PERF_GLOBAL_CTRL.  KVM has to intercept writes to
the PerfEvtSels as well. Off-topic, but I'd like to request that Linux
perf *only* use the enable  bits in IA32_PERF_GLOBAL_CTRL on
architectures where that is supported. Just leave the enable bits set
in the PrfEvtSels, to avoid unnecessary VM-exits. :)

>   - Timer related IRQs, both in the guest and host.  These are the biggest source
>     of exits on modern hardware.  Neither AMD nor Intel provide a virtual APIC
>     timer, and so KVM must trap and emulate writes to TSC_DEADLINE (or to APIC_TMICT),
>     and the subsequent IRQ will also cause an exit.
>
> The cumulative cost of all exits is important, but the latency of each individual
> exit is even more critical, especially for PMU related stuff.  E.g. if the guest
> is trying to use perf/PMU to profile a workload, adding a few thousand cycles to
> each exit will introduce too much noise into the results.
>
> > > > > Or at least, that was my reading of things.  Maybe it was just a
> > > > > misunderstanding because we didn't do a good job of defining the behavior.
> > > >
> > > > This might be the case. I don't particularly care where the guest
> > > > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > > > PMU is usable again etc..
> > >
> > > Well drat, that there would have saved a wee bit of frustration.  Better late
> > > than never though, that's for sure.
> > >
> > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > > out or KVM exits to userspace, would mean that host perf events won't be active
> > > for potentially large swaths of non-KVM code.  Any function calls or event/exception
> > > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > > perf events disabled.
> >
> > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> > sounds like a ton more.
> >
> > /me frobs around the kvm code some...
> >
> > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> > seems to run with IRQs disabled, so at most you can trigger a #PF or
> > something, which will then trip an exception fixup because you can't run
> > #PF with IRQs disabled etc..
> >
> > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> > coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> > reasonable solution from where I'm sitting. That also more or less
> > matches the FPU state save/restore AFAICT.
> >
> > Or are you talking about the whole of vcpu_run() ? That seems like a
> > massive amount of code, and doesn't look like anything I'd call a
> > fast-path. Also, much of that loop has preemption enabled...
>
> The whole of vcpu_run().  And yes, much of it runs with preemption enabled.  KVM
> uses preempt notifiers to context switch state if the vCPU task is scheduled
> out/in, we'd use those hooks to swap PMU state.
>
> Jumping back to the exception analogy, not all exits are equal.  For "simple" exits
> that KVM can handle internally, the roundtrip is <1500.   The exit_fastpath loop is
> roughly half that.
>
> But for exits that are more complex, e.g. if the guest hits the equivalent of a
> page fault, the cost of handling the page fault can vary significantly.  It might
> be <1500, but it might also be 10x that if handling the page fault requires faulting
> in a new page in the host.
>
> We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> because doing too much work with IRQs disabled can cause latency problems for the
> host.  This isn't much of a concern for slice-of-hardware setups, but would be
> quite problematic for other use cases.
>
> And except for obviously slow paths (from the guest's perspective), extra latency
> on any exit can be problematic.  E.g. even if we got to the point where KVM handles
> 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> inopportune time could throw off the guest's profiling results, introduce unacceptable
> jitter, etc.
>
> > > Are you ok with that approach?  Assuming we don't completely botch things, the
> > > interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
> >
> > Since you steal the whole PMU, can't you re-route the PMI to something
> > that's virt friendly too?
>
> Hmm, actually, we probably could.  It would require modifying the host's APIC_LVTPC
> entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ
> vector.  As gross as that sounds, it might actually be cleaner overall than
> deciphering whether an NMI belongs to the host or guest, and it would almost
> certainly yield lower latency for guest PMIs.

Ugh.  Can't KVM just install its own NMI handler? Either way, it's
possible for late PMIs to arrive in the wrong context.
Mingwei Zhang Oct. 3, 2023, 10:02 p.m. UTC | #24
On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> > On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote:
> > > > > worse it's not a choice based in technical reality.
> > >
> > > The technical reality is that context switching the PMU between host and guest
> > > requires reading and writing far too many MSRs for KVM to be able to context
> > > switch at every VM-Enter and every VM-Exit.  And PMIs skidding past VM-Exit adds
> > > another layer of complexity to deal with.
> >
> > I'm not sure what you're suggesting here. It will have to save/restore
> > all those MSRs anyway. Suppose it switches between vCPUs.
>
> The "when" is what's important.   If KVM took a literal interpretation of
> "exclude guest" for pass-through MSRs, then KVM would context switch all those
> MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
> all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
> <1500 cycles, and "fastpath" exits are something like half that.  Switching all
> the MSRs is likely 1000+ cycles, if not double that.

Hi Sean,

Sorry, I have no intention to interrupt the conversation, but this is
slightly confusing to me.

I remember when doing AMX, we added gigantic 8KB memory in the FPU
context switch. That works well in Linux today. Why can't we do the
same for PMU? Assuming we context switch all counters, selectors and
global stuff there?

On the VM boundary, all we need is for global ctrl, right? We stop all
counters when we exit from the guest and restore the guest value of
global control when entering it. But the actual PMU context switch
should be deferred roughly to the same time we switch FPU (xsave
state). This means we do that when switching task_struct and/or
returning to userspace.

Please kindly correct me if this is flawed.

ah, I think I understand what you are saying... So, "If KVM took a
literal interpretation of "exclude guest" for pass-through MSRs..."

perf_event.attr.exclude_guest might need a different meaning, if we
have a pass-through PMU for KVM. exclude_guest=1 does not mean the
counters are restored at the VMEXIT boundary, which is a disaster if
we do that.

Thanks.
-Mingwei


-Mingwei


>
> FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> vCPU is pinned 1:1 with a host pCPU.  I suspect it's a similar story for the other
> CSPs that are trying to provide accurate PMUs to guests.  If a vCPU is scheduled
> out, then yes, a bunch of context switching will need to happen.  But for the
> types of VMs that are the target audience, their vCPUs will rarely be scheduled
> out.
>
> > > > > It's a choice out of lazyness, disabling host PMU is not a requirement
> > > > > for pass-through.
> > >
> > > The requirement isn't passthrough access, the requirements are that the guest's
> > > PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
> > > guest doesn't have a meaningful impact on guest performance.
> >
> > Given you don't think that trapping MSR accesses is viable, what else
> > besides pass-through did you have in mind?
>
> Sorry, I didn't mean to imply that we don't want pass-through of MSRs.  What I was
> trying to say is that *just* passthrough MSRs doesn't solve the problem, because
> again I thought the whole "context switch PMU state less often" approach had been
> firmly nak'd.
>
> > > > Not just a choice of laziness, but it will clearly be forced upon users
> > > > by external entities:
> > > >
> > > >    "Pass ownership of the PMU to the guest and have no host PMU, or you
> > > >     won't have sane guest PMU support at all. If you disagree, please open
> > > >     a support ticket, which we'll ignore."
> > >
> > > We don't have sane guest PMU support today.
> >
> > Because KVM is too damn hard to use, rebooting a machine is *sooo* much
> > easier -- and I'm really not kidding here.
> >
> > Anyway, you want pass-through, but that doesn't mean host cannot use
> > PMU when vCPU thread is not running.
> >
> > > If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
> > > loop, then I'm all for exploring that option.  But that idea got shot down over
> > > a year ago[*].
> >
> > I never saw that idea in that thread. You virt people keep talking like
> > I know how KVM works -- I'm not joking when I say I have no clue about
> > virt.
> >
> > Sometimes I get a little clue after y'all keep bashing me over the head,
> > but it quickly erases itself.
> >
> > > Or at least, that was my reading of things.  Maybe it was just a
> > > misunderstanding because we didn't do a good job of defining the behavior.
> >
> > This might be the case. I don't particularly care where the guest
> > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > PMU is usable again etc..
>
> Well drat, that there would have saved a wee bit of frustration.  Better late
> than never though, that's for sure.
>
> Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> out or KVM exits to userspace, would mean that host perf events won't be active
> for potentially large swaths of non-KVM code.  Any function calls or event/exception
> handlers that occur within the context of ioctl(KVM_RUN) would run with host
> perf events disabled.
>
> Are you ok with that approach?  Assuming we don't completely botch things, the
> interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
>
> > Re-reading parts of that linked thread, I see mention of
> > PT_MODE_HOST_GUEST -- see I knew we had something there, but I can never
> > remember all that nonsense. Worst part is that I can't find the relevant
> > perf code when I grep for that string :/
>
> The PT stuff is actually an example of what we don't want, at least not exactly.
> The concept of a hard switch between guest and host is ok, but as-is, KVM's PT
> code does a big pile of MSR reads and writes on every VM-Enter and VM-Exit.
>
> > Anyway, what I don't like is KVM silently changing all events to
> > ::exclude_guest=1. I would like all (pre-existing) ::exclude_guest=0
> > events to hard error when they run into a vCPU with pass-through on
> > (PERF_EVENT_STATE_ERROR). I would like event-creation to error out on
> > ::exclude_guest=0 events when a vCPU with pass-through exists -- with
> > minimal scope (this probably means all CPU events, but only relevant
> > vCPU events).
>
> Agreed, I am definitely against KVM silently doing anything.  And the more that
> is surfaced to the user, the better.
>
> > It also means ::exclude_guest should actually work -- it often does not
> > today -- the IBS thing for example totally ignores it.
>
> Is that already an in-tree, or are you talking about Manali's proposed series to
> support virtualizing IBS?
>
> > Anyway, none of this means host cannot use PMU because virt muck wants
> > it.
Peter Zijlstra Oct. 4, 2023, 11:21 a.m. UTC | #25
On Tue, Oct 03, 2023 at 08:23:26AM -0700, Sean Christopherson wrote:
> On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:

> > > Well drat, that there would have saved a wee bit of frustration.  Better late
> > > than never though, that's for sure.
> > > 
> > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > > out or KVM exits to userspace, would mean that host perf events won't be active
> > > for potentially large swaths of non-KVM code.  Any function calls or event/exception
> > > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > > perf events disabled.
> > 
> > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> > sounds like a ton more.
> > 
> > /me frobs around the kvm code some...
> > 
> > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> > seems to run with IRQs disabled, so at most you can trigger a #PF or
> > something, which will then trip an exception fixup because you can't run
> > #PF with IRQs disabled etc..
> >
> > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> > coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> > reasonable solution from where I'm sitting. That also more or less
> > matches the FPU state save/restore AFAICT.
> > 
> > Or are you talking about the whole of vcpu_run() ? That seems like a
> > massive amount of code, and doesn't look like anything I'd call a
> > fast-path. Also, much of that loop has preemption enabled...
> 
> The whole of vcpu_run().  And yes, much of it runs with preemption enabled.  KVM
> uses preempt notifiers to context switch state if the vCPU task is scheduled
> out/in, we'd use those hooks to swap PMU state.
> 
> Jumping back to the exception analogy, not all exits are equal.  For "simple" exits
> that KVM can handle internally, the roundtrip is <1500.   The exit_fastpath loop is
> roughly half that.
> 
> But for exits that are more complex, e.g. if the guest hits the equivalent of a
> page fault, the cost of handling the page fault can vary significantly.  It might
> be <1500, but it might also be 10x that if handling the page fault requires faulting
> in a new page in the host.
> 
> We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> because doing too much work with IRQs disabled can cause latency problems for the
> host.  This isn't much of a concern for slice-of-hardware setups, but would be
> quite problematic for other use cases.
> 
> And except for obviously slow paths (from the guest's perspective), extra latency
> on any exit can be problematic.  E.g. even if we got to the point where KVM handles
> 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> inopportune time could throw off the guest's profiling results, introduce unacceptable
> jitter, etc.

I'm confused... the PMU must not be running after vm-exit. It must not
be able to profile the host. So what jitter are you talking about?

Even if we persist the MSR contents, the PMU itself must be disabled on
vm-exit and enabled on vm-enter. If not by hardware then by software
poking at the global ctrl msr.

I also don't buy the latency argument, we already do full and complete
PMU rewrites with IRQs disabled in the context switch path. And as
mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
save/restore.

I would much prefer we keep the PMU swizzle inside the IRQ disabled
region of vcpu_enter_guest(). That's already a ton better than you have
today.
Peter Zijlstra Oct. 4, 2023, 11:32 a.m. UTC | #26
On Tue, Oct 03, 2023 at 11:21:46AM -0700, Jim Mattson wrote:
> On Tue, Oct 3, 2023 at 8:23 AM Sean Christopherson <seanjc@google.com> wrote:

> > > Since you steal the whole PMU, can't you re-route the PMI to something
> > > that's virt friendly too?
> >
> > Hmm, actually, we probably could.  It would require modifying the host's APIC_LVTPC
> > entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ
> > vector.  As gross as that sounds, it might actually be cleaner overall than
> > deciphering whether an NMI belongs to the host or guest, and it would almost
> > certainly yield lower latency for guest PMIs.
> 
> Ugh.  Can't KVM just install its own NMI handler? Either way, it's
> possible for late PMIs to arrive in the wrong context.

I don't think you realize what a horrible trainwreck the NMI handler is.
Every handler has to be able to determine if the NMI is theirs to
handle.

If we go do this whole swizzle thing we must find a sequence of PMU
'instructions' that syncs against the PMI, because otherwise we're going
to loose PMIs and that's going to be a *TON* of pain.

I'll put it on the agenda for the next time I talk with the hardware
folks. But IIRC the AMD thing is *much* worse in this regards than the
Intel one.
Mingwei Zhang Oct. 4, 2023, 7:51 p.m. UTC | #27
On Wed, Oct 4, 2023 at 4:22 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Oct 03, 2023 at 08:23:26AM -0700, Sean Christopherson wrote:
> > On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> > > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
>
> > > > Well drat, that there would have saved a wee bit of frustration.  Better late
> > > > than never though, that's for sure.
> > > >
> > > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > > > out or KVM exits to userspace, would mean that host perf events won't be active
> > > > for potentially large swaths of non-KVM code.  Any function calls or event/exception
> > > > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > > > perf events disabled.
> > >
> > > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> > > sounds like a ton more.
> > >
> > > /me frobs around the kvm code some...
> > >
> > > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> > > seems to run with IRQs disabled, so at most you can trigger a #PF or
> > > something, which will then trip an exception fixup because you can't run
> > > #PF with IRQs disabled etc..
> > >
> > > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> > > coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> > > reasonable solution from where I'm sitting. That also more or less
> > > matches the FPU state save/restore AFAICT.
> > >
> > > Or are you talking about the whole of vcpu_run() ? That seems like a
> > > massive amount of code, and doesn't look like anything I'd call a
> > > fast-path. Also, much of that loop has preemption enabled...
> >
> > The whole of vcpu_run().  And yes, much of it runs with preemption enabled.  KVM
> > uses preempt notifiers to context switch state if the vCPU task is scheduled
> > out/in, we'd use those hooks to swap PMU state.
> >
> > Jumping back to the exception analogy, not all exits are equal.  For "simple" exits
> > that KVM can handle internally, the roundtrip is <1500.   The exit_fastpath loop is
> > roughly half that.
> >
> > But for exits that are more complex, e.g. if the guest hits the equivalent of a
> > page fault, the cost of handling the page fault can vary significantly.  It might
> > be <1500, but it might also be 10x that if handling the page fault requires faulting
> > in a new page in the host.
> >
> > We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> > because doing too much work with IRQs disabled can cause latency problems for the
> > host.  This isn't much of a concern for slice-of-hardware setups, but would be
> > quite problematic for other use cases.
> >
> > And except for obviously slow paths (from the guest's perspective), extra latency
> > on any exit can be problematic.  E.g. even if we got to the point where KVM handles
> > 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> > inopportune time could throw off the guest's profiling results, introduce unacceptable
> > jitter, etc.
>
> I'm confused... the PMU must not be running after vm-exit. It must not
> be able to profile the host. So what jitter are you talking about?
>
> Even if we persist the MSR contents, the PMU itself must be disabled on
> vm-exit and enabled on vm-enter. If not by hardware then by software
> poking at the global ctrl msr.
>
> I also don't buy the latency argument, we already do full and complete
> PMU rewrites with IRQs disabled in the context switch path. And as
> mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
> save/restore.
>
> I would much prefer we keep the PMU swizzle inside the IRQ disabled
> region of vcpu_enter_guest(). That's already a ton better than you have
> today.

Peter, I think the jitter Sean was talking about is the potential
issue in pass-through implementation. If KVM follows the perf
subsystem requirement, then after VMEXIT, any perf_event with
exclude_guest=1 (and higher priority ?) should start counting. Because
the guest VM exclusively owns the PMU with all counters at that point,
the gigantic msr save/restore is needed which requires a whole bunch
of wrmsrs. That will be a performance disaster since VMEXIT could
happen at a very high frequency.

In comparison, if we are talking about the existing non-pass-through
implementation, then the PMU context switch immediately becomes
simple: only global ctrl tweak is needed at VM boundary (to stop
exclude_host events and start exclude_guest events in one shot), since
the guest VM and host perf subsystem share the hardware PMU counters.

Peter, that latency argument in pass-through implementation is
something that we hope you could buy. This should be relatively easy
to prove. I can provide some data if you need.

To cope with that, KVM might need to defer that msr save/restore for
PMU to a later point in pass-through implementation. But that will be
conflicting with the support of the perf_event with exclude_guest=1.
So, I guess that's why Sean mentioned this: "If y'all are willing to
let KVM redefined exclude_guest to be KVM's outer run loop, then I'm
all for exploring that option."

Note that the situation is similar to AMX, i.e., when guest VMEXIT to
host, the FPU should be switched to the host FPU as well, but because
AMX is too big and thus too slow, KVM defers that to a very late
point.

Hope this explains a little bit and sorry if this might be an
injection of noise.

Thanks.
-Mingwei


>
Sean Christopherson Oct. 4, 2023, 8:43 p.m. UTC | #28
On Tue, Oct 03, 2023, Mingwei Zhang wrote:
> On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <seanjc@google.com> wrote:
> > The "when" is what's important.   If KVM took a literal interpretation of
> > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
> > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
> > <1500 cycles, and "fastpath" exits are something like half that.  Switching all
> > the MSRs is likely 1000+ cycles, if not double that.
> 
> Hi Sean,
> 
> Sorry, I have no intention to interrupt the conversation, but this is
> slightly confusing to me.
> 
> I remember when doing AMX, we added gigantic 8KB memory in the FPU
> context switch. That works well in Linux today. Why can't we do the
> same for PMU? Assuming we context switch all counters, selectors and
> global stuff there?

That's what we (Google folks) are proposing.  However, there are significant
side effects if KVM context switches PMU outside of vcpu_run(), whereas the FPU
doesn't suffer the same problems.

Keeping the guest FPU resident for the duration of vcpu_run() is, in terms of
functionality, completely transparent to the rest of the kernel.  From the kernel's
perspective, the guest FPU is just a variation of a userspace FPU, and the kernel
is already designed to save/restore userspace/guest FPU state when the kernel wants
to use the FPU for whatever reason.  And crucially, kernel FPU usage is explicit
and contained, e.g. see kernel_fpu_{begin,end}(), and comes with mechanisms for
KVM to detect when the guest FPU needs to be reloaded (see TIF_NEED_FPU_LOAD).

The PMU is a completely different story.  PMU usage, a.k.a. perf, by design is
"always running".  KVM can't transparently stop host usage of the PMU, as disabling
host PMU usage stops perf events from counting/profiling whatever it is they're
supposed to profile.

Today, KVM minimizes the "downtime" of host PMU usage by context switching PMU
state at VM-Enter and VM-Exit, or at least as close as possible, e.g. for LBRs
and Intel PT.

What we are proposing would *significantly* increase the downtime, to the point
where it would almost be unbounded in some paths, e.g. if KVM faults in a page,
gup() could go swap in memory from disk, install PTEs, and so on and so forth.
If the host is trying to profile something related to swap or memory management,
they're out of luck.
Sean Christopherson Oct. 4, 2023, 9:50 p.m. UTC | #29
On Wed, Oct 04, 2023, Mingwei Zhang wrote:
> On Wed, Oct 4, 2023 at 4:22 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > Or are you talking about the whole of vcpu_run() ? That seems like a
> > > > massive amount of code, and doesn't look like anything I'd call a
> > > > fast-path. Also, much of that loop has preemption enabled...
> > >
> > > The whole of vcpu_run().  And yes, much of it runs with preemption enabled.  KVM
> > > uses preempt notifiers to context switch state if the vCPU task is scheduled
> > > out/in, we'd use those hooks to swap PMU state.
> > >
> > > Jumping back to the exception analogy, not all exits are equal.  For "simple" exits
> > > that KVM can handle internally, the roundtrip is <1500.   The exit_fastpath loop is
> > > roughly half that.
> > >
> > > But for exits that are more complex, e.g. if the guest hits the equivalent of a
> > > page fault, the cost of handling the page fault can vary significantly.  It might
> > > be <1500, but it might also be 10x that if handling the page fault requires faulting
> > > in a new page in the host.
> > >
> > > We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> > > because doing too much work with IRQs disabled can cause latency problems for the
> > > host.  This isn't much of a concern for slice-of-hardware setups, but would be
> > > quite problematic for other use cases.
> > >
> > > And except for obviously slow paths (from the guest's perspective), extra latency
> > > on any exit can be problematic.  E.g. even if we got to the point where KVM handles
> > > 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> > > inopportune time could throw off the guest's profiling results, introduce unacceptable
> > > jitter, etc.
> >
> > I'm confused... the PMU must not be running after vm-exit. It must not
> > be able to profile the host. So what jitter are you talking about?
> >
> > Even if we persist the MSR contents, the PMU itself must be disabled on
> > vm-exit and enabled on vm-enter. If not by hardware then by software
> > poking at the global ctrl msr.
> >
> > I also don't buy the latency argument, we already do full and complete
> > PMU rewrites with IRQs disabled in the context switch path. And as
> > mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
> > save/restore.
> >
> > I would much prefer we keep the PMU swizzle inside the IRQ disabled
> > region of vcpu_enter_guest(). That's already a ton better than you have
> > today.

...

> Peter, that latency argument in pass-through implementation is
> something that we hope you could buy. This should be relatively easy
> to prove. I can provide some data if you need.

You and Peter are talking about is two different latencies.  Or rather, how the
latency impacts two different things.

Peter is talking about is the latency impact on the host, specifically how much
work is done with IRQs disabled.

You are talking about is the latency impact on the guest, i.e. how much guest
performance is affected if KVM swaps MSRs on every exit. 

Peter is contending that swapping PMU MSRs with IRQs disabled isn't a big deal,
because the kernel already does as much during a context switch.  I agree, *if*
we're talking about only adding the PMU MSRs.

You (and I) are contending that the latency impact on the guest will be too high
if KVM swaps in the inner VM-Exit loop.  This is not analogous to host context
switches, as VM-Exits can occur at a much higher frequency than context switches,
and can be triggered by events that have nothing to do with the guest.

There's some confusion here though because of what I said earlier:

  We don't want to get too aggressive with moving stuff into the exit_fastpath
  loop, because doing too much work with IRQs disabled can cause latency problems
  for the host. 

By "stuff" I wasn't talking about PMU MSRs, I was referring to all exit handling
that KVM *could* move into the IRQs disabled section in order to mitigate the
concerns that we have about the latency impacts on the guest.  E.g. if most exits
are handled in the IRQs disabled section, then KVM could handle most exits without
swapping PMU state and thus limit the impact on guest performance, and not cause
to much host perf "downtime" that I mentioned in the other thread[*].

However, my concern is that handling most exits with IRQs disabled would result
in KVM doing too much work with IRQs disabled, i.e. would impact the host latency
that Peter is talking about.  And I'm more than a bit terrified of calling into
code that doesn't expect to be called with IRQs disabled.

Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
register swapping?

  A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
  B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
     are enabled, *if and only if* the current CPU has one or perf events that
     wants to use the hardware PMU
  C. Put guest PMU state at vcpu_put()
  D. Add a perf callback that is invoked from IRQ context when perf wants to
     configure a new PMU-based events, *before* actually programming the MSRs,
     and have KVM's callback put the guest PMU state

If there are host perf events that want to use the PMU, then KVM will swap fairly
aggressively and the "downtime" of the host perf events will be limited to the
small window around VM-Enter/VM-Exit.

If there are no such host events, KVM will swap on the first entry to the guest,
and keep the guest PMU loaded until the vCPU is put.

The perf callback in (D) would allow perf to program system-wide events on all
CPUs without clobbering guest PMU state.

I think that would make everyone happy.  As long as our hosts don't create perf
events, then we get the "swap as little as possible" behavior without significantly
impacting the host's ability to utilize perf.  If our host screws up and creates
perf events on CPUs that are running vCPUs, then the degraded vCPU performance is
on us.

Rough sketch below, minus the perf callback or any of actual swapping logic.

[*] https://lore.kernel.org/all/ZR3Ohk50rSofAnSL@google.com

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7d9ba301c090..86699d310224 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,30 @@ struct kvm_pmu_ops {
 
 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
 
+static inline void kvm_load_guest_pmu(struct kvm_vcpu *vcpu)
+{
+       struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+       lockdep_assert_irqs_disabled();
+
+       if (vcpu->pmu->guest_state_loaded)
+               return;
+
+       <swap state>
+       vcpu->pmu->guest_state_loaded = true;
+}
+
+static inline void kvm_put_guest_pmu(struct kvm_vcpu *vcpu)
+{
+       lockdep_assert_irqs_disabled();
+
+       if (!vcpu->pmu->guest_state_loaded)
+               return;
+
+       <swap state>
+       vcpu->pmu->guest_state_loaded = false;
+}
+
 static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
 {
        /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e645f5b1e2c..93a8f268c37b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4903,8 +4903,20 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+       unsigned long flags;
        int idx;
 
+       /*
+        * This can get false positives, but not false negatives, i.e. KVM will
+        * never fail to put the PMU, but may unnecessarily disable IRQs to
+        * safely check if the PMU is still loaded.
+        */
+       if (kvm_is_guest_pmu_loaded(vcpu)) {
+               local_irq_save(flags);
+               kvm_put_guest_pmu(vcpu);
+               local_irq_restore(flags);
+       }
+
        if (vcpu->preempted) {
                if (!vcpu->arch.guest_state_protected)
                        vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
@@ -10759,6 +10771,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                set_debugreg(0, 7);
        }
 
+       kvm_load_guest_pmu(vcpu);
+
        guest_timing_enter_irqoff();
 
        for (;;) {
@@ -10810,6 +10824,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
        if (hw_breakpoint_active())
                hw_breakpoint_restore();
 
+       if (perf_has_hw_events())
+               kvm_put_guest_pmu(vcpu);
+
        vcpu->arch.last_vmentry_cpu = vcpu->cpu;
        vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
Sean Christopherson Oct. 4, 2023, 10:05 p.m. UTC | #30
On Wed, Oct 04, 2023, Sean Christopherson wrote:
> Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
> register swapping?
> 
>   A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
>   B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
>      are enabled, *if and only if* the current CPU has one or perf events that
>      wants to use the hardware PMU
>   C. Put guest PMU state at vcpu_put()
>   D. Add a perf callback that is invoked from IRQ context when perf wants to
>      configure a new PMU-based events, *before* actually programming the MSRs,
>      and have KVM's callback put the guest PMU state
> 
> If there are host perf events that want to use the PMU, then KVM will swap fairly
> aggressively and the "downtime" of the host perf events will be limited to the
> small window around VM-Enter/VM-Exit.
> 
> If there are no such host events, KVM will swap on the first entry to the guest,
> and keep the guest PMU loaded until the vCPU is put.
> 
> The perf callback in (D) would allow perf to program system-wide events on all
> CPUs without clobbering guest PMU state.
> 
> I think that would make everyone happy.  As long as our hosts don't create perf
> events, then we get the "swap as little as possible" behavior without significantly
> impacting the host's ability to utilize perf.  If our host screws up and creates
> perf events on CPUs that are running vCPUs, then the degraded vCPU performance is
> on us.
> 
> Rough sketch below, minus the perf callback or any of actual swapping logic.

Another reason to go for an approach that doesn't completely kill off host PMU
usage: just because we don't plan on enable perf events in *production*, there
will undoubtedly be times where we want to enable perf events to debug issues
(outside of prod) in the host kernel/KVM that affect VMs with a passthrough PMU.

So I'll add a self-NAK to the idea of completely disabling the host PMU, I think
that would burn us quite badly at some point.
Like Xu Oct. 8, 2023, 10:04 a.m. UTC | #31
Hi all,

On 5/10/2023 6:05 am, Sean Christopherson wrote:
> So I'll add a self-NAK to the idea of completely disabling the host PMU, I think
> that would burn us quite badly at some point.

I seem to have missed a party, so allow me to add a few more comments
to better facilitate future discussions in this direction:

(1) PMU counters on TEE

The SGX/SEV is already part of the upstream, but what kind of performance
data will be obtained by sampling enclaves or sev-guest with hardware pmu
counters on host (will the perf-report show these data missing holes or
pure encrypted data), we don't have a clear idea nor have we established
the right expectations. But on AMD profiling a SEV-SNP guest is supported:

"Fingerprinting attack protection is also not supported in the current
generation of these technologies. Fingerprinting attacks attempt to
determine what code the VM is running by monitoring its access patterns,
performance counter information, etc." (AMD SEV-SNP White Paper, 2020)

(2) PMU Guest/Host Co-existence Development

The introduction of pt_mode in the KVM was misleading, leading subsequent
developers to believe that static slicing of pmu facility usage was allowed.

On user scenarios, the host/perf should treat pmu resource requests from
vCPUs with regularity (which can be unequal under the host's authority IMO)
while allowing the host to be able to profile any software entity (including
hypervisor and guest-code, including TEE code in debug mode). Functionality
takes precedence over performance.

The semantics of exclude_guest/host should be tied to the hw-event isolation
settings on the hardware interfaces, not to the human-defined sw-context.
The perf subsystem is the arbiter of pmu resource allocation on the host,
and any attempt to change the status quo (or maintenance scope) will not
succeed. Therefore, vPMU developers are required to be familiar with the
implementation details of both perf and kvm, and try not to add perf APIs
dedicated to serving KVM blindly.

Getting host and guests to share limited PMU resources harmoniously is not
particularly difficult compared to real rocket science in the kernel, so
please don't be intimidated.

(3) Performance Concern in Co-existence

I wonder if it would be possible to add a knob to turn off the perf counter
multiplexing mechanism on the host, so that in coexistence scenarios, the
number of VM exits on the vCPU would not be increased by counter rotations
due to timer expiration.

For normal counters shared between guest and host, the number of counter
msr switches requiring a vm-entry level will be relatively small.
(The number of counters is growing; for LBR, it is possible to share LBR
select values to avoid frequent switching, but of course this requires the
implementation of a software filtering mechanism when the host/guest read
the LBR records, and some additional PMI; for DS-based PEBS, host and guest
PEBS buffers are automatically segregated based on linear address).

There is a lot of room for optimisation here, and in real scenarios where
triggering a large number of register switches in the host/guest PMU is
to be expected and observed easily (accompanied by a large number of pmi
appearances).

If we are really worried about the virtualisation overhead of vPMU, then
virtio-pmu might be an option. In this technology direction, the back-end
pmu can add more performance events of interest to the VM (including host
un-core and off-core events, host-side software events, etc.) In terms of
implementation, the semantics of the MSRLIST instruction can be re-used,
along with compatibility with the different PMU hardware interfaces on ARM
and Risc-v, which is also very friendly to production environments based on
its virtio nature.

(4) New vPMU Feature Development

We should not put KVM's current vPMU support into maintenance-only mode.
Users want more PMU features in the guest, like AMD vIBS, Intel pmu higher
versions, Intel topdown and Arch lbr, more on the way. The maturity of
different features' patch sets aren't the same, but we can't ignore these
real needs because of available time for key maintainers, apathy towards
contributors, mindset avoidance and laziness, and preference for certain
technology stacks. These technical challenges will attract an influx of
open source heroes to push the technology forward, which is good in the
long run.

(5) More to think about

Similar to the guest PMU feature, the debugging feature may face the same
state. For example, what happens when you debug code inside the host and
guest at the same time (host debugs hypevisor/guest code and guest debugs
guest code only) ?

Forgive my ignorance and offence, but we don't want to see a KVM subsystem
controlled and driven by Google's demands.

Please feel free to share comments to move forward.

Thanks,
Like Xu
Manali Shukla Oct. 9, 2023, 5:03 p.m. UTC | #32
On 10/8/2023 3:34 PM, Like Xu wrote:
> Hi all,
> 
> On 5/10/2023 6:05 am, Sean Christopherson wrote:
>> So I'll add a self-NAK to the idea of completely disabling the host PMU, I think
>> that would burn us quite badly at some point.
> 
> I seem to have missed a party, so allow me to add a few more comments
> to better facilitate future discussions in this direction:
> 
> (1) PMU counters on TEE
> 
> The SGX/SEV is already part of the upstream, but what kind of performance
> data will be obtained by sampling enclaves or sev-guest with hardware pmu
> counters on host (will the perf-report show these data missing holes or
> pure encrypted data), we don't have a clear idea nor have we established
> the right expectations. But on AMD profiling a SEV-SNP guest is supported:
> 
> "Fingerprinting attack protection is also not supported in the current
> generation of these technologies. Fingerprinting attacks attempt to
> determine what code the VM is running by monitoring its access patterns,
> performance counter information, etc." (AMD SEV-SNP White Paper, 2020)
> 
> (2) PMU Guest/Host Co-existence Development
> 
> The introduction of pt_mode in the KVM was misleading, leading subsequent
> developers to believe that static slicing of pmu facility usage was allowed.
> 
> On user scenarios, the host/perf should treat pmu resource requests from
> vCPUs with regularity (which can be unequal under the host's authority IMO)
> while allowing the host to be able to profile any software entity (including
> hypervisor and guest-code, including TEE code in debug mode). Functionality
> takes precedence over performance.
> 
> The semantics of exclude_guest/host should be tied to the hw-event isolation
> settings on the hardware interfaces, not to the human-defined sw-context.
> The perf subsystem is the arbiter of pmu resource allocation on the host,
> and any attempt to change the status quo (or maintenance scope) will not
> succeed. Therefore, vPMU developers are required to be familiar with the
> implementation details of both perf and kvm, and try not to add perf APIs
> dedicated to serving KVM blindly.
> 
> Getting host and guests to share limited PMU resources harmoniously is not
> particularly difficult compared to real rocket science in the kernel, so
> please don't be intimidated.
> 
> (3) Performance Concern in Co-existence
> 
> I wonder if it would be possible to add a knob to turn off the perf counter
> multiplexing mechanism on the host, so that in coexistence scenarios, the
> number of VM exits on the vCPU would not be increased by counter rotations
> due to timer expiration.
> 
> For normal counters shared between guest and host, the number of counter
> msr switches requiring a vm-entry level will be relatively small.
> (The number of counters is growing; for LBR, it is possible to share LBR
> select values to avoid frequent switching, but of course this requires the
> implementation of a software filtering mechanism when the host/guest read
> the LBR records, and some additional PMI; for DS-based PEBS, host and guest
> PEBS buffers are automatically segregated based on linear address).
> 
> There is a lot of room for optimisation here, and in real scenarios where
> triggering a large number of register switches in the host/guest PMU is
> to be expected and observed easily (accompanied by a large number of pmi
> appearances).
> 
> If we are really worried about the virtualisation overhead of vPMU, then
> virtio-pmu might be an option. In this technology direction, the back-end
> pmu can add more performance events of interest to the VM (including host
> un-core and off-core events, host-side software events, etc.) In terms of
> implementation, the semantics of the MSRLIST instruction can be re-used,
> along with compatibility with the different PMU hardware interfaces on ARM
> and Risc-v, which is also very friendly to production environments based on
> its virtio nature.
> 
> (4) New vPMU Feature Development
> 
> We should not put KVM's current vPMU support into maintenance-only mode.
> Users want more PMU features in the guest, like AMD vIBS, Intel pmu higher
> versions, Intel topdown and Arch lbr, more on the way. The maturity of
> different features' patch sets aren't the same, but we can't ignore these
> real needs because of available time for key maintainers, apathy towards
> contributors, mindset avoidance and laziness, and preference for certain
> technology stacks. These technical challenges will attract an influx of
> open source heroes to push the technology forward, which is good in the
> long run.
> 
> (5) More to think about
> 
> Similar to the guest PMU feature, the debugging feature may face the same
> state. For example, what happens when you debug code inside the host and
> guest at the same time (host debugs hypevisor/guest code and guest debugs
> guest code only) ?
> 
> Forgive my ignorance and offence, but we don't want to see a KVM subsystem
> controlled and driven by Google's demands.
> 
> Please feel free to share comments to move forward.
> 
> Thanks,
> Like Xu

Hi all,

I would like to add following things to the discussion just for the awareness of
everyone.

Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
working on prototyping it.

As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
type C: guest PMC states are loaded at VMRUN automatically but host PMC states
are not saved by hardware. If hypervisor is using the performance counters, it
is hypervisor's responsibility to save PERF_CTL registers to host save area
prior to VMRUN and restore them after VMEXIT. In order to tackle PMC overflow
interrupts in guest itself, NMI virtualization or AVIC can be used, so that
interrupt on PMC overflow in guest will not leak to host.

- Manali
Peter Zijlstra Oct. 11, 2023, 2:15 p.m. UTC | #33
On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> Hi all,
> 
> I would like to add following things to the discussion just for the awareness of
> everyone.
> 
> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> working on prototyping it.
> 
> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> are not saved by hardware.

Per the previous discussion, doing this while host has active counters
that do not have ::exclude_guest=1 is invalid and must result in an
error.

Also, I'm assuming it is all optional, a host can still profile a guest
if all is configured just so?

> If hypervisor is using the performance counters, it
> is hypervisor's responsibility to save PERF_CTL registers to host save area
> prior to VMRUN and restore them after VMEXIT. 

Does VMEXIT clear global_ctrl at least?

> In order to tackle PMC overflow
> interrupts in guest itself, NMI virtualization or AVIC can be used, so that
> interrupt on PMC overflow in guest will not leak to host.

Can you please clarify -- AMD has this history with very dodgy PMI
boundaries. See the whole amd_pmu_adjust_nmi_window() crud. Even the
PMUv2 update didn't fix that nonsense.

How is any virt stuff supposed to fix this? If the hardware is late
delivering PMI, what guarantees a guest PMI does not land in host
context and vice-versa?

How does NMI virtualization (what even is that) or AVIC (I'm assuming
that's a virtual interrupt controller) help?

Please make very sure, with your hardware team, that PMI must not be
delivered after clearing global_ctrl (preferably) or at the very least,
there exists a sequence of operations that provides a hard barrier
to order PMI.
Peter Zijlstra Oct. 11, 2023, 2:20 p.m. UTC | #34
On Wed, Oct 04, 2023 at 02:50:46PM -0700, Sean Christopherson wrote:

> Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
> register swapping?
> 
>   A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
>   B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
>      are enabled, *if and only if* the current CPU has one or perf events that
>      wants to use the hardware PMU
>   C. Put guest PMU state at vcpu_put()
>   D. Add a perf callback that is invoked from IRQ context when perf wants to
>      configure a new PMU-based events, *before* actually programming the MSRs,
>      and have KVM's callback put the guest PMU state
> 

No real objection, but I would suggest arriving at that solution by
first building the simple-stupid thing and then making it more
complicated in additinoal patches. Hmm?
Sean Christopherson Oct. 13, 2023, 5:02 p.m. UTC | #35
On Wed, Oct 11, 2023, Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 02:50:46PM -0700, Sean Christopherson wrote:
> 
> > Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
> > register swapping?
> > 
> >   A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
> >   B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
> >      are enabled, *if and only if* the current CPU has one or perf events that
> >      wants to use the hardware PMU
> >   C. Put guest PMU state at vcpu_put()
> >   D. Add a perf callback that is invoked from IRQ context when perf wants to
> >      configure a new PMU-based events, *before* actually programming the MSRs,
> >      and have KVM's callback put the guest PMU state
> > 
> 
> No real objection, but I would suggest arriving at that solution by
> first building the simple-stupid thing and then making it more
> complicated in additinoal patches. Hmm?

Yeah, for sure.
Manali Shukla Oct. 17, 2023, 10:24 a.m. UTC | #36
On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
>> Hi all,
>>
>> I would like to add following things to the discussion just for the awareness of
>> everyone.
>>
>> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
>> working on prototyping it.
>>
>> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
>> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
>> are not saved by hardware.
> 
> Per the previous discussion, doing this while host has active counters
> that do not have ::exclude_guest=1 is invalid and must result in an
> error.
> 

Yeah, exclude_guest should be enforced on host, while host has active PMC
counters and VPMC is enabled.

> Also, I'm assuming it is all optional, a host can still profile a guest
> if all is configured just so?
> 

Correct, host should be able to profile guest, if VPMC is not enabled.

>> If hypervisor is using the performance counters, it
>> is hypervisor's responsibility to save PERF_CTL registers to host save area
>> prior to VMRUN and restore them after VMEXIT. 
> 
> Does VMEXIT clear global_ctrl at least?
> 

global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.

>> In order to tackle PMC overflow
>> interrupts in guest itself, NMI virtualization or AVIC can be used, so that
>> interrupt on PMC overflow in guest will not leak to host.
> 
> Can you please clarify -- AMD has this history with very dodgy PMI
> boundaries. See the whole amd_pmu_adjust_nmi_window() crud. Even the
> PMUv2 update didn't fix that nonsense.
> 
> How is any virt stuff supposed to fix this? If the hardware is late
> delivering PMI, what guarantees a guest PMI does not land in host
> context and vice-versa?
> 
> How does NMI virtualization (what even is that) or AVIC (I'm assuming
> that's a virtual interrupt controller) help?
> 

When NMI virtualization is enabled and source of VNMI is in guest, micro code will 
make sure that VNMI will directly be delivered to the guest (even if NMI was late
delivered), so there is no issue of leaking guest NMI to the host.

> Please make very sure, with your hardware team, that PMI must not be
> delivered after clearing global_ctrl (preferably) or at the very least,
> there exists a sequence of operations that provides a hard barrier
> to order PMI.
> 
We are verifying all the corner cases, while prototyping PMC virtualization. 
As of now, we don't see guest NMIs leaking to host issue. But latency issues 
still stays.
Mingwei Zhang Oct. 17, 2023, 4:58 p.m. UTC | #37
On Tue, Oct 17, 2023 at 3:24 AM Manali Shukla <manali.shukla@amd.com> wrote:
>
> On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> > On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> >> Hi all,
> >>
> >> I would like to add following things to the discussion just for the awareness of
> >> everyone.
> >>
> >> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> >> working on prototyping it.
> >>
> >> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> >> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> >> are not saved by hardware.
> >
> > Per the previous discussion, doing this while host has active counters
> > that do not have ::exclude_guest=1 is invalid and must result in an
> > error.
> >
>
> Yeah, exclude_guest should be enforced on host, while host has active PMC
> counters and VPMC is enabled.
>
> > Also, I'm assuming it is all optional, a host can still profile a guest
> > if all is configured just so?
> >
>
> Correct, host should be able to profile guest, if VPMC is not enabled.
>
> >> If hypervisor is using the performance counters, it
> >> is hypervisor's responsibility to save PERF_CTL registers to host save area
> >> prior to VMRUN and restore them after VMEXIT.
> >
> > Does VMEXIT clear global_ctrl at least?
> >
>
> global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
> all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.

Are these guest values (automatically) saved in the guest area of VMCB
on VMEXIT?

>
> >> In order to tackle PMC overflow
> >> interrupts in guest itself, NMI virtualization or AVIC can be used, so that
> >> interrupt on PMC overflow in guest will not leak to host.
> >
> > Can you please clarify -- AMD has this history with very dodgy PMI
> > boundaries. See the whole amd_pmu_adjust_nmi_window() crud. Even the
> > PMUv2 update didn't fix that nonsense.
> >
> > How is any virt stuff supposed to fix this? If the hardware is late
> > delivering PMI, what guarantees a guest PMI does not land in host
> > context and vice-versa?
> >
> > How does NMI virtualization (what even is that) or AVIC (I'm assuming
> > that's a virtual interrupt controller) help?
> >
>
> When NMI virtualization is enabled and source of VNMI is in guest, micro code will
> make sure that VNMI will directly be delivered to the guest (even if NMI was late
> delivered), so there is no issue of leaking guest NMI to the host.
>
> > Please make very sure, with your hardware team, that PMI must not be
> > delivered after clearing global_ctrl (preferably) or at the very least,
> > there exists a sequence of operations that provides a hard barrier
> > to order PMI.
> >
> We are verifying all the corner cases, while prototyping PMC virtualization.
> As of now, we don't see guest NMIs leaking to host issue. But latency issues
> still stays.

How long is the latency? From the existing code (amd_pmu_disable_all()
=> amd_pmu_check_overflow()), it seems to take up to 50 microseconds
in amd_pmu_check_overflow(). But I wonder how much in reality.

Thanks.
-Mingwei
>
Mingwei Zhang Oct. 18, 2023, 12:01 a.m. UTC | #38
On Tue, Oct 17, 2023 at 9:58 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 3:24 AM Manali Shukla <manali.shukla@amd.com> wrote:
> >
> > On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> > > On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> > >> Hi all,
> > >>
> > >> I would like to add following things to the discussion just for the awareness of
> > >> everyone.
> > >>
> > >> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> > >> working on prototyping it.
> > >>
> > >> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> > >> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> > >> are not saved by hardware.
> > >
> > > Per the previous discussion, doing this while host has active counters
> > > that do not have ::exclude_guest=1 is invalid and must result in an
> > > error.
> > >
> >
> > Yeah, exclude_guest should be enforced on host, while host has active PMC
> > counters and VPMC is enabled.
> >
> > > Also, I'm assuming it is all optional, a host can still profile a guest
> > > if all is configured just so?
> > >
> >
> > Correct, host should be able to profile guest, if VPMC is not enabled.
> >
> > >> If hypervisor is using the performance counters, it
> > >> is hypervisor's responsibility to save PERF_CTL registers to host save area
> > >> prior to VMRUN and restore them after VMEXIT.
> > >
> > > Does VMEXIT clear global_ctrl at least?
> > >
> >
> > global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
> > all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.
>
> Are these guest values (automatically) saved in the guest area of VMCB
> on VMEXIT?
>

Never mind on this one. So, if both values are in Type C, then they
should be saved to the guest area of VMCB on VMEXIT (according to APM
vol 2 Table B-3). So, this means KVM does not need to intercept these
MSRs on pass-through implementation.

Thanks.
-Mingwei

-Mingwei
diff mbox series

Patch

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index fa355d3658a6..1c349290677c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3158,17 +3158,26 @@  intel_bts_constraints(struct perf_event *event)
 	return NULL;
 }
 
+static struct event_constraint *intel_virt_event_constraints[] __read_mostly = {
+	&vlbr_constraint,
+	&vmetrics_constraint,
+};
+
 /*
- * Note: matches a fake event, like Fixed2.
+ * Note: matches a virtual event, like vmetrics.
  */
 static struct event_constraint *
-intel_vlbr_constraints(struct perf_event *event)
+intel_virt_constraints(struct perf_event *event)
 {
-	struct event_constraint *c = &vlbr_constraint;
+	int i;
+	struct event_constraint *c;
 
-	if (unlikely(constraint_match(c, event->hw.config))) {
-		event->hw.flags |= c->flags;
-		return c;
+	for (i = 0; i < ARRAY_SIZE(intel_virt_event_constraints); i++) {
+		c = intel_virt_event_constraints[i];
+		if (unlikely(constraint_match(c, event->hw.config))) {
+			event->hw.flags |= c->flags;
+			return c;
+		}
 	}
 
 	return NULL;
@@ -3368,7 +3377,7 @@  __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 {
 	struct event_constraint *c;
 
-	c = intel_vlbr_constraints(event);
+	c = intel_virt_constraints(event);
 	if (c)
 		return c;
 
@@ -5369,6 +5378,11 @@  static struct attribute *spr_tsx_events_attrs[] = {
 	NULL,
 };
 
+struct event_constraint vmetrics_constraint =
+	__EVENT_CONSTRAINT(INTEL_FIXED_VMETRICS_EVENT,
+			   (1ULL << INTEL_PMC_IDX_FIXED_VMETRICS),
+			   FIXED_EVENT_FLAGS, 1, 0, 0);
+
 static ssize_t freeze_on_smi_show(struct device *cdev,
 				  struct device_attribute *attr,
 				  char *buf)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c8ba2be7585d..a0d12989a483 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1482,6 +1482,7 @@  void reserve_lbr_buffers(void);
 
 extern struct event_constraint bts_constraint;
 extern struct event_constraint vlbr_constraint;
+extern struct event_constraint vmetrics_constraint;
 
 void intel_pmu_enable_bts(u64 config);
 
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 63e1ce1f4b27..d767807aae91 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -390,6 +390,21 @@  static inline bool is_topdown_idx(int idx)
  */
 #define INTEL_PMC_IDX_FIXED_VLBR		(GLOBAL_STATUS_LBRS_FROZEN_BIT)
 
+/*
+ * We model guest TopDown metrics event tracing similarly.
+ *
+ * Guest metric events are recognized as vCPU process's events on host, they
+ * would be time-sharing multiplexed with other host metric events, so that
+ * we choose bit 48 (INTEL_PMC_IDX_METRIC_BASE) as the index of virtual
+ * metrics event.
+ */
+#define INTEL_PMC_IDX_FIXED_VMETRICS		(INTEL_PMC_IDX_METRIC_BASE)
+
+/*
+ * Pseudo-encoding the guest metrics event as event=0x00,umask=0x11,
+ * since it would claim bit 48 which is effectively Fixed16.
+ */
+#define INTEL_FIXED_VMETRICS_EVENT		0x1100
 /*
  * Pseudo-encoding the guest LBR event as event=0x00,umask=0x1b,
  * since it would claim bit 58 which is effectively Fixed26.