mbox series

[0/1] KVM: x86/vPMU: Speed up vmexit for AMD Zen 4 CPUs

Message ID 20231109180646.2963718-1-khorenko@virtuozzo.com (mailing list archive)
Headers show
Series KVM: x86/vPMU: Speed up vmexit for AMD Zen 4 CPUs | expand

Message

Konstantin Khorenko Nov. 9, 2023, 6:06 p.m. UTC
We have detected significant performance drop of our atomic test which
checks the rate of CPUID instructions rate inside an L1 VM on an AMD
node.

Investigation led to 2 mainstream patches which have introduced extra
events accounting:

   018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
   9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")

And on an AMD Zen 3 CPU that resulted in immediate 43% drop in the CPUID
rate.

Checking latest mainsteam kernel the performance difference is much less
but still quite noticeable: 13.4% and shows up on AMD CPUs only.

Looks like iteration over all PMCs in kvm_pmu_trigger_event() is cheap
on Intel and expensive on AMD CPUs.

So the idea behind this patch is to skip iterations over PMCs at all in
case PMU is disabled for a VM completely or PMU is enabled for a VM, but
there are no active PMCs at all.

Unfortunately
 * current kernel code does not differentiate if PMU is globally enabled
   for a VM or not (pmu->version is always 1)
 * AMD CPUs older than Zen 4 do not support PMU v2 and thus efficient
   check for enabled PMCs is not possible

=> the patch speeds up vmexit for AMD Zen 4 CPUs only, this is sad.
   but the patch does not hurt other CPUs - and this is fortunate!

i have no access to a node with AMD Zen 4 CPU, so i had to test on
AMD Zen 3 CPU and i hope my expectations are right for AMD Zen 4.

i would appreciate if anyone perform the test of a real AMD Zen 4 node.

AMD performance results:
CPU: AMD Zen 3 (three!): AMD EPYC 7443P 24-Core Processor

 * The test binary is run inside an AlmaLinux 9 VM with their stock kernel
   5.14.0-284.11.1.el9_2.x86_64.
 * Test binary checks the CPUID instractions rate (instructions per sec).
 * Default VM config (PMU is off, pmu->version is reported as 1).
 * The Host runs the kernel under test.

 # for i in 1 2 3 4 5 ; do ./at_cpu_cpuid.pub ; done | \
   awk -e '{print $4;}' | \
   cut -f1 --delimiter='.' | \
   ./avg.sh

Measurements:
1. Host runs stock latest mainstream kernel commit 305230142ae0.
2. Host runs same mainstream kernel + current patch.
3. Host runs same mainstream kernel + current patch + force
   guest_pmu_is_enabled() to always return "false" using following change:

   -       if (pmu->version >= 2 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask))
   +       if (pmu->version == 1 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask))

   -----------------------------------------
   | Kernels       | CPUID rate            |
   -----------------------------------------
   | 1.            | 1360250               |
   | 2.            | 1365536 (+ 0.4%)      |
   | 3.            | 1541850 (+13.4%)      |
   -----------------------------------------

Measurement (2) gives some fluctuation, the performance is not increased
because the test was done on a Zen 3 CPU, so we are unable to use fast
check for active PMCs.
Measurement (3) shows expected performance boost on a Zen 4 CPU under
the same test.

The test used:
# cat at_cpu_cpuid.pub.cpp
/*
 * The test executes CPUID instruction in a loop and reports the calls rate.
 */

#include <stdio.h>
#include <time.h>

/* #define CPUID_EAX            0x80000002 */
#define CPUID_EAX               0x29a
#define CPUID_ECX               0

#define TEST_EXEC_SECS          30      // in seconds
#define LOOPS_APPROX_RATE       1000000

static inline void cpuid(unsigned int _eax, unsigned int _ecx)
{
        unsigned int regs[4] = {_eax, 0, _ecx, 0};

        asm __volatile__(
                "cpuid"
                : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
                :  "0" (regs[0]),  "1" (regs[1]),  "2" (regs[2]),  "3" (regs[3])
                : "memory");
}

double cpuid_rate_loops(int loops_num)
{
        int i;
        clock_t start_time, end_time;
        double spent_time, rate;

        start_time = clock();

        for (i = 0; i < loops_num; i++)
                cpuid((unsigned int)CPUID_EAX, (unsigned int)CPUID_ECX);

        end_time = clock();
        spent_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;

        rate = (double)loops_num / spent_time;

        return rate;
}

int main(int argc, char* argv[])
{
        double approx_rate, rate;
        int loops;

        /* First we detect approximate CPUIDs rate. */
        approx_rate = cpuid_rate_loops(LOOPS_APPROX_RATE);

        /*
         * How many loops there should be in order to run the test for
         * TEST_EXEC_SECS seconds?
         */
        loops = (int)(approx_rate * TEST_EXEC_SECS);

        /* Get the precise instructions rate. */
        rate = cpuid_rate_loops(loops);

        printf( "CPUID instructions rate: %f instructions/second\n", rate);

        return 0;
}

Konstantin Khorenko (1):
  KVM: x86/vPMU: Check PMU is enabled for vCPU before searching for PMC

 arch/x86/kvm/pmu.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Comments

Jim Mattson Nov. 9, 2023, 10:44 p.m. UTC | #1
On Thu, Nov 9, 2023 at 10:24 AM Konstantin Khorenko
<khorenko@virtuozzo.com> wrote:
>
> We have detected significant performance drop of our atomic test which
> checks the rate of CPUID instructions rate inside an L1 VM on an AMD
> node.
>
> Investigation led to 2 mainstream patches which have introduced extra
> events accounting:
>
>    018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
>    9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
>
> And on an AMD Zen 3 CPU that resulted in immediate 43% drop in the CPUID
> rate.
>
> Checking latest mainsteam kernel the performance difference is much less
> but still quite noticeable: 13.4% and shows up on AMD CPUs only.
>
> Looks like iteration over all PMCs in kvm_pmu_trigger_event() is cheap
> on Intel and expensive on AMD CPUs.
>
> So the idea behind this patch is to skip iterations over PMCs at all in
> case PMU is disabled for a VM completely or PMU is enabled for a VM, but
> there are no active PMCs at all.

A better solution may be to maintain two bitmaps of general purpose
counters that need to be incremented, one for instructions retired and
one for branch instructions retired. Set or clear these bits whenever
the PerfEvtSelN MSRs are written. I think I would keep the PGC bits
separate, on those microarchitectures that support PGC. Then,
kvm_pmu_trigger_event() need only consult the appropriate bitmap (or
the logical and of that bitmap with PGC). In most cases, the value
will be zero, and the function can simply return.

This would work even for AMD microarchitectures that don't support PGC.

> Unfortunately
>  * current kernel code does not differentiate if PMU is globally enabled
>    for a VM or not (pmu->version is always 1)
>  * AMD CPUs older than Zen 4 do not support PMU v2 and thus efficient
>    check for enabled PMCs is not possible
>
> => the patch speeds up vmexit for AMD Zen 4 CPUs only, this is sad.
>    but the patch does not hurt other CPUs - and this is fortunate!
>
> i have no access to a node with AMD Zen 4 CPU, so i had to test on
> AMD Zen 3 CPU and i hope my expectations are right for AMD Zen 4.
>
> i would appreciate if anyone perform the test of a real AMD Zen 4 node.
>
> AMD performance results:
> CPU: AMD Zen 3 (three!): AMD EPYC 7443P 24-Core Processor
>
>  * The test binary is run inside an AlmaLinux 9 VM with their stock kernel
>    5.14.0-284.11.1.el9_2.x86_64.
>  * Test binary checks the CPUID instractions rate (instructions per sec).
>  * Default VM config (PMU is off, pmu->version is reported as 1).
>  * The Host runs the kernel under test.
>
>  # for i in 1 2 3 4 5 ; do ./at_cpu_cpuid.pub ; done | \
>    awk -e '{print $4;}' | \
>    cut -f1 --delimiter='.' | \
>    ./avg.sh
>
> Measurements:
> 1. Host runs stock latest mainstream kernel commit 305230142ae0.
> 2. Host runs same mainstream kernel + current patch.
> 3. Host runs same mainstream kernel + current patch + force
>    guest_pmu_is_enabled() to always return "false" using following change:
>
>    -       if (pmu->version >= 2 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask))
>    +       if (pmu->version == 1 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask))
>
>    -----------------------------------------
>    | Kernels       | CPUID rate            |
>    -----------------------------------------
>    | 1.            | 1360250               |
>    | 2.            | 1365536 (+ 0.4%)      |
>    | 3.            | 1541850 (+13.4%)      |
>    -----------------------------------------
>
> Measurement (2) gives some fluctuation, the performance is not increased
> because the test was done on a Zen 3 CPU, so we are unable to use fast
> check for active PMCs.
> Measurement (3) shows expected performance boost on a Zen 4 CPU under
> the same test.
>
> The test used:
> # cat at_cpu_cpuid.pub.cpp
> /*
>  * The test executes CPUID instruction in a loop and reports the calls rate.
>  */
>
> #include <stdio.h>
> #include <time.h>
>
> /* #define CPUID_EAX            0x80000002 */
> #define CPUID_EAX               0x29a
> #define CPUID_ECX               0
>
> #define TEST_EXEC_SECS          30      // in seconds
> #define LOOPS_APPROX_RATE       1000000
>
> static inline void cpuid(unsigned int _eax, unsigned int _ecx)
> {
>         unsigned int regs[4] = {_eax, 0, _ecx, 0};
>
>         asm __volatile__(
>                 "cpuid"
>                 : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
>                 :  "0" (regs[0]),  "1" (regs[1]),  "2" (regs[2]),  "3" (regs[3])
>                 : "memory");
> }
>
> double cpuid_rate_loops(int loops_num)
> {
>         int i;
>         clock_t start_time, end_time;
>         double spent_time, rate;
>
>         start_time = clock();
>
>         for (i = 0; i < loops_num; i++)
>                 cpuid((unsigned int)CPUID_EAX, (unsigned int)CPUID_ECX);
>
>         end_time = clock();
>         spent_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
>
>         rate = (double)loops_num / spent_time;
>
>         return rate;
> }
>
> int main(int argc, char* argv[])
> {
>         double approx_rate, rate;
>         int loops;
>
>         /* First we detect approximate CPUIDs rate. */
>         approx_rate = cpuid_rate_loops(LOOPS_APPROX_RATE);
>
>         /*
>          * How many loops there should be in order to run the test for
>          * TEST_EXEC_SECS seconds?
>          */
>         loops = (int)(approx_rate * TEST_EXEC_SECS);
>
>         /* Get the precise instructions rate. */
>         rate = cpuid_rate_loops(loops);
>
>         printf( "CPUID instructions rate: %f instructions/second\n", rate);
>
>         return 0;
> }
>
> Konstantin Khorenko (1):
>   KVM: x86/vPMU: Check PMU is enabled for vCPU before searching for PMC
>
>  arch/x86/kvm/pmu.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> --
> 2.39.3
>
>
Jim Mattson Nov. 9, 2023, 10:52 p.m. UTC | #2
On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
<khorenko@virtuozzo.com> wrote:
>
> Hi All,
>
> as a followup for my patch: i have noticed that
> currently Intel kernel code provides an ability to detect if PMU is totally disabled for a VM
> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/> in the VM config which
> results in "-cpu pmu=off" qemu option).
>
> So the question is - is it possible to enhance the code for AMD to also honor PMU VM setting or it is
> impossible by design?

The AMD architectural specification prior to AMD PMU v2 does not allow
one to describe a CPU (via CPUID or MSRs) that has fewer than 4
general purpose PMU counters. While AMD PMU v2 does allow one to
describe such a CPU, legacy software that knows nothing of AMD PMU v2
can expect four counters regardless.

Having said that, KVM does provide a per-VM capability for disabling
the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
section 8.35 in Documentation/virt/kvm/api.rst.
Sean Christopherson Nov. 9, 2023, 11:42 p.m. UTC | #3
On Thu, Nov 09, 2023, Jim Mattson wrote:
> On Thu, Nov 9, 2023 at 10:24 AM Konstantin Khorenko
> <khorenko@virtuozzo.com> wrote:
> >
> > We have detected significant performance drop of our atomic test which
> > checks the rate of CPUID instructions rate inside an L1 VM on an AMD
> > node.
> >
> > Investigation led to 2 mainstream patches which have introduced extra
> > events accounting:
> >
> >    018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
> >    9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> >
> > And on an AMD Zen 3 CPU that resulted in immediate 43% drop in the CPUID
> > rate.
> >
> > Checking latest mainsteam kernel the performance difference is much less
> > but still quite noticeable: 13.4% and shows up on AMD CPUs only.
> >
> > Looks like iteration over all PMCs in kvm_pmu_trigger_event() is cheap
> > on Intel and expensive on AMD CPUs.
> >
> > So the idea behind this patch is to skip iterations over PMCs at all in
> > case PMU is disabled for a VM completely or PMU is enabled for a VM, but
> > there are no active PMCs at all.
> 
> A better solution may be to maintain two bitmaps of general purpose
> counters that need to be incremented, one for instructions retired and
> one for branch instructions retired. Set or clear these bits whenever
> the PerfEvtSelN MSRs are written. I think I would keep the PGC bits
> separate, on those microarchitectures that support PGC. Then,
> kvm_pmu_trigger_event() need only consult the appropriate bitmap (or
> the logical and of that bitmap with PGC). In most cases, the value
> will be zero, and the function can simply return.
> 
> This would work even for AMD microarchitectures that don't support PGC.

Yeah.  There are multiple lower-hanging fruits to be picked though, most of which
won't conflict with using dedicated per-event bitmaps, or at worst are trivial
to resolve.

 1. Don't call into perf to get the eventsel (which generates an indirect call)
    on every invocation, let alone every iteration.

 2. Avoid getting the CPL when it's irrelevant.

 3. Check the eventsel before querying the event filter.

 4. Mask out PMCs that aren't globally enabled from the get-go (masking out
    PMCs based on eventsel would essentially be the same as per-event bitmaps).

I'm definitely not opposed to per-event bitmaps, but it'd be nice to avoid them,
e.g. if we can eke out 99% of the performance just by doing a few obvious
optimizations.

This is the end result of what I'm testing and will (hopefully) post shortly:

static inline bool pmc_is_eventsel_match(struct kvm_pmc *pmc, u64 eventsel)
{
	return !((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB);
}

static inline bool cpl_is_matched(struct kvm_pmc *pmc)
{
	bool select_os, select_user;
	u64 config;

	if (pmc_is_gp(pmc)) {
		config = pmc->eventsel;
		select_os = config & ARCH_PERFMON_EVENTSEL_OS;
		select_user = config & ARCH_PERFMON_EVENTSEL_USR;
	} else {
		config = fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl,
					  pmc->idx - KVM_FIXED_PMC_BASE_IDX);
		select_os = config & 0x1;
		select_user = config & 0x2;
	}

	/*
	 * Skip the CPL lookup, which isn't free on Intel, if the result will
	 * be the same regardless of the CPL.
	 */
	if (select_os == select_user)
		return select_os;

	return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
}

void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
{
	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
	struct kvm_pmc *pmc;
	int i;

	BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX);

	if (!kvm_pmu_has_perf_global_ctrl(pmu))
		bitmap_copy(bitmap, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
	else if (!bitmap_and(bitmap, pmu->all_valid_pmc_idx,
			     (unsigned long *)&pmu->global_ctrl, X86_PMC_IDX_MAX))
		return;

	kvm_for_each_pmc(pmu, pmc, i, bitmap) {
		/* Ignore checks for edge detect, pin control, invert and CMASK bits */
		if (!pmc_is_eventsel_match(pmc, eventsel) ||
		    !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc))
			continue;

		kvm_pmu_incr_counter(pmc);
	}
}
Denis V. Lunev Nov. 9, 2023, 11:46 p.m. UTC | #4
On 11/9/23 23:52, Jim Mattson wrote:
> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
> <khorenko@virtuozzo.com> wrote:
>> Hi All,
>>
>> as a followup for my patch: i have noticed that
>> currently Intel kernel code provides an ability to detect if PMU is totally disabled for a VM
>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/> in the VM config which
>> results in "-cpu pmu=off" qemu option).
>>
>> So the question is - is it possible to enhance the code for AMD to also honor PMU VM setting or it is
>> impossible by design?
> The AMD architectural specification prior to AMD PMU v2 does not allow
> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
> general purpose PMU counters. While AMD PMU v2 does allow one to
> describe such a CPU, legacy software that knows nothing of AMD PMU v2
> can expect four counters regardless.
>
> Having said that, KVM does provide a per-VM capability for disabling
> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
> section 8.35 in Documentation/virt/kvm/api.rst.
But this means in particular that QEMU should immediately
use this KVM_PMU_CAP_DISABLE if this capability is supported and 
PMU=off. I am not seeing this code thus I believe that we have missed 
this. I think that this change worth adding. We will measure the impact 
:-) Den
Dongli Zhang Nov. 10, 2023, 12:01 a.m. UTC | #5
On 11/9/23 3:46 PM, Denis V. Lunev wrote:
> On 11/9/23 23:52, Jim Mattson wrote:
>> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
>> <khorenko@virtuozzo.com> wrote:
>>> Hi All,
>>>
>>> as a followup for my patch: i have noticed that
>>> currently Intel kernel code provides an ability to detect if PMU is totally
>>> disabled for a VM
>>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/>
>>> in the VM config which
>>> results in "-cpu pmu=off" qemu option).
>>>
>>> So the question is - is it possible to enhance the code for AMD to also honor
>>> PMU VM setting or it is
>>> impossible by design?
>> The AMD architectural specification prior to AMD PMU v2 does not allow
>> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
>> general purpose PMU counters. While AMD PMU v2 does allow one to
>> describe such a CPU, legacy software that knows nothing of AMD PMU v2
>> can expect four counters regardless.
>>
>> Having said that, KVM does provide a per-VM capability for disabling
>> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
>> section 8.35 in Documentation/virt/kvm/api.rst.
> But this means in particular that QEMU should immediately
> use this KVM_PMU_CAP_DISABLE if this capability is supported and PMU=off. I am
> not seeing this code thus I believe that we have missed this. I think that this
> change worth adding. We will measure the impact :-) Den
> 

I used to have a patch to use KVM_PMU_CAP_DISABLE in QEMU, but that did not draw
many developers' attention.

https://lore.kernel.org/qemu-devel/20230621013821.6874-2-dongli.zhang@oracle.com/

It is time to first re-send that again.

Dongli Zhang
Jim Mattson Nov. 10, 2023, 12:02 a.m. UTC | #6
On Thu, Nov 9, 2023 at 3:46 PM Denis V. Lunev <den@virtuozzo.com> wrote:
>
> On 11/9/23 23:52, Jim Mattson wrote:
> > On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
> > <khorenko@virtuozzo.com> wrote:
> >> Hi All,
> >>
> >> as a followup for my patch: i have noticed that
> >> currently Intel kernel code provides an ability to detect if PMU is totally disabled for a VM
> >> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
> >> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/> in the VM config which
> >> results in "-cpu pmu=off" qemu option).
> >>
> >> So the question is - is it possible to enhance the code for AMD to also honor PMU VM setting or it is
> >> impossible by design?
> > The AMD architectural specification prior to AMD PMU v2 does not allow
> > one to describe a CPU (via CPUID or MSRs) that has fewer than 4
> > general purpose PMU counters. While AMD PMU v2 does allow one to
> > describe such a CPU, legacy software that knows nothing of AMD PMU v2
> > can expect four counters regardless.
> >
> > Having said that, KVM does provide a per-VM capability for disabling
> > the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
> > section 8.35 in Documentation/virt/kvm/api.rst.
> But this means in particular that QEMU should immediately
> use this KVM_PMU_CAP_DISABLE if this capability is supported and
> PMU=off. I am not seeing this code thus I believe that we have missed
> this. I think that this change worth adding. We will measure the impact
> :-) Den

At present, KVM will still blindly cycle through each GP counter (4
minimum for AMD) until it checks vcpu->kvm->arch.enable_pmu at the top
of get_gp_pmc_amd().

Sean's proposal to clear the metadata should eliminate the overhead
for VMs with the virtual PMU disabled. My proposal should eliminate
the overhead even for VMs with the virtual PMU enabled, as long as no
counters are programmed for "instructions retired."
Jim Mattson Nov. 10, 2023, 12:17 a.m. UTC | #7
On Thu, Nov 9, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 09, 2023, Jim Mattson wrote:
> > On Thu, Nov 9, 2023 at 10:24 AM Konstantin Khorenko
> > <khorenko@virtuozzo.com> wrote:
> > >
> > > We have detected significant performance drop of our atomic test which
> > > checks the rate of CPUID instructions rate inside an L1 VM on an AMD
> > > node.
> > >
> > > Investigation led to 2 mainstream patches which have introduced extra
> > > events accounting:
> > >
> > >    018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
> > >    9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > >
> > > And on an AMD Zen 3 CPU that resulted in immediate 43% drop in the CPUID
> > > rate.
> > >
> > > Checking latest mainsteam kernel the performance difference is much less
> > > but still quite noticeable: 13.4% and shows up on AMD CPUs only.
> > >
> > > Looks like iteration over all PMCs in kvm_pmu_trigger_event() is cheap
> > > on Intel and expensive on AMD CPUs.
> > >
> > > So the idea behind this patch is to skip iterations over PMCs at all in
> > > case PMU is disabled for a VM completely or PMU is enabled for a VM, but
> > > there are no active PMCs at all.
> >
> > A better solution may be to maintain two bitmaps of general purpose
> > counters that need to be incremented, one for instructions retired and
> > one for branch instructions retired. Set or clear these bits whenever
> > the PerfEvtSelN MSRs are written. I think I would keep the PGC bits
> > separate, on those microarchitectures that support PGC. Then,
> > kvm_pmu_trigger_event() need only consult the appropriate bitmap (or
> > the logical and of that bitmap with PGC). In most cases, the value
> > will be zero, and the function can simply return.
> >
> > This would work even for AMD microarchitectures that don't support PGC.
>
> Yeah.  There are multiple lower-hanging fruits to be picked though, most of which
> won't conflict with using dedicated per-event bitmaps, or at worst are trivial
> to resolve.
>
>  1. Don't call into perf to get the eventsel (which generates an indirect call)
>     on every invocation, let alone every iteration.
>
>  2. Avoid getting the CPL when it's irrelevant.
>
>  3. Check the eventsel before querying the event filter.
>
>  4. Mask out PMCs that aren't globally enabled from the get-go (masking out
>     PMCs based on eventsel would essentially be the same as per-event bitmaps).

The code below only looks at PGC. Even on CPUs that support PGC, some
PMU clients still use the enable bits in the PerfEvtSelN. Linux perf,
for instance, can't seem to make up its mind whether to use PGC or
PerfEvtSelN.EN.

> I'm definitely not opposed to per-event bitmaps, but it'd be nice to avoid them,
> e.g. if we can eke out 99% of the performance just by doing a few obvious
> optimizations.
>
> This is the end result of what I'm testing and will (hopefully) post shortly:
>
> static inline bool pmc_is_eventsel_match(struct kvm_pmc *pmc, u64 eventsel)
> {
>         return !((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB);
> }

The top nybble of AMD's 3-nybble event select collides with Intel's
IN_TX and IN_TXCP bits. I think we can assert that the vCPU can't be
in a transaction if KVM is emulating an instruction, but this probably
merits a comment. The function name should also be more descriptive,
so that it doesn't get misused out of context. :)

> static inline bool cpl_is_matched(struct kvm_pmc *pmc)
> {
>         bool select_os, select_user;
>         u64 config;
>
>         if (pmc_is_gp(pmc)) {
>                 config = pmc->eventsel;
>                 select_os = config & ARCH_PERFMON_EVENTSEL_OS;
>                 select_user = config & ARCH_PERFMON_EVENTSEL_USR;
>         } else {
>                 config = fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl,
>                                           pmc->idx - KVM_FIXED_PMC_BASE_IDX);
>                 select_os = config & 0x1;
>                 select_user = config & 0x2;
>         }
>
>         /*
>          * Skip the CPL lookup, which isn't free on Intel, if the result will
>          * be the same regardless of the CPL.
>          */
>         if (select_os == select_user)
>                 return select_os;
>
>         return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
> }
>
> void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
> {
>         DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
>         struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>         struct kvm_pmc *pmc;
>         int i;
>
>         BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX);
>
>         if (!kvm_pmu_has_perf_global_ctrl(pmu))
>                 bitmap_copy(bitmap, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>         else if (!bitmap_and(bitmap, pmu->all_valid_pmc_idx,
>                              (unsigned long *)&pmu->global_ctrl, X86_PMC_IDX_MAX))
>                 return;
>
>         kvm_for_each_pmc(pmu, pmc, i, bitmap) {
>                 /* Ignore checks for edge detect, pin control, invert and CMASK bits */
>                 if (!pmc_is_eventsel_match(pmc, eventsel) ||
>                     !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc))
>                         continue;
>
>                 kvm_pmu_incr_counter(pmc);
>         }
> }
Sean Christopherson Nov. 10, 2023, 12:52 a.m. UTC | #8
On Thu, Nov 09, 2023, Jim Mattson wrote:
> On Thu, Nov 9, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Nov 09, 2023, Jim Mattson wrote:
> > > A better solution may be to maintain two bitmaps of general purpose
> > > counters that need to be incremented, one for instructions retired and
> > > one for branch instructions retired. Set or clear these bits whenever
> > > the PerfEvtSelN MSRs are written. I think I would keep the PGC bits
> > > separate, on those microarchitectures that support PGC. Then,
> > > kvm_pmu_trigger_event() need only consult the appropriate bitmap (or
> > > the logical and of that bitmap with PGC). In most cases, the value
> > > will be zero, and the function can simply return.
> > >
> > > This would work even for AMD microarchitectures that don't support PGC.
> >
> > Yeah.  There are multiple lower-hanging fruits to be picked though, most of which
> > won't conflict with using dedicated per-event bitmaps, or at worst are trivial
> > to resolve.
> >
> >  1. Don't call into perf to get the eventsel (which generates an indirect call)
> >     on every invocation, let alone every iteration.
> >
> >  2. Avoid getting the CPL when it's irrelevant.
> >
> >  3. Check the eventsel before querying the event filter.
> >
> >  4. Mask out PMCs that aren't globally enabled from the get-go (masking out
> >     PMCs based on eventsel would essentially be the same as per-event bitmaps).
> 
> The code below only looks at PGC. Even on CPUs that support PGC, some
> PMU clients still use the enable bits in the PerfEvtSelN. Linux perf,
> for instance, can't seem to make up its mind whether to use PGC or
> PerfEvtSelN.EN.

Ugh, as in perf leaves all GLOBAL_CTRL bits set and toggles only the eventsel
enable bit?  Lame.

> > I'm definitely not opposed to per-event bitmaps, but it'd be nice to avoid them,
> > e.g. if we can eke out 99% of the performance just by doing a few obvious
> > optimizations.
> >
> > This is the end result of what I'm testing and will (hopefully) post shortly:
> >
> > static inline bool pmc_is_eventsel_match(struct kvm_pmc *pmc, u64 eventsel)
> > {
> >         return !((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB);
> > }
> 
> The top nybble of AMD's 3-nybble event select collides with Intel's
> IN_TX and IN_TXCP bits. I think we can assert that the vCPU can't be
> in a transaction if KVM is emulating an instruction, but this probably
> merits a comment.

Argh, more pre-existing crud.  This is silly, the vendor mask is already in
kvm_pmu_ops.EVENTSEL_EVENT.  What's one more patch...

> The function name should also be more descriptive, so that it doesn't get
> misused out of context. :)

Heh, I think I'll just delete the darn thing.  That also makes it easier to
understand the comment about ignoring certain fields.
Sean Christopherson Nov. 10, 2023, 1:14 a.m. UTC | #9
On Fri, Nov 10, 2023, Sean Christopherson wrote:
> On Thu, Nov 09, 2023, Jim Mattson wrote:
> > On Thu, Nov 9, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote:
> > > static inline bool pmc_is_eventsel_match(struct kvm_pmc *pmc, u64 eventsel)
> > > {
> > >         return !((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB);
> > > }
> > 
> > The top nybble of AMD's 3-nybble event select collides with Intel's
> > IN_TX and IN_TXCP bits. I think we can assert that the vCPU can't be
> > in a transaction if KVM is emulating an instruction, but this probably
> > merits a comment.
> 
> Argh, more pre-existing crud.  This is silly, the vendor mask is already in
> kvm_pmu_ops.EVENTSEL_EVENT.  What's one more patch...

Ah, I see what your saying.  Checking the bits is actually correct, probably through
sheer dumb luck.  I'll expand the comment to cover that and the reserved bits.
Denis V. Lunev Nov. 13, 2023, 9:31 a.m. UTC | #10
On 11/10/23 01:01, Dongli Zhang wrote:
>
> On 11/9/23 3:46 PM, Denis V. Lunev wrote:
>> On 11/9/23 23:52, Jim Mattson wrote:
>>> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
>>> <khorenko@virtuozzo.com> wrote:
>>>> Hi All,
>>>>
>>>> as a followup for my patch: i have noticed that
>>>> currently Intel kernel code provides an ability to detect if PMU is totally
>>>> disabled for a VM
>>>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>>>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/>
>>>> in the VM config which
>>>> results in "-cpu pmu=off" qemu option).
>>>>
>>>> So the question is - is it possible to enhance the code for AMD to also honor
>>>> PMU VM setting or it is
>>>> impossible by design?
>>> The AMD architectural specification prior to AMD PMU v2 does not allow
>>> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
>>> general purpose PMU counters. While AMD PMU v2 does allow one to
>>> describe such a CPU, legacy software that knows nothing of AMD PMU v2
>>> can expect four counters regardless.
>>>
>>> Having said that, KVM does provide a per-VM capability for disabling
>>> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
>>> section 8.35 in Documentation/virt/kvm/api.rst.
>> But this means in particular that QEMU should immediately
>> use this KVM_PMU_CAP_DISABLE if this capability is supported and PMU=off. I am
>> not seeing this code thus I believe that we have missed this. I think that this
>> change worth adding. We will measure the impact :-) Den
>>
> I used to have a patch to use KVM_PMU_CAP_DISABLE in QEMU, but that did not draw
> many developers' attention.
>
> https://lore.kernel.org/qemu-devel/20230621013821.6874-2-dongli.zhang@oracle.com/
>
> It is time to first re-send that again.
>
> Dongli Zhang
We have checked that setting KVM_PMU_CAP_DISABLE really helps. 
Konstantin has done this and this is good. On the other hand, looking 
into these patches I disagree with them. We should not introduce new 
option for QEMU. If PMU is disabled, i.e. we assume that pmu=off passed 
in the command line, we should set KVM_PMU_CAP_DISABLE for that virtual 
machine. Den
Dongli Zhang Nov. 13, 2023, 2:14 p.m. UTC | #11
Hi Denis,

On 11/13/23 01:31, Denis V. Lunev wrote:
> On 11/10/23 01:01, Dongli Zhang wrote:
>>
>> On 11/9/23 3:46 PM, Denis V. Lunev wrote:
>>> On 11/9/23 23:52, Jim Mattson wrote:
>>>> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
>>>> <khorenko@virtuozzo.com> wrote:
>>>>> Hi All,
>>>>>
>>>>> as a followup for my patch: i have noticed that
>>>>> currently Intel kernel code provides an ability to detect if PMU is totally
>>>>> disabled for a VM
>>>>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>>>>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/>
>>>>> in the VM config which
>>>>> results in "-cpu pmu=off" qemu option).
>>>>>
>>>>> So the question is - is it possible to enhance the code for AMD to also honor
>>>>> PMU VM setting or it is
>>>>> impossible by design?
>>>> The AMD architectural specification prior to AMD PMU v2 does not allow
>>>> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
>>>> general purpose PMU counters. While AMD PMU v2 does allow one to
>>>> describe such a CPU, legacy software that knows nothing of AMD PMU v2
>>>> can expect four counters regardless.
>>>>
>>>> Having said that, KVM does provide a per-VM capability for disabling
>>>> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
>>>> section 8.35 in Documentation/virt/kvm/api.rst.
>>> But this means in particular that QEMU should immediately
>>> use this KVM_PMU_CAP_DISABLE if this capability is supported and PMU=off. I am
>>> not seeing this code thus I believe that we have missed this. I think that this
>>> change worth adding. We will measure the impact :-) Den
>>>
>> I used to have a patch to use KVM_PMU_CAP_DISABLE in QEMU, but that did not draw
>> many developers' attention.
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20230621013821.6874-2-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!McSH2M-kuHmzAwTuXKxrjLkrdJoPqML6cY_Ndc-8k9LRQ7D1V9bSBRQPwHqtx9XCVLK3uzdsMaxyfwve$
>> It is time to first re-send that again.
>>
>> Dongli Zhang
> We have checked that setting KVM_PMU_CAP_DISABLE really helps. Konstantin has
> done this and this is good. On the other hand, looking into these patches I
> disagree with them. We should not introduce new option for QEMU. If PMU is
> disabled, i.e. we assume that pmu=off passed in the command line, we should set
> KVM_PMU_CAP_DISABLE for that virtual machine. Den

Can I assume you meant pmu=off, that is, cpu->enable_pmu in QEMU?

In my opinion, cpu->enable_pmu indicates the option to control the cpu features.
It may be used by any accelerators, and it is orthogonal to the KVM cap.


The KVM_PMU_CAP_DISABLE is only specific to the KVM accelerator.


That's why I had introduced a new option, to allow to configure the VM in my
dimensions.

It means one dimension to AMD, but two for Intel: to disable PMU via cpuid, or
KVM cap.

Anyway, this is KVM mailing list, and I may initiate the discussion in QEMU list.

Thank you very much!

Dongli Zhang
Denis V. Lunev Nov. 13, 2023, 2:42 p.m. UTC | #12
On 11/13/23 15:14, Dongli Zhang wrote:
> Hi Denis,
>
> On 11/13/23 01:31, Denis V. Lunev wrote:
>> On 11/10/23 01:01, Dongli Zhang wrote:
>>> On 11/9/23 3:46 PM, Denis V. Lunev wrote:
>>>> On 11/9/23 23:52, Jim Mattson wrote:
>>>>> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
>>>>> <khorenko@virtuozzo.com> wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> as a followup for my patch: i have noticed that
>>>>>> currently Intel kernel code provides an ability to detect if PMU is totally
>>>>>> disabled for a VM
>>>>>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>>>>>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/>
>>>>>> in the VM config which
>>>>>> results in "-cpu pmu=off" qemu option).
>>>>>>
>>>>>> So the question is - is it possible to enhance the code for AMD to also honor
>>>>>> PMU VM setting or it is
>>>>>> impossible by design?
>>>>> The AMD architectural specification prior to AMD PMU v2 does not allow
>>>>> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
>>>>> general purpose PMU counters. While AMD PMU v2 does allow one to
>>>>> describe such a CPU, legacy software that knows nothing of AMD PMU v2
>>>>> can expect four counters regardless.
>>>>>
>>>>> Having said that, KVM does provide a per-VM capability for disabling
>>>>> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
>>>>> section 8.35 in Documentation/virt/kvm/api.rst.
>>>> But this means in particular that QEMU should immediately
>>>> use this KVM_PMU_CAP_DISABLE if this capability is supported and PMU=off. I am
>>>> not seeing this code thus I believe that we have missed this. I think that this
>>>> change worth adding. We will measure the impact :-) Den
>>>>
>>> I used to have a patch to use KVM_PMU_CAP_DISABLE in QEMU, but that did not draw
>>> many developers' attention.
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20230621013821.6874-2-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!McSH2M-kuHmzAwTuXKxrjLkrdJoPqML6cY_Ndc-8k9LRQ7D1V9bSBRQPwHqtx9XCVLK3uzdsMaxyfwve$
>>> It is time to first re-send that again.
>>>
>>> Dongli Zhang
>> We have checked that setting KVM_PMU_CAP_DISABLE really helps. Konstantin has
>> done this and this is good. On the other hand, looking into these patches I
>> disagree with them. We should not introduce new option for QEMU. If PMU is
>> disabled, i.e. we assume that pmu=off passed in the command line, we should set
>> KVM_PMU_CAP_DISABLE for that virtual machine. Den
> Can I assume you meant pmu=off, that is, cpu->enable_pmu in QEMU?
>
> In my opinion, cpu->enable_pmu indicates the option to control the cpu features.
> It may be used by any accelerators, and it is orthogonal to the KVM cap.
>
>
> The KVM_PMU_CAP_DISABLE is only specific to the KVM accelerator.
>
>
> That's why I had introduced a new option, to allow to configure the VM in my
> dimensions.
>
> It means one dimension to AMD, but two for Intel: to disable PMU via cpuid, or
> KVM cap.
>
> Anyway, this is KVM mailing list, and I may initiate the discussion in QEMU list.
>
> Thank you very much!
>
> Dongli Zhang
with the option pmu='off' it is expected that PMU should be
off for the guest. At the moment (without this KVM capability)
we can disable PMU for Intel only and thus have performance
degradation on AMD.

This option disables PMU and thus normally when we are
running KVM guest and wanting PMU to be off it would
be required to
* disable CPUID leaf for Intel
* set KVM_PMU_CAP_DISABLE for both processors This would be quite 
natural and transparent for the libvirt. Alexander will prepare the 
patch today or tomorrow for the discussion. Den
Dongli Zhang Nov. 13, 2023, 4:17 p.m. UTC | #13
On 11/13/23 06:42, Denis V. Lunev wrote:
> On 11/13/23 15:14, Dongli Zhang wrote:
>> Hi Denis,
>>
>> On 11/13/23 01:31, Denis V. Lunev wrote:
>>> On 11/10/23 01:01, Dongli Zhang wrote:
>>>> On 11/9/23 3:46 PM, Denis V. Lunev wrote:
>>>>> On 11/9/23 23:52, Jim Mattson wrote:
>>>>>> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
>>>>>> <khorenko@virtuozzo.com> wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> as a followup for my patch: i have noticed that
>>>>>>> currently Intel kernel code provides an ability to detect if PMU is totally
>>>>>>> disabled for a VM
>>>>>>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>>>>>>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/>
>>>>>>> in the VM config which
>>>>>>> results in "-cpu pmu=off" qemu option).
>>>>>>>
>>>>>>> So the question is - is it possible to enhance the code for AMD to also
>>>>>>> honor
>>>>>>> PMU VM setting or it is
>>>>>>> impossible by design?
>>>>>> The AMD architectural specification prior to AMD PMU v2 does not allow
>>>>>> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
>>>>>> general purpose PMU counters. While AMD PMU v2 does allow one to
>>>>>> describe such a CPU, legacy software that knows nothing of AMD PMU v2
>>>>>> can expect four counters regardless.
>>>>>>
>>>>>> Having said that, KVM does provide a per-VM capability for disabling
>>>>>> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
>>>>>> section 8.35 in Documentation/virt/kvm/api.rst.
>>>>> But this means in particular that QEMU should immediately
>>>>> use this KVM_PMU_CAP_DISABLE if this capability is supported and PMU=off. I am
>>>>> not seeing this code thus I believe that we have missed this. I think that
>>>>> this
>>>>> change worth adding. We will measure the impact :-) Den
>>>>>
>>>> I used to have a patch to use KVM_PMU_CAP_DISABLE in QEMU, but that did not
>>>> draw
>>>> many developers' attention.
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20230621013821.6874-2-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!McSH2M-kuHmzAwTuXKxrjLkrdJoPqML6cY_Ndc-8k9LRQ7D1V9bSBRQPwHqtx9XCVLK3uzdsMaxyfwve$
>>>> It is time to first re-send that again.
>>>>
>>>> Dongli Zhang
>>> We have checked that setting KVM_PMU_CAP_DISABLE really helps. Konstantin has
>>> done this and this is good. On the other hand, looking into these patches I
>>> disagree with them. We should not introduce new option for QEMU. If PMU is
>>> disabled, i.e. we assume that pmu=off passed in the command line, we should set
>>> KVM_PMU_CAP_DISABLE for that virtual machine. Den
>> Can I assume you meant pmu=off, that is, cpu->enable_pmu in QEMU?
>>
>> In my opinion, cpu->enable_pmu indicates the option to control the cpu features.
>> It may be used by any accelerators, and it is orthogonal to the KVM cap.
>>
>>
>> The KVM_PMU_CAP_DISABLE is only specific to the KVM accelerator.
>>
>>
>> That's why I had introduced a new option, to allow to configure the VM in my
>> dimensions.
>>
>> It means one dimension to AMD, but two for Intel: to disable PMU via cpuid, or
>> KVM cap.
>>
>> Anyway, this is KVM mailing list, and I may initiate the discussion in QEMU list.
>>
>> Thank you very much!
>>
>> Dongli Zhang
> with the option pmu='off' it is expected that PMU should be
> off for the guest. At the moment (without this KVM capability)
> we can disable PMU for Intel only and thus have performance
> degradation on AMD.
> 
> This option disables PMU and thus normally when we are
> running KVM guest and wanting PMU to be off it would
> be required to
> * disable CPUID leaf for Intel
> * set KVM_PMU_CAP_DISABLE for both processors This would be quite natural and
> transparent for the libvirt. Alexander will prepare the patch today or tomorrow
> for the discussion. Den

That is what I had implemented in the v1 of patch.

https://lore.kernel.org/all/20221119122901.2469-3-dongli.zhang@oracle.com/

However, I changed that after people suggested introduce a new property.

Dongli Zhang
Denis V. Lunev Nov. 13, 2023, 4:33 p.m. UTC | #14
On 11/13/23 17:17, Dongli Zhang wrote:
>
> On 11/13/23 06:42, Denis V. Lunev wrote:
>> On 11/13/23 15:14, Dongli Zhang wrote:
>>> Hi Denis,
>>>
>>> On 11/13/23 01:31, Denis V. Lunev wrote:
>>>> On 11/10/23 01:01, Dongli Zhang wrote:
>>>>> On 11/9/23 3:46 PM, Denis V. Lunev wrote:
>>>>>> On 11/9/23 23:52, Jim Mattson wrote:
>>>>>>> On Thu, Nov 9, 2023 at 10:18 AM Konstantin Khorenko
>>>>>>> <khorenko@virtuozzo.com> wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> as a followup for my patch: i have noticed that
>>>>>>>> currently Intel kernel code provides an ability to detect if PMU is totally
>>>>>>>> disabled for a VM
>>>>>>>> (pmu->version == 0 in this case), but for AMD code pmu->version is never 0,
>>>>>>>> no matter if PMU is enabled or disabled for a VM (i mean <pmu state='off'/>
>>>>>>>> in the VM config which
>>>>>>>> results in "-cpu pmu=off" qemu option).
>>>>>>>>
>>>>>>>> So the question is - is it possible to enhance the code for AMD to also
>>>>>>>> honor
>>>>>>>> PMU VM setting or it is
>>>>>>>> impossible by design?
>>>>>>> The AMD architectural specification prior to AMD PMU v2 does not allow
>>>>>>> one to describe a CPU (via CPUID or MSRs) that has fewer than 4
>>>>>>> general purpose PMU counters. While AMD PMU v2 does allow one to
>>>>>>> describe such a CPU, legacy software that knows nothing of AMD PMU v2
>>>>>>> can expect four counters regardless.
>>>>>>>
>>>>>>> Having said that, KVM does provide a per-VM capability for disabling
>>>>>>> the virtual PMU: KVM_CAP_PMU_CAPABILITY(KVM_PMU_CAP_DISABLE). See
>>>>>>> section 8.35 in Documentation/virt/kvm/api.rst.
>>>>>> But this means in particular that QEMU should immediately
>>>>>> use this KVM_PMU_CAP_DISABLE if this capability is supported and PMU=off. I am
>>>>>> not seeing this code thus I believe that we have missed this. I think that
>>>>>> this
>>>>>> change worth adding. We will measure the impact :-) Den
>>>>>>
>>>>> I used to have a patch to use KVM_PMU_CAP_DISABLE in QEMU, but that did not
>>>>> draw
>>>>> many developers' attention.
>>>>>
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20230621013821.6874-2-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!McSH2M-kuHmzAwTuXKxrjLkrdJoPqML6cY_Ndc-8k9LRQ7D1V9bSBRQPwHqtx9XCVLK3uzdsMaxyfwve$
>>>>> It is time to first re-send that again.
>>>>>
>>>>> Dongli Zhang
>>>> We have checked that setting KVM_PMU_CAP_DISABLE really helps. Konstantin has
>>>> done this and this is good. On the other hand, looking into these patches I
>>>> disagree with them. We should not introduce new option for QEMU. If PMU is
>>>> disabled, i.e. we assume that pmu=off passed in the command line, we should set
>>>> KVM_PMU_CAP_DISABLE for that virtual machine. Den
>>> Can I assume you meant pmu=off, that is, cpu->enable_pmu in QEMU?
>>>
>>> In my opinion, cpu->enable_pmu indicates the option to control the cpu features.
>>> It may be used by any accelerators, and it is orthogonal to the KVM cap.
>>>
>>>
>>> The KVM_PMU_CAP_DISABLE is only specific to the KVM accelerator.
>>>
>>>
>>> That's why I had introduced a new option, to allow to configure the VM in my
>>> dimensions.
>>>
>>> It means one dimension to AMD, but two for Intel: to disable PMU via cpuid, or
>>> KVM cap.
>>>
>>> Anyway, this is KVM mailing list, and I may initiate the discussion in QEMU list.
>>>
>>> Thank you very much!
>>>
>>> Dongli Zhang
>> with the option pmu='off' it is expected that PMU should be
>> off for the guest. At the moment (without this KVM capability)
>> we can disable PMU for Intel only and thus have performance
>> degradation on AMD.
>>
>> This option disables PMU and thus normally when we are
>> running KVM guest and wanting PMU to be off it would
>> be required to
>> * disable CPUID leaf for Intel
>> * set KVM_PMU_CAP_DISABLE for both processors This would be quite natural and
>> transparent for the libvirt. Alexander will prepare the patch today or tomorrow
>> for the discussion. Den
> That is what I had implemented in the v1 of patch.
>
> https://lore.kernel.org/all/20221119122901.2469-3-dongli.zhang@oracle.com/
>
> However, I changed that after people suggested introduce a new property.
>
> Dongli Zhang
That would save a bit of our work :)

For me this patch looks absolutely awesome and is doing exactly
what I want to do in our downstream. This would get us required
15+% benefit for each VMexit.

Den