[v2] KVM: arm64: Allow to limit number of PMU counters

Message ID	20200908205730.23898-1-graf@amazon.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=SnLg=CR=vger.kernel.org=kvm-owner@kernel.org> From: Alexander Graf <graf@amazon.com> To: <kvmarm@lists.cs.columbia.edu> CC: Marc Zyngier <maz@kernel.org>, <linux-arm-kernel@lists.infradead.org>, <kvm@vger.kernel.org>, James Morse <james.morse@arm.com>, Julien Thierry <julien.thierry.kdev@gmail.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Robin Murphy <robin.murphy@arm.com>, Mark Rutland <mark.rutland@arm.com>, Eric Auger <eric.auger@redhat.com> Subject: [PATCH v2] KVM: arm64: Allow to limit number of PMU counters Date: Tue, 8 Sep 2020 22:57:30 +0200 Message-ID: <20200908205730.23898-1-graf@amazon.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	[v2] KVM: arm64: Allow to limit number of PMU counters \| expand [v2] KVM: arm64: Allow to limit number of PMU counters

Alexander Graf Sept. 8, 2020, 8:57 p.m. UTC

We currently pass through the number of PMU counters that we have available
in hardware to guests. So if my host supports 10 concurrently active PMU
counters, my guest will be able to spawn 10 counters as well.

This is undesireable if we also want to use the PMU on the host for
monitoring. In that case, we want to split the PMU between guest and
host.

To help that case, let's add a PMU attr that allows us to limit the number
of PMU counters that we expose. With this patch in place, user space can
keep some counters free for host use.

Signed-off-by: Alexander Graf <graf@amazon.com>

---

Because this patch touches the same code paths as the vPMU filtering one
and the vPMU filtering generalized a few conditions in the attr path,
I've based it on top. Please let me know if you want it independent instead.

v1 -> v2:

  - Add documentation
  - Add read support
---
 Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
 arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
 arch/arm64/kvm/pmu-emul.c               | 32 ++++++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c               |  5 +++++
 include/kvm/arm_pmu.h                   |  1 +
 5 files changed, 67 insertions(+), 3 deletions(-)

Andrew Jones Sept. 9, 2020, 6:25 a.m. UTC | #1

On Tue, Sep 08, 2020 at 10:57:30PM +0200, Alexander Graf wrote:
> We currently pass through the number of PMU counters that we have available
> in hardware to guests. So if my host supports 10 concurrently active PMU
> counters, my guest will be able to spawn 10 counters as well.
> 
> This is undesireable if we also want to use the PMU on the host for
> monitoring. In that case, we want to split the PMU between guest and
> host.
> 
> To help that case, let's add a PMU attr that allows us to limit the number
> of PMU counters that we expose. With this patch in place, user space can
> keep some counters free for host use.

Hi Alex,

Is there any reason to use the device API instead of just giving the user
control over the necessary PMCR_EL0 bits through set/get-one-reg?

Thanks,
drew

> 
> Signed-off-by: Alexander Graf <graf@amazon.com>
> 
> ---
> 
> Because this patch touches the same code paths as the vPMU filtering one
> and the vPMU filtering generalized a few conditions in the attr path,
> I've based it on top. Please let me know if you want it independent instead.
> 
> v1 -> v2:
> 
>   - Add documentation
>   - Add read support
> ---
>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>  arch/arm64/kvm/pmu-emul.c               | 32 ++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>  include/kvm/arm_pmu.h                   |  1 +
>  5 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 203b91e93151..1a1c8d8c8b1d 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the cycle counter is possible
>  using event 0x11 (CPU_CYCLES).
>  
>  
> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
> +---------------------------------------------
> +
> +:Parameters: in kvm_device_attr.addr the address for the limit of concurrent
> +             events is a pointer to an int
> +
> +:Returns:
> +
> +	 =======  ======================================================
> +	 -ENODEV: PMUv3 not supported
> +	 -EBUSY:  PMUv3 already initialized
> +	 -EINVAL: Too large number of events
> +	 =======  ======================================================
> +
> +Reconfigure the limit of concurrent PMU events that the guest can monitor.
> +This number is directly exposed as part of the PMCR_EL0 register.
> +
> +On vcpu creation, this attribute is set to the hardware limit of the current
> +platform. If you need to determine the hardware limit, you can read this
> +attribute before setting it.
> +
> +Restrictions: The default value for this property is the number of hardware
> +supported events. Only values that are smaller than the hardware limit can
> +be set.
> +
>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>  =================================
>  
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 7b1511d6ce44..db025c0b5a40 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>  
>  /* Device Control API on vcpu fd */
>  #define KVM_ARM_VCPU_PMU_V3_CTRL	0
> -#define   KVM_ARM_VCPU_PMU_V3_IRQ	0
> -#define   KVM_ARM_VCPU_PMU_V3_INIT	1
> -#define   KVM_ARM_VCPU_PMU_V3_FILTER	2
> +#define   KVM_ARM_VCPU_PMU_V3_IRQ		0
> +#define   KVM_ARM_VCPU_PMU_V3_INIT		1
> +#define   KVM_ARM_VCPU_PMU_V3_FILTER		2
> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS	3
>  #define KVM_ARM_VCPU_TIMER_CTRL		1
>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index 0458860bade2..c7915b95fec0 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>  
>  	for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>  		pmu->pmc[i].idx = i;
> +
> +	pmu->num_events = perf_num_counters() - 1;
>  }
>  
>  /**
> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  
>  		return 0;
>  	}
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
> +		u64 mask = ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT;
> +		int __user *uaddr = (int __user *)(long)attr->addr;
> +		u32 num_events;
> +
> +		if (get_user(num_events, uaddr))
> +			return -EFAULT;
> +
> +		if (num_events >= perf_num_counters())
> +			return -EINVAL;
> +
> +		vcpu->arch.pmu.num_events = num_events;
> +
> +		num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
> +
> +		return 0;
> +	}
>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>  		return kvm_arm_pmu_v3_init(vcpu);
>  	}
> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  		irq = vcpu->arch.pmu.irq_num;
>  		return put_user(irq, uaddr);
>  	}
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
> +		int __user *uaddr = (int __user *)(long)attr->addr;
> +		u32 num_events;
> +
> +		if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
> +			return -ENODEV;
> +
> +		num_events = vcpu->arch.pmu.num_events;
> +		return put_user(num_events, uaddr);
> +	}
>  	}
>  
>  	return -ENXIO;
> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  	case KVM_ARM_VCPU_PMU_V3_IRQ:
>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>  	case KVM_ARM_VCPU_PMU_V3_FILTER:
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>  		if (kvm_arm_support_pmu_v3() &&
>  		    test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>  			return 0;
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 20ab2a7d37ca..d51e39600bbd 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
>  	       | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E);
>  	if (!system_supports_32bit_el0())
>  		val |= ARMV8_PMU_PMCR_LC;
> +
> +	/* Override number of event selectors */
> +	val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
> +	val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
> +
>  	__vcpu_sys_reg(vcpu, r->reg) = val;
>  }
>  
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index 98cbfe885a53..ea3fc96a37d9 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -27,6 +27,7 @@ struct kvm_pmu {
>  	bool ready;
>  	bool created;
>  	bool irq_level;
> +	u8 num_events;
>  };
>  
>  #define kvm_arm_pmu_v3_ready(v)		((v)->arch.pmu.ready)
> -- 
> 2.16.4
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
>

Alexander Graf Sept. 9, 2020, 8:43 a.m. UTC | #2

Hey Drew!

On 09.09.20 08:25, Andrew Jones wrote:
> 
> On Tue, Sep 08, 2020 at 10:57:30PM +0200, Alexander Graf wrote:
>> We currently pass through the number of PMU counters that we have available
>> in hardware to guests. So if my host supports 10 concurrently active PMU
>> counters, my guest will be able to spawn 10 counters as well.
>>
>> This is undesireable if we also want to use the PMU on the host for
>> monitoring. In that case, we want to split the PMU between guest and
>> host.
>>
>> To help that case, let's add a PMU attr that allows us to limit the number
>> of PMU counters that we expose. With this patch in place, user space can
>> keep some counters free for host use.
> 
> Hi Alex,
> 
> Is there any reason to use the device API instead of just giving the user
> control over the necessary PMCR_EL0 bits through set/get-one-reg?

I mostly used the attr interface because I was in that particular mental 
mode after looking at the filtering bits :).

Today, the PMCR_EL0 register gets reset implicitly on every vcpu reset 
call. How would we persist the counter field across resets? Would we in 
the first place?

I'm slightly hazy how the ONE_REG API would look like here. Do you have 
recommendations?

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Andrew Jones Sept. 9, 2020, 9:56 a.m. UTC | #3

On Wed, Sep 09, 2020 at 10:43:41AM +0200, Alexander Graf wrote:
> Hey Drew!
> 
> On 09.09.20 08:25, Andrew Jones wrote:
> > 
> > On Tue, Sep 08, 2020 at 10:57:30PM +0200, Alexander Graf wrote:
> > > We currently pass through the number of PMU counters that we have available
> > > in hardware to guests. So if my host supports 10 concurrently active PMU
> > > counters, my guest will be able to spawn 10 counters as well.
> > > 
> > > This is undesireable if we also want to use the PMU on the host for
> > > monitoring. In that case, we want to split the PMU between guest and
> > > host.
> > > 
> > > To help that case, let's add a PMU attr that allows us to limit the number
> > > of PMU counters that we expose. With this patch in place, user space can
> > > keep some counters free for host use.
> > 
> > Hi Alex,
> > 
> > Is there any reason to use the device API instead of just giving the user
> > control over the necessary PMCR_EL0 bits through set/get-one-reg?
> 
> I mostly used the attr interface because I was in that particular mental
> mode after looking at the filtering bits :).
> 
> Today, the PMCR_EL0 register gets reset implicitly on every vcpu reset call.
> How would we persist the counter field across resets? Would we in the first
> place?
> 
> I'm slightly hazy how the ONE_REG API would look like here. Do you have
> recommendations?
>

Using the set/get_user hooks of the sysreg table we can accept a user
input PMCR_EL0. We would only accept one that matches what the hardware
and KVM supports though (EINVAL otherwise). We'll need to modify reset to
use the value selected by the user too, which we can store in 'val' of the
sysreg table.

Since userspace will likely get before set in order to know what's valid,
we'll need to provide the current reset state on get until it has been
set. I'm not sure how to track whether it has been set or not. Maybe new
state is needed or an initial val=0 or val=~0 may work.

Thanks,
drew

Eric Auger Sept. 10, 2020, 7:21 a.m. UTC | #4

Hi Alex,

On 9/8/20 10:57 PM, Alexander Graf wrote:
> We currently pass through the number of PMU counters that we have available
> in hardware to guests. So if my host supports 10 concurrently active PMU
> counters, my guest will be able to spawn 10 counters as well.
> 
> This is undesireable if we also want to use the PMU on the host for
> monitoring. In that case, we want to split the PMU between guest and
> host.
but don't we have a trap and emulate approach as opposed to current SPE
implementation?
> 
> To help that case, let's add a PMU attr that allows us to limit the number
> of PMU counters that we expose. With this patch in place, user space can
> keep some counters free for host use.
> 
> Signed-off-by: Alexander Graf <graf@amazon.com>
> 
> ---
> 
> Because this patch touches the same code paths as the vPMU filtering one
> and the vPMU filtering generalized a few conditions in the attr path,
> I've based it on top. Please let me know if you want it independent instead.
> 
> v1 -> v2:
> 
>   - Add documentation
>   - Add read support
> ---
>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>  arch/arm64/kvm/pmu-emul.c               | 32 ++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>  include/kvm/arm_pmu.h                   |  1 +
>  5 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 203b91e93151..1a1c8d8c8b1d 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the cycle counter is possible
>  using event 0x11 (CPU_CYCLES).
>  
>  
> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
> +---------------------------------------------
> +
> +:Parameters: in kvm_device_attr.addr the address for the limit of concurrent
> +             events is a pointer to an int
> +
> +:Returns:
> +
> +	 =======  ======================================================
> +	 -ENODEV: PMUv3 not supported
> +	 -EBUSY:  PMUv3 already initialized
> +	 -EINVAL: Too large number of events
s/events/event counters

I see that in perf code indeed num_events is used for that but I think
for the end-user the event counter terminology is better as it fits the
ARM spec.
> +	 =======  ======================================================
> +
> +Reconfigure the limit of concurrent PMU events that the guest can monitor.
here also
> +This number is directly exposed as part of the PMCR_EL0 register.
Maybe quote the "N" field
> +
> +On vcpu creation, this attribute is set to the hardware limit of the current
> +platform. If you need to determine the hardware limit, you can read this
> +attribute before setting it.
> +
> +Restrictions: The default value for this property is the number of hardware
> +supported events. Only values that are smaller than the hardware limit can
event counters
> +be set.
> +
>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>  =================================
>  
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 7b1511d6ce44..db025c0b5a40 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>  
>  /* Device Control API on vcpu fd */
>  #define KVM_ARM_VCPU_PMU_V3_CTRL	0
> -#define   KVM_ARM_VCPU_PMU_V3_IRQ	0
> -#define   KVM_ARM_VCPU_PMU_V3_INIT	1
> -#define   KVM_ARM_VCPU_PMU_V3_FILTER	2
> +#define   KVM_ARM_VCPU_PMU_V3_IRQ		0
> +#define   KVM_ARM_VCPU_PMU_V3_INIT		1
> +#define   KVM_ARM_VCPU_PMU_V3_FILTER		2
> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS	3
>  #define KVM_ARM_VCPU_TIMER_CTRL		1
>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index 0458860bade2..c7915b95fec0 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>  
>  	for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>  		pmu->pmc[i].idx = i;
> +
> +	pmu->num_events = perf_num_counters() - 1;
>  }
>  
>  /**
> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  
>  		return 0;
>  	}
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
> +		u64 mask = ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT;
> +		int __user *uaddr = (int __user *)(long)attr->addr;
> +		u32 num_events;
> +
> +		if (get_user(num_events, uaddr))
> +			return -EFAULT;
> +
> +		if (num_events >= perf_num_counters())
> +			return -EINVAL;
> +
> +		vcpu->arch.pmu.num_events = num_events;
> +
> +		num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
> +
> +		return 0;
> +	}
>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>  		return kvm_arm_pmu_v3_init(vcpu);
>  	}
> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  		irq = vcpu->arch.pmu.irq_num;
>  		return put_user(irq, uaddr);
>  	}
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
> +		int __user *uaddr = (int __user *)(long)attr->addr;
> +		u32 num_events;
> +
> +		if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
> +			return -ENODEV;
> +
> +		num_events = vcpu->arch.pmu.num_events;
> +		return put_user(num_events, uaddr);
> +	}
>  	}
>  
>  	return -ENXIO;
> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  	case KVM_ARM_VCPU_PMU_V3_IRQ:
>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>  	case KVM_ARM_VCPU_PMU_V3_FILTER:
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>  		if (kvm_arm_support_pmu_v3() &&
>  		    test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>  			return 0;
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 20ab2a7d37ca..d51e39600bbd 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
>  	       | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E);
>  	if (!system_supports_32bit_el0())
>  		val |= ARMV8_PMU_PMCR_LC;
> +
> +	/* Override number of event selectors */
> +	val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
> +	val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
> +
>  	__vcpu_sys_reg(vcpu, r->reg) = val;
>  }
>  
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index 98cbfe885a53..ea3fc96a37d9 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -27,6 +27,7 @@ struct kvm_pmu {
>  	bool ready;
>  	bool created;
>  	bool irq_level;
> +	u8 num_events;
>  };
>  
>  #define kvm_arm_pmu_v3_ready(v)		((v)->arch.pmu.ready)
> 

Thanks

Eric

Eric Auger Sept. 10, 2020, 8:35 a.m. UTC | #5

Hi Alex,

On 9/10/20 9:21 AM, Auger Eric wrote:
> Hi Alex,
> 
> On 9/8/20 10:57 PM, Alexander Graf wrote:
>> We currently pass through the number of PMU counters that we have available
>> in hardware to guests. So if my host supports 10 concurrently active PMU
>> counters, my guest will be able to spawn 10 counters as well.
>>
>> This is undesireable if we also want to use the PMU on the host for
>> monitoring. In that case, we want to split the PMU between guest and
>> host.
> but don't we have a trap and emulate approach as opposed to current SPE
> implementation?

Looking again at the switch code (__pmu_switch_to_guest), I see we just
enable/disable the event counters owned by either the host/guest. I
thought this was more involved. So now it is clear.

Thanks

Eric
>>
>> To help that case, let's add a PMU attr that allows us to limit the number
>> of PMU counters that we expose. With this patch in place, user space can
>> keep some counters free for host use.
>>
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>
>> ---
>>
>> Because this patch touches the same code paths as the vPMU filtering one
>> and the vPMU filtering generalized a few conditions in the attr path,
>> I've based it on top. Please let me know if you want it independent instead.
>>
>> v1 -> v2:
>>
>>   - Add documentation
>>   - Add read support
>> ---
>>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>  arch/arm64/kvm/pmu-emul.c               | 32 ++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>  include/kvm/arm_pmu.h                   |  1 +
>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
>> index 203b91e93151..1a1c8d8c8b1d 100644
>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the cycle counter is possible
>>  using event 0x11 (CPU_CYCLES).
>>  
>>  
>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>> +---------------------------------------------
>> +
>> +:Parameters: in kvm_device_attr.addr the address for the limit of concurrent
>> +             events is a pointer to an int
>> +
>> +:Returns:
>> +
>> +	 =======  ======================================================
>> +	 -ENODEV: PMUv3 not supported
>> +	 -EBUSY:  PMUv3 already initialized
>> +	 -EINVAL: Too large number of events
> s/events/event counters
> 
> I see that in perf code indeed num_events is used for that but I think
> for the end-user the event counter terminology is better as it fits the
> ARM spec.
>> +	 =======  ======================================================
>> +
>> +Reconfigure the limit of concurrent PMU events that the guest can monitor.
> here also
>> +This number is directly exposed as part of the PMCR_EL0 register.
> Maybe quote the "N" field
>> +
>> +On vcpu creation, this attribute is set to the hardware limit of the current
>> +platform. If you need to determine the hardware limit, you can read this
>> +attribute before setting it.
>> +
>> +Restrictions: The default value for this property is the number of hardware
>> +supported events. Only values that are smaller than the hardware limit can
> event counters
>> +be set.
>> +
>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>  =================================
>>  
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 7b1511d6ce44..db025c0b5a40 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>  
>>  /* Device Control API on vcpu fd */
>>  #define KVM_ARM_VCPU_PMU_V3_CTRL	0
>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ	0
>> -#define   KVM_ARM_VCPU_PMU_V3_INIT	1
>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER	2
>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ		0
>> +#define   KVM_ARM_VCPU_PMU_V3_INIT		1
>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER		2
>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS	3
>>  #define KVM_ARM_VCPU_TIMER_CTRL		1
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>> index 0458860bade2..c7915b95fec0 100644
>> --- a/arch/arm64/kvm/pmu-emul.c
>> +++ b/arch/arm64/kvm/pmu-emul.c
>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>  
>>  	for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>  		pmu->pmc[i].idx = i;
>> +
>> +	pmu->num_events = perf_num_counters() - 1;
>>  }
>>  
>>  /**
>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>  
>>  		return 0;
>>  	}
>> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>> +		u64 mask = ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT;
>> +		int __user *uaddr = (int __user *)(long)attr->addr;
>> +		u32 num_events;
>> +
>> +		if (get_user(num_events, uaddr))
>> +			return -EFAULT;
>> +
>> +		if (num_events >= perf_num_counters())
>> +			return -EINVAL;
>> +
>> +		vcpu->arch.pmu.num_events = num_events;
>> +
>> +		num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>> +
>> +		return 0;
>> +	}
>>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>>  		return kvm_arm_pmu_v3_init(vcpu);
>>  	}
>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>  		irq = vcpu->arch.pmu.irq_num;
>>  		return put_user(irq, uaddr);
>>  	}
>> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>> +		int __user *uaddr = (int __user *)(long)attr->addr;
>> +		u32 num_events;
>> +
>> +		if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>> +			return -ENODEV;
>> +
>> +		num_events = vcpu->arch.pmu.num_events;
>> +		return put_user(num_events, uaddr);
>> +	}
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>  	case KVM_ARM_VCPU_PMU_V3_IRQ:
>>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>>  	case KVM_ARM_VCPU_PMU_V3_FILTER:
>> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>  		if (kvm_arm_support_pmu_v3() &&
>>  		    test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>  			return 0;
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 20ab2a7d37ca..d51e39600bbd 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
>>  	       | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E);
>>  	if (!system_supports_32bit_el0())
>>  		val |= ARMV8_PMU_PMCR_LC;
>> +
>> +	/* Override number of event selectors */
>> +	val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>> +	val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>> +
>>  	__vcpu_sys_reg(vcpu, r->reg) = val;
>>  }
>>  
>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>> index 98cbfe885a53..ea3fc96a37d9 100644
>> --- a/include/kvm/arm_pmu.h
>> +++ b/include/kvm/arm_pmu.h
>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>  	bool ready;
>>  	bool created;
>>  	bool irq_level;
>> +	u8 num_events;
>>  };
>>  
>>  #define kvm_arm_pmu_v3_ready(v)		((v)->arch.pmu.ready)
>>
> 
> Thanks
> 
> Eric
>

Marc Zyngier Sept. 10, 2020, 10:06 a.m. UTC | #6

On 2020-09-08 21:57, Alexander Graf wrote:
> We currently pass through the number of PMU counters that we have 
> available
> in hardware to guests. So if my host supports 10 concurrently active 
> PMU
> counters, my guest will be able to spawn 10 counters as well.
> 
> This is undesireable if we also want to use the PMU on the host for
> monitoring. In that case, we want to split the PMU between guest and
> host.
> 
> To help that case, let's add a PMU attr that allows us to limit the 
> number
> of PMU counters that we expose. With this patch in place, user space 
> can
> keep some counters free for host use.
> 
> Signed-off-by: Alexander Graf <graf@amazon.com>
> 
> ---
> 
> Because this patch touches the same code paths as the vPMU filtering 
> one
> and the vPMU filtering generalized a few conditions in the attr path,
> I've based it on top. Please let me know if you want it independent 
> instead.
> 
> v1 -> v2:
> 
>   - Add documentation
>   - Add read support
> ---
>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>  arch/arm64/kvm/pmu-emul.c               | 32 
> ++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>  include/kvm/arm_pmu.h                   |  1 +
>  5 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
> b/Documentation/virt/kvm/devices/vcpu.rst
> index 203b91e93151..1a1c8d8c8b1d 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
> cycle counter is possible
>  using event 0x11 (CPU_CYCLES).
> 
> 
> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
> +---------------------------------------------
> +
> +:Parameters: in kvm_device_attr.addr the address for the limit of 
> concurrent
> +             events is a pointer to an int
> +
> +:Returns:
> +
> +	 =======  ======================================================
> +	 -ENODEV: PMUv3 not supported
> +	 -EBUSY:  PMUv3 already initialized
> +	 -EINVAL: Too large number of events
> +	 =======  ======================================================
> +
> +Reconfigure the limit of concurrent PMU events that the guest can 
> monitor.
> +This number is directly exposed as part of the PMCR_EL0 register.
> +
> +On vcpu creation, this attribute is set to the hardware limit of the 
> current
> +platform. If you need to determine the hardware limit, you can read 
> this
> +attribute before setting it.
> +
> +Restrictions: The default value for this property is the number of 
> hardware
> +supported events. Only values that are smaller than the hardware limit 
> can
> +be set.
> +
>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>  =================================
> 
> diff --git a/arch/arm64/include/uapi/asm/kvm.h
> b/arch/arm64/include/uapi/asm/kvm.h
> index 7b1511d6ce44..db025c0b5a40 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
> 
>  /* Device Control API on vcpu fd */
>  #define KVM_ARM_VCPU_PMU_V3_CTRL	0
> -#define   KVM_ARM_VCPU_PMU_V3_IRQ	0
> -#define   KVM_ARM_VCPU_PMU_V3_INIT	1
> -#define   KVM_ARM_VCPU_PMU_V3_FILTER	2
> +#define   KVM_ARM_VCPU_PMU_V3_IRQ		0
> +#define   KVM_ARM_VCPU_PMU_V3_INIT		1
> +#define   KVM_ARM_VCPU_PMU_V3_FILTER		2
> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS	3
>  #define KVM_ARM_VCPU_TIMER_CTRL		1
>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index 0458860bade2..c7915b95fec0 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
> 
>  	for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>  		pmu->pmc[i].idx = i;
> +
> +	pmu->num_events = perf_num_counters() - 1;
>  }
> 
>  /**
> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
> *vcpu, struct kvm_device_attr *attr)
> 
>  		return 0;
>  	}
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
> +		u64 mask = ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT;
> +		int __user *uaddr = (int __user *)(long)attr->addr;
> +		u32 num_events;
> +
> +		if (get_user(num_events, uaddr))
> +			return -EFAULT;
> +
> +		if (num_events >= perf_num_counters())
> +			return -EINVAL;
> +
> +		vcpu->arch.pmu.num_events = num_events;
> +
> +		num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
> +		__vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
> +
> +		return 0;
> +	}
>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>  		return kvm_arm_pmu_v3_init(vcpu);
>  	}
> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
> *vcpu, struct kvm_device_attr *attr)
>  		irq = vcpu->arch.pmu.irq_num;
>  		return put_user(irq, uaddr);
>  	}
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
> +		int __user *uaddr = (int __user *)(long)attr->addr;
> +		u32 num_events;
> +
> +		if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
> +			return -ENODEV;
> +
> +		num_events = vcpu->arch.pmu.num_events;
> +		return put_user(num_events, uaddr);
> +	}
>  	}
> 
>  	return -ENXIO;
> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
> *vcpu, struct kvm_device_attr *attr)
>  	case KVM_ARM_VCPU_PMU_V3_IRQ:
>  	case KVM_ARM_VCPU_PMU_V3_INIT:
>  	case KVM_ARM_VCPU_PMU_V3_FILTER:
> +	case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>  		if (kvm_arm_support_pmu_v3() &&
>  		    test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>  			return 0;
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 20ab2a7d37ca..d51e39600bbd 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
> const struct sys_reg_desc *r)
>  	       | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E);
>  	if (!system_supports_32bit_el0())
>  		val |= ARMV8_PMU_PMCR_LC;
> +
> +	/* Override number of event selectors */
> +	val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
> +	val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
> +
>  	__vcpu_sys_reg(vcpu, r->reg) = val;
>  }
> 
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index 98cbfe885a53..ea3fc96a37d9 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -27,6 +27,7 @@ struct kvm_pmu {
>  	bool ready;
>  	bool created;
>  	bool irq_level;
> +	u8 num_events;
>  };
> 
>  #define kvm_arm_pmu_v3_ready(v)		((v)->arch.pmu.ready)

I see several problems with this approach:

- userspace doesn't really have a good way to retrieve the number of
   counters.

- Limiting the number of counters for the guest doesn't mean anything
   when it comes to the actual use of the HW counters, given that we
   don't allocate them ourselves (it's all perf doing the actual work).

- If you want to "pin" counters for the host, why don't you just do
   that before starting the guest?

I think you need to look at the bigger picture: how to limit the use
of physical counter usage for a given userspace task. This needs
to happen in perf itself, and not in KVM.

         M.

Alexander Graf Sept. 10, 2020, 10:18 a.m. UTC | #7

On 10.09.20 12:06, Marc Zyngier wrote:
> 
> On 2020-09-08 21:57, Alexander Graf wrote:
>> We currently pass through the number of PMU counters that we have
>> available
>> in hardware to guests. So if my host supports 10 concurrently active
>> PMU
>> counters, my guest will be able to spawn 10 counters as well.
>>
>> This is undesireable if we also want to use the PMU on the host for
>> monitoring. In that case, we want to split the PMU between guest and
>> host.
>>
>> To help that case, let's add a PMU attr that allows us to limit the
>> number
>> of PMU counters that we expose. With this patch in place, user space
>> can
>> keep some counters free for host use.
>>
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>
>> ---
>>
>> Because this patch touches the same code paths as the vPMU filtering
>> one
>> and the vPMU filtering generalized a few conditions in the attr path,
>> I've based it on top. Please let me know if you want it independent
>> instead.
>>
>> v1 -> v2:
>>
>>   - Add documentation
>>   - Add read support
>> ---
>>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>  arch/arm64/kvm/pmu-emul.c               | 32
>> ++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>  include/kvm/arm_pmu.h                   |  1 +
>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
>> b/Documentation/virt/kvm/devices/vcpu.rst
>> index 203b91e93151..1a1c8d8c8b1d 100644
>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
>> cycle counter is possible
>>  using event 0x11 (CPU_CYCLES).
>>
>>
>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>> +---------------------------------------------
>> +
>> +:Parameters: in kvm_device_attr.addr the address for the limit of
>> concurrent
>> +             events is a pointer to an int
>> +
>> +:Returns:
>> +
>> +      =======  ======================================================
>> +      -ENODEV: PMUv3 not supported
>> +      -EBUSY:  PMUv3 already initialized
>> +      -EINVAL: Too large number of events
>> +      =======  ======================================================
>> +
>> +Reconfigure the limit of concurrent PMU events that the guest can
>> monitor.
>> +This number is directly exposed as part of the PMCR_EL0 register.
>> +
>> +On vcpu creation, this attribute is set to the hardware limit of the
>> current
>> +platform. If you need to determine the hardware limit, you can read
>> this
>> +attribute before setting it.
>> +
>> +Restrictions: The default value for this property is the number of
>> hardware
>> +supported events. Only values that are smaller than the hardware limit
>> can
>> +be set.
>> +
>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>  =================================
>>
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 7b1511d6ce44..db025c0b5a40 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>
>>  /* Device Control API on vcpu fd */
>>  #define KVM_ARM_VCPU_PMU_V3_CTRL     0
>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
>> -#define   KVM_ARM_VCPU_PMU_V3_INIT   1
>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ            0
>> +#define   KVM_ARM_VCPU_PMU_V3_INIT           1
>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER         2
>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS     3
>>  #define KVM_ARM_VCPU_TIMER_CTRL              1
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER              0
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER              1
>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>> index 0458860bade2..c7915b95fec0 100644
>> --- a/arch/arm64/kvm/pmu-emul.c
>> +++ b/arch/arm64/kvm/pmu-emul.c
>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>
>>       for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>               pmu->pmc[i].idx = i;
>> +
>> +     pmu->num_events = perf_num_counters() - 1;
>>  }
>>
>>  /**
>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
>> *vcpu, struct kvm_device_attr *attr)
>>
>>               return 0;
>>       }
>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>> +             u64 mask = ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT;
>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>> +             u32 num_events;
>> +
>> +             if (get_user(num_events, uaddr))
>> +                     return -EFAULT;
>> +
>> +             if (num_events >= perf_num_counters())
>> +                     return -EINVAL;
>> +
>> +             vcpu->arch.pmu.num_events = num_events;
>> +
>> +             num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>> +
>> +             return 0;
>> +     }
>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>               return kvm_arm_pmu_v3_init(vcpu);
>>       }
>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
>> *vcpu, struct kvm_device_attr *attr)
>>               irq = vcpu->arch.pmu.irq_num;
>>               return put_user(irq, uaddr);
>>       }
>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>> +             u32 num_events;
>> +
>> +             if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>> +                     return -ENODEV;
>> +
>> +             num_events = vcpu->arch.pmu.num_events;
>> +             return put_user(num_events, uaddr);
>> +     }
>>       }
>>
>>       return -ENXIO;
>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
>> *vcpu, struct kvm_device_attr *attr)
>>       case KVM_ARM_VCPU_PMU_V3_IRQ:
>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>       case KVM_ARM_VCPU_PMU_V3_FILTER:
>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>               if (kvm_arm_support_pmu_v3() &&
>>                   test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>                       return 0;
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 20ab2a7d37ca..d51e39600bbd 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
>> const struct sys_reg_desc *r)
>>              | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & (~ARMV8_PMU_PMCR_E);
>>       if (!system_supports_32bit_el0())
>>               val |= ARMV8_PMU_PMCR_LC;
>> +
>> +     /* Override number of event selectors */
>> +     val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>> +     val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>> +
>>       __vcpu_sys_reg(vcpu, r->reg) = val;
>>  }
>>
>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>> index 98cbfe885a53..ea3fc96a37d9 100644
>> --- a/include/kvm/arm_pmu.h
>> +++ b/include/kvm/arm_pmu.h
>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>       bool ready;
>>       bool created;
>>       bool irq_level;
>> +     u8 num_events;
>>  };
>>
>>  #define kvm_arm_pmu_v3_ready(v)              ((v)->arch.pmu.ready)
> 
> I see several problems with this approach:
> 
> - userspace doesn't really have a good way to retrieve the number of
>    counters.
It does with v2, because it can then just read the register ;). I agree 
that it's clunky though.

> 
> - Limiting the number of counters for the guest doesn't mean anything
>    when it comes to the actual use of the HW counters, given that we
>    don't allocate them ourselves (it's all perf doing the actual work).

We do cap the number of actively requestable counters via perf by the 
PMCR.N limit. So in a way, it does mean something.

> - If you want to "pin" counters for the host, why don't you just do
>    that before starting the guest?

You can do that. Imagine I have 10 counters. I pin 4 of them to the 
host. I still tell my guest that it can use 6. That means perf will then 
time slice and juggle 10 guest event counters on those remaining 6 
hardware counters. That juggling heavily reduces accuracy.

> I think you need to look at the bigger picture: how to limit the use
> of physical counter usage for a given userspace task. This needs
> to happen in perf itself, and not in KVM.

That's definitely another way to look at it that I agree with.

What we really want is to expose the number of counters the guest has 
available, not the number of counters hardware can support at maximum.

So in theory it would be enough to ask perf how many counters it does 
have free for me to consume without overcommitting. But that would 
potentially change between multiple invocations of KVM and thus break 
things like live migration, no?

Maybe what we really want is an interface to perf from user space to say 
"how many counters can you dedicate to me?" and "reserve them for me". 
Then user space could reserve them as dedicated counters and KVM would 
just need to either probe for the reservation or get told by user space 
what to expose via ONE_REG as Drew suggested. It'd be up to user space 
to ensure that the reservation matches the number of exposed counters then.


Thoughts?

Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Robin Murphy Sept. 10, 2020, 3:52 p.m. UTC | #8

On 2020-09-10 11:18, Alexander Graf wrote:
> 
> 
> On 10.09.20 12:06, Marc Zyngier wrote:
>>
>> On 2020-09-08 21:57, Alexander Graf wrote:
>>> We currently pass through the number of PMU counters that we have
>>> available
>>> in hardware to guests. So if my host supports 10 concurrently active
>>> PMU
>>> counters, my guest will be able to spawn 10 counters as well.
>>>
>>> This is undesireable if we also want to use the PMU on the host for
>>> monitoring. In that case, we want to split the PMU between guest and
>>> host.
>>>
>>> To help that case, let's add a PMU attr that allows us to limit the
>>> number
>>> of PMU counters that we expose. With this patch in place, user space
>>> can
>>> keep some counters free for host use.
>>>
>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>>
>>> ---
>>>
>>> Because this patch touches the same code paths as the vPMU filtering
>>> one
>>> and the vPMU filtering generalized a few conditions in the attr path,
>>> I've based it on top. Please let me know if you want it independent
>>> instead.
>>>
>>> v1 -> v2:
>>>
>>>   - Add documentation
>>>   - Add read support
>>> ---
>>>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>>  arch/arm64/kvm/pmu-emul.c               | 32
>>> ++++++++++++++++++++++++++++++++
>>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>>  include/kvm/arm_pmu.h                   |  1 +
>>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
>>> b/Documentation/virt/kvm/devices/vcpu.rst
>>> index 203b91e93151..1a1c8d8c8b1d 100644
>>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
>>> cycle counter is possible
>>>  using event 0x11 (CPU_CYCLES).
>>>
>>>
>>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>>> +---------------------------------------------
>>> +
>>> +:Parameters: in kvm_device_attr.addr the address for the limit of
>>> concurrent
>>> +             events is a pointer to an int
>>> +
>>> +:Returns:
>>> +
>>> +      =======  ======================================================
>>> +      -ENODEV: PMUv3 not supported
>>> +      -EBUSY:  PMUv3 already initialized
>>> +      -EINVAL: Too large number of events
>>> +      =======  ======================================================
>>> +
>>> +Reconfigure the limit of concurrent PMU events that the guest can
>>> monitor.
>>> +This number is directly exposed as part of the PMCR_EL0 register.
>>> +
>>> +On vcpu creation, this attribute is set to the hardware limit of the
>>> current
>>> +platform. If you need to determine the hardware limit, you can read
>>> this
>>> +attribute before setting it.
>>> +
>>> +Restrictions: The default value for this property is the number of
>>> hardware
>>> +supported events. Only values that are smaller than the hardware limit
>>> can
>>> +be set.
>>> +
>>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>>  =================================
>>>
>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>> b/arch/arm64/include/uapi/asm/kvm.h
>>> index 7b1511d6ce44..db025c0b5a40 100644
>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>>
>>>  /* Device Control API on vcpu fd */
>>>  #define KVM_ARM_VCPU_PMU_V3_CTRL     0
>>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
>>> -#define   KVM_ARM_VCPU_PMU_V3_INIT   1
>>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
>>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ            0
>>> +#define   KVM_ARM_VCPU_PMU_V3_INIT           1
>>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER         2
>>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS     3
>>>  #define KVM_ARM_VCPU_TIMER_CTRL              1
>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER              0
>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER              1
>>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>>> index 0458860bade2..c7915b95fec0 100644
>>> --- a/arch/arm64/kvm/pmu-emul.c
>>> +++ b/arch/arm64/kvm/pmu-emul.c
>>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>>
>>>       for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>>               pmu->pmc[i].idx = i;
>>> +
>>> +     pmu->num_events = perf_num_counters() - 1;
>>>  }
>>>
>>>  /**
>>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
>>> *vcpu, struct kvm_device_attr *attr)
>>>
>>>               return 0;
>>>       }
>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>> +             u64 mask = ARMV8_PMU_PMCR_N_MASK << 
>>> ARMV8_PMU_PMCR_N_SHIFT;
>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>> +             u32 num_events;
>>> +
>>> +             if (get_user(num_events, uaddr))
>>> +                     return -EFAULT;
>>> +
>>> +             if (num_events >= perf_num_counters())
>>> +                     return -EINVAL;
>>> +
>>> +             vcpu->arch.pmu.num_events = num_events;
>>> +
>>> +             num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>>> +
>>> +             return 0;
>>> +     }
>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>               return kvm_arm_pmu_v3_init(vcpu);
>>>       }
>>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
>>> *vcpu, struct kvm_device_attr *attr)
>>>               irq = vcpu->arch.pmu.irq_num;
>>>               return put_user(irq, uaddr);
>>>       }
>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>> +             u32 num_events;
>>> +
>>> +             if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>> +                     return -ENODEV;
>>> +
>>> +             num_events = vcpu->arch.pmu.num_events;
>>> +             return put_user(num_events, uaddr);
>>> +     }
>>>       }
>>>
>>>       return -ENXIO;
>>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
>>> *vcpu, struct kvm_device_attr *attr)
>>>       case KVM_ARM_VCPU_PMU_V3_IRQ:
>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>       case KVM_ARM_VCPU_PMU_V3_FILTER:
>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>>               if (kvm_arm_support_pmu_v3() &&
>>>                   test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>                       return 0;
>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>> index 20ab2a7d37ca..d51e39600bbd 100644
>>> --- a/arch/arm64/kvm/sys_regs.c
>>> +++ b/arch/arm64/kvm/sys_regs.c
>>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
>>> const struct sys_reg_desc *r)
>>>              | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) & 
>>> (~ARMV8_PMU_PMCR_E);
>>>       if (!system_supports_32bit_el0())
>>>               val |= ARMV8_PMU_PMCR_LC;
>>> +
>>> +     /* Override number of event selectors */
>>> +     val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>>> +     val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>>> +
>>>       __vcpu_sys_reg(vcpu, r->reg) = val;
>>>  }
>>>
>>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>>> index 98cbfe885a53..ea3fc96a37d9 100644
>>> --- a/include/kvm/arm_pmu.h
>>> +++ b/include/kvm/arm_pmu.h
>>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>>       bool ready;
>>>       bool created;
>>>       bool irq_level;
>>> +     u8 num_events;
>>>  };
>>>
>>>  #define kvm_arm_pmu_v3_ready(v)              ((v)->arch.pmu.ready)
>>
>> I see several problems with this approach:
>>
>> - userspace doesn't really have a good way to retrieve the number of
>>    counters.
> It does with v2, because it can then just read the register ;). I agree 
> that it's clunky though.
> 
>>
>> - Limiting the number of counters for the guest doesn't mean anything
>>    when it comes to the actual use of the HW counters, given that we
>>    don't allocate them ourselves (it's all perf doing the actual work).
> 
> We do cap the number of actively requestable counters via perf by the 
> PMCR.N limit. So in a way, it does mean something.
> 
>> - If you want to "pin" counters for the host, why don't you just do
>>    that before starting the guest?
> 
> You can do that. Imagine I have 10 counters. I pin 4 of them to the 
> host. I still tell my guest that it can use 6. That means perf will then 
> time slice and juggle 10 guest event counters on those remaining 6 
> hardware counters. That juggling heavily reduces accuracy.
> 
>> I think you need to look at the bigger picture: how to limit the use
>> of physical counter usage for a given userspace task. This needs
>> to happen in perf itself, and not in KVM.
> 
> That's definitely another way to look at it that I agree with.
> 
> What we really want is to expose the number of counters the guest has 
> available, not the number of counters hardware can support at maximum.
> 
> So in theory it would be enough to ask perf how many counters it does 
> have free for me to consume without overcommitting. But that would 
> potentially change between multiple invocations of KVM and thus break 
> things like live migration, no?
> 
> Maybe what we really want is an interface to perf from user space to say 
> "how many counters can you dedicate to me?" and "reserve them for me". 
> Then user space could reserve them as dedicated counters and KVM would 
> just need to either probe for the reservation or get told by user space 
> what to expose via ONE_REG as Drew suggested. It'd be up to user space 
> to ensure that the reservation matches the number of exposed counters then.

Note that if the aim is to avoid the guest seeing unexpectedly weird 
behaviour, then it's not just the *number* of counters that matters, but 
the underlying physical allocation too, thanks to the possibility of 
chained events.

Robin.

Alexander Graf Sept. 10, 2020, 4:46 p.m. UTC | #9

On 10.09.20 17:52, Robin Murphy wrote:
> 
> On 2020-09-10 11:18, Alexander Graf wrote:
>>
>>
>> On 10.09.20 12:06, Marc Zyngier wrote:
>>>
>>> On 2020-09-08 21:57, Alexander Graf wrote:
>>>> We currently pass through the number of PMU counters that we have
>>>> available
>>>> in hardware to guests. So if my host supports 10 concurrently active
>>>> PMU
>>>> counters, my guest will be able to spawn 10 counters as well.
>>>>
>>>> This is undesireable if we also want to use the PMU on the host for
>>>> monitoring. In that case, we want to split the PMU between guest and
>>>> host.
>>>>
>>>> To help that case, let's add a PMU attr that allows us to limit the
>>>> number
>>>> of PMU counters that we expose. With this patch in place, user space
>>>> can
>>>> keep some counters free for host use.
>>>>
>>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>>>
>>>> ---
>>>>
>>>> Because this patch touches the same code paths as the vPMU filtering
>>>> one
>>>> and the vPMU filtering generalized a few conditions in the attr path,
>>>> I've based it on top. Please let me know if you want it independent
>>>> instead.
>>>>
>>>> v1 -> v2:
>>>>
>>>>   - Add documentation
>>>>   - Add read support
>>>> ---
>>>>  Documentation/virt/kvm/devices/vcpu.rst | 25 +++++++++++++++++++++++++
>>>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>>>  arch/arm64/kvm/pmu-emul.c               | 32
>>>> ++++++++++++++++++++++++++++++++
>>>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>>>  include/kvm/arm_pmu.h                   |  1 +
>>>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
>>>> b/Documentation/virt/kvm/devices/vcpu.rst
>>>> index 203b91e93151..1a1c8d8c8b1d 100644
>>>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>>>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>>>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
>>>> cycle counter is possible
>>>>  using event 0x11 (CPU_CYCLES).
>>>>
>>>>
>>>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>>>> +---------------------------------------------
>>>> +
>>>> +:Parameters: in kvm_device_attr.addr the address for the limit of
>>>> concurrent
>>>> +             events is a pointer to an int
>>>> +
>>>> +:Returns:
>>>> +
>>>> +      =======  ======================================================
>>>> +      -ENODEV: PMUv3 not supported
>>>> +      -EBUSY:  PMUv3 already initialized
>>>> +      -EINVAL: Too large number of events
>>>> +      =======  ======================================================
>>>> +
>>>> +Reconfigure the limit of concurrent PMU events that the guest can
>>>> monitor.
>>>> +This number is directly exposed as part of the PMCR_EL0 register.
>>>> +
>>>> +On vcpu creation, this attribute is set to the hardware limit of the
>>>> current
>>>> +platform. If you need to determine the hardware limit, you can read
>>>> this
>>>> +attribute before setting it.
>>>> +
>>>> +Restrictions: The default value for this property is the number of
>>>> hardware
>>>> +supported events. Only values that are smaller than the hardware limit
>>>> can
>>>> +be set.
>>>> +
>>>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>>>  =================================
>>>>
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>>> b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 7b1511d6ce44..db025c0b5a40 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>>>
>>>>  /* Device Control API on vcpu fd */
>>>>  #define KVM_ARM_VCPU_PMU_V3_CTRL     0
>>>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
>>>> -#define   KVM_ARM_VCPU_PMU_V3_INIT   1
>>>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
>>>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ            0
>>>> +#define   KVM_ARM_VCPU_PMU_V3_INIT           1
>>>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER         2
>>>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS     3
>>>>  #define KVM_ARM_VCPU_TIMER_CTRL              1
>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER              0
>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER              1
>>>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>>>> index 0458860bade2..c7915b95fec0 100644
>>>> --- a/arch/arm64/kvm/pmu-emul.c
>>>> +++ b/arch/arm64/kvm/pmu-emul.c
>>>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>>>
>>>>       for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>>>               pmu->pmc[i].idx = i;
>>>> +
>>>> +     pmu->num_events = perf_num_counters() - 1;
>>>>  }
>>>>
>>>>  /**
>>>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
>>>> *vcpu, struct kvm_device_attr *attr)
>>>>
>>>>               return 0;
>>>>       }
>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>>> +             u64 mask = ARMV8_PMU_PMCR_N_MASK <<
>>>> ARMV8_PMU_PMCR_N_SHIFT;
>>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>>> +             u32 num_events;
>>>> +
>>>> +             if (get_user(num_events, uaddr))
>>>> +                     return -EFAULT;
>>>> +
>>>> +             if (num_events >= perf_num_counters())
>>>> +                     return -EINVAL;
>>>> +
>>>> +             vcpu->arch.pmu.num_events = num_events;
>>>> +
>>>> +             num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>>>> +
>>>> +             return 0;
>>>> +     }
>>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>>               return kvm_arm_pmu_v3_init(vcpu);
>>>>       }
>>>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
>>>> *vcpu, struct kvm_device_attr *attr)
>>>>               irq = vcpu->arch.pmu.irq_num;
>>>>               return put_user(irq, uaddr);
>>>>       }
>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>>> +             u32 num_events;
>>>> +
>>>> +             if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>> +                     return -ENODEV;
>>>> +
>>>> +             num_events = vcpu->arch.pmu.num_events;
>>>> +             return put_user(num_events, uaddr);
>>>> +     }
>>>>       }
>>>>
>>>>       return -ENXIO;
>>>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
>>>> *vcpu, struct kvm_device_attr *attr)
>>>>       case KVM_ARM_VCPU_PMU_V3_IRQ:
>>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>>       case KVM_ARM_VCPU_PMU_V3_FILTER:
>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>>>               if (kvm_arm_support_pmu_v3() &&
>>>>                   test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>>                       return 0;
>>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>>> index 20ab2a7d37ca..d51e39600bbd 100644
>>>> --- a/arch/arm64/kvm/sys_regs.c
>>>> +++ b/arch/arm64/kvm/sys_regs.c
>>>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
>>>> const struct sys_reg_desc *r)
>>>>              | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) &
>>>> (~ARMV8_PMU_PMCR_E);
>>>>       if (!system_supports_32bit_el0())
>>>>               val |= ARMV8_PMU_PMCR_LC;
>>>> +
>>>> +     /* Override number of event selectors */
>>>> +     val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>>>> +     val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>>>> +
>>>>       __vcpu_sys_reg(vcpu, r->reg) = val;
>>>>  }
>>>>
>>>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>>>> index 98cbfe885a53..ea3fc96a37d9 100644
>>>> --- a/include/kvm/arm_pmu.h
>>>> +++ b/include/kvm/arm_pmu.h
>>>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>>>       bool ready;
>>>>       bool created;
>>>>       bool irq_level;
>>>> +     u8 num_events;
>>>>  };
>>>>
>>>>  #define kvm_arm_pmu_v3_ready(v)              ((v)->arch.pmu.ready)
>>>
>>> I see several problems with this approach:
>>>
>>> - userspace doesn't really have a good way to retrieve the number of
>>>    counters.
>> It does with v2, because it can then just read the register ;). I agree
>> that it's clunky though.
>>
>>>
>>> - Limiting the number of counters for the guest doesn't mean anything
>>>    when it comes to the actual use of the HW counters, given that we
>>>    don't allocate them ourselves (it's all perf doing the actual work).
>>
>> We do cap the number of actively requestable counters via perf by the
>> PMCR.N limit. So in a way, it does mean something.
>>
>>> - If you want to "pin" counters for the host, why don't you just do
>>>    that before starting the guest?
>>
>> You can do that. Imagine I have 10 counters. I pin 4 of them to the
>> host. I still tell my guest that it can use 6. That means perf will then
>> time slice and juggle 10 guest event counters on those remaining 6
>> hardware counters. That juggling heavily reduces accuracy.
>>
>>> I think you need to look at the bigger picture: how to limit the use
>>> of physical counter usage for a given userspace task. This needs
>>> to happen in perf itself, and not in KVM.
>>
>> That's definitely another way to look at it that I agree with.
>>
>> What we really want is to expose the number of counters the guest has
>> available, not the number of counters hardware can support at maximum.
>>
>> So in theory it would be enough to ask perf how many counters it does
>> have free for me to consume without overcommitting. But that would
>> potentially change between multiple invocations of KVM and thus break
>> things like live migration, no?
>>
>> Maybe what we really want is an interface to perf from user space to say
>> "how many counters can you dedicate to me?" and "reserve them for me".
>> Then user space could reserve them as dedicated counters and KVM would
>> just need to either probe for the reservation or get told by user space
>> what to expose via ONE_REG as Drew suggested. It'd be up to user space
>> to ensure that the reservation matches the number of exposed counters 
>> then.
> 
> Note that if the aim is to avoid the guest seeing unexpectedly weird
> behaviour, then it's not just the *number* of counters that matters, but
> the underlying physical allocation too, thanks to the possibility of
> chained events.

Wouldn't ideally guest chaining propagate into host chaining as well? 
I'd have to double check if it does, but in my naive thinking if I 
reserve 4 hardware counters for the guest and the guest ends up using 4 
hardware counters regardless of their chaining attributes, I'd still be 
able to fit them all?


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Robin Murphy Sept. 10, 2020, 5:41 p.m. UTC | #10

On 2020-09-10 17:46, Alexander Graf wrote:
> 
> 
> On 10.09.20 17:52, Robin Murphy wrote:
>>
>> On 2020-09-10 11:18, Alexander Graf wrote:
>>>
>>>
>>> On 10.09.20 12:06, Marc Zyngier wrote:
>>>>
>>>> On 2020-09-08 21:57, Alexander Graf wrote:
>>>>> We currently pass through the number of PMU counters that we have
>>>>> available
>>>>> in hardware to guests. So if my host supports 10 concurrently active
>>>>> PMU
>>>>> counters, my guest will be able to spawn 10 counters as well.
>>>>>
>>>>> This is undesireable if we also want to use the PMU on the host for
>>>>> monitoring. In that case, we want to split the PMU between guest and
>>>>> host.
>>>>>
>>>>> To help that case, let's add a PMU attr that allows us to limit the
>>>>> number
>>>>> of PMU counters that we expose. With this patch in place, user space
>>>>> can
>>>>> keep some counters free for host use.
>>>>>
>>>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>>>>
>>>>> ---
>>>>>
>>>>> Because this patch touches the same code paths as the vPMU filtering
>>>>> one
>>>>> and the vPMU filtering generalized a few conditions in the attr path,
>>>>> I've based it on top. Please let me know if you want it independent
>>>>> instead.
>>>>>
>>>>> v1 -> v2:
>>>>>
>>>>>   - Add documentation
>>>>>   - Add read support
>>>>> ---
>>>>>  Documentation/virt/kvm/devices/vcpu.rst | 25 
>>>>> +++++++++++++++++++++++++
>>>>>  arch/arm64/include/uapi/asm/kvm.h       |  7 ++++---
>>>>>  arch/arm64/kvm/pmu-emul.c               | 32
>>>>> ++++++++++++++++++++++++++++++++
>>>>>  arch/arm64/kvm/sys_regs.c               |  5 +++++
>>>>>  include/kvm/arm_pmu.h                   |  1 +
>>>>>  5 files changed, 67 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst
>>>>> b/Documentation/virt/kvm/devices/vcpu.rst
>>>>> index 203b91e93151..1a1c8d8c8b1d 100644
>>>>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>>>>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>>>>> @@ -102,6 +102,31 @@ isn't strictly speaking an event. Filtering the
>>>>> cycle counter is possible
>>>>>  using event 0x11 (CPU_CYCLES).
>>>>>
>>>>>
>>>>> +1.4 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_NUM_EVENTS
>>>>> +---------------------------------------------
>>>>> +
>>>>> +:Parameters: in kvm_device_attr.addr the address for the limit of
>>>>> concurrent
>>>>> +             events is a pointer to an int
>>>>> +
>>>>> +:Returns:
>>>>> +
>>>>> +      =======  ======================================================
>>>>> +      -ENODEV: PMUv3 not supported
>>>>> +      -EBUSY:  PMUv3 already initialized
>>>>> +      -EINVAL: Too large number of events
>>>>> +      =======  ======================================================
>>>>> +
>>>>> +Reconfigure the limit of concurrent PMU events that the guest can
>>>>> monitor.
>>>>> +This number is directly exposed as part of the PMCR_EL0 register.
>>>>> +
>>>>> +On vcpu creation, this attribute is set to the hardware limit of the
>>>>> current
>>>>> +platform. If you need to determine the hardware limit, you can read
>>>>> this
>>>>> +attribute before setting it.
>>>>> +
>>>>> +Restrictions: The default value for this property is the number of
>>>>> hardware
>>>>> +supported events. Only values that are smaller than the hardware 
>>>>> limit
>>>>> can
>>>>> +be set.
>>>>> +
>>>>>  2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>>>>  =================================
>>>>>
>>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>>>> b/arch/arm64/include/uapi/asm/kvm.h
>>>>> index 7b1511d6ce44..db025c0b5a40 100644
>>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>>> @@ -342,9 +342,10 @@ struct kvm_vcpu_events {
>>>>>
>>>>>  /* Device Control API on vcpu fd */
>>>>>  #define KVM_ARM_VCPU_PMU_V3_CTRL     0
>>>>> -#define   KVM_ARM_VCPU_PMU_V3_IRQ    0
>>>>> -#define   KVM_ARM_VCPU_PMU_V3_INIT   1
>>>>> -#define   KVM_ARM_VCPU_PMU_V3_FILTER 2
>>>>> +#define   KVM_ARM_VCPU_PMU_V3_IRQ            0
>>>>> +#define   KVM_ARM_VCPU_PMU_V3_INIT           1
>>>>> +#define   KVM_ARM_VCPU_PMU_V3_FILTER         2
>>>>> +#define   KVM_ARM_VCPU_PMU_V3_NUM_EVENTS     3
>>>>>  #define KVM_ARM_VCPU_TIMER_CTRL              1
>>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER              0
>>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER              1
>>>>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>>>>> index 0458860bade2..c7915b95fec0 100644
>>>>> --- a/arch/arm64/kvm/pmu-emul.c
>>>>> +++ b/arch/arm64/kvm/pmu-emul.c
>>>>> @@ -253,6 +253,8 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu)
>>>>>
>>>>>       for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++)
>>>>>               pmu->pmc[i].idx = i;
>>>>> +
>>>>> +     pmu->num_events = perf_num_counters() - 1;
>>>>>  }
>>>>>
>>>>>  /**
>>>>> @@ -978,6 +980,25 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu
>>>>> *vcpu, struct kvm_device_attr *attr)
>>>>>
>>>>>               return 0;
>>>>>       }
>>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>>>> +             u64 mask = ARMV8_PMU_PMCR_N_MASK <<
>>>>> ARMV8_PMU_PMCR_N_SHIFT;
>>>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>>>> +             u32 num_events;
>>>>> +
>>>>> +             if (get_user(num_events, uaddr))
>>>>> +                     return -EFAULT;
>>>>> +
>>>>> +             if (num_events >= perf_num_counters())
>>>>> +                     return -EINVAL;
>>>>> +
>>>>> +             vcpu->arch.pmu.num_events = num_events;
>>>>> +
>>>>> +             num_events <<= ARMV8_PMU_PMCR_N_SHIFT;
>>>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) &= ~mask;
>>>>> +             __vcpu_sys_reg(vcpu, SYS_PMCR_EL0) |= num_events;
>>>>> +
>>>>> +             return 0;
>>>>> +     }
>>>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>>>               return kvm_arm_pmu_v3_init(vcpu);
>>>>>       }
>>>>> @@ -1004,6 +1025,16 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu
>>>>> *vcpu, struct kvm_device_attr *attr)
>>>>>               irq = vcpu->arch.pmu.irq_num;
>>>>>               return put_user(irq, uaddr);
>>>>>       }
>>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS: {
>>>>> +             int __user *uaddr = (int __user *)(long)attr->addr;
>>>>> +             u32 num_events;
>>>>> +
>>>>> +             if (!test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>>> +                     return -ENODEV;
>>>>> +
>>>>> +             num_events = vcpu->arch.pmu.num_events;
>>>>> +             return put_user(num_events, uaddr);
>>>>> +     }
>>>>>       }
>>>>>
>>>>>       return -ENXIO;
>>>>> @@ -1015,6 +1046,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu
>>>>> *vcpu, struct kvm_device_attr *attr)
>>>>>       case KVM_ARM_VCPU_PMU_V3_IRQ:
>>>>>       case KVM_ARM_VCPU_PMU_V3_INIT:
>>>>>       case KVM_ARM_VCPU_PMU_V3_FILTER:
>>>>> +     case KVM_ARM_VCPU_PMU_V3_NUM_EVENTS:
>>>>>               if (kvm_arm_support_pmu_v3() &&
>>>>>                   test_bit(KVM_ARM_VCPU_PMU_V3, vcpu->arch.features))
>>>>>                       return 0;
>>>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>>>> index 20ab2a7d37ca..d51e39600bbd 100644
>>>>> --- a/arch/arm64/kvm/sys_regs.c
>>>>> +++ b/arch/arm64/kvm/sys_regs.c
>>>>> @@ -672,6 +672,11 @@ static void reset_pmcr(struct kvm_vcpu *vcpu,
>>>>> const struct sys_reg_desc *r)
>>>>>              | (ARMV8_PMU_PMCR_MASK & 0xdecafbad)) &
>>>>> (~ARMV8_PMU_PMCR_E);
>>>>>       if (!system_supports_32bit_el0())
>>>>>               val |= ARMV8_PMU_PMCR_LC;
>>>>> +
>>>>> +     /* Override number of event selectors */
>>>>> +     val &= ~(ARMV8_PMU_PMCR_N_MASK << ARMV8_PMU_PMCR_N_SHIFT);
>>>>> +     val |= (u32)vcpu->arch.pmu.num_events << ARMV8_PMU_PMCR_N_SHIFT;
>>>>> +
>>>>>       __vcpu_sys_reg(vcpu, r->reg) = val;
>>>>>  }
>>>>>
>>>>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>>>>> index 98cbfe885a53..ea3fc96a37d9 100644
>>>>> --- a/include/kvm/arm_pmu.h
>>>>> +++ b/include/kvm/arm_pmu.h
>>>>> @@ -27,6 +27,7 @@ struct kvm_pmu {
>>>>>       bool ready;
>>>>>       bool created;
>>>>>       bool irq_level;
>>>>> +     u8 num_events;
>>>>>  };
>>>>>
>>>>>  #define kvm_arm_pmu_v3_ready(v)              ((v)->arch.pmu.ready)
>>>>
>>>> I see several problems with this approach:
>>>>
>>>> - userspace doesn't really have a good way to retrieve the number of
>>>>    counters.
>>> It does with v2, because it can then just read the register ;). I agree
>>> that it's clunky though.
>>>
>>>>
>>>> - Limiting the number of counters for the guest doesn't mean anything
>>>>    when it comes to the actual use of the HW counters, given that we
>>>>    don't allocate them ourselves (it's all perf doing the actual work).
>>>
>>> We do cap the number of actively requestable counters via perf by the
>>> PMCR.N limit. So in a way, it does mean something.
>>>
>>>> - If you want to "pin" counters for the host, why don't you just do
>>>>    that before starting the guest?
>>>
>>> You can do that. Imagine I have 10 counters. I pin 4 of them to the
>>> host. I still tell my guest that it can use 6. That means perf will then
>>> time slice and juggle 10 guest event counters on those remaining 6
>>> hardware counters. That juggling heavily reduces accuracy.
>>>
>>>> I think you need to look at the bigger picture: how to limit the use
>>>> of physical counter usage for a given userspace task. This needs
>>>> to happen in perf itself, and not in KVM.
>>>
>>> That's definitely another way to look at it that I agree with.
>>>
>>> What we really want is to expose the number of counters the guest has
>>> available, not the number of counters hardware can support at maximum.
>>>
>>> So in theory it would be enough to ask perf how many counters it does
>>> have free for me to consume without overcommitting. But that would
>>> potentially change between multiple invocations of KVM and thus break
>>> things like live migration, no?
>>>
>>> Maybe what we really want is an interface to perf from user space to say
>>> "how many counters can you dedicate to me?" and "reserve them for me".
>>> Then user space could reserve them as dedicated counters and KVM would
>>> just need to either probe for the reservation or get told by user space
>>> what to expose via ONE_REG as Drew suggested. It'd be up to user space
>>> to ensure that the reservation matches the number of exposed counters 
>>> then.
>>
>> Note that if the aim is to avoid the guest seeing unexpectedly weird
>> behaviour, then it's not just the *number* of counters that matters, but
>> the underlying physical allocation too, thanks to the possibility of
>> chained events.
> 
> Wouldn't ideally guest chaining propagate into host chaining as well? 
> I'd have to double check if it does, but in my naive thinking if I 
> reserve 4 hardware counters for the guest and the guest ends up using 4 
> hardware counters regardless of their chaining attributes, I'd still be 
> able to fit them all?

It depends what you mean by "reserve" - if you merely tell the guest 
that the vPMU only has 4 counters, and at any given time your (n-4) host 
events happen to have ended up scheduled such that no physical even/odd 
pair of counters is free, then even a single chained event from the 
guest might trigger context rotation despite there being 4 or more 
counters free in total.

If you want truly deterministic results then you'll probably need some 
brain surgery on the PMU driver to actually partition the physical 
counters rather than simply relying on perf's scheduling.

Robin.

[v2] KVM: arm64: Allow to limit number of PMU counters

Commit Message

Comments

Patch