diff mbox series

KVM: VMX: CPU frequency scaling for intel x86_64 KVM guests

Message ID 20220531105925.27676-1-jalliste@amazon.com (mailing list archive)
State New, archived
Headers show
Series KVM: VMX: CPU frequency scaling for intel x86_64 KVM guests | expand

Commit Message

Jack Allister May 31, 2022, 10:59 a.m. UTC
A VMM can control a vCPU's CPU frequency by interfacing with KVM via
the vCPU file descriptor to enable/set CPU frequency scaling for a
guest. Instead of creating a separate IOCTL to this this, KVM capabil-
ities are extended to include a capability called
KVM_CAP_CPU_FREQ_SCALING.

A generic set_cpu_freq interface is added to kvm_x86_ops
to allow for architecture (AMD/Intel) independent CPU frequency
scaling setting.

For Intel platforms, Hardware-Controlled Performance States (HWP) are
used to implement CPU scaling within the guest. Further information on
this mechanism can be seen in Intel SDM Vol 3B (section 14.4). The CPU
frequency is set as soon as this function is called and is kept running
until explicitly reset or set again.

Currently the AMD frequency setting interface is left unimplemented.

Please note that CPU frequency scaling will have an effect on host
processing in it's current form. To change back to full performance
when running in host context an IOCTL with a frequency value of 0
is needed to run back at uncapped speed.

Signed-off-by: Jack Allister <jalliste@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/vmx/vmx.c          | 91 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              | 16 ++++++
 include/uapi/linux/kvm.h        |  1 +
 4 files changed, 110 insertions(+)

Comments

Kaya, Metin May 31, 2022, 11:43 a.m. UTC | #1
Thanks, Jack.

Reviewed-by: Metin Kaya <metikaya@amazon.co.uk>
Peter Zijlstra May 31, 2022, 1:38 p.m. UTC | #2
On Tue, May 31, 2022 at 10:59:25AM +0000, Jack Allister wrote:
> A VMM can control a vCPU's CPU frequency by interfacing with KVM via
> the vCPU file descriptor to enable/set CPU frequency scaling for a
> guest. Instead of creating a separate IOCTL to this this, KVM capabil-
> ities are extended to include a capability called
> KVM_CAP_CPU_FREQ_SCALING.
> 
> A generic set_cpu_freq interface is added to kvm_x86_ops
> to allow for architecture (AMD/Intel) independent CPU frequency
> scaling setting.
> 
> For Intel platforms, Hardware-Controlled Performance States (HWP) are
> used to implement CPU scaling within the guest. Further information on
> this mechanism can be seen in Intel SDM Vol 3B (section 14.4). The CPU
> frequency is set as soon as this function is called and is kept running
> until explicitly reset or set again.
> 
> Currently the AMD frequency setting interface is left unimplemented.
> 
> Please note that CPU frequency scaling will have an effect on host
> processing in it's current form. To change back to full performance
> when running in host context an IOCTL with a frequency value of 0
> is needed to run back at uncapped speed.

Nowhere does this explain *WHY* we would want to do this?
Jack Allister May 31, 2022, 2:02 p.m. UTC | #3
The reasoning behind this is that you may want to run a guest at a
lower CPU frequency for the purposes of trying to match performance
parity between a host of an older CPU type to a newer faster one.
Peter Zijlstra May 31, 2022, 2:44 p.m. UTC | #4
On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> The reasoning behind this is that you may want to run a guest at a
> lower CPU frequency for the purposes of trying to match performance
> parity between a host of an older CPU type to a newer faster one.

That's quite ludicrus. Also, then it should be the host enforcing the
cpufreq, not the guest.
Durrant, Paul May 31, 2022, 2:52 p.m. UTC | #5
> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: 31 May 2022 15:44
> To: Allister, Jack <jalliste@amazon.com>
> Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com;
> pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de;
> vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org
> Subject: RE: [EXTERNAL]...\n
> 
> 
> On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> > The reasoning behind this is that you may want to run a guest at a
> > lower CPU frequency for the purposes of trying to match performance
> > parity between a host of an older CPU type to a newer faster one.
> 
> That's quite ludicrus. Also, then it should be the host enforcing the
> cpufreq, not the guest.

I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running?

  Paul
Paolo Bonzini May 31, 2022, 2:52 p.m. UTC | #6
On Tue, May 31, 2022 at 4:45 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> > The reasoning behind this is that you may want to run a guest at a
> > lower CPU frequency for the purposes of trying to match performance
> > parity between a host of an older CPU type to a newer faster one.
>
> That's quite ludicrus. Also, then it should be the host enforcing the
> cpufreq, not the guest.

It is a weird usecase indeed, but actually it *is* enforced by the
host in Jack's patch.

On the other hand doing it by writing random MSRs, with no save and
load when the task is scheduled or migrated, will not fly outside
Amazon datacenters. The patch also has traces of the Amazon kernel and
doesn't apply upstream.

Paolo
Vitaly Kuznetsov May 31, 2022, 3:01 p.m. UTC | #7
Jack Allister <jalliste@amazon.com> writes:

> A VMM can control a vCPU's CPU frequency by interfacing with KVM via
> the vCPU file descriptor to enable/set CPU frequency scaling for a
> guest. Instead of creating a separate IOCTL to this this, KVM capabil-
> ities are extended to include a capability called
> KVM_CAP_CPU_FREQ_SCALING.
>
> A generic set_cpu_freq interface is added to kvm_x86_ops
> to allow for architecture (AMD/Intel) independent CPU frequency
> scaling setting.
>
> For Intel platforms, Hardware-Controlled Performance States (HWP) are
> used to implement CPU scaling within the guest. Further information on
> this mechanism can be seen in Intel SDM Vol 3B (section 14.4). The CPU
> frequency is set as soon as this function is called and is kept running
> until explicitly reset or set again.
>
> Currently the AMD frequency setting interface is left unimplemented.
>
> Please note that CPU frequency scaling will have an effect on host
> processing in it's current form. To change back to full performance
> when running in host context an IOCTL with a frequency value of 0
> is needed to run back at uncapped speed.

I may have missed something but it looks like it will also have an
effect on other guests running on the same CPU. Also, nothing guarantees
that the guest vCPU will not get migrated to another CPU. This looks
like a very hard to use interface.

What if you introduce a per-vCPU setting for the desired frequency and
set it every time the vCPU is loaded on a pCPU and then restore back the
original frequency when it is unloaded (and assuming nobody else touched
it while the vCPU was running)?

>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +
>  arch/x86/kvm/vmx/vmx.c          | 91 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              | 16 ++++++
>  include/uapi/linux/kvm.h        |  1 +
>  4 files changed, 110 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ae220f88f00d..d2efc2ce624f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1169,6 +1169,8 @@ struct kvm_x86_ops {
>  	bool (*rdtscp_supported)(void);
>  	bool (*invpcid_supported)(void);
>  
> +	int (*set_cpu_freq_scaling)(struct kvm_vcpu *vcpu, u8 freq_100mhz);
> +
>  	void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
>  
>  	void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6499f371de58..beee39b57b13 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1699,6 +1699,95 @@ static bool vmx_invpcid_supported(void)
>  	return cpu_has_vmx_invpcid();
>  }
>  
> +static int vmx_query_cpu_freq_valid_freq(u8 freq)
> +{
> +#define MASK_PERF 0xFF
> +#define CAP_HIGHEST_SHIFT 0
> +#define CAP_LOWEST_SHIFT 24
> +#define CAP_HIGHEST_MASK (MASK_PERF << CAP_HIGHEST_SHIFT)
> +#define CAP_LOWEST_MASK (MASK_PERF << CAP_LOWEST_SHIFT)
> +	u64 cap_msr;
> +	u8 highest, lowest;
> +
> +	/* Query highest and lowest supported scaling. */
> +	rdmsrl(MSR_HWP_CAPABILITIES, cap_msr);
> +	highest = (u8)(cap_msr & CAP_HIGHEST_MASK);
> +	lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT);
> +
> +	if (freq < lowest || freq > highest)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static void vmx_set_cpu_freq_uncapped(void)
> +{
> +#define SHIFT_DESIRED_PERF 16
> +#define SHIFT_MAX_PERF 8
> +#define SHIFT_MIN_PERF 0
> +
> +	u64 cap_msr, req_msr;
> +	u8 highest, lowest;
> +
> +	/* Query the capabilities. */
> +	rdmsrl(MSR_HWP_CAPABILITIES, cap_msr);
> +	highest = (u8)(cap_msr & CAP_HIGHEST_MASK);
> +	lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT);
> +
> +	/* Set the desired to highest performance. */
> +	req_msr = ((highest & MASK_PERF) << SHIFT_DESIRED_PERF) |
> +		((highest & MASK_PERF) << SHIFT_MAX_PERF) |
> +		((lowest & MASK_PERF) << SHIFT_MIN_PERF);
> +	wrmsrl(MSR_HWP_REQUEST, req_msr);
> +}
> +
> +static void vmx_set_cpu_freq_capped(u8 freq_100mhz)
> +{
> +	u64 req_msr;
> +
> +	/* Populate the variable used for setting the HWP request. */
> +	req_msr = ((freq_100mhz & MASK_PERF) << SHIFT_DESIRED_PERF) |
> +		((freq_100mhz & MASK_PERF) << SHIFT_MAX_PERF) |
> +		((freq_100mhz & MASK_PERF) << SHIFT_MIN_PERF);
> +
> +	wrmsrl(MSR_HWP_REQUEST, req_msr);
> +}
> +
> +static int vmx_set_cpu_freq_scaling(struct kvm_vcpu *vcpu, u8 freq_100mhz)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 pm_before, req_msr;
> +	int rc;
> +
> +	/* Is HWP scaling supported? */
> +	if (!this_cpu_has(X86_FEATURE_HWP))
> +		return -ENODEV;
> +
> +	/*
> +	 * HWP needs to be enabled to query & use capabilities.
> +	 * This bit is W1Once so cannot be cleared after.
> +	 */
> +	rdmsrl(MSR_PM_ENABLE, pm_before);
> +	if ((pm_before & 1) == 0)
> +		wrmsrl(MSR_PM_ENABLE, pm_before | 1);
> +
> +	/*
> +	 * Check if setting to a specific value, if being set
> +	 * to zero this means return to uncapped frequency.
> +	 */
> +	if (freq_100mhz) {
> +		rc = vmx_query_cpu_freq_valid_freq(freq_100mhz);
> +
> +		if (rc)
> +			return rc;
> +
> +		vmx_set_cpu_freq_capped(freq_100mhz);
> +	} else
> +		vmx_set_cpu_freq_uncapped();
> +
> +	return 0;
> +}
> +
>  /*
>   * Swap MSR entry in host/guest MSR entry array.
>   */
> @@ -8124,6 +8213,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.rdtscp_supported = vmx_rdtscp_supported,
>  	.invpcid_supported = vmx_invpcid_supported,
>  
> +	.set_cpu_freq_scaling = vmx_set_cpu_freq_scaling,
> +
>  	.set_supported_cpuid = vmx_set_supported_cpuid,
>  
>  	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c33423a1a13d..9ae2ab102e01 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3669,6 +3669,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_SET_VAR_MTRR_COUNT:
>  	case KVM_CAP_X86_USER_SPACE_MSR:
>  	case KVM_CAP_X86_MSR_FILTER:
> +	case KVM_CAP_CPU_FREQ_SCALING:
>  		r = 1;
>  		break;
>  #ifdef CONFIG_KVM_XEN
> @@ -4499,6 +4500,19 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
>  	return r;
>  }
>  
> +static int kvm_cap_set_cpu_freq(struct kvm_vcpu *vcpu,
> +				       struct kvm_enable_cap *cap)
> +{
> +	u8 freq = (u8)cap->args[0];
> +
> +	/* Query whether this platform (Intel or AMD) support setting. */
> +	if (!kvm_x86_ops.set_cpu_freq_scaling)
> +		return -ENODEV;
> +
> +	/* Attempt to set to the frequency specified. */
> +	return kvm_x86_ops.set_cpu_freq_scaling(vcpu, freq);
> +}
> +
>  /*
>   * kvm_set_guest_paused() indicates to the guest kernel that it has been
>   * stopped by the hypervisor.  This function will be called from the host only.
> @@ -4553,6 +4567,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  		return kvm_x86_ops.enable_direct_tlbflush(vcpu);
>  	case KVM_CAP_SET_VAR_MTRR_COUNT:
>  		return kvm_mtrr_set_var_mtrr_count(vcpu, cap->args[0]);
> +	case KVM_CAP_CPU_FREQ_SCALING:
> +		return kvm_cap_set_cpu_freq(vcpu, cap);
>  
>  	default:
>  		return -EINVAL;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 831be0d2d5e4..273a3ab5590e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -874,6 +874,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_NO_POLL_ON_HLT 100003
>  #define KVM_CAP_MMU_USE_VMA_CAPMEM 100004
>  #define KVM_CAP_MMU_SUPPORT_DYNAMIC_CAPMEM 100005
> +#define KVM_CAP_CPU_FREQ_SCALING 100006
>  
>  #define KVM_CAP_IRQCHIP	  0
>  #define KVM_CAP_HLT	  1
Peter Zijlstra May 31, 2022, 3:27 p.m. UTC | #8
On Tue, May 31, 2022 at 04:52:56PM +0200, Paolo Bonzini wrote:
> On Tue, May 31, 2022 at 4:45 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> > > The reasoning behind this is that you may want to run a guest at a
> > > lower CPU frequency for the purposes of trying to match performance
> > > parity between a host of an older CPU type to a newer faster one.
> >
> > That's quite ludicrus. Also, then it should be the host enforcing the
> > cpufreq, not the guest.
> 
> It is a weird usecase indeed, but actually it *is* enforced by the
> host in Jack's patch.

Clearly I don't understand KVM much; I was thikning that since it was
mucking the with vmx code it was some guest interface.

If it is host control, then it's even more insane, since the host has
plenty existing interfaces for cpufreq control. No need to add more
crazy hacks like this.
Paolo Bonzini May 31, 2022, 3:51 p.m. UTC | #9
On 5/31/22 16:52, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Peter Zijlstra <peterz@infradead.org>
>> Sent: 31 May 2022 15:44
>> To: Allister, Jack <jalliste@amazon.com>
>> Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org;
>> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com;
>> pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de;
>> vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org
>> Subject: RE: [EXTERNAL]...\n
>>
>>
>> On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
>>> The reasoning behind this is that you may want to run a guest at a
>>> lower CPU frequency for the purposes of trying to match performance
>>> parity between a host of an older CPU type to a newer faster one.
>>
>> That's quite ludicrus. Also, then it should be the host enforcing the
>> cpufreq, not the guest.
> 
> I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running?

Well, the right API is cpufreq, there's no need to make it a KVM 
functionality.

Paolo
Paolo Bonzini May 31, 2022, 6:11 p.m. UTC | #10
On 5/31/22 13:43, Metin Kaya wrote:
> Thanks, Jack.
> 
> Reviewed-by: Metin Kaya <metikaya@amazon.co.uk>
> 

Please try a bit harder.  "Reviewed-by" is neither "this matches what's 
been forever in the Amazon kernel" nor "I guarantee that Jack is a nice 
guy and doesn't screw up".  I'm sure he is but everybody screws up, and 
in this case the patch:

- does not even *apply* to the upstream kernel, because it uses 
(presumably Amazon-specific) CAP numbers above 10000

- does not work if the vCPU is moved from one physical CPU to another

- does not work if the intel_pstate driver writes to MSR_HWP_REQUEST

- does not include documentation for the new capability

- does not include a selftest

- is unacceptable anyway because, as mentioned in the cover letter, it 
isn't undone when the process exits

Jack, please understand that I am not really blaming you in any way, and 
ask some of your colleagues with upstream kernel experience (Alex Graf, 
David Woodhouse, Filippo Sironi, Jan Schoenherr, Amit Shah are the ones 
I know) which patches could be good targets for including upstream.

Paolo
Peter Zijlstra June 1, 2022, 6:52 a.m. UTC | #11
On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Peter Zijlstra <peterz@infradead.org>
> > Sent: 31 May 2022 15:44
> > To: Allister, Jack <jalliste@amazon.com>
> > Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com;
> > pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de;
> > vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org
> > Subject: RE: [EXTERNAL]...\n
> > 
> > 
> > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> > > The reasoning behind this is that you may want to run a guest at a
> > > lower CPU frequency for the purposes of trying to match performance
> > > parity between a host of an older CPU type to a newer faster one.
> > 
> > That's quite ludicrus. Also, then it should be the host enforcing the
> > cpufreq, not the guest.
> 
> I'll bite... What's ludicrous about wanting to run a guest at a lower
> CPU freq to minimize observable change in whatever workload it is
> running?

*why* would you want to do that? Everybody wants their stuff done
faster.

If this is some hare-brained money scheme; must not give them if they
didn't pay up then I really don't care.

On top of that, you can't hide uarch differences with cpufreq capping.

Also, it is probably more power efficient to let it run faster and idle
more, so you're not being environmental either.
Vitaly Kuznetsov June 1, 2022, 7:57 a.m. UTC | #12
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 5/31/22 16:52, Durrant, Paul wrote:
>>> -----Original Message-----
>>> From: Peter Zijlstra <peterz@infradead.org>
>>> Sent: 31 May 2022 15:44
>>> To: Allister, Jack <jalliste@amazon.com>
>>> Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org;
>>> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com;
>>> pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de;
>>> vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org
>>> Subject: RE: [EXTERNAL]...\n
>>>
>>>
>>> On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
>>>> The reasoning behind this is that you may want to run a guest at a
>>>> lower CPU frequency for the purposes of trying to match performance
>>>> parity between a host of an older CPU type to a newer faster one.
>>>
>>> That's quite ludicrus. Also, then it should be the host enforcing the
>>> cpufreq, not the guest.
>> 
>> I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running?
>
> Well, the right API is cpufreq, there's no need to make it a KVM 
> functionality.

KVM may probably use the cpufreq API to run each vCPU at the desired
frequency: I don't quite see how this can be done with a VMM today when
it's not a 1-vCPU-per-1-pCPU setup.
Vitaly Kuznetsov June 1, 2022, 8:03 a.m. UTC | #13
Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote:

...

>> 
>> I'll bite... What's ludicrous about wanting to run a guest at a lower
>> CPU freq to minimize observable change in whatever workload it is
>> running?
>
> *why* would you want to do that? Everybody wants their stuff done
> faster.
>

FWIW, I can see a valid use-case: imagine you're running some software
which calibrates itself in the beginning to run at some desired real
time speed but then the VM running it has to be migrated to a host with
faster (newer) CPUs. I don't have a real world examples out of top of my
head but I remember some old DOS era games were impossible to play on
newer CPUs because everything was happenning too fast. Maybe that's the
case :-)
Christophe de Dinechin June 1, 2022, 8:25 a.m. UTC | #14
> On 1 Jun 2022, at 10:03, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
>> On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote:
> 
> ...
> 
>>> 
>>> I'll bite... What's ludicrous about wanting to run a guest at a lower
>>> CPU freq to minimize observable change in whatever workload it is
>>> running?
>> 
>> *why* would you want to do that? Everybody wants their stuff done
>> faster.
>> 
> 
> FWIW, I can see a valid use-case: imagine you're running some software
> which calibrates itself in the beginning to run at some desired real
> time speed but then the VM running it has to be migrated to a host with
> faster (newer) CPUs. I don't have a real world examples out of top of my
> head but I remember some old DOS era games were impossible to play on
> newer CPUs because everything was happenning too fast. Maybe that's the
> case :-)

The PC version of Alpha Waves was such an example, but Frederick Raynal,
who did the port, said it was the last time he made the mistake. That was 1990 :-)

More seriously, what about mitigating timing-based remote attacks by
arbitrarily changing the CPU frequency and injecting noise in the timing?
That could be a valid use case, no? Although I can think of about a
million other ways of doing this more efficiently…


> 
> -- 
> Vitaly
>
Durrant, Paul June 1, 2022, 8:54 a.m. UTC | #15
> -----Original Message-----
[snip]
> >>
> >> I'll bite... What's ludicrous about wanting to run a guest at a lower
> >> CPU freq to minimize observable change in whatever workload it is
> >> running?
> >
> > *why* would you want to do that? Everybody wants their stuff done
> > faster.
> >
> 
> FWIW, I can see a valid use-case: imagine you're running some software
> which calibrates itself in the beginning to run at some desired real
> time speed but then the VM running it has to be migrated to a host with
> faster (newer) CPUs. I don't have a real world examples out of top of my
> head but I remember some old DOS era games were impossible to play on
> newer CPUs because everything was happenning too fast. Maybe that's the
> case :-)
> 

That is exactly the case. This is not 'some hare-brained money scheme'; there is genuine concern that moving a VM from old h/w to new h/w may cause it to run 'too fast', breaking any such calibration done by the guest OS/application. I also don't have any real-world examples, but bugs may well be reported and having a lever to address them is IMO a good idea.
However, I also agree with Paolo that KVM doesn't really need to be doing this when the VMM could do the job using cpufreq, so we'll pursue that option instead. (FWIW the reason for involving KVM was to do the freq adjustment right before entering the guest and then remove the cap right after VMEXIT).

  Paul
Paolo Bonzini June 1, 2022, 8:57 a.m. UTC | #16
On 6/1/22 10:54, Durrant, Paul wrote:
> That is exactly the case. This is not 'some hare-brained money
> scheme'; there is genuine concern that moving a VM from old h/w to
> new h/w may cause it to run 'too fast', breaking any such calibration
> done by the guest OS/application. I also don't have any real-world
> examples, but bugs may well be reported and having a lever to address
> them is IMO a good idea. However, I also agree with Paolo that KVM
> doesn't really need to be doing this when the VMM could do the job
> using cpufreq, so we'll pursue that option instead. (FWIW the reason
> for involving KVM was to do the freq adjustment right before entering
> the guest and then remove the cap right after VMEXIT).

But if so, you still would submit the full feature, wouldn't you?

Paul, thanks for chiming in, and sorry for leaving you out of the list 
of people that can help Jack with his upstreaming efforts.  :)

Paolo
Paolo Bonzini June 1, 2022, 8:59 a.m. UTC | #17
On 6/1/22 09:57, Vitaly Kuznetsov wrote:
>>> I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running?
>> Well, the right API is cpufreq, there's no need to make it a KVM
>> functionality.
> KVM may probably use the cpufreq API to run each vCPU at the desired
> frequency: I don't quite see how this can be done with a VMM today when
> it's not a 1-vCPU-per-1-pCPU setup.

True, but then there's also a policy issue, in that KVM shouldn't be 
allowed to *bump* the frequency if userspace would ordinarily not have 
access to the cpufreq files in sysfs.

All in all, I think it's simpler to let privileged userspace (which 
knows when it has a 1:1 mapping of vCPU to pCPU) handle it with cpufreq.

Paolo
Durrant, Paul June 1, 2022, 9:20 a.m. UTC | #18
> -----Original Message-----
> From: Paolo Bonzini <pbonzini@redhat.com>
> Sent: 01 June 2022 09:57
> To: Durrant, Paul <pdurrant@amazon.co.uk>; Vitaly Kuznetsov <vkuznets@redhat.com>; Peter Zijlstra
> <peterz@infradead.org>
> Cc: Allister, Jack <jalliste@amazon.com>; bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com;
> jmattson@google.com; joro@8bytes.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> metikaya@amazon.co.uk; mingo@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com;
> tglx@linutronix.de; wanpengli@tencent.com; x86@kernel.org
> Subject: RE: [EXTERNAL]...\n
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 6/1/22 10:54, Durrant, Paul wrote:
> > That is exactly the case. This is not 'some hare-brained money
> > scheme'; there is genuine concern that moving a VM from old h/w to
> > new h/w may cause it to run 'too fast', breaking any such calibration
> > done by the guest OS/application. I also don't have any real-world
> > examples, but bugs may well be reported and having a lever to address
> > them is IMO a good idea. However, I also agree with Paolo that KVM
> > doesn't really need to be doing this when the VMM could do the job
> > using cpufreq, so we'll pursue that option instead. (FWIW the reason
> > for involving KVM was to do the freq adjustment right before entering
> > the guest and then remove the cap right after VMEXIT).
> 
> But if so, you still would submit the full feature, wouldn't you?
> 

Yes; the commit message should have at least said that we'd follow up... but a full series would have been a better idea.

> Paul, thanks for chiming in, and sorry for leaving you out of the list
> of people that can help Jack with his upstreaming efforts.  :)
> 

NP.

  Paul

> Paolo
Amit Shah June 1, 2022, 9:43 a.m. UTC | #19
On Wed, 2022-06-01 at 08:52 +0200, Peter Zijlstra wrote:
> On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote:
> > > 
> > > 
> > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> > > > The reasoning behind this is that you may want to run a guest at a
> > > > lower CPU frequency for the purposes of trying to match performance
> > > > parity between a host of an older CPU type to a newer faster one.
> > > 
> > > That's quite ludicrus. Also, then it should be the host enforcing the
> > > cpufreq, not the guest.
> > 
> > I'll bite... What's ludicrous about wanting to run a guest at a lower
> > CPU freq to minimize observable change in whatever workload it is
> > running?
> 
> *why* would you want to do that? Everybody wants their stuff done
> faster.

We're running out of older hardware on which VMs have been started, and
these have to be moved to newer hardware.

We want the customer experience to stay as close to the current
situation as possible (i.e. no surprises), as this is just a live-
migration event for these instances.

Live migration events happen today as well, within the same hardware
and hypervisor cluster.  But this hw deprecation thing is going to be
new -- meaning customers and workloads aren't used to having hw
characteristics change as part of LM events.

> If this is some hare-brained money scheme; must not give them if they
> didn't pay up then I really don't care.

Many workloads that are still tied to the older generation instances we
offer are there for a reason.  EC2's newer instance generations have a
better price and performance than the older ones; yet folks use the
older ones.  We don't want to guess as to why that is.  We just want
these workloads to continue running w/o changes or w/o customers having
to even think about these things, while running on supported hardware.

So as infrastructure providers, we're doing everything possible behind-
the-scenes to ensure there's as little disruption to existing workloads
as possible.

> On top of that, you can't hide uarch differences with cpufreq capping.

Yes, this move (old hw -> new hw) isn't supposed to be "hide from the
instances we're doing this".  It's rather "try to match the
capabilities to the older hw as much as possible".

Some software will adapt to these changes; some software won't.  We're
aiming to be ready for both scenarios as far as software allows us.

> Also, it is probably more power efficient to let it run faster and idle
> more, so you're not being environmental either.

Agreed; I can chat about that quite a bit, but that doesn't apply to
this context.

		Amit
Peter Zijlstra June 1, 2022, 10:19 a.m. UTC | #20
On Wed, Jun 01, 2022 at 10:59:17AM +0200, Paolo Bonzini wrote:
> On 6/1/22 09:57, Vitaly Kuznetsov wrote:
> > > > I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running?
> > > Well, the right API is cpufreq, there's no need to make it a KVM
> > > functionality.
> > KVM may probably use the cpufreq API to run each vCPU at the desired
> > frequency: I don't quite see how this can be done with a VMM today when
> > it's not a 1-vCPU-per-1-pCPU setup.
> 
> True, but then there's also a policy issue, in that KVM shouldn't be allowed
> to *bump* the frequency if userspace would ordinarily not have access to the
> cpufreq files in sysfs.

So, when using schedutil (which requires intel_pstate in passive mode),
then there's the option to use per-task uclamps which are somewhat
complicated but also affect cpufreq.
David Woodhouse June 1, 2022, 12:55 p.m. UTC | #21
On Tue, 2022-05-31 at 20:11 +0200, Paolo Bonzini wrote:
> On 5/31/22 13:43, Metin Kaya wrote:
> > Thanks, Jack.
> > 
> > Reviewed-by: Metin Kaya <metikaya@amazon.co.uk>
> > 
> 
> Please try a bit harder.  "Reviewed-by" is neither "this matches what's 
> been forever in the Amazon kernel" 

Oh, it hasn't made it to the Amazon kernel yet. I have started
threatening to *eat* people who even submit stuff to the internal
kernel for review without first posting it upstream.

That does mean you get to see it sooner, which is a mixed blessing :)

> - does not even *apply* to the upstream kernel, because it uses 
> (presumably Amazon-specific) CAP numbers above 10000
> 
> - does not work if the vCPU is moved from one physical CPU to another
> 
> - does not work if the intel_pstate driver writes to MSR_HWP_REQUEST
> 
> - does not include documentation for the new capability
> 
> - does not include a selftest
> 
> - is unacceptable anyway because, as mentioned in the cover letter, it 
> isn't undone when the process exits

That's a useful summary; thank you. I think we should definitely focus
on integrating with cpufreq too.

Is this even a KVM thing?

I don't think we really want to do the change on every vmexit/enter. We
could *maybe* make a case for doing it on entry and then removing the
limit when either returning to *userspace* or scheduling?

But I think even that probably isn't needed. Just let the VMM run at
the same frequency too, as a per-process setting.

> Jack, please understand that I am not really blaming you in any way, and 
> ask some of your colleagues with upstream kernel experience (Alex Graf, 
> David Woodhouse, Filippo Sironi, Jan Schoenherr, Amit Shah are the ones 
> I know) which patches could be good targets for including upstream.

Indeed. As a straw man there are worse ways to start the discussion.
Thanks, Jack.
David Woodhouse June 1, 2022, 1:14 p.m. UTC | #22
On Wed, 2022-06-01 at 08:52 +0200, Peter Zijlstra wrote:
> On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote:
> > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote:
> > > > The reasoning behind this is that you may want to run a guest at a
> > > > lower CPU frequency for the purposes of trying to match performance
> > > > parity between a host of an older CPU type to a newer faster one.
> > > 
> > > That's quite ludicrus. Also, then it should be the host enforcing the
> > > cpufreq, not the guest.
> > 
> > I'll bite... What's ludicrous about wanting to run a guest at a lower
> > CPU freq to minimize observable change in whatever workload it is
> > running?
> 
> *why* would you want to do that? Everybody wants their stuff done
> faster.
> 

Nah, lots of customers have existing workloads and they want them to
run the *same*. They don't want them to run *faster* because that could
expose existing bugs and race conditions in guest code that has worked
perfectly fine for years. They don't want us to stress-test it when it
was working fine before.

Hell, we are implementing guest transparent live migration to KVM from
*actual* Xen in order to let stuff "just continue to run as it did
before", when for many it would "just" be a case of rebuilding their
guest with new NVMe and network drivers.

> If this is some hare-brained money scheme; must not give them if they
> didn't pay up then I really don't care.

It's actually the other way round. The older instance types were more
expensive; prices generally went down over time, especially $/perf.

None of that eliminates customer inertia.

> On top of that, you can't hide uarch differences with cpufreq capping.

No, but you can bring the performance envelope within spitting
distance. This isn't about hiding the fact that they are now running on
Linux and on newer CPUs; it's about not *breaking* things too much.

> Also, it is probably more power efficient to let it run faster and idle
> more, so you're not being environmental either.

Not sure about that. I thought I saw analysis that something like the
last 5% of turbo performance cost 30% of the power budget in practice.

And running these Xen guests even scaled down on modern hardware is
still much more power efficient than running them on the original
hardware that we're migrating them from.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ae220f88f00d..d2efc2ce624f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1169,6 +1169,8 @@  struct kvm_x86_ops {
 	bool (*rdtscp_supported)(void);
 	bool (*invpcid_supported)(void);
 
+	int (*set_cpu_freq_scaling)(struct kvm_vcpu *vcpu, u8 freq_100mhz);
+
 	void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
 
 	void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6499f371de58..beee39b57b13 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1699,6 +1699,95 @@  static bool vmx_invpcid_supported(void)
 	return cpu_has_vmx_invpcid();
 }
 
+static int vmx_query_cpu_freq_valid_freq(u8 freq)
+{
+#define MASK_PERF 0xFF
+#define CAP_HIGHEST_SHIFT 0
+#define CAP_LOWEST_SHIFT 24
+#define CAP_HIGHEST_MASK (MASK_PERF << CAP_HIGHEST_SHIFT)
+#define CAP_LOWEST_MASK (MASK_PERF << CAP_LOWEST_SHIFT)
+	u64 cap_msr;
+	u8 highest, lowest;
+
+	/* Query highest and lowest supported scaling. */
+	rdmsrl(MSR_HWP_CAPABILITIES, cap_msr);
+	highest = (u8)(cap_msr & CAP_HIGHEST_MASK);
+	lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT);
+
+	if (freq < lowest || freq > highest)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void vmx_set_cpu_freq_uncapped(void)
+{
+#define SHIFT_DESIRED_PERF 16
+#define SHIFT_MAX_PERF 8
+#define SHIFT_MIN_PERF 0
+
+	u64 cap_msr, req_msr;
+	u8 highest, lowest;
+
+	/* Query the capabilities. */
+	rdmsrl(MSR_HWP_CAPABILITIES, cap_msr);
+	highest = (u8)(cap_msr & CAP_HIGHEST_MASK);
+	lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT);
+
+	/* Set the desired to highest performance. */
+	req_msr = ((highest & MASK_PERF) << SHIFT_DESIRED_PERF) |
+		((highest & MASK_PERF) << SHIFT_MAX_PERF) |
+		((lowest & MASK_PERF) << SHIFT_MIN_PERF);
+	wrmsrl(MSR_HWP_REQUEST, req_msr);
+}
+
+static void vmx_set_cpu_freq_capped(u8 freq_100mhz)
+{
+	u64 req_msr;
+
+	/* Populate the variable used for setting the HWP request. */
+	req_msr = ((freq_100mhz & MASK_PERF) << SHIFT_DESIRED_PERF) |
+		((freq_100mhz & MASK_PERF) << SHIFT_MAX_PERF) |
+		((freq_100mhz & MASK_PERF) << SHIFT_MIN_PERF);
+
+	wrmsrl(MSR_HWP_REQUEST, req_msr);
+}
+
+static int vmx_set_cpu_freq_scaling(struct kvm_vcpu *vcpu, u8 freq_100mhz)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 pm_before, req_msr;
+	int rc;
+
+	/* Is HWP scaling supported? */
+	if (!this_cpu_has(X86_FEATURE_HWP))
+		return -ENODEV;
+
+	/*
+	 * HWP needs to be enabled to query & use capabilities.
+	 * This bit is W1Once so cannot be cleared after.
+	 */
+	rdmsrl(MSR_PM_ENABLE, pm_before);
+	if ((pm_before & 1) == 0)
+		wrmsrl(MSR_PM_ENABLE, pm_before | 1);
+
+	/*
+	 * Check if setting to a specific value, if being set
+	 * to zero this means return to uncapped frequency.
+	 */
+	if (freq_100mhz) {
+		rc = vmx_query_cpu_freq_valid_freq(freq_100mhz);
+
+		if (rc)
+			return rc;
+
+		vmx_set_cpu_freq_capped(freq_100mhz);
+	} else
+		vmx_set_cpu_freq_uncapped();
+
+	return 0;
+}
+
 /*
  * Swap MSR entry in host/guest MSR entry array.
  */
@@ -8124,6 +8213,8 @@  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.rdtscp_supported = vmx_rdtscp_supported,
 	.invpcid_supported = vmx_invpcid_supported,
 
+	.set_cpu_freq_scaling = vmx_set_cpu_freq_scaling,
+
 	.set_supported_cpuid = vmx_set_supported_cpuid,
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c33423a1a13d..9ae2ab102e01 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3669,6 +3669,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_SET_VAR_MTRR_COUNT:
 	case KVM_CAP_X86_USER_SPACE_MSR:
 	case KVM_CAP_X86_MSR_FILTER:
+	case KVM_CAP_CPU_FREQ_SCALING:
 		r = 1;
 		break;
 #ifdef CONFIG_KVM_XEN
@@ -4499,6 +4500,19 @@  static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
 	return r;
 }
 
+static int kvm_cap_set_cpu_freq(struct kvm_vcpu *vcpu,
+				       struct kvm_enable_cap *cap)
+{
+	u8 freq = (u8)cap->args[0];
+
+	/* Query whether this platform (Intel or AMD) support setting. */
+	if (!kvm_x86_ops.set_cpu_freq_scaling)
+		return -ENODEV;
+
+	/* Attempt to set to the frequency specified. */
+	return kvm_x86_ops.set_cpu_freq_scaling(vcpu, freq);
+}
+
 /*
  * kvm_set_guest_paused() indicates to the guest kernel that it has been
  * stopped by the hypervisor.  This function will be called from the host only.
@@ -4553,6 +4567,8 @@  static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		return kvm_x86_ops.enable_direct_tlbflush(vcpu);
 	case KVM_CAP_SET_VAR_MTRR_COUNT:
 		return kvm_mtrr_set_var_mtrr_count(vcpu, cap->args[0]);
+	case KVM_CAP_CPU_FREQ_SCALING:
+		return kvm_cap_set_cpu_freq(vcpu, cap);
 
 	default:
 		return -EINVAL;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 831be0d2d5e4..273a3ab5590e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -874,6 +874,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_NO_POLL_ON_HLT 100003
 #define KVM_CAP_MMU_USE_VMA_CAPMEM 100004
 #define KVM_CAP_MMU_SUPPORT_DYNAMIC_CAPMEM 100005
+#define KVM_CAP_CPU_FREQ_SCALING 100006
 
 #define KVM_CAP_IRQCHIP	  0
 #define KVM_CAP_HLT	  1