Message ID | 20220531105925.27676-1-jalliste@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: VMX: CPU frequency scaling for intel x86_64 KVM guests | expand |
Thanks, Jack.
Reviewed-by: Metin Kaya <metikaya@amazon.co.uk>
On Tue, May 31, 2022 at 10:59:25AM +0000, Jack Allister wrote: > A VMM can control a vCPU's CPU frequency by interfacing with KVM via > the vCPU file descriptor to enable/set CPU frequency scaling for a > guest. Instead of creating a separate IOCTL to this this, KVM capabil- > ities are extended to include a capability called > KVM_CAP_CPU_FREQ_SCALING. > > A generic set_cpu_freq interface is added to kvm_x86_ops > to allow for architecture (AMD/Intel) independent CPU frequency > scaling setting. > > For Intel platforms, Hardware-Controlled Performance States (HWP) are > used to implement CPU scaling within the guest. Further information on > this mechanism can be seen in Intel SDM Vol 3B (section 14.4). The CPU > frequency is set as soon as this function is called and is kept running > until explicitly reset or set again. > > Currently the AMD frequency setting interface is left unimplemented. > > Please note that CPU frequency scaling will have an effect on host > processing in it's current form. To change back to full performance > when running in host context an IOCTL with a frequency value of 0 > is needed to run back at uncapped speed. Nowhere does this explain *WHY* we would want to do this?
The reasoning behind this is that you may want to run a guest at a lower CPU frequency for the purposes of trying to match performance parity between a host of an older CPU type to a newer faster one.
On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > The reasoning behind this is that you may want to run a guest at a > lower CPU frequency for the purposes of trying to match performance > parity between a host of an older CPU type to a newer faster one. That's quite ludicrus. Also, then it should be the host enforcing the cpufreq, not the guest.
> -----Original Message----- > From: Peter Zijlstra <peterz@infradead.org> > Sent: 31 May 2022 15:44 > To: Allister, Jack <jalliste@amazon.com> > Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org; > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com; > pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de; > vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org > Subject: RE: [EXTERNAL]...\n > > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > > The reasoning behind this is that you may want to run a guest at a > > lower CPU frequency for the purposes of trying to match performance > > parity between a host of an older CPU type to a newer faster one. > > That's quite ludicrus. Also, then it should be the host enforcing the > cpufreq, not the guest. I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running? Paul
On Tue, May 31, 2022 at 4:45 PM Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > > The reasoning behind this is that you may want to run a guest at a > > lower CPU frequency for the purposes of trying to match performance > > parity between a host of an older CPU type to a newer faster one. > > That's quite ludicrus. Also, then it should be the host enforcing the > cpufreq, not the guest. It is a weird usecase indeed, but actually it *is* enforced by the host in Jack's patch. On the other hand doing it by writing random MSRs, with no save and load when the task is scheduled or migrated, will not fly outside Amazon datacenters. The patch also has traces of the Amazon kernel and doesn't apply upstream. Paolo
Jack Allister <jalliste@amazon.com> writes: > A VMM can control a vCPU's CPU frequency by interfacing with KVM via > the vCPU file descriptor to enable/set CPU frequency scaling for a > guest. Instead of creating a separate IOCTL to this this, KVM capabil- > ities are extended to include a capability called > KVM_CAP_CPU_FREQ_SCALING. > > A generic set_cpu_freq interface is added to kvm_x86_ops > to allow for architecture (AMD/Intel) independent CPU frequency > scaling setting. > > For Intel platforms, Hardware-Controlled Performance States (HWP) are > used to implement CPU scaling within the guest. Further information on > this mechanism can be seen in Intel SDM Vol 3B (section 14.4). The CPU > frequency is set as soon as this function is called and is kept running > until explicitly reset or set again. > > Currently the AMD frequency setting interface is left unimplemented. > > Please note that CPU frequency scaling will have an effect on host > processing in it's current form. To change back to full performance > when running in host context an IOCTL with a frequency value of 0 > is needed to run back at uncapped speed. I may have missed something but it looks like it will also have an effect on other guests running on the same CPU. Also, nothing guarantees that the guest vCPU will not get migrated to another CPU. This looks like a very hard to use interface. What if you introduce a per-vCPU setting for the desired frequency and set it every time the vCPU is loaded on a pCPU and then restore back the original frequency when it is unloaded (and assuming nobody else touched it while the vCPU was running)? > > Signed-off-by: Jack Allister <jalliste@amazon.com> > --- > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/vmx/vmx.c | 91 +++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 16 ++++++ > include/uapi/linux/kvm.h | 1 + > 4 files changed, 110 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index ae220f88f00d..d2efc2ce624f 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1169,6 +1169,8 @@ struct kvm_x86_ops { > bool (*rdtscp_supported)(void); > bool (*invpcid_supported)(void); > > + int (*set_cpu_freq_scaling)(struct kvm_vcpu *vcpu, u8 freq_100mhz); > + > void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); > > void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry); > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 6499f371de58..beee39b57b13 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1699,6 +1699,95 @@ static bool vmx_invpcid_supported(void) > return cpu_has_vmx_invpcid(); > } > > +static int vmx_query_cpu_freq_valid_freq(u8 freq) > +{ > +#define MASK_PERF 0xFF > +#define CAP_HIGHEST_SHIFT 0 > +#define CAP_LOWEST_SHIFT 24 > +#define CAP_HIGHEST_MASK (MASK_PERF << CAP_HIGHEST_SHIFT) > +#define CAP_LOWEST_MASK (MASK_PERF << CAP_LOWEST_SHIFT) > + u64 cap_msr; > + u8 highest, lowest; > + > + /* Query highest and lowest supported scaling. */ > + rdmsrl(MSR_HWP_CAPABILITIES, cap_msr); > + highest = (u8)(cap_msr & CAP_HIGHEST_MASK); > + lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT); > + > + if (freq < lowest || freq > highest) > + return -EINVAL; > + > + return 0; > +} > + > +static void vmx_set_cpu_freq_uncapped(void) > +{ > +#define SHIFT_DESIRED_PERF 16 > +#define SHIFT_MAX_PERF 8 > +#define SHIFT_MIN_PERF 0 > + > + u64 cap_msr, req_msr; > + u8 highest, lowest; > + > + /* Query the capabilities. */ > + rdmsrl(MSR_HWP_CAPABILITIES, cap_msr); > + highest = (u8)(cap_msr & CAP_HIGHEST_MASK); > + lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT); > + > + /* Set the desired to highest performance. */ > + req_msr = ((highest & MASK_PERF) << SHIFT_DESIRED_PERF) | > + ((highest & MASK_PERF) << SHIFT_MAX_PERF) | > + ((lowest & MASK_PERF) << SHIFT_MIN_PERF); > + wrmsrl(MSR_HWP_REQUEST, req_msr); > +} > + > +static void vmx_set_cpu_freq_capped(u8 freq_100mhz) > +{ > + u64 req_msr; > + > + /* Populate the variable used for setting the HWP request. */ > + req_msr = ((freq_100mhz & MASK_PERF) << SHIFT_DESIRED_PERF) | > + ((freq_100mhz & MASK_PERF) << SHIFT_MAX_PERF) | > + ((freq_100mhz & MASK_PERF) << SHIFT_MIN_PERF); > + > + wrmsrl(MSR_HWP_REQUEST, req_msr); > +} > + > +static int vmx_set_cpu_freq_scaling(struct kvm_vcpu *vcpu, u8 freq_100mhz) > +{ > + struct kvm *kvm = vcpu->kvm; > + u64 pm_before, req_msr; > + int rc; > + > + /* Is HWP scaling supported? */ > + if (!this_cpu_has(X86_FEATURE_HWP)) > + return -ENODEV; > + > + /* > + * HWP needs to be enabled to query & use capabilities. > + * This bit is W1Once so cannot be cleared after. > + */ > + rdmsrl(MSR_PM_ENABLE, pm_before); > + if ((pm_before & 1) == 0) > + wrmsrl(MSR_PM_ENABLE, pm_before | 1); > + > + /* > + * Check if setting to a specific value, if being set > + * to zero this means return to uncapped frequency. > + */ > + if (freq_100mhz) { > + rc = vmx_query_cpu_freq_valid_freq(freq_100mhz); > + > + if (rc) > + return rc; > + > + vmx_set_cpu_freq_capped(freq_100mhz); > + } else > + vmx_set_cpu_freq_uncapped(); > + > + return 0; > +} > + > /* > * Swap MSR entry in host/guest MSR entry array. > */ > @@ -8124,6 +8213,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { > .rdtscp_supported = vmx_rdtscp_supported, > .invpcid_supported = vmx_invpcid_supported, > > + .set_cpu_freq_scaling = vmx_set_cpu_freq_scaling, > + > .set_supported_cpuid = vmx_set_supported_cpuid, > > .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index c33423a1a13d..9ae2ab102e01 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3669,6 +3669,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_SET_VAR_MTRR_COUNT: > case KVM_CAP_X86_USER_SPACE_MSR: > case KVM_CAP_X86_MSR_FILTER: > + case KVM_CAP_CPU_FREQ_SCALING: > r = 1; > break; > #ifdef CONFIG_KVM_XEN > @@ -4499,6 +4500,19 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu, > return r; > } > > +static int kvm_cap_set_cpu_freq(struct kvm_vcpu *vcpu, > + struct kvm_enable_cap *cap) > +{ > + u8 freq = (u8)cap->args[0]; > + > + /* Query whether this platform (Intel or AMD) support setting. */ > + if (!kvm_x86_ops.set_cpu_freq_scaling) > + return -ENODEV; > + > + /* Attempt to set to the frequency specified. */ > + return kvm_x86_ops.set_cpu_freq_scaling(vcpu, freq); > +} > + > /* > * kvm_set_guest_paused() indicates to the guest kernel that it has been > * stopped by the hypervisor. This function will be called from the host only. > @@ -4553,6 +4567,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, > return kvm_x86_ops.enable_direct_tlbflush(vcpu); > case KVM_CAP_SET_VAR_MTRR_COUNT: > return kvm_mtrr_set_var_mtrr_count(vcpu, cap->args[0]); > + case KVM_CAP_CPU_FREQ_SCALING: > + return kvm_cap_set_cpu_freq(vcpu, cap); > > default: > return -EINVAL; > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 831be0d2d5e4..273a3ab5590e 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -874,6 +874,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_NO_POLL_ON_HLT 100003 > #define KVM_CAP_MMU_USE_VMA_CAPMEM 100004 > #define KVM_CAP_MMU_SUPPORT_DYNAMIC_CAPMEM 100005 > +#define KVM_CAP_CPU_FREQ_SCALING 100006 > > #define KVM_CAP_IRQCHIP 0 > #define KVM_CAP_HLT 1
On Tue, May 31, 2022 at 04:52:56PM +0200, Paolo Bonzini wrote: > On Tue, May 31, 2022 at 4:45 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > > > The reasoning behind this is that you may want to run a guest at a > > > lower CPU frequency for the purposes of trying to match performance > > > parity between a host of an older CPU type to a newer faster one. > > > > That's quite ludicrus. Also, then it should be the host enforcing the > > cpufreq, not the guest. > > It is a weird usecase indeed, but actually it *is* enforced by the > host in Jack's patch. Clearly I don't understand KVM much; I was thikning that since it was mucking the with vmx code it was some guest interface. If it is host control, then it's even more insane, since the host has plenty existing interfaces for cpufreq control. No need to add more crazy hacks like this.
On 5/31/22 16:52, Durrant, Paul wrote: >> -----Original Message----- >> From: Peter Zijlstra <peterz@infradead.org> >> Sent: 31 May 2022 15:44 >> To: Allister, Jack <jalliste@amazon.com> >> Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org; >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com; >> pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de; >> vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org >> Subject: RE: [EXTERNAL]...\n >> >> >> On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: >>> The reasoning behind this is that you may want to run a guest at a >>> lower CPU frequency for the purposes of trying to match performance >>> parity between a host of an older CPU type to a newer faster one. >> >> That's quite ludicrus. Also, then it should be the host enforcing the >> cpufreq, not the guest. > > I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running? Well, the right API is cpufreq, there's no need to make it a KVM functionality. Paolo
On 5/31/22 13:43, Metin Kaya wrote: > Thanks, Jack. > > Reviewed-by: Metin Kaya <metikaya@amazon.co.uk> > Please try a bit harder. "Reviewed-by" is neither "this matches what's been forever in the Amazon kernel" nor "I guarantee that Jack is a nice guy and doesn't screw up". I'm sure he is but everybody screws up, and in this case the patch: - does not even *apply* to the upstream kernel, because it uses (presumably Amazon-specific) CAP numbers above 10000 - does not work if the vCPU is moved from one physical CPU to another - does not work if the intel_pstate driver writes to MSR_HWP_REQUEST - does not include documentation for the new capability - does not include a selftest - is unacceptable anyway because, as mentioned in the cover letter, it isn't undone when the process exits Jack, please understand that I am not really blaming you in any way, and ask some of your colleagues with upstream kernel experience (Alex Graf, David Woodhouse, Filippo Sironi, Jan Schoenherr, Amit Shah are the ones I know) which patches could be good targets for including upstream. Paolo
On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote: > > -----Original Message----- > > From: Peter Zijlstra <peterz@infradead.org> > > Sent: 31 May 2022 15:44 > > To: Allister, Jack <jalliste@amazon.com> > > Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org; > > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com; > > pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de; > > vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org > > Subject: RE: [EXTERNAL]...\n > > > > > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > > > The reasoning behind this is that you may want to run a guest at a > > > lower CPU frequency for the purposes of trying to match performance > > > parity between a host of an older CPU type to a newer faster one. > > > > That's quite ludicrus. Also, then it should be the host enforcing the > > cpufreq, not the guest. > > I'll bite... What's ludicrous about wanting to run a guest at a lower > CPU freq to minimize observable change in whatever workload it is > running? *why* would you want to do that? Everybody wants their stuff done faster. If this is some hare-brained money scheme; must not give them if they didn't pay up then I really don't care. On top of that, you can't hide uarch differences with cpufreq capping. Also, it is probably more power efficient to let it run faster and idle more, so you're not being environmental either.
Paolo Bonzini <pbonzini@redhat.com> writes: > On 5/31/22 16:52, Durrant, Paul wrote: >>> -----Original Message----- >>> From: Peter Zijlstra <peterz@infradead.org> >>> Sent: 31 May 2022 15:44 >>> To: Allister, Jack <jalliste@amazon.com> >>> Cc: bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; jmattson@google.com; joro@8bytes.org; >>> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; metikaya@amazon.co.uk; mingo@redhat.com; >>> pbonzini@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; tglx@linutronix.de; >>> vkuznets@redhat.com; wanpengli@tencent.com; x86@kernel.org >>> Subject: RE: [EXTERNAL]...\n >>> >>> >>> On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: >>>> The reasoning behind this is that you may want to run a guest at a >>>> lower CPU frequency for the purposes of trying to match performance >>>> parity between a host of an older CPU type to a newer faster one. >>> >>> That's quite ludicrus. Also, then it should be the host enforcing the >>> cpufreq, not the guest. >> >> I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running? > > Well, the right API is cpufreq, there's no need to make it a KVM > functionality. KVM may probably use the cpufreq API to run each vCPU at the desired frequency: I don't quite see how this can be done with a VMM today when it's not a 1-vCPU-per-1-pCPU setup.
Peter Zijlstra <peterz@infradead.org> writes: > On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote: ... >> >> I'll bite... What's ludicrous about wanting to run a guest at a lower >> CPU freq to minimize observable change in whatever workload it is >> running? > > *why* would you want to do that? Everybody wants their stuff done > faster. > FWIW, I can see a valid use-case: imagine you're running some software which calibrates itself in the beginning to run at some desired real time speed but then the VM running it has to be migrated to a host with faster (newer) CPUs. I don't have a real world examples out of top of my head but I remember some old DOS era games were impossible to play on newer CPUs because everything was happenning too fast. Maybe that's the case :-)
> On 1 Jun 2022, at 10:03, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: > > Peter Zijlstra <peterz@infradead.org> writes: > >> On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote: > > ... > >>> >>> I'll bite... What's ludicrous about wanting to run a guest at a lower >>> CPU freq to minimize observable change in whatever workload it is >>> running? >> >> *why* would you want to do that? Everybody wants their stuff done >> faster. >> > > FWIW, I can see a valid use-case: imagine you're running some software > which calibrates itself in the beginning to run at some desired real > time speed but then the VM running it has to be migrated to a host with > faster (newer) CPUs. I don't have a real world examples out of top of my > head but I remember some old DOS era games were impossible to play on > newer CPUs because everything was happenning too fast. Maybe that's the > case :-) The PC version of Alpha Waves was such an example, but Frederick Raynal, who did the port, said it was the last time he made the mistake. That was 1990 :-) More seriously, what about mitigating timing-based remote attacks by arbitrarily changing the CPU frequency and injecting noise in the timing? That could be a valid use case, no? Although I can think of about a million other ways of doing this more efficiently… > > -- > Vitaly >
> -----Original Message----- [snip] > >> > >> I'll bite... What's ludicrous about wanting to run a guest at a lower > >> CPU freq to minimize observable change in whatever workload it is > >> running? > > > > *why* would you want to do that? Everybody wants their stuff done > > faster. > > > > FWIW, I can see a valid use-case: imagine you're running some software > which calibrates itself in the beginning to run at some desired real > time speed but then the VM running it has to be migrated to a host with > faster (newer) CPUs. I don't have a real world examples out of top of my > head but I remember some old DOS era games were impossible to play on > newer CPUs because everything was happenning too fast. Maybe that's the > case :-) > That is exactly the case. This is not 'some hare-brained money scheme'; there is genuine concern that moving a VM from old h/w to new h/w may cause it to run 'too fast', breaking any such calibration done by the guest OS/application. I also don't have any real-world examples, but bugs may well be reported and having a lever to address them is IMO a good idea. However, I also agree with Paolo that KVM doesn't really need to be doing this when the VMM could do the job using cpufreq, so we'll pursue that option instead. (FWIW the reason for involving KVM was to do the freq adjustment right before entering the guest and then remove the cap right after VMEXIT). Paul
On 6/1/22 10:54, Durrant, Paul wrote: > That is exactly the case. This is not 'some hare-brained money > scheme'; there is genuine concern that moving a VM from old h/w to > new h/w may cause it to run 'too fast', breaking any such calibration > done by the guest OS/application. I also don't have any real-world > examples, but bugs may well be reported and having a lever to address > them is IMO a good idea. However, I also agree with Paolo that KVM > doesn't really need to be doing this when the VMM could do the job > using cpufreq, so we'll pursue that option instead. (FWIW the reason > for involving KVM was to do the freq adjustment right before entering > the guest and then remove the cap right after VMEXIT). But if so, you still would submit the full feature, wouldn't you? Paul, thanks for chiming in, and sorry for leaving you out of the list of people that can help Jack with his upstreaming efforts. :) Paolo
On 6/1/22 09:57, Vitaly Kuznetsov wrote: >>> I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running? >> Well, the right API is cpufreq, there's no need to make it a KVM >> functionality. > KVM may probably use the cpufreq API to run each vCPU at the desired > frequency: I don't quite see how this can be done with a VMM today when > it's not a 1-vCPU-per-1-pCPU setup. True, but then there's also a policy issue, in that KVM shouldn't be allowed to *bump* the frequency if userspace would ordinarily not have access to the cpufreq files in sysfs. All in all, I think it's simpler to let privileged userspace (which knows when it has a 1:1 mapping of vCPU to pCPU) handle it with cpufreq. Paolo
> -----Original Message----- > From: Paolo Bonzini <pbonzini@redhat.com> > Sent: 01 June 2022 09:57 > To: Durrant, Paul <pdurrant@amazon.co.uk>; Vitaly Kuznetsov <vkuznets@redhat.com>; Peter Zijlstra > <peterz@infradead.org> > Cc: Allister, Jack <jalliste@amazon.com>; bp@alien8.de; diapop@amazon.co.uk; hpa@zytor.com; > jmattson@google.com; joro@8bytes.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; > metikaya@amazon.co.uk; mingo@redhat.com; rkrcmar@redhat.com; sean.j.christopherson@intel.com; > tglx@linutronix.de; wanpengli@tencent.com; x86@kernel.org > Subject: RE: [EXTERNAL]...\n > > CAUTION: This email originated from outside of the organization. Do not click links or open > attachments unless you can confirm the sender and know the content is safe. > > > > On 6/1/22 10:54, Durrant, Paul wrote: > > That is exactly the case. This is not 'some hare-brained money > > scheme'; there is genuine concern that moving a VM from old h/w to > > new h/w may cause it to run 'too fast', breaking any such calibration > > done by the guest OS/application. I also don't have any real-world > > examples, but bugs may well be reported and having a lever to address > > them is IMO a good idea. However, I also agree with Paolo that KVM > > doesn't really need to be doing this when the VMM could do the job > > using cpufreq, so we'll pursue that option instead. (FWIW the reason > > for involving KVM was to do the freq adjustment right before entering > > the guest and then remove the cap right after VMEXIT). > > But if so, you still would submit the full feature, wouldn't you? > Yes; the commit message should have at least said that we'd follow up... but a full series would have been a better idea. > Paul, thanks for chiming in, and sorry for leaving you out of the list > of people that can help Jack with his upstreaming efforts. :) > NP. Paul > Paolo
On Wed, 2022-06-01 at 08:52 +0200, Peter Zijlstra wrote: > On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote: > > > > > > > > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > > > > The reasoning behind this is that you may want to run a guest at a > > > > lower CPU frequency for the purposes of trying to match performance > > > > parity between a host of an older CPU type to a newer faster one. > > > > > > That's quite ludicrus. Also, then it should be the host enforcing the > > > cpufreq, not the guest. > > > > I'll bite... What's ludicrous about wanting to run a guest at a lower > > CPU freq to minimize observable change in whatever workload it is > > running? > > *why* would you want to do that? Everybody wants their stuff done > faster. We're running out of older hardware on which VMs have been started, and these have to be moved to newer hardware. We want the customer experience to stay as close to the current situation as possible (i.e. no surprises), as this is just a live- migration event for these instances. Live migration events happen today as well, within the same hardware and hypervisor cluster. But this hw deprecation thing is going to be new -- meaning customers and workloads aren't used to having hw characteristics change as part of LM events. > If this is some hare-brained money scheme; must not give them if they > didn't pay up then I really don't care. Many workloads that are still tied to the older generation instances we offer are there for a reason. EC2's newer instance generations have a better price and performance than the older ones; yet folks use the older ones. We don't want to guess as to why that is. We just want these workloads to continue running w/o changes or w/o customers having to even think about these things, while running on supported hardware. So as infrastructure providers, we're doing everything possible behind- the-scenes to ensure there's as little disruption to existing workloads as possible. > On top of that, you can't hide uarch differences with cpufreq capping. Yes, this move (old hw -> new hw) isn't supposed to be "hide from the instances we're doing this". It's rather "try to match the capabilities to the older hw as much as possible". Some software will adapt to these changes; some software won't. We're aiming to be ready for both scenarios as far as software allows us. > Also, it is probably more power efficient to let it run faster and idle > more, so you're not being environmental either. Agreed; I can chat about that quite a bit, but that doesn't apply to this context. Amit
On Wed, Jun 01, 2022 at 10:59:17AM +0200, Paolo Bonzini wrote: > On 6/1/22 09:57, Vitaly Kuznetsov wrote: > > > > I'll bite... What's ludicrous about wanting to run a guest at a lower CPU freq to minimize observable change in whatever workload it is running? > > > Well, the right API is cpufreq, there's no need to make it a KVM > > > functionality. > > KVM may probably use the cpufreq API to run each vCPU at the desired > > frequency: I don't quite see how this can be done with a VMM today when > > it's not a 1-vCPU-per-1-pCPU setup. > > True, but then there's also a policy issue, in that KVM shouldn't be allowed > to *bump* the frequency if userspace would ordinarily not have access to the > cpufreq files in sysfs. So, when using schedutil (which requires intel_pstate in passive mode), then there's the option to use per-task uclamps which are somewhat complicated but also affect cpufreq.
On Tue, 2022-05-31 at 20:11 +0200, Paolo Bonzini wrote: > On 5/31/22 13:43, Metin Kaya wrote: > > Thanks, Jack. > > > > Reviewed-by: Metin Kaya <metikaya@amazon.co.uk> > > > > Please try a bit harder. "Reviewed-by" is neither "this matches what's > been forever in the Amazon kernel" Oh, it hasn't made it to the Amazon kernel yet. I have started threatening to *eat* people who even submit stuff to the internal kernel for review without first posting it upstream. That does mean you get to see it sooner, which is a mixed blessing :) > - does not even *apply* to the upstream kernel, because it uses > (presumably Amazon-specific) CAP numbers above 10000 > > - does not work if the vCPU is moved from one physical CPU to another > > - does not work if the intel_pstate driver writes to MSR_HWP_REQUEST > > - does not include documentation for the new capability > > - does not include a selftest > > - is unacceptable anyway because, as mentioned in the cover letter, it > isn't undone when the process exits That's a useful summary; thank you. I think we should definitely focus on integrating with cpufreq too. Is this even a KVM thing? I don't think we really want to do the change on every vmexit/enter. We could *maybe* make a case for doing it on entry and then removing the limit when either returning to *userspace* or scheduling? But I think even that probably isn't needed. Just let the VMM run at the same frequency too, as a per-process setting. > Jack, please understand that I am not really blaming you in any way, and > ask some of your colleagues with upstream kernel experience (Alex Graf, > David Woodhouse, Filippo Sironi, Jan Schoenherr, Amit Shah are the ones > I know) which patches could be good targets for including upstream. Indeed. As a straw man there are worse ways to start the discussion. Thanks, Jack.
On Wed, 2022-06-01 at 08:52 +0200, Peter Zijlstra wrote: > On Tue, May 31, 2022 at 02:52:04PM +0000, Durrant, Paul wrote: > > > On Tue, May 31, 2022 at 02:02:36PM +0000, Jack Allister wrote: > > > > The reasoning behind this is that you may want to run a guest at a > > > > lower CPU frequency for the purposes of trying to match performance > > > > parity between a host of an older CPU type to a newer faster one. > > > > > > That's quite ludicrus. Also, then it should be the host enforcing the > > > cpufreq, not the guest. > > > > I'll bite... What's ludicrous about wanting to run a guest at a lower > > CPU freq to minimize observable change in whatever workload it is > > running? > > *why* would you want to do that? Everybody wants their stuff done > faster. > Nah, lots of customers have existing workloads and they want them to run the *same*. They don't want them to run *faster* because that could expose existing bugs and race conditions in guest code that has worked perfectly fine for years. They don't want us to stress-test it when it was working fine before. Hell, we are implementing guest transparent live migration to KVM from *actual* Xen in order to let stuff "just continue to run as it did before", when for many it would "just" be a case of rebuilding their guest with new NVMe and network drivers. > If this is some hare-brained money scheme; must not give them if they > didn't pay up then I really don't care. It's actually the other way round. The older instance types were more expensive; prices generally went down over time, especially $/perf. None of that eliminates customer inertia. > On top of that, you can't hide uarch differences with cpufreq capping. No, but you can bring the performance envelope within spitting distance. This isn't about hiding the fact that they are now running on Linux and on newer CPUs; it's about not *breaking* things too much. > Also, it is probably more power efficient to let it run faster and idle > more, so you're not being environmental either. Not sure about that. I thought I saw analysis that something like the last 5% of turbo performance cost 30% of the power budget in practice. And running these Xen guests even scaled down on modern hardware is still much more power efficient than running them on the original hardware that we're migrating them from.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ae220f88f00d..d2efc2ce624f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1169,6 +1169,8 @@ struct kvm_x86_ops { bool (*rdtscp_supported)(void); bool (*invpcid_supported)(void); + int (*set_cpu_freq_scaling)(struct kvm_vcpu *vcpu, u8 freq_100mhz); + void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6499f371de58..beee39b57b13 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1699,6 +1699,95 @@ static bool vmx_invpcid_supported(void) return cpu_has_vmx_invpcid(); } +static int vmx_query_cpu_freq_valid_freq(u8 freq) +{ +#define MASK_PERF 0xFF +#define CAP_HIGHEST_SHIFT 0 +#define CAP_LOWEST_SHIFT 24 +#define CAP_HIGHEST_MASK (MASK_PERF << CAP_HIGHEST_SHIFT) +#define CAP_LOWEST_MASK (MASK_PERF << CAP_LOWEST_SHIFT) + u64 cap_msr; + u8 highest, lowest; + + /* Query highest and lowest supported scaling. */ + rdmsrl(MSR_HWP_CAPABILITIES, cap_msr); + highest = (u8)(cap_msr & CAP_HIGHEST_MASK); + lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT); + + if (freq < lowest || freq > highest) + return -EINVAL; + + return 0; +} + +static void vmx_set_cpu_freq_uncapped(void) +{ +#define SHIFT_DESIRED_PERF 16 +#define SHIFT_MAX_PERF 8 +#define SHIFT_MIN_PERF 0 + + u64 cap_msr, req_msr; + u8 highest, lowest; + + /* Query the capabilities. */ + rdmsrl(MSR_HWP_CAPABILITIES, cap_msr); + highest = (u8)(cap_msr & CAP_HIGHEST_MASK); + lowest = (u8)((cap_msr & CAP_LOWEST_MASK) >> CAP_LOWEST_SHIFT); + + /* Set the desired to highest performance. */ + req_msr = ((highest & MASK_PERF) << SHIFT_DESIRED_PERF) | + ((highest & MASK_PERF) << SHIFT_MAX_PERF) | + ((lowest & MASK_PERF) << SHIFT_MIN_PERF); + wrmsrl(MSR_HWP_REQUEST, req_msr); +} + +static void vmx_set_cpu_freq_capped(u8 freq_100mhz) +{ + u64 req_msr; + + /* Populate the variable used for setting the HWP request. */ + req_msr = ((freq_100mhz & MASK_PERF) << SHIFT_DESIRED_PERF) | + ((freq_100mhz & MASK_PERF) << SHIFT_MAX_PERF) | + ((freq_100mhz & MASK_PERF) << SHIFT_MIN_PERF); + + wrmsrl(MSR_HWP_REQUEST, req_msr); +} + +static int vmx_set_cpu_freq_scaling(struct kvm_vcpu *vcpu, u8 freq_100mhz) +{ + struct kvm *kvm = vcpu->kvm; + u64 pm_before, req_msr; + int rc; + + /* Is HWP scaling supported? */ + if (!this_cpu_has(X86_FEATURE_HWP)) + return -ENODEV; + + /* + * HWP needs to be enabled to query & use capabilities. + * This bit is W1Once so cannot be cleared after. + */ + rdmsrl(MSR_PM_ENABLE, pm_before); + if ((pm_before & 1) == 0) + wrmsrl(MSR_PM_ENABLE, pm_before | 1); + + /* + * Check if setting to a specific value, if being set + * to zero this means return to uncapped frequency. + */ + if (freq_100mhz) { + rc = vmx_query_cpu_freq_valid_freq(freq_100mhz); + + if (rc) + return rc; + + vmx_set_cpu_freq_capped(freq_100mhz); + } else + vmx_set_cpu_freq_uncapped(); + + return 0; +} + /* * Swap MSR entry in host/guest MSR entry array. */ @@ -8124,6 +8213,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .rdtscp_supported = vmx_rdtscp_supported, .invpcid_supported = vmx_invpcid_supported, + .set_cpu_freq_scaling = vmx_set_cpu_freq_scaling, + .set_supported_cpuid = vmx_set_supported_cpuid, .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c33423a1a13d..9ae2ab102e01 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3669,6 +3669,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_VAR_MTRR_COUNT: case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_FILTER: + case KVM_CAP_CPU_FREQ_SCALING: r = 1; break; #ifdef CONFIG_KVM_XEN @@ -4499,6 +4500,19 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu, return r; } +static int kvm_cap_set_cpu_freq(struct kvm_vcpu *vcpu, + struct kvm_enable_cap *cap) +{ + u8 freq = (u8)cap->args[0]; + + /* Query whether this platform (Intel or AMD) support setting. */ + if (!kvm_x86_ops.set_cpu_freq_scaling) + return -ENODEV; + + /* Attempt to set to the frequency specified. */ + return kvm_x86_ops.set_cpu_freq_scaling(vcpu, freq); +} + /* * kvm_set_guest_paused() indicates to the guest kernel that it has been * stopped by the hypervisor. This function will be called from the host only. @@ -4553,6 +4567,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, return kvm_x86_ops.enable_direct_tlbflush(vcpu); case KVM_CAP_SET_VAR_MTRR_COUNT: return kvm_mtrr_set_var_mtrr_count(vcpu, cap->args[0]); + case KVM_CAP_CPU_FREQ_SCALING: + return kvm_cap_set_cpu_freq(vcpu, cap); default: return -EINVAL; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 831be0d2d5e4..273a3ab5590e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -874,6 +874,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_NO_POLL_ON_HLT 100003 #define KVM_CAP_MMU_USE_VMA_CAPMEM 100004 #define KVM_CAP_MMU_SUPPORT_DYNAMIC_CAPMEM 100005 +#define KVM_CAP_CPU_FREQ_SCALING 100006 #define KVM_CAP_IRQCHIP 0 #define KVM_CAP_HLT 1
A VMM can control a vCPU's CPU frequency by interfacing with KVM via the vCPU file descriptor to enable/set CPU frequency scaling for a guest. Instead of creating a separate IOCTL to this this, KVM capabil- ities are extended to include a capability called KVM_CAP_CPU_FREQ_SCALING. A generic set_cpu_freq interface is added to kvm_x86_ops to allow for architecture (AMD/Intel) independent CPU frequency scaling setting. For Intel platforms, Hardware-Controlled Performance States (HWP) are used to implement CPU scaling within the guest. Further information on this mechanism can be seen in Intel SDM Vol 3B (section 14.4). The CPU frequency is set as soon as this function is called and is kept running until explicitly reset or set again. Currently the AMD frequency setting interface is left unimplemented. Please note that CPU frequency scaling will have an effect on host processing in it's current form. To change back to full performance when running in host context an IOCTL with a frequency value of 0 is needed to run back at uncapped speed. Signed-off-by: Jack Allister <jalliste@amazon.com> --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/vmx/vmx.c | 91 +++++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 16 ++++++ include/uapi/linux/kvm.h | 1 + 4 files changed, 110 insertions(+)