Message ID | c858817d3e3be246a1a2278e3b42d06284e615e5.1700766316.git.maciej.szmigiero@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Allow XSAVES on CPUs where host doesn't use it due to an errata | expand |
On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") > kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. > > Since KVM CPU caps are initialized from the kernel boot CPU features this > makes the XSAVES feature also unavailable for KVM guests in this case, even > though they might want to decide on their own whether they are affected by > this errata. > > Allow KVM guests to make such decision by setting the XSAVES KVM CPU > capability bit based on the actual CPU capability This is not generally safe, as the guest can make such a decision if and only if the Family/Model/Stepping information is reasonably accurate. > This fixes booting Hyper-V enabled Windows Server 2016 VMs with more than > one vCPU on Zen1/2 CPUs. How/why does lack of XSAVES break a multi-vCPU setup? Is Windows blindly doing XSAVES based on FMS?
On 27.11.2023 18:24, Sean Christopherson wrote: > On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> >> Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") >> kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. >> >> Since KVM CPU caps are initialized from the kernel boot CPU features this >> makes the XSAVES feature also unavailable for KVM guests in this case, even >> though they might want to decide on their own whether they are affected by >> this errata. >> >> Allow KVM guests to make such decision by setting the XSAVES KVM CPU >> capability bit based on the actual CPU capability > > This is not generally safe, as the guest can make such a decision if and only if > the Family/Model/Stepping information is reasonably accurate. If one lies to the guest about the CPU it is running on then obviously things may work non-optimally. >> This fixes booting Hyper-V enabled Windows Server 2016 VMs with more than >> one vCPU on Zen1/2 CPUs. > > How/why does lack of XSAVES break a multi-vCPU setup? Is Windows blindly doing > XSAVES based on FMS? The hypercall from L2 Windows to L1 Hyper-V asking to boot the first AP returns HV_STATUS_CPUID_XSAVE_FEATURE_VALIDATION_ERROR. It's apparently a "should never happen" scenario for Windows since it crashes soon after. That's why uniprocessor configurations aren't broken - the BSP doesn't need to be specifically booted by the L2 guest. Unfortunately, Windows Server 2016 mainstream support has ended in Jan 2022 so it is only getting security updates. And you can't really break into an OS that you can't even start. Thanks, Maciej
On Mon, Nov 27, 2023, Maciej S. Szmigiero wrote: > On 27.11.2023 18:24, Sean Christopherson wrote: > > On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > > > > > Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") > > > kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. > > > > > > Since KVM CPU caps are initialized from the kernel boot CPU features this > > > makes the XSAVES feature also unavailable for KVM guests in this case, even > > > though they might want to decide on their own whether they are affected by > > > this errata. > > > > > > Allow KVM guests to make such decision by setting the XSAVES KVM CPU > > > capability bit based on the actual CPU capability > > > > This is not generally safe, as the guest can make such a decision if and only if > > the Family/Model/Stepping information is reasonably accurate. > > If one lies to the guest about the CPU it is running on then obviously > things may work non-optimally. But this isn't about running optimally, it's about functional correctness. And "lying" to the guest about F/M/S is extremely common. > > > This fixes booting Hyper-V enabled Windows Server 2016 VMs with more than > > > one vCPU on Zen1/2 CPUs. > > > > How/why does lack of XSAVES break a multi-vCPU setup? Is Windows blindly doing > > XSAVES based on FMS? > > The hypercall from L2 Windows to L1 Hyper-V asking to boot the first AP > returns HV_STATUS_CPUID_XSAVE_FEATURE_VALIDATION_ERROR. If it's just about CPUID enumeration, then userspace can simply stuff the XSAVES feature flag. This is not something that belongs in KVM, because this is safe if and only if F/M/S is accurate and the guest is actually aware of the erratum (or will not actually use XSAVES for other reasons), neither of which KVM can guarantee.
On 28.11.2023 17:48, Sean Christopherson wrote: > On Mon, Nov 27, 2023, Maciej S. Szmigiero wrote: >> On 27.11.2023 18:24, Sean Christopherson wrote: >>> On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: >>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >>>> >>>> Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") >>>> kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. >>>> >>>> Since KVM CPU caps are initialized from the kernel boot CPU features this >>>> makes the XSAVES feature also unavailable for KVM guests in this case, even >>>> though they might want to decide on their own whether they are affected by >>>> this errata. >>>> >>>> Allow KVM guests to make such decision by setting the XSAVES KVM CPU >>>> capability bit based on the actual CPU capability >>> >>> This is not generally safe, as the guest can make such a decision if and only if >>> the Family/Model/Stepping information is reasonably accurate. >> >> If one lies to the guest about the CPU it is running on then obviously >> things may work non-optimally. > > But this isn't about running optimally, it's about functional correctness. And > "lying" to the guest about F/M/S is extremely common. > >>>> This fixes booting Hyper-V enabled Windows Server 2016 VMs with more than >>>> one vCPU on Zen1/2 CPUs. >>> >>> How/why does lack of XSAVES break a multi-vCPU setup? Is Windows blindly doing >>> XSAVES based on FMS? >> >> The hypercall from L2 Windows to L1 Hyper-V asking to boot the first AP >> returns HV_STATUS_CPUID_XSAVE_FEATURE_VALIDATION_ERROR. > > If it's just about CPUID enumeration, then userspace can simply stuff the XSAVES > feature flag. This is not something that belongs in KVM, because this is safe if > and only if F/M/S is accurate and the guest is actually aware of the erratum (or > will not actually use XSAVES for other reasons), neither of which KVM can guarantee. In other words, your suggestion is that QEMU (or other VMM) not KVM should be the one setting the XSAVES CPUID bit back, correct? I don't think this would work with the current KVM code since it seems to make various decisions depending on presence of XSAVES bit in KVM caps rather than the guest CPUID and on boot_cpu_has(XSAVES) - one of such code blocks was even modified by this patch. It even says in the comment above that code that it is not possible to actually disable XSAVES without disabling all other variants on SVM so this has to be enabled if CPU supports it to switch the XSS MSR at guest entry/exit (in this case it looks harmless since Zen1/2 supposedly don't support any supervisor extended states). So it looks like we would need changes to *both* KVM and QEMU to restore the XSAVES support this way. Thanks, Maciej
On Tue, Nov 28, 2023, Maciej S. Szmigiero wrote: > On 28.11.2023 17:48, Sean Christopherson wrote: > > On Mon, Nov 27, 2023, Maciej S. Szmigiero wrote: > > > On 27.11.2023 18:24, Sean Christopherson wrote: > > > > On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > > > > > > > > > Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") > > > > > kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. > > > > > > > > > > Since KVM CPU caps are initialized from the kernel boot CPU features this > > > > > makes the XSAVES feature also unavailable for KVM guests in this case, even > > > > > though they might want to decide on their own whether they are affected by > > > > > this errata. > > > > > > > > > > Allow KVM guests to make such decision by setting the XSAVES KVM CPU > > > > > capability bit based on the actual CPU capability > > > > > > > > This is not generally safe, as the guest can make such a decision if and only if > > > > the Family/Model/Stepping information is reasonably accurate. > > > > > > If one lies to the guest about the CPU it is running on then obviously > > > things may work non-optimally. > > > > But this isn't about running optimally, it's about functional correctness. And > > "lying" to the guest about F/M/S is extremely common. > > > > > > > This fixes booting Hyper-V enabled Windows Server 2016 VMs with more than > > > > > one vCPU on Zen1/2 CPUs. > > > > > > > > How/why does lack of XSAVES break a multi-vCPU setup? Is Windows blindly doing > > > > XSAVES based on FMS? > > > > > > The hypercall from L2 Windows to L1 Hyper-V asking to boot the first AP > > > returns HV_STATUS_CPUID_XSAVE_FEATURE_VALIDATION_ERROR. > > > > If it's just about CPUID enumeration, then userspace can simply stuff the XSAVES > > feature flag. This is not something that belongs in KVM, because this is safe if > > and only if F/M/S is accurate and the guest is actually aware of the erratum (or > > will not actually use XSAVES for other reasons), neither of which KVM can guarantee. > > In other words, your suggestion is that QEMU (or other VMM) not KVM > should be the one setting the XSAVES CPUID bit back, correct? > > I don't think this would work with the current KVM code since it seems > to make various decisions depending on presence of XSAVES bit in KVM > caps rather than the guest CPUID and on boot_cpu_has(XSAVES) - one of > such code blocks was even modified by this patch. > > It even says in the comment above that code that it is not possible to > actually disable XSAVES without disabling all other variants on SVM so > this has to be enabled if CPU supports it to switch the XSS MSR at > guest entry/exit (in this case it looks harmless since Zen1/2 > supposedly don't support any supervisor extended states). > > So it looks like we would need changes to *both* KVM and QEMU to > restore the XSAVES support this way. I'm not suggesting we restore XSAVES support, I'm suggesting that _if_ someone wants to hack their setup to let the guest use broken hardware, then they should do that in userspace or in an a private kernel, not in upstream KVM.
On Mon, 2023-11-27 at 09:24 -0800, Sean Christopherson wrote: > On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > > > Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") > > kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. > > > > Since KVM CPU caps are initialized from the kernel boot CPU features this > > makes the XSAVES feature also unavailable for KVM guests in this case, even > > though they might want to decide on their own whether they are affected by > > this errata. > > > > Allow KVM guests to make such decision by setting the XSAVES KVM CPU > > capability bit based on the actual CPU capability > > This is not generally safe, as the guest can make such a decision if and only if > the Family/Model/Stepping information is reasonably accurate. Another thing that really worries me is that the XSAVES errata is really nasty one - AFAIK it silently corrupts some registers. Is it better to let a broken CPU boot a broken OS (OS which demands XSAVES blindly), and let a silent data corruption happen than refuse to boot it completely? I mean I understand that it is technically OS fault in this case (assuming that we do provide it the correct CPU family info), but still this seems like the wrong thing to do. I guess this is one of those few cases when it makes sense for the userspace to override KVM's CPUID caps and force a feature - in this case at least that won't be KVM's fault. Best regards, Maxim Levitsky > > > This fixes booting Hyper-V enabled Windows Server 2016 VMs with more than > > one vCPU on Zen1/2 CPUs. > > How/why does lack of XSAVES break a multi-vCPU setup? Is Windows blindly doing > XSAVES based on FMS? >
On 30.11.2023 18:24, Maxim Levitsky wrote: > On Mon, 2023-11-27 at 09:24 -0800, Sean Christopherson wrote: >> On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote: >>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >>> >>> Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17") >>> kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs. >>> >>> Since KVM CPU caps are initialized from the kernel boot CPU features this >>> makes the XSAVES feature also unavailable for KVM guests in this case, even >>> though they might want to decide on their own whether they are affected by >>> this errata. >>> >>> Allow KVM guests to make such decision by setting the XSAVES KVM CPU >>> capability bit based on the actual CPU capability >> >> This is not generally safe, as the guest can make such a decision if and only if >> the Family/Model/Stepping information is reasonably accurate. > > Another thing that really worries me is that the XSAVES errata is really nasty one - > AFAIK it silently corrupts some registers. It's not unconditional state corruption, but corruption in specific set of conditions, all of which have to be true for it to occur [1]: * All XMM registers were restored to the initialization value by the most recent XRSTORS instruction because the XSTATE_BV[SSE] bit was clear. * The state save area for the XMM registers does not contain the initialization state. * The value in the XMM registers match the initialization value when the XSAVES instruction is executed. * The MXCSR register has been modified to a value different from the initialization value since the most recent XRSTORS instruction. According to [2] this issue was fixed in the microcode update released on 2022-08-09. [2] also says it is not present anymore in (at least) version 0x08301055. > Is it better to let a broken CPU boot a broken OS (OS which demands XSAVES blindly), > and let a silent data corruption happen than refuse to boot it completely? It is possible that, for example, Windows only uses safe subset of this instruction or just verifies its presence but doesn't actually use it - it's Hyper-V (L1) that throws this HV_STATUS_CPUID_XSAVE_FEATURE_VALIDATION_ERROR but I presume it's Windows (L2) kernel which chooses which XSAVE-family variant to actually use. At least in the Linux guest case the guest won't use XSAVES anyway due to this errata. For other guests we also don't make situation any worse than on bare metal - if they would use XSAVES anyway in KVM they would do it when running on bare metal too. > I mean I understand that it is technically OS fault in this case (assuming that we > do provide it the correct CPU family info), but still this seems like the wrong thing to do. > > I guess this is one of those few cases when it makes sense for the userspace to > override KVM's CPUID caps and force a feature - in this case at least that > won't be KVM's fault. I am not against making the decision in QEMU instead of doing this in KVM, but as I said to Sean it looks like this will still require some KVM changes since KVM seems to make various decisions depending on presence of XSAVES bit in KVM caps and boot_cpu_has(XSAVES) rather that exclusively based on what VMM has set in CPUID. That's why some KVM changes will be necessary even if the actual decision logic will be in QEMU. My point is to make this work out-of-the box for QEMU "-cpu host" and similar CPU models that have support for XSAVES. The reason for this is simple: It's not like such Windows guests throw a big error screen saying "Please enable XSAVES or disable XSAVEC for successful boot". Instead, they simply hang at boot leaving the user wondering what could be wrong. Users can get very frustrated with situation like this since they don't know what to do - on Intel side of things look for example how people are unable to boot recent Windows versions (both server and client) on Intel Core 12th gen or later in KVM and how people still try random things to fix it [3] (it's Hyper-V being picky about extended topology information in CPUID btw). Cloud providers often know which guest type is going to be launched in a VM so there's no problem adding a extra QEMU "-cpu" flag for particular cloud VM. But for individual users having to guess which magic flags are necessary to make particular KVM guest work makes for miserable user experience. I think that if particular guest would work on bare metal it should work on "-cpu host" too - no tinkering should be required for such basic functionality as being able to successfully finish booting. > > Best regards, > Maxim Levitsky Thanks, Maciej [1]: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/revision-guides/56323-PUB_1_01.pdf [2]: https://google.github.io/security-research/pocs/cpus/errata/amd/1386/ [3]: https://lore.kernel.org/kvm/MN2PR12MB3023F67FF37889AB3E8885F2A0849@MN2PR12MB3023.namprd12.prod.outlook.com/ https://bugzilla.kernel.org/show_bug.cgi?id=217307 https://forums.unraid.net/topic/131838-windows-11-virtual-machine-platform-wsl2-boot-loop/
On Thu, Nov 30, 2023 at 2:00 PM Maciej S. Szmigiero <mail@maciej.szmigiero.name> wrote: > I think that if particular guest would work on bare metal it should > work on "-cpu host" too - no tinkering should be required for such > basic functionality as being able to successfully finish booting. I disagree. Let's not focus on one particular erratum. If, for whatever reason, the host kernel is booted with "noxsaves," I don't think KVM should allow a guest to bypass that directive.
On 1.12.2023 00:57, Jim Mattson wrote: > On Thu, Nov 30, 2023 at 2:00 PM Maciej S. Szmigiero > <mail@maciej.szmigiero.name> wrote: >> I think that if particular guest would work on bare metal it should >> work on "-cpu host" too - no tinkering should be required for such >> basic functionality as being able to successfully finish booting. > > I disagree. Let's not focus on one particular erratum. If, for > whatever reason, the host kernel is booted with "noxsaves," I don't > think KVM should allow a guest to bypass that directive. Good point, I agree that if user explicitly disabled XSAVES on the host via this parameter then it should remain disabled for the whole host kernel, including KVM. This could be achieved by either adding special "noxsaves" flag or by setting X86_BUG_XSAVES_AVOID instead of clearing X86_FEATURE_XSAVES on these CPUs. Then the core kernel XSAVES code would check for lack of X86_BUG_XSAVES_AVOID (in addition to checking for presence of X86_FEATURE_XSAVES) while KVM would keep using only X86_FEATURE_XSAVES. Thanks, Maciej
On Fri, Dec 1, 2023 at 5:05 PM Maciej S. Szmigiero <mail@maciej.szmigiero.name> wrote: > > On 1.12.2023 00:57, Jim Mattson wrote: > > On Thu, Nov 30, 2023 at 2:00 PM Maciej S. Szmigiero > > <mail@maciej.szmigiero.name> wrote: > >> I think that if particular guest would work on bare metal it should > >> work on "-cpu host" too - no tinkering should be required for such > >> basic functionality as being able to successfully finish booting. > > > > I disagree. Let's not focus on one particular erratum. If, for > > whatever reason, the host kernel is booted with "noxsaves," I don't > > think KVM should allow a guest to bypass that directive. > > This could be achieved by either adding special "noxsaves" flag > or by setting X86_BUG_XSAVES_AVOID instead of clearing > X86_FEATURE_XSAVES on these CPUs. > > Then the core kernel XSAVES code would check for lack of > X86_BUG_XSAVES_AVOID (in addition to checking for > presence of X86_FEATURE_XSAVES) while KVM would keep using > only X86_FEATURE_XSAVES. This is feasible, on the other hand the erratum is pretty bad. Since the workaround is easy (just disable XSAVEC; and maybe XGETBV1 as well?), you could just do a printk_ratelimited() on the write to HV_X64_MSR_GUEST_OS_ID, in particular if: 1) guest CPUID has XSAVEC and SVM 2) host CPUID does not have XSAVES 3) guest OS id indicates Windows Server 10.x (which is 2016 to 2022), which should be 0x0001040A00??????. Paolo
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index dda6fc4cfae8..a8820460163a 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -679,6 +679,22 @@ void kvm_set_cpu_caps(void) F(AMX_COMPLEX) ); + /* + * It is possible that CPU supports XSAVES but the host kernel decided + * not to use it, for example due to AMD Erratum 1386, and cleared the + * relevant CPU feature bit. + * + * In such case let the guest decide on it own whether to make use of + * this feature. + */ + if (boot_cpu_data.cpuid_level >= XSTATE_CPUID) { + unsigned int eax, ebx, ecx, edx; + + cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx); + if (eax & F(XSAVES)) + kvm_cpu_cap_set(X86_FEATURE_XSAVES); + } + kvm_cpu_cap_mask(CPUID_D_1_EAX, F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | f_xfd ); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 712146312358..3cc36710eb21 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4306,9 +4306,12 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) * whether it's advertised to the guest so that KVM context switches * XSS on VM-Enter/VM-Exit. Failure to do so would effectively give * the guest read/write access to the host's XSS. + * + * Make sure to check for XSAVES in KVM CPU capabilities, since the + * boot CPU feature bit might be disabled due to Erratum 1386. */ if (boot_cpu_has(X86_FEATURE_XSAVE) && - boot_cpu_has(X86_FEATURE_XSAVES) && + kvm_cpu_cap_has(X86_FEATURE_XSAVES) && guest_cpuid_has(vcpu, X86_FEATURE_XSAVE)) kvm_governed_feature_set(vcpu, X86_FEATURE_XSAVES);