Message ID | 20211214024948.048572883@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/fpu: Preparatory changes for guest AMX support | expand |
> From: Thomas Gleixner <tglx@linutronix.de> > Sent: Tuesday, December 14, 2021 10:50 AM > > KVM can require fpstate expansion due to updates to XCR0 and to the XFD > MSR. In both cases it is required to check whether: > > - the requested values are correct or permitted > > - the resulting xfeature mask which is relevant for XSAVES is a subset of > the guests fpstate xfeature mask for which the register buffer is sized. > > If the feature mask does not fit into the guests fpstate then > reallocation is required. > > Provide a common update function which utilizes the existing XFD > enablement > mechanics and two wrapper functions, one for XCR0 and one for XFD. > > These wrappers have to be invoked from XSETBV emulation and the XFD > MSR > write emulation. > > XCR0 modification can only proceed when fpu_update_guest_xcr0() returns > success. Currently XCR0 is modified right before entering guest with preemption disabled (see kvm_load_guest_xsave_state()). So this assumption is met. > > XFD modification is done by the FPU core code as it requires to update the > software state as well. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [...] > +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xcr0, > u64 xfd) > +{ > + return __fpu_update_guest_features(guest_fpu, xcr0, xfd); > +} no need to pass in xcr0. It can be fetched from vcpu->arch.xcr0. > +int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, > u64 xfd) > +{ > + u64 expand, requested; > + > + lockdep_assert_preemption_enabled(); > + > + /* Only permitted features are allowed in XCR0 */ > + if (xcr0 & ~guest_fpu->perm) > + return -EPERM; > + > + /* Check for unsupported XFD values */ > + if (xfd & ~XFEATURE_MASK_USER_DYNAMIC || xfd & > ~fpu_user_cfg.max_features) > + return -ENOTSUPP; > + > + if (!IS_ENABLED(CONFIG_X86_64)) > + return 0; this could be checked first. Thanks Kevin
On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote: > KVM can require fpstate expansion due to updates to XCR0 and to the XFD > MSR. In both cases it is required to check whether: > > - the requested values are correct or permitted > > - the resulting xfeature mask which is relevant for XSAVES is a subset of > the guests fpstate xfeature mask for which the register buffer is sized. > > If the feature mask does not fit into the guests fpstate then > reallocation is required. > > Provide a common update function which utilizes the existing XFD > enablement mechanics and two wrapper functions, one for XCR0 and one > for XFD. > > These wrappers have to be invoked from XSETBV emulation and the XFD > MSR write emulation. > > XCR0 modification can only proceed when fpu_update_guest_xcr0() returns > success. > > XFD modification is done by the FPU core code as it requires to update the > software state as well. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > New version to handle the restore case and XCR0 updates correctly. > --- > arch/x86/include/asm/fpu/api.h | 57 > +++++++++++++++++++++++++++++++++++++++++ > arch/x86/kernel/fpu/core.c | 57 > +++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 114 insertions(+) > > --- a/arch/x86/include/asm/fpu/api.h > +++ b/arch/x86/include/asm/fpu/api.h > @@ -136,6 +136,63 @@ extern void fpstate_clear_xstate_compone > extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu); extern void > fpu_free_guest_fpstate(struct fpu_guest *gfpu); extern int > fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest); > +extern int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 > +xcr0, u64 xfd); > + > +/** > + * fpu_update_guest_xcr0 - Update guest XCR0 from XSETBV emulation > + * @guest_fpu: Pointer to the guest FPU container > + * @xcr0: Requested guest XCR0 > + * > + * Has to be invoked before making the guest XCR0 update effective. The > + * function validates that the requested features are permitted and > +ensures > + * that @guest_fpu->fpstate is properly sized taking > +@guest_fpu->fpstate->xfd > + * into account. > + * > + * If @guest_fpu->fpstate is not the current tasks active fpstate then > +the > + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently > +in > + * use, i.e. the guest restore case. > + * > + * Return: > + * 0 - Success > + * -EPERM - Feature(s) not permitted > + * -EFAULT - Resizing of fpstate failed > + */ > +static inline int fpu_update_guest_xcr0(struct fpu_guest *guest_fpu, > +u64 xcr0) { > + return __fpu_update_guest_features(guest_fpu, xcr0, > +guest_fpu->fpstate->xfd); } > + > +/** > + * fpu_update_guest_xfd - Update guest XFD from MSR write emulation > + * @guest_fpu: Pointer to the guest FPU container > + * @xcr0: Current guest XCR0 > + * @xfd: Requested XFD value > + * > + * Has to be invoked to make the guest XFD update effective. The > +function > + * validates the XFD value and ensures that @guest_fpu->fpstate is > +properly > + * sized by taking @xcr0 into account. > + * > + * The caller must not modify @guest_fpu->fpstate->xfd or the XFD MSR > + * directly. > + * > + * If @guest_fpu->fpstate is not the current tasks active fpstate then > +the > + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently > +in > + * use, i.e. the guest restore case. > + * > + * On success the buffer size is valid, @guest_fpu->fpstate.xfd == @xfd > +and > + * if the guest fpstate is active then MSR_IA32_XFD == @xfd. > + * > + * On failure the previous state is retained. > + * > + * Return: > + * 0 - Success > + * -ENOTSUPP - XFD value not supported > + * -EFAULT - Resizing of fpstate failed > + */ > +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 > +xcr0, u64 xfd) { > + return __fpu_update_guest_features(guest_fpu, xcr0, xfd); } > > extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void > *buf, unsigned int size, u32 pkru); extern int > fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, > u64 xcr0, u32 *vpkru); > --- a/arch/x86/kernel/fpu/core.c > +++ b/arch/x86/kernel/fpu/core.c > @@ -261,6 +261,63 @@ void fpu_free_guest_fpstate(struct fpu_g } > EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate); > > +/** > + * __fpu_update_guest_features - Validate and enable guest XCR0 and XFD > updates > + * @guest_fpu: Pointer to the guest FPU container > + * @xcr0: Guest XCR0 > + * @xfd: Guest XFD > + * > + * Note: @xcr0 and @xfd must either be the already validated values or > +the > + * requested values (guest emulation or host write on restore). > + * > + * Do not invoke directly. Use the provided wrappers > +fpu_validate_guest_xcr0() > + * and fpu_update_guest_xfd() instead. > + * > + * Return: 0 on success, error code otherwise */ int > +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 > +xfd) { I think there would be one issue for the "host write on restore" case. The current QEMU based host restore uses the following sequence: 1) restore xsave 2) restore xcr0 3) restore XFD MSR At the time of "1) restore xsave", KVM already needs fpstate expansion before restoring the xsave data. So the 2 APIs here might not be usable for this usage. Our current solution to fpstate expansion at KVM_SET_XSAVE (i.e. step 1) above) is: kvm_load_guest_fpu(vcpu); guest_fpu->realloc_request = realloc_request; kvm_put_guest_fpu(vcpu); "realloc_request" above is generated from the "xstate_header" received from userspace. Thanks, Wei
On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote: > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote: >> + * Return: 0 on success, error code otherwise */ int >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 >> +xfd) { > > I think there would be one issue for the "host write on restore" case. > The current QEMU based host restore uses the following sequence: > 1) restore xsave > 2) restore xcr0 > 3) restore XFD MSR This needs to be fixed. Ordering clearly needs to be: XFD, XCR0, XSTATE > At the time of "1) restore xsave", KVM already needs fpstate expansion > before restoring the xsave data. > So the 2 APIs here might not be usable for this usage. > Our current solution to fpstate expansion at KVM_SET_XSAVE (i.e. step 1) above) is: > > kvm_load_guest_fpu(vcpu); > guest_fpu->realloc_request = realloc_request; > kvm_put_guest_fpu(vcpu); > > "realloc_request" above is generated from the "xstate_header" received from userspace. That's a horrible hack. Please fix the ordering in QEMU. Trying to accomodate for nonsensical use cases in the kernel is just wrong. That's like you expect the following to work: u8 *p = mmap(NULL, 4096, ....); p[8192] = x; It rightfully explodes in your face and you can keep the pieces. Having ordering constraints vs. these 3 involved parts is just sensible. Thanks, tglx
On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote: > On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote: > > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote: > >> + * Return: 0 on success, error code otherwise */ int > >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, > >> +u64 > >> +xfd) { > > > > I think there would be one issue for the "host write on restore" case. > > The current QEMU based host restore uses the following sequence: > > 1) restore xsave > > 2) restore xcr0 > > 3) restore XFD MSR > > This needs to be fixed. Ordering clearly needs to be: > > XFD, XCR0, XSTATE Sorry, just to clarify that the ordering in QEMU isn't made by us for this specific XFD enabling. It has been there for long time for the general restoring of all the XCRs and MSRs. (if you are interested..FYI: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168). - kvm_put_xsave() - kvm_put_xcrs() - kvm_put_msrs() We need to check with the QEMU migration maintainer (Dave and Juan CC-ed) if changing that ordering would be OK. (In general, I think there are no hard rules documented for this ordering) Thanks, Wei
Wei, On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote: > On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote: >> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote: >> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote: >> >> + * Return: 0 on success, error code otherwise */ int >> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, >> >> +u64 >> >> +xfd) { >> > >> > I think there would be one issue for the "host write on restore" case. >> > The current QEMU based host restore uses the following sequence: >> > 1) restore xsave >> > 2) restore xcr0 >> > 3) restore XFD MSR >> >> This needs to be fixed. Ordering clearly needs to be: >> >> XFD, XCR0, XSTATE > > Sorry, just to clarify that the ordering in QEMU isn't made by us for this specific XFD enabling. > It has been there for long time for the general restoring of all the XCRs and MSRs. > (if you are interested..FYI: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168). > - kvm_put_xsave() > - kvm_put_xcrs() > - kvm_put_msrs() > > We need to check with the QEMU migration maintainer (Dave and Juan CC-ed) > if changing that ordering would be OK. > (In general, I think there are no hard rules documented for this ordering) There haven't been ordering requirements so far, but with dynamic feature enablement there are. I really want to avoid going to the point to deduce it from the xstate:xfeatures bitmap, which is just backwards and Qemu has all the required information already. Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> wrote: > Wei, > > On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote: >> On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote: >>> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote: >>> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote: >>> >> + * Return: 0 on success, error code otherwise */ int >>> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, >>> >> +u64 >>> >> +xfd) { >>> > >>> > I think there would be one issue for the "host write on restore" case. >>> > The current QEMU based host restore uses the following sequence: >>> > 1) restore xsave >>> > 2) restore xcr0 >>> > 3) restore XFD MSR >>> >>> This needs to be fixed. Ordering clearly needs to be: >>> >>> XFD, XCR0, XSTATE >> >> Sorry, just to clarify that the ordering in QEMU isn't made by us >> for this specific XFD enabling. >> It has been there for long time for the general restoring of all the >> XCRs and MSRs. >> (if you are interested..FYI: >> https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168). >> - kvm_put_xsave() >> - kvm_put_xcrs() >> - kvm_put_msrs() >> >> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed) >> if changing that ordering would be OK. >> (In general, I think there are no hard rules documented for this ordering) > > There haven't been ordering requirements so far, but with dynamic > feature enablement there are. > > I really want to avoid going to the point to deduce it from the > xstate:xfeatures bitmap, which is just backwards and Qemu has all the > required information already. Hi First of all, I claim ZERO knowledge about low level x86_64. Once told that, this don't matter for qemu migration, code is at target/i386/kvm/kvm.c:kvm_arch_put_registers() ret = kvm_put_xsave(x86_cpu); if (ret < 0) { return ret; } ret = kvm_put_xcrs(x86_cpu); if (ret < 0) { return ret; } /* must be before kvm_put_msrs */ ret = kvm_inject_mce_oldstyle(x86_cpu); if (ret < 0) { return ret; } ret = kvm_put_msrs(x86_cpu, level); if (ret < 0) { return ret; } If it needs to be done in any other order, it is completely independent of whatever is inside the migration stream. I guess that Paolo will put some light here. Later, Juan.
Juan, On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote: > Thomas Gleixner <tglx@linutronix.de> wrote: >> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote: >>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed) >>> if changing that ordering would be OK. >>> (In general, I think there are no hard rules documented for this ordering) >> >> There haven't been ordering requirements so far, but with dynamic >> feature enablement there are. >> >> I really want to avoid going to the point to deduce it from the >> xstate:xfeatures bitmap, which is just backwards and Qemu has all the >> required information already. > > First of all, I claim ZERO knowledge about low level x86_64. Lucky you. > Once told that, this don't matter for qemu migration, code is at Once, that was at the time where rubber boots were still made of wood, right? :) > target/i386/kvm/kvm.c:kvm_arch_put_registers() > > > ret = kvm_put_xsave(x86_cpu); > if (ret < 0) { > return ret; > } > ret = kvm_put_xcrs(x86_cpu); > if (ret < 0) { > return ret; > } > /* must be before kvm_put_msrs */ > ret = kvm_inject_mce_oldstyle(x86_cpu); So this has already ordering requirements. > if (ret < 0) { > return ret; > } > ret = kvm_put_msrs(x86_cpu, level); > if (ret < 0) { > return ret; > } > > If it needs to be done in any other order, it is completely independent > of whatever is inside the migration stream. From the migration data perspective that's correct, but I have the nagging feeling that this in not that simple. > I guess that Paolo will put some light here. I fear shining light on that will unearth quite a few skeletons :) Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> wrote: Hi Thomas > On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote: >> Thomas Gleixner <tglx@linutronix.de> wrote: >>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote: >>>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed) >>>> if changing that ordering would be OK. >>>> (In general, I think there are no hard rules documented for this ordering) >>> >>> There haven't been ordering requirements so far, but with dynamic >>> feature enablement there are. >>> >>> I really want to avoid going to the point to deduce it from the >>> xstate:xfeatures bitmap, which is just backwards and Qemu has all the >>> required information already. >> >> First of all, I claim ZERO knowledge about low level x86_64. > > Lucky you. Well, that is true until I have to debug some bug, at that time I miss the knowledge O:-) >> Once told that, this don't matter for qemu migration, code is at > > Once, that was at the time where rubber boots were still made of wood, > right? :) I forgot to add: "famous last words". >> target/i386/kvm/kvm.c:kvm_arch_put_registers() >> >> >> ret = kvm_put_xsave(x86_cpu); >> if (ret < 0) { >> return ret; >> } >> ret = kvm_put_xcrs(x86_cpu); >> if (ret < 0) { >> return ret; >> } >> /* must be before kvm_put_msrs */ >> ret = kvm_inject_mce_oldstyle(x86_cpu); > > So this has already ordering requirements. > >> if (ret < 0) { >> return ret; >> } >> ret = kvm_put_msrs(x86_cpu, level); >> if (ret < 0) { >> return ret; >> } >> >> If it needs to be done in any other order, it is completely independent >> of whatever is inside the migration stream. > > From the migration data perspective that's correct, but I have the > nagging feeling that this in not that simple. Oh, I was not meaning that it was simple at all. We have backward compatibility baggage on x86_64 that is grotesque at this point. We don't send a single msr (by that name) on the main migration stream. And then we send them based on "conditions". So the trick as somithing like: - qemu reads the msrs from kvm at some order - it stores them in a list of MSR's - On migration pre_save we "mangle" that msr's and other cpu state to on the main (decades old) state - then we send the main state - then we send conditionally the variable state on reception side: - we receive everything that the sending side have sent - we "demangle" it on pre_load - then we create the list of MSR's that need to be transferred - at that point we send them to kvm in another random order So yes, I fully agree that it is not _that_ simple O:-) >> I guess that Paolo will put some light here. > > I fear shining light on that will unearth quite a few skeletons :) It is quite probable. When a bugzilla start with: We found a bug while we were trying to migrate during (BIOS) boot, I just ran for the hills O:-) Later, Juan.
Hi Thomas, On Wednesday, December 15, 2021 5:36 AM, Juan Quintela wrote: > To: Thomas Gleixner <tglx@linutronix.de> > Cc: Wang, Wei W <wei.w.wang@intel.com>; LKML > <linux-kernel@vger.kernel.org>; Dr. David Alan Gilbert <dgilbert@redhat.com>; > Jing Liu <jing2.liu@linux.intel.com>; Zhong, Yang <yang.zhong@intel.com>; > Paolo Bonzini <pbonzini@redhat.com>; x86@kernel.org; kvm@vger.kernel.org; > Sean Christoperson <seanjc@google.com>; Nakajima, Jun > <jun.nakajima@intel.com>; Tian, Kevin <kevin.tian@intel.com> > Subject: Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() > > Thomas Gleixner <tglx@linutronix.de> wrote: > > Hi Thomas > > > On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote: > >> Thomas Gleixner <tglx@linutronix.de> wrote: > >>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote: > >>>> We need to check with the QEMU migration maintainer (Dave and Juan > >>>> CC-ed) if changing that ordering would be OK. > >>>> (In general, I think there are no hard rules documented for this > >>>> ordering) > >>> > >>> There haven't been ordering requirements so far, but with dynamic > >>> feature enablement there are. > >>> > >>> I really want to avoid going to the point to deduce it from the > >>> xstate:xfeatures bitmap, which is just backwards and Qemu has all > >>> the required information already. > >> > >> First of all, I claim ZERO knowledge about low level x86_64. > > > > Lucky you. > > Well, that is true until I have to debug some bug, at that time I miss the > knowledge O:-) > > >> Once told that, this don't matter for qemu migration, code is at > > > > Once, that was at the time where rubber boots were still made of wood, > > right? :) > > I forgot to add: "famous last words". > > >> target/i386/kvm/kvm.c:kvm_arch_put_registers() > >> > >> > >> ret = kvm_put_xsave(x86_cpu); > >> if (ret < 0) { > >> return ret; > >> } > >> ret = kvm_put_xcrs(x86_cpu); > >> if (ret < 0) { > >> return ret; > >> } > >> /* must be before kvm_put_msrs */ > >> ret = kvm_inject_mce_oldstyle(x86_cpu); > > > > So this has already ordering requirements. > > > >> if (ret < 0) { > >> return ret; > >> } > >> ret = kvm_put_msrs(x86_cpu, level); > >> if (ret < 0) { > >> return ret; > >> } > >> > >> If it needs to be done in any other order, it is completely > >> independent of whatever is inside the migration stream. > > > > From the migration data perspective that's correct, but I have the > > nagging feeling that this in not that simple. > > Oh, I was not meaning that it was simple at all. It seems to be a consensus that the ordering constraint wouldn't be that easy. Would you think that our current solution (the 3 parts shared earlier to do fpstate expansion at KVM_SET_XSAVE) is acceptable as the 1st version? Thanks, Wei
> From: Tian, Kevin > Sent: Tuesday, December 14, 2021 2:26 PM > > +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 > xcr0, > > u64 xfd) > > +{ > > + return __fpu_update_guest_features(guest_fpu, xcr0, xfd); > > +} > > no need to pass in xcr0. It can be fetched from vcpu->arch.xcr0. > This is a silly comment. There is not vcpu to be referenced...
On Wed, Dec 15 2021 at 02:17, Wei W. Wang wrote: > On Wednesday, December 15, 2021 5:36 AM, Juan Quintela wrote: >> >> If it needs to be done in any other order, it is completely >> >> independent of whatever is inside the migration stream. >> > >> > From the migration data perspective that's correct, but I have the >> > nagging feeling that this in not that simple. >> >> Oh, I was not meaning that it was simple at all. > > It seems to be a consensus that the ordering constraint wouldn't be > that easy. Right, but what is easy in this context? Not easy does not mean it is impossible. > Would you think that our current solution (the 3 parts shared earlier > to do fpstate expansion at KVM_SET_XSAVE) is acceptable as the 1st > version? This is really the wrong question in the context of an user space ABI. The point is that if we go and add that hack, then the hack has to be supported forever. So there is no "as the 1st version". I'm not at all a fan of this kind of approach because it puts the burden at the wrong end and it carefully avoids to sit down and really think it through. That way we just pile hacks on hacks forever up to the point where the hacks end up in a circular dependency which becomes unresolvable. That's not a KVM/Qemu specific problem. That's a problem in general and we've been bitten by that again and again. The worst about this is that those people who try to sell their quick and dirty hack in the first place are not those who end up dealing with the consequences some time down the road. Lets assume the restore order is XSTATE, XCR0, XFD: XSTATE has everything in init state, which means the default buffer is good enough XCR0 has everything enabled including AMX, so the buffer is expanded XFD has AMX disable set, which means the buffer expansion was pointless If we go there, then we can just use a full expanded buffer for KVM unconditionally and be done with it. That spares a lot of code. Thanks, tglx
On 12/15/21 11:09, Thomas Gleixner wrote: > Lets assume the restore order is XSTATE, XCR0, XFD: > > XSTATE has everything in init state, which means the default > buffer is good enough > > XCR0 has everything enabled including AMX, so the buffer is > expanded > > XFD has AMX disable set, which means the buffer expansion was > pointless > > If we go there, then we can just use a full expanded buffer for KVM > unconditionally and be done with it. That spares a lot of code. If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is done, that would work for me. Basically KVM_SET_CPUID2 would: - check bits from CPUID[0xD] against the prctl requested with GUEST_PERM - return with -ENXIO or whatever if any dynamic bits were not requested - otherwise call fpstate_realloc if there are any dynamic bits requested Considering that in practice all Linux guests with AMX would have XFD passthrough (because if there's no prctl, Linux keeps AMX disabled in XFD), this removes the need to do all the #NM handling too. Just make XFD passthrough if it can ever be set to a nonzero value. This costs an RDMSR per vmexit even if neither the host nor the guest ever use AMX. That said, if we don't want to use a full expanded buffer, I don't expect any issue with requiring XFD first then XCR0 then XSAVE. As Juan said, QEMU first gets everything from the migration stream and then restores it. So yes, the QEMU code is complicated and messy but we can change the order without breaking migration from old to new QEMU. QEMU also forbids migration if there's any CPUID feature that it does not understand, so the old versions that don't understand QEMU won't migrate AMX (with no possibility to override). Paolo
On 12/15/21 11:27, Paolo Bonzini wrote: > On 12/15/21 11:09, Thomas Gleixner wrote: >> Lets assume the restore order is XSTATE, XCR0, XFD: >> >> XSTATE has everything in init state, which means the default >> buffer is good enough >> >> XCR0 has everything enabled including AMX, so the buffer is >> expanded >> >> XFD has AMX disable set, which means the buffer expansion was >> pointless >> >> If we go there, then we can just use a full expanded buffer for KVM >> unconditionally and be done with it. That spares a lot of code. > > If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is > done, that would work for me. Off-list, Thomas mentioned doing it even at vCPU creation as long as the prctl has been called. That is also okay and even simpler. There's also another important thing that hasn't been mentioned so far: KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in CPUID[0xD] if they have not been requested with prctl. It's okay to return the AMX bit, but not the bit in CPUID[0xD]. Paolo
> From: Paolo Bonzini > Sent: Wednesday, December 15, 2021 6:41 PM > > On 12/15/21 11:27, Paolo Bonzini wrote: > > On 12/15/21 11:09, Thomas Gleixner wrote: > >> Lets assume the restore order is XSTATE, XCR0, XFD: > >> > >> XSTATE has everything in init state, which means the default > >> buffer is good enough > >> > >> XCR0 has everything enabled including AMX, so the buffer is > >> expanded > >> > >> XFD has AMX disable set, which means the buffer expansion was > >> pointless > >> > >> If we go there, then we can just use a full expanded buffer for KVM > >> unconditionally and be done with it. That spares a lot of code. > > > > If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is > > done, that would work for me. > > Off-list, Thomas mentioned doing it even at vCPU creation as long as the > prctl has been called. That is also okay and even simpler. Make sense. It also avoids the #GP thing in the emulation path if just due to reallocation error. We'll follow this direction to do a quick update/test. > > There's also another important thing that hasn't been mentioned so far: > KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in > CPUID[0xD] if they have not been requested with prctl. It's okay to > return the AMX bit, but not the bit in CPUID[0xD]. > will do. Thanks Kevin
> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo Bonzini > Sent: Wednesday, December 15, 2021 6:28 PM > > On 12/15/21 11:09, Thomas Gleixner wrote: > > Lets assume the restore order is XSTATE, XCR0, XFD: > > > > XSTATE has everything in init state, which means the default > > buffer is good enough > > > > XCR0 has everything enabled including AMX, so the buffer is > > expanded > > > > XFD has AMX disable set, which means the buffer expansion was > > pointless > > > > If we go there, then we can just use a full expanded buffer for KVM > > unconditionally and be done with it. That spares a lot of code. > > If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is > done, that would work for me. Basically KVM_SET_CPUID2 would: > > - check bits from CPUID[0xD] against the prctl requested with GUEST_PERM > > - return with -ENXIO or whatever if any dynamic bits were not requested > > - otherwise call fpstate_realloc if there are any dynamic bits requested > > Considering that in practice all Linux guests with AMX would have XFD > passthrough (because if there's no prctl, Linux keeps AMX disabled in > XFD), this removes the need to do all the #NM handling too. Just make #NM trap is for XFD_ERR thus still required. > XFD passthrough if it can ever be set to a nonzero value. This costs an > RDMSR per vmexit even if neither the host nor the guest ever use AMX. Well, we can still trap WRMSR(XFD) in the start and then disable interception after the 1st trap. Thanks Kevin
> From: Tian, Kevin > Sent: Thursday, December 16, 2021 9:00 AM > > > > Off-list, Thomas mentioned doing it even at vCPU creation as long as the > > prctl has been called. That is also okay and even simpler. > > Make sense. It also avoids the #GP thing in the emulation path if just due > to reallocation error. > > We'll follow this direction to do a quick update/test. > After some study there are three opens which we'd like to sync here. Once they are closed we'll send out a new version very soon (hopefully tomorrow). 1) Have a full expanded buffer at vCPU creation There are two options. One is to directly allocate a big-enough buffer upon guest_fpu::perm in fpu_alloc_guest_fpstate(). There is no reallocation per-se thus most changes in this series are not required. The other is to keep the reallocation concept (thus all previous patches are kept) and still call a wrapper around __xfd_enable_feature() even at vCPU creation (e.g. right after fpu_init_guest_permissions()). This matches the fpu core assumption that fpstate for xfd features are dynamically allocated, though the actual calling point may not be dynamic. This also allows us to exploit doing expansion in KVM_SET_CPUID2 (see next). 2) Do expansion at vCPU creation or KVM_ SET_CPUID2? If the reallocation concept is still kept, then we feel doing expansion in KVM_SET_CPUID2 makes slightly more sense. There is no functional difference between two options since the guest is not running at this point. And in general Qemu should set prctl according to the cpuid bits. But since anyway we still need to check guest cpuid against guest perm in KVM_SET_CPUID2, it reads clearer to expand the buffer only after this check is passed. If this approach is agreed, then we may replace the helper functions in this patch with a new one: /* * fpu_update_guest_perm_features - Enable xfeatures according to guest perm * @guest_fpu: Pointer to the guest FPU container * * Enable all dynamic xfeatures according to guest perm. Invoked if the * caller wants to conservatively expand fpstate buffer instead of waiting * until a given feature is accessed. * * Return: 0 on success, error code otherwise */ +int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu) +{ + u64 expand; + + lockdep_assert_preemption_enabled(); + + if (!IS_ENABLED(CONFIG_X86_64)) + return 0; + + expand = guest_fpu->perm & ~guest_fpu->xfeatures; + if (!expand) + return 0; + + return __xfd_enable_feature(expand, guest_fpu); +} +EXPORT_SYMBOL_GPL(fpu_update_guest_features); 3) Always disable interception of disable after 1st interception? Once we choose to have a full expanded buffer before guest runs, the point of intercepting WRMSR(IA32_XFD) becomes less obvious since no dynamic reallocation is required. One option is to always disable WRMSR interception once KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit. But doing so affects legacy OS which even has no XFD logic at all. The other option is to continue the current policy i.e. disable write emulation only after the 1st interception of setting XFD to a non-zero value. Then the RDMSR cost is added only for guest which supports XFD. In either case we need another helper to update guest_fpu->fpstate->xfd as required in the restore path. For the 2nd option we further want to update MSR if guest_fpu is currently in use: +void xfd_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) +{ + fpregs_lock(); + guest_fpu->fpstae->xfd = xfd; + if (guest_fpu->fpstate->in_use) + xfd_update_state(guest_fpu->fpstate); + fpregs_unlock(); +} Thoughts? -- p.s. currently we're working on a quick prototype based on: - Expand buffer in KVM_SET_CPUID2 - Disable write emulation after first interception to check any oversight. Thanks Kevin
On Thu, Dec 16 2021 at 01:04, Kevin Tian wrote: >> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo Bonzini >> Considering that in practice all Linux guests with AMX would have XFD >> passthrough (because if there's no prctl, Linux keeps AMX disabled in >> XFD), this removes the need to do all the #NM handling too. Just make > > #NM trap is for XFD_ERR thus still required. > >> XFD passthrough if it can ever be set to a nonzero value. This costs an >> RDMSR per vmexit even if neither the host nor the guest ever use AMX. > > Well, we can still trap WRMSR(XFD) in the start and then disable interception > after the 1st trap. If we go for buffer expansion at vcpu_create() or CPUID2 then I think you don't need a trap at all. XFD_ERR: Always 0 on the host. Guest state needs to be preserved on VMEXIT and restored on VMENTER This can be done simply with the MSR entry/exit controls. No trap required neither for #NM for for XFD_ERR. VMENTER loads guest state. VMEXIT saves guest state and loads host state (0) XFD: Always guest state So VMENTER does nothing and VMEXIT either saves guest state and the sync function uses the automatically saved value or you keep the sync function which does the rdmsrl() as is. Hmm? Thanks, tglx
> From: Thomas Gleixner <tglx@linutronix.de> > Sent: Thursday, December 16, 2021 5:35 PM > > On Thu, Dec 16 2021 at 01:04, Kevin Tian wrote: > >> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo > Bonzini > >> Considering that in practice all Linux guests with AMX would have XFD > >> passthrough (because if there's no prctl, Linux keeps AMX disabled in > >> XFD), this removes the need to do all the #NM handling too. Just make > > > > #NM trap is for XFD_ERR thus still required. > > > >> XFD passthrough if it can ever be set to a nonzero value. This costs an > >> RDMSR per vmexit even if neither the host nor the guest ever use AMX. > > > > Well, we can still trap WRMSR(XFD) in the start and then disable > interception > > after the 1st trap. > > If we go for buffer expansion at vcpu_create() or CPUID2 then I think > you don't need a trap at all. > > XFD_ERR: Always 0 on the host. Guest state needs to be preserved on > VMEXIT and restored on VMENTER > > This can be done simply with the MSR entry/exit controls. No trap > required neither for #NM for for XFD_ERR. > > VMENTER loads guest state. VMEXIT saves guest state and loads host state > (0) This implies three MSR operations for every vm-exit. With trap we only need one RDMSR in host #NM handler, one RDMSR/one WRMSR exit in guest #NM handler, which are both rare. plus one RDMSR/one WRMSR per vm-exit only if saved xfd_err is non-zero which is again rare. > > XFD: Always guest state > > So VMENTER does nothing and VMEXIT either saves guest state and the sync > function uses the automatically saved value or you keep the sync > function which does the rdmsrl() as is. > Yes, this is the 3rd open that I asked in another reply. The only restriction with this approach is that the sync cost is added also for legacy OS which doesn't touch xfd at all. Thanks Kevin
> From: Paolo Bonzini > Sent: Wednesday, December 15, 2021 6:41 PM > > There's also another important thing that hasn't been mentioned so far: > KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in > CPUID[0xD] if they have not been requested with prctl. It's okay to > return the AMX bit, but not the bit in CPUID[0xD]. There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm. This then requires exposing xstate_get_guest_group_perm() to KVM. Thomas, are you OK with this change given Paolo's ask? v1 included this change but it was not necessary at the moment: https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/ and Paolo, do we want to document that prctl() must be done before calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location? Thanks Kevin
On 12/16/21 11:21, Tian, Kevin wrote: >> From: Paolo Bonzini >> Sent: Wednesday, December 15, 2021 6:41 PM >> >> There's also another important thing that hasn't been mentioned so far: >> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in >> CPUID[0xD] if they have not been requested with prctl. It's okay to >> return the AMX bit, but not the bit in CPUID[0xD]. > > There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm. > > This then requires exposing xstate_get_guest_group_perm() to KVM. Right, this is a generic /dev/kvm ioctl therefore it has to check the process state. > Thomas, are you OK with this change given Paolo's ask? v1 included > this change but it was not necessary at the moment: > > https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/ > > and Paolo, do we want to document that prctl() must be done before > calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location? You can document it under the KVM_GET_SUPPORTED_CPUID ioctl. Paolo
On 12/16/21 11:21, Tian, Kevin wrote: >> From: Paolo Bonzini >> Sent: Wednesday, December 15, 2021 6:41 PM >> >> There's also another important thing that hasn't been mentioned so far: >> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in >> CPUID[0xD] if they have not been requested with prctl. It's okay to >> return the AMX bit, but not the bit in CPUID[0xD]. > > There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm. > > This then requires exposing xstate_get_guest_group_perm() to KVM. Right, this is a generic /dev/kvm ioctl therefore it has to check the process state. > Thomas, are you OK with this change given Paolo's ask? v1 included > this change but it was not necessary at the moment: > > https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/ > > and Paolo, do we want to document that prctl() must be done before > calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location? You can document it under the KVM_GET_SUPPORTED_CPUID ioctl. (The reason for this ordering is backwards compatibility: otherwise a process could pass KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2 directly, and the resulting VM would not be able to use AMX because it hasn't been requested. Likewise, userspace needs to know that if you use prctl then you also need to allocate >4K for the xstate and use KVM_GET_XSAVE2 to retrieve it). Paolo
Hi, Paolo/Thomas, Any comment on following opens? In general doing static buffer expansion definitely simplifies the logic, but still need your help to finalize its impact on the overall design.
On Thu, Dec 16 2021 at 09:59, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@linutronix.de> >> This can be done simply with the MSR entry/exit controls. No trap >> required neither for #NM for for XFD_ERR. >> >> VMENTER loads guest state. VMEXIT saves guest state and loads host state >> (0) > > This implies three MSR operations for every vm-exit. > > With trap we only need one RDMSR in host #NM handler, one > RDMSR/one WRMSR exit in guest #NM handler, which are both rare. > plus one RDMSR/one WRMSR per vm-exit only if saved xfd_err is > non-zero which is again rare. Fair enough. >> XFD: Always guest state >> >> So VMENTER does nothing and VMEXIT either saves guest state and the sync >> function uses the automatically saved value or you keep the sync >> function which does the rdmsrl() as is. >> > > Yes, this is the 3rd open that I asked in another reply. The only restriction > with this approach is that the sync cost is added also for legacy OS which > doesn't touch xfd at all. You still can make that conditional on the guest XCR0. If guest never enables the extended bit then neither the #NM trap nor the XFD sync are required. But yes, there are too many moving parts here :) Thanks, tglx
On 12/16/21 06:36, Tian, Kevin wrote: > 2) Do expansion at vCPU creation or KVM_ SET_CPUID2? > > If the reallocation concept is still kept, then we feel doing expansion in > KVM_SET_CPUID2 makes slightly more sense. There is no functional > difference between two options since the guest is not running at this > point. And in general Qemu should set prctl according to the cpuid bits. > But since anyway we still need to check guest cpuid against guest perm in > KVM_SET_CPUID2, it reads clearer to expand the buffer only after this > check is passed. Yes, that makes sense to me as well. In principle userspace could call prctl only after KVM_CREATE_VCPU. > > One option is to always disable WRMSR interception once > KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit. > But doing so affects legacy OS which even has no XFD logic at all. > > The other option is to continue the current policy i.e. disable write > emulation only after the 1st interception of setting XFD to a non-zero > value. Then the RDMSR cost is added only for guest which supports XFD. For this I suggest to implement the current policy, but place it at the end of the series so it's easy to drop it. Paolo
> From: Thomas Gleixner <tglx@linutronix.de> > Sent: Thursday, December 16, 2021 10:13 PM > > > > Yes, this is the 3rd open that I asked in another reply. The only restriction > > with this approach is that the sync cost is added also for legacy OS which > > doesn't touch xfd at all. > > You still can make that conditional on the guest XCR0. If guest never > enables the extended bit then neither the #NM trap nor the XFD sync > are required. > > But yes, there are too many moving parts here :) > Yes. Many moving parts but in general it's getting cleaner and simplified.
--- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -136,6 +136,63 @@ extern void fpstate_clear_xstate_compone extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu); extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu); extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest); +extern int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd); + +/** + * fpu_update_guest_xcr0 - Update guest XCR0 from XSETBV emulation + * @guest_fpu: Pointer to the guest FPU container + * @xcr0: Requested guest XCR0 + * + * Has to be invoked before making the guest XCR0 update effective. The + * function validates that the requested features are permitted and ensures + * that @guest_fpu->fpstate is properly sized taking @guest_fpu->fpstate->xfd + * into account. + * + * If @guest_fpu->fpstate is not the current tasks active fpstate then the + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently in + * use, i.e. the guest restore case. + * + * Return: + * 0 - Success + * -EPERM - Feature(s) not permitted + * -EFAULT - Resizing of fpstate failed + */ +static inline int fpu_update_guest_xcr0(struct fpu_guest *guest_fpu, u64 xcr0) +{ + return __fpu_update_guest_features(guest_fpu, xcr0, guest_fpu->fpstate->xfd); +} + +/** + * fpu_update_guest_xfd - Update guest XFD from MSR write emulation + * @guest_fpu: Pointer to the guest FPU container + * @xcr0: Current guest XCR0 + * @xfd: Requested XFD value + * + * Has to be invoked to make the guest XFD update effective. The function + * validates the XFD value and ensures that @guest_fpu->fpstate is properly + * sized by taking @xcr0 into account. + * + * The caller must not modify @guest_fpu->fpstate->xfd or the XFD MSR + * directly. + * + * If @guest_fpu->fpstate is not the current tasks active fpstate then the + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently in + * use, i.e. the guest restore case. + * + * On success the buffer size is valid, @guest_fpu->fpstate.xfd == @xfd and + * if the guest fpstate is active then MSR_IA32_XFD == @xfd. + * + * On failure the previous state is retained. + * + * Return: + * 0 - Success + * -ENOTSUPP - XFD value not supported + * -EFAULT - Resizing of fpstate failed + */ +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd) +{ + return __fpu_update_guest_features(guest_fpu, xcr0, xfd); +} extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru); extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru); --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -261,6 +261,63 @@ void fpu_free_guest_fpstate(struct fpu_g } EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate); +/** + * __fpu_update_guest_features - Validate and enable guest XCR0 and XFD updates + * @guest_fpu: Pointer to the guest FPU container + * @xcr0: Guest XCR0 + * @xfd: Guest XFD + * + * Note: @xcr0 and @xfd must either be the already validated values or the + * requested values (guest emulation or host write on restore). + * + * Do not invoke directly. Use the provided wrappers fpu_validate_guest_xcr0() + * and fpu_update_guest_xfd() instead. + * + * Return: 0 on success, error code otherwise + */ +int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd) +{ + u64 expand, requested; + + lockdep_assert_preemption_enabled(); + + /* Only permitted features are allowed in XCR0 */ + if (xcr0 & ~guest_fpu->perm) + return -EPERM; + + /* Check for unsupported XFD values */ + if (xfd & ~XFEATURE_MASK_USER_DYNAMIC || xfd & ~fpu_user_cfg.max_features) + return -ENOTSUPP; + + if (!IS_ENABLED(CONFIG_X86_64)) + return 0; + + /* + * The requested features are set in XCR0 and not set in XFD. + * Feature expansion is required when the requested features are + * not a subset of the xfeatures for which @guest_fpu->fpstate + * is sized. + */ + requested = xcr0 & ~xfd; + expand = requested & ~guest_fpu->xfeatures; + if (!expand) { + /* + * fpstate size is correct, update the XFD value in fpstate + * and if @guest_fpu->fpstate is active also the per CPU + * cache and the MSR. + */ + fpregs_lock(); + guest_fpu->fpstate->xfd = xfd; + if (guest_fpu->fpstate->in_use) + xfd_update_state(guest_fpu->fpstate); + fpregs_unlock(); + return 0; + } + + return __xfd_enable_feature(expand, guest_fpu); +} +EXPORT_SYMBOL_GPL(__fpu_update_guest_features); + int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) { struct fpstate *guest_fps = guest_fpu->fpstate;
KVM can require fpstate expansion due to updates to XCR0 and to the XFD MSR. In both cases it is required to check whether: - the requested values are correct or permitted - the resulting xfeature mask which is relevant for XSAVES is a subset of the guests fpstate xfeature mask for which the register buffer is sized. If the feature mask does not fit into the guests fpstate then reallocation is required. Provide a common update function which utilizes the existing XFD enablement mechanics and two wrapper functions, one for XCR0 and one for XFD. These wrappers have to be invoked from XSETBV emulation and the XFD MSR write emulation. XCR0 modification can only proceed when fpu_update_guest_xcr0() returns success. XFD modification is done by the FPU core code as it requires to update the software state as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- New version to handle the restore case and XCR0 updates correctly. --- arch/x86/include/asm/fpu/api.h | 57 +++++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/fpu/core.c | 57 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+)