diff mbox series

[5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()

Message ID 20211214024948.048572883@linutronix.de (mailing list archive)
State New, archived
Headers show
Series x86/fpu: Preparatory changes for guest AMX support | expand

Commit Message

Thomas Gleixner Dec. 14, 2021, 2:50 a.m. UTC
KVM can require fpstate expansion due to updates to XCR0 and to the XFD
MSR. In both cases it is required to check whether:

  - the requested values are correct or permitted

  - the resulting xfeature mask which is relevant for XSAVES is a subset of
    the guests fpstate xfeature mask for which the register buffer is sized.

    If the feature mask does not fit into the guests fpstate then
    reallocation is required.

Provide a common update function which utilizes the existing XFD enablement
mechanics and two wrapper functions, one for XCR0 and one for XFD.

These wrappers have to be invoked from XSETBV emulation and the XFD MSR
write emulation.

XCR0 modification can only proceed when fpu_update_guest_xcr0() returns
success.

XFD modification is done by the FPU core code as it requires to update the
software state as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
New version to handle the restore case and XCR0 updates correctly.
---
 arch/x86/include/asm/fpu/api.h |   57 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/fpu/core.c     |   57 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+)

Comments

Tian, Kevin Dec. 14, 2021, 6:25 a.m. UTC | #1
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 10:50 AM
> 
> KVM can require fpstate expansion due to updates to XCR0 and to the XFD
> MSR. In both cases it is required to check whether:
> 
>   - the requested values are correct or permitted
> 
>   - the resulting xfeature mask which is relevant for XSAVES is a subset of
>     the guests fpstate xfeature mask for which the register buffer is sized.
> 
>     If the feature mask does not fit into the guests fpstate then
>     reallocation is required.
> 
> Provide a common update function which utilizes the existing XFD
> enablement
> mechanics and two wrapper functions, one for XCR0 and one for XFD.
> 
> These wrappers have to be invoked from XSETBV emulation and the XFD
> MSR
> write emulation.
> 
> XCR0 modification can only proceed when fpu_update_guest_xcr0() returns
> success.

Currently XCR0 is modified right before entering guest with preemption 
disabled (see kvm_load_guest_xsave_state()). So this assumption is met.

> 
> XFD modification is done by the FPU core code as it requires to update the
> software state as well.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

[...]
> +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xcr0,
> u64 xfd)
> +{
> +	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
> +}

no need to pass in xcr0. It can be fetched from vcpu->arch.xcr0.

> +int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
> u64 xfd)
> +{
> +	u64 expand, requested;
> +
> +	lockdep_assert_preemption_enabled();
> +
> +	/* Only permitted features are allowed in XCR0 */
> +	if (xcr0 & ~guest_fpu->perm)
> +		return -EPERM;
> +
> +	/* Check for unsupported XFD values */
> +	if (xfd & ~XFEATURE_MASK_USER_DYNAMIC || xfd &
> ~fpu_user_cfg.max_features)
> +		return -ENOTSUPP;
> +
> +	if (!IS_ENABLED(CONFIG_X86_64))
> +		return 0;

this could be checked first.

Thanks
Kevin
Wang, Wei W Dec. 14, 2021, 3:09 p.m. UTC | #2
On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
> KVM can require fpstate expansion due to updates to XCR0 and to the XFD
> MSR. In both cases it is required to check whether:
> 
>   - the requested values are correct or permitted
> 
>   - the resulting xfeature mask which is relevant for XSAVES is a subset of
>     the guests fpstate xfeature mask for which the register buffer is sized.
> 
>     If the feature mask does not fit into the guests fpstate then
>     reallocation is required.
> 
> Provide a common update function which utilizes the existing XFD
> enablement mechanics and two wrapper functions, one for XCR0 and one
> for XFD.
> 
> These wrappers have to be invoked from XSETBV emulation and the XFD
> MSR write emulation.
> 
> XCR0 modification can only proceed when fpu_update_guest_xcr0() returns
> success.
> 
> XFD modification is done by the FPU core code as it requires to update the
> software state as well.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> New version to handle the restore case and XCR0 updates correctly.
> ---
>  arch/x86/include/asm/fpu/api.h |   57
> +++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/fpu/core.c     |   57
> +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 114 insertions(+)
> 
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -136,6 +136,63 @@ extern void fpstate_clear_xstate_compone
> extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);  extern void
> fpu_free_guest_fpstate(struct fpu_guest *gfpu);  extern int
> fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
> +extern int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64
> +xcr0, u64 xfd);
> +
> +/**
> + * fpu_update_guest_xcr0 - Update guest XCR0 from XSETBV emulation
> + * @guest_fpu:	Pointer to the guest FPU container
> + * @xcr0:	Requested guest XCR0
> + *
> + * Has to be invoked before making the guest XCR0 update effective. The
> + * function validates that the requested features are permitted and
> +ensures
> + * that @guest_fpu->fpstate is properly sized taking
> +@guest_fpu->fpstate->xfd
> + * into account.
> + *
> + * If @guest_fpu->fpstate is not the current tasks active fpstate then
> +the
> + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently
> +in
> + * use, i.e. the guest restore case.
> + *
> + * Return:
> + * 0		- Success
> + * -EPERM	- Feature(s) not permitted
> + * -EFAULT	- Resizing of fpstate failed
> + */
> +static inline int fpu_update_guest_xcr0(struct fpu_guest *guest_fpu,
> +u64 xcr0) {
> +	return __fpu_update_guest_features(guest_fpu, xcr0,
> +guest_fpu->fpstate->xfd); }
> +
> +/**
> + * fpu_update_guest_xfd - Update guest XFD from MSR write emulation
> + * @guest_fpu:	Pointer to the guest FPU container
> + * @xcr0:	Current guest XCR0
> + * @xfd:	Requested XFD value
> + *
> + * Has to be invoked to make the guest XFD update effective. The
> +function
> + * validates the XFD value and ensures that @guest_fpu->fpstate is
> +properly
> + * sized by taking @xcr0 into account.
> + *
> + * The caller must not modify @guest_fpu->fpstate->xfd or the XFD MSR
> + * directly.
> + *
> + * If @guest_fpu->fpstate is not the current tasks active fpstate then
> +the
> + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently
> +in
> + * use, i.e. the guest restore case.
> + *
> + * On success the buffer size is valid, @guest_fpu->fpstate.xfd == @xfd
> +and
> + * if the guest fpstate is active then MSR_IA32_XFD == @xfd.
> + *
> + * On failure the previous state is retained.
> + *
> + * Return:
> + * 0		- Success
> + * -ENOTSUPP	- XFD value not supported
> + * -EFAULT	- Resizing of fpstate failed
> + */
> +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64
> +xcr0, u64 xfd) {
> +	return __fpu_update_guest_features(guest_fpu, xcr0, xfd); }
> 
>  extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void
> *buf, unsigned int size, u32 pkru);  extern int
> fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
> u64 xcr0, u32 *vpkru);
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -261,6 +261,63 @@ void fpu_free_guest_fpstate(struct fpu_g  }
> EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);
> 
> +/**
> + * __fpu_update_guest_features - Validate and enable guest XCR0 and XFD
> updates
> + * @guest_fpu:	Pointer to the guest FPU container
> + * @xcr0:	Guest XCR0
> + * @xfd:	Guest XFD
> + *
> + * Note: @xcr0 and @xfd must either be the already validated values or
> +the
> + * requested values (guest emulation or host write on restore).
> + *
> + * Do not invoke directly. Use the provided wrappers
> +fpu_validate_guest_xcr0()
> + * and fpu_update_guest_xfd() instead.
> + *
> + * Return: 0 on success, error code otherwise  */ int
> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64
> +xfd) {

I think there would be one issue for the "host write on restore" case.
The current QEMU based host restore uses the following sequence:
1) restore xsave
2) restore xcr0
3) restore XFD MSR

At the time of "1) restore xsave", KVM already needs fpstate expansion before restoring the xsave data.
So the 2 APIs here might not be usable for this usage.
Our current solution to fpstate expansion at KVM_SET_XSAVE (i.e. step 1) above) is:

kvm_load_guest_fpu(vcpu);
guest_fpu->realloc_request = realloc_request;
kvm_put_guest_fpu(vcpu);

"realloc_request" above is generated from the "xstate_header" received from userspace.

Thanks,
Wei
Thomas Gleixner Dec. 14, 2021, 3:40 p.m. UTC | #3
On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
> On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
>> + * Return: 0 on success, error code otherwise  */ int
>> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64
>> +xfd) {
>
> I think there would be one issue for the "host write on restore" case.
> The current QEMU based host restore uses the following sequence:
> 1) restore xsave
> 2) restore xcr0
> 3) restore XFD MSR

This needs to be fixed. Ordering clearly needs to be:

  XFD, XCR0, XSTATE

> At the time of "1) restore xsave", KVM already needs fpstate expansion
> before restoring the xsave data.

> So the 2 APIs here might not be usable for this usage.
> Our current solution to fpstate expansion at KVM_SET_XSAVE (i.e. step 1) above) is:
>
> kvm_load_guest_fpu(vcpu);
> guest_fpu->realloc_request = realloc_request;
> kvm_put_guest_fpu(vcpu);
>
> "realloc_request" above is generated from the "xstate_header" received from userspace.

That's a horrible hack. Please fix the ordering in QEMU. Trying to
accomodate for nonsensical use cases in the kernel is just wrong.

That's like you expect the following to work:

       u8 *p = mmap(NULL, 4096, ....);

       p[8192] = x;

It rightfully explodes in your face and you can keep the pieces.

Having ordering constraints vs. these 3 involved parts is just sensible.

Thanks,

        tglx
Wang, Wei W Dec. 14, 2021, 4:11 p.m. UTC | #4
On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote:
> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
> >> + * Return: 0 on success, error code otherwise  */ int
> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
> >> +u64
> >> +xfd) {
> >
> > I think there would be one issue for the "host write on restore" case.
> > The current QEMU based host restore uses the following sequence:
> > 1) restore xsave
> > 2) restore xcr0
> > 3) restore XFD MSR
> 
> This needs to be fixed. Ordering clearly needs to be:
> 
>   XFD, XCR0, XSTATE

Sorry, just to clarify that the ordering in QEMU isn't made by us for this specific XFD enabling.
It has been there for long time for the general restoring of all the XCRs and MSRs.
(if you are interested..FYI: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168).
- kvm_put_xsave()
- kvm_put_xcrs()
- kvm_put_msrs()

We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
if changing that ordering would be OK.
(In general, I think there are no hard rules documented for this ordering)

Thanks,
Wei
Thomas Gleixner Dec. 14, 2021, 6:04 p.m. UTC | #5
Wei,

On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
> On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote:
>> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
>> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
>> >> + * Return: 0 on success, error code otherwise  */ int
>> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
>> >> +u64
>> >> +xfd) {
>> >
>> > I think there would be one issue for the "host write on restore" case.
>> > The current QEMU based host restore uses the following sequence:
>> > 1) restore xsave
>> > 2) restore xcr0
>> > 3) restore XFD MSR
>> 
>> This needs to be fixed. Ordering clearly needs to be:
>> 
>>   XFD, XCR0, XSTATE
>
> Sorry, just to clarify that the ordering in QEMU isn't made by us for this specific XFD enabling.
> It has been there for long time for the general restoring of all the XCRs and MSRs.
> (if you are interested..FYI: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168).
> - kvm_put_xsave()
> - kvm_put_xcrs()
> - kvm_put_msrs()
>
> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
> if changing that ordering would be OK.
> (In general, I think there are no hard rules documented for this ordering)

There haven't been ordering requirements so far, but with dynamic
feature enablement there are.

I really want to avoid going to the point to deduce it from the
xstate:xfeatures bitmap, which is just backwards and Qemu has all the
required information already.

Thanks,

        tglx
Juan Quintela Dec. 14, 2021, 7:07 p.m. UTC | #6
Thomas Gleixner <tglx@linutronix.de> wrote:
> Wei,
>
> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
>> On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote:
>>> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
>>> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
>>> >> + * Return: 0 on success, error code otherwise  */ int
>>> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
>>> >> +u64
>>> >> +xfd) {
>>> >
>>> > I think there would be one issue for the "host write on restore" case.
>>> > The current QEMU based host restore uses the following sequence:
>>> > 1) restore xsave
>>> > 2) restore xcr0
>>> > 3) restore XFD MSR
>>> 
>>> This needs to be fixed. Ordering clearly needs to be:
>>> 
>>>   XFD, XCR0, XSTATE
>>
>> Sorry, just to clarify that the ordering in QEMU isn't made by us
>> for this specific XFD enabling.
>> It has been there for long time for the general restoring of all the
>> XCRs and MSRs.
>> (if you are interested..FYI:
>> https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168).
>> - kvm_put_xsave()
>> - kvm_put_xcrs()
>> - kvm_put_msrs()
>>
>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
>> if changing that ordering would be OK.
>> (In general, I think there are no hard rules documented for this ordering)
>
> There haven't been ordering requirements so far, but with dynamic
> feature enablement there are.
>
> I really want to avoid going to the point to deduce it from the
> xstate:xfeatures bitmap, which is just backwards and Qemu has all the
> required information already.

Hi

First of all, I claim ZERO knowledge about low level x86_64.

Once told that, this don't matter for qemu migration, code is at

target/i386/kvm/kvm.c:kvm_arch_put_registers()


    ret = kvm_put_xsave(x86_cpu);
    if (ret < 0) {
        return ret;
    }
    ret = kvm_put_xcrs(x86_cpu);
    if (ret < 0) {
        return ret;
    }
    /* must be before kvm_put_msrs */
    ret = kvm_inject_mce_oldstyle(x86_cpu);
    if (ret < 0) {
        return ret;
    }
    ret = kvm_put_msrs(x86_cpu, level);
    if (ret < 0) {
        return ret;
    }

If it needs to be done in any other order, it is completely independent
of whatever is inside the migration stream.

I guess that Paolo will put some light here.

Later, Juan.
Thomas Gleixner Dec. 14, 2021, 8:28 p.m. UTC | #7
Juan,

On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote:
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
>>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
>>> if changing that ordering would be OK.
>>> (In general, I think there are no hard rules documented for this ordering)
>>
>> There haven't been ordering requirements so far, but with dynamic
>> feature enablement there are.
>>
>> I really want to avoid going to the point to deduce it from the
>> xstate:xfeatures bitmap, which is just backwards and Qemu has all the
>> required information already.
>
> First of all, I claim ZERO knowledge about low level x86_64.

Lucky you.

> Once told that, this don't matter for qemu migration, code is at

Once, that was at the time where rubber boots were still made of wood,
right? :)

> target/i386/kvm/kvm.c:kvm_arch_put_registers()
>
>
>     ret = kvm_put_xsave(x86_cpu);
>     if (ret < 0) {
>         return ret;
>     }
>     ret = kvm_put_xcrs(x86_cpu);
>     if (ret < 0) {
>         return ret;
>     }
>     /* must be before kvm_put_msrs */
>     ret = kvm_inject_mce_oldstyle(x86_cpu);

So this has already ordering requirements.

>     if (ret < 0) {
>         return ret;
>     }
>     ret = kvm_put_msrs(x86_cpu, level);
>     if (ret < 0) {
>         return ret;
>     }
>
> If it needs to be done in any other order, it is completely independent
> of whatever is inside the migration stream.

From the migration data perspective that's correct, but I have the
nagging feeling that this in not that simple.

> I guess that Paolo will put some light here.

I fear shining light on that will unearth quite a few skeletons :)

Thanks,

        tglx
Juan Quintela Dec. 14, 2021, 9:35 p.m. UTC | #8
Thomas Gleixner <tglx@linutronix.de> wrote:

Hi Thomas

> On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote:
>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
>>>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
>>>> if changing that ordering would be OK.
>>>> (In general, I think there are no hard rules documented for this ordering)
>>>
>>> There haven't been ordering requirements so far, but with dynamic
>>> feature enablement there are.
>>>
>>> I really want to avoid going to the point to deduce it from the
>>> xstate:xfeatures bitmap, which is just backwards and Qemu has all the
>>> required information already.
>>
>> First of all, I claim ZERO knowledge about low level x86_64.
>
> Lucky you.

Well, that is true until I have to debug some bug, at that time I miss
the knowledge O:-)

>> Once told that, this don't matter for qemu migration, code is at
>
> Once, that was at the time where rubber boots were still made of wood,
> right? :)

I forgot to add: "famous last words".

>> target/i386/kvm/kvm.c:kvm_arch_put_registers()
>>
>>
>>     ret = kvm_put_xsave(x86_cpu);
>>     if (ret < 0) {
>>         return ret;
>>     }
>>     ret = kvm_put_xcrs(x86_cpu);
>>     if (ret < 0) {
>>         return ret;
>>     }
>>     /* must be before kvm_put_msrs */
>>     ret = kvm_inject_mce_oldstyle(x86_cpu);
>
> So this has already ordering requirements.
>
>>     if (ret < 0) {
>>         return ret;
>>     }
>>     ret = kvm_put_msrs(x86_cpu, level);
>>     if (ret < 0) {
>>         return ret;
>>     }
>>
>> If it needs to be done in any other order, it is completely independent
>> of whatever is inside the migration stream.
>
> From the migration data perspective that's correct, but I have the
> nagging feeling that this in not that simple.

Oh, I was not meaning that it was simple at all.

We have backward compatibility baggage on x86_64 that is grotesque at
this point.  We don't send a single msr (by that name) on the main
migration stream.  And then we send them based on "conditions".  So the
trick as somithing like:

- qemu reads the msrs from kvm at some order
- it stores them in a list of MSR's
- On migration pre_save we "mangle" that msr's and other cpu state to
  on the main (decades old) state
- then we send the main state
- then we send conditionally the variable state

on reception side:

- we receive everything that the sending side have sent
- we "demangle" it on pre_load
- then we create the list of MSR's that need to be transferred
- at that point we send them to kvm in another random order

So yes, I fully agree that it is not _that_ simple O:-)


>> I guess that Paolo will put some light here.
>
> I fear shining light on that will unearth quite a few skeletons :)

It is quite probable.

When a bugzilla start with: We found a bug while we were trying to
migrate during (BIOS) boot, I just ran for the hills O:-)

Later, Juan.
Wang, Wei W Dec. 15, 2021, 2:17 a.m. UTC | #9
Hi Thomas,

On Wednesday, December 15, 2021 5:36 AM, Juan Quintela wrote:
> To: Thomas Gleixner <tglx@linutronix.de>
> Cc: Wang, Wei W <wei.w.wang@intel.com>; LKML
> <linux-kernel@vger.kernel.org>; Dr. David Alan Gilbert <dgilbert@redhat.com>;
> Jing Liu <jing2.liu@linux.intel.com>; Zhong, Yang <yang.zhong@intel.com>;
> Paolo Bonzini <pbonzini@redhat.com>; x86@kernel.org; kvm@vger.kernel.org;
> Sean Christoperson <seanjc@google.com>; Nakajima, Jun
> <jun.nakajima@intel.com>; Tian, Kevin <kevin.tian@intel.com>
> Subject: Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
> 
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Hi Thomas
> 
> > On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote:
> >> Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
> >>>> We need to check with the QEMU migration maintainer (Dave and Juan
> >>>> CC-ed) if changing that ordering would be OK.
> >>>> (In general, I think there are no hard rules documented for this
> >>>> ordering)
> >>>
> >>> There haven't been ordering requirements so far, but with dynamic
> >>> feature enablement there are.
> >>>
> >>> I really want to avoid going to the point to deduce it from the
> >>> xstate:xfeatures bitmap, which is just backwards and Qemu has all
> >>> the required information already.
> >>
> >> First of all, I claim ZERO knowledge about low level x86_64.
> >
> > Lucky you.
> 
> Well, that is true until I have to debug some bug, at that time I miss the
> knowledge O:-)
> 
> >> Once told that, this don't matter for qemu migration, code is at
> >
> > Once, that was at the time where rubber boots were still made of wood,
> > right? :)
> 
> I forgot to add: "famous last words".
> 
> >> target/i386/kvm/kvm.c:kvm_arch_put_registers()
> >>
> >>
> >>     ret = kvm_put_xsave(x86_cpu);
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>     ret = kvm_put_xcrs(x86_cpu);
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>     /* must be before kvm_put_msrs */
> >>     ret = kvm_inject_mce_oldstyle(x86_cpu);
> >
> > So this has already ordering requirements.
> >
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>     ret = kvm_put_msrs(x86_cpu, level);
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>
> >> If it needs to be done in any other order, it is completely
> >> independent of whatever is inside the migration stream.
> >
> > From the migration data perspective that's correct, but I have the
> > nagging feeling that this in not that simple.
> 
> Oh, I was not meaning that it was simple at all.

It seems to be a consensus that the ordering constraint wouldn't be that easy.
Would you think that our current solution (the 3 parts shared earlier to do fpstate expansion at KVM_SET_XSAVE) is acceptable as the 1st version?

Thanks,
Wei
Tian, Kevin Dec. 15, 2021, 6:14 a.m. UTC | #10
> From: Tian, Kevin
> Sent: Tuesday, December 14, 2021 2:26 PM
> > +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64
> xcr0,
> > u64 xfd)
> > +{
> > +	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
> > +}
> 
> no need to pass in xcr0. It can be fetched from vcpu->arch.xcr0.
> 

This is a silly comment. There is not vcpu to be referenced...
Thomas Gleixner Dec. 15, 2021, 10:09 a.m. UTC | #11
On Wed, Dec 15 2021 at 02:17, Wei W. Wang wrote:
> On Wednesday, December 15, 2021 5:36 AM, Juan Quintela wrote:
>> >> If it needs to be done in any other order, it is completely
>> >> independent of whatever is inside the migration stream.
>> >
>> > From the migration data perspective that's correct, but I have the
>> > nagging feeling that this in not that simple.
>> 
>> Oh, I was not meaning that it was simple at all.
>
> It seems to be a consensus that the ordering constraint wouldn't be
> that easy.

Right, but what is easy in this context? Not easy does not mean it is
impossible.

> Would you think that our current solution (the 3 parts shared earlier
> to do fpstate expansion at KVM_SET_XSAVE) is acceptable as the 1st
> version?

This is really the wrong question in the context of an user space ABI.

The point is that if we go and add that hack, then the hack has to be
supported forever. So there is no "as the 1st version".

I'm not at all a fan of this kind of approach because it puts the burden
at the wrong end and it carefully avoids to sit down and really think it
through.

That way we just pile hacks on hacks forever up to the point where the
hacks end up in a circular dependency which becomes unresolvable.

That's not a KVM/Qemu specific problem. That's a problem in general and
we've been bitten by that again and again.

The worst about this is that those people who try to sell their quick
and dirty hack in the first place are not those who end up dealing with
the consequences some time down the road.

Lets assume the restore order is XSTATE, XCR0, XFD:

     XSTATE has everything in init state, which means the default
     buffer is good enough

     XCR0 has everything enabled including AMX, so the buffer is
     expanded

     XFD has AMX disable set, which means the buffer expansion was
     pointless

If we go there, then we can just use a full expanded buffer for KVM
unconditionally and be done with it. That spares a lot of code.

Thanks,

        tglx
Paolo Bonzini Dec. 15, 2021, 10:27 a.m. UTC | #12
On 12/15/21 11:09, Thomas Gleixner wrote:
> Lets assume the restore order is XSTATE, XCR0, XFD:
> 
>       XSTATE has everything in init state, which means the default
>       buffer is good enough
> 
>       XCR0 has everything enabled including AMX, so the buffer is
>       expanded
> 
>       XFD has AMX disable set, which means the buffer expansion was
>       pointless
> 
> If we go there, then we can just use a full expanded buffer for KVM
> unconditionally and be done with it. That spares a lot of code.

If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is 
done, that would work for me.  Basically KVM_SET_CPUID2 would:

- check bits from CPUID[0xD] against the prctl requested with GUEST_PERM

- return with -ENXIO or whatever if any dynamic bits were not requested

- otherwise call fpstate_realloc if there are any dynamic bits requested

Considering that in practice all Linux guests with AMX would have XFD 
passthrough (because if there's no prctl, Linux keeps AMX disabled in 
XFD), this removes the need to do all the #NM handling too.  Just make 
XFD passthrough if it can ever be set to a nonzero value.  This costs an 
RDMSR per vmexit even if neither the host nor the guest ever use AMX.

That said, if we don't want to use a full expanded buffer, I don't 
expect any issue with requiring XFD first then XCR0 then XSAVE.  As Juan 
said, QEMU first gets everything from the migration stream and then 
restores it.  So yes, the QEMU code is complicated and messy but we can 
change the order without breaking migration from old to new QEMU.  QEMU 
also forbids migration if there's any CPUID feature that it does not 
understand, so the old versions that don't understand QEMU won't migrate 
AMX (with no possibility to override).

Paolo
Paolo Bonzini Dec. 15, 2021, 10:41 a.m. UTC | #13
On 12/15/21 11:27, Paolo Bonzini wrote:
> On 12/15/21 11:09, Thomas Gleixner wrote:
>> Lets assume the restore order is XSTATE, XCR0, XFD:
>>
>>       XSTATE has everything in init state, which means the default
>>       buffer is good enough
>>
>>       XCR0 has everything enabled including AMX, so the buffer is
>>       expanded
>>
>>       XFD has AMX disable set, which means the buffer expansion was
>>       pointless
>>
>> If we go there, then we can just use a full expanded buffer for KVM
>> unconditionally and be done with it. That spares a lot of code.
> 
> If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is 
> done, that would work for me. 

Off-list, Thomas mentioned doing it even at vCPU creation as long as the 
prctl has been called.  That is also okay and even simpler.

There's also another important thing that hasn't been mentioned so far: 
KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in 
CPUID[0xD] if they have not been requested with prctl.  It's okay to 
return the AMX bit, but not the bit in CPUID[0xD].

Paolo
Tian, Kevin Dec. 16, 2021, 1 a.m. UTC | #14
> From: Paolo Bonzini
> Sent: Wednesday, December 15, 2021 6:41 PM
> 
> On 12/15/21 11:27, Paolo Bonzini wrote:
> > On 12/15/21 11:09, Thomas Gleixner wrote:
> >> Lets assume the restore order is XSTATE, XCR0, XFD:
> >>
> >>       XSTATE has everything in init state, which means the default
> >>       buffer is good enough
> >>
> >>       XCR0 has everything enabled including AMX, so the buffer is
> >>       expanded
> >>
> >>       XFD has AMX disable set, which means the buffer expansion was
> >>       pointless
> >>
> >> If we go there, then we can just use a full expanded buffer for KVM
> >> unconditionally and be done with it. That spares a lot of code.
> >
> > If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is
> > done, that would work for me.
> 
> Off-list, Thomas mentioned doing it even at vCPU creation as long as the
> prctl has been called.  That is also okay and even simpler.

Make sense. It also avoids the #GP thing in the emulation path if just due
to reallocation error.

We'll follow this direction to do a quick update/test. 

> 
> There's also another important thing that hasn't been mentioned so far:
> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
> CPUID[0xD] if they have not been requested with prctl.  It's okay to
> return the AMX bit, but not the bit in CPUID[0xD].
> 

will do.

Thanks
Kevin
Tian, Kevin Dec. 16, 2021, 1:04 a.m. UTC | #15
> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo Bonzini
> Sent: Wednesday, December 15, 2021 6:28 PM
> 
> On 12/15/21 11:09, Thomas Gleixner wrote:
> > Lets assume the restore order is XSTATE, XCR0, XFD:
> >
> >       XSTATE has everything in init state, which means the default
> >       buffer is good enough
> >
> >       XCR0 has everything enabled including AMX, so the buffer is
> >       expanded
> >
> >       XFD has AMX disable set, which means the buffer expansion was
> >       pointless
> >
> > If we go there, then we can just use a full expanded buffer for KVM
> > unconditionally and be done with it. That spares a lot of code.
> 
> If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is
> done, that would work for me.  Basically KVM_SET_CPUID2 would:
> 
> - check bits from CPUID[0xD] against the prctl requested with GUEST_PERM
> 
> - return with -ENXIO or whatever if any dynamic bits were not requested
> 
> - otherwise call fpstate_realloc if there are any dynamic bits requested
> 
> Considering that in practice all Linux guests with AMX would have XFD
> passthrough (because if there's no prctl, Linux keeps AMX disabled in
> XFD), this removes the need to do all the #NM handling too.  Just make

#NM trap is for XFD_ERR thus still required.

> XFD passthrough if it can ever be set to a nonzero value.  This costs an
> RDMSR per vmexit even if neither the host nor the guest ever use AMX.

Well, we can still trap WRMSR(XFD) in the start and then disable interception
after the 1st trap.

Thanks
Kevin
Tian, Kevin Dec. 16, 2021, 5:36 a.m. UTC | #16
> From: Tian, Kevin
> Sent: Thursday, December 16, 2021 9:00 AM
> >
> > Off-list, Thomas mentioned doing it even at vCPU creation as long as the
> > prctl has been called.  That is also okay and even simpler.
> 
> Make sense. It also avoids the #GP thing in the emulation path if just due
> to reallocation error.
> 
> We'll follow this direction to do a quick update/test.
> 

After some study there are three opens which we'd like to sync here. Once
they are closed we'll send out a new version very soon (hopefully tomorrow).

1) Have a full expanded buffer at vCPU creation

There are two options.

One is to directly allocate a big-enough buffer upon guest_fpu::perm in
fpu_alloc_guest_fpstate(). There is no reallocation per-se thus most changes
in this series are not required.

The other is to keep the reallocation concept (thus all previous patches are
kept) and still call a wrapper around __xfd_enable_feature() even at vCPU
creation (e.g. right after fpu_init_guest_permissions()). This matches the
fpu core assumption that fpstate for xfd features are dynamically allocated,
though the actual calling point may not be dynamic. This also allows us
to exploit doing expansion in KVM_SET_CPUID2 (see next).

2) Do expansion at vCPU creation or KVM_ SET_CPUID2?

If the reallocation concept is still kept, then we feel doing expansion in
KVM_SET_CPUID2 makes slightly more sense. There is no functional 
difference between two options since the guest is not running at this 
point. And in general Qemu should set prctl according to the cpuid bits. 
But since anyway we still need to check guest cpuid against guest perm in 
KVM_SET_CPUID2, it reads clearer to expand the buffer only after this 
check is passed.

If this approach is agreed, then we may replace the helper functions in
this patch with a new one:

/*
 * fpu_update_guest_perm_features - Enable xfeatures according to guest perm
 * @guest_fpu:		Pointer to the guest FPU container
 *
 * Enable all dynamic xfeatures according to guest perm. Invoked if the
 * caller wants to conservatively expand fpstate buffer instead of waiting
 * until a given feature is accessed.
 *
 * Return: 0 on success, error code otherwise
 */
+int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
+{
+	u64 expand;
+
+	lockdep_assert_preemption_enabled();
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return 0;
+
+	expand = guest_fpu->perm & ~guest_fpu->xfeatures;
+	if (!expand)
+		return 0;
+
+	return __xfd_enable_feature(expand, guest_fpu);
+}
+EXPORT_SYMBOL_GPL(fpu_update_guest_features);

3) Always disable interception of disable after 1st interception?

Once we choose to have a full expanded buffer before guest runs, the
point of intercepting WRMSR(IA32_XFD) becomes less obvious since
no dynamic reallocation is required.

One option is to always disable WRMSR interception once 
KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit. 
But doing so affects legacy OS which even has no XFD logic at all.

The other option is to continue the current policy i.e. disable write 
emulation only after the 1st interception of setting XFD to a non-zero 
value. Then the RDMSR cost is added only for guest which supports XFD.

In either case we need another helper to update guest_fpu->fpstate->xfd
as required in the restore path. For the 2nd option we further want
to update MSR if guest_fpu is currently in use:

+void xfd_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
+{
+	fpregs_lock();
+	guest_fpu->fpstae->xfd = xfd;
+	if (guest_fpu->fpstate->in_use)
+		xfd_update_state(guest_fpu->fpstate);
+	fpregs_unlock();
+}

Thoughts?
--
p.s. currently we're working on a quick prototype based on:
  - Expand buffer in KVM_SET_CPUID2
  - Disable write emulation after first interception
to check any oversight. 

Thanks
Kevin
Thomas Gleixner Dec. 16, 2021, 9:34 a.m. UTC | #17
On Thu, Dec 16 2021 at 01:04, Kevin Tian wrote:
>> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo Bonzini
>> Considering that in practice all Linux guests with AMX would have XFD
>> passthrough (because if there's no prctl, Linux keeps AMX disabled in
>> XFD), this removes the need to do all the #NM handling too.  Just make
>
> #NM trap is for XFD_ERR thus still required.
>
>> XFD passthrough if it can ever be set to a nonzero value.  This costs an
>> RDMSR per vmexit even if neither the host nor the guest ever use AMX.
>
> Well, we can still trap WRMSR(XFD) in the start and then disable interception
> after the 1st trap.

If we go for buffer expansion at vcpu_create() or CPUID2 then I think
you don't need a trap at all.

XFD_ERR:  Always 0 on the host. Guest state needs to be preserved on
          VMEXIT and restored on VMENTER

This can be done simply with the MSR entry/exit controls. No trap
required neither for #NM for for XFD_ERR.

VMENTER loads guest state. VMEXIT saves guest state and loads host state
(0)

XFD:     Always guest state

So VMENTER does nothing and VMEXIT either saves guest state and the sync
function uses the automatically saved value or you keep the sync
function which does the rdmsrl() as is.

Hmm?

Thanks,

        tglx
Tian, Kevin Dec. 16, 2021, 9:59 a.m. UTC | #18
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 16, 2021 5:35 PM
> 
> On Thu, Dec 16 2021 at 01:04, Kevin Tian wrote:
> >> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo
> Bonzini
> >> Considering that in practice all Linux guests with AMX would have XFD
> >> passthrough (because if there's no prctl, Linux keeps AMX disabled in
> >> XFD), this removes the need to do all the #NM handling too.  Just make
> >
> > #NM trap is for XFD_ERR thus still required.
> >
> >> XFD passthrough if it can ever be set to a nonzero value.  This costs an
> >> RDMSR per vmexit even if neither the host nor the guest ever use AMX.
> >
> > Well, we can still trap WRMSR(XFD) in the start and then disable
> interception
> > after the 1st trap.
> 
> If we go for buffer expansion at vcpu_create() or CPUID2 then I think
> you don't need a trap at all.
> 
> XFD_ERR:  Always 0 on the host. Guest state needs to be preserved on
>           VMEXIT and restored on VMENTER
> 
> This can be done simply with the MSR entry/exit controls. No trap
> required neither for #NM for for XFD_ERR.
> 
> VMENTER loads guest state. VMEXIT saves guest state and loads host state
> (0)

This implies three MSR operations for every vm-exit.

With trap we only need one RDMSR in host #NM handler, one 
RDMSR/one WRMSR exit in guest #NM handler, which are both rare.
plus one RDMSR/one WRMSR per vm-exit only if saved xfd_err is 
non-zero which is again rare.

> 
> XFD:     Always guest state
> 
> So VMENTER does nothing and VMEXIT either saves guest state and the sync
> function uses the automatically saved value or you keep the sync
> function which does the rdmsrl() as is.
> 

Yes, this is the 3rd open that I asked in another reply. The only restriction
with this approach is that the sync cost is added also for legacy OS which
doesn't touch xfd at all. 

Thanks
Kevin
Tian, Kevin Dec. 16, 2021, 10:21 a.m. UTC | #19
> From: Paolo Bonzini
> Sent: Wednesday, December 15, 2021 6:41 PM
> 
> There's also another important thing that hasn't been mentioned so far:
> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
> CPUID[0xD] if they have not been requested with prctl.  It's okay to
> return the AMX bit, but not the bit in CPUID[0xD].

There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm.

This then requires exposing xstate_get_guest_group_perm() to KVM. Thomas, 
are you OK with this change given Paolo's ask? v1 included this change but
it was not necessary at the moment:

	https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/

and Paolo, do we want to document that prctl() must be done before
calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location?

Thanks
Kevin
Paolo Bonzini Dec. 16, 2021, 10:24 a.m. UTC | #20
On 12/16/21 11:21, Tian, Kevin wrote:
>> From: Paolo Bonzini
>> Sent: Wednesday, December 15, 2021 6:41 PM
>>
>> There's also another important thing that hasn't been mentioned so far:
>> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
>> CPUID[0xD] if they have not been requested with prctl.  It's okay to
>> return the AMX bit, but not the bit in CPUID[0xD].
> 
> There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm.
> 
> This then requires exposing xstate_get_guest_group_perm() to KVM.

Right, this is a generic /dev/kvm ioctl therefore it has to check the 
process state.

> Thomas, are you OK with this change given Paolo's ask? v1 included
> this change but it was not necessary at the moment:
> 
> 	https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/
> 
> and Paolo, do we want to document that prctl() must be done before
> calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location?

You can document it under the KVM_GET_SUPPORTED_CPUID ioctl.

Paolo
Paolo Bonzini Dec. 16, 2021, 10:26 a.m. UTC | #21
On 12/16/21 11:21, Tian, Kevin wrote:
>> From: Paolo Bonzini
>> Sent: Wednesday, December 15, 2021 6:41 PM
>>
>> There's also another important thing that hasn't been mentioned so far:
>> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
>> CPUID[0xD] if they have not been requested with prctl.  It's okay to
>> return the AMX bit, but not the bit in CPUID[0xD].
> 
> There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm.
> 
> This then requires exposing xstate_get_guest_group_perm() to KVM.

Right, this is a generic /dev/kvm ioctl therefore it has to check the 
process state.

> Thomas, are you OK with this change given Paolo's ask? v1 included
> this change but it was not necessary at the moment:
> 
> 	https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/
> 
> and Paolo, do we want to document that prctl() must be done before
> calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location?

You can document it under the KVM_GET_SUPPORTED_CPUID ioctl.

(The reason for this ordering is backwards compatibility: otherwise a 
process could pass KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2 directly, 
and the resulting VM would not be able to use AMX because it hasn't been 
requested.  Likewise, userspace needs to know that if you use prctl then 
you also need to allocate >4K for the xstate and use KVM_GET_XSAVE2 to 
retrieve it).

Paolo
Tian, Kevin Dec. 16, 2021, 1 p.m. UTC | #22
Hi, Paolo/Thomas,

Any comment on following opens? In general doing static buffer 
expansion definitely simplifies the logic, but still need your help to 
finalize its impact on the overall design. 
Thomas Gleixner Dec. 16, 2021, 2:12 p.m. UTC | #23
On Thu, Dec 16 2021 at 09:59, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> This can be done simply with the MSR entry/exit controls. No trap
>> required neither for #NM for for XFD_ERR.
>> 
>> VMENTER loads guest state. VMEXIT saves guest state and loads host state
>> (0)
>
> This implies three MSR operations for every vm-exit.
>
> With trap we only need one RDMSR in host #NM handler, one 
> RDMSR/one WRMSR exit in guest #NM handler, which are both rare.
> plus one RDMSR/one WRMSR per vm-exit only if saved xfd_err is 
> non-zero which is again rare.

Fair enough.

>> XFD:     Always guest state
>> 
>> So VMENTER does nothing and VMEXIT either saves guest state and the sync
>> function uses the automatically saved value or you keep the sync
>> function which does the rdmsrl() as is.
>> 
>
> Yes, this is the 3rd open that I asked in another reply. The only restriction
> with this approach is that the sync cost is added also for legacy OS which
> doesn't touch xfd at all. 

You still can make that conditional on the guest XCR0. If guest never
enables the extended bit then neither the #NM trap nor the XFD sync
are required.

But yes, there are too many moving parts here :)

Thanks,

        tglx
Paolo Bonzini Dec. 16, 2021, 9:07 p.m. UTC | #24
On 12/16/21 06:36, Tian, Kevin wrote:
> 2) Do expansion at vCPU creation or KVM_ SET_CPUID2?
> 
> If the reallocation concept is still kept, then we feel doing expansion in
> KVM_SET_CPUID2 makes slightly more sense. There is no functional
> difference between two options since the guest is not running at this
> point. And in general Qemu should set prctl according to the cpuid bits.
> But since anyway we still need to check guest cpuid against guest perm in
> KVM_SET_CPUID2, it reads clearer to expand the buffer only after this
> check is passed.

Yes, that makes sense to me as well.  In principle userspace could call 
prctl only after KVM_CREATE_VCPU.

> 
> One option is to always disable WRMSR interception once 
> KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit. 
> But doing so affects legacy OS which even has no XFD logic at all.
> 
> The other option is to continue the current policy i.e. disable write 
> emulation only after the 1st interception of setting XFD to a non-zero 
> value. Then the RDMSR cost is added only for guest which supports XFD.

For this I suggest to implement the current policy, but place it at the 
end of the series so it's easy to drop it.

Paolo
Tian, Kevin Dec. 17, 2021, 3:33 p.m. UTC | #25
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 16, 2021 10:13 PM
> >
> > Yes, this is the 3rd open that I asked in another reply. The only restriction
> > with this approach is that the sync cost is added also for legacy OS which
> > doesn't touch xfd at all.
> 
> You still can make that conditional on the guest XCR0. If guest never
> enables the extended bit then neither the #NM trap nor the XFD sync
> are required.
> 
> But yes, there are too many moving parts here :)
> 

Yes. Many moving parts but in general it's getting cleaner and simplified. 
diff mbox series

Patch

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -136,6 +136,63 @@  extern void fpstate_clear_xstate_compone
 extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
 extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
 extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
+extern int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd);
+
+/**
+ * fpu_update_guest_xcr0 - Update guest XCR0 from XSETBV emulation
+ * @guest_fpu:	Pointer to the guest FPU container
+ * @xcr0:	Requested guest XCR0
+ *
+ * Has to be invoked before making the guest XCR0 update effective. The
+ * function validates that the requested features are permitted and ensures
+ * that @guest_fpu->fpstate is properly sized taking @guest_fpu->fpstate->xfd
+ * into account.
+ *
+ * If @guest_fpu->fpstate is not the current tasks active fpstate then the
+ * caller has to ensure that @guest_fpu->fpstate cannot be concurrently in
+ * use, i.e. the guest restore case.
+ *
+ * Return:
+ * 0		- Success
+ * -EPERM	- Feature(s) not permitted
+ * -EFAULT	- Resizing of fpstate failed
+ */
+static inline int fpu_update_guest_xcr0(struct fpu_guest *guest_fpu, u64 xcr0)
+{
+	return __fpu_update_guest_features(guest_fpu, xcr0, guest_fpu->fpstate->xfd);
+}
+
+/**
+ * fpu_update_guest_xfd - Update guest XFD from MSR write emulation
+ * @guest_fpu:	Pointer to the guest FPU container
+ * @xcr0:	Current guest XCR0
+ * @xfd:	Requested XFD value
+ *
+ * Has to be invoked to make the guest XFD update effective. The function
+ * validates the XFD value and ensures that @guest_fpu->fpstate is properly
+ * sized by taking @xcr0 into account.
+ *
+ * The caller must not modify @guest_fpu->fpstate->xfd or the XFD MSR
+ * directly.
+ *
+ * If @guest_fpu->fpstate is not the current tasks active fpstate then the
+ * caller has to ensure that @guest_fpu->fpstate cannot be concurrently in
+ * use, i.e. the guest restore case.
+ *
+ * On success the buffer size is valid, @guest_fpu->fpstate.xfd == @xfd and
+ * if the guest fpstate is active then MSR_IA32_XFD == @xfd.
+ *
+ * On failure the previous state is retained.
+ *
+ * Return:
+ * 0		- Success
+ * -ENOTSUPP	- XFD value not supported
+ * -EFAULT	- Resizing of fpstate failed
+ */
+static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd)
+{
+	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
+}
 
 extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
 extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -261,6 +261,63 @@  void fpu_free_guest_fpstate(struct fpu_g
 }
 EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);
 
+/**
+ * __fpu_update_guest_features - Validate and enable guest XCR0 and XFD updates
+ * @guest_fpu:	Pointer to the guest FPU container
+ * @xcr0:	Guest XCR0
+ * @xfd:	Guest XFD
+ *
+ * Note: @xcr0 and @xfd must either be the already validated values or the
+ * requested values (guest emulation or host write on restore).
+ *
+ * Do not invoke directly. Use the provided wrappers fpu_validate_guest_xcr0()
+ * and fpu_update_guest_xfd() instead.
+ *
+ * Return: 0 on success, error code otherwise
+ */
+int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd)
+{
+	u64 expand, requested;
+
+	lockdep_assert_preemption_enabled();
+
+	/* Only permitted features are allowed in XCR0 */
+	if (xcr0 & ~guest_fpu->perm)
+		return -EPERM;
+
+	/* Check for unsupported XFD values */
+	if (xfd & ~XFEATURE_MASK_USER_DYNAMIC || xfd & ~fpu_user_cfg.max_features)
+		return -ENOTSUPP;
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return 0;
+
+	/*
+	 * The requested features are set in XCR0 and not set in XFD.
+	 * Feature expansion is required when the requested features are
+	 * not a subset of the xfeatures for which @guest_fpu->fpstate
+	 * is sized.
+	 */
+	requested = xcr0 & ~xfd;
+	expand = requested & ~guest_fpu->xfeatures;
+	if (!expand) {
+		/*
+		 * fpstate size is correct, update the XFD value in fpstate
+		 * and if @guest_fpu->fpstate is active also the per CPU
+		 * cache and the MSR.
+		 */
+		fpregs_lock();
+		guest_fpu->fpstate->xfd = xfd;
+		if (guest_fpu->fpstate->in_use)
+			xfd_update_state(guest_fpu->fpstate);
+		fpregs_unlock();
+		return 0;
+	}
+
+	return __xfd_enable_feature(expand, guest_fpu);
+}
+EXPORT_SYMBOL_GPL(__fpu_update_guest_features);
+
 int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 {
 	struct fpstate *guest_fps = guest_fpu->fpstate;