diff mbox series

KVM: x86: add hint to skip hidden rdpkru under kvm_load_host_xsave_state

Message ID 20210507164456.1033-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: add hint to skip hidden rdpkru under kvm_load_host_xsave_state | expand

Commit Message

Jon Kohler May 7, 2021, 4:44 p.m. UTC
kvm_load_host_xsave_state handles xsave on vm exit, part of which is
managing memory protection key state. The latest arch.pkru is updated
with a rdpkru, and if that doesn't match the base host_pkru (which
about 70% of the time), we issue a __write_pkru.

__write_pkru issues another rdpkru internally to try to avoid the
wrpkru, so we're reading the same value back to back when we 100% of
the time know that we need to go directly to wrpkru. This is a 100%
branch miss and extra work that can be skipped.

To improve performance, add a hint to __write_pkru so that callers
can specify if they've already done the check themselves. The
compiler seems to filter this branch this out completely when the hint
is true, so incrementally tighter code is generated as well.

While we're in this section of code, optimize if condition ordering
prior to __write_pkru in both kvm_load_{guest|host}_xsave_state.

For both functions, flip the ordering of the || condition so that
arch.xcr0 & XFEATURE_MASK_PKRU is checked first, which when
instrumented in our evironment appeared to be always true and less
overall work than kvm_read_cr4_bits.

For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
one position. When instrumented, I saw this be true roughly ~70% of
the time vs the other conditions which were almost always true.
With this change, we will avoid 3rd condition check ~30% of the time.

Cc: Babu Moger <babu.moger@amd.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 arch/x86/include/asm/fpu/internal.h  |  2 +-
 arch/x86/include/asm/pgtable.h       |  2 +-
 arch/x86/include/asm/special_insns.h | 10 ++++++----
 arch/x86/kvm/x86.c                   | 14 +++++++-------
 4 files changed, 15 insertions(+), 13 deletions(-)

--
2.30.1 (Apple Git-130)

Comments

Paolo Bonzini May 7, 2021, 4:52 p.m. UTC | #1
Nice one.  The patch can be made simpler though (I think).

On 07/05/21 18:44, Jon Kohler wrote:
  @@ -122,7 +124,7 @@ static inline u32 rdpkru(void)
>   	return 0;
>   }
> 
> -static inline void __write_pkru(u32 pkru)
> +static inline void __write_pkru(u32 pkru, bool skip_comparison)
>   {
>   }
>   #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cebdaa1e3cf5..cd95adbd140c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>   	}
> 
>   	if (static_cpu_has(X86_FEATURE_PKU) &&
> -	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> -	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
> -	    vcpu->arch.pkru != vcpu->arch.host_pkru)
> -		__write_pkru(vcpu->arch.pkru);
> +	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
> +	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> +	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
> +		__write_pkru(vcpu->arch.pkru, false);

This can be optimized as well, can't it?  This means that the only case 
that needs the rdpkru is in switch_fpu_finish, and __write_pkru can be 
removed completely:

- do the rdpkru+wrpkru in switch_fpu_finish

- just use wrpkru in KVM

Paolo


>   }
>   EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
> 
> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>   		return;
> 
>   	if (static_cpu_has(X86_FEATURE_PKU) &&
> -	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> -	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
> +	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> +	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>   		vcpu->arch.pkru = rdpkru();
>   		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
> -			__write_pkru(vcpu->arch.host_pkru);
> +			__write_pkru(vcpu->arch.host_pkru, true);
>   	}
> 
>   	if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {
> --
> 2.30.1 (Apple Git-130)
>
Dave Hansen May 7, 2021, 4:58 p.m. UTC | #2
On 5/7/21 9:52 AM, Paolo Bonzini wrote:
> This can be optimized as well, can't it?  This means that the only case
> that needs the rdpkru is in switch_fpu_finish, and __write_pkru can be
> removed completely:
> 
> - do the rdpkru+wrpkru in switch_fpu_finish
> 
> - just use wrpkru in KVM

I was going to suggest exactly the same thing.  It doesn't require the
compiler to be smart, and wrpkru() is available in the same header as
__write_pkru().

I also detest the mysterious true/false arguments to functions where you
have no clue what they're doing at the call site without comments.
Jon Kohler May 7, 2021, 5:13 p.m. UTC | #3
> On May 7, 2021, at 12:58 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 5/7/21 9:52 AM, Paolo Bonzini wrote:
>> This can be optimized as well, can't it?  This means that the only case
>> that needs the rdpkru is in switch_fpu_finish, and __write_pkru can be
>> removed completely:
>> 
>> - do the rdpkru+wrpkru in switch_fpu_finish
>> 
>> - just use wrpkru in KVM
> 
> I was going to suggest exactly the same thing.  It doesn't require the
> compiler to be smart, and wrpkru() is available in the same header as
> __write_pkru().

Thanks Paolo, Dave. Good suggestion, I’ll work up a v2 and send it out.

> 
> I also detest the mysterious true/false arguments to functions where you
> have no clue what they're doing at the call site without comments.

Noted, my apologies. Thats a good point, I won’t make that mistake again.
Andy Lutomirski May 14, 2021, 5:11 a.m. UTC | #4
On Fri, May 7, 2021 at 9:45 AM Jon Kohler <jon@nutanix.com> wrote:
>
> kvm_load_host_xsave_state handles xsave on vm exit, part of which is
> managing memory protection key state. The latest arch.pkru is updated
> with a rdpkru, and if that doesn't match the base host_pkru (which
> about 70% of the time), we issue a __write_pkru.

This thread caused me to read the code, and I don't get how it's even
remotely correct.

First, kvm_load_guest_fpu() has this delight:

    /*
     * Guests with protected state can't have it set by the hypervisor,
     * so skip trying to set it.
     */
    if (vcpu->arch.guest_fpu)
        /* PKRU is separately restored in kvm_x86_ops.run. */
        __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
                    ~XFEATURE_MASK_PKRU);

That's nice, but it fails to restore XINUSE[PKRU].  As far as I know,
that bit is live, and the only way to restore it to 0 is with
XRSTOR(S).

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cebdaa1e3cf5..cd95adbd140c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>         }
>
>         if (static_cpu_has(X86_FEATURE_PKU) &&
> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
> -           vcpu->arch.pkru != vcpu->arch.host_pkru)
> -               __write_pkru(vcpu->arch.pkru);
> +           vcpu->arch.pkru != vcpu->arch.host_pkru &&
> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
> +               __write_pkru(vcpu->arch.pkru, false);

Please tell me I'm missing something (e.g. KVM very cleverly managing
the PKRU register using intercepts) that makes this reliably load the
guest value.  An innocent or malicious guest could easily make that
condition evaluate to false, thus allowing the host PKRU value to be
live in guest mode.  (Or is something fancy going on here?)

I don't even want to think about what happens if a perf NMI hits and
accesses host user memory while the guest PKRU is live (on VMX -- I
think this can't happen on SVM).

>  }
>  EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>
> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>                 return;
>
>         if (static_cpu_has(X86_FEATURE_PKU) &&
> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>                 vcpu->arch.pkru = rdpkru();
>                 if (vcpu->arch.pkru != vcpu->arch.host_pkru)
> -                       __write_pkru(vcpu->arch.host_pkru);
> +                       __write_pkru(vcpu->arch.host_pkru, true);
>         }

Suppose the guest writes to PKRU and then, without exiting, sets PKE =
0 and XCR0[PKRU] = 0.  (Or are the intercepts such that this can't
happen except on SEV where maybe SEV magic makes the problem go away?)

I admit I'm fairly mystified as to why KVM doesn't handle PKRU like
the rest of guest XSTATE.
Jon Kohler May 17, 2021, 2:50 a.m. UTC | #5
> On May 14, 2021, at 1:11 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, May 7, 2021 at 9:45 AM Jon Kohler <jon@nutanix.com> wrote:
>> 
>> kvm_load_host_xsave_state handles xsave on vm exit, part of which is
>> managing memory protection key state. The latest arch.pkru is updated
>> with a rdpkru, and if that doesn't match the base host_pkru (which
>> about 70% of the time), we issue a __write_pkru.
> 
> This thread caused me to read the code, and I don't get how it's even
> remotely correct.
> 
> First, kvm_load_guest_fpu() has this delight:
> 
>    /*
>     * Guests with protected state can't have it set by the hypervisor,
>     * so skip trying to set it.
>     */
>    if (vcpu->arch.guest_fpu)
>        /* PKRU is separately restored in kvm_x86_ops.run. */
>        __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
>                    ~XFEATURE_MASK_PKRU);
> 
> That's nice, but it fails to restore XINUSE[PKRU].  As far as I know,
> that bit is live, and the only way to restore it to 0 is with
> XRSTOR(S).

Thanks, Andy - Adding Tom Lendacky to this thread as that
Particular snippet was last touched in ~5.11 in ed02b213098a.

> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index cebdaa1e3cf5..cd95adbd140c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>        }
>> 
>>        if (static_cpu_has(X86_FEATURE_PKU) &&
>> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
>> -           vcpu->arch.pkru != vcpu->arch.host_pkru)
>> -               __write_pkru(vcpu->arch.pkru);
>> +           vcpu->arch.pkru != vcpu->arch.host_pkru &&
>> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
>> +               __write_pkru(vcpu->arch.pkru, false);
> 
> Please tell me I'm missing something (e.g. KVM very cleverly managing
> the PKRU register using intercepts) that makes this reliably load the
> guest value.  An innocent or malicious guest could easily make that
> condition evaluate to false, thus allowing the host PKRU value to be
> live in guest mode.  (Or is something fancy going on here?)
> 
> I don't even want to think about what happens if a perf NMI hits and
> accesses host user memory while the guest PKRU is live (on VMX -- I
> think this can't happen on SVM).

Perhaps Babu / Dave can comment on the exiting logic here? Babu did
some additional work to fix save/restore on 37486135d3a.

From this changes perspective, I’m just trying to get to the resultant
evaluation a bit quicker, which this change (and the v3) seems to do
ok with; however, if I’ve ran my ship into a larger ice berg, happy to
collaborate to make it better overall.

> 
>> }
>> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>> 
>> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>                return;
>> 
>>        if (static_cpu_has(X86_FEATURE_PKU) &&
>> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
>> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>>                vcpu->arch.pkru = rdpkru();
>>                if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>> -                       __write_pkru(vcpu->arch.host_pkru);
>> +                       __write_pkru(vcpu->arch.host_pkru, true);
>>        }
> 
> Suppose the guest writes to PKRU and then, without exiting, sets PKE =
> 0 and XCR0[PKRU] = 0.  (Or are the intercepts such that this can't
> happen except on SEV where maybe SEV magic makes the problem go away?)
> 
> I admit I'm fairly mystified as to why KVM doesn't handle PKRU like
> the rest of guest XSTATE.

I’ll defer to Dave/Babu here. This code has been this way for a bit now,
I’m not sure at first glance if that situation could occur intentionally
or accidentally, but that would evaluate the same both before and
after this change, no?
Paolo Bonzini May 17, 2021, 7:46 a.m. UTC | #6
On 14/05/21 07:11, Andy Lutomirski wrote:
> That's nice, but it fails to restore XINUSE[PKRU].  As far as I know,
> that bit is live, and the only way to restore it to 0 is with
> XRSTOR(S).

The manual says "It is possible for XINUSE[i] to be 1 even when state 
component i is in its initial configuration" so this is architecturally 
valid.  Does the XINUSE optimization matter for PKRU which is a single word?

>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index cebdaa1e3cf5..cd95adbd140c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>          }
>>
>>          if (static_cpu_has(X86_FEATURE_PKU) &&
>> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
>> -           vcpu->arch.pkru != vcpu->arch.host_pkru)
>> -               __write_pkru(vcpu->arch.pkru);
>> +           vcpu->arch.pkru != vcpu->arch.host_pkru &&
>> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
>> +               __write_pkru(vcpu->arch.pkru, false);
> 
> Please tell me I'm missing something (e.g. KVM very cleverly managing
> the PKRU register using intercepts) that makes this reliably load the
> guest value.  An innocent or malicious guest could easily make that
> condition evaluate to false, thus allowing the host PKRU value to be
> live in guest mode.  (Or is something fancy going on here?)

RDPKRU/WRPKRU cannot be used unless CR4.PKE=1, but PKRU can still be 
accessed using XSAVE/XRSTOR.  However both CR4 and XCR0 have their 
writes trapped, so the guest will not use the host PKRU value before the 
next vmexit if CR4.PKE=0 and XCR0.PKRU=0.

> I don't even want to think about what happens if a perf NMI hits and
> accesses host user memory while the guest PKRU is live (on VMX -- I
> think this can't happen on SVM).

This is indeed a problem, which indeed cannot happen on SVM but is there 
on VMX.  Note that the function above is not handling all of the xstate, 
it's handling the *XSAVE state*, that is XCR0, XSS and PKRU.  Thus the 
window is small, but it's there.

Is it solvable at all, without having PKRU fields in the VMCS (and 
without masking NMIs in the LAPIC which would be too expensive)?  Dave, 
Sean, what do you think?

>>   }
>>   EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>>
>> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>                  return;
>>
>>          if (static_cpu_has(X86_FEATURE_PKU) &&
>> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
>> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>>                  vcpu->arch.pkru = rdpkru();
>>                  if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>> -                       __write_pkru(vcpu->arch.host_pkru);
>> +                       __write_pkru(vcpu->arch.host_pkru, true);
>>          }
> 
> Suppose the guest writes to PKRU and then, without exiting, sets PKE =
> 0 and XCR0[PKRU] = 0.  (Or are the intercepts such that this can't
> happen except on SEV where maybe SEV magic makes the problem go away?)

Yes, see above.  KVM needs to trap CR4 and XCR0 anyway (CR4 because you 
don't want the guest to clear e.g. MCE, XCR0 to forbid setting bits that 
the host kernel does not have in its own xstate).

> I admit I'm fairly mystified as to why KVM doesn't handle PKRU like
> the rest of guest XSTATE.
Because the rest of the guest XSTATE is live too early.  The problem you 
mention above with respect to perf, where you access host memory with 
the guest PKRU, would be very much amplified.

It is basically the opposite problem of what you have in 
switch_fpu_finish, which loads PKRU eagerly while delaying the rest to 
the switch to userland.

Paolo
Dave Hansen May 17, 2021, 1:54 p.m. UTC | #7
On 5/13/21 10:11 PM, Andy Lutomirski wrote:
> I don't even want to think about what happens if a perf NMI hits and
> accesses host user memory while the guest PKRU is live (on VMX -- I
> think this can't happen on SVM).

What's the relevant difference between SVM and VMX here?  I'm missing
something.
Tom Lendacky May 17, 2021, 4:35 p.m. UTC | #8
On 5/16/21 9:50 PM, Jon Kohler wrote:
> 
> 
>> On May 14, 2021, at 1:11 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Fri, May 7, 2021 at 9:45 AM Jon Kohler <jon@nutanix.com> wrote:
>>>
>>> kvm_load_host_xsave_state handles xsave on vm exit, part of which is
>>> managing memory protection key state. The latest arch.pkru is updated
>>> with a rdpkru, and if that doesn't match the base host_pkru (which
>>> about 70% of the time), we issue a __write_pkru.
>>
>> This thread caused me to read the code, and I don't get how it's even
>> remotely correct.
>>
>> First, kvm_load_guest_fpu() has this delight:
>>
>>    /*
>>     * Guests with protected state can't have it set by the hypervisor,
>>     * so skip trying to set it.
>>     */
>>    if (vcpu->arch.guest_fpu)
>>        /* PKRU is separately restored in kvm_x86_ops.run. */
>>        __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
>>                    ~XFEATURE_MASK_PKRU);
>>
>> That's nice, but it fails to restore XINUSE[PKRU].  As far as I know,
>> that bit is live, and the only way to restore it to 0 is with
>> XRSTOR(S).
> 
> Thanks, Andy - Adding Tom Lendacky to this thread as that
> Particular snippet was last touched in ~5.11 in ed02b213098a.

It sounds like Andy's comment is separate from the changes associated with
commit ed02b213098a, right?

Commit ed02b213098a just added the check for vcpu->arch.guest_fpu to
support SEV-ES guests. Since the hypervisor can't save/restore those
registers directly when running under SEV-ES, we skip this. The state will
be restored when VMRUN is performed.

Thanks,
Tom

> 
>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index cebdaa1e3cf5..cd95adbd140c 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>>        }
>>>
>>>        if (static_cpu_has(X86_FEATURE_PKU) &&
>>> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>>> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
>>> -           vcpu->arch.pkru != vcpu->arch.host_pkru)
>>> -               __write_pkru(vcpu->arch.pkru);
>>> +           vcpu->arch.pkru != vcpu->arch.host_pkru &&
>>> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>>> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
>>> +               __write_pkru(vcpu->arch.pkru, false);
>>
>> Please tell me I'm missing something (e.g. KVM very cleverly managing
>> the PKRU register using intercepts) that makes this reliably load the
>> guest value.  An innocent or malicious guest could easily make that
>> condition evaluate to false, thus allowing the host PKRU value to be
>> live in guest mode.  (Or is something fancy going on here?)
>>
>> I don't even want to think about what happens if a perf NMI hits and
>> accesses host user memory while the guest PKRU is live (on VMX -- I
>> think this can't happen on SVM).
> 
> Perhaps Babu / Dave can comment on the exiting logic here? Babu did
> some additional work to fix save/restore on 37486135d3a.
> 
> From this changes perspective, I’m just trying to get to the resultant
> evaluation a bit quicker, which this change (and the v3) seems to do
> ok with; however, if I’ve ran my ship into a larger ice berg, happy to
> collaborate to make it better overall.
> 
>>
>>> }
>>> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>>>
>>> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>>>                return;
>>>
>>>        if (static_cpu_has(X86_FEATURE_PKU) &&
>>> -           (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>>> -            (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
>>> +           ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>>> +            kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>>>                vcpu->arch.pkru = rdpkru();
>>>                if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>>> -                       __write_pkru(vcpu->arch.host_pkru);
>>> +                       __write_pkru(vcpu->arch.host_pkru, true);
>>>        }
>>
>> Suppose the guest writes to PKRU and then, without exiting, sets PKE =
>> 0 and XCR0[PKRU] = 0.  (Or are the intercepts such that this can't
>> happen except on SEV where maybe SEV magic makes the problem go away?)
>>
>> I admit I'm fairly mystified as to why KVM doesn't handle PKRU like
>> the rest of guest XSTATE.
> 
> I’ll defer to Dave/Babu here. This code has been this way for a bit now,
> I’m not sure at first glance if that situation could occur intentionally
> or accidentally, but that would evaluate the same both before and
> after this change, no?
>
Paolo Bonzini May 17, 2021, 4:43 p.m. UTC | #9
On 17/05/21 15:54, Dave Hansen wrote:
> On 5/13/21 10:11 PM, Andy Lutomirski wrote:
>> I don't even want to think about what happens if a perf NMI hits and
>> accesses host user memory while the guest PKRU is live (on VMX -- I
>> think this can't happen on SVM).
> 
> What's the relevant difference between SVM and VMX here?  I'm missing
> something.

SVM has the global interrupt flag that blocks NMIs and SMIs, and PKRU is 
loaded while GIF=0.

Paolo
Sean Christopherson May 17, 2021, 5:39 p.m. UTC | #10
On Mon, May 17, 2021, Paolo Bonzini wrote:
> On 14/05/21 07:11, Andy Lutomirski wrote:
> > I don't even want to think about what happens if a perf NMI hits and
> > accesses host user memory while the guest PKRU is live (on VMX -- I
> > think this can't happen on SVM).
> 
> This is indeed a problem, which indeed cannot happen on SVM but is there on
> VMX.  Note that the function above is not handling all of the xstate, it's
> handling the *XSAVE state*, that is XCR0, XSS and PKRU.  Thus the window is
> small, but it's there.
> 
> Is it solvable at all, without having PKRU fields in the VMCS (and without
> masking NMIs in the LAPIC which would be too expensive)?  Dave, Sean, what
> do you think?

The least awful solution would be to have the NMI handler restore the host's
PKRU.  The NMI handler would need to save/restore the register, a la CR2, but the
whole thing could be optimized to run if and only if the NMI lands in the window
where the guest's PKRU is loaded.
Dave Hansen May 17, 2021, 5:55 p.m. UTC | #11
On 5/17/21 10:39 AM, Sean Christopherson wrote:
> On Mon, May 17, 2021, Paolo Bonzini wrote:
>> On 14/05/21 07:11, Andy Lutomirski wrote:
>>> I don't even want to think about what happens if a perf NMI hits and
>>> accesses host user memory while the guest PKRU is live (on VMX -- I
>>> think this can't happen on SVM).
>> This is indeed a problem, which indeed cannot happen on SVM but is there on
>> VMX.  Note that the function above is not handling all of the xstate, it's
>> handling the *XSAVE state*, that is XCR0, XSS and PKRU.  Thus the window is
>> small, but it's there.
>>
>> Is it solvable at all, without having PKRU fields in the VMCS (and without
>> masking NMIs in the LAPIC which would be too expensive)?  Dave, Sean, what
>> do you think?
> The least awful solution would be to have the NMI handler restore the host's
> PKRU.  The NMI handler would need to save/restore the register, a la CR2, but the
> whole thing could be optimized to run if and only if the NMI lands in the window
> where the guest's PKRU is loaded.

What were you thinking about?  Something like:

	*this_cpu_ptr(&need_nmi_wpkru) = 1
	// Enter Guest
	__write_pkru(vcpu->arch.pkru);
	*this_cpu_ptr(&need_nmi_wpkru) = 0

And then in the NMI handler:

	u32 pkru;

	if (*this_cpu_ptr(&need_nmi_wpkru)) {
		pkru = rdpku();
		__write_pkru(vcpu->arch.pkru);
	}
	...
	copy_*_user_nmi(... whatever ...);
	...
	if (*this_cpu_ptr(&need_nmi_wpkru))
		__write_pkru(pkru);

?

I was thinking we could just blow away PKRU without saving/restoring it
in the NMI handler, but that might clobber PKRU in the window between
need_nmi_wpkru=1 and entering the guest.

But, the save/restore seems doable especially since we can do it in C
and don't have to mess with the NMI stack or anything.
Dave Hansen May 17, 2021, 5:59 p.m. UTC | #12
On 5/17/21 10:49 AM, Andy Lutomirski wrote:
>> The least awful solution would be to have the NMI handler restore
>> the host's PKRU.  The NMI handler would need to save/restore the
>> register, a la CR2, but the whole thing could be optimized to run
>> if and only if the NMI lands in the window where the guest's PKRU
>> is loaded.
> 
> Or set a flag causing nmi_uaccess_ok() to return false.

Oh, that doesn't sound too bad.  The VMENTER/EXIT paths are also
essentially a context switch.

Will widening the window where nmi_uaccess_okay()==false anger any of
the perf folks?  It looks like perf knows how to handle it nicely.
Sean Christopherson May 17, 2021, 6:02 p.m. UTC | #13
On Mon, May 17, 2021, Dave Hansen wrote:
> On 5/17/21 10:39 AM, Sean Christopherson wrote:
> > On Mon, May 17, 2021, Paolo Bonzini wrote:
> >> On 14/05/21 07:11, Andy Lutomirski wrote:
> >>> I don't even want to think about what happens if a perf NMI hits and
> >>> accesses host user memory while the guest PKRU is live (on VMX -- I
> >>> think this can't happen on SVM).
> >> This is indeed a problem, which indeed cannot happen on SVM but is there on
> >> VMX.  Note that the function above is not handling all of the xstate, it's
> >> handling the *XSAVE state*, that is XCR0, XSS and PKRU.  Thus the window is
> >> small, but it's there.
> >>
> >> Is it solvable at all, without having PKRU fields in the VMCS (and without
> >> masking NMIs in the LAPIC which would be too expensive)?  Dave, Sean, what
> >> do you think?
> > The least awful solution would be to have the NMI handler restore the host's
> > PKRU.  The NMI handler would need to save/restore the register, a la CR2, but the
> > whole thing could be optimized to run if and only if the NMI lands in the window
> > where the guest's PKRU is loaded.
> 
> What were you thinking about?  Something like:
> 
> 	*this_cpu_ptr(&need_nmi_wpkru) = 1
> 	// Enter Guest
> 	__write_pkru(vcpu->arch.pkru);
> 	*this_cpu_ptr(&need_nmi_wpkru) = 0
> 
> And then in the NMI handler:
> 
> 	u32 pkru;
> 
> 	if (*this_cpu_ptr(&need_nmi_wpkru)) {
> 		pkru = rdpku();
> 		__write_pkru(vcpu->arch.pkru);

This isn't correct, vcpu->arch.pkru holds the guest value, the NMI handler needs
to load the host value.  I was thinking KVM would stash away the current host
value, and the NMI handler would check the saved value against rdpkru().

> 	}
> 	...
> 	copy_*_user_nmi(... whatever ...);
> 	...
> 	if (*this_cpu_ptr(&need_nmi_wpkru))
> 		__write_pkru(pkru);
> 
> ?
> 
> I was thinking we could just blow away PKRU without saving/restoring it
> in the NMI handler, but that might clobber PKRU in the window between
> need_nmi_wpkru=1 and entering the guest.

Yep.  It would also blow away the guest's value if the guest did WRPKU while it
was running since KVM would never get a chance to read/save the guest's new value.
 
> But, the save/restore seems doable especially since we can do it in C
> and don't have to mess with the NMI stack or anything.
Sean Christopherson May 17, 2021, 6:04 p.m. UTC | #14
On Mon, May 17, 2021, Dave Hansen wrote:
> On 5/17/21 10:49 AM, Andy Lutomirski wrote:
> >> The least awful solution would be to have the NMI handler restore
> >> the host's PKRU.  The NMI handler would need to save/restore the
> >> register, a la CR2, but the whole thing could be optimized to run
> >> if and only if the NMI lands in the window where the guest's PKRU
> >> is loaded.
> > 
> > Or set a flag causing nmi_uaccess_ok() to return false.
> 
> Oh, that doesn't sound too bad.  The VMENTER/EXIT paths are also
> essentially a context switch.

I like that idea, too.

The flag might also be useful to fix the issue where the NMI handler activates
PEBS after KVM disables it.  Jim?

> Will widening the window where nmi_uaccess_okay()==false anger any of
> the perf folks?  It looks like perf knows how to handle it nicely.
Jim Mattson May 17, 2021, 6:15 p.m. UTC | #15
On Mon, May 17, 2021 at 11:05 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 17, 2021, Dave Hansen wrote:
> > On 5/17/21 10:49 AM, Andy Lutomirski wrote:
> > >> The least awful solution would be to have the NMI handler restore
> > >> the host's PKRU.  The NMI handler would need to save/restore the
> > >> register, a la CR2, but the whole thing could be optimized to run
> > >> if and only if the NMI lands in the window where the guest's PKRU
> > >> is loaded.
> > >
> > > Or set a flag causing nmi_uaccess_ok() to return false.
> >
> > Oh, that doesn't sound too bad.  The VMENTER/EXIT paths are also
> > essentially a context switch.
>
> I like that idea, too.
>
> The flag might also be useful to fix the issue where the NMI handler activates
> PEBS after KVM disables it.  Jim?

The issue is actually that the NMI handler *clears* IA32_PEBS_ENABLE
bits after giving out the host value of the MSR to KVM. If we were to
block the NMI handler from modifying IA32_PEBS_ENABLE until after the
next VM-exit, that could solve this issue. I don't know if it makes
sense to piggyback on nmi_uaccess(), though.

> > Will widening the window where nmi_uaccess_okay()==false anger any of
> > the perf folks?  It looks like perf knows how to handle it nicely.
Sean Christopherson May 17, 2021, 6:34 p.m. UTC | #16
On Mon, May 17, 2021, Jim Mattson wrote:
> On Mon, May 17, 2021 at 11:05 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, May 17, 2021, Dave Hansen wrote:
> > > On 5/17/21 10:49 AM, Andy Lutomirski wrote:
> > > >> The least awful solution would be to have the NMI handler restore
> > > >> the host's PKRU.  The NMI handler would need to save/restore the
> > > >> register, a la CR2, but the whole thing could be optimized to run
> > > >> if and only if the NMI lands in the window where the guest's PKRU
> > > >> is loaded.
> > > >
> > > > Or set a flag causing nmi_uaccess_ok() to return false.
> > >
> > > Oh, that doesn't sound too bad.  The VMENTER/EXIT paths are also
> > > essentially a context switch.
> >
> > I like that idea, too.
> >
> > The flag might also be useful to fix the issue where the NMI handler activates
> > PEBS after KVM disables it.  Jim?
> 
> The issue is actually that the NMI handler *clears* IA32_PEBS_ENABLE
> bits after giving out the host value of the MSR to KVM. If we were to
> block the NMI handler from modifying IA32_PEBS_ENABLE until after the
> next VM-exit, that could solve this issue. I don't know if it makes
> sense to piggyback on nmi_uaccess(), though.

I wasn't thinking about using nmi_uaccess_okay(), but rather whatever flag is
added so that can KVM can inform the NMI handler that KVM is in the middle of
its version of a context switch.

> > > Will widening the window where nmi_uaccess_okay()==false anger any of
> > > the perf folks?  It looks like perf knows how to handle it nicely.
Dave Hansen May 19, 2021, 10:44 p.m. UTC | #17
On 5/17/21 12:46 AM, Paolo Bonzini wrote:
> On 14/05/21 07:11, Andy Lutomirski wrote:
>> That's nice, but it fails to restore XINUSE[PKRU].  As far as I know,
>> that bit is live, and the only way to restore it to 0 is with
>> XRSTOR(S).
> 
> The manual says "It is possible for XINUSE[i] to be 1 even when state
> component i is in its initial configuration" so this is architecturally
> valid.  Does the XINUSE optimization matter for PKRU which is a single
> word?

In Linux with normal userspace, virtually never.

The hardware defaults PKRU to 0x0 which means "no restrictions on any
keys".  Linux defaults PKRU via 'init_pkru_value' to the most
restrictive value.  This ensures that new non-zero-pkey-assigned memory
is protected by default.

But, that also means PKRU is virtually never in its init state in Linux.
 An app would probably need to manipulate PKRU with XRSTOR to get
XINUSE[PKRU]=0.

It would only even *possibly* be useful if running a KVM guest that had
PKRU=0x0 (sorry I don't consider things using KVM "normal userspace" :P ).
Andy Lutomirski May 19, 2021, 11:15 p.m. UTC | #18
On Wed, May 19, 2021, at 3:44 PM, Dave Hansen wrote:
> On 5/17/21 12:46 AM, Paolo Bonzini wrote:
> > On 14/05/21 07:11, Andy Lutomirski wrote:
> >> That's nice, but it fails to restore XINUSE[PKRU].  As far as I know,
> >> that bit is live, and the only way to restore it to 0 is with
> >> XRSTOR(S).
> > 
> > The manual says "It is possible for XINUSE[i] to be 1 even when state
> > component i is in its initial configuration" so this is architecturally
> > valid.  Does the XINUSE optimization matter for PKRU which is a single
> > word?
> 
> In Linux with normal userspace, virtually never.
> 
> The hardware defaults PKRU to 0x0 which means "no restrictions on any
> keys".  Linux defaults PKRU via 'init_pkru_value' to the most
> restrictive value.  This ensures that new non-zero-pkey-assigned memory
> is protected by default.
> 
> But, that also means PKRU is virtually never in its init state in Linux.
>  An app would probably need to manipulate PKRU with XRSTOR to get
> XINUSE[PKRU]=0.
> 
> It would only even *possibly* be useful if running a KVM guest that had
> PKRU=0x0 (sorry I don't consider things using KVM "normal userspace" :P ).
> 

There was at least one report from the rr camp of glibc behaving differently depending on the result of XGETBV(1).  It's at least impolite to change the XINUSE register for a guest behind its back.

Admittedly that particular report wasn't about PKRU, and *Linux* guests won't run with XINUSE[PKRU]=0 under normal circumstances, but non-Linux guests could certainly do so.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 8d33ad80704f..0860bbadb422 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -583,7 +583,7 @@  static inline void switch_fpu_finish(struct fpu *new_fpu)
 		if (pk)
 			pkru_val = pk->pkru;
 	}
-	__write_pkru(pkru_val);
+	__write_pkru(pkru_val, false);

 	/*
 	 * Expensive PASID MSR write will be avoided in update_pasid() because
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..8c8d4796b864 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -151,7 +151,7 @@  static inline void write_pkru(u32 pkru)
 	fpregs_lock();
 	if (pk)
 		pk->pkru = pkru;
-	__write_pkru(pkru);
+	__write_pkru(pkru, false);
 	fpregs_unlock();
 }

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 2acd6cb62328..f4ac8ec3f096 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -104,13 +104,15 @@  static inline void wrpkru(u32 pkru)
 		     : : "a" (pkru), "c"(ecx), "d"(edx));
 }

-static inline void __write_pkru(u32 pkru)
+static inline void __write_pkru(u32 pkru, bool skip_comparison)
 {
 	/*
 	 * WRPKRU is relatively expensive compared to RDPKRU.
-	 * Avoid WRPKRU when it would not change the value.
+	 * Avoid WRPKRU when it would not change the value,
+	 * except when the caller has already done the
+	 * comparison, in which case skip redundant RDPKRU.
 	 */
-	if (pkru == rdpkru())
+	if (!skip_comparison && pkru == rdpkru())
 		return;

 	wrpkru(pkru);
@@ -122,7 +124,7 @@  static inline u32 rdpkru(void)
 	return 0;
 }

-static inline void __write_pkru(u32 pkru)
+static inline void __write_pkru(u32 pkru, bool skip_comparison)
 {
 }
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cebdaa1e3cf5..cd95adbd140c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -912,10 +912,10 @@  void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 	}

 	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
-	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
-	    vcpu->arch.pkru != vcpu->arch.host_pkru)
-		__write_pkru(vcpu->arch.pkru);
+	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
+	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
+	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
+		__write_pkru(vcpu->arch.pkru, false);
 }
 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);

@@ -925,11 +925,11 @@  void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 		return;

 	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
-	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
+	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
+	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
 		vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
-			__write_pkru(vcpu->arch.host_pkru);
+			__write_pkru(vcpu->arch.host_pkru, true);
 	}

 	if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {