Message ID | 20210507164456.1033-1-jon@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: add hint to skip hidden rdpkru under kvm_load_host_xsave_state | expand |
Nice one. The patch can be made simpler though (I think). On 07/05/21 18:44, Jon Kohler wrote: @@ -122,7 +124,7 @@ static inline u32 rdpkru(void) > return 0; > } > > -static inline void __write_pkru(u32 pkru) > +static inline void __write_pkru(u32 pkru, bool skip_comparison) > { > } > #endif > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index cebdaa1e3cf5..cd95adbd140c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) > } > > if (static_cpu_has(X86_FEATURE_PKU) && > - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || > - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && > - vcpu->arch.pkru != vcpu->arch.host_pkru) > - __write_pkru(vcpu->arch.pkru); > + vcpu->arch.pkru != vcpu->arch.host_pkru && > + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || > + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) > + __write_pkru(vcpu->arch.pkru, false); This can be optimized as well, can't it? This means that the only case that needs the rdpkru is in switch_fpu_finish, and __write_pkru can be removed completely: - do the rdpkru+wrpkru in switch_fpu_finish - just use wrpkru in KVM Paolo > } > EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); > > @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) > return; > > if (static_cpu_has(X86_FEATURE_PKU) && > - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || > - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { > + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || > + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { > vcpu->arch.pkru = rdpkru(); > if (vcpu->arch.pkru != vcpu->arch.host_pkru) > - __write_pkru(vcpu->arch.host_pkru); > + __write_pkru(vcpu->arch.host_pkru, true); > } > > if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) { > -- > 2.30.1 (Apple Git-130) >
On 5/7/21 9:52 AM, Paolo Bonzini wrote: > This can be optimized as well, can't it? This means that the only case > that needs the rdpkru is in switch_fpu_finish, and __write_pkru can be > removed completely: > > - do the rdpkru+wrpkru in switch_fpu_finish > > - just use wrpkru in KVM I was going to suggest exactly the same thing. It doesn't require the compiler to be smart, and wrpkru() is available in the same header as __write_pkru(). I also detest the mysterious true/false arguments to functions where you have no clue what they're doing at the call site without comments.
> On May 7, 2021, at 12:58 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/7/21 9:52 AM, Paolo Bonzini wrote: >> This can be optimized as well, can't it? This means that the only case >> that needs the rdpkru is in switch_fpu_finish, and __write_pkru can be >> removed completely: >> >> - do the rdpkru+wrpkru in switch_fpu_finish >> >> - just use wrpkru in KVM > > I was going to suggest exactly the same thing. It doesn't require the > compiler to be smart, and wrpkru() is available in the same header as > __write_pkru(). Thanks Paolo, Dave. Good suggestion, I’ll work up a v2 and send it out. > > I also detest the mysterious true/false arguments to functions where you > have no clue what they're doing at the call site without comments. Noted, my apologies. Thats a good point, I won’t make that mistake again.
On Fri, May 7, 2021 at 9:45 AM Jon Kohler <jon@nutanix.com> wrote: > > kvm_load_host_xsave_state handles xsave on vm exit, part of which is > managing memory protection key state. The latest arch.pkru is updated > with a rdpkru, and if that doesn't match the base host_pkru (which > about 70% of the time), we issue a __write_pkru. This thread caused me to read the code, and I don't get how it's even remotely correct. First, kvm_load_guest_fpu() has this delight: /* * Guests with protected state can't have it set by the hypervisor, * so skip trying to set it. */ if (vcpu->arch.guest_fpu) /* PKRU is separately restored in kvm_x86_ops.run. */ __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state, ~XFEATURE_MASK_PKRU); That's nice, but it fails to restore XINUSE[PKRU]. As far as I know, that bit is live, and the only way to restore it to 0 is with XRSTOR(S). > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index cebdaa1e3cf5..cd95adbd140c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) > } > > if (static_cpu_has(X86_FEATURE_PKU) && > - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || > - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && > - vcpu->arch.pkru != vcpu->arch.host_pkru) > - __write_pkru(vcpu->arch.pkru); > + vcpu->arch.pkru != vcpu->arch.host_pkru && > + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || > + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) > + __write_pkru(vcpu->arch.pkru, false); Please tell me I'm missing something (e.g. KVM very cleverly managing the PKRU register using intercepts) that makes this reliably load the guest value. An innocent or malicious guest could easily make that condition evaluate to false, thus allowing the host PKRU value to be live in guest mode. (Or is something fancy going on here?) I don't even want to think about what happens if a perf NMI hits and accesses host user memory while the guest PKRU is live (on VMX -- I think this can't happen on SVM). > } > EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); > > @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) > return; > > if (static_cpu_has(X86_FEATURE_PKU) && > - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || > - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { > + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || > + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { > vcpu->arch.pkru = rdpkru(); > if (vcpu->arch.pkru != vcpu->arch.host_pkru) > - __write_pkru(vcpu->arch.host_pkru); > + __write_pkru(vcpu->arch.host_pkru, true); > } Suppose the guest writes to PKRU and then, without exiting, sets PKE = 0 and XCR0[PKRU] = 0. (Or are the intercepts such that this can't happen except on SEV where maybe SEV magic makes the problem go away?) I admit I'm fairly mystified as to why KVM doesn't handle PKRU like the rest of guest XSTATE.
> On May 14, 2021, at 1:11 AM, Andy Lutomirski <luto@kernel.org> wrote: > > On Fri, May 7, 2021 at 9:45 AM Jon Kohler <jon@nutanix.com> wrote: >> >> kvm_load_host_xsave_state handles xsave on vm exit, part of which is >> managing memory protection key state. The latest arch.pkru is updated >> with a rdpkru, and if that doesn't match the base host_pkru (which >> about 70% of the time), we issue a __write_pkru. > > This thread caused me to read the code, and I don't get how it's even > remotely correct. > > First, kvm_load_guest_fpu() has this delight: > > /* > * Guests with protected state can't have it set by the hypervisor, > * so skip trying to set it. > */ > if (vcpu->arch.guest_fpu) > /* PKRU is separately restored in kvm_x86_ops.run. */ > __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state, > ~XFEATURE_MASK_PKRU); > > That's nice, but it fails to restore XINUSE[PKRU]. As far as I know, > that bit is live, and the only way to restore it to 0 is with > XRSTOR(S). Thanks, Andy - Adding Tom Lendacky to this thread as that Particular snippet was last touched in ~5.11 in ed02b213098a. > >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index cebdaa1e3cf5..cd95adbd140c 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) >> } >> >> if (static_cpu_has(X86_FEATURE_PKU) && >> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && >> - vcpu->arch.pkru != vcpu->arch.host_pkru) >> - __write_pkru(vcpu->arch.pkru); >> + vcpu->arch.pkru != vcpu->arch.host_pkru && >> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) >> + __write_pkru(vcpu->arch.pkru, false); > > Please tell me I'm missing something (e.g. KVM very cleverly managing > the PKRU register using intercepts) that makes this reliably load the > guest value. An innocent or malicious guest could easily make that > condition evaluate to false, thus allowing the host PKRU value to be > live in guest mode. (Or is something fancy going on here?) > > I don't even want to think about what happens if a perf NMI hits and > accesses host user memory while the guest PKRU is live (on VMX -- I > think this can't happen on SVM). Perhaps Babu / Dave can comment on the exiting logic here? Babu did some additional work to fix save/restore on 37486135d3a. From this changes perspective, I’m just trying to get to the resultant evaluation a bit quicker, which this change (and the v3) seems to do ok with; however, if I’ve ran my ship into a larger ice berg, happy to collaborate to make it better overall. > >> } >> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); >> >> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) >> return; >> >> if (static_cpu_has(X86_FEATURE_PKU) && >> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { >> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { >> vcpu->arch.pkru = rdpkru(); >> if (vcpu->arch.pkru != vcpu->arch.host_pkru) >> - __write_pkru(vcpu->arch.host_pkru); >> + __write_pkru(vcpu->arch.host_pkru, true); >> } > > Suppose the guest writes to PKRU and then, without exiting, sets PKE = > 0 and XCR0[PKRU] = 0. (Or are the intercepts such that this can't > happen except on SEV where maybe SEV magic makes the problem go away?) > > I admit I'm fairly mystified as to why KVM doesn't handle PKRU like > the rest of guest XSTATE. I’ll defer to Dave/Babu here. This code has been this way for a bit now, I’m not sure at first glance if that situation could occur intentionally or accidentally, but that would evaluate the same both before and after this change, no?
On 14/05/21 07:11, Andy Lutomirski wrote: > That's nice, but it fails to restore XINUSE[PKRU]. As far as I know, > that bit is live, and the only way to restore it to 0 is with > XRSTOR(S). The manual says "It is possible for XINUSE[i] to be 1 even when state component i is in its initial configuration" so this is architecturally valid. Does the XINUSE optimization matter for PKRU which is a single word? >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index cebdaa1e3cf5..cd95adbd140c 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) >> } >> >> if (static_cpu_has(X86_FEATURE_PKU) && >> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && >> - vcpu->arch.pkru != vcpu->arch.host_pkru) >> - __write_pkru(vcpu->arch.pkru); >> + vcpu->arch.pkru != vcpu->arch.host_pkru && >> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) >> + __write_pkru(vcpu->arch.pkru, false); > > Please tell me I'm missing something (e.g. KVM very cleverly managing > the PKRU register using intercepts) that makes this reliably load the > guest value. An innocent or malicious guest could easily make that > condition evaluate to false, thus allowing the host PKRU value to be > live in guest mode. (Or is something fancy going on here?) RDPKRU/WRPKRU cannot be used unless CR4.PKE=1, but PKRU can still be accessed using XSAVE/XRSTOR. However both CR4 and XCR0 have their writes trapped, so the guest will not use the host PKRU value before the next vmexit if CR4.PKE=0 and XCR0.PKRU=0. > I don't even want to think about what happens if a perf NMI hits and > accesses host user memory while the guest PKRU is live (on VMX -- I > think this can't happen on SVM). This is indeed a problem, which indeed cannot happen on SVM but is there on VMX. Note that the function above is not handling all of the xstate, it's handling the *XSAVE state*, that is XCR0, XSS and PKRU. Thus the window is small, but it's there. Is it solvable at all, without having PKRU fields in the VMCS (and without masking NMIs in the LAPIC which would be too expensive)? Dave, Sean, what do you think? >> } >> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); >> >> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) >> return; >> >> if (static_cpu_has(X86_FEATURE_PKU) && >> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { >> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { >> vcpu->arch.pkru = rdpkru(); >> if (vcpu->arch.pkru != vcpu->arch.host_pkru) >> - __write_pkru(vcpu->arch.host_pkru); >> + __write_pkru(vcpu->arch.host_pkru, true); >> } > > Suppose the guest writes to PKRU and then, without exiting, sets PKE = > 0 and XCR0[PKRU] = 0. (Or are the intercepts such that this can't > happen except on SEV where maybe SEV magic makes the problem go away?) Yes, see above. KVM needs to trap CR4 and XCR0 anyway (CR4 because you don't want the guest to clear e.g. MCE, XCR0 to forbid setting bits that the host kernel does not have in its own xstate). > I admit I'm fairly mystified as to why KVM doesn't handle PKRU like > the rest of guest XSTATE. Because the rest of the guest XSTATE is live too early. The problem you mention above with respect to perf, where you access host memory with the guest PKRU, would be very much amplified. It is basically the opposite problem of what you have in switch_fpu_finish, which loads PKRU eagerly while delaying the rest to the switch to userland. Paolo
On 5/13/21 10:11 PM, Andy Lutomirski wrote: > I don't even want to think about what happens if a perf NMI hits and > accesses host user memory while the guest PKRU is live (on VMX -- I > think this can't happen on SVM). What's the relevant difference between SVM and VMX here? I'm missing something.
On 5/16/21 9:50 PM, Jon Kohler wrote: > > >> On May 14, 2021, at 1:11 AM, Andy Lutomirski <luto@kernel.org> wrote: >> >> On Fri, May 7, 2021 at 9:45 AM Jon Kohler <jon@nutanix.com> wrote: >>> >>> kvm_load_host_xsave_state handles xsave on vm exit, part of which is >>> managing memory protection key state. The latest arch.pkru is updated >>> with a rdpkru, and if that doesn't match the base host_pkru (which >>> about 70% of the time), we issue a __write_pkru. >> >> This thread caused me to read the code, and I don't get how it's even >> remotely correct. >> >> First, kvm_load_guest_fpu() has this delight: >> >> /* >> * Guests with protected state can't have it set by the hypervisor, >> * so skip trying to set it. >> */ >> if (vcpu->arch.guest_fpu) >> /* PKRU is separately restored in kvm_x86_ops.run. */ >> __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state, >> ~XFEATURE_MASK_PKRU); >> >> That's nice, but it fails to restore XINUSE[PKRU]. As far as I know, >> that bit is live, and the only way to restore it to 0 is with >> XRSTOR(S). > > Thanks, Andy - Adding Tom Lendacky to this thread as that > Particular snippet was last touched in ~5.11 in ed02b213098a. It sounds like Andy's comment is separate from the changes associated with commit ed02b213098a, right? Commit ed02b213098a just added the check for vcpu->arch.guest_fpu to support SEV-ES guests. Since the hypervisor can't save/restore those registers directly when running under SEV-ES, we skip this. The state will be restored when VMRUN is performed. Thanks, Tom > >> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index cebdaa1e3cf5..cd95adbd140c 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) >>> } >>> >>> if (static_cpu_has(X86_FEATURE_PKU) && >>> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >>> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && >>> - vcpu->arch.pkru != vcpu->arch.host_pkru) >>> - __write_pkru(vcpu->arch.pkru); >>> + vcpu->arch.pkru != vcpu->arch.host_pkru && >>> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >>> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) >>> + __write_pkru(vcpu->arch.pkru, false); >> >> Please tell me I'm missing something (e.g. KVM very cleverly managing >> the PKRU register using intercepts) that makes this reliably load the >> guest value. An innocent or malicious guest could easily make that >> condition evaluate to false, thus allowing the host PKRU value to be >> live in guest mode. (Or is something fancy going on here?) >> >> I don't even want to think about what happens if a perf NMI hits and >> accesses host user memory while the guest PKRU is live (on VMX -- I >> think this can't happen on SVM). > > Perhaps Babu / Dave can comment on the exiting logic here? Babu did > some additional work to fix save/restore on 37486135d3a. > > From this changes perspective, I’m just trying to get to the resultant > evaluation a bit quicker, which this change (and the v3) seems to do > ok with; however, if I’ve ran my ship into a larger ice berg, happy to > collaborate to make it better overall. > >> >>> } >>> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); >>> >>> @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) >>> return; >>> >>> if (static_cpu_has(X86_FEATURE_PKU) && >>> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >>> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { >>> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >>> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { >>> vcpu->arch.pkru = rdpkru(); >>> if (vcpu->arch.pkru != vcpu->arch.host_pkru) >>> - __write_pkru(vcpu->arch.host_pkru); >>> + __write_pkru(vcpu->arch.host_pkru, true); >>> } >> >> Suppose the guest writes to PKRU and then, without exiting, sets PKE = >> 0 and XCR0[PKRU] = 0. (Or are the intercepts such that this can't >> happen except on SEV where maybe SEV magic makes the problem go away?) >> >> I admit I'm fairly mystified as to why KVM doesn't handle PKRU like >> the rest of guest XSTATE. > > I’ll defer to Dave/Babu here. This code has been this way for a bit now, > I’m not sure at first glance if that situation could occur intentionally > or accidentally, but that would evaluate the same both before and > after this change, no? >
On 17/05/21 15:54, Dave Hansen wrote: > On 5/13/21 10:11 PM, Andy Lutomirski wrote: >> I don't even want to think about what happens if a perf NMI hits and >> accesses host user memory while the guest PKRU is live (on VMX -- I >> think this can't happen on SVM). > > What's the relevant difference between SVM and VMX here? I'm missing > something. SVM has the global interrupt flag that blocks NMIs and SMIs, and PKRU is loaded while GIF=0. Paolo
On Mon, May 17, 2021, Paolo Bonzini wrote: > On 14/05/21 07:11, Andy Lutomirski wrote: > > I don't even want to think about what happens if a perf NMI hits and > > accesses host user memory while the guest PKRU is live (on VMX -- I > > think this can't happen on SVM). > > This is indeed a problem, which indeed cannot happen on SVM but is there on > VMX. Note that the function above is not handling all of the xstate, it's > handling the *XSAVE state*, that is XCR0, XSS and PKRU. Thus the window is > small, but it's there. > > Is it solvable at all, without having PKRU fields in the VMCS (and without > masking NMIs in the LAPIC which would be too expensive)? Dave, Sean, what > do you think? The least awful solution would be to have the NMI handler restore the host's PKRU. The NMI handler would need to save/restore the register, a la CR2, but the whole thing could be optimized to run if and only if the NMI lands in the window where the guest's PKRU is loaded.
On 5/17/21 10:39 AM, Sean Christopherson wrote: > On Mon, May 17, 2021, Paolo Bonzini wrote: >> On 14/05/21 07:11, Andy Lutomirski wrote: >>> I don't even want to think about what happens if a perf NMI hits and >>> accesses host user memory while the guest PKRU is live (on VMX -- I >>> think this can't happen on SVM). >> This is indeed a problem, which indeed cannot happen on SVM but is there on >> VMX. Note that the function above is not handling all of the xstate, it's >> handling the *XSAVE state*, that is XCR0, XSS and PKRU. Thus the window is >> small, but it's there. >> >> Is it solvable at all, without having PKRU fields in the VMCS (and without >> masking NMIs in the LAPIC which would be too expensive)? Dave, Sean, what >> do you think? > The least awful solution would be to have the NMI handler restore the host's > PKRU. The NMI handler would need to save/restore the register, a la CR2, but the > whole thing could be optimized to run if and only if the NMI lands in the window > where the guest's PKRU is loaded. What were you thinking about? Something like: *this_cpu_ptr(&need_nmi_wpkru) = 1 // Enter Guest __write_pkru(vcpu->arch.pkru); *this_cpu_ptr(&need_nmi_wpkru) = 0 And then in the NMI handler: u32 pkru; if (*this_cpu_ptr(&need_nmi_wpkru)) { pkru = rdpku(); __write_pkru(vcpu->arch.pkru); } ... copy_*_user_nmi(... whatever ...); ... if (*this_cpu_ptr(&need_nmi_wpkru)) __write_pkru(pkru); ? I was thinking we could just blow away PKRU without saving/restoring it in the NMI handler, but that might clobber PKRU in the window between need_nmi_wpkru=1 and entering the guest. But, the save/restore seems doable especially since we can do it in C and don't have to mess with the NMI stack or anything.
On 5/17/21 10:49 AM, Andy Lutomirski wrote: >> The least awful solution would be to have the NMI handler restore >> the host's PKRU. The NMI handler would need to save/restore the >> register, a la CR2, but the whole thing could be optimized to run >> if and only if the NMI lands in the window where the guest's PKRU >> is loaded. > > Or set a flag causing nmi_uaccess_ok() to return false. Oh, that doesn't sound too bad. The VMENTER/EXIT paths are also essentially a context switch. Will widening the window where nmi_uaccess_okay()==false anger any of the perf folks? It looks like perf knows how to handle it nicely.
On Mon, May 17, 2021, Dave Hansen wrote: > On 5/17/21 10:39 AM, Sean Christopherson wrote: > > On Mon, May 17, 2021, Paolo Bonzini wrote: > >> On 14/05/21 07:11, Andy Lutomirski wrote: > >>> I don't even want to think about what happens if a perf NMI hits and > >>> accesses host user memory while the guest PKRU is live (on VMX -- I > >>> think this can't happen on SVM). > >> This is indeed a problem, which indeed cannot happen on SVM but is there on > >> VMX. Note that the function above is not handling all of the xstate, it's > >> handling the *XSAVE state*, that is XCR0, XSS and PKRU. Thus the window is > >> small, but it's there. > >> > >> Is it solvable at all, without having PKRU fields in the VMCS (and without > >> masking NMIs in the LAPIC which would be too expensive)? Dave, Sean, what > >> do you think? > > The least awful solution would be to have the NMI handler restore the host's > > PKRU. The NMI handler would need to save/restore the register, a la CR2, but the > > whole thing could be optimized to run if and only if the NMI lands in the window > > where the guest's PKRU is loaded. > > What were you thinking about? Something like: > > *this_cpu_ptr(&need_nmi_wpkru) = 1 > // Enter Guest > __write_pkru(vcpu->arch.pkru); > *this_cpu_ptr(&need_nmi_wpkru) = 0 > > And then in the NMI handler: > > u32 pkru; > > if (*this_cpu_ptr(&need_nmi_wpkru)) { > pkru = rdpku(); > __write_pkru(vcpu->arch.pkru); This isn't correct, vcpu->arch.pkru holds the guest value, the NMI handler needs to load the host value. I was thinking KVM would stash away the current host value, and the NMI handler would check the saved value against rdpkru(). > } > ... > copy_*_user_nmi(... whatever ...); > ... > if (*this_cpu_ptr(&need_nmi_wpkru)) > __write_pkru(pkru); > > ? > > I was thinking we could just blow away PKRU without saving/restoring it > in the NMI handler, but that might clobber PKRU in the window between > need_nmi_wpkru=1 and entering the guest. Yep. It would also blow away the guest's value if the guest did WRPKU while it was running since KVM would never get a chance to read/save the guest's new value. > But, the save/restore seems doable especially since we can do it in C > and don't have to mess with the NMI stack or anything.
On Mon, May 17, 2021, Dave Hansen wrote: > On 5/17/21 10:49 AM, Andy Lutomirski wrote: > >> The least awful solution would be to have the NMI handler restore > >> the host's PKRU. The NMI handler would need to save/restore the > >> register, a la CR2, but the whole thing could be optimized to run > >> if and only if the NMI lands in the window where the guest's PKRU > >> is loaded. > > > > Or set a flag causing nmi_uaccess_ok() to return false. > > Oh, that doesn't sound too bad. The VMENTER/EXIT paths are also > essentially a context switch. I like that idea, too. The flag might also be useful to fix the issue where the NMI handler activates PEBS after KVM disables it. Jim? > Will widening the window where nmi_uaccess_okay()==false anger any of > the perf folks? It looks like perf knows how to handle it nicely.
On Mon, May 17, 2021 at 11:05 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, May 17, 2021, Dave Hansen wrote: > > On 5/17/21 10:49 AM, Andy Lutomirski wrote: > > >> The least awful solution would be to have the NMI handler restore > > >> the host's PKRU. The NMI handler would need to save/restore the > > >> register, a la CR2, but the whole thing could be optimized to run > > >> if and only if the NMI lands in the window where the guest's PKRU > > >> is loaded. > > > > > > Or set a flag causing nmi_uaccess_ok() to return false. > > > > Oh, that doesn't sound too bad. The VMENTER/EXIT paths are also > > essentially a context switch. > > I like that idea, too. > > The flag might also be useful to fix the issue where the NMI handler activates > PEBS after KVM disables it. Jim? The issue is actually that the NMI handler *clears* IA32_PEBS_ENABLE bits after giving out the host value of the MSR to KVM. If we were to block the NMI handler from modifying IA32_PEBS_ENABLE until after the next VM-exit, that could solve this issue. I don't know if it makes sense to piggyback on nmi_uaccess(), though. > > Will widening the window where nmi_uaccess_okay()==false anger any of > > the perf folks? It looks like perf knows how to handle it nicely.
On Mon, May 17, 2021, Jim Mattson wrote: > On Mon, May 17, 2021 at 11:05 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, May 17, 2021, Dave Hansen wrote: > > > On 5/17/21 10:49 AM, Andy Lutomirski wrote: > > > >> The least awful solution would be to have the NMI handler restore > > > >> the host's PKRU. The NMI handler would need to save/restore the > > > >> register, a la CR2, but the whole thing could be optimized to run > > > >> if and only if the NMI lands in the window where the guest's PKRU > > > >> is loaded. > > > > > > > > Or set a flag causing nmi_uaccess_ok() to return false. > > > > > > Oh, that doesn't sound too bad. The VMENTER/EXIT paths are also > > > essentially a context switch. > > > > I like that idea, too. > > > > The flag might also be useful to fix the issue where the NMI handler activates > > PEBS after KVM disables it. Jim? > > The issue is actually that the NMI handler *clears* IA32_PEBS_ENABLE > bits after giving out the host value of the MSR to KVM. If we were to > block the NMI handler from modifying IA32_PEBS_ENABLE until after the > next VM-exit, that could solve this issue. I don't know if it makes > sense to piggyback on nmi_uaccess(), though. I wasn't thinking about using nmi_uaccess_okay(), but rather whatever flag is added so that can KVM can inform the NMI handler that KVM is in the middle of its version of a context switch. > > > Will widening the window where nmi_uaccess_okay()==false anger any of > > > the perf folks? It looks like perf knows how to handle it nicely.
On 5/17/21 12:46 AM, Paolo Bonzini wrote: > On 14/05/21 07:11, Andy Lutomirski wrote: >> That's nice, but it fails to restore XINUSE[PKRU]. As far as I know, >> that bit is live, and the only way to restore it to 0 is with >> XRSTOR(S). > > The manual says "It is possible for XINUSE[i] to be 1 even when state > component i is in its initial configuration" so this is architecturally > valid. Does the XINUSE optimization matter for PKRU which is a single > word? In Linux with normal userspace, virtually never. The hardware defaults PKRU to 0x0 which means "no restrictions on any keys". Linux defaults PKRU via 'init_pkru_value' to the most restrictive value. This ensures that new non-zero-pkey-assigned memory is protected by default. But, that also means PKRU is virtually never in its init state in Linux. An app would probably need to manipulate PKRU with XRSTOR to get XINUSE[PKRU]=0. It would only even *possibly* be useful if running a KVM guest that had PKRU=0x0 (sorry I don't consider things using KVM "normal userspace" :P ).
On Wed, May 19, 2021, at 3:44 PM, Dave Hansen wrote: > On 5/17/21 12:46 AM, Paolo Bonzini wrote: > > On 14/05/21 07:11, Andy Lutomirski wrote: > >> That's nice, but it fails to restore XINUSE[PKRU]. As far as I know, > >> that bit is live, and the only way to restore it to 0 is with > >> XRSTOR(S). > > > > The manual says "It is possible for XINUSE[i] to be 1 even when state > > component i is in its initial configuration" so this is architecturally > > valid. Does the XINUSE optimization matter for PKRU which is a single > > word? > > In Linux with normal userspace, virtually never. > > The hardware defaults PKRU to 0x0 which means "no restrictions on any > keys". Linux defaults PKRU via 'init_pkru_value' to the most > restrictive value. This ensures that new non-zero-pkey-assigned memory > is protected by default. > > But, that also means PKRU is virtually never in its init state in Linux. > An app would probably need to manipulate PKRU with XRSTOR to get > XINUSE[PKRU]=0. > > It would only even *possibly* be useful if running a KVM guest that had > PKRU=0x0 (sorry I don't consider things using KVM "normal userspace" :P ). > There was at least one report from the rr camp of glibc behaving differently depending on the result of XGETBV(1). It's at least impolite to change the XINUSE register for a guest behind its back. Admittedly that particular report wasn't about PKRU, and *Linux* guests won't run with XINUSE[PKRU]=0 under normal circumstances, but non-Linux guests could certainly do so.
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 8d33ad80704f..0860bbadb422 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -583,7 +583,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) if (pk) pkru_val = pk->pkru; } - __write_pkru(pkru_val); + __write_pkru(pkru_val, false); /* * Expensive PASID MSR write will be avoided in update_pasid() because diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index b1099f2d9800..8c8d4796b864 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -151,7 +151,7 @@ static inline void write_pkru(u32 pkru) fpregs_lock(); if (pk) pk->pkru = pkru; - __write_pkru(pkru); + __write_pkru(pkru, false); fpregs_unlock(); } diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 2acd6cb62328..f4ac8ec3f096 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -104,13 +104,15 @@ static inline void wrpkru(u32 pkru) : : "a" (pkru), "c"(ecx), "d"(edx)); } -static inline void __write_pkru(u32 pkru) +static inline void __write_pkru(u32 pkru, bool skip_comparison) { /* * WRPKRU is relatively expensive compared to RDPKRU. - * Avoid WRPKRU when it would not change the value. + * Avoid WRPKRU when it would not change the value, + * except when the caller has already done the + * comparison, in which case skip redundant RDPKRU. */ - if (pkru == rdpkru()) + if (!skip_comparison && pkru == rdpkru()) return; wrpkru(pkru); @@ -122,7 +124,7 @@ static inline u32 rdpkru(void) return 0; } -static inline void __write_pkru(u32 pkru) +static inline void __write_pkru(u32 pkru, bool skip_comparison) { } #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cebdaa1e3cf5..cd95adbd140c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -912,10 +912,10 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) } if (static_cpu_has(X86_FEATURE_PKU) && - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && - vcpu->arch.pkru != vcpu->arch.host_pkru) - __write_pkru(vcpu->arch.pkru); + vcpu->arch.pkru != vcpu->arch.host_pkru && + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) + __write_pkru(vcpu->arch.pkru, false); } EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); @@ -925,11 +925,11 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) return; if (static_cpu_has(X86_FEATURE_PKU) && - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { vcpu->arch.pkru = rdpkru(); if (vcpu->arch.pkru != vcpu->arch.host_pkru) - __write_pkru(vcpu->arch.host_pkru); + __write_pkru(vcpu->arch.host_pkru, true); } if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {
kvm_load_host_xsave_state handles xsave on vm exit, part of which is managing memory protection key state. The latest arch.pkru is updated with a rdpkru, and if that doesn't match the base host_pkru (which about 70% of the time), we issue a __write_pkru. __write_pkru issues another rdpkru internally to try to avoid the wrpkru, so we're reading the same value back to back when we 100% of the time know that we need to go directly to wrpkru. This is a 100% branch miss and extra work that can be skipped. To improve performance, add a hint to __write_pkru so that callers can specify if they've already done the check themselves. The compiler seems to filter this branch this out completely when the hint is true, so incrementally tighter code is generated as well. While we're in this section of code, optimize if condition ordering prior to __write_pkru in both kvm_load_{guest|host}_xsave_state. For both functions, flip the ordering of the || condition so that arch.xcr0 & XFEATURE_MASK_PKRU is checked first, which when instrumented in our evironment appeared to be always true and less overall work than kvm_read_cr4_bits. For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead one position. When instrumented, I saw this be true roughly ~70% of the time vs the other conditions which were almost always true. With this change, we will avoid 3rd condition check ~30% of the time. Cc: Babu Moger <babu.moger@amd.com> Signed-off-by: Jon Kohler <jon@nutanix.com> --- arch/x86/include/asm/fpu/internal.h | 2 +- arch/x86/include/asm/pgtable.h | 2 +- arch/x86/include/asm/special_insns.h | 10 ++++++---- arch/x86/kvm/x86.c | 14 +++++++------- 4 files changed, 15 insertions(+), 13 deletions(-) -- 2.30.1 (Apple Git-130)