Message ID | 20220324004439.6709-1-jon@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state | expand |
Queued, thanks. Paolo
On Wed, Mar 23, 2022, Jon Kohler wrote: > kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit, > part of which is managing memory protection key state. The latest > arch.pkru is updated with a rdpkru, and if that doesn't match the base > host_pkru (which about 70% of the time), we issue a __write_pkru. > > To improve performance, implement the following optimizations: > 1. Reorder if conditions prior to wrpkru in both > kvm_load_{guest|host}_xsave_state. > > Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is > checked first, which when instrumented in our environment appeared > to be always true and less overall work than kvm_read_cr4_bits. If it's always true, then it should be checked last, not first. And if kvm_read_cr4_bits() is more expensive then we should address that issue, not try to micro-optimize this stuff. X86_CR4_PKE can't be guest-owned, and so kvm_read_cr4_bits() should be optimized down to: return vcpu->arch.cr4 & X86_CR4_PKE; If the compiler isn't smart enough to do that on its own then we should rework kvm_read_cr4_bits() as that will benefit multiple code paths. > For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead > one position. When instrumented, I saw this be true roughly ~70% of > the time vs the other conditions which were almost always true. > With this change, we will avoid 3rd condition check ~30% of the time. If the guest uses PKRU... If PKRU is used by the host but not the guest, the early comparison is pure overhead because it will almost always be true (guest will be zero, host will non-zero), > 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS, > as if the user compiles out this feature, we should not have > these branches at all. Not that it really matters, since static_cpu_has() will patch out all the branches, and in practice who cares about a JMP or NOP(s)? But... #ifdeffery is the wrong way to handle this. Replace static_cpu_has() with cpu_feature_enabled(); that'll boil down to a '0' and omit all the code when CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=n, without the #ifdef ugliness. > Signed-off-by: Jon Kohler <jon@nutanix.com> > --- > arch/x86/kvm/x86.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6db3a506b402..2b00123a5d50 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -950,11 +950,13 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) > wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss); > } > > +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS > if (static_cpu_has(X86_FEATURE_PKU) && > - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || > - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && > - vcpu->arch.pkru != vcpu->arch.host_pkru) > + vcpu->arch.pkru != vcpu->arch.host_pkru && > + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || > + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) > write_pkru(vcpu->arch.pkru); > +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ > } > EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); > > @@ -963,13 +965,15 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) > if (vcpu->arch.guest_state_protected) > return; > > +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS > if (static_cpu_has(X86_FEATURE_PKU) && > - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || > - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { > + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || > + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { > vcpu->arch.pkru = rdpkru(); > if (vcpu->arch.pkru != vcpu->arch.host_pkru) > write_pkru(vcpu->arch.host_pkru); > } > +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ > > if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) { > > -- > 2.30.1 (Apple Git-130) >
> On Mar 25, 2022, at 8:15 PM, Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Mar 23, 2022, Jon Kohler wrote: >> kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit, >> part of which is managing memory protection key state. The latest >> arch.pkru is updated with a rdpkru, and if that doesn't match the base >> host_pkru (which about 70% of the time), we issue a __write_pkru. >> >> To improve performance, implement the following optimizations: >> 1. Reorder if conditions prior to wrpkru in both >> kvm_load_{guest|host}_xsave_state. >> >> Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is >> checked first, which when instrumented in our environment appeared >> to be always true and less overall work than kvm_read_cr4_bits. > > If it's always true, then it should be checked last, not first. And if Sean thanks for the review. This would be a left handed || short circuit, so wouldn’t we want always true to be first? > kvm_read_cr4_bits() is more expensive then we should address that issue, not > try to micro-optimize this stuff. X86_CR4_PKE can't be guest-owned, and so > kvm_read_cr4_bits() should be optimized down to: > > return vcpu->arch.cr4 & X86_CR4_PKE; Ok good tip, thank you. I’ll dig into that a bit more and see what I can find. To be fair, kvm_read_cr4_bits isn’t super expensive, it’s just more expensive than the code I proposed. That said, I’ll see what I can do to simplify this. > > If the compiler isn't smart enough to do that on its own then we should rework > kvm_read_cr4_bits() as that will benefit multiple code paths. > >> For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead >> one position. When instrumented, I saw this be true roughly ~70% of >> the time vs the other conditions which were almost always true. >> With this change, we will avoid 3rd condition check ~30% of the time. > > If the guest uses PKRU... If PKRU is used by the host but not the guest, the > early comparison is pure overhead because it will almost always be true (guest > will be zero, host will non-zero), In a multi tenant environment with a variety of hosts and customer controlled guests, we’ve seen this be a mixed bag. I’d be ok moving this back to the way it is currently, I can do that on the next revision to simplify this. > >> 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS, >> as if the user compiles out this feature, we should not have >> these branches at all. > > Not that it really matters, since static_cpu_has() will patch out all the branches, > and in practice who cares about a JMP or NOP(s)? But... The reason I’ve been pursuing this is that the guest+host xsave adds up to a bit over ~1% as measured by perf top in an exit heavy workload. This is the first in a few patch we’ve drummed up to to get it back towards zero. I’ll send the rest out next week. > > #ifdeffery is the wrong way to handle this. Replace static_cpu_has() with > cpu_feature_enabled(); that'll boil down to a '0' and omit all the code when > CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=n, without the #ifdef ugliness. Great idea, thanks. I’ll tune this up and send a v2 patch. > >> Signed-off-by: Jon Kohler <jon@nutanix.com> >> --- >> arch/x86/kvm/x86.c | 14 +++++++++----- >> 1 file changed, 9 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index 6db3a506b402..2b00123a5d50 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -950,11 +950,13 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) >> wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss); >> } >> >> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS >> if (static_cpu_has(X86_FEATURE_PKU) && >> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && >> - vcpu->arch.pkru != vcpu->arch.host_pkru) >> + vcpu->arch.pkru != vcpu->arch.host_pkru && >> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) >> write_pkru(vcpu->arch.pkru); >> +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ >> } >> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); >> >> @@ -963,13 +965,15 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) >> if (vcpu->arch.guest_state_protected) >> return; >> >> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS >> if (static_cpu_has(X86_FEATURE_PKU) && >> - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || >> - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { >> + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || >> + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { >> vcpu->arch.pkru = rdpkru(); >> if (vcpu->arch.pkru != vcpu->arch.host_pkru) >> write_pkru(vcpu->arch.host_pkru); >> } >> +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ >> >> if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) { >> >> -- >> 2.30.1 (Apple Git-130) >>
On 3/26/22 02:37, Jon Kohler wrote: >>> Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is >>> checked first, which when instrumented in our environment appeared >>> to be always true and less overall work than kvm_read_cr4_bits. >> >> If it's always true, then it should be checked last, not first. And if > > Sean thanks for the review. This would be a left handed || short circuit, so > wouldn’t we want always true to be first? Yes. >> Not that it really matters, since static_cpu_has() will patch out all the branches, >> and in practice who cares about a JMP or NOP(s)? But... > > The reason I’ve been pursuing this is that the guest+host xsave adds up to > a bit over ~1% as measured by perf top in an exit heavy workload. This is > the first in a few patch we’ve drummed up to to get it back towards zero. > I’ll send the rest out next week. Can you add a testcase to x86/vmexit.c in kvm-unit-tests, too? Thanks, Paolo
> On Mar 27, 2022, at 6:43 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 3/26/22 02:37, Jon Kohler wrote: >>>> Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is >>>> checked first, which when instrumented in our environment appeared >>>> to be always true and less overall work than kvm_read_cr4_bits. >>> >>> If it's always true, then it should be checked last, not first. And if >> Sean thanks for the review. This would be a left handed || short circuit, so >> wouldn’t we want always true to be first? > > Yes. Ack, thanks. > >>> Not that it really matters, since static_cpu_has() will patch out all the branches, >>> and in practice who cares about a JMP or NOP(s)? But... >> The reason I’ve been pursuing this is that the guest+host xsave adds up to >> a bit over ~1% as measured by perf top in an exit heavy workload. This is >> the first in a few patch we’ve drummed up to to get it back towards zero. >> I’ll send the rest out next week. > > Can you add a testcase to x86/vmexit.c in kvm-unit-tests, too? Sure, I’ll check that out and see what I can do. Here’s a preview of the larger issue: we’re seeing a regression on SKX/CLX hosts when supporting live migration from these older hosts to ICX hosts in the same logical compute cluster. Our control plane automatically masks features to the lowest common denominator. In such cases, MPX is masked away from the guests on the older systems, causing the xsave state that the guests sees to be different than the host. When that happens, on each load guest or load host state, we spend quite a bit amount of time on xsetbv. This happens any time the host and guest don’t match. This is Even more expensive when the guest sees PKU, as we spend time on xsetbv for MPX not matching and spend time on all the rdpkru/wrpkru stuff. Turns out, the xsave bringup code only checks running features, as we see the same behavior if we compile out PKU too, so the patches we’ve created add a knob for MPX in disabled-features and add the ability for that early code to respect the disabled features mask. That way its easy to make the host and guest match without doing BIOS level things. In between those patches and this one, these two code paths are pretty cheap now. I’ll make sure to copy the list when those patches go out, and have a detailed cover letter for the issue. > > Thanks, > > Paolo >
On Mon, Mar 28, 2022, Jon Kohler wrote: > > > > On Mar 27, 2022, at 6:43 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > On 3/26/22 02:37, Jon Kohler wrote: > >>>> Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is > >>>> checked first, which when instrumented in our environment appeared > >>>> to be always true and less overall work than kvm_read_cr4_bits. > >>> > >>> If it's always true, then it should be checked last, not first. And if > >> Sean thanks for the review. This would be a left handed || short circuit, so > >> wouldn’t we want always true to be first? > > > > Yes. > > Ack, thanks. Yeah, I lost track of whether it was a || or &&.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6db3a506b402..2b00123a5d50 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -950,11 +950,13 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss); } +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS if (static_cpu_has(X86_FEATURE_PKU) && - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) && - vcpu->arch.pkru != vcpu->arch.host_pkru) + vcpu->arch.pkru != vcpu->arch.host_pkru && + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) write_pkru(vcpu->arch.pkru); +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ } EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); @@ -963,13 +965,15 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) if (vcpu->arch.guest_state_protected) return; +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS if (static_cpu_has(X86_FEATURE_PKU) && - (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || - (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) { + ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || + kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { vcpu->arch.pkru = rdpkru(); if (vcpu->arch.pkru != vcpu->arch.host_pkru) write_pkru(vcpu->arch.host_pkru); } +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {
kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit, part of which is managing memory protection key state. The latest arch.pkru is updated with a rdpkru, and if that doesn't match the base host_pkru (which about 70% of the time), we issue a __write_pkru. To improve performance, implement the following optimizations: 1. Reorder if conditions prior to wrpkru in both kvm_load_{guest|host}_xsave_state. Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is checked first, which when instrumented in our environment appeared to be always true and less overall work than kvm_read_cr4_bits. For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead one position. When instrumented, I saw this be true roughly ~70% of the time vs the other conditions which were almost always true. With this change, we will avoid 3rd condition check ~30% of the time. 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS, as if the user compiles out this feature, we should not have these branches at all. Signed-off-by: Jon Kohler <jon@nutanix.com> --- arch/x86/kvm/x86.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) -- 2.30.1 (Apple Git-130)