diff mbox series

KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state

Message ID 20220324004439.6709-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state | expand

Commit Message

Jon Kohler March 24, 2022, 12:44 a.m. UTC
kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
part of which is managing memory protection key state. The latest
arch.pkru is updated with a rdpkru, and if that doesn't match the base
host_pkru (which about 70% of the time), we issue a __write_pkru.

To improve performance, implement the following optimizations:
 1. Reorder if conditions prior to wrpkru in both
    kvm_load_{guest|host}_xsave_state.

    Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
    checked first, which when instrumented in our environment appeared
    to be always true and less overall work than kvm_read_cr4_bits.

    For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
    one position. When instrumented, I saw this be true roughly ~70% of
    the time vs the other conditions which were almost always true.
    With this change, we will avoid 3rd condition check ~30% of the time.

 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
    as if the user compiles out this feature, we should not have
    these branches at all.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 arch/x86/kvm/x86.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

--
2.30.1 (Apple Git-130)

Comments

Paolo Bonzini March 25, 2022, 11:01 a.m. UTC | #1
Queued, thanks.

Paolo
Sean Christopherson March 26, 2022, 12:15 a.m. UTC | #2
On Wed, Mar 23, 2022, Jon Kohler wrote:
> kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
> part of which is managing memory protection key state. The latest
> arch.pkru is updated with a rdpkru, and if that doesn't match the base
> host_pkru (which about 70% of the time), we issue a __write_pkru.
> 
> To improve performance, implement the following optimizations:
>  1. Reorder if conditions prior to wrpkru in both
>     kvm_load_{guest|host}_xsave_state.
> 
>     Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
>     checked first, which when instrumented in our environment appeared
>     to be always true and less overall work than kvm_read_cr4_bits.

If it's always true, then it should be checked last, not first.  And if
kvm_read_cr4_bits() is more expensive then we should address that issue, not
try to micro-optimize this stuff.  X86_CR4_PKE can't be guest-owned, and so
kvm_read_cr4_bits() should be optimized down to:

	return vcpu->arch.cr4 & X86_CR4_PKE;

If the compiler isn't smart enough to do that on its own then we should rework
kvm_read_cr4_bits() as that will benefit multiple code paths.

>     For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
>     one position. When instrumented, I saw this be true roughly ~70% of
>     the time vs the other conditions which were almost always true.
>     With this change, we will avoid 3rd condition check ~30% of the time.

If the guest uses PKRU...  If PKRU is used by the host but not the guest, the
early comparison is pure overhead because it will almost always be true (guest
will be zero, host will non-zero), 

>  2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
>     as if the user compiles out this feature, we should not have
>     these branches at all.

Not that it really matters, since static_cpu_has() will patch out all the branches,
and in practice who cares about a JMP or NOP(s)?  But...

#ifdeffery is the wrong way to handle this.  Replace static_cpu_has() with
cpu_feature_enabled(); that'll boil down to a '0' and omit all the code when
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=n, without the #ifdef ugliness.

> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  arch/x86/kvm/x86.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6db3a506b402..2b00123a5d50 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -950,11 +950,13 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>  			wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss);
>  	}
> 
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  	if (static_cpu_has(X86_FEATURE_PKU) &&
> -	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> -	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
> -	    vcpu->arch.pkru != vcpu->arch.host_pkru)
> +	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
> +	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> +	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
>  		write_pkru(vcpu->arch.pkru);
> +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
>  }
>  EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
> 
> @@ -963,13 +965,15 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>  	if (vcpu->arch.guest_state_protected)
>  		return;
> 
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  	if (static_cpu_has(X86_FEATURE_PKU) &&
> -	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> -	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
> +	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
> +	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>  		vcpu->arch.pkru = rdpkru();
>  		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>  			write_pkru(vcpu->arch.host_pkru);
>  	}
> +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
> 
>  	if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {
> 
> --
> 2.30.1 (Apple Git-130)
>
Jon Kohler March 26, 2022, 1:37 a.m. UTC | #3
> On Mar 25, 2022, at 8:15 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Wed, Mar 23, 2022, Jon Kohler wrote:
>> kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
>> part of which is managing memory protection key state. The latest
>> arch.pkru is updated with a rdpkru, and if that doesn't match the base
>> host_pkru (which about 70% of the time), we issue a __write_pkru.
>> 
>> To improve performance, implement the following optimizations:
>> 1. Reorder if conditions prior to wrpkru in both
>>    kvm_load_{guest|host}_xsave_state.
>> 
>>    Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
>>    checked first, which when instrumented in our environment appeared
>>    to be always true and less overall work than kvm_read_cr4_bits.
> 
> If it's always true, then it should be checked last, not first.  And if

Sean thanks for the review. This would be a left handed || short circuit, so
wouldn’t we want always true to be first?

> kvm_read_cr4_bits() is more expensive then we should address that issue, not
> try to micro-optimize this stuff.  X86_CR4_PKE can't be guest-owned, and so
> kvm_read_cr4_bits() should be optimized down to:
> 
> 	return vcpu->arch.cr4 & X86_CR4_PKE;

Ok good tip, thank you. I’ll dig into that a bit more and see what I can find. To be
fair, kvm_read_cr4_bits isn’t super expensive, it’s just more expensive than the
code I proposed. That said, I’ll see what I can do to simplify this.

> 
> If the compiler isn't smart enough to do that on its own then we should rework
> kvm_read_cr4_bits() as that will benefit multiple code paths.
> 
>>    For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
>>    one position. When instrumented, I saw this be true roughly ~70% of
>>    the time vs the other conditions which were almost always true.
>>    With this change, we will avoid 3rd condition check ~30% of the time.
> 
> If the guest uses PKRU...  If PKRU is used by the host but not the guest, the
> early comparison is pure overhead because it will almost always be true (guest
> will be zero, host will non-zero), 

In a multi tenant environment with a variety of hosts and customer controlled
guests, we’ve seen this be a mixed bag. I’d be ok moving this back to the way
it is currently, I can do that on the next revision to simplify this.

> 
>> 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
>>    as if the user compiles out this feature, we should not have
>>    these branches at all.
> 
> Not that it really matters, since static_cpu_has() will patch out all the branches,
> and in practice who cares about a JMP or NOP(s)?  But...

The reason I’ve been pursuing this is that the guest+host xsave adds up to
a bit over ~1% as measured by perf top in an exit heavy workload. This is 
the first in a few patch we’ve drummed up to to get it back towards zero. 
I’ll send the rest out next week.

> 
> #ifdeffery is the wrong way to handle this.  Replace static_cpu_has() with
> cpu_feature_enabled(); that'll boil down to a '0' and omit all the code when
> CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=n, without the #ifdef ugliness.

Great idea, thanks. I’ll tune this up and send a v2 patch.

> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> arch/x86/kvm/x86.c | 14 +++++++++-----
>> 1 file changed, 9 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6db3a506b402..2b00123a5d50 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -950,11 +950,13 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>> 			wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss);
>> 	}
>> 
>> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>> 	if (static_cpu_has(X86_FEATURE_PKU) &&
>> -	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> -	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
>> -	    vcpu->arch.pkru != vcpu->arch.host_pkru)
>> +	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
>> +	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>> +	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
>> 		write_pkru(vcpu->arch.pkru);
>> +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
>> }
>> EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);
>> 
>> @@ -963,13 +965,15 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
>> 	if (vcpu->arch.guest_state_protected)
>> 		return;
>> 
>> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>> 	if (static_cpu_has(X86_FEATURE_PKU) &&
>> -	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> -	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
>> +	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
>> +	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
>> 		vcpu->arch.pkru = rdpkru();
>> 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
>> 			write_pkru(vcpu->arch.host_pkru);
>> 	}
>> +#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
>> 
>> 	if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {
>> 
>> --
>> 2.30.1 (Apple Git-130)
>>
Paolo Bonzini March 27, 2022, 10:43 a.m. UTC | #4
On 3/26/22 02:37, Jon Kohler wrote:
>>>     Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
>>>     checked first, which when instrumented in our environment appeared
>>>     to be always true and less overall work than kvm_read_cr4_bits.
>>
>> If it's always true, then it should be checked last, not first.  And if
> 
> Sean thanks for the review. This would be a left handed || short circuit, so
> wouldn’t we want always true to be first?

Yes.

>> Not that it really matters, since static_cpu_has() will patch out all the branches,
>> and in practice who cares about a JMP or NOP(s)?  But...
> 
> The reason I’ve been pursuing this is that the guest+host xsave adds up to
> a bit over ~1% as measured by perf top in an exit heavy workload. This is
> the first in a few patch we’ve drummed up to to get it back towards zero.
> I’ll send the rest out next week.

Can you add a testcase to x86/vmexit.c in kvm-unit-tests, too?

Thanks,

Paolo
Jon Kohler March 28, 2022, 12:53 a.m. UTC | #5
> On Mar 27, 2022, at 6:43 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 3/26/22 02:37, Jon Kohler wrote:
>>>>    Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
>>>>    checked first, which when instrumented in our environment appeared
>>>>    to be always true and less overall work than kvm_read_cr4_bits.
>>> 
>>> If it's always true, then it should be checked last, not first.  And if
>> Sean thanks for the review. This would be a left handed || short circuit, so
>> wouldn’t we want always true to be first?
> 
> Yes.

Ack, thanks.

> 
>>> Not that it really matters, since static_cpu_has() will patch out all the branches,
>>> and in practice who cares about a JMP or NOP(s)?  But...
>> The reason I’ve been pursuing this is that the guest+host xsave adds up to
>> a bit over ~1% as measured by perf top in an exit heavy workload. This is
>> the first in a few patch we’ve drummed up to to get it back towards zero.
>> I’ll send the rest out next week.
> 
> Can you add a testcase to x86/vmexit.c in kvm-unit-tests, too?

Sure, I’ll check that out and see what I can do. 

Here’s a preview of the larger issue:  we’re seeing a regression on SKX/CLX
hosts when supporting live migration from these older hosts to ICX hosts
in the same logical compute cluster. Our control plane automatically masks
features to the lowest common denominator. In such cases, MPX is masked
away from the guests on the older systems, causing the xsave state that
the guests sees to be different than the host. When that happens, on each
load guest or load host state, we spend quite a bit amount of time on
xsetbv. This happens any time the host and guest don’t match. This is
Even more expensive when the guest sees PKU, as we spend time on
xsetbv for MPX not matching and spend time on all the rdpkru/wrpkru stuff.

Turns out, the xsave bringup code only checks running features, as we
see the same behavior if we compile out PKU too, so the patches we’ve
created add a knob for MPX in disabled-features and add the ability for 
that early code to respect the disabled features mask. That way its easy
to make the host and guest match without doing BIOS level things. 

In between those patches and this one, these two code paths are pretty
cheap now. I’ll make sure to copy the list when those patches go out, and
have a detailed cover letter for the issue.

> 
> Thanks,
> 
> Paolo
>
Sean Christopherson March 28, 2022, 2:51 p.m. UTC | #6
On Mon, Mar 28, 2022, Jon Kohler wrote:
> 
> 
> > On Mar 27, 2022, at 6:43 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > On 3/26/22 02:37, Jon Kohler wrote:
> >>>>    Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
> >>>>    checked first, which when instrumented in our environment appeared
> >>>>    to be always true and less overall work than kvm_read_cr4_bits.
> >>> 
> >>> If it's always true, then it should be checked last, not first.  And if
> >> Sean thanks for the review. This would be a left handed || short circuit, so
> >> wouldn’t we want always true to be first?
> > 
> > Yes.
> 
> Ack, thanks.

Yeah, I lost track of whether it was a || or &&.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6db3a506b402..2b00123a5d50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -950,11 +950,13 @@  void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 			wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss);
 	}

+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
-	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
-	    vcpu->arch.pkru != vcpu->arch.host_pkru)
+	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
+	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
+	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
 		write_pkru(vcpu->arch.pkru);
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);

@@ -963,13 +965,15 @@  void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_state_protected)
 		return;

+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
-	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
+	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
+	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
 		vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
 			write_pkru(vcpu->arch.host_pkru);
 	}
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */

 	if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {