diff mbox series

[v4] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS

Message ID 20230531144128.73814-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series [v4] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS | expand

Commit Message

Jon Kohler May 31, 2023, 2:41 p.m. UTC
Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
has long been that eIBRS only needs to be set once, so most guests with
eIBRS awareness should behave nicely. We would not want to accidentally
regress misbehaving guests on pre-eIBRS systems, who might be spamming
IBRS MSR without the hypervisor being able to see it today.

eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
once or twice per vCPU on boot, so it is far better to take those
VM exits on boot than having to read and save this msr on every
single VM exit forever. This outcome was suggested on Andrea's commit
2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
however, since interception is still unilaterally disabled, the rdmsr
tax is still there even after that commit.

This is a significant win for eIBRS enabled systems as this rdmsr
accounts for roughly ~50% of time for vmx_vcpu_run() as observed
by perf top disassembly, and is in the critical path for all
VM-Exits, including fastpath exits.

Opportunistically update comments for both MSR_IA32_SPEC_CTRL and
MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works.

Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Waiman Long <longman@redhat.com>
---
v1
 - https://lore.kernel.org/all/20220512174427.3608-1-jon@nutanix.com/
v1 -> v2:
 - https://lore.kernel.org/all/20220520195303.58692-1-jon@nutanix.com/
 - Addressed comments on approach from Sean.
v2 -> v3:
 - https://lore.kernel.org/kvm/20220520204115.67580-1-jon@nutanix.com/
 - Addressed comments on approach from Sean.
v3 -> v4:
 - Fixed inline code comments from Sean.

 arch/x86/kvm/vmx/vmx.c | 35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

--
2.30.1 (Apple Git-130)

Comments

Waiman Long May 31, 2023, 5:02 p.m. UTC | #1
On 5/31/23 10:41, Jon Kohler wrote:
> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
> the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
> has long been that eIBRS only needs to be set once, so most guests with
> eIBRS awareness should behave nicely. We would not want to accidentally
> regress misbehaving guests on pre-eIBRS systems, who might be spamming
> IBRS MSR without the hypervisor being able to see it today.
>
> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
> once or twice per vCPU on boot, so it is far better to take those
> VM exits on boot than having to read and save this msr on every
> single VM exit forever. This outcome was suggested on Andrea's commit
> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> however, since interception is still unilaterally disabled, the rdmsr
> tax is still there even after that commit.
>
> This is a significant win for eIBRS enabled systems as this rdmsr
> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
> by perf top disassembly, and is in the critical path for all
> VM-Exits, including fastpath exits.
>
> Opportunistically update comments for both MSR_IA32_SPEC_CTRL and
> MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works.
>
> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> ---
> v1
>   - https://lore.kernel.org/all/20220512174427.3608-1-jon@nutanix.com/
> v1 -> v2:
>   - https://lore.kernel.org/all/20220520195303.58692-1-jon@nutanix.com/
>   - Addressed comments on approach from Sean.
> v2 -> v3:
>   - https://lore.kernel.org/kvm/20220520204115.67580-1-jon@nutanix.com/
>   - Addressed comments on approach from Sean.
> v3 -> v4:
>   - Fixed inline code comments from Sean.
>
>   arch/x86/kvm/vmx/vmx.c | 35 ++++++++++++++++++++++++-----------
>   1 file changed, 24 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..5e643ac897bc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2260,20 +2260,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   			return 1;
>
>   		vmx->spec_ctrl = data;
> -		if (!data)
> +
> +		/*
> +		 * Disable interception on the first non-zero write, except if
> +		 * eIBRS is advertised to the guest and the guest is enabling
> +		 * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
> +		 * once at boot and never touch it post-boot.  All other bits,
> +		 * and IBRS on non-eIBRS systems, are often set on a per-task
> +		 * basis, i.e. change frequently, so the benefit of avoiding
> +		 * VM-exits during guest context switches outweighs the cost of
> +		 * RDMSR on every VM-Exit to save the guest's value.
> +		 */
> +		if (!data ||
> +		    (data == SPEC_CTRL_IBRS &&
> +		     (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
>   			break;
>
>   		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging. We update the vmcs01 here for L1 as well
> -		 * since it will end up touching the MSR anyway now.
> +		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
> +		 * interception for the vCPU on the first write regardless of
> +		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
> +		 * combination of vmcs01 and vmcs12 bitmaps, and will be
> +		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
> +		 * nested VM-Enter.  Note, this does mean that future WRMSRs
> +		 * from L2 will be intercepted until the next nested VM-Exit if
> +		 * L2 was the first to write, but L1 exposing the MSR to L2
> +		 * without first writing it is unlikely and not worth the
> +		 * extra bit of complexity.
>   		 */
>   		vmx_disable_intercept_for_msr(vcpu,
>   					      MSR_IA32_SPEC_CTRL,

I have 2 comments.

1) Besides the IBRS, STIBP & SSBD bits, the SPEC_CTRL MSR may have the 
RRSBA_DIS_S bit set in the future. I am not aware of any current Intel 
processors having this capability yet, but a future Intel processor may 
have this and the above patch will have to be modified accordingly. It 
looks like that the RRSBA_DIS_S bit will be set once.

2) AMD has their own AutoIBRS capability in Genoa which is similar to 
eIBRS but is not identified by the ARCH_CAP_IBRS_ALL bit. Instead it is 
identified by the AUTOIBRS bit in MSR_EFER. Are you planning to support 
that?

Cheers,
Longman
Jon Kohler May 31, 2023, 5:13 p.m. UTC | #2
> On May 31, 2023, at 1:02 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 5/31/23 10:41, Jon Kohler wrote:
>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
>> the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
>> has long been that eIBRS only needs to be set once, so most guests with
>> eIBRS awareness should behave nicely. We would not want to accidentally
>> regress misbehaving guests on pre-eIBRS systems, who might be spamming
>> IBRS MSR without the hypervisor being able to see it today.
>> 
>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
>> once or twice per vCPU on boot, so it is far better to take those
>> VM exits on boot than having to read and save this msr on every
>> single VM exit forever. This outcome was suggested on Andrea's commit
>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> however, since interception is still unilaterally disabled, the rdmsr
>> tax is still there even after that commit.
>> 
>> This is a significant win for eIBRS enabled systems as this rdmsr
>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
>> by perf top disassembly, and is in the critical path for all
>> VM-Exits, including fastpath exits.
>> 
>> Opportunistically update comments for both MSR_IA32_SPEC_CTRL and
>> MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works.
>> 
>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>> Cc: Waiman Long <longman@redhat.com>
>> ---
>> v1
>>  - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20220512174427.3608-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=jNnloZQgh0KG-n36uwVC0dJTmokvqsQdYQCWYI8hVvM&e= v1 -> v2:
>>  - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20220520195303.58692-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=Rwi5NoHwaezlmzzLiGGCuI6QHuGQZ1BVK2hs6-SZvzU&e=   - Addressed comments on approach from Sean.
>> v2 -> v3:
>>  - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_kvm_20220520204115.67580-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=R2Ykxdv-DyeVGLWd8_pLpu43zEsnWzpyvvBPEZ9lz-Y&e=   - Addressed comments on approach from Sean.
>> v3 -> v4:
>>  - Fixed inline code comments from Sean.
>> 
>>  arch/x86/kvm/vmx/vmx.c | 35 ++++++++++++++++++++++++-----------
>>  1 file changed, 24 insertions(+), 11 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 44fb619803b8..5e643ac897bc 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2260,20 +2260,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  			return 1;
>> 
>>  		vmx->spec_ctrl = data;
>> -		if (!data)
>> +
>> +		/*
>> +		 * Disable interception on the first non-zero write, except if
>> +		 * eIBRS is advertised to the guest and the guest is enabling
>> +		 * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
>> +		 * once at boot and never touch it post-boot.  All other bits,
>> +		 * and IBRS on non-eIBRS systems, are often set on a per-task
>> +		 * basis, i.e. change frequently, so the benefit of avoiding
>> +		 * VM-exits during guest context switches outweighs the cost of
>> +		 * RDMSR on every VM-Exit to save the guest's value.
>> +		 */
>> +		if (!data ||
>> +		    (data == SPEC_CTRL_IBRS &&
>> +		     (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
>>  			break;
>> 
>>  		/*
>> -		 * For non-nested:
>> -		 * When it's written (to non-zero) for the first time, pass
>> -		 * it through.
>> -		 *
>> -		 * For nested:
>> -		 * The handling of the MSR bitmap for L2 guests is done in
>> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
>> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
>> -		 * in the merging. We update the vmcs01 here for L1 as well
>> -		 * since it will end up touching the MSR anyway now.
>> +		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
>> +		 * interception for the vCPU on the first write regardless of
>> +		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
>> +		 * combination of vmcs01 and vmcs12 bitmaps, and will be
>> +		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
>> +		 * nested VM-Enter.  Note, this does mean that future WRMSRs
>> +		 * from L2 will be intercepted until the next nested VM-Exit if
>> +		 * L2 was the first to write, but L1 exposing the MSR to L2
>> +		 * without first writing it is unlikely and not worth the
>> +		 * extra bit of complexity.
>>  		 */
>>  		vmx_disable_intercept_for_msr(vcpu,
>>  					      MSR_IA32_SPEC_CTRL,
> 
> I have 2 comments.
> 
> 1) Besides the IBRS, STIBP & SSBD bits, the SPEC_CTRL MSR may have the RRSBA_DIS_S bit set in the future. I am not aware of any current Intel processors having this capability yet, but a future Intel processor may have this and the above patch will have to be modified accordingly. It looks like that the RRSBA_DIS_S bit will be set once.

Agreed. Once that becomes pubic with future processors, this code can be fixed up in a fairly trivial manner. I don’t have any access to said future processors, so I’d like to keep it as-is today rather than project it out too far. Is that ok?

> 
> 2) AMD has their own AutoIBRS capability in Genoa which is similar to eIBRS but is not identified by the ARCH_CAP_IBRS_ALL bit. Instead it is identified by the AUTOIBRS bit in MSR_EFER. Are you planning to support that?

This is a vmx-only optimization. AMD optimizations can certainly be picked up separately if they suffer from same interception type issue that is called out here, but that isn’t something I’m looking at this time. 

> 
> Cheers,
> Longman
Waiman Long May 31, 2023, 5:21 p.m. UTC | #3
On 5/31/23 13:13, Jon Kohler wrote:
>
>> On May 31, 2023, at 1:02 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> On 5/31/23 10:41, Jon Kohler wrote:
>>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
>>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
>>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
>>> the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
>>> has long been that eIBRS only needs to be set once, so most guests with
>>> eIBRS awareness should behave nicely. We would not want to accidentally
>>> regress misbehaving guests on pre-eIBRS systems, who might be spamming
>>> IBRS MSR without the hypervisor being able to see it today.
>>>
>>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
>>> once or twice per vCPU on boot, so it is far better to take those
>>> VM exits on boot than having to read and save this msr on every
>>> single VM exit forever. This outcome was suggested on Andrea's commit
>>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>>> however, since interception is still unilaterally disabled, the rdmsr
>>> tax is still there even after that commit.
>>>
>>> This is a significant win for eIBRS enabled systems as this rdmsr
>>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
>>> by perf top disassembly, and is in the critical path for all
>>> VM-Exits, including fastpath exits.
>>>
>>> Opportunistically update comments for both MSR_IA32_SPEC_CTRL and
>>> MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works.
>>>
>>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>> Cc: Sean Christopherson <seanjc@google.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>>> Cc: Waiman Long <longman@redhat.com>
>>> ---
>>> v1
>>>   - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20220512174427.3608-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=jNnloZQgh0KG-n36uwVC0dJTmokvqsQdYQCWYI8hVvM&e= v1 -> v2:
>>>   - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20220520195303.58692-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=Rwi5NoHwaezlmzzLiGGCuI6QHuGQZ1BVK2hs6-SZvzU&e=   - Addressed comments on approach from Sean.
>>> v2 -> v3:
>>>   - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_kvm_20220520204115.67580-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=R2Ykxdv-DyeVGLWd8_pLpu43zEsnWzpyvvBPEZ9lz-Y&e=   - Addressed comments on approach from Sean.
>>> v3 -> v4:
>>>   - Fixed inline code comments from Sean.
>>>
>>>   arch/x86/kvm/vmx/vmx.c | 35 ++++++++++++++++++++++++-----------
>>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index 44fb619803b8..5e643ac897bc 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -2260,20 +2260,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>   			return 1;
>>>
>>>   		vmx->spec_ctrl = data;
>>> -		if (!data)
>>> +
>>> +		/*
>>> +		 * Disable interception on the first non-zero write, except if
>>> +		 * eIBRS is advertised to the guest and the guest is enabling
>>> +		 * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
>>> +		 * once at boot and never touch it post-boot.  All other bits,
>>> +		 * and IBRS on non-eIBRS systems, are often set on a per-task
>>> +		 * basis, i.e. change frequently, so the benefit of avoiding
>>> +		 * VM-exits during guest context switches outweighs the cost of
>>> +		 * RDMSR on every VM-Exit to save the guest's value.
>>> +		 */
>>> +		if (!data ||
>>> +		    (data == SPEC_CTRL_IBRS &&
>>> +		     (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
>>>   			break;
>>>
>>>   		/*
>>> -		 * For non-nested:
>>> -		 * When it's written (to non-zero) for the first time, pass
>>> -		 * it through.
>>> -		 *
>>> -		 * For nested:
>>> -		 * The handling of the MSR bitmap for L2 guests is done in
>>> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
>>> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
>>> -		 * in the merging. We update the vmcs01 here for L1 as well
>>> -		 * since it will end up touching the MSR anyway now.
>>> +		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
>>> +		 * interception for the vCPU on the first write regardless of
>>> +		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
>>> +		 * combination of vmcs01 and vmcs12 bitmaps, and will be
>>> +		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
>>> +		 * nested VM-Enter.  Note, this does mean that future WRMSRs
>>> +		 * from L2 will be intercepted until the next nested VM-Exit if
>>> +		 * L2 was the first to write, but L1 exposing the MSR to L2
>>> +		 * without first writing it is unlikely and not worth the
>>> +		 * extra bit of complexity.
>>>   		 */
>>>   		vmx_disable_intercept_for_msr(vcpu,
>>>   					      MSR_IA32_SPEC_CTRL,
>> I have 2 comments.
>>
>> 1) Besides the IBRS, STIBP & SSBD bits, the SPEC_CTRL MSR may have the RRSBA_DIS_S bit set in the future. I am not aware of any current Intel processors having this capability yet, but a future Intel processor may have this and the above patch will have to be modified accordingly. It looks like that the RRSBA_DIS_S bit will be set once.
> Agreed. Once that becomes pubic with future processors, this code can be fixed up in a fairly trivial manner. I don’t have any access to said future processors, so I’d like to keep it as-is today rather than project it out too far. Is that ok?
That is certainly OK. I am just raising a question here.
>> 2) AMD has their own AutoIBRS capability in Genoa which is similar to eIBRS but is not identified by the ARCH_CAP_IBRS_ALL bit. Instead it is identified by the AUTOIBRS bit in MSR_EFER. Are you planning to support that?
> This is a vmx-only optimization. AMD optimizations can certainly be picked up separately if they suffer from same interception type issue that is called out here, but that isn’t something I’m looking at this time.

Right, I forgot kvm has a different set of code for AMD CPUs. Since this 
is Intel only, the patch looks sane to me.

Thanks,
Longman
Jim Mattson May 31, 2023, 6:08 p.m. UTC | #4
On Wed, May 31, 2023 at 10:22 AM Waiman Long <longman@redhat.com> wrote:
>
> On 5/31/23 13:13, Jon Kohler wrote:
> >
> >> On May 31, 2023, at 1:02 PM, Waiman Long <longman@redhat.com> wrote:
> >>
> >> On 5/31/23 10:41, Jon Kohler wrote:
> >>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
> >>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
> >>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
> >>> the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
> >>> has long been that eIBRS only needs to be set once, so most guests with
> >>> eIBRS awareness should behave nicely. We would not want to accidentally
> >>> regress misbehaving guests on pre-eIBRS systems, who might be spamming
> >>> IBRS MSR without the hypervisor being able to see it today.
> >>>
> >>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
> >>> once or twice per vCPU on boot, so it is far better to take those
> >>> VM exits on boot than having to read and save this msr on every
> >>> single VM exit forever. This outcome was suggested on Andrea's commit
> >>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> >>> however, since interception is still unilaterally disabled, the rdmsr
> >>> tax is still there even after that commit.
> >>>
> >>> This is a significant win for eIBRS enabled systems as this rdmsr
> >>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
> >>> by perf top disassembly, and is in the critical path for all
> >>> VM-Exits, including fastpath exits.
> >>>
> >>> Opportunistically update comments for both MSR_IA32_SPEC_CTRL and
> >>> MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works.
> >>>
> >>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> >>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> >>> Cc: Sean Christopherson <seanjc@google.com>
> >>> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >>> Cc: Kees Cook <keescook@chromium.org>
> >>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> >>> Cc: Waiman Long <longman@redhat.com>
> >>> ---
> >>> v1
> >>>   - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20220512174427.3608-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=jNnloZQgh0KG-n36uwVC0dJTmokvqsQdYQCWYI8hVvM&e= v1 -> v2:
> >>>   - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20220520195303.58692-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=Rwi5NoHwaezlmzzLiGGCuI6QHuGQZ1BVK2hs6-SZvzU&e=   - Addressed comments on approach from Sean.
> >>> v2 -> v3:
> >>>   - https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_kvm_20220520204115.67580-2D1-2Djon-40nutanix.com_&d=DwICaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=PT1QjB8Lk_a3baDOwHBfedQG67HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=R2Ykxdv-DyeVGLWd8_pLpu43zEsnWzpyvvBPEZ9lz-Y&e=   - Addressed comments on approach from Sean.
> >>> v3 -> v4:
> >>>   - Fixed inline code comments from Sean.
> >>>
> >>>   arch/x86/kvm/vmx/vmx.c | 35 ++++++++++++++++++++++++-----------
> >>>   1 file changed, 24 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >>> index 44fb619803b8..5e643ac897bc 100644
> >>> --- a/arch/x86/kvm/vmx/vmx.c
> >>> +++ b/arch/x86/kvm/vmx/vmx.c
> >>> @@ -2260,20 +2260,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >>>                     return 1;
> >>>
> >>>             vmx->spec_ctrl = data;
> >>> -           if (!data)
> >>> +
> >>> +           /*
> >>> +            * Disable interception on the first non-zero write, except if
> >>> +            * eIBRS is advertised to the guest and the guest is enabling
> >>> +            * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
> >>> +            * once at boot and never touch it post-boot.  All other bits,
> >>> +            * and IBRS on non-eIBRS systems, are often set on a per-task
> >>> +            * basis, i.e. change frequently, so the benefit of avoiding
> >>> +            * VM-exits during guest context switches outweighs the cost of
> >>> +            * RDMSR on every VM-Exit to save the guest's value.
> >>> +            */
> >>> +           if (!data ||
> >>> +               (data == SPEC_CTRL_IBRS &&
> >>> +                (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
> >>>                     break;
> >>>
> >>>             /*
> >>> -            * For non-nested:
> >>> -            * When it's written (to non-zero) for the first time, pass
> >>> -            * it through.
> >>> -            *
> >>> -            * For nested:
> >>> -            * The handling of the MSR bitmap for L2 guests is done in
> >>> -            * nested_vmx_prepare_msr_bitmap. We should not touch the
> >>> -            * vmcs02.msr_bitmap here since it gets completely overwritten
> >>> -            * in the merging. We update the vmcs01 here for L1 as well
> >>> -            * since it will end up touching the MSR anyway now.
> >>> +            * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
> >>> +            * interception for the vCPU on the first write regardless of
> >>> +            * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
> >>> +            * combination of vmcs01 and vmcs12 bitmaps, and will be
> >>> +            * recomputed by nested_vmx_prepare_msr_bitmap() on the next
> >>> +            * nested VM-Enter.  Note, this does mean that future WRMSRs
> >>> +            * from L2 will be intercepted until the next nested VM-Exit if
> >>> +            * L2 was the first to write, but L1 exposing the MSR to L2
> >>> +            * without first writing it is unlikely and not worth the
> >>> +            * extra bit of complexity.
> >>>              */
> >>>             vmx_disable_intercept_for_msr(vcpu,
> >>>                                           MSR_IA32_SPEC_CTRL,
> >> I have 2 comments.
> >>
> >> 1) Besides the IBRS, STIBP & SSBD bits, the SPEC_CTRL MSR may have the RRSBA_DIS_S bit set in the future. I am not aware of any current Intel processors having this capability yet, but a future Intel processor may have this and the above patch will have to be modified accordingly. It looks like that the RRSBA_DIS_S bit will be set once.
> > Agreed. Once that becomes pubic with future processors, this code can be fixed up in a fairly trivial manner. I don’t have any access to said future processors, so I’d like to keep it as-is today rather than project it out too far. Is that ok?
> That is certainly OK. I am just raising a question here.

How difficult would it be to do a back of the envelope cost/benefit
analysis, rather than relying on heuristics based on today's typical
guest behavior?

Say that it's a minimum of 1000 cycles to intercept this WRMSR. The
tradeoff is the cost of a RDMSR on every VM-exit. How long does a
RDMSR take these days? On the order of 50 cycles? So, if the guest
consistently writes IA32_SPEC_CTRL more often than once every 20
VM-exits, it's better not to intercept it.

Most of the bookkeeping could be handled in the WRMSR(IA32_SPEC_CTRL)
intercept. The only overhead we'd incur on every VM-exit would be the
cost of incrementing a VM-exit counter.

It's a bit more complicated, but it directly addresses the issue, and
it's more future-proof.
Josh Poimboeuf May 31, 2023, 10:48 p.m. UTC | #5
On Wed, May 31, 2023 at 10:41:28AM -0400, Jon Kohler wrote:
> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
> the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidance
> has long been that eIBRS only needs to be set once, so most guests with
> eIBRS awareness should behave nicely. We would not want to accidentally
> regress misbehaving guests on pre-eIBRS systems, who might be spamming
> IBRS MSR without the hypervisor being able to see it today.
> 
> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
> once or twice per vCPU on boot, so it is far better to take those
> VM exits on boot than having to read and save this msr on every
> single VM exit forever. This outcome was suggested on Andrea's commit
> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> however, since interception is still unilaterally disabled, the rdmsr
> tax is still there even after that commit.
> 
> This is a significant win for eIBRS enabled systems as this rdmsr
> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
> by perf top disassembly, and is in the critical path for all
> VM-Exits, including fastpath exits.
> 
> Opportunistically update comments for both MSR_IA32_SPEC_CTRL and
> MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works.
> 
> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> Signed-off-by: Jon Kohler <jon@nutanix.com>

Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Sean Christopherson June 1, 2023, 5:43 p.m. UTC | #6
On Wed, May 31, 2023, Jon Kohler wrote:
> 
> > On May 31, 2023, at 2:30 PM, Jim Mattson <jmattson@google.com> wrote:
> > 
> > On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
> >> Yea, I thought about it. One one hand, simplicity is kingand on the other
> >> hand, not having to think about this again is nice too.
> >> 
> >> The challenge in my mind is that on setups where this truly is static, we’d
> >> be taking some incremental amount of memory to keep the counter around,

Not really.  The vCPU structures are already order-2 allocations, increasing the
size by 8-16 bytes doesn't affect the actual memory usage in practice.  Death by
a thousand cuts is a potential problem, but we're a ways away from crossing back
over into order-3 allocations.

> >> just to have the same outcome each time. Doesn’t feel right (to me) unless that is
> >> also used for “other” stuff as some sort of general purpose/common counter.

...

> Yes, there is places this could be stuffed I’m sure. Still feels a bit heavy
> handed for the same-outcome-every-time situations though.

There's no guarantee the outcome will be the same.  You're assuming that (a) the
guest is eIBRS aware, (b) SPEC_CTRL doesn't get extended for future mitigations,
and (c) that if L1 is running VMs of its own, that L1 is advertising eIBRS to L2
and that the L2 kernel is also aware of eIBRS.

> >> RE Cost: I can’t put my finger on it, but I swear that RDMSR for *this*
> >> specific MSR is more expensive than any other RDMSR I’ve come across
> >> for run-of-the-mill random MSRs. I flipped thru the SDM and the mitigations
> >> documentation, and it only ever mentions that there is a notable cost to
> >> do WRMSR IA32_SPEC_CTRL, but nothing about the RDMSR side.
> >> 
> >> If anyone happens to know from an Intel-internals perspective, I’d be quite
> >> interested to know why it just “feels” so darn costly. i.e. is the proc also doing
> >> special things under the covers, similar to what the processor does on
> >> writes to this one?
> > 
> > What do you mean by "feels"? Have you measured it?
> 
> There are plenty of rdmsr’s scattered around the entry and exit paths that get
> hit every time, but this is far and away always the most expensive one when
> profiling with perf top. I haven’t measured it separately from the existing code,
> But rather noted during profiling that it appears to be nastier than others. 
> 
> I’m more curious than anything else, but it doesn’t matter all that much going
> forward since this commit will nuke it from orbit for the run of the mill 
> eIBRS-only use cases.

As above, you're making multiple assumptions that may or may not hold true.  I
agree with Jim, reacting to what the guest is actually doing is more robust than
assuming the guest will do XYZ based on the vCPU model or some other heuristic.

The code isn't that complex, and KVM can even reuse the number of exits snapshot
to periodically re-enable the intercept, e.g. to avoid unnecessary RDMSRs if the
vCPU stops writing MSR_IA32_SPEC_CTRL for whatever reason.

Needs actual testing and informed magic numbers, but I think this captures the
gist of what Jim is suggesting.

---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/svm/svm.c          | 22 ++++++++--------------
 arch/x86/kvm/vmx/vmx.c          | 28 ++++++++++------------------
 arch/x86/kvm/x86.h              | 24 ++++++++++++++++++++++++
 4 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..3fdb6048cd58 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -966,6 +966,9 @@ struct kvm_vcpu_arch {
 	/* Host CPU on which VM-entry was most recently attempted */
 	int last_vmentry_cpu;
 
+	u32 nr_quick_spec_ctrl_writes;
+	u64 spec_ctrl_nr_exits_snapshot;
+
 	/* AMD MSRC001_0015 Hardware Configuration */
 	u64 msr_hwcr;
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca32389f3c36..f749613204d3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2959,21 +2959,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 			svm->vmcb->save.spec_ctrl = data;
 		else
 			svm->spec_ctrl = data;
-		if (!data)
-			break;
 
-		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_svm_vmrun_msrpm.
-		 * We update the L1 MSR bit as well since it will end up
-		 * touching the MSR anyway now.
-		 */
-		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+		if (!msr->host_initiated &&
+		    kvm_account_msr_spec_ctrl_write(vcpu))
+			set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
 		break;
 	case MSR_AMD64_VIRT_SPEC_CTRL:
 		if (!msr->host_initiated &&
@@ -4158,6 +4147,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	svm_complete_interrupts(vcpu);
 
+	if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) &&
+	    !spec_ctrl_intercepted &&
+	    kvm_account_msr_spec_ctrl_passthrough(vcpu))
+		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 0, 0);
+
 	if (is_guest_mode(vcpu))
 		return EXIT_FASTPATH_NONE;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..4f4a2c3507bc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2260,24 +2260,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 
 		vmx->spec_ctrl = data;
-		if (!data)
-			break;
 
-		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_vmx_prepare_msr_bitmap. We should not touch the
-		 * vmcs02.msr_bitmap here since it gets completely overwritten
-		 * in the merging. We update the vmcs01 here for L1 as well
-		 * since it will end up touching the MSR anyway now.
-		 */
-		vmx_disable_intercept_for_msr(vcpu,
-					      MSR_IA32_SPEC_CTRL,
-					      MSR_TYPE_RW);
+		if (msr_info->host_initiated &&
+		    kvm_account_msr_spec_ctrl_write(vcpu))
+			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL,
+						      MSR_TYPE_RW);
 		break;
 	case MSR_IA32_TSX_CTRL:
 		if (!msr_info->host_initiated &&
@@ -7192,6 +7179,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	unsigned int run_flags = __vmx_vcpu_run_flags(vmx);
 	unsigned long cr3, cr4;
 
 	/* Record the guest's net vcpu time for enforced NMI injections. */
@@ -7280,7 +7268,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	kvm_wait_lapic_expire(vcpu);
 
 	/* The actual VMENTER/EXIT is in the .noinstr.text section. */
-	vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx));
+	vmx_vcpu_enter_exit(vcpu, run_flags);
 
 	/* All fields are clean at this point */
 	if (kvm_is_using_evmcs()) {
@@ -7346,6 +7334,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	vmx_recover_nmi_blocking(vmx);
 	vmx_complete_interrupts(vmx);
 
+	if ((run_flags & VMX_RUN_SAVE_SPEC_CTRL) &&
+	    kvm_account_msr_spec_ctrl_passthrough(vcpu))
+		vmx_enable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
+
 	if (is_guest_mode(vcpu))
 		return EXIT_FASTPATH_NONE;
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..454bcbf5b543 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)
 
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
+
 int kvm_spec_ctrl_test_value(u64 value);
+
+static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
+{
+	if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
+		vcpu->arch.nr_quick_spec_ctrl_writes++;
+	else
+		vcpu->arch.nr_quick_spec_ctrl_writes = 0;
+
+	vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
+
+	return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;
+}
+
+static inline bool kvm_account_msr_spec_ctrl_passthrough(struct kvm_vcpu *vcpu)
+{
+	if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 100000)
+		return false;
+
+	vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
+	vcpu->arch.nr_quick_spec_ctrl_writes = 0;
+	return true;
+}
+
 bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
 			      struct x86_exception *e);

base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
--
Jon Kohler June 1, 2023, 7:28 p.m. UTC | #7
> On Jun 1, 2023, at 1:43 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Wed, May 31, 2023, Jon Kohler wrote:
>> 
>>> On May 31, 2023, at 2:30 PM, Jim Mattson <jmattson@google.com> wrote:
>>> 
>>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
>>>> Yea, I thought about it. One one hand, simplicity is kingand on the other
>>>> hand, not having to think about this again is nice too.
>>>> 
>>>> The challenge in my mind is that on setups where this truly is static, we’d
>>>> be taking some incremental amount of memory to keep the counter around,
> 
> Not really.  The vCPU structures are already order-2 allocations, increasing the
> size by 8-16 bytes doesn't affect the actual memory usage in practice.  Death by
> a thousand cuts is a potential problem, but we're a ways away from crossing back
> over into order-3 allocations.
> 
>>>> just to have the same outcome each time. Doesn’t feel right (to me) unless that is
>>>> also used for “other” stuff as some sort of general purpose/common counter.
> 
> ...
> 
>> Yes, there is places this could be stuffed I’m sure. Still feels a bit heavy
>> handed for the same-outcome-every-time situations though.
> 
> There's no guarantee the outcome will be the same.  You're assuming that (a) the
> guest is eIBRS aware, (b) SPEC_CTRL doesn't get extended for future mitigations,
> and (c) that if L1 is running VMs of its own, that L1 is advertising eIBRS to L2
> and that the L2 kernel is also aware of eIBRS.
> 
>>>> RE Cost: I can’t put my finger on it, but I swear that RDMSR for *this*
>>>> specific MSR is more expensive than any other RDMSR I’ve come across
>>>> for run-of-the-mill random MSRs. I flipped thru the SDM and the mitigations
>>>> documentation, and it only ever mentions that there is a notable cost to
>>>> do WRMSR IA32_SPEC_CTRL, but nothing about the RDMSR side.
>>>> 
>>>> If anyone happens to know from an Intel-internals perspective, I’d be quite
>>>> interested to know why it just “feels” so darn costly. i.e. is the proc also doing
>>>> special things under the covers, similar to what the processor does on
>>>> writes to this one?
>>> 
>>> What do you mean by "feels"? Have you measured it?
>> 
>> There are plenty of rdmsr’s scattered around the entry and exit paths that get
>> hit every time, but this is far and away always the most expensive one when
>> profiling with perf top. I haven’t measured it separately from the existing code,
>> But rather noted during profiling that it appears to be nastier than others. 
>> 
>> I’m more curious than anything else, but it doesn’t matter all that much going
>> forward since this commit will nuke it from orbit for the run of the mill 
>> eIBRS-only use cases.
> 
> As above, you're making multiple assumptions that may or may not hold true.  I
> agree with Jim, reacting to what the guest is actually doing is more robust than
> assuming the guest will do XYZ based on the vCPU model or some other heuristic.
> 
> The code isn't that complex, and KVM can even reuse the number of exits snapshot
> to periodically re-enable the intercept, e.g. to avoid unnecessary RDMSRs if the
> vCPU stops writing MSR_IA32_SPEC_CTRL for whatever reason.
> 
> Needs actual testing and informed magic numbers, but I think this captures the
> gist of what Jim is suggesting.

Thanks, Sean, Jim. I agree that having something robust and lightweight would be
real nice here. Thanks, Sean for the suggested code. I’ll take that, do some
testing, and report back.

> 
> ---
> arch/x86/include/asm/kvm_host.h |  3 +++
> arch/x86/kvm/svm/svm.c          | 22 ++++++++--------------
> arch/x86/kvm/vmx/vmx.c          | 28 ++++++++++------------------
> arch/x86/kvm/x86.h              | 24 ++++++++++++++++++++++++
> 4 files changed, 45 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fb9d1f2d6136..3fdb6048cd58 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -966,6 +966,9 @@ struct kvm_vcpu_arch {
> 	/* Host CPU on which VM-entry was most recently attempted */
> 	int last_vmentry_cpu;
> 
> +	u32 nr_quick_spec_ctrl_writes;
> +	u64 spec_ctrl_nr_exits_snapshot;
> +
> 	/* AMD MSRC001_0015 Hardware Configuration */
> 	u64 msr_hwcr;
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ca32389f3c36..f749613204d3 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2959,21 +2959,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> 			svm->vmcb->save.spec_ctrl = data;
> 		else
> 			svm->spec_ctrl = data;
> -		if (!data)
> -			break;
> 
> -		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_svm_vmrun_msrpm.
> -		 * We update the L1 MSR bit as well since it will end up
> -		 * touching the MSR anyway now.
> -		 */
> -		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
> +		if (!msr->host_initiated &&
> +		    kvm_account_msr_spec_ctrl_write(vcpu))
> +			set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
> 		break;
> 	case MSR_AMD64_VIRT_SPEC_CTRL:
> 		if (!msr->host_initiated &&
> @@ -4158,6 +4147,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
> 
> 	svm_complete_interrupts(vcpu);
> 
> +	if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) &&
> +	    !spec_ctrl_intercepted &&
> +	    kvm_account_msr_spec_ctrl_passthrough(vcpu))
> +		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 0, 0);
> +
> 	if (is_guest_mode(vcpu))
> 		return EXIT_FASTPATH_NONE;
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..4f4a2c3507bc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2260,24 +2260,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 			return 1;
> 
> 		vmx->spec_ctrl = data;
> -		if (!data)
> -			break;
> 
> -		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging. We update the vmcs01 here for L1 as well
> -		 * since it will end up touching the MSR anyway now.
> -		 */
> -		vmx_disable_intercept_for_msr(vcpu,
> -					      MSR_IA32_SPEC_CTRL,
> -					      MSR_TYPE_RW);
> +		if (msr_info->host_initiated &&
> +		    kvm_account_msr_spec_ctrl_write(vcpu))
> +			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL,
> +						      MSR_TYPE_RW);
> 		break;
> 	case MSR_IA32_TSX_CTRL:
> 		if (!msr_info->host_initiated &&
> @@ -7192,6 +7179,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> {
> 	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	unsigned int run_flags = __vmx_vcpu_run_flags(vmx);
> 	unsigned long cr3, cr4;
> 
> 	/* Record the guest's net vcpu time for enforced NMI injections. */
> @@ -7280,7 +7268,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	kvm_wait_lapic_expire(vcpu);
> 
> 	/* The actual VMENTER/EXIT is in the .noinstr.text section. */
> -	vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx));
> +	vmx_vcpu_enter_exit(vcpu, run_flags);
> 
> 	/* All fields are clean at this point */
> 	if (kvm_is_using_evmcs()) {
> @@ -7346,6 +7334,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	vmx_recover_nmi_blocking(vmx);
> 	vmx_complete_interrupts(vmx);
> 
> +	if ((run_flags & VMX_RUN_SAVE_SPEC_CTRL) &&
> +	    kvm_account_msr_spec_ctrl_passthrough(vcpu))
> +		vmx_enable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
> +
> 	if (is_guest_mode(vcpu))
> 		return EXIT_FASTPATH_NONE;
> 
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c544602d07a3..454bcbf5b543 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)
> 
> void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> +
> int kvm_spec_ctrl_test_value(u64 value);
> +
> +static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
> +{
> +	if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
> +		vcpu->arch.nr_quick_spec_ctrl_writes++;
> +	else
> +		vcpu->arch.nr_quick_spec_ctrl_writes = 0;
> +
> +	vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
> +
> +	return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;
> +}
> +
> +static inline bool kvm_account_msr_spec_ctrl_passthrough(struct kvm_vcpu *vcpu)
> +{
> +	if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 100000)
> +		return false;
> +
> +	vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
> +	vcpu->arch.nr_quick_spec_ctrl_writes = 0;
> +	return true;
> +}
> +
> bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> 			      struct x86_exception *e);
> 
> base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
> --
Jim Mattson June 1, 2023, 8:20 p.m. UTC | #8
On Thu, Jun 1, 2023 at 12:28 PM Jon Kohler <jon@nutanix.com> wrote:
>
>
>
> > On Jun 1, 2023, at 1:43 PM, Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, May 31, 2023, Jon Kohler wrote:
> >>
> >>> On May 31, 2023, at 2:30 PM, Jim Mattson <jmattson@google.com> wrote:
> >>>
> >>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@nutanix.com> wrote:
> >>>> Yea, I thought about it. One one hand, simplicity is kingand on the other
> >>>> hand, not having to think about this again is nice too.
> >>>>
> >>>> The challenge in my mind is that on setups where this truly is static, we’d
> >>>> be taking some incremental amount of memory to keep the counter around,
> >
> > Not really.  The vCPU structures are already order-2 allocations, increasing the
> > size by 8-16 bytes doesn't affect the actual memory usage in practice.  Death by
> > a thousand cuts is a potential problem, but we're a ways away from crossing back
> > over into order-3 allocations.
> >
> >>>> just to have the same outcome each time. Doesn’t feel right (to me) unless that is
> >>>> also used for “other” stuff as some sort of general purpose/common counter.
> >
> > ...
> >
> >> Yes, there is places this could be stuffed I’m sure. Still feels a bit heavy
> >> handed for the same-outcome-every-time situations though.
> >
> > There's no guarantee the outcome will be the same.  You're assuming that (a) the
> > guest is eIBRS aware, (b) SPEC_CTRL doesn't get extended for future mitigations,
> > and (c) that if L1 is running VMs of its own, that L1 is advertising eIBRS to L2
> > and that the L2 kernel is also aware of eIBRS.
> >
> >>>> RE Cost: I can’t put my finger on it, but I swear that RDMSR for *this*
> >>>> specific MSR is more expensive than any other RDMSR I’ve come across
> >>>> for run-of-the-mill random MSRs. I flipped thru the SDM and the mitigations
> >>>> documentation, and it only ever mentions that there is a notable cost to
> >>>> do WRMSR IA32_SPEC_CTRL, but nothing about the RDMSR side.
> >>>>
> >>>> If anyone happens to know from an Intel-internals perspective, I’d be quite
> >>>> interested to know why it just “feels” so darn costly. i.e. is the proc also doing
> >>>> special things under the covers, similar to what the processor does on
> >>>> writes to this one?
> >>>
> >>> What do you mean by "feels"? Have you measured it?
> >>
> >> There are plenty of rdmsr’s scattered around the entry and exit paths that get
> >> hit every time, but this is far and away always the most expensive one when
> >> profiling with perf top. I haven’t measured it separately from the existing code,
> >> But rather noted during profiling that it appears to be nastier than others.
> >>
> >> I’m more curious than anything else, but it doesn’t matter all that much going
> >> forward since this commit will nuke it from orbit for the run of the mill
> >> eIBRS-only use cases.
> >
> > As above, you're making multiple assumptions that may or may not hold true.  I
> > agree with Jim, reacting to what the guest is actually doing is more robust than
> > assuming the guest will do XYZ based on the vCPU model or some other heuristic.
> >
> > The code isn't that complex, and KVM can even reuse the number of exits snapshot
> > to periodically re-enable the intercept, e.g. to avoid unnecessary RDMSRs if the
> > vCPU stops writing MSR_IA32_SPEC_CTRL for whatever reason.

Bonus! It's nice to be able to respond to a state change. For
instance, L1 might write IA32_SPEC_CTRL like crazy while running a
certain L2, and then it might go back to not writing it at all.

> > Needs actual testing and informed magic numbers, but I think this captures the
> > gist of what Jim is suggesting.
> Thanks, Sean, Jim. I agree that having something robust and lightweight would be
> real nice here. Thanks, Sean for the suggested code. I’ll take that, do some
> testing, and report back.
> >
> > ---
> > arch/x86/include/asm/kvm_host.h |  3 +++
> > arch/x86/kvm/svm/svm.c          | 22 ++++++++--------------
> > arch/x86/kvm/vmx/vmx.c          | 28 ++++++++++------------------
> > arch/x86/kvm/x86.h              | 24 ++++++++++++++++++++++++
> > 4 files changed, 45 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index fb9d1f2d6136..3fdb6048cd58 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -966,6 +966,9 @@ struct kvm_vcpu_arch {
> >       /* Host CPU on which VM-entry was most recently attempted */
> >       int last_vmentry_cpu;
> >
> > +     u32 nr_quick_spec_ctrl_writes;
> > +     u64 spec_ctrl_nr_exits_snapshot;
> > +
> >       /* AMD MSRC001_0015 Hardware Configuration */
> >       u64 msr_hwcr;
> >
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index ca32389f3c36..f749613204d3 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2959,21 +2959,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> >                       svm->vmcb->save.spec_ctrl = data;
> >               else
> >                       svm->spec_ctrl = data;
> > -             if (!data)
> > -                     break;
> >
> > -             /*
> > -              * For non-nested:
> > -              * When it's written (to non-zero) for the first time, pass
> > -              * it through.
> > -              *
> > -              * For nested:
> > -              * The handling of the MSR bitmap for L2 guests is done in
> > -              * nested_svm_vmrun_msrpm.
> > -              * We update the L1 MSR bit as well since it will end up
> > -              * touching the MSR anyway now.
> > -              */
> > -             set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
> > +             if (!msr->host_initiated &&
> > +                 kvm_account_msr_spec_ctrl_write(vcpu))
> > +                     set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
> >               break;
> >       case MSR_AMD64_VIRT_SPEC_CTRL:
> >               if (!msr->host_initiated &&
> > @@ -4158,6 +4147,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
> >
> >       svm_complete_interrupts(vcpu);
> >
> > +     if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) &&
> > +         !spec_ctrl_intercepted &&
> > +         kvm_account_msr_spec_ctrl_passthrough(vcpu))
> > +             set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 0, 0);
> > +
> >       if (is_guest_mode(vcpu))
> >               return EXIT_FASTPATH_NONE;
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 44fb619803b8..4f4a2c3507bc 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -2260,24 +2260,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >                       return 1;
> >
> >               vmx->spec_ctrl = data;
> > -             if (!data)
> > -                     break;
> >
> > -             /*
> > -              * For non-nested:
> > -              * When it's written (to non-zero) for the first time, pass
> > -              * it through.
> > -              *
> > -              * For nested:
> > -              * The handling of the MSR bitmap for L2 guests is done in
> > -              * nested_vmx_prepare_msr_bitmap. We should not touch the
> > -              * vmcs02.msr_bitmap here since it gets completely overwritten
> > -              * in the merging. We update the vmcs01 here for L1 as well
> > -              * since it will end up touching the MSR anyway now.
> > -              */
> > -             vmx_disable_intercept_for_msr(vcpu,
> > -                                           MSR_IA32_SPEC_CTRL,
> > -                                           MSR_TYPE_RW);
> > +             if (msr_info->host_initiated &&
> > +                 kvm_account_msr_spec_ctrl_write(vcpu))
> > +                     vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL,
> > +                                                   MSR_TYPE_RW);
> >               break;
> >       case MSR_IA32_TSX_CTRL:
> >               if (!msr_info->host_initiated &&
> > @@ -7192,6 +7179,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> > {
> >       struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +     unsigned int run_flags = __vmx_vcpu_run_flags(vmx);
> >       unsigned long cr3, cr4;
> >
> >       /* Record the guest's net vcpu time for enforced NMI injections. */
> > @@ -7280,7 +7268,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >       kvm_wait_lapic_expire(vcpu);
> >
> >       /* The actual VMENTER/EXIT is in the .noinstr.text section. */
> > -     vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx));
> > +     vmx_vcpu_enter_exit(vcpu, run_flags);
> >
> >       /* All fields are clean at this point */
> >       if (kvm_is_using_evmcs()) {
> > @@ -7346,6 +7334,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >       vmx_recover_nmi_blocking(vmx);
> >       vmx_complete_interrupts(vmx);
> >
> > +     if ((run_flags & VMX_RUN_SAVE_SPEC_CTRL) &&
> > +         kvm_account_msr_spec_ctrl_passthrough(vcpu))
> > +             vmx_enable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
> > +
> >       if (is_guest_mode(vcpu))
> >               return EXIT_FASTPATH_NONE;
> >
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index c544602d07a3..454bcbf5b543 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)
> >
> > void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> > void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> > +
> > int kvm_spec_ctrl_test_value(u64 value);
> > +
> > +static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
> > +{
> > +     if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)

I think you mean 200 here. If it's bad to have more than 1
WRMSR(IA32_SPEC_CTRL) VM-exit in 20 VM-exits, then more than 10 such
VM-exits in 200 VM-exits represents sustained badness.

(Although, as Sean noted, these numbers are just placeholders.)

> > +             vcpu->arch.nr_quick_spec_ctrl_writes++;
> > +     else
> > +             vcpu->arch.nr_quick_spec_ctrl_writes = 0;
> > +
> > +     vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
> > +
> > +     return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;
> > +}
> > +
> > +static inline bool kvm_account_msr_spec_ctrl_passthrough(struct kvm_vcpu *vcpu)
> > +{
> > +     if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 100000)
> > +             return false;
> > +
> > +     vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
> > +     vcpu->arch.nr_quick_spec_ctrl_writes = 0;
> > +     return true;
> > +}
> > +
> > bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> > int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> >                             struct x86_exception *e);
> >
> > base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
> > --
>
Sean Christopherson June 1, 2023, 8:35 p.m. UTC | #9
On Thu, Jun 01, 2023, Jim Mattson wrote:
> On Thu, Jun 1, 2023 at 12:28 PM Jon Kohler <jon@nutanix.com> wrote:
> > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > > index c544602d07a3..454bcbf5b543 100644
> > > --- a/arch/x86/kvm/x86.h
> > > +++ b/arch/x86/kvm/x86.h
> > > @@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)
> > >
> > > void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> > > void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> > > +
> > > int kvm_spec_ctrl_test_value(u64 value);
> > > +
> > > +static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
> > > +{
> > > +     if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
> 
> I think you mean 200 here. If it's bad to have more than 1
> WRMSR(IA32_SPEC_CTRL) VM-exit in 20 VM-exits, then more than 10 such
> VM-exits in 200 VM-exits represents sustained badness.

No?  The snapshot is updated on every write, i.e. this check is whether or not
the last wrmsr(SPEC_CTRL) was less than 20 cycles ago.  

       if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
               vcpu->arch.nr_quick_spec_ctrl_writes++;
       else
               vcpu->arch.nr_quick_spec_ctrl_writes = 0;

       vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;  <= new snapshot

       return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;

> (Although, as Sean noted, these numbers are just placeholders.)

And the logic is very off-the-cuff, e.g. it may be better to have a rolling 200-exit
window instead of 10 somewhat independent 20-exit windows.
Jim Mattson June 1, 2023, 9:23 p.m. UTC | #10
On Thu, Jun 1, 2023 at 1:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jun 01, 2023, Jim Mattson wrote:
> > On Thu, Jun 1, 2023 at 12:28 PM Jon Kohler <jon@nutanix.com> wrote:
> > > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > > > index c544602d07a3..454bcbf5b543 100644
> > > > --- a/arch/x86/kvm/x86.h
> > > > +++ b/arch/x86/kvm/x86.h
> > > > @@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)
> > > >
> > > > void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> > > > void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> > > > +
> > > > int kvm_spec_ctrl_test_value(u64 value);
> > > > +
> > > > +static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
> > > > +{
> > > > +     if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
> >
> > I think you mean 200 here. If it's bad to have more than 1
> > WRMSR(IA32_SPEC_CTRL) VM-exit in 20 VM-exits, then more than 10 such
> > VM-exits in 200 VM-exits represents sustained badness.
>
> No?  The snapshot is updated on every write, i.e. this check is whether or not
> the last wrmsr(SPEC_CTRL) was less than 20 cycles ago.
>
>        if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
>                vcpu->arch.nr_quick_spec_ctrl_writes++;
>        else
>                vcpu->arch.nr_quick_spec_ctrl_writes = 0;

Okay, then maybe this else clause is too aggressive. Just because we
went 21 VM-exits without seeing a WRMSR(IA32_SPEC_CTRL), we don't want
to clobber all of our history.

>        vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;  <= new snapshot
>
>        return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;
>
> > (Although, as Sean noted, these numbers are just placeholders.)
>
> And the logic is very off-the-cuff, e.g. it may be better to have a rolling 200-exit
> window instead of 10 somewhat independent 20-exit windows.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..5e643ac897bc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2260,20 +2260,33 @@  static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;

 		vmx->spec_ctrl = data;
-		if (!data)
+
+		/*
+		 * Disable interception on the first non-zero write, except if
+		 * eIBRS is advertised to the guest and the guest is enabling
+		 * _only_ IBRS.  On eIBRS systems, kernels typically set IBRS
+		 * once at boot and never touch it post-boot.  All other bits,
+		 * and IBRS on non-eIBRS systems, are often set on a per-task
+		 * basis, i.e. change frequently, so the benefit of avoiding
+		 * VM-exits during guest context switches outweighs the cost of
+		 * RDMSR on every VM-Exit to save the guest's value.
+		 */
+		if (!data ||
+		    (data == SPEC_CTRL_IBRS &&
+		     (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL)))
 			break;

 		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_vmx_prepare_msr_bitmap. We should not touch the
-		 * vmcs02.msr_bitmap here since it gets completely overwritten
-		 * in the merging. We update the vmcs01 here for L1 as well
-		 * since it will end up touching the MSR anyway now.
+		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
+		 * interception for the vCPU on the first write regardless of
+		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
+		 * combination of vmcs01 and vmcs12 bitmaps, and will be
+		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
+		 * nested VM-Enter.  Note, this does mean that future WRMSRs
+		 * from L2 will be intercepted until the next nested VM-Exit if
+		 * L2 was the first to write, but L1 exposing the MSR to L2
+		 * without first writing it is unlikely and not worth the
+		 * extra bit of complexity.
 		 */
 		vmx_disable_intercept_for_msr(vcpu,
 					      MSR_IA32_SPEC_CTRL,