diff mbox series

[v2,04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected

Message ID 20240831001538.336683-5-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Fix multiple #PF RO infinite loop bugs | expand

Commit Message

Sean Christopherson Aug. 31, 2024, 12:15 a.m. UTC
When doing "fast unprotection" of nested TDP page tables, skip emulation
if and only if at least one gfn was unprotected, i.e. continue with
emulation if simply resuming is likely to hit the same fault and risk
putting the vCPU into an infinite loop.

Note, it's entirely possible to get a false negative, e.g. if a different
vCPU faults on the same gfn and unprotects the gfn first, but that's a
relatively rare edge case, and emulating is still functionally ok, i.e.
saving a few cycles by avoiding emulation isn't worth the risk of putting
the vCPU into an infinite loop.

Opportunistically rewrite the relevant comment to document in gory detail
exactly what scenario the "fast unprotect" logic is handling.

Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
Cc: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

Comments

Yuan Yao Sept. 6, 2024, 6:22 a.m. UTC | #1
On Fri, Aug 30, 2024 at 05:15:19PM -0700, Sean Christopherson wrote:
> When doing "fast unprotection" of nested TDP page tables, skip emulation
> if and only if at least one gfn was unprotected, i.e. continue with
> emulation if simply resuming is likely to hit the same fault and risk
> putting the vCPU into an infinite loop.
>
> Note, it's entirely possible to get a false negative, e.g. if a different
> vCPU faults on the same gfn and unprotects the gfn first, but that's a
> relatively rare edge case, and emulating is still functionally ok, i.e.
> saving a few cycles by avoiding emulation isn't worth the risk of putting
> the vCPU into an infinite loop.
>
> Opportunistically rewrite the relevant comment to document in gory detail
> exactly what scenario the "fast unprotect" logic is handling.
>
> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: Yuan Yao <yuan.yao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 37 +++++++++++++++++++++++++++++--------
>  1 file changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 57692d873f76..6b5f80f38a95 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5959,16 +5959,37 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	bool direct = vcpu->arch.mmu->root_role.direct;
>
>  	/*
> -	 * Before emulating the instruction, check if the error code
> -	 * was due to a RO violation while translating the guest page.
> -	 * This can occur when using nested virtualization with nested
> -	 * paging in both guests. If true, we simply unprotect the page
> -	 * and resume the guest.
> +	 * Before emulating the instruction, check to see if the access was due
> +	 * to a read-only violation while the CPU was walking non-nested NPT
> +	 * page tables, i.e. for a direct MMU, for _guest_ page tables in L1.
> +	 * If L1 is sharing (a subset of) its page tables with L2, e.g. by
> +	 * having nCR3 share lower level page tables with hCR3, then when KVM
> +	 * (L0) write-protects the nested NPTs, i.e. npt12 entries, KVM is also
> +	 * unknowingly write-protecting L1's guest page tables, which KVM isn't
> +	 * shadowing.

Hi Sean,

Sorry I didn't reply to you in v1 in time.

Now it's very clear to me why we had this code path here.
Thank you for the excellent explaination on this interesting behaivor!

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

> +	 *
> +	 * Because the CPU (by default) walks NPT page tables using a write
> +	 * access (to ensure the CPU can do A/D updates), page walks in L1 can
> +	 * trigger write faults for the above case even when L1 isn't modifying
> +	 * PTEs.  As a result, KVM will unnecessarily emulate (or at least, try
> +	 * to emulate) an excessive number of L1 instructions; because L1's MMU
> +	 * isn't shadowed by KVM, there is no need to write-protect L1's gPTEs
> +	 * and thus no need to emulate in order to guarantee forward progress.
> +	 *
> +	 * Try to unprotect the gfn, i.e. zap any shadow pages, so that L1 can
> +	 * proceed without triggering emulation.  If one or more shadow pages
> +	 * was zapped, skip emulation and resume L1 to let it natively execute
> +	 * the instruction.  If no shadow pages were zapped, then the write-
> +	 * fault is due to something else entirely, i.e. KVM needs to emulate,
> +	 * as resuming the guest will put it into an infinite loop.
> +	 *
> +	 * Note, this code also applies to Intel CPUs, even though it is *very*
> +	 * unlikely that an L1 will share its page tables (IA32/PAE/paging64
> +	 * format) with L2's page tables (EPT format).
>  	 */
> -	if (direct && is_write_to_guest_page_table(error_code)) {
> -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> +	if (direct && is_write_to_guest_page_table(error_code) &&
> +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
>  		return RET_PF_RETRY;
> -	}
>
>  	/*
>  	 * The gfn is write-protected, but if emulation fails we can still
> --
> 2.46.0.469.g59c65b2a67-goog
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 57692d873f76..6b5f80f38a95 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5959,16 +5959,37 @@  static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	bool direct = vcpu->arch.mmu->root_role.direct;
 
 	/*
-	 * Before emulating the instruction, check if the error code
-	 * was due to a RO violation while translating the guest page.
-	 * This can occur when using nested virtualization with nested
-	 * paging in both guests. If true, we simply unprotect the page
-	 * and resume the guest.
+	 * Before emulating the instruction, check to see if the access was due
+	 * to a read-only violation while the CPU was walking non-nested NPT
+	 * page tables, i.e. for a direct MMU, for _guest_ page tables in L1.
+	 * If L1 is sharing (a subset of) its page tables with L2, e.g. by
+	 * having nCR3 share lower level page tables with hCR3, then when KVM
+	 * (L0) write-protects the nested NPTs, i.e. npt12 entries, KVM is also
+	 * unknowingly write-protecting L1's guest page tables, which KVM isn't
+	 * shadowing.
+	 *
+	 * Because the CPU (by default) walks NPT page tables using a write
+	 * access (to ensure the CPU can do A/D updates), page walks in L1 can
+	 * trigger write faults for the above case even when L1 isn't modifying
+	 * PTEs.  As a result, KVM will unnecessarily emulate (or at least, try
+	 * to emulate) an excessive number of L1 instructions; because L1's MMU
+	 * isn't shadowed by KVM, there is no need to write-protect L1's gPTEs
+	 * and thus no need to emulate in order to guarantee forward progress.
+	 *
+	 * Try to unprotect the gfn, i.e. zap any shadow pages, so that L1 can
+	 * proceed without triggering emulation.  If one or more shadow pages
+	 * was zapped, skip emulation and resume L1 to let it natively execute
+	 * the instruction.  If no shadow pages were zapped, then the write-
+	 * fault is due to something else entirely, i.e. KVM needs to emulate,
+	 * as resuming the guest will put it into an infinite loop.
+	 *
+	 * Note, this code also applies to Intel CPUs, even though it is *very*
+	 * unlikely that an L1 will share its page tables (IA32/PAE/paging64
+	 * format) with L2's page tables (EPT format).
 	 */
-	if (direct && is_write_to_guest_page_table(error_code)) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
+	if (direct && is_write_to_guest_page_table(error_code) &&
+	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
 		return RET_PF_RETRY;
-	}
 
 	/*
 	 * The gfn is write-protected, but if emulation fails we can still