Message ID | 20230605004334.1930091-1-mizhang@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Remove KVM MMU write lock when accessing indirect_shadow_pages | expand |
On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > page role is direct because this counter value is used as a coarse-grained > heuristics to check if there is nested guest active. Racing with this > heuristics without mmu lock will be harmless because the corresponding > indirect shadow sptes for the GPA will either be zapped by this thread or > some other thread who has previously zapped all indirect shadow pages and > makes the value to 0. > > Because of that, remove the KVM MMU write lock pair to potentially reduce > the lock contension and improve the performance of nested VM. In addition > opportunistically change the comment of 'direct mmu' to make the > description consistent with other places. > > Reported-by: Jim Mattson <jmattson@google.com> > Signed-off-by: Mingwei Zhang <mizhang@google.com> > --- > arch/x86/kvm/x86.c | 10 ++-------- > 1 file changed, 2 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 5ad55ef71433..97cfa5a00ff2 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > kvm_release_pfn_clean(pfn); > > - /* The instructions are well-emulated on direct mmu. */ > + /* The instructions are well-emulated on Direct MMUs. */ > if (vcpu->arch.mmu->root_role.direct) { > - unsigned int indirect_shadow_pages; > - > - write_lock(&vcpu->kvm->mmu_lock); > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > - write_unlock(&vcpu->kvm->mmu_lock); > - > - if (indirect_shadow_pages) > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) I don't understand the need for READ_ONCE() here. That implies that there is something tricky going on, and I don't think that's the case. > kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > > return true; > > base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7 > -- > 2.41.0.rc0.172.g3f132b7071-goog >
On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote: > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > > page role is direct because this counter value is used as a coarse-grained > > heuristics to check if there is nested guest active. Racing with this > > heuristics without mmu lock will be harmless because the corresponding > > indirect shadow sptes for the GPA will either be zapped by this thread or > > some other thread who has previously zapped all indirect shadow pages and > > makes the value to 0. > > > > Because of that, remove the KVM MMU write lock pair to potentially reduce > > the lock contension and improve the performance of nested VM. In addition > > opportunistically change the comment of 'direct mmu' to make the > > description consistent with other places. > > > > Reported-by: Jim Mattson <jmattson@google.com> > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > --- > > arch/x86/kvm/x86.c | 10 ++-------- > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 5ad55ef71433..97cfa5a00ff2 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > kvm_release_pfn_clean(pfn); > > > > - /* The instructions are well-emulated on direct mmu. */ > > + /* The instructions are well-emulated on Direct MMUs. */ Nit: Internally within Google, on older kernels, we have the "Direct MMU" which was the precursor to the TDP MMU we all know and love. This comment however does not refer to the Direct MMU. Direct here just refers to the direct role bit being set. Since it's just descriptive, direct should not be capitalized in this comment, so no reason to change this line. > > if (vcpu->arch.mmu->root_role.direct) { > > - unsigned int indirect_shadow_pages; > > - > > - write_lock(&vcpu->kvm->mmu_lock); > > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > > - write_unlock(&vcpu->kvm->mmu_lock); > > - > > - if (indirect_shadow_pages) > > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > > I don't understand the need for READ_ONCE() here. That implies that > there is something tricky going on, and I don't think that's the case. Agree this doesn't need a READ_ONCE. Just a read is fine. kvm_mmu_unprotect_page starts by acquiring the MMU lock, so there's not much room to reorder anything anyway. Thanks for sending a patch to fix this. The critical section of the MMU lock here is small, but any lock acquisition in write mode can mess up performance of otherwise happy read-mode uses. > > > kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > > > > return true; > > > > base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7 > > -- > > 2.41.0.rc0.172.g3f132b7071-goog > >
On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote: > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > > page role is direct because this counter value is used as a coarse-grained > > heuristics to check if there is nested guest active. Racing with this > > heuristics without mmu lock will be harmless because the corresponding > > indirect shadow sptes for the GPA will either be zapped by this thread or > > some other thread who has previously zapped all indirect shadow pages and > > makes the value to 0. > > > > Because of that, remove the KVM MMU write lock pair to potentially reduce > > the lock contension and improve the performance of nested VM. In addition > > opportunistically change the comment of 'direct mmu' to make the > > description consistent with other places. > > > > Reported-by: Jim Mattson <jmattson@google.com> > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > --- > > arch/x86/kvm/x86.c | 10 ++-------- > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 5ad55ef71433..97cfa5a00ff2 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > kvm_release_pfn_clean(pfn); > > > > - /* The instructions are well-emulated on direct mmu. */ > > + /* The instructions are well-emulated on Direct MMUs. */ > > if (vcpu->arch.mmu->root_role.direct) { > > - unsigned int indirect_shadow_pages; > > - > > - write_lock(&vcpu->kvm->mmu_lock); > > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > > - write_unlock(&vcpu->kvm->mmu_lock); > > - > > - if (indirect_shadow_pages) > > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > > I don't understand the need for READ_ONCE() here. That implies that > there is something tricky going on, and I don't think that's the case. READ_ONCE() is just telling the compiler not to remove the read. Since this is reading a global variable, the compiler might just read a previous copy if the value has already been read into a local variable. But that is not the case here... Note I see there is another READ_ONCE for kvm->arch.indirect_shadow_pages, so I am reusing the same thing. I did check the reordering issue but it should be fine because when 'we' see indirect_shadow_pages as 0, the shadow pages must have already been zapped. Not only because of the locking, but also the program order in __kvm_mmu_prepare_zap_page() shows that it will zap shadow pages first before updating the stats.
On Mon, Jun 5, 2023 at 10:17 AM Ben Gardon <bgardon@google.com> wrote: > > On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote: > > > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > > > page role is direct because this counter value is used as a coarse-grained > > > heuristics to check if there is nested guest active. Racing with this > > > heuristics without mmu lock will be harmless because the corresponding > > > indirect shadow sptes for the GPA will either be zapped by this thread or > > > some other thread who has previously zapped all indirect shadow pages and > > > makes the value to 0. > > > > > > Because of that, remove the KVM MMU write lock pair to potentially reduce > > > the lock contension and improve the performance of nested VM. In addition > > > opportunistically change the comment of 'direct mmu' to make the > > > description consistent with other places. > > > > > > Reported-by: Jim Mattson <jmattson@google.com> > > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > > --- > > > arch/x86/kvm/x86.c | 10 ++-------- > > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 5ad55ef71433..97cfa5a00ff2 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > > > kvm_release_pfn_clean(pfn); > > > > > > - /* The instructions are well-emulated on direct mmu. */ > > > + /* The instructions are well-emulated on Direct MMUs. */ > > Nit: Internally within Google, on older kernels, we have the "Direct > MMU" which was the precursor to the TDP MMU we all know and love. This > comment however does not refer to the Direct MMU. Direct here just > refers to the direct role bit being set. Since it's just descriptive, > direct should not be capitalized in this comment, so no reason to > change this line. You are right., it is incorrect to uppercase the 'direct', since that generates confusions with our internal MMU implementation. So, I will just uppercase the 'mmu' here in the next version.
On Mon, Jun 5, 2023 at 10:42 AM Mingwei Zhang <mizhang@google.com> wrote: > > On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote: > > > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > > > page role is direct because this counter value is used as a coarse-grained > > > heuristics to check if there is nested guest active. Racing with this > > > heuristics without mmu lock will be harmless because the corresponding > > > indirect shadow sptes for the GPA will either be zapped by this thread or > > > some other thread who has previously zapped all indirect shadow pages and > > > makes the value to 0. > > > > > > Because of that, remove the KVM MMU write lock pair to potentially reduce > > > the lock contension and improve the performance of nested VM. In addition > > > opportunistically change the comment of 'direct mmu' to make the > > > description consistent with other places. > > > > > > Reported-by: Jim Mattson <jmattson@google.com> > > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > > --- > > > arch/x86/kvm/x86.c | 10 ++-------- > > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 5ad55ef71433..97cfa5a00ff2 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > > > kvm_release_pfn_clean(pfn); > > > > > > - /* The instructions are well-emulated on direct mmu. */ > > > + /* The instructions are well-emulated on Direct MMUs. */ > > > if (vcpu->arch.mmu->root_role.direct) { > > > - unsigned int indirect_shadow_pages; > > > - > > > - write_lock(&vcpu->kvm->mmu_lock); > > > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > > > - write_unlock(&vcpu->kvm->mmu_lock); > > > - > > > - if (indirect_shadow_pages) > > > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > > > > I don't understand the need for READ_ONCE() here. That implies that > > there is something tricky going on, and I don't think that's the case. > > READ_ONCE() is just telling the compiler not to remove the read. Since > this is reading a global variable, the compiler might just read a > previous copy if the value has already been read into a local > variable. But that is not the case here... Not a global variable, actually, but that's not relevant. What would be wrong with using a previously read copy? We don't always wrap reads in READ_ONCE(). It's actually pretty rare. So, there should be an explicit and meaningful reason. > Note I see there is another READ_ONCE for > kvm->arch.indirect_shadow_pages, so I am reusing the same thing. That's not a good reason. "If all of your friends jumped off a cliff, would you?" > I did check the reordering issue but it should be fine because when > 'we' see indirect_shadow_pages as 0, the shadow pages must have > already been zapped. Not only because of the locking, but also the > program order in __kvm_mmu_prepare_zap_page() shows that it will zap > shadow pages first before updating the stats.
On Mon, Jun 5, 2023 at 11:12 AM Jim Mattson <jmattson@google.com> wrote: > > On Mon, Jun 5, 2023 at 10:42 AM Mingwei Zhang <mizhang@google.com> wrote: > > > > On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote: > > > > > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > > > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > > > > page role is direct because this counter value is used as a coarse-grained > > > > heuristics to check if there is nested guest active. Racing with this > > > > heuristics without mmu lock will be harmless because the corresponding > > > > indirect shadow sptes for the GPA will either be zapped by this thread or > > > > some other thread who has previously zapped all indirect shadow pages and > > > > makes the value to 0. > > > > > > > > Because of that, remove the KVM MMU write lock pair to potentially reduce > > > > the lock contension and improve the performance of nested VM. In addition > > > > opportunistically change the comment of 'direct mmu' to make the > > > > description consistent with other places. > > > > > > > > Reported-by: Jim Mattson <jmattson@google.com> > > > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > > > --- > > > > arch/x86/kvm/x86.c | 10 ++-------- > > > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > index 5ad55ef71433..97cfa5a00ff2 100644 > > > > --- a/arch/x86/kvm/x86.c > > > > +++ b/arch/x86/kvm/x86.c > > > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > > > > > kvm_release_pfn_clean(pfn); > > > > > > > > - /* The instructions are well-emulated on direct mmu. */ > > > > + /* The instructions are well-emulated on Direct MMUs. */ > > > > if (vcpu->arch.mmu->root_role.direct) { > > > > - unsigned int indirect_shadow_pages; > > > > - > > > > - write_lock(&vcpu->kvm->mmu_lock); > > > > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > > > > - write_unlock(&vcpu->kvm->mmu_lock); > > > > - > > > > - if (indirect_shadow_pages) > > > > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > > > > > > I don't understand the need for READ_ONCE() here. That implies that > > > there is something tricky going on, and I don't think that's the case. > > > > READ_ONCE() is just telling the compiler not to remove the read. Since > > this is reading a global variable, the compiler might just read a > > previous copy if the value has already been read into a local > > variable. But that is not the case here... > > Not a global variable, actually, but that's not relevant. What would > be wrong with using a previously read copy? Nothing will be wrong I think since this is already just a heuristic. > > We don't always wrap reads in READ_ONCE(). It's actually pretty rare. > So, there should be an explicit and meaningful reason. > > > Note I see there is another READ_ONCE for > > kvm->arch.indirect_shadow_pages, so I am reusing the same thing. > > That's not a good reason. "If all of your friends jumped off a cliff, > would you?" :) > > > I did check the reordering issue but it should be fine because when > > 'we' see indirect_shadow_pages as 0, the shadow pages must have > > already been zapped. Not only because of the locking, but also the > > program order in __kvm_mmu_prepare_zap_page() shows that it will zap > > shadow pages first before updating the stats. yeah, I forgot to mention that removing READ_ONCE() is ok for me.
On Mon, Jun 05, 2023, Mingwei Zhang wrote: > On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote: > > > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when > > > page role is direct because this counter value is used as a coarse-grained > > > heuristics to check if there is nested guest active. Racing with this > > > heuristics without mmu lock will be harmless because the corresponding > > > indirect shadow sptes for the GPA will either be zapped by this thread or > > > some other thread who has previously zapped all indirect shadow pages and > > > makes the value to 0. > > > > > > Because of that, remove the KVM MMU write lock pair to potentially reduce > > > the lock contension and improve the performance of nested VM. In addition > > > opportunistically change the comment of 'direct mmu' to make the > > > description consistent with other places. > > > > > > Reported-by: Jim Mattson <jmattson@google.com> > > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > > --- > > > arch/x86/kvm/x86.c | 10 ++-------- > > > 1 file changed, 2 insertions(+), 8 deletions(-) > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 5ad55ef71433..97cfa5a00ff2 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > > > kvm_release_pfn_clean(pfn); > > > > > > - /* The instructions are well-emulated on direct mmu. */ > > > + /* The instructions are well-emulated on Direct MMUs. */ > > > if (vcpu->arch.mmu->root_role.direct) { > > > - unsigned int indirect_shadow_pages; > > > - > > > - write_lock(&vcpu->kvm->mmu_lock); > > > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > > > - write_unlock(&vcpu->kvm->mmu_lock); > > > - > > > - if (indirect_shadow_pages) > > > + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > > > > I don't understand the need for READ_ONCE() here. That implies that > > there is something tricky going on, and I don't think that's the case. > > READ_ONCE() is just telling the compiler not to remove the read. Since > this is reading a global variable, the compiler might just read a > previous copy if the value has already been read into a local > variable. But that is not the case here... > > Note I see there is another READ_ONCE for > kvm->arch.indirect_shadow_pages, so I am reusing the same thing. I agree with Jim, using READ_ONCE() doesn't make any sense. I suspect it may have been a misguided attempt to force the memory read to be as close to the write_lock() as possible, e.g. to minimize the chance of a false negative. > I did check the reordering issue but it should be fine because when > 'we' see indirect_shadow_pages as 0, the shadow pages must have > already been zapped. Not only because of the locking, but also the > program order in __kvm_mmu_prepare_zap_page() shows that it will zap > shadow pages first before updating the stats. I don't think zapping, i.e. the 1=>0 transition, is a concern. KVM is dropping the SPTE, so racing with kvm_mmu_pte_write() is a non-issue because the guest will either see the old value, or will fault after the SPTE is zapped, i.e. KVM won't run with a stale even if kvm_mmu_pte_write() sees '0' before TLBs are flushed. I believe the 0=>1 transition on the other hand doesn't have a *very* theoretical bug. KVM needs to ensure that either kvm_mmu_pte_write() sees an elevated count, or that a page fault task sees the updated guest PTE, i.e. the emulated write. The READ_ONCE() likely serves this purpose in practice, though technically it's insufficient. So I think this? --- arch/x86/kvm/mmu.h | 14 ++++++++++++++ arch/x86/kvm/mmu/mmu.c | 13 ++++++++++++- arch/x86/kvm/x86.c | 8 +------- 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 92d5a1924fc1..9cd105ccb1d4 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -264,6 +264,20 @@ static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm); } +static inline bool kvm_mmu_has_indirect_shadow_pages(struct kvm *kvm) +{ + /* + * When emulating guest writes, ensure the written value is visible to + * any task that is handling page faults before checking whether or not + * KVM is shadowing a guest PTE. This ensures either KVM will create + * the correct SPTE in the page fault handler, or this task will see + * a non-zero indirect_shadow_pages. Pairs with the smp_mb() in + * account_shadowed() and unaccount_shadowed(). + */ + smp_mb(); + return kvm->arch.indirect_shadow_pages; +} + static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) { /* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */ diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c8961f45e3b1..1735bee3f653 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -830,6 +830,17 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) gfn_t gfn; kvm->arch.indirect_shadow_pages++; + + /* + * Ensure indirect_shadow_pages is elevated prior to re-reading guest + * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight + * emulated writes are visible before re-reading guest PTEs, or that + * an emulated write will see the elevated count and acquire mmu_lock + * to update SPTEs. Pairs with the smp_mb() in + * kvm_mmu_has_indirect_shadow_pages(). + */ + smp_mb(); + gfn = sp->gfn; slots = kvm_memslots_for_spte_role(kvm, sp->role); slot = __gfn_to_memslot(slots, gfn); @@ -5692,7 +5703,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, * If we don't have indirect shadow pages, it means no page is * write-protected, so we can exit simply. */ - if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) + if (!kvm_mmu_has_indirect_shadow_pages(vcpu->kvm)) return; pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index abfba3cae0ba..22c226f5f4f8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8588,13 +8588,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, /* The instructions are well-emulated on direct mmu. */ if (vcpu->arch.mmu->root_role.direct) { - unsigned int indirect_shadow_pages; - - write_lock(&vcpu->kvm->mmu_lock); - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; - write_unlock(&vcpu->kvm->mmu_lock); - - if (indirect_shadow_pages) + if (kvm_mmu_has_indirect_shadow_pages(vcpu->kvm)) kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); return true; base-commit: 69b4e5b82fec7195c79c939ce25789b16a133f3a --
On 6/5/23 19:17, Ben Gardon wrote: >>> - if (indirect_shadow_pages) >>> + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) >> I don't understand the need for READ_ONCE() here. That implies that >> there is something tricky going on, and I don't think that's the case. > Agree this doesn't need a READ_ONCE. Just a read is fine. > kvm_mmu_unprotect_page starts by acquiring the MMU lock, so there's > not much room to reorder anything anyway. I agree that "there's not much room to reorder anything", but it's not a particularly formal/reassuring statement :) and READ_ONCE/WRITE_ONCE have documentation value too. Or maybe it's because I've become used to the C11 memory model. Either way, I like the idea to use READ_ONCE/WRITE_ONCE more explicitly whenever there can be concurrent accesses (which would be data races in the C11 model), and I think Linux is moving in that direction too. Paolo
> > > > > > I don't understand the need for READ_ONCE() here. That implies that > > > there is something tricky going on, and I don't think that's the case. > > > > READ_ONCE() is just telling the compiler not to remove the read. Since > > this is reading a global variable, the compiler might just read a > > previous copy if the value has already been read into a local > > variable. But that is not the case here... > > > > Note I see there is another READ_ONCE for > > kvm->arch.indirect_shadow_pages, so I am reusing the same thing. > > I agree with Jim, using READ_ONCE() doesn't make any sense. I suspect it may have > been a misguided attempt to force the memory read to be as close to the write_lock() > as possible, e.g. to minimize the chance of a false negative. Sean :) Your suggestion is the opposite with Jim. He is suggesting doing nothing, but your suggestion is doing way more than READ_ONCE(). > > > I did check the reordering issue but it should be fine because when > > 'we' see indirect_shadow_pages as 0, the shadow pages must have > > already been zapped. Not only because of the locking, but also the > > program order in __kvm_mmu_prepare_zap_page() shows that it will zap > > shadow pages first before updating the stats. > > I don't think zapping, i.e. the 1=>0 transition, is a concern. KVM is dropping > the SPTE, so racing with kvm_mmu_pte_write() is a non-issue because the guest > will either see the old value, or will fault after the SPTE is zapped, i.e. KVM > won't run with a stale even if kvm_mmu_pte_write() sees '0' before TLBs are > flushed. Agree. > > I believe the 0=>1 transition on the other hand doesn't have a *very* theoretical > bug. KVM needs to ensure that either kvm_mmu_pte_write() sees an elevated count, > or that a page fault task sees the updated guest PTE, i.e. the emulated write. > The READ_ONCE() likely serves this purpose in practice, though technically it's > insufficient. Agree. > > So I think this? Hmm. I agree with both points above, but below, the change seems too heavyweight. smp_wb() is a mfence(), i.e., serializing all loads/stores before the instruction. Doing that for every shadow page creation and destruction seems a lot. In fact, the case that only matters is '0->1' which may potentially confuse kvm_mmu_pte_write() when it reads 'indirect_shadow_count', but the majority of the cases are 'X => X + 1' where X != 0. So, those cases do not matter. So, if we want to add barriers, we only need it for 0->1. Maybe creating a new variable and not blocking account_shadow() and unaccount_shadow() is a better idea? Regardless, the above problem is related to interactions among account_shadow(), unaccount_shadow() and kvm_mmu_pte_write(). It has nothing to do with the 'reexecute_instruction()', which is what this patch is about. So, I think having a READ_ONCE() for reexecute_instruction() should be good enough. What do you think. > > --- > arch/x86/kvm/mmu.h | 14 ++++++++++++++ > arch/x86/kvm/mmu/mmu.c | 13 ++++++++++++- > arch/x86/kvm/x86.c | 8 +------- > 3 files changed, 27 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 92d5a1924fc1..9cd105ccb1d4 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -264,6 +264,20 @@ static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) > return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm); > } > > +static inline bool kvm_mmu_has_indirect_shadow_pages(struct kvm *kvm) > +{ > + /* > + * When emulating guest writes, ensure the written value is visible to > + * any task that is handling page faults before checking whether or not > + * KVM is shadowing a guest PTE. This ensures either KVM will create > + * the correct SPTE in the page fault handler, or this task will see > + * a non-zero indirect_shadow_pages. Pairs with the smp_mb() in > + * account_shadowed() and unaccount_shadowed(). > + */ > + smp_mb(); > + return kvm->arch.indirect_shadow_pages; > +} > + > static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) > { > /* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */ > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c8961f45e3b1..1735bee3f653 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -830,6 +830,17 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) > gfn_t gfn; > > kvm->arch.indirect_shadow_pages++; > + > + /* > + * Ensure indirect_shadow_pages is elevated prior to re-reading guest > + * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight > + * emulated writes are visible before re-reading guest PTEs, or that > + * an emulated write will see the elevated count and acquire mmu_lock > + * to update SPTEs. Pairs with the smp_mb() in > + * kvm_mmu_has_indirect_shadow_pages(). > + */ > + smp_mb(); > + > gfn = sp->gfn; > slots = kvm_memslots_for_spte_role(kvm, sp->role); > slot = __gfn_to_memslot(slots, gfn); > @@ -5692,7 +5703,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > * If we don't have indirect shadow pages, it means no page is > * write-protected, so we can exit simply. > */ > - if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > + if (!kvm_mmu_has_indirect_shadow_pages(vcpu->kvm)) > return; > > pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index abfba3cae0ba..22c226f5f4f8 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8588,13 +8588,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > /* The instructions are well-emulated on direct mmu. */ > if (vcpu->arch.mmu->root_role.direct) { > - unsigned int indirect_shadow_pages; > - > - write_lock(&vcpu->kvm->mmu_lock); > - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > - write_unlock(&vcpu->kvm->mmu_lock); > - > - if (indirect_shadow_pages) > + if (kvm_mmu_has_indirect_shadow_pages(vcpu->kvm)) > kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > > return true; > > base-commit: 69b4e5b82fec7195c79c939ce25789b16a133f3a > -- >
> > > > So I think this? > > Hmm. I agree with both points above, but below, the change seems too > heavyweight. smp_wb() is a mfence(), i.e., serializing all > loads/stores before the instruction. Doing that for every shadow page > creation and destruction seems a lot. > typo: smp_wb() => smp_mb().
On Tue, Jun 06, 2023, Mingwei Zhang wrote: > > > > > > > > I don't understand the need for READ_ONCE() here. That implies that > > > > there is something tricky going on, and I don't think that's the case. > > > > > > READ_ONCE() is just telling the compiler not to remove the read. Since > > > this is reading a global variable, the compiler might just read a > > > previous copy if the value has already been read into a local > > > variable. But that is not the case here... > > > > > > Note I see there is another READ_ONCE for > > > kvm->arch.indirect_shadow_pages, so I am reusing the same thing. > > > > I agree with Jim, using READ_ONCE() doesn't make any sense. I suspect it may have > > been a misguided attempt to force the memory read to be as close to the write_lock() > > as possible, e.g. to minimize the chance of a false negative. > > Sean :) Your suggestion is the opposite with Jim. He is suggesting > doing nothing, but your suggestion is doing way more than READ_ONCE(). Not really. Jim is asserting that the READ_ONCE() is pointless, and I completely agree. I am also saying that I think there is a real memory ordering issue here, and that it was being papered over by the READ_ONCE() in kvm_mmu_pte_write(). > > So I think this? > > Hmm. I agree with both points above, but below, the change seems too > heavyweight. smp_wb() is a mfence(), i.e., serializing all > loads/stores before the instruction. Doing that for every shadow page > creation and destruction seems a lot. No, the smp_*b() variants are just compiler barriers on x86. > In fact, the case that only matters is '0->1' which may potentially > confuse kvm_mmu_pte_write() when it reads 'indirect_shadow_count', but > the majority of the cases are 'X => X + 1' where X != 0. So, those > cases do not matter. So, if we want to add barriers, we only need it > for 0->1. Maybe creating a new variable and not blocking > account_shadow() and unaccount_shadow() is a better idea? > > Regardless, the above problem is related to interactions among > account_shadow(), unaccount_shadow() and kvm_mmu_pte_write(). It has > nothing to do with the 'reexecute_instruction()', which is what this > patch is about. So, I think having a READ_ONCE() for > reexecute_instruction() should be good enough. What do you think. The reexecute_instruction() case should be fine without any fanciness, it's nothing more than a heuristic, i.e. neither a false positive nor a false negative will impact functional correctness, and nothing changes regardless of how many times the compiler reads the variable outside of mmu_lock. I was thinking that it would be better to have a single helper to locklessly access indirect_shadow_pages, but I agree that applying the barriers to reexecute_instruction() introduces a different kind of confusion. Want to post a v2 of yours without a READ_ONCE(), and I'll post a separate fix for the theoretical kvm_mmu_pte_write() race? And then Paolo can tell me that there's no race and school me on lockless programming once more ;-)
> > Hmm. I agree with both points above, but below, the change seems too > > heavyweight. smp_wb() is a mfence(), i.e., serializing all > > loads/stores before the instruction. Doing that for every shadow page > > creation and destruction seems a lot. > > No, the smp_*b() variants are just compiler barriers on x86. hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE") So this means smp_mb() is not a free lunch and we need to be a little bit careful. > > > In fact, the case that only matters is '0->1' which may potentially > > confuse kvm_mmu_pte_write() when it reads 'indirect_shadow_count', but > > the majority of the cases are 'X => X + 1' where X != 0. So, those > > cases do not matter. So, if we want to add barriers, we only need it > > for 0->1. Maybe creating a new variable and not blocking > > account_shadow() and unaccount_shadow() is a better idea? > > > > Regardless, the above problem is related to interactions among > > account_shadow(), unaccount_shadow() and kvm_mmu_pte_write(). It has > > nothing to do with the 'reexecute_instruction()', which is what this > > patch is about. So, I think having a READ_ONCE() for > > reexecute_instruction() should be good enough. What do you think. > > The reexecute_instruction() case should be fine without any fanciness, it's > nothing more than a heuristic, i.e. neither a false positive nor a false negative > will impact functional correctness, and nothing changes regardless of how many > times the compiler reads the variable outside of mmu_lock. > > I was thinking that it would be better to have a single helper to locklessly > access indirect_shadow_pages, but I agree that applying the barriers to > reexecute_instruction() introduces a different kind of confusion. > > Want to post a v2 of yours without a READ_ONCE(), and I'll post a separate fix > for the theoretical kvm_mmu_pte_write() race? And then Paolo can tell me that > there's no race and school me on lockless programming once more ;-) yeah, that works for me.
On Tue, Jun 06, 2023, Mingwei Zhang wrote: > > > Hmm. I agree with both points above, but below, the change seems too > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all > > > loads/stores before the instruction. Doing that for every shadow page > > > creation and destruction seems a lot. > > > > No, the smp_*b() variants are just compiler barriers on x86. > > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE") > > So this means smp_mb() is not a free lunch and we need to be a little > bit careful. Oh, those sneaky macros. x86 #defines __smp_mb(), not the outer helper. I'll take a closer look before posting to see if there's a way to avoid the runtime barrier.
On Tue, Jun 6, 2023 at 5:28 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jun 06, 2023, Mingwei Zhang wrote: > > > > Hmm. I agree with both points above, but below, the change seems too > > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all > > > > loads/stores before the instruction. Doing that for every shadow page > > > > creation and destruction seems a lot. > > > > > > No, the smp_*b() variants are just compiler barriers on x86. > > > > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c > > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE") > > > > So this means smp_mb() is not a free lunch and we need to be a little > > bit careful. > > Oh, those sneaky macros. x86 #defines __smp_mb(), not the outer helper. I'll > take a closer look before posting to see if there's a way to avoid the runtime > barrier. Checked again, I think using smp_wmb() and smp_rmb() should be fine as those are just compiler barriers. We don't need a full barrier here. Thanks. -Mingwei
On Thu, Jun 15, 2023 at 4:58 PM Mingwei Zhang <mizhang@google.com> wrote: > > On Tue, Jun 6, 2023 at 5:28 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Jun 06, 2023, Mingwei Zhang wrote: > > > > > Hmm. I agree with both points above, but below, the change seems too > > > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all > > > > > loads/stores before the instruction. Doing that for every shadow page > > > > > creation and destruction seems a lot. > > > > > > > > No, the smp_*b() variants are just compiler barriers on x86. > > > > > > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c > > > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE") > > > > > > So this means smp_mb() is not a free lunch and we need to be a little > > > bit careful. > > > > Oh, those sneaky macros. x86 #defines __smp_mb(), not the outer helper. I'll > > take a closer look before posting to see if there's a way to avoid the runtime > > barrier. > > Checked again, I think using smp_wmb() and smp_rmb() should be fine as > those are just compiler barriers. We don't need a full barrier here. That seems adequate. > Thanks. > -Mingwei
On Mon, Jun 26, 2023, Jim Mattson wrote: > On Thu, Jun 15, 2023 at 4:58 PM Mingwei Zhang <mizhang@google.com> wrote: > > > > On Tue, Jun 6, 2023 at 5:28 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Tue, Jun 06, 2023, Mingwei Zhang wrote: > > > > > > Hmm. I agree with both points above, but below, the change seems too > > > > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all > > > > > > loads/stores before the instruction. Doing that for every shadow page > > > > > > creation and destruction seems a lot. > > > > > > > > > > No, the smp_*b() variants are just compiler barriers on x86. > > > > > > > > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c > > > > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE") > > > > > > > > So this means smp_mb() is not a free lunch and we need to be a little > > > > bit careful. > > > > > > Oh, those sneaky macros. x86 #defines __smp_mb(), not the outer helper. I'll > > > take a closer look before posting to see if there's a way to avoid the runtime > > > barrier. > > > > Checked again, I think using smp_wmb() and smp_rmb() should be fine as > > those are just compiler barriers. We don't need a full barrier here. > > That seems adequate. Strictly speaking, no, because neither FNAME(fetch) nor kvm_mmu_pte_write() are pure readers or writers. FNAME(fetch) reads guest memory (guest PTEs) and writes indirect_shadow_pages. kvm_mmu_pte_write() writes guest memory (guest PTEs) and reads indirect_shadow_pages (it later writes indirect_shadow_pages too, but that write isn't relevant to the ordering we care about here).
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5ad55ef71433..97cfa5a00ff2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, kvm_release_pfn_clean(pfn); - /* The instructions are well-emulated on direct mmu. */ + /* The instructions are well-emulated on Direct MMUs. */ if (vcpu->arch.mmu->root_role.direct) { - unsigned int indirect_shadow_pages; - - write_lock(&vcpu->kvm->mmu_lock); - indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; - write_unlock(&vcpu->kvm->mmu_lock); - - if (indirect_shadow_pages) + if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); return true;
Remove KVM MMU write lock when accessing indirect_shadow_pages counter when page role is direct because this counter value is used as a coarse-grained heuristics to check if there is nested guest active. Racing with this heuristics without mmu lock will be harmless because the corresponding indirect shadow sptes for the GPA will either be zapped by this thread or some other thread who has previously zapped all indirect shadow pages and makes the value to 0. Because of that, remove the KVM MMU write lock pair to potentially reduce the lock contension and improve the performance of nested VM. In addition opportunistically change the comment of 'direct mmu' to make the description consistent with other places. Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> --- arch/x86/kvm/x86.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7