diff mbox series

[v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG

Message ID 20240402213656.3068504-1-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG | expand

Commit Message

David Matlack April 2, 2024, 9:36 p.m. UTC
Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid
blocking other threads (e.g. vCPUs taking page faults) for too long.

Specifically, change kvm_clear_dirty_log_protect() to acquire/release
mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(),
rather than around the entire for loop. This ensures that KVM will only
hold mmu_lock for the time it takes the architecture-specific code to
process up to 64 pages, rather than holding mmu_lock for log->num_pages,
which is controllable by userspace. This also avoids holding mmu_lock
when processing parts of the dirty_bitmap that are zero (i.e. when there
is nothing to clear).

Moving the acquire/release points for mmu_lock should be safe since
dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap
is already accessed with atomic_long_fetch_andnot(). And at least on x86
holding mmu_lock doesn't even serialize access to the memslot dirty
bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding
mmu_lock.

This change eliminates dips in guest performance during live migration
in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
would also reduce contention on mmu_lock, but doing so will increase the
rate of remote TLB flushing. And there's really no reason to punt this
problem to userspace since KVM can just drop and reacquire mmu_lock more
frequently.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Tianrui Zhao <zhaotianrui@loongson.cn>
Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anup Patel <anup@brainfault.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
v2:
 - Rebase onto kvm/queue [Marc]

v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/

 virt/kvm/kvm_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402

Comments

bibo mao April 3, 2024, 1:50 a.m. UTC | #1
On 2024/4/3 上午5:36, David Matlack wrote:
> Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid
> blocking other threads (e.g. vCPUs taking page faults) for too long.
> 
> Specifically, change kvm_clear_dirty_log_protect() to acquire/release
> mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(),
> rather than around the entire for loop. This ensures that KVM will only
> hold mmu_lock for the time it takes the architecture-specific code to
> process up to 64 pages, rather than holding mmu_lock for log->num_pages,
> which is controllable by userspace. This also avoids holding mmu_lock
> when processing parts of the dirty_bitmap that are zero (i.e. when there
> is nothing to clear).
> 
> Moving the acquire/release points for mmu_lock should be safe since
> dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap
> is already accessed with atomic_long_fetch_andnot(). And at least on x86
> holding mmu_lock doesn't even serialize access to the memslot dirty
> bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding
> mmu_lock.
> 
> This change eliminates dips in guest performance during live migration
> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
Frequently drop/reacquire mmu_lock will cause userspace migration 
process issuing CLEAR ioctls to contend with 160 vCPU, migration speed 
maybe become slower. In theory priority of userspace migration thread 
should be higher than vcpu thread.

Drop and reacquire mmu_lock with 64-pages may be a little too smaller,
in generic it is one huge page size. However it should be decided by 
framework maintainer:)

Regards
Bibo Mao
> would also reduce contention on mmu_lock, but doing so will increase the
> rate of remote TLB flushing. And there's really no reason to punt this
> problem to userspace since KVM can just drop and reacquire mmu_lock more
> frequently.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Oliver Upton <oliver.upton@linux.dev>
> Cc: Tianrui Zhao <zhaotianrui@loongson.cn>
> Cc: Bibo Mao <maobibo@loongson.cn>
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Anup Patel <anup@brainfault.org>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
> v2:
>   - Rebase onto kvm/queue [Marc]
> 
> v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/
> 
>   virt/kvm/kvm_main.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..0a8b25a52c15 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>   	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
>   		return -EFAULT;
>   
> -	KVM_MMU_LOCK(kvm);
>   	for (offset = log->first_page, i = offset / BITS_PER_LONG,
>   		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
>   	     i++, offset += BITS_PER_LONG) {
> @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>   		*/
>   		if (mask) {
>   			flush = true;
> +			KVM_MMU_LOCK(kvm);
>   			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
>   								offset, mask);
> +			KVM_MMU_UNLOCK(kvm);
>   		}
>   	}
> -	KVM_MMU_UNLOCK(kvm);
>   
>   	if (flush)
>   		kvm_flush_remote_tlbs_memslot(kvm, memslot);
> 
> base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402
>
Sean Christopherson April 3, 2024, 5:20 p.m. UTC | #2
On Wed, Apr 03, 2024, maobibo wrote:
> 
> On 2024/4/3 上午5:36, David Matlack wrote:
> > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid
> > blocking other threads (e.g. vCPUs taking page faults) for too long.
> > 
> > Specifically, change kvm_clear_dirty_log_protect() to acquire/release
> > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(),
> > rather than around the entire for loop. This ensures that KVM will only
> > hold mmu_lock for the time it takes the architecture-specific code to
> > process up to 64 pages, rather than holding mmu_lock for log->num_pages,
> > which is controllable by userspace. This also avoids holding mmu_lock
> > when processing parts of the dirty_bitmap that are zero (i.e. when there
> > is nothing to clear).
> > 
> > Moving the acquire/release points for mmu_lock should be safe since
> > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap
> > is already accessed with atomic_long_fetch_andnot(). And at least on x86
> > holding mmu_lock doesn't even serialize access to the memslot dirty
> > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding
> > mmu_lock.
> > 
> > This change eliminates dips in guest performance during live migration
> > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> Frequently drop/reacquire mmu_lock will cause userspace migration process
> issuing CLEAR ioctls to contend with 160 vCPU, migration speed maybe become
> slower.

Only if vCPUs actually acquire mmu_lock.  E.g. on x86, KVM fixes/handles faults
due to write-protection for dirty logging without acquiring mmu_lock.  So for x86,
taking faults that need to acquire mmu_lock while dirty logging should be fairly
uncommon (and if vCPUs are taking lots of faults, performance is likely going to
be bad no matter what).

> In theory priority of userspace migration thread should be higher than vcpu
> thread.

That's very debatable.  And it's not an apples-to-apples comparison, because
CLEAR_DIRTY_LOG can hold mmu_lock for a very long time, probably orders of
magnitude longer than a vCPU will hold mmu_lock when handling a page fault.

And on x86 and ARM, page faults can be resolved while hold mmu_lock for read.  As
a result, holding mmu_lock in CLEAR_DIRTY_LOG (for write) is effectively more
costly than holding it in vCPUs.

> Drop and reacquire mmu_lock with 64-pages may be a little too smaller,
> in generic it is one huge page size. However it should be decided by
> framework maintainer:)

We could tweak the batching, but my very strong preference would be to do that
only as a last resort, i.e. if and only if some magic batch number provides waaay
better performance in all scenarios.

Maintaining code with arbitrary magic numbers is a pain, e.g. KVM x86's MMU has
arbitrary batching in kvm_zap_obsolete_pages(), and the justification for the
number is extremely handwavy (paraphasing the changelog):

      Zap at least 10 shadow pages before releasing mmu_lock to reduce the
      overhead associated with re-acquiring the lock.
    
      Note: "10" is an arbitrary number, speculated to be high enough so
      that a vCPU isn't stuck zapping obsolete pages for an extended period,
      but small enough so that other vCPUs aren't starved waiting for
      mmu_lock.

I.e. we're stuck with someone's best guess from years ago without any real idea
if 10 is a good number, let alone optimal.

Obviously that doesn't mean 64 pages is optimal, but it's at least not arbitrary,
it's just an artifact of how KVM processes the bitmap.

To be clear, I'm not opposed to per-arch behavior, nor do I think x86 should
dictate how all other architectures should behave.  But I would strongly prefer
to avoid per-arch behavior unless it's actually necessary (doubly so for batching).

In other words, if we do need per-arch behavior, e.g. because aggressivly dropping
mmu_lock causes performance issues on other architectures that need to take mmu_lock
for write to handle faults, I would prefer to have the arch knob control whether
the lock+unlock is outside versus inside the loop, not control an arbitrary batch
size.
David Matlack April 4, 2024, 4:29 p.m. UTC | #3
On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
>
>
>
> On 2024/4/3 上午5:36, David Matlack wrote:
> > Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid
> > blocking other threads (e.g. vCPUs taking page faults) for too long.
> >
> > Specifically, change kvm_clear_dirty_log_protect() to acquire/release
> > mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(),
> > rather than around the entire for loop. This ensures that KVM will only
> > hold mmu_lock for the time it takes the architecture-specific code to
> > process up to 64 pages, rather than holding mmu_lock for log->num_pages,
> > which is controllable by userspace. This also avoids holding mmu_lock
> > when processing parts of the dirty_bitmap that are zero (i.e. when there
> > is nothing to clear).
> >
> > Moving the acquire/release points for mmu_lock should be safe since
> > dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap
> > is already accessed with atomic_long_fetch_andnot(). And at least on x86
> > holding mmu_lock doesn't even serialize access to the memslot dirty
> > bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding
> > mmu_lock.
> >
> > This change eliminates dips in guest performance during live migration
> > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> Frequently drop/reacquire mmu_lock will cause userspace migration
> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
> maybe become slower.

In practice we have not found this to be the case. With this patch
applied we see a significant improvement in guest workload throughput
while userspace is issuing CLEAR ioctls without any change to the
overall migration duration.

For example, here[1] is a graph showing the effect of this patch on
a160 vCPU VM being Live Migrated while running a MySQL workload.
Y-Axis is transactions, blue line is without the patch, red line is
with the patch.

[1] https://drive.google.com/file/d/1zSKtc6wOQqfQrAQ4O9xlFmuyJ2-Iah0k/view

> In theory priority of userspace migration thread
> should be higher than vcpu thread.

I don't think it's black and white. If prioritizing migration threads
causes vCPU starvation (which is the type of issue this patch is
fixing), then that's probably not the right trade-off. IMO the
ultimate goal of live migration is that it's completely transparent to
the guest workload, i.e. we should aim to minimize guest disruption as
much as possible. A change that migration go 2x as fast but reduces
vCPU performance by 10x during the migration is probably not worth it.
And the reverse is also true, a change that increases vCPU performance
by 10% during migration but makes the migration take 10x longer is
also probably not worth it.

In the case of this patch, there doesn't seem to be a trade-off. We
see an improvement to vCPU performance without any regression in
migration duration or other metrics.

>
> Drop and reacquire mmu_lock with 64-pages may be a little too smaller,
> in generic it is one huge page size. However it should be decided by
> framework maintainer:)
>
> Regards
> Bibo Mao
> > would also reduce contention on mmu_lock, but doing so will increase the
> > rate of remote TLB flushing. And there's really no reason to punt this
> > problem to userspace since KVM can just drop and reacquire mmu_lock more
> > frequently.
> >
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Oliver Upton <oliver.upton@linux.dev>
> > Cc: Tianrui Zhao <zhaotianrui@loongson.cn>
> > Cc: Bibo Mao <maobibo@loongson.cn>
> > Cc: Huacai Chen <chenhuacai@kernel.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Anup Patel <anup@brainfault.org>
> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> > Cc: Janosch Frank <frankja@linux.ibm.com>
> > Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> > v2:
> >   - Rebase onto kvm/queue [Marc]
> >
> > v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/
> >
> >   virt/kvm/kvm_main.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fb49c2a60200..0a8b25a52c15 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
> >       if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
> >               return -EFAULT;
> >
> > -     KVM_MMU_LOCK(kvm);
> >       for (offset = log->first_page, i = offset / BITS_PER_LONG,
> >                n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
> >            i++, offset += BITS_PER_LONG) {
> > @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
> >               */
> >               if (mask) {
> >                       flush = true;
> > +                     KVM_MMU_LOCK(kvm);
> >                       kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
> >                                                               offset, mask);
> > +                     KVM_MMU_UNLOCK(kvm);
> >               }
> >       }
> > -     KVM_MMU_UNLOCK(kvm);
> >
> >       if (flush)
> >               kvm_flush_remote_tlbs_memslot(kvm, memslot);
> >
> > base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402
> >
>
Sean Christopherson April 4, 2024, 5:10 p.m. UTC | #4
On Thu, Apr 04, 2024, David Matlack wrote:
> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
> > > This change eliminates dips in guest performance during live migration
> > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> > Frequently drop/reacquire mmu_lock will cause userspace migration
> > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
> > maybe become slower.
> 
> In practice we have not found this to be the case. With this patch
> applied we see a significant improvement in guest workload throughput
> while userspace is issuing CLEAR ioctls without any change to the
> overall migration duration.

...

> In the case of this patch, there doesn't seem to be a trade-off. We
> see an improvement to vCPU performance without any regression in
> migration duration or other metrics.

For x86.  We need to keep in mind that not all architectures have x86's optimization
around dirty logging faults, or around faults in general. E.g. LoongArch's (which
I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
And if the fault can't be handled in the fast path, KVM will actually acquire
mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
the mmu_seq and fault-in pfn stuff).

So for x86, I think we can comfortably state that this change is a net positive
for all scenarios.  But for other architectures, that might not be the case.
I'm not saying this isn't a good change for other architectures, just that we
don't have relevant data to really know for sure.

Absent performance data for other architectures, which is likely going to be
difficult/slow to get, it might make sense to have this be opt-in to start.  We
could even do it with minimal #ifdeffery, e.g. something like the below would allow
x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
(I assume we want to give kvm_get_dirty_log_protect() similar treatment?).

I don't love the idea of adding more arch specific MMU behavior (going the wrong
direction), but it doesn't seem like an unreasonable approach in this case.

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 86d267db87bb..5eb1ce83f29d 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
        if (!memslot || (offset + __fls(mask)) >= memslot->npages)
                return;
 
-       KVM_MMU_LOCK(kvm);
+       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
        kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
-       KVM_MMU_UNLOCK(kvm);
+       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
 }
 
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1fd9cb5d037..74ae844e4ed0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
                dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
                memset(dirty_bitmap_buffer, 0, n);
 
-               KVM_MMU_LOCK(kvm);
+               KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
                for (i = 0; i < n / sizeof(long); i++) {
                        unsigned long mask;
                        gfn_t offset;
@@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
                        kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
                                                                offset, mask);
                }
-               KVM_MMU_UNLOCK(kvm);
+               KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
        }
 
        if (flush)
@@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
        if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
                return -EFAULT;
 
-       KVM_MMU_LOCK(kvm);
+       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
        for (offset = log->first_page, i = offset / BITS_PER_LONG,
                 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
             i++, offset += BITS_PER_LONG) {
@@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
                                                                offset, mask);
                }
        }
-       KVM_MMU_UNLOCK(kvm);
+       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
 
        if (flush)
                kvm_flush_remote_tlbs_memslot(kvm, memslot);
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index ecefc7ec51af..39d8b809c303 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -20,6 +20,11 @@
 #define KVM_MMU_UNLOCK(kvm)            spin_unlock(&(kvm)->mmu_lock)
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
+#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT
+#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT     KVM_MMU_LOCK
+#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT   KVM_MMU_UNLOCK
+#endif
+
 kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
                     bool *async, bool write_fault, bool *writable);
David Matlack April 4, 2024, 6:12 p.m. UTC | #5
On Thu, Apr 4, 2024 at 10:10 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 04, 2024, David Matlack wrote:
> > On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
> > > > This change eliminates dips in guest performance during live migration
> > > > in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> > > > 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> > > Frequently drop/reacquire mmu_lock will cause userspace migration
> > > process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
> > > maybe become slower.
> >
> > In practice we have not found this to be the case. With this patch
> > applied we see a significant improvement in guest workload throughput
> > while userspace is issuing CLEAR ioctls without any change to the
> > overall migration duration.
>
> ...
>
> > In the case of this patch, there doesn't seem to be a trade-off. We
> > see an improvement to vCPU performance without any regression in
> > migration duration or other metrics.
>
> For x86.  We need to keep in mind that not all architectures have x86's optimization
> around dirty logging faults, or around faults in general. E.g. LoongArch's (which
> I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
> And if the fault can't be handled in the fast path, KVM will actually acquire
> mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
> the mmu_seq and fault-in pfn stuff).
>
> So for x86, I think we can comfortably state that this change is a net positive
> for all scenarios.  But for other architectures, that might not be the case.
> I'm not saying this isn't a good change for other architectures, just that we
> don't have relevant data to really know for sure.

I do not have data for other architectures, but may be able to get
data on ARM in the next few weeks. I believe we saw similar benefits
when testing on ARM.

>
> Absent performance data for other architectures, which is likely going to be
> difficult/slow to get, it might make sense to have this be opt-in to start.  We
> could even do it with minimal #ifdeffery, e.g. something like the below would allow
> x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
> (I assume we want to give kvm_get_dirty_log_protect() similar treatment?).

I don't see any reason not to give kvm_get_dirty_log_protect() the
same treatment, but it's less important since
kvm_get_dirty_log_protect() does not take the mmu_lock at all when
manual-protect is enabled.

>
> I don't love the idea of adding more arch specific MMU behavior (going the wrong
> direction), but it doesn't seem like an unreasonable approach in this case.

I wonder if this is being overly cautious. I would expect only more
benefit on architectures that more aggressively take the mmu_lock on
vCPU threads during faults. The more lock acquisition on vCPU threads,
the more this patch will help reduce vCPU starvation during
CLEAR_DIRTY_LOG.

Hm, perhaps testing with ept=N (which will use the write-lock for even
dirty logging faults) would be a way to increase confidence in the
effect on other architectures?

>
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 86d267db87bb..5eb1ce83f29d 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>         if (!memslot || (offset + __fls(mask)) >= memslot->npages)
>                 return;
>
> -       KVM_MMU_LOCK(kvm);
> +       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>         kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> -       KVM_MMU_UNLOCK(kvm);
> +       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>  }
>
>  int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1fd9cb5d037..74ae844e4ed0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>                 dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
>                 memset(dirty_bitmap_buffer, 0, n);
>
> -               KVM_MMU_LOCK(kvm);
> +               KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>                 for (i = 0; i < n / sizeof(long); i++) {
>                         unsigned long mask;
>                         gfn_t offset;
> @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>                         kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
>                                                                 offset, mask);
>                 }
> -               KVM_MMU_UNLOCK(kvm);
> +               KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>         }
>
>         if (flush)
> @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>         if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
>                 return -EFAULT;
>
> -       KVM_MMU_LOCK(kvm);
> +       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>         for (offset = log->first_page, i = offset / BITS_PER_LONG,
>                  n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
>              i++, offset += BITS_PER_LONG) {
> @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>                                                                 offset, mask);
>                 }
>         }
> -       KVM_MMU_UNLOCK(kvm);
> +       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>
>         if (flush)
>                 kvm_flush_remote_tlbs_memslot(kvm, memslot);
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index ecefc7ec51af..39d8b809c303 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,6 +20,11 @@
>  #define KVM_MMU_UNLOCK(kvm)            spin_unlock(&(kvm)->mmu_lock)
>  #endif /* KVM_HAVE_MMU_RWLOCK */
>
> +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT
> +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT     KVM_MMU_LOCK
> +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT   KVM_MMU_UNLOCK
> +#endif
> +
>  kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>                      bool *async, bool write_fault, bool *writable);
>
>
Sean Christopherson April 4, 2024, 6:17 p.m. UTC | #6
On Thu, Apr 04, 2024, David Matlack wrote:
> > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > direction), but it doesn't seem like an unreasonable approach in this case.
> 
> I wonder if this is being overly cautious.

Probably.  "Lazy" is another word for it ;-)

> I would expect only more benefit on architectures that more aggressively take
> the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU
> threads, the more this patch will help reduce vCPU starvation during
> CLEAR_DIRTY_LOG.
> 
> Hm, perhaps testing with ept=N (which will use the write-lock for even
> dirty logging faults) would be a way to increase confidence in the
> effect on other architectures?

Turning off the TDP MMU would be more representative, just manually disable the
fast-path, e.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 992e651540e8..532c24911f39 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3371,7 +3371,7 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
         * Note, instruction fetches and writes are mutually exclusive, ignore
         * the "exec" flag.
         */
-       return fault->write;
+       return false;//fault->write;
 }
 
 /*
bibo mao April 7, 2024, 1:36 a.m. UTC | #7
On 2024/4/5 上午12:29, David Matlack wrote:
> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
>>
>>
>>
>> On 2024/4/3 上午5:36, David Matlack wrote:
>>> Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG to avoid
>>> blocking other threads (e.g. vCPUs taking page faults) for too long.
>>>
>>> Specifically, change kvm_clear_dirty_log_protect() to acquire/release
>>> mmu_lock only when calling kvm_arch_mmu_enable_log_dirty_pt_masked(),
>>> rather than around the entire for loop. This ensures that KVM will only
>>> hold mmu_lock for the time it takes the architecture-specific code to
>>> process up to 64 pages, rather than holding mmu_lock for log->num_pages,
>>> which is controllable by userspace. This also avoids holding mmu_lock
>>> when processing parts of the dirty_bitmap that are zero (i.e. when there
>>> is nothing to clear).
>>>
>>> Moving the acquire/release points for mmu_lock should be safe since
>>> dirty_bitmap_buffer is already protected by slots_lock, and dirty_bitmap
>>> is already accessed with atomic_long_fetch_andnot(). And at least on x86
>>> holding mmu_lock doesn't even serialize access to the memslot dirty
>>> bitmap, as vCPUs can call mark_page_dirty_in_slot() without holding
>>> mmu_lock.
>>>
>>> This change eliminates dips in guest performance during live migration
>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
>> Frequently drop/reacquire mmu_lock will cause userspace migration
>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
>> maybe become slower.
> 
> In practice we have not found this to be the case. With this patch
> applied we see a significant improvement in guest workload throughput
> while userspace is issuing CLEAR ioctls without any change to the
> overall migration duration.
> 
> For example, here[1] is a graph showing the effect of this patch on
> a160 vCPU VM being Live Migrated while running a MySQL workload.
> Y-Axis is transactions, blue line is without the patch, red line is
> with the patch.
> 
> [1] https://drive.google.com/file/d/1zSKtc6wOQqfQrAQ4O9xlFmuyJ2-Iah0k/view
It is great that there is obvious improvement with the workload.
> 
>> In theory priority of userspace migration thread
>> should be higher than vcpu thread.
> 
> I don't think it's black and white. If prioritizing migration threads
> causes vCPU starvation (which is the type of issue this patch is
> fixing), then that's probably not the right trade-off. IMO the
> ultimate goal of live migration is that it's completely transparent to
> the guest workload, i.e. we should aim to minimize guest disruption as
> much as possible. A change that migration go 2x as fast but reduces
> vCPU performance by 10x during the migration is probably not worth it.
> And the reverse is also true, a change that increases vCPU performance
> by 10% during migration but makes the migration take 10x longer is
> also probably not worth it.
I am not migration expert, IIRC there is migration timeout value in 
cloud stack. If migration timeout reached, migration will be abort.
> 
> In the case of this patch, there doesn't seem to be a trade-off. We
> see an improvement to vCPU performance without any regression in
> migration duration or other metrics.
I will be ok about this patch if there is no any regression in
migration. If x86 migration takes more time with the patch on x86, I 
suggest migration experts give some comments.

Regards
Bibo Mao
> 
>>
>> Drop and reacquire mmu_lock with 64-pages may be a little too smaller,
>> in generic it is one huge page size. However it should be decided by
>> framework maintainer:)
>>
>> Regards
>> Bibo Mao
>>> would also reduce contention on mmu_lock, but doing so will increase the
>>> rate of remote TLB flushing. And there's really no reason to punt this
>>> problem to userspace since KVM can just drop and reacquire mmu_lock more
>>> frequently.
>>>
>>> Cc: Marc Zyngier <maz@kernel.org>
>>> Cc: Oliver Upton <oliver.upton@linux.dev>
>>> Cc: Tianrui Zhao <zhaotianrui@loongson.cn>
>>> Cc: Bibo Mao <maobibo@loongson.cn>
>>> Cc: Huacai Chen <chenhuacai@kernel.org>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> Cc: Anup Patel <anup@brainfault.org>
>>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
>>> Cc: Janosch Frank <frankja@linux.ibm.com>
>>> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>> Cc: Sean Christopherson <seanjc@google.com>
>>> Signed-off-by: David Matlack <dmatlack@google.com>
>>> ---
>>> v2:
>>>    - Rebase onto kvm/queue [Marc]
>>>
>>> v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/
>>>
>>>    virt/kvm/kvm_main.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index fb49c2a60200..0a8b25a52c15 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -2386,7 +2386,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>>>        if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
>>>                return -EFAULT;
>>>
>>> -     KVM_MMU_LOCK(kvm);
>>>        for (offset = log->first_page, i = offset / BITS_PER_LONG,
>>>                 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
>>>             i++, offset += BITS_PER_LONG) {
>>> @@ -2405,11 +2404,12 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>>>                */
>>>                if (mask) {
>>>                        flush = true;
>>> +                     KVM_MMU_LOCK(kvm);
>>>                        kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
>>>                                                                offset, mask);
>>> +                     KVM_MMU_UNLOCK(kvm);
>>>                }
>>>        }
>>> -     KVM_MMU_UNLOCK(kvm);
>>>
>>>        if (flush)
>>>                kvm_flush_remote_tlbs_memslot(kvm, memslot);
>>>
>>> base-commit: 9bc60f733839ab6fcdde0d0b15cbb486123e6402
>>>
>>
bibo mao April 7, 2024, 2:25 a.m. UTC | #8
On 2024/4/5 上午1:10, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, David Matlack wrote:
>> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
>>>> This change eliminates dips in guest performance during live migration
>>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
>>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
>>> Frequently drop/reacquire mmu_lock will cause userspace migration
>>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
>>> maybe become slower.
>>
>> In practice we have not found this to be the case. With this patch
>> applied we see a significant improvement in guest workload throughput
>> while userspace is issuing CLEAR ioctls without any change to the
>> overall migration duration.
> 
> ...
> 
>> In the case of this patch, there doesn't seem to be a trade-off. We
>> see an improvement to vCPU performance without any regression in
>> migration duration or other metrics.
> 
> For x86.  We need to keep in mind that not all architectures have x86's optimization
> around dirty logging faults, or around faults in general. E.g. LoongArch's (which
> I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
> And if the fault can't be handled in the fast path, KVM will actually acquire
> mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
> the mmu_seq and fault-in pfn stuff).
yes, there is no lock in function fast_page_fault on x86, however mmu 
spinlock is held on fast path on LoongArch. I do not notice this, 
originally I think that there is read_lock on x86 fast path for pte 
modification, write lock about page table modification.
> 
> So for x86, I think we can comfortably state that this change is a net positive
> for all scenarios.  But for other architectures, that might not be the case.
> I'm not saying this isn't a good change for other architectures, just that we
> don't have relevant data to really know for sure.
 From the code there is contention between migration assistant thread 
and vcpu thread in slow path where huge page need be split if memslot is 
dirty log enabled, however there is no contention between migration 
assistant thread and vcpu fault fast path. If it is right, negative 
effect is small on x86.

> 
> Absent performance data for other architectures, which is likely going to be
> difficult/slow to get, it might make sense to have this be opt-in to start.  We
> could even do it with minimal #ifdeffery, e.g. something like the below would allow
> x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
> (I assume we want to give kvm_get_dirty_log_protect() similar treatment?).
> 
> I don't love the idea of adding more arch specific MMU behavior (going the wrong
> direction), but it doesn't seem like an unreasonable approach in this case.
No special modification for this, it is a little strange. LoongArch page 
fault fast path can improve later.

Regards
Bibo Mao
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 86d267db87bb..5eb1ce83f29d 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -66,9 +66,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>          if (!memslot || (offset + __fls(mask)) >= memslot->npages)
>                  return;
>   
> -       KVM_MMU_LOCK(kvm);
> +       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>          kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> -       KVM_MMU_UNLOCK(kvm);
> +       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>   }
>   
>   int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1fd9cb5d037..74ae844e4ed0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2279,7 +2279,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>                  dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
>                  memset(dirty_bitmap_buffer, 0, n);
>   
> -               KVM_MMU_LOCK(kvm);
> +               KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>                  for (i = 0; i < n / sizeof(long); i++) {
>                          unsigned long mask;
>                          gfn_t offset;
> @@ -2295,7 +2295,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>                          kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
>                                                                  offset, mask);
>                  }
> -               KVM_MMU_UNLOCK(kvm);
> +               KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>          }
>   
>          if (flush)
> @@ -2390,7 +2390,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>          if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
>                  return -EFAULT;
>   
> -       KVM_MMU_LOCK(kvm);
> +       KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>          for (offset = log->first_page, i = offset / BITS_PER_LONG,
>                   n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
>               i++, offset += BITS_PER_LONG) {
> @@ -2413,7 +2413,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>                                                                  offset, mask);
>                  }
>          }
> -       KVM_MMU_UNLOCK(kvm);
> +       KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT(kvm);
>   
>          if (flush)
>                  kvm_flush_remote_tlbs_memslot(kvm, memslot);
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index ecefc7ec51af..39d8b809c303 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -20,6 +20,11 @@
>   #define KVM_MMU_UNLOCK(kvm)            spin_unlock(&(kvm)->mmu_lock)
>   #endif /* KVM_HAVE_MMU_RWLOCK */
>   
> +#ifndef KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT
> +#define KVM_MMU_LOCK_FOR_DIRTY_LOG_PROTECT     KVM_MMU_LOCK
> +#define KVM_MMU_UNLOCK_FOR_DIRTY_LOG_PROTECT   KVM_MMU_UNLOCK
> +#endif
> +
>   kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>                       bool *async, bool write_fault, bool *writable);
>   
>
David Matlack April 12, 2024, 4:12 p.m. UTC | #9
On Sat, Apr 6, 2024 at 7:26 PM maobibo <maobibo@loongson.cn> wrote:
>
>
>
> On 2024/4/5 上午1:10, Sean Christopherson wrote:
> > On Thu, Apr 04, 2024, David Matlack wrote:
> >> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
> >>>> This change eliminates dips in guest performance during live migration
> >>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
> >>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
> >>> Frequently drop/reacquire mmu_lock will cause userspace migration
> >>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
> >>> maybe become slower.
> >>
> >> In practice we have not found this to be the case. With this patch
> >> applied we see a significant improvement in guest workload throughput
> >> while userspace is issuing CLEAR ioctls without any change to the
> >> overall migration duration.
> >
> > ...
> >
> >> In the case of this patch, there doesn't seem to be a trade-off. We
> >> see an improvement to vCPU performance without any regression in
> >> migration duration or other metrics.
> >
> > For x86.  We need to keep in mind that not all architectures have x86's optimization
> > around dirty logging faults, or around faults in general. E.g. LoongArch's (which
> > I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
> > And if the fault can't be handled in the fast path, KVM will actually acquire
> > mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
> > the mmu_seq and fault-in pfn stuff).
> yes, there is no lock in function fast_page_fault on x86, however mmu
> spinlock is held on fast path on LoongArch. I do not notice this,
> originally I think that there is read_lock on x86 fast path for pte
> modification, write lock about page table modification.

Most[*] vCPU faults on KVM/x86 are handled as follows:

- vCPU write-protection and access-tracking faults are handled
locklessly (fast_page_fault()).
- All other vCPU faults are handled under read_lock(&kvm->mmu_lock).

[*] The locking is different when nested virtualization is in use, TDP
(i.e. stage-2 hardware) is disabled, and/or kvm.tdp_mmu=N.

> >
> > So for x86, I think we can comfortably state that this change is a net positive
> > for all scenarios.  But for other architectures, that might not be the case.
> > I'm not saying this isn't a good change for other architectures, just that we
> > don't have relevant data to really know for sure.
>  From the code there is contention between migration assistant thread
> and vcpu thread in slow path where huge page need be split if memslot is
> dirty log enabled, however there is no contention between migration
> assistant thread and vcpu fault fast path. If it is right, negative
> effect is small on x86.

Right there is no contention between CLEAR_DIRTY_LOG and vCPU
write-protection faults on KVM/x86. KVM/arm64 does write-protection
faults under the read-lock.

KVM/x86 and KVM/arm64 also both have eager page splitting, where the
huge pages are split during CLEAR_DIRTY_LOG, so there are also no vCPU
faults to split huge pages.

>
> >
> > Absent performance data for other architectures, which is likely going to be
> > difficult/slow to get, it might make sense to have this be opt-in to start.  We
> > could even do it with minimal #ifdeffery, e.g. something like the below would allow
> > x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
> > (I assume we want to give kvm_get_dirty_log_protect() similar treatment?).
> >
> > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > direction), but it doesn't seem like an unreasonable approach in this case.
> No special modification for this, it is a little strange. LoongArch page
> fault fast path can improve later.

Sorry, I don't follow exactly what you mean here. Are you saying
Sean's patch is not required for LoongArch and, instead, LoongArch
should/will avoid acquiring the mmu_lock when handling
write-protection faults?
David Matlack April 12, 2024, 4:14 p.m. UTC | #10
On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 04, 2024, David Matlack wrote:
> > > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > > direction), but it doesn't seem like an unreasonable approach in this case.
> >
> > I wonder if this is being overly cautious.
>
> Probably.  "Lazy" is another word for it ;-)
>
> > I would expect only more benefit on architectures that more aggressively take
> > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU
> > threads, the more this patch will help reduce vCPU starvation during
> > CLEAR_DIRTY_LOG.
> >
> > Hm, perhaps testing with ept=N (which will use the write-lock for even
> > dirty logging faults) would be a way to increase confidence in the
> > effect on other architectures?
>
> Turning off the TDP MMU would be more representative, just manually disable the
> fast-path, e.g.

Good idea. I'm actually throwing in some writable module parameters
too to make it easy to toggle between configurations.

I'll report back when I have some data.
bibo mao April 15, 2024, 1:21 a.m. UTC | #11
On 2024/4/13 上午12:12, David Matlack wrote:
> On Sat, Apr 6, 2024 at 7:26 PM maobibo <maobibo@loongson.cn> wrote:
>>
>>
>>
>> On 2024/4/5 上午1:10, Sean Christopherson wrote:
>>> On Thu, Apr 04, 2024, David Matlack wrote:
>>>> On Tue, Apr 2, 2024 at 6:50 PM maobibo <maobibo@loongson.cn> wrote:
>>>>>> This change eliminates dips in guest performance during live migration
>>>>>> in a 160 vCPU VM when userspace is issuing CLEAR ioctls (tested with
>>>>>> 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which
>>>>> Frequently drop/reacquire mmu_lock will cause userspace migration
>>>>> process issuing CLEAR ioctls to contend with 160 vCPU, migration speed
>>>>> maybe become slower.
>>>>
>>>> In practice we have not found this to be the case. With this patch
>>>> applied we see a significant improvement in guest workload throughput
>>>> while userspace is issuing CLEAR ioctls without any change to the
>>>> overall migration duration.
>>>
>>> ...
>>>
>>>> In the case of this patch, there doesn't seem to be a trade-off. We
>>>> see an improvement to vCPU performance without any regression in
>>>> migration duration or other metrics.
>>>
>>> For x86.  We need to keep in mind that not all architectures have x86's optimization
>>> around dirty logging faults, or around faults in general. E.g. LoongArch's (which
>>> I assume is Bibo Mao's primary interest) kvm_map_page_fast() still acquires mmu_lock.
>>> And if the fault can't be handled in the fast path, KVM will actually acquire
>>> mmu_lock twice (mmu_lock is dropped after the fast-path, then reacquired after
>>> the mmu_seq and fault-in pfn stuff).
>> yes, there is no lock in function fast_page_fault on x86, however mmu
>> spinlock is held on fast path on LoongArch. I do not notice this,
>> originally I think that there is read_lock on x86 fast path for pte
>> modification, write lock about page table modification.
> 
> Most[*] vCPU faults on KVM/x86 are handled as follows:
> 
> - vCPU write-protection and access-tracking faults are handled
> locklessly (fast_page_fault()).
> - All other vCPU faults are handled under read_lock(&kvm->mmu_lock).
> 
> [*] The locking is different when nested virtualization is in use, TDP
> (i.e. stage-2 hardware) is disabled, and/or kvm.tdp_mmu=N.
> 
>>>
>>> So for x86, I think we can comfortably state that this change is a net positive
>>> for all scenarios.  But for other architectures, that might not be the case.
>>> I'm not saying this isn't a good change for other architectures, just that we
>>> don't have relevant data to really know for sure.
>>   From the code there is contention between migration assistant thread
>> and vcpu thread in slow path where huge page need be split if memslot is
>> dirty log enabled, however there is no contention between migration
>> assistant thread and vcpu fault fast path. If it is right, negative
>> effect is small on x86.
> 
> Right there is no contention between CLEAR_DIRTY_LOG and vCPU
> write-protection faults on KVM/x86. KVM/arm64 does write-protection
> faults under the read-lock.
> 
> KVM/x86 and KVM/arm64 also both have eager page splitting, where the
> huge pages are split during CLEAR_DIRTY_LOG, so there are also no vCPU
> faults to split huge pages.
> 
>>
>>>
>>> Absent performance data for other architectures, which is likely going to be
>>> difficult/slow to get, it might make sense to have this be opt-in to start.  We
>>> could even do it with minimal #ifdeffery, e.g. something like the below would allow
>>> x86 to do whatever locking it wants in kvm_arch_mmu_enable_log_dirty_pt_masked()
>>> (I assume we want to give kvm_get_dirty_log_protect() similar treatment?).
>>>
>>> I don't love the idea of adding more arch specific MMU behavior (going the wrong
>>> direction), but it doesn't seem like an unreasonable approach in this case.
>> No special modification for this, it is a little strange. LoongArch page
>> fault fast path can improve later.
> 
> Sorry, I don't follow exactly what you mean here. Are you saying
> Sean's patch is not required for LoongArch and, instead, LoongArch
> should/will avoid acquiring the mmu_lock when handling
> write-protection faults?
> 
yes, LoongArch need some optimization in secondary mmu fast path, such 
as rcu lock, read lock on mmu_lock, or page table spin lock on pte table 
page. It is not decided now. If the patch works on x86 platform, just go 
ahead; LoongArch KVM will improve later.

Regards
Bibo Mao
David Matlack April 15, 2024, 5:20 p.m. UTC | #12
On 2024-04-12 09:14 AM, David Matlack wrote:
> On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 04, 2024, David Matlack wrote:
> > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > > > direction), but it doesn't seem like an unreasonable approach in this case.
> > >
> > > I wonder if this is being overly cautious.
> >
> > Probably.  "Lazy" is another word for it ;-)
> >
> > > I would expect only more benefit on architectures that more aggressively take
> > > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU
> > > threads, the more this patch will help reduce vCPU starvation during
> > > CLEAR_DIRTY_LOG.
> > >
> > > Hm, perhaps testing with ept=N (which will use the write-lock for even
> > > dirty logging faults) would be a way to increase confidence in the
> > > effect on other architectures?
> >
> > Turning off the TDP MMU would be more representative, just manually disable the
> > fast-path, e.g.
> 
> Good idea. I'm actually throwing in some writable module parameters
> too to make it easy to toggle between configurations.
> 
> I'll report back when I have some data.

tl;dr

 * My patch likely _will_ regress migration performance on other architectures.
   Thank you Bibo and Sean for keeping me honest here.

 * I suspect the original issue my patch is trying to fix is actually specific
   to the way the TDP MMU does eager page splitting and a more targeted fix is
   warranted.

---

To evaluate my patch I tested on x86 with different mmu_lock configurations
to simulate other architectures.

 Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y
 Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y
 Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N

Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to
add a read_lock/unlock() in fast_page_fault().

Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates
LoongArch if LoongArch added support for lockless write-protection fault
handling.

The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive
write-heavy workload. To compare runs I evaluated 3 metrics:

 * Duration of pre-copy.
 * Amount of dirty memory going into post-copy.
 * Total CPU usage of CLEAR_DIRTY_LOG.

The following table shows how each metric changed after adding my patch to drop
mmu_lock during CLEAR_DIRTY_LOG.

          | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU
 ---------|------------------|-----------------|---------------------
 Config 1 | -1%              | -1%             | +6%
 Config 2 | -1%              | +1%             | +123%
 Config 3 | +32%             | +158%           | +5200%

Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3
dimensions.

Given these regressions, I started rethinking the original issue this patch is
trying to fix.

The dips in guest performance during CLEAR_DIRTY_LOG occurred during the first
pre-copy pass but not during subsequent passes. One thing unique to the first
pass is eager page splitting.

Ah ha, a theory! The TDP MMU allocates shadow pages while holding the mmu_lock
during eager page splitting and only drops the lock if need_resched=True or a
GFP_NOWAIT allocation fails. If neither occurs, CLEAR_DIRTY_LOG could potential
hold mmu_lock in write-mode for a long time.

Second, the host platform where we saw the dips has nx_huge_pages=Y. I suspect
the long CLEAR_DIRTY_LOG calls are blocking vCPUs taking steady-state
faults for NX Huge Pages, causing the dips in performance.

This theory also explains why we (Google) haven't seen similar drops in guest
performance when using !manual-protect, as in that configuration the TDP MMU
does eager page splitting under the read-lock instead of write-lock.

If this is all true, then a better / more targeted fix for this issue would be
to drop mmu_lock in the TDP MMU eager page splitting path. For example, we
could limit the "allocate under lock" behavior to only when the read-lock is
held, e.g.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7dfdc49a6ade..ea34f8232d97 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
         * If this allocation fails we drop the lock and retry with reclaim
         * allowed.
         */
-       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
-       if (sp)
-               return sp;
+       if (shared) {
+               sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
+               if (sp)
+                       return sp;
+       }

        rcu_read_unlock();

I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to
allocate page tables. So I suspect no fix is needed there and this is, in fact,
purely and x86-specific issue.
Sean Christopherson April 15, 2024, 8 p.m. UTC | #13
On Mon, Apr 15, 2024, David Matlack wrote:
>  * I suspect the original issue my patch is trying to fix is actually specific
>    to the way the TDP MMU does eager page splitting and a more targeted fix is
>    warranted.
> 
> ---
> 
> To evaluate my patch I tested on x86 with different mmu_lock configurations
> to simulate other architectures.
> 
>  Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y
>  Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y
>  Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N
> 
> Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to
> add a read_lock/unlock() in fast_page_fault().
> 
> Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates
> LoongArch if LoongArch added support for lockless write-protection fault
> handling.
> 
> The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive
> write-heavy workload. To compare runs I evaluated 3 metrics:
> 
>  * Duration of pre-copy.
>  * Amount of dirty memory going into post-copy.
>  * Total CPU usage of CLEAR_DIRTY_LOG.
> 
> The following table shows how each metric changed after adding my patch to drop
> mmu_lock during CLEAR_DIRTY_LOG.
> 
>           | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU
>  ---------|------------------|-----------------|---------------------
>  Config 1 | -1%              | -1%             | +6%
>  Config 2 | -1%              | +1%             | +123%
>  Config 3 | +32%             | +158%           | +5200%
> 
> Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3
> dimensions.

...

> If this is all true, then a better / more targeted fix for this issue would be
> to drop mmu_lock in the TDP MMU eager page splitting path. For example, we
> could limit the "allocate under lock" behavior to only when the read-lock is
> held, e.g.
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7dfdc49a6ade..ea34f8232d97 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>          * If this allocation fails we drop the lock and retry with reclaim
>          * allowed.
>          */
> -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
> -       if (sp)
> -               return sp;
> +       if (shared) {
> +               sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
> +               if (sp)
> +                       return sp;
> +       }
> 
>         rcu_read_unlock();
> 
> I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to
> allocate page tables. So I suspect no fix is needed there and this is, in fact,
> purely and x86-specific issue.

Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally
do the optimal thing for eager page splitting.  Not sure how best to document the
"best known practice" though.

As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for
read is measurable, I would prefer to unconditionally drop mmu_lock, and delete
the GFP_NOWAIT allocation.  There can be lock contention when mmu_lock is held
for read, it's just less common.

On a related topic, I think we should take a hard look at the rwlock_needbreak()
usage in tdp_mmu_iter_cond_resched().  Because dropping when allocating is really
just speculatively dropping mmu_lock because it _might_ be contended, but doing
so at a batch size that provides a good balance between doing enough work under
mmu_lock and providing low latency for vCPUs.  I.e. in theory, we should be able
to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere
near smart enough and it's currently only for preemptible kernels (or at least,
it's supposed to be only for preemptible kernels).

Simply yielding on contention is not at all optimal, as evidenced by the whole
dynamic preemption debacle[1][2].  The immediate issue was "fixed" by having vCPUs
avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is
pathologically dropping mmu_lock to the point where a non-vCPU action grinds to
a halt.

The contention logic fails to take into account many things:

 (1) Is the other task higher priority?

 (2) Is the other task a vCPU, or something else?

 (3) Will yielding actually allow the other task to make forward progress?

 (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed?

 (5) What is the expected duration this task is expected to hold mmu_lock?

 (6) Has this task made enough progress for yielding to be a decent choice?

and probably many more than that.

As we discussed off-list, I think there are two viable approaches:

 (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock
     contention logic, and improve the logic (especially the forward progress
     guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance
     in all cases.

 (b) We completely remove the rwlock_needbreak() checks from
     tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping
     mmu_lock in flows where doing so provides the best overall balance, e.g. as
     in the eager page split case.

I don't have a strong preference between (a) and (b), though I think I'd lean
towards (b), because it's simpler.  My guess is that we can achieve similar
performance results with both.  E.g. odds are decent that the "best" batch size
(see #6) is large enough that the cost of dropping and reacquiring mmu_lock is
in the noise when it's not contented.

The main argument I see for (b) is that it's simpler, as only code that actually
has a justified need to drop mmu_lock does so.  The advantage I see with (a) is
that it would provide structure and documentation for choosing when to drop
mmu_lock (or not).

E.g. dropping in the eager page split path makes sense because KVM does so at a
large batch size, odds are good that the contending task is a vCPU, there's no
TLB flush required, the total hold time of mmu_lock is high, and we know that
dropping mmu_lock will allow vCPUs to make forward progress.  (a) would do a much
better job of capturing all that in code, albeit with quite a bit more complexity.

Regardless of which option we choose, I think we should drop the preemptible kernel
dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e.
manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak().

The current approach essentially means that there's zero testing of the performance
of the yield-on-contention logic.  E.g. the complaints about the TDP MMU yielding
too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig
support for default dynamic preempt mode") unintentionally enabled
rwlock_needbreak() by default.

That's definitely easier said then done though, as I suspect that if we switched
to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also
enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot
of performance regressions.

[1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@google.com
[2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@google.com
David Matlack April 18, 2024, 6:50 p.m. UTC | #14
On Mon, Apr 15, 2024 at 1:00 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 15, 2024, David Matlack wrote:
>
> > If this is all true, then a better / more targeted fix for this issue would be
> > to drop mmu_lock in the TDP MMU eager page splitting path. For example, we
> > could limit the "allocate under lock" behavior to only when the read-lock is
> > held, e.g.
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 7dfdc49a6ade..ea34f8232d97 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >          * If this allocation fails we drop the lock and retry with reclaim
> >          * allowed.
> >          */
> > -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
> > -       if (sp)
> > -               return sp;
> > +       if (shared) {
> > +               sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
> > +               if (sp)
> > +                       return sp;
> > +       }
> >
> >         rcu_read_unlock();
> >
> > I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to
> > allocate page tables. So I suspect no fix is needed there and this is, in fact,
> > purely and x86-specific issue.
>
> Hmm, it'd be nice if we didn't have to rely on random arch flows to coincidentally
> do the optimal thing for eager page splitting.  Not sure how best to document the
> "best known practice" though.

We discussed upstreaming Google's mmu_lock timing stats yesterday.
Once that's in place, KVM could WARN_ON_ONCE() if the mmu_lock is held
for an excessive amount of time to help flag these kinds of issues.
That might trigger false positives though, as holding mmu_lock for a
long time is no big deal if there's no contention.

>
> As for the TDP MMU code, unless the cost of dropping and reacquiring mmu_lock for
> read is measurable, I would prefer to unconditionally drop mmu_lock, and delete
> the GFP_NOWAIT allocation.  There can be lock contention when mmu_lock is held
> for read, it's just less common.

SGTM. I'll do some testing. Unfortunately, the original MySQL workload
that led to this patch has bitrotted so I'm having some trouble
reproducing the original results and confirming the TDP MMU fix. Sigh.

>
> On a related topic, I think we should take a hard look at the rwlock_needbreak()
> usage in tdp_mmu_iter_cond_resched().  Because dropping when allocating is really
> just speculatively dropping mmu_lock because it _might_ be contended, but doing
> so at a batch size that provides a good balance between doing enough work under
> mmu_lock and providing low latency for vCPUs.  I.e. in theory, we should be able
> to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere
> near smart enough and it's currently only for preemptible kernels (or at least,
> it's supposed to be only for preemptible kernels).

Dropping the lock for allocating is also to drop GFP_NOWAIT, i.e. to
allow direct reclaim and other blocking operations. This is valuable
for "cry-for-help" type migrations where the host is under intense
memory pressure. I'd rather do the reclaim on the eager page splitting
thread than a vCPU.

But I agree that the rwlock_needbreak() checks are pretty much
untested and likely super nonoptimal.

>
> Simply yielding on contention is not at all optimal, as evidenced by the whole
> dynamic preemption debacle[1][2].  The immediate issue was "fixed" by having vCPUs
> avoid taking mmu_lock, but KVM really shouldn't get into a situation where KVM is
> pathologically dropping mmu_lock to the point where a non-vCPU action grinds to
> a halt.
>
> The contention logic fails to take into account many things:
>
>  (1) Is the other task higher priority?
>
>  (2) Is the other task a vCPU, or something else?
>
>  (3) Will yielding actually allow the other task to make forward progress?
>
>  (4) What is the cost of dropping mmu_lock, e.g. is a remote TLB flush needed?
>
>  (5) What is the expected duration this task is expected to hold mmu_lock?
>
>  (6) Has this task made enough progress for yielding to be a decent choice?
>
> and probably many more than that.
>
> As we discussed off-list, I think there are two viable approaches:
>
>  (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock
>      contention logic, and improve the logic (especially the forward progress
>      guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance
>      in all cases.

The only way I can think of to universally measure forward progress
would be by wall time. Again that becomes more possible with the
mmu_lock timing stats. But we'll have to hand-pick some thresholds and
that feels wrong...

>
>  (b) We completely remove the rwlock_needbreak() checks from
>      tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping
>      mmu_lock in flows where doing so provides the best overall balance, e.g. as
>      in the eager page split case.
>
> I don't have a strong preference between (a) and (b), though I think I'd lean
> towards (b), because it's simpler.  My guess is that we can achieve similar
> performance results with both.  E.g. odds are decent that the "best" batch size
> (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is
> in the noise when it's not contented.
>
> The main argument I see for (b) is that it's simpler, as only code that actually
> has a justified need to drop mmu_lock does so.  The advantage I see with (a) is
> that it would provide structure and documentation for choosing when to drop
> mmu_lock (or not).

I need to think it through more but I'm leaning toward (b) and use the
mmu_lock stats to flag potential flows that are holding the lock too
long. With (b) we can make each flow incrementally better and don't
have to pick any magic numbers.

>
> E.g. dropping in the eager page split path makes sense because KVM does so at a
> large batch size, odds are good that the contending task is a vCPU, there's no
> TLB flush required, the total hold time of mmu_lock is high, and we know that
> dropping mmu_lock will allow vCPUs to make forward progress.  (a) would do a much
> better job of capturing all that in code, albeit with quite a bit more complexity.
>
> Regardless of which option we choose, I think we should drop the preemptible kernel
> dependency from the lock contention logic in tdp_mmu_iter_cond_resched(), i.e.
> manually check if mmu_lock is contented instead of bouncing through rwlock_needbreak().

+1

>
> The current approach essentially means that there's zero testing of the performance
> of the yield-on-contention logic.  E.g. the complaints about the TDP MMU yielding
> too aggressively only popped up when commit c597bfddc9e9 ("sched: Provide Kconfig
> support for default dynamic preempt mode") unintentionally enabled
> rwlock_needbreak() by default.
>
> That's definitely easier said then done though, as I suspect that if we switched
> to rwlock_is_contended(), i.e. dropped the preemptible requirement, without also
> enhancing tdp_mmu_iter_cond_resched() to make it smarter as above, we'd see a lot
> of performance regressions.
>
> [1] https://lore.kernel.org/all/20240312193911.1796717-3-seanjc@google.com
> [2] https://lore.kernel.org/all/20240222012640.2820927-1-seanjc@google.com
Sean Christopherson April 18, 2024, 7:39 p.m. UTC | #15
On Thu, Apr 18, 2024, David Matlack wrote:
> On Mon, Apr 15, 2024 at 1:00 PM Sean Christopherson <seanjc@google.com> wrote:
> > On a related topic, I think we should take a hard look at the rwlock_needbreak()
> > usage in tdp_mmu_iter_cond_resched().  Because dropping when allocating is really
> > just speculatively dropping mmu_lock because it _might_ be contended, but doing
> > so at a batch size that provides a good balance between doing enough work under
> > mmu_lock and providing low latency for vCPUs.  I.e. in theory, we should be able
> > to handle this fully in tdp_mmu_iter_cond_resched(), but that code is nowhere
> > near smart enough and it's currently only for preemptible kernels (or at least,
> > it's supposed to be only for preemptible kernels).
> 
> Dropping the lock for allocating is also to drop GFP_NOWAIT, i.e. to
> allow direct reclaim and other blocking operations. This is valuable
> for "cry-for-help" type migrations where the host is under intense
> memory pressure. I'd rather do the reclaim on the eager page splitting
> thread than a vCPU.

...

> >  (a) We drop the preemption dependency from tdp_mmu_iter_cond_resched()'s lock
> >      contention logic, and improve the logic (especially the forward progress
> >      guarantees) so that tdp_mmu_iter_cond_resched() provides solid performance
> >      in all cases.
> 
> The only way I can think of to universally measure forward progress
> would be by wall time. Again that becomes more possible with the
> mmu_lock timing stats. But we'll have to hand-pick some thresholds and
> that feels wrong...

Yeah, hard pass on anything based on arbitrary time thresholds.  And I think we
should have the mmu_lock stats be buried behind a Kconfig, i.e. we shouldn't use
them in KVM to guide behavior in any way.

> >  (b) We completely remove the rwlock_needbreak() checks from
> >      tdp_mmu_iter_cond_resched(), and instead rely on unconditionally dropping
> >      mmu_lock in flows where doing so provides the best overall balance, e.g. as
> >      in the eager page split case.
> >
> > I don't have a strong preference between (a) and (b), though I think I'd lean
> > towards (b), because it's simpler.  My guess is that we can achieve similar
> > performance results with both.  E.g. odds are decent that the "best" batch size
> > (see #6) is large enough that the cost of dropping and reacquiring mmu_lock is
> > in the noise when it's not contented.
> >
> > The main argument I see for (b) is that it's simpler, as only code that actually
> > has a justified need to drop mmu_lock does so.  The advantage I see with (a) is
> > that it would provide structure and documentation for choosing when to drop
> > mmu_lock (or not).
> 
> I need to think it through more but I'm leaning toward (b) and use the
> mmu_lock stats to flag potential flows that are holding the lock too
> long. With (b) we can make each flow incrementally better and don't
> have to pick any magic numbers.

I'm fully in the (b) boat given the GFP_NOWAIT => direct reclaim piece of the
puzzle.

  1. TDP MMU now frees and allocates roots with mmu_lock held for read
  2. The "zap everything" MTRR trainwreck likely on its way out the door[1]
  3. The change_pte() mmu_notifier is gone[2]
  4. We're working on aging GFNs with mmu_lock held for read, or not at all[3]

As a result, the the only paths that take mmu_lock for write are "zap all",
mmu_notifier invalidations events, and APICv inhibit toggling, which does
kvm_zap_gfn_range() on a single page, i.e. won't ever need to yield.

The "slow" zap all is mutually exclusive with anything interesting because it
only runs when the process is exiting.

The "fast" zap all should never yield when it holds mmu_lock for write, because
the part that runs with mmu_lock held for write is <drum roll> fast.  Maaaybe
the slow part that runs with mmu_lock held for read could drop mmu_lock on
contention.

That just leaves mmu_notifier invalidations, and the mess with dynamic preemption
that resulted in KVM bouncing mmu_lock between invalidations and vCPUs is pretty
strong evidence that yielding on mmu_lock contention when zapping SPTEs is a
terrible idea.

And yielding from most flows that hold mmu_lock for read is also a net negative,
as (a) *all* readers need to go aways, (b) the majority of such flows are
short-lived and operate in vCPU context, and (c) the remaining flows that take
mmu_lock are either slow paths (zap all, mmu_notifier invalidations), or rare
(APICv toggling).

In short, I highly doubt we'll ever have more than a few flows where yielding
because mmu_lock is contended is actually desirable.  As a result, odds are good
that the reasons for dropping mmu_lock in the middle of a sequence are going to
be unique each and every time.

E.g. eager page splitting wants to drop it in order to do direct reclaim, and
because it can be super long running, *and* doesn't need to flush TLBs when yielding.

kvm_tdp_mmu_zap_invalidated_roots() is the only flow I can think of where dropping
mmu_lock on contention might make sense, e.g. to allow an mmu_notifier invalidation
to go through.  Again, that's a very long-running flow that doesn't need to flush
TLBs, is (hopefully) not being done from vCPU context, and could put KVM into a
scenario where it blocks an mmu_notifiers, which in turn blocks vCPUs that are
trying to rebuild SPTEs.

[1] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com
[2] https://lore.kernel.org/all/20240405115815.3226315-1-pbonzini@redhat.com
[3] https://lore.kernel.org/all/20240401232946.1837665-1-jthoughton@google.com
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..0a8b25a52c15 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2386,7 +2386,6 @@  static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
 		return -EFAULT;
 
-	KVM_MMU_LOCK(kvm);
 	for (offset = log->first_page, i = offset / BITS_PER_LONG,
 		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
 	     i++, offset += BITS_PER_LONG) {
@@ -2405,11 +2404,12 @@  static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 		*/
 		if (mask) {
 			flush = true;
+			KVM_MMU_LOCK(kvm);
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
+			KVM_MMU_UNLOCK(kvm);
 		}
 	}
-	KVM_MMU_UNLOCK(kvm);
 
 	if (flush)
 		kvm_flush_remote_tlbs_memslot(kvm, memslot);