diff mbox series

[RFC,1/1] kvm: Use a new spinlock to avoid atomic operations in kvm_get_dirty_log_protect

Message ID 20220805073544.555261-1-leobras@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC,1/1] kvm: Use a new spinlock to avoid atomic operations in kvm_get_dirty_log_protect | expand

Commit Message

Leonardo Bras Aug. 5, 2022, 7:35 a.m. UTC
<The objective is to get feedback on the idea, before trying to get a
 proper version. Please provide feedback! >

kvm_get_dirty_log_protect is very slow, and it takes a while to get the
dirty bitmap when 'manual_dirty_log_protect' is not being used.
Current working is something like:
- Memset the 'second_dirty_bitmap' with zeros
- Get KVM_MMU_LOCK to scan the whole bitmap for dirty-bits
  - If there is long with a dirty-bit, xchg() it into second_dirty_bitmap
  - Reprotect the related memory pages,
- Flush remote tlbs,
- Copy the dirty-bitmap to the user, and return.

The proposal is to remove the memset and the atomics part by using the
second_dirty_bitmap as a real dirty-bitmap, instead of a simple buffer:
- When kvm_get_dirty_log_protect runs, before copying anything, switches
  to the other dirty-bitmap (primary->secondary or secondary->primary).
- (After that, mark_page_dirty_in_slot() will start writing dirty-bits to
  the other bitmap, so there will be no one using it, and it's fine to
  modify it without atomics),
- Then, copy the current bitmap to userspace,
- Then get the KVM_MMU_LOCK, scan the bitmap copied, and both protect the
  page and clean the dirty bits in the same loop.
- Flush remote tlbs, and return.

The bitmap switch (primary <-> secondary) is protected by a spinlock,
which is also added into mark_page_dirty_in_slot() to make sure it does
not write to the wrong bitmap. Since this spinlock protects just a few
instructios, it's should not introduce a lot of delay.

Results:
1 - Average clock count spent in kvm_get_dirty_log_protect() goes from
    13608798 to 6734244, representing around 50% less clock cycles.
2 - Average clock count spent in mark_page_dirty_in_slot() goes from
    462 to 471, representing around 2% increase, but since this means
    just a few cycles, it can be an imprecise measure.

Known limitations:
0 - It's preliminary. It has been tested but there are a lot of bad stuff
that needs to be fixed, if the idea is interesting.

1 - Spin-Locking in mark_page_dirty_in_slot():
I understand this function happens a lot in the guest and should probably
be as fast as possible, so introducing a lock here seems
counter-productive, but to be fair, I could not see it any slower than a
couple cycles in my current setup (x86_64 machine).

2 - Qemu will use the 'manual_dirty_log_protect'
I understand that more recent versions qemu will use
'manual_dirty_log_protect' when available, so this approach will not
benefit this use case, which is quite common.
A counter argument would be: there are other hypervisors that could benefit
from it, and that is also applicable for older qemu versions.

3 - On top of that, the overhead in (1) will be unnecessary for (2) case
That could be removed by testing for 'manual_dirty_log_protect' in
'mark_page_dirty_in_slot()', and skipping the spinlock, but this was not
added just to keep the rfc simpler.

Performance tests were doing using:
- x86_64 VM with 32 vcpus and 16GB RAM
- Systemtap probes in function enter and function return, measuring the
  number of clock cycles spent between those two, and printing the average.
- Synthetic VM load, that keeps dirtying memory at fixed rate of 8Gbps.
- VM Migration between hosts.

Further info:
- I am also trying to think on improvements for the
  'manual_dirty_log_protect' use case, which seems to be very hard to
  improve. For starters, using the same approach to remove the atomics
  does not seem to cause any relevant speedup.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/kvm_host.h |  3 ++
 virt/kvm/kvm_main.c      | 75 ++++++++++++++++++++++++++--------------
 2 files changed, 52 insertions(+), 26 deletions(-)

Comments

Paolo Bonzini Aug. 5, 2022, 9:33 a.m. UTC | #1
On 8/5/22 09:35, Leonardo Bras wrote:
> 1 - Spin-Locking in mark_page_dirty_in_slot():
> I understand this function happens a lot in the guest and should probably
> be as fast as possible, so introducing a lock here seems
> counter-productive, but to be fair, I could not see it any slower than a
> couple cycles in my current setup (x86_64 machine).

Maybe too small workload?  32 vcpus at 8 GB/s would mean 256 MB/s/vCPU, 
i.e. 16384 pages/second *at most*.  That might not create too much 
contention.

One possibility here is to use a global (for all VMs) 
percpu_rw_semaphore, or perhaps even RCU.  The write critical section is 
so short that it could be a win nevertheless.

However...

> 2 - Qemu will use the 'manual_dirty_log_protect'
> I understand that more recent versions qemu will use
> 'manual_dirty_log_protect' when available, so this approach will not
> benefit this use case, which is quite common.
> A counter argument would be: there are other hypervisors that could benefit
> from it, and that is also applicable for older qemu versions.

... that was my first thought indeed.  I would just consider the old API 
to be legacy and not bother with it.  Mostly because of the ability to 
clear a small part of the bitmap(*), and the initially-all-set 
optimization, manual dirty log ought to be superior even if 
CLEAR_DIRTY_LOG has to use atomics.

> - I am also trying to think on improvements for the
>   'manual_dirty_log_protect' use case, which seems to be very hard to
>   improve. For starters, using the same approach to remove the atomics
>   does not seem to cause any relevant speedup.

Yes, there are two issues:

1) CLEAR_DIRTY_LOG does not clear all bits, only those passed in by 
userspace.  This means that the inactive bitmap still has some bits set

2) the ability to clear part of the bitmap makes it hard to do a 
wholesale switch in CLEAR_DIRTY_LOG; this is the dealbreaker

Thanks,

Paolo

(*) for the old API, that requires a workaround of using multiple small 
memslots, e.g. 1-32G in size
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 90a45ef7203bd..bee3809e85e93 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -571,6 +571,9 @@  struct kvm_memory_slot {
 	gfn_t base_gfn;
 	unsigned long npages;
 	unsigned long *dirty_bitmap;
+	/* Protects use_secondary, which tells which bitmap to use */
+	spinlock_t bm_switch_lock;
+	bool use_secondary;
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
 	u32 flags;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a49df8988cd6a..5ea543d8cfec2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1560,6 +1560,9 @@  static int kvm_prepare_memory_region(struct kvm *kvm,
 
 	r = kvm_arch_prepare_memory_region(kvm, old, new, change);
 
+	if (new)
+		spin_lock_init(&new->bm_switch_lock);
+
 	/* Free the bitmap on failure if it was allocated above. */
 	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
 		kvm_destroy_dirty_bitmap(new);
@@ -2053,7 +2056,6 @@  static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	int i, as_id, id;
 	unsigned long n;
 	unsigned long *dirty_bitmap;
-	unsigned long *dirty_bitmap_buffer;
 	bool flush;
 
 	/* Dirty ring tracking is exclusive to dirty log tracking */
@@ -2070,12 +2072,11 @@  static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 	if (!memslot || !memslot->dirty_bitmap)
 		return -ENOENT;
 
-	dirty_bitmap = memslot->dirty_bitmap;
-
 	kvm_arch_sync_dirty_log(kvm, memslot);
 
 	n = kvm_dirty_bitmap_bytes(memslot);
 	flush = false;
+
 	if (kvm->manual_dirty_log_protect) {
 		/*
 		 * Unlike kvm_get_dirty_log, we always return false in *flush,
@@ -2085,35 +2086,49 @@  static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 		 * transition to kvm_get_dirty_log_protect and kvm_get_dirty_log
 		 * can be eliminated.
 		 */
-		dirty_bitmap_buffer = dirty_bitmap;
-	} else {
-		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
-		memset(dirty_bitmap_buffer, 0, n);
+		if (copy_to_user(log->dirty_bitmap, memslot->dirty_bitmap, n))
+			return -EFAULT;
 
-		KVM_MMU_LOCK(kvm);
-		for (i = 0; i < n / sizeof(long); i++) {
-			unsigned long mask;
-			gfn_t offset;
+		return 0;
+	}
 
-			if (!dirty_bitmap[i])
-				continue;
+	/*
+	 * Switches between primary and secondary dirty_bitmap.
+	 * After unlocking, all new dirty bits are written to an empty bitmap.
+	 */
+	spin_lock(&memslot->bm_switch_lock);
+	if (memslot->use_secondary)
+		dirty_bitmap = kvm_second_dirty_bitmap(memslot);
+	else
+		dirty_bitmap = memslot->dirty_bitmap;
 
-			flush = true;
-			mask = xchg(&dirty_bitmap[i], 0);
-			dirty_bitmap_buffer[i] = mask;
+	memslot->use_secondary = !memslot->use_secondary;
+	spin_unlock(&memslot->bm_switch_lock);
 
-			offset = i * BITS_PER_LONG;
-			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
-								offset, mask);
-		}
-		KVM_MMU_UNLOCK(kvm);
+	if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n))
+		return -EFAULT;
+
+	KVM_MMU_LOCK(kvm);
+	for (i = 0; i < n / sizeof(long); i++) {
+		unsigned long mask;
+		gfn_t offset;
+
+		if (!dirty_bitmap[i])
+			continue;
+
+		flush = true;
+		mask = dirty_bitmap[i];
+		dirty_bitmap[i] = 0;
+
+		offset = i * BITS_PER_LONG;
+		kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
+							offset, mask);
 	}
+	KVM_MMU_UNLOCK(kvm);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 
-	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
-		return -EFAULT;
 	return 0;
 }
 
@@ -3203,11 +3218,19 @@  void mark_page_dirty_in_slot(struct kvm *kvm,
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-		if (kvm->dirty_ring_size)
+		if (kvm->dirty_ring_size) {
 			kvm_dirty_ring_push(&vcpu->dirty_ring,
 					    slot, rel_gfn);
-		else
-			set_bit_le(rel_gfn, memslot->dirty_bitmap);
+		} else {
+			spin_lock(&memslot->bm_switch_lock);
+			if (memslot->use_secondary)
+				set_bit_le(rel_gfn,
+					   kvm_second_dirty_bitmap(memslot));
+			else
+				set_bit_le(rel_gfn, memslot->dirty_bitmap);
+
+			spin_unlock(&memslot->bm_switch_lock);
+		}
 	}
 }
 EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);