diff mbox series

[v2,4/4] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

Message ID 20240829191135.2041489-5-vipinsh@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Run NX huge page recovery under MMU read lock | expand

Commit Message

Vipin Sharma Aug. 29, 2024, 7:11 p.m. UTC
Use MMU read lock to recover TDP MMU NX huge pages. Iterate
huge pages list under tdp_mmu_pages_lock protection and unaccount the
page before dropping the lock.

Modify kvm_tdp_mmu_zap_sp() to tdp_mmu_zap_possible_nx_huge_page() as
there are no other user of it. Ignore the zap if any of the following
condition is true:
- It is a root page.
- Parent is pointing to:
  - A different page table.
  - A huge page.
  - Not present

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table. This should never happen.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an execute small
page. Unaccounting sooner during the list traversal is increasing the
window of that race. Functionally, it is okay, because accounting
doesn't protect against iTLB multi-hit bug, it is there purely to
prevent KVM from bouncing a gfn between two page sizes. The only
downside is that a vCPU will end up doing more work in tearing down all
the child SPTEs. This should be a very rare race.

Zapping under MMU read lock unblock vCPUs which are waiting for MMU read
lock. This optimizaion is done to solve a guest jitter issue on Windows
VM which was observing an increase in network latency. The test workload
sets up two Windows VM and use latte.exe[1] binary to run network
latency benchmark. Running NX huge page recovery under MMU lock was
causing latency to increase up to 30 ms because vCPUs were waiting for
MMU lock.

Running the tool on VMs using MMU read lock NX huge page recovery
removed the jitter issue completely and MMU lock wait time by vCPUs was
also reduced.

Command used for testing:

Server:
latte.exe -udp -a 192.168.100.1:9000 -i 10000000

Client:
latte.exe -c -udp -a 192.168.100.1:9000 -i 10000000 -hist -hl 1000 -hc 30

Output from the latency tool on client:

Before
------

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 69.98
CPU(%)        2.8
CtxSwitch/sec 32783     (2.29/iteration)
SysCall/sec   99948     (6.99/iteration)
Interrupt/sec 55164     (3.86/iteration)

Interval(usec)   Frequency
      0          9999967
   1000          14
   2000          0
   3000          5
   4000          1
   5000          0
   6000          0
   7000          0
   8000          0
   9000          0
  10000          0
  11000          0
  12000          2
  13000          2
  14000          4
  15000          2
  16000          2
  17000          0
  18000          1

After
-----

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 67.66
CPU(%)        1.6
CtxSwitch/sec 32869     (2.22/iteration)
SysCall/sec   69366     (4.69/iteration)
Interrupt/sec 50693     (3.43/iteration)

Interval(usec)   Frequency
      0          9999972
   1000          27
   2000          1

[1] https://github.com/microsoft/latte

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 11 ++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 76 +++++++++++++++++++++++++++++---------
 arch/x86/kvm/mmu/tdp_mmu.h |  1 -
 3 files changed, 67 insertions(+), 21 deletions(-)

Comments

Sean Christopherson Aug. 29, 2024, 8:06 p.m. UTC | #1
On Thu, Aug 29, 2024, Vipin Sharma wrote:
> @@ -1806,7 +1832,7 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  	struct kvm_mmu_page *sp;
>  	bool flush = false;
>  
> -	lockdep_assert_held_write(&kvm->mmu_lock);
> +	lockdep_assert_held_read(&kvm->mmu_lock);
>  	/*
>  	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
>  	 * be done under RCU protection, because the pages are freed via RCU
> @@ -1815,8 +1841,11 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  	rcu_read_lock();
>  
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages))
> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages)) {
> +			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  			break;
> +		}
>  
>  		/*
>  		 * We use a separate list instead of just using active_mmu_pages
> @@ -1832,16 +1861,29 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  		WARN_ON_ONCE(!sp->role.direct);
>  
>  		/*
> -		 * Unaccount and do not attempt to recover any NX Huge Pages
> -		 * that are being dirty tracked, as they would just be faulted
> -		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
> +		 * Unaccount the shadow page before zapping its SPTE so as to
> +		 * avoid bouncing tdp_mmu_pages_lock more than is necessary.
> +		 * Clearing nx_huge_page_disallowed before zapping is safe, as
> +		 * the flag doesn't protect against iTLB multi-hit, it's there
> +		 * purely to prevent bouncing the gfn between an NX huge page
> +		 * and an X small spage. A vCPU could get stuck tearing down
> +		 * the shadow page, e.g. if it happens to fault on the region
> +		 * before the SPTE is zapped and replaces the shadow page with
> +		 * an NX huge page and get stuck tearing down the child SPTEs,
> +		 * but that is a rare race, i.e. shouldn't impact performance.
> +		 */
> +		unaccount_nx_huge_page(kvm, sp);
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +
> +		/*
> +		 * Don't bother zapping shadow pages if the memslot is being
> +		 * dirty logged, as the relevant pages would just be faulted back
> +		 * in as 4KiB pages. Potential NX Huge Pages in this slot will be
>  		 * recovered, along with all the other huge pages in the slot,
>  		 * when dirty logging is disabled.
>  		 */
> -		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> -			unaccount_nx_huge_page(kvm, sp);
> -		else
> -			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
> +		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> +			flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
>  		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
>  
>  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {

Tsk, tsk.  There's a cond_resched_rwlock_write() lurking here.

Which is a decent argument/segue for my main comment: I would very, very strongly
prefer to have a single flow for the control logic.  Almost all of the code is
copy+pasted to the TDP MMU, and there unnecessary and confusing differences between
the two flows.  E.g. the TDP MMU does unaccount_nx_huge_page() preemptively, while
the shadow MMU does not.

The TDP MMU has a few extra "locks" (rcu_read_lock() + tdp_mmu_pages_lock), but
I don't see any harm in taking those in the shadow MMU flow.  KVM holds a spinlock
and so RCU practically is meaningless, and tdp_mmu_pages_lock will never be
contended since it's only ever taken under mmu_lock.

If we _really_ wanted to bury tdp_mmu_pages_lock in the TDP MMU, it could be
passed in as an optional paramter.  I'm not super opposed to that, it just looks
ugly, and I'm not convinced it's worth the effort.  Maybe a middle ground would
be to provide a helper so that tdp_mmu_pages_lock can stay 64-bit only, but this
code doesn't need #ifdefs?

Anyways, if the helper is __always_inline, there shouldn't be an indirect call
to recover_page().  Completely untested, but this is the basic gist.

---
typedef bool (*nx_recover_page_t)(struct kvm *kvm, struct kvm_mmu_page *sp,
				  struct list_head *invalid_pages);

static __always_inline void kvm_recover_nx_huge_pages(struct kvm *kvm,
						      bool shared,
						      spinlock_t *list_lock;
						      struct list_head *pages,
						      unsigned long nr_pages,
						      nx_recover_page_t recover_page)
{
	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
	unsigned long to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
	struct kvm_mmu_page *sp;
	LIST_HEAD(invalid_list);
	int rcu_idx;
	bool flush;

	kvm_lockdep_assert_mmu_lock_held(kvm, shared);

	rcu_idx = srcu_read_lock(&kvm->srcu);
	rcu_read_lock();

	for ( ; to_zap; --to_zap) {
		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
		if (list_empty(pages)) {
			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
			break;
		}

		sp = list_first_entry(pages, struct kvm_mmu_page,
				      possible_nx_huge_page_link);
		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
		WARN_ON_ONCE(!sp->role.direct);

		unaccount_nx_huge_page(kvm, sp);
		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
			flush |= recover_page(kvm, sp, &invalid_list);

		WARN_ON_ONCE(sp->nx_huge_page_disallowed);

		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

			rcu_read_unlock();

			if (shared)
				cond_resched_rwlock_read(&kvm->mmu_lock);
			else
				cond_resched_rwlock_write(&kvm->mmu_lock);

			rcu_read_lock();
		}
	}

	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

	rcu_read_unlock();
	srcu_read_unlock(&kvm->srcu, rcu_idx);
}
Vipin Sharma Sept. 3, 2024, 8:23 p.m. UTC | #2
On 2024-08-29 13:06:55, Sean Christopherson wrote:
> On Thu, Aug 29, 2024, Vipin Sharma wrote:
> >  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> 
> Tsk, tsk.  There's a cond_resched_rwlock_write() lurking here.

I'm gonna order a dunce cap.

> 
> Which is a decent argument/segue for my main comment: I would very, very strongly
> prefer to have a single flow for the control logic.  Almost all of the code is
> copy+pasted to the TDP MMU, and there unnecessary and confusing differences between
> the two flows.  E.g. the TDP MMU does unaccount_nx_huge_page() preemptively, while
> the shadow MMU does not.
> 
> The TDP MMU has a few extra "locks" (rcu_read_lock() + tdp_mmu_pages_lock), but
> I don't see any harm in taking those in the shadow MMU flow.  KVM holds a spinlock
> and so RCU practically is meaningless, and tdp_mmu_pages_lock will never be
> contended since it's only ever taken under mmu_lock.
> 
> If we _really_ wanted to bury tdp_mmu_pages_lock in the TDP MMU, it could be
> passed in as an optional paramter.  I'm not super opposed to that, it just looks
> ugly, and I'm not convinced it's worth the effort.  Maybe a middle ground would
> be to provide a helper so that tdp_mmu_pages_lock can stay 64-bit only, but this
> code doesn't need #ifdefs?

How about declaring in tdp_mmu.h and providing different definitions
similar to is_tdp_mmu_page(), i.e. no-op for 32bit and actual lock usage
for 64 bit build?

> 
> Anyways, if the helper is __always_inline, there shouldn't be an indirect call
> to recover_page().  Completely untested, but this is the basic gist.
> 

One more way I can do is to reuse "shared" bool to decide which
nx_recover_page_t to call. Something like:

static __always_inline void kvm_recover_nx_huge_pages(...)
{
...
    if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
	if (shared)
		flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
	else
		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);

}

tdp_mmu_zap_possible_nx_huge_page() can print WARN_ONCE() and return if
shadow MMU page is passed to it. One downside is now that "shared" bool
and "pages" list are dependent on each other.

This way we don't need to rely on __always_inline to do the right thing
and avoid function pointer call.

I think using bool is more easier to understand its uage compared to
understanding why we used __always_inline. What do you think?


> ---
> typedef bool (*nx_recover_page_t)(struct kvm *kvm, struct kvm_mmu_page *sp,
> 				  struct list_head *invalid_pages);
> 
> static __always_inline void kvm_recover_nx_huge_pages(struct kvm *kvm,
> 						      bool shared,
> 						      spinlock_t *list_lock;
> 						      struct list_head *pages,
> 						      unsigned long nr_pages,
> 						      nx_recover_page_t recover_page)
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d636850c6929..cda6b07d4cda 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7436,14 +7436,19 @@  static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 			return 0;
 
 		rcu_idx = srcu_read_lock(&kvm->srcu);
-		write_lock(&kvm->mmu_lock);
 
-		kvm_mmu_recover_nx_huge_pages(kvm);
+		if (kvm_memslots_have_rmaps(kvm)) {
+			write_lock(&kvm->mmu_lock);
+			kvm_mmu_recover_nx_huge_pages(kvm);
+			write_unlock(&kvm->mmu_lock);
+		}
+
 		if (tdp_mmu_enabled) {
+			read_lock(&kvm->mmu_lock);
 			kvm_tdp_mmu_recover_nx_huge_pages(kvm);
+			read_unlock(&kvm->mmu_lock);
 		}
 
-		write_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, rcu_idx);
 	}
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 179cfd67609a..95aa829b856f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -818,23 +818,49 @@  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_unlock();
 }
 
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+static bool tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+					      struct kvm_mmu_page *sp)
 {
-	u64 old_spte;
+	struct tdp_iter iter = {
+		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
+		.sptep = sp->ptep,
+		.level = sp->role.level + 1,
+		.gfn = sp->gfn,
+		.as_id = kvm_mmu_page_as_id(sp),
+	};
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
-	 * This helper intentionally doesn't allow zapping a root shadow page,
-	 * which doesn't have a parent page table and thus no associated entry.
+	 * Root shadow pages don't a parent page table and thus no associated
+	 * entry, but they can never be possible NX huge pages.
 	 */
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	/*
+	 * Since mmu_lock is held in read mode, it's possible another task has
+	 * already modified the SPTE. Zap the SPTE if and only if the SPTE
+	 * points at the SP's page table, as checking  shadow-present isn't
+	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
+	 * another SP. Note, spte_to_child_pt() also checks that the SPTE is
+	 * shadow-present, i.e. guards against zapping a frozen SPTE.
+	 */
+	if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
 		return false;
 
-	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
-			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
+	/*
+	 * If a different task modified the SPTE, then it should be impossible
+	 * for the SPTE to still be used for the to-be-zapped SP. Non-leaf
+	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
+	 * creating non-leaf SPTEs, and all other bits are immutable for non-
+	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
+	 * zapping and replacement.
+	 */
+	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
+		WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
+		return false;
+	}
 
 	return true;
 }
@@ -1806,7 +1832,7 @@  void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 	struct kvm_mmu_page *sp;
 	bool flush = false;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held_read(&kvm->mmu_lock);
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
 	 * be done under RCU protection, because the pages are freed via RCU
@@ -1815,8 +1841,11 @@  void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 	rcu_read_lock();
 
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages))
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages)) {
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 			break;
+		}
 
 		/*
 		 * We use a separate list instead of just using active_mmu_pages
@@ -1832,16 +1861,29 @@  void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 		WARN_ON_ONCE(!sp->role.direct);
 
 		/*
-		 * Unaccount and do not attempt to recover any NX Huge Pages
-		 * that are being dirty tracked, as they would just be faulted
-		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
+		 * Unaccount the shadow page before zapping its SPTE so as to
+		 * avoid bouncing tdp_mmu_pages_lock more than is necessary.
+		 * Clearing nx_huge_page_disallowed before zapping is safe, as
+		 * the flag doesn't protect against iTLB multi-hit, it's there
+		 * purely to prevent bouncing the gfn between an NX huge page
+		 * and an X small spage. A vCPU could get stuck tearing down
+		 * the shadow page, e.g. if it happens to fault on the region
+		 * before the SPTE is zapped and replaces the shadow page with
+		 * an NX huge page and get stuck tearing down the child SPTEs,
+		 * but that is a rare race, i.e. shouldn't impact performance.
+		 */
+		unaccount_nx_huge_page(kvm, sp);
+		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+
+		/*
+		 * Don't bother zapping shadow pages if the memslot is being
+		 * dirty logged, as the relevant pages would just be faulted back
+		 * in as 4KiB pages. Potential NX Huge Pages in this slot will be
 		 * recovered, along with all the other huge pages in the slot,
 		 * when dirty logging is disabled.
 		 */
-		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
-			unaccount_nx_huge_page(kvm, sp);
-		else
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
+		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
+			flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 86c1065a672d..57683b5dca9d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,7 +20,6 @@  __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);