From patchwork Thu Aug 29 19:11:32 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vipin Sharma <vipinsh@google.com>
X-Patchwork-Id: 13783608
Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com
 [209.85.215.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EAADD1B8E9C
	for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 19:11:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.215.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724958704; cv=none;
 b=tk3mjXfZvbTPYg18vyi1bZ3Em4FyLYg/S6gIjAdCl2oaYuLYD9pJRH4HLtEwvFUWetkdW1AL5BlRgt0WyTIa+jzcLOVA62Aj/G+ao4pKi0jVMhLluxPXyrGGaojB3Vw4a9KIb9wez2XsNrudwLel/T8kIEeeRMSSWM6vuY/UfFs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724958704; c=relaxed/simple;
	bh=fiZayKA6SsO0JXjYtK8CFwxzN8zOS6xGaQ0N61svpio=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=hZB1JGU5owTDbLXvdA35tlMlcxsirLcwKLJAY1eEDbfT1hd4MpGTkJ5RE2vonbP4trGcFss2q/6XlPJAf9CSKY+x3MyjZxJpp1o3A/KQRCahauvYM69dbqx0HxCwdU3iFWxP/YBzq3ABdMva5Tu/hWviJT8FneSNry/MRCmWfHI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=mYYzF+b7; arc=none smtp.client-ip=209.85.215.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="mYYzF+b7"
Received: by mail-pg1-f202.google.com with SMTP id
 41be03b00d2f7-7cd849a6077so895615a12.0
        for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 12:11:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1724958702; x=1725563502;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=9ko4ob+ViJSbLqyaIB4a+SKn30GCHnoVZzbsVDR3Bww=;
        b=mYYzF+b7WEmnDuXLapzi/G1t0Yi/1sgMhzjUYzlpIar/1VgivqXDjJkfPnEq6OAIfZ
         Lg3EPhbvj0Q9G+kq1jtm8BWI/fTWkOq0Q29RgLvSBocBRLDWwAQl7J+SQHiOsATmzr9H
         o+TR3bauMUk2Dz1e585pA6sCnC1q2ADqceNaFSFGmVqo8HVuY1oswtIqwcmoFLqaKUlX
         rp4MHdfVnvomMr5EBjH2it97ch/eicbGTIQqTGMuTd8k9HCGMjmRZ4YSXmfisrxSl2w0
         Qs+jbnzB+VSwbviVKO15VqEA4My/MvpYZa3x0zxx9rVihNtHhKqZjaajQLkWVmMalja5
         PnJA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1724958702; x=1725563502;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=9ko4ob+ViJSbLqyaIB4a+SKn30GCHnoVZzbsVDR3Bww=;
        b=iMlNrpQdBo+tPUrpqU/0bijPcptbfclfWHMeMsJwa6KRgRZzKlaCFfOasSM7sYBL9A
         gYmdPKVeeWSqKQlNrfd20pYrWmn9bWQqm5/vpHmNq/XWzGLgKb8Goxs0HPT0BmS/J4tx
         4ur3/nKbnfzW/BmKujtx1Wf14n5QGSDdjp3vPE3ihWIebSXcoKAN5VcWvwVsiMSqdsxU
         AIfkPwlfA8nJ1prCq0P1ckk7G9rTb60K2VvFdkWTvDNz7jO93HEfjp0pTO6AUxSJ0yvo
         anZrZRDkkTuE0ndJCtLTpuHlFWztALtVzb50K2uAIEOSy/qXWQuAaKy4eSAG7eGUqtaM
         AoHA==
X-Gm-Message-State: AOJu0YwBuMkCe+Fm76+CVwstVKXoppC3qGC/Izq5kDPLwA4YPfgvCKc5
	SUmm+il3VqjnEQxTtBjIuK6vQKBhFjqOIt9oz6bdu61om1w9E2bk41WIQew//ASWpzpaBe+VSuK
	WV05fNQ==
X-Google-Smtp-Source: 
 AGHT+IFtQm6SOImi/GBCeUp1CgED7o6t8UN6ormVIiNZKFA5OfBLYTJZt5tyRGWIdzLcadhZhe92KAw78s6c
X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh
 job=sendgmr) by 2002:a17:902:e54c:b0:1f8:44f4:efd9 with SMTP id
 d9443c01a7336-2050c22c5bamr1788125ad.2.1724958701130; Thu, 29 Aug 2024
 12:11:41 -0700 (PDT)
Date: Thu, 29 Aug 2024 12:11:32 -0700
In-Reply-To: <20240829191135.2041489-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240829191135.2041489-1-vipinsh@google.com>
X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog
Message-ID: <20240829191135.2041489-2-vipinsh@google.com>
Subject: [PATCH v2 1/4] KVM: x86/mmu: Track TDP MMU NX huge pages separately
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>

Create separate list for storing TDP MMU NX huge pages and provide
counter for it. Use this list in NX huge page recovery worker along with
the existing NX huge pages list. Use old NX huge pages list for storing
only non-TDP MMU pages and provide separate counter for it.

Separate list will allow to optimize TDP MMU NX huge page recovery in
future patches by using MMU read lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h | 13 ++++++-
 arch/x86/kvm/mmu/mmu.c          | 62 +++++++++++++++++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c      |  9 +++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  2 ++
 5 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..e6e7026bb8e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1318,8 +1318,12 @@ struct kvm_arch {
 	 * guarantee an NX huge page will be created in its stead, e.g. if the
 	 * guest attempts to execute from the region then KVM obviously can't
 	 * create an NX huge page (without hanging the guest).
+	 *
+	 * This list only contains shadow and legacy MMU pages. TDP MMU pages
+	 * are stored separately in tdp_mmu_possible_nx_huge_pages.
 	 */
 	struct list_head possible_nx_huge_pages;
+	u64 possible_nx_huge_pages_count;
 #ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
 	struct kvm_page_track_notifier_head track_notifier_head;
 #endif
@@ -1474,7 +1478,7 @@ struct kvm_arch {
 	 * is held in read mode:
 	 *  - tdp_mmu_roots (above)
 	 *  - the link field of kvm_mmu_page structs used by the TDP MMU
-	 *  - possible_nx_huge_pages;
+	 *  - tdp_mmu_possible_nx_huge_pages;
 	 *  - the possible_nx_huge_page_link field of kvm_mmu_page structs used
 	 *    by the TDP MMU
 	 * Because the lock is only taken within the MMU lock, strictly
@@ -1483,6 +1487,13 @@ struct kvm_arch {
 	 * the code to do so.
 	 */
 	spinlock_t tdp_mmu_pages_lock;
+
+	/*
+	 * Similar to possible_nx_huge_pages list but this one stores only TDP
+	 * MMU pages.
+	 */
+	struct list_head tdp_mmu_possible_nx_huge_pages;
+	u64 tdp_mmu_possible_nx_huge_pages_count;
 #endif /* CONFIG_X86_64 */
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..0bda372b13a5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -65,9 +65,9 @@ int __read_mostly nx_huge_pages = -1;
 static uint __read_mostly nx_huge_pages_recovery_period_ms;
 #ifdef CONFIG_PREEMPT_RT
 /* Recovery can cause latency spikes, disable it for PREEMPT_RT.  */
-static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
+unsigned int __read_mostly nx_huge_pages_recovery_ratio;
 #else
-static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
+unsigned int __read_mostly nx_huge_pages_recovery_ratio = 60;
 #endif
 
 static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp);
@@ -871,8 +871,17 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 		return;
 
 	++kvm->stat.nx_lpage_splits;
-	list_add_tail(&sp->possible_nx_huge_page_link,
-		      &kvm->arch.possible_nx_huge_pages);
+	if (is_tdp_mmu_page(sp)) {
+#ifdef CONFIG_X86_64
+		++kvm->arch.tdp_mmu_possible_nx_huge_pages_count;
+		list_add_tail(&sp->possible_nx_huge_page_link,
+			      &kvm->arch.tdp_mmu_possible_nx_huge_pages);
+#endif
+	} else {
+		++kvm->arch.possible_nx_huge_pages_count;
+		list_add_tail(&sp->possible_nx_huge_page_link,
+			      &kvm->arch.possible_nx_huge_pages);
+	}
 }
 
 static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
@@ -906,6 +915,13 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 		return;
 
 	--kvm->stat.nx_lpage_splits;
+	if (is_tdp_mmu_page(sp)) {
+#ifdef CONFIG_X86_64
+		--kvm->arch.tdp_mmu_possible_nx_huge_pages_count;
+#endif
+	} else {
+		--kvm->arch.possible_nx_huge_pages_count;
+	}
 	list_del_init(&sp->possible_nx_huge_page_link);
 }
 
@@ -7311,16 +7327,15 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-static void kvm_recover_nx_huge_pages(struct kvm *kvm)
+static void kvm_recover_nx_huge_pages(struct kvm *kvm,
+				      struct list_head *nx_huge_pages,
+				      unsigned long to_zap)
 {
-	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
 	struct kvm_memory_slot *slot;
 	int rcu_idx;
 	struct kvm_mmu_page *sp;
-	unsigned int ratio;
 	LIST_HEAD(invalid_list);
 	bool flush = false;
-	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
 	write_lock(&kvm->mmu_lock);
@@ -7332,10 +7347,8 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	 */
 	rcu_read_lock();
 
-	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
-	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.possible_nx_huge_pages))
+		if (list_empty(nx_huge_pages))
 			break;
 
 		/*
@@ -7345,7 +7358,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 		 * the total number of shadow pages.  And because the TDP MMU
 		 * doesn't use active_mmu_pages.
 		 */
-		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
+		sp = list_first_entry(nx_huge_pages,
 				      struct kvm_mmu_page,
 				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
@@ -7417,10 +7430,19 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
 		       : MAX_SCHEDULE_TIMEOUT;
 }
 
+static unsigned long nx_huge_pages_to_zap(struct kvm *kvm)
+{
+	unsigned long pages = READ_ONCE(kvm->arch.possible_nx_huge_pages_count);
+	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+
+	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
+}
+
 static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 {
-	u64 start_time;
+	unsigned long to_zap;
 	long remaining_time;
+	u64 start_time;
 
 	while (true) {
 		start_time = get_jiffies_64();
@@ -7438,7 +7460,19 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 		if (kthread_should_stop())
 			return 0;
 
-		kvm_recover_nx_huge_pages(kvm);
+		to_zap = nx_huge_pages_to_zap(kvm);
+		kvm_recover_nx_huge_pages(kvm,
+					  &kvm->arch.possible_nx_huge_pages,
+					  to_zap);
+
+		if (tdp_mmu_enabled) {
+#ifdef CONFIG_X86_64
+			to_zap = kvm_tdp_mmu_nx_huge_pages_to_zap(kvm);
+			kvm_recover_nx_huge_pages(kvm,
+						  &kvm->arch.tdp_mmu_possible_nx_huge_pages,
+						  to_zap);
+#endif
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1721d97743e9..8deed808592b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -354,4 +354,5 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+extern unsigned int nx_huge_pages_recovery_ratio;
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..6415c2c7e936 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -15,6 +15,7 @@
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 {
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
 }
 
@@ -1796,3 +1797,11 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 	 */
 	return rcu_dereference(sptep);
 }
+
+unsigned long kvm_tdp_mmu_nx_huge_pages_to_zap(struct kvm *kvm)
+{
+	unsigned long pages = READ_ONCE(kvm->arch.tdp_mmu_possible_nx_huge_pages_count);
+	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+
+	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1b74e058a81c..95290fd6154e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -67,6 +67,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
 
+unsigned long kvm_tdp_mmu_nx_huge_pages_to_zap(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
 #else

From patchwork Thu Aug 29 19:11:33 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vipin Sharma <vipinsh@google.com>
X-Patchwork-Id: 13783609
Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com
 [209.85.215.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C7661B9B28
	for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 19:11:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.215.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724958706; cv=none;
 b=AVnin68Mo74AuOw38EMZfzO5FrSVyUqaceZCFfGFHw1WIhO/qvokuEIf7ms5dXhk0gRgnTIHU+h3jH3wwFOEcbfk4PGeswAkoI8kn15ZWNGu9yyIB2ukhG8dQIqwSI9RQSd6G+eI7U504BMY04tRwdFCeCtxDGplfIZC11U2CvE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724958706; c=relaxed/simple;
	bh=hXqTCy+IwWuvE13nTODm+KR/gKFsVDZ85WkOLUllTk8=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=GLAjv6YOKqkbE127Z8R0QFrUH6SkatI9ny89PILuMx7qHwur8aPXFxxIsu5xBvjbwwTc/M5I23zE8XOlQbIc6KWGdNYc7IwH9rt5Fu4gFr2AWtzHggUoPal5n/tRcD6L1FaPpy+k5dvZLVXC4GXccpQHKkndIcP+4Y3KY4taooA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=Bp8j8auW; arc=none smtp.client-ip=209.85.215.201
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="Bp8j8auW"
Received: by mail-pg1-f201.google.com with SMTP id
 41be03b00d2f7-7c6b192a39bso825223a12.2
        for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 12:11:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1724958704; x=1725563504;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=Wl1T4+l//WebT6Vv9ktUpu/Y9AZqw4mpHt+k/DcseZg=;
        b=Bp8j8auWIwnU+8aqJhOPFfGL60tjfex38R1CGKGgNJSGgLsT03dEm7bVyEbcZe8Axk
         x//AjOQ5uGVhXnM623qtGbscdHZCAWpGFSpJnn+2vZUpBu2o6XRdwqfAK45Hpq8ZNnhP
         w4Y8ErcWwYVQDByxomOecL7sfmFUVMNq5Dt/CwS9gLaeiQxLX9DLHF7CZwJjy4ybWBs2
         IlUNQgq0dHqAH4YVVejUQL5+R7zLxUzX1nd635OyDWkDtZmM1aXhnPeQER86LeMDfFwJ
         g4Bny0I1WFkGR0kWKBGgn250CEFB1GDKvYCPM4e1yiIdMNssNB2P833WtRCeQTFWM6El
         DtBA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1724958704; x=1725563504;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=Wl1T4+l//WebT6Vv9ktUpu/Y9AZqw4mpHt+k/DcseZg=;
        b=mGgue56eHG89cZHpbvrKQHG57sDJq79TC0d352WTkKSjA8wtiXTxsIpmYh0d9zkvxu
         tg8RLLMZv6dnNfFWNbDYMDHmn6h/r4aL00c8HVas3X2EumAJ/eSZdPDg9jB8rZlccG/b
         QI8NoF1k7d+yjb143qeCX9DXeBoEagX8dmdkYPfGja1ktRVYtZ3mXHnHxW0HC45Tf4Yw
         C0aASCsSYQ17oGoSvRCUJ9rjAxJKKIpWGEQ9YmSA36O9TCTMZnvo6D8A7o6nAE7Hes1v
         L2ZTO1r5iEXZqwk7wd12wOAdEEvUfPamglJ3/+HqT5YSKdg2vyqu55NHejmYBnsbkLOz
         R5Bg==
X-Gm-Message-State: AOJu0YzeAHAWV2hxX4NCDns9wL8JJ3397Ctv20mSirKEJ1Rcu2GXAL2g
	BkT1yLrwzn9fboFJkSLDMkvBtp2Jj7wUKctiLEbTZdQI1bh+h47pqJfpqYSruoBGoJm8z7SmUyS
	J9QBvJQ==
X-Google-Smtp-Source: 
 AGHT+IGjtXxEEKFKY7G6VoFP2r45mwUTJ9UygqPJnCeYLndEwxjGtfi3gqP+J6ppU01pSc1JIiLZJ71zDgXW
X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh
 job=sendgmr) by 2002:a65:63c9:0:b0:7a1:db97:d6b2 with SMTP id
 41be03b00d2f7-7d22c4767d5mr8230a12.1.1724958703643; Thu, 29 Aug 2024 12:11:43
 -0700 (PDT)
Date: Thu, 29 Aug 2024 12:11:33 -0700
In-Reply-To: <20240829191135.2041489-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240829191135.2041489-1-vipinsh@google.com>
X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog
Message-ID: <20240829191135.2041489-3-vipinsh@google.com>
Subject: [PATCH v2 2/4] KVM: x86/mmu: Extract out TDP MMU NX huge page
 recovery code
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>

Create separate function for TDP MMU NX huge page recovery. In the new
TDP MMU function remove code related to "prepare and commit" zap pages
of legacy MMU as there will be no legacy MMU pages. Similarly, remove
TDP MMU zap related logic from legacy MMU NX huge page recovery code.
Extract out dirty logging check as it is common to both. Rename
kvm_recover_nx_huge_pages() to kvm_mmu_recover_nx_huge_pages().

Separate code allows to change TDP MMU NX huge page recovery
independently of legacy MMU.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 93 ++++++++++++++-------------------
 arch/x86/kvm/mmu/mmu_internal.h |  2 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 68 ++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  3 ++
 4 files changed, 113 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0bda372b13a5..c8c64df979e3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -925,7 +925,7 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 	list_del_init(&sp->possible_nx_huge_page_link);
 }
 
-static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	sp->nx_huge_page_disallowed = false;
 
@@ -7327,26 +7327,44 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-static void kvm_recover_nx_huge_pages(struct kvm *kvm,
-				      struct list_head *nx_huge_pages,
-				      unsigned long to_zap)
+bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	struct kvm_memory_slot *slot = NULL;
+
+	/*
+	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
+	 * it the test cannot possibly return true.  On the other hand, if any
+	 * memslot has logging enabled, chances are good that all of them do, in
+	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
+	 * page.
+	 *
+	 * If a memslot update is in progress, reading an incorrect value of
+	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
+	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
+	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
+	 * affects efficiency in racy situations, and not correctness.
+	 */
+	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
+		struct kvm_memslots *slots;
+
+		slots = kvm_memslots_for_spte_role(kvm, sp->role);
+		slot = __gfn_to_memslot(slots, sp->gfn);
+		WARN_ON_ONCE(!slot);
+	}
+	return slot && kvm_slot_dirty_track_enabled(slot);
+}
+
+static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm,
+					  struct list_head *nx_huge_pages,
+					  unsigned long to_zap)
 {
-	struct kvm_memory_slot *slot;
 	int rcu_idx;
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
-	bool flush = false;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
 	write_lock(&kvm->mmu_lock);
 
-	/*
-	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
-	 * be done under RCU protection, because the pages are freed via RCU
-	 * callback.
-	 */
-	rcu_read_lock();
-
 	for ( ; to_zap; --to_zap) {
 		if (list_empty(nx_huge_pages))
 			break;
@@ -7370,50 +7388,19 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
 		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
 		 * recovered, along with all the other huge pages in the slot,
 		 * when dirty logging is disabled.
-		 *
-		 * Since gfn_to_memslot() is relatively expensive, it helps to
-		 * skip it if it the test cannot possibly return true.  On the
-		 * other hand, if any memslot has logging enabled, chances are
-		 * good that all of them do, in which case unaccount_nx_huge_page()
-		 * is much cheaper than zapping the page.
-		 *
-		 * If a memslot update is in progress, reading an incorrect value
-		 * of kvm->nr_memslots_dirty_logging is not a problem: if it is
-		 * becoming zero, gfn_to_memslot() will be done unnecessarily; if
-		 * it is becoming nonzero, the page will be zapped unnecessarily.
-		 * Either way, this only affects efficiency in racy situations,
-		 * and not correctness.
 		 */
-		slot = NULL;
-		if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
-			struct kvm_memslots *slots;
-
-			slots = kvm_memslots_for_spte_role(kvm, sp->role);
-			slot = __gfn_to_memslot(slots, sp->gfn);
-			WARN_ON_ONCE(!slot);
-		}
-
-		if (slot && kvm_slot_dirty_track_enabled(slot))
+		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
 			unaccount_nx_huge_page(kvm, sp);
-		else if (is_tdp_mmu_page(sp))
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
 		else
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
-			rcu_read_unlock();
-
+			kvm_mmu_commit_zap_page(kvm, &invalid_list);
 			cond_resched_rwlock_write(&kvm->mmu_lock);
-			flush = false;
-
-			rcu_read_lock();
 		}
 	}
-	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
-
-	rcu_read_unlock();
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
 	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
@@ -7461,16 +7448,16 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 			return 0;
 
 		to_zap = nx_huge_pages_to_zap(kvm);
-		kvm_recover_nx_huge_pages(kvm,
-					  &kvm->arch.possible_nx_huge_pages,
-					  to_zap);
+		kvm_mmu_recover_nx_huge_pages(kvm,
+					      &kvm->arch.possible_nx_huge_pages,
+					      to_zap);
 
 		if (tdp_mmu_enabled) {
 #ifdef CONFIG_X86_64
 			to_zap = kvm_tdp_mmu_nx_huge_pages_to_zap(kvm);
-			kvm_recover_nx_huge_pages(kvm,
-						  &kvm->arch.tdp_mmu_possible_nx_huge_pages,
-						  to_zap);
+			kvm_tdp_mmu_recover_nx_huge_pages(kvm,
+						      &kvm->arch.tdp_mmu_possible_nx_huge_pages,
+						      to_zap);
 #endif
 		}
 	}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8deed808592b..83b165077d97 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -353,6 +353,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 extern unsigned int nx_huge_pages_recovery_ratio;
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6415c2c7e936..f0b4341264fd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1805,3 +1805,71 @@ unsigned long kvm_tdp_mmu_nx_huge_pages_to_zap(struct kvm *kvm)
 
 	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
 }
+
+void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
+				   struct list_head *nx_huge_pages,
+				   unsigned long to_zap)
+{
+	int rcu_idx;
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+
+	rcu_idx = srcu_read_lock(&kvm->srcu);
+	write_lock(&kvm->mmu_lock);
+
+	/*
+	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
+	 * be done under RCU protection, because the pages are freed via RCU
+	 * callback.
+	 */
+	rcu_read_lock();
+
+	for ( ; to_zap; --to_zap) {
+		if (list_empty(nx_huge_pages))
+			break;
+
+		/*
+		 * We use a separate list instead of just using active_mmu_pages
+		 * because the number of shadow pages that be replaced with an
+		 * NX huge page is expected to be relatively small compared to
+		 * the total number of shadow pages.  And because the TDP MMU
+		 * doesn't use active_mmu_pages.
+		 */
+		sp = list_first_entry(nx_huge_pages,
+				      struct kvm_mmu_page,
+				      possible_nx_huge_page_link);
+		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
+		WARN_ON_ONCE(!sp->role.direct);
+
+		/*
+		 * Unaccount and do not attempt to recover any NX Huge Pages
+		 * that are being dirty tracked, as they would just be faulted
+		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
+		 * recovered, along with all the other huge pages in the slot,
+		 * when dirty logging is disabled.
+		 */
+		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
+			unaccount_nx_huge_page(kvm, sp);
+		else
+			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
+		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
+
+		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
+			if (flush)
+				kvm_flush_remote_tlbs(kvm);
+			rcu_read_unlock();
+
+			cond_resched_rwlock_write(&kvm->mmu_lock);
+			flush = false;
+
+			rcu_read_lock();
+		}
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+	rcu_read_unlock();
+
+	write_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, rcu_idx);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 95290fd6154e..4036552f40cd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -68,6 +68,9 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
 
 unsigned long kvm_tdp_mmu_nx_huge_pages_to_zap(struct kvm *kvm);
+void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
+				   struct list_head *nx_huge_pages,
+				   unsigned long to_zap);
 
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }

From patchwork Thu Aug 29 19:11:34 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vipin Sharma <vipinsh@google.com>
X-Patchwork-Id: 13783610
Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com
 [209.85.128.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6A9A1B9B56
	for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 19:11:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724958708; cv=none;
 b=Bu9jSdsLrfh6M5mus5zcY2/wHRujnOjP6prloU4nHMn/VWNyTKXBcF6hxNQuvUUzqOwT62aS78yg7EhDj/9xRDvsm/iX23GF/luDcJ0fpmWSJpwfWjtEkmW9tbwwQPLxagtk2FEoUQE1niXV3+xPJHg9wKJSVnEIvYGj3u9SBSg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724958708; c=relaxed/simple;
	bh=rrJ9r8ms6kWklefGCuisxnO/txgebqDl2+mcGhArhVk=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=XWiUgjNk131kWcTxxL5aQVkZWJwuyS78Rf+Awbu8hoCgRwdlMnPr2nCI8IBWjhxr7+w1fMHR2mV6nSNaVFV2svyqcWIHDgsBU/awDattaodxxbwW1wNciPaBxrLNB8OQnMyX7reqj17KW554UlCKpql0Rr3ngw8hv0MI3+wO8IA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=FbnuBNOP; arc=none smtp.client-ip=209.85.128.201
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="FbnuBNOP"
Received: by mail-yw1-f201.google.com with SMTP id
 00721157ae682-6b4270bdea3so22698797b3.3
        for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 12:11:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1724958706; x=1725563506;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=S15CpaBo2DhgLbkSJxYME2NDX2/9h7B78GJzx7e7VI8=;
        b=FbnuBNOPZ/AngrcPeAweUGlKVixahb939VBgnk2QcWyRyK9ewmQuTzG9WWBSwHubZz
         rOz/tsF8OflJLoQA5x3+tkbK33w2RKJdWrq/pLjiIk7Gdjwa97Ny0Qhs2k3JWdEHDSus
         d2V8ZrCFywPmx3cHwANx3hL+v5LkMeWDgsFqrGc1z8ibylEbLjE+q8Vt8Opc9OvFBRdU
         ViRVUwIGy6lOTAAe4qSVkJ27iVniKqk0i9yE397JYtxh4i343N+l4Sv3t+aHLlH67a9A
         HYwdOifSoqarhrQYXymWh5V8KbzgLLT+O4/6LaMdClv6ZmA6PlMDwPuAWPT42theU//3
         2fRA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1724958706; x=1725563506;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=S15CpaBo2DhgLbkSJxYME2NDX2/9h7B78GJzx7e7VI8=;
        b=QqtOfcDQtTLKK8L4buD+8nPaeFnFrlqHTEeDKqB8HKR6NpVTn8CrSejBGsxwEyFepK
         RrezHkujNeOPZlDhmuPYn3yAydiw6Z7n3LrnbRFMrZrC49tQ7Q3zeP+CC7P2FN/xtTbp
         XcGY1Fc7RS+X0wCMuSfzLi2LMUz1gv/d194IOvB+vA6PPUp8xB6RKP/sw9iwkTdMtz6W
         uuGHxhLphoPUdrHsk/Wmx7qKxScvldRwNiGjfA0tOQpKDyv9KpM3p11N1vQZRUYQo+cU
         T6kgeW16Lp/aQv+T6jvk9wWOuvtXiMCYtq8Sp8nzsBZp0U13KAS13tYv2GgL4VZAw3GF
         UNKg==
X-Gm-Message-State: AOJu0YxusfAp3Lsg4QEPqtzmBOqd1tjNSKfDC2JysNL1lsDXEU7cieJ7
	0+Zyqmpw2wkIwmzX/wsnV469Va9RTX2Lh612nqNJEUfA0yZg7llOLm+fa7dFHO5tA3nz+0Z8wRC
	tUox6Yw==
X-Google-Smtp-Source: 
 AGHT+IEtzc+zngyHL/Sycpta8KjyasW4DBxaRHtMDv8dCHg39P7HkVopIgMp8dEHRyXqnlzPS1+e/d0iUksR
X-Received: from vipin.c.googlers.com ([35.247.89.60]) (user=vipinsh
 job=sendgmr) by 2002:a25:a2c4:0:b0:e0b:a712:2ceb with SMTP id
 3f1490d57ef6-e1a5ab8d7d0mr22335276.5.1724958705697; Thu, 29 Aug 2024 12:11:45
 -0700 (PDT)
Date: Thu, 29 Aug 2024 12:11:34 -0700
In-Reply-To: <20240829191135.2041489-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240829191135.2041489-1-vipinsh@google.com>
X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog
Message-ID: <20240829191135.2041489-4-vipinsh@google.com>
Subject: [PATCH v2 3/4] KVM: x86/mmu: Rearrange locks and to_zap count for NX
 huge page recovery
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>

Extract out locks from TDP and legacy MMU NX huge page recovery flow and
use them at a common place inside recovery worker. Also, move to_zap
calculations to their respective recovery functions.

Extracting out locks will allow acquiring and using locks for
TDP flow in the same way as other TDP APIs i.e. take read lock and then
call the TDP APIs. This will be utilized when TDP MMU NX huge page
recovery will switch to using read lock.

to_zap calculation outside recovery code was needed as same code was
used for both TDP and legacy MMU. Now, as both flows have different code
there is no need to calculate them separately at a common place. Let the
respective functions handle that.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 45 +++++++++++++-------------------------
 arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++--------------
 arch/x86/kvm/mmu/tdp_mmu.h |  5 +----
 3 files changed, 21 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8c64df979e3..d636850c6929 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7354,19 +7354,18 @@ bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, struct kvm_mmu_page *sp)
 	return slot && kvm_slot_dirty_track_enabled(slot);
 }
 
-static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm,
-					  struct list_head *nx_huge_pages,
-					  unsigned long to_zap)
+static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm)
 {
-	int rcu_idx;
+	unsigned long pages = READ_ONCE(kvm->arch.possible_nx_huge_pages_count);
+	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+	unsigned long to_zap = ratio ? DIV_ROUND_UP(pages, ratio) : 0;
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
 
-	rcu_idx = srcu_read_lock(&kvm->srcu);
-	write_lock(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(nx_huge_pages))
+		if (list_empty(&kvm->arch.possible_nx_huge_pages))
 			break;
 
 		/*
@@ -7376,7 +7375,7 @@ static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm,
 		 * the total number of shadow pages.  And because the TDP MMU
 		 * doesn't use active_mmu_pages.
 		 */
-		sp = list_first_entry(nx_huge_pages,
+		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
 				      struct kvm_mmu_page,
 				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
@@ -7401,9 +7400,6 @@ static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm,
 		}
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-
-	write_unlock(&kvm->mmu_lock);
-	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
 static long get_nx_huge_page_recovery_timeout(u64 start_time)
@@ -7417,19 +7413,11 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
 		       : MAX_SCHEDULE_TIMEOUT;
 }
 
-static unsigned long nx_huge_pages_to_zap(struct kvm *kvm)
-{
-	unsigned long pages = READ_ONCE(kvm->arch.possible_nx_huge_pages_count);
-	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
-
-	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
-}
-
 static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 {
-	unsigned long to_zap;
 	long remaining_time;
 	u64 start_time;
+	int rcu_idx;
 
 	while (true) {
 		start_time = get_jiffies_64();
@@ -7447,19 +7435,16 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 		if (kthread_should_stop())
 			return 0;
 
-		to_zap = nx_huge_pages_to_zap(kvm);
-		kvm_mmu_recover_nx_huge_pages(kvm,
-					      &kvm->arch.possible_nx_huge_pages,
-					      to_zap);
+		rcu_idx = srcu_read_lock(&kvm->srcu);
+		write_lock(&kvm->mmu_lock);
 
+		kvm_mmu_recover_nx_huge_pages(kvm);
 		if (tdp_mmu_enabled) {
-#ifdef CONFIG_X86_64
-			to_zap = kvm_tdp_mmu_nx_huge_pages_to_zap(kvm);
-			kvm_tdp_mmu_recover_nx_huge_pages(kvm,
-						      &kvm->arch.tdp_mmu_possible_nx_huge_pages,
-						      to_zap);
-#endif
+			kvm_tdp_mmu_recover_nx_huge_pages(kvm);
 		}
+
+		write_unlock(&kvm->mmu_lock);
+		srcu_read_unlock(&kvm->srcu, rcu_idx);
 	}
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f0b4341264fd..179cfd67609a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1798,25 +1798,15 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return rcu_dereference(sptep);
 }
 
-unsigned long kvm_tdp_mmu_nx_huge_pages_to_zap(struct kvm *kvm)
+void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 {
 	unsigned long pages = READ_ONCE(kvm->arch.tdp_mmu_possible_nx_huge_pages_count);
 	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
-
-	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
-}
-
-void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
-				   struct list_head *nx_huge_pages,
-				   unsigned long to_zap)
-{
-	int rcu_idx;
+	unsigned long to_zap = ratio ? DIV_ROUND_UP(pages, ratio) : 0;
 	struct kvm_mmu_page *sp;
 	bool flush = false;
 
-	rcu_idx = srcu_read_lock(&kvm->srcu);
-	write_lock(&kvm->mmu_lock);
-
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
 	 * be done under RCU protection, because the pages are freed via RCU
@@ -1825,7 +1815,7 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
 	rcu_read_lock();
 
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(nx_huge_pages))
+		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages))
 			break;
 
 		/*
@@ -1835,7 +1825,7 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
 		 * the total number of shadow pages.  And because the TDP MMU
 		 * doesn't use active_mmu_pages.
 		 */
-		sp = list_first_entry(nx_huge_pages,
+		sp = list_first_entry(&kvm->arch.tdp_mmu_possible_nx_huge_pages,
 				      struct kvm_mmu_page,
 				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
@@ -1869,7 +1859,4 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 	rcu_read_unlock();
-
-	write_unlock(&kvm->mmu_lock);
-	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 4036552f40cd..86c1065a672d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -67,10 +67,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
 
-unsigned long kvm_tdp_mmu_nx_huge_pages_to_zap(struct kvm *kvm);
-void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm,
-				   struct list_head *nx_huge_pages,
-				   unsigned long to_zap);
+void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm);
 
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }

From patchwork Thu Aug 29 19:11:35 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vipin Sharma <vipinsh@google.com>
X-Patchwork-Id: 13783611
Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com
 [209.85.214.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 22B261BA268
	for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 19:11:47 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724958711; cv=none;
 b=iw6whex2yLaZuNtNxpSDCq15CaqM7SSPindqo5cocTg/8rvBCP55BDH5JxOWVmyRXlvohWqxUNZemR/IuufkU4dbnrdgx54fNvvDWHtnYfvhamlTX+v/HxQof5CchRIgT2PA0OHZK1cKrg5JGDc/2KvtI3ZMPrytq1vpUyJD/zE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724958711; c=relaxed/simple;
	bh=zll6c5CUF/h98F5nNJpdFephBj+sdNsC47XY2M2pANo=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=MqaFQapk4UDeUuBAUMarZkPNbeMP1J/ubZrlmQ8cVuxQint12NSc4N52HN552OA13gISlDnsgEsKRfE0RYWt9Iy5nn2fGfBFGimqZQeU385boufVxYheNlhHPB812fKrtdyNQ9OYtNB1CoO3aLVc6b8eTZA1m13NwsV3F5tjIEo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=pWY6N87T; arc=none smtp.client-ip=209.85.214.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="pWY6N87T"
Received: by mail-pl1-f202.google.com with SMTP id
 d9443c01a7336-201e318ac63so9359395ad.1
        for <kvm@vger.kernel.org>; Thu, 29 Aug 2024 12:11:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1724958707; x=1725563507;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=UWn/R6jmGvzbPhOlWXh5KeXbWVap9lx0Vq9PBd9EnIo=;
        b=pWY6N87TtznfR036JArSwEEEaFdlxJa3osANaRiZF4ECqkcxwFTWzaGPiha2xXj7Z9
         Umkv6OTnQH4Xbj5zc5HbavN556zn/K/+Bof+6j0rydHSHiw15CDH+RSUbwQzfkrI9NcU
         iBAxIiM3VBO3xjI7aEEJC96eyvdzsu3/8+HksZA59FvWZPXXykUGs2MK1Kc26S5CuQ+h
         TN5MOCzaKrrjOiS1TitAxEv3SuOvJvwy3EGFxODMkFyB8qWe/Z7ZMDM0EZxmADFxf77c
         YNzp+ZzKBOKyeMkFHsHv4OQKc6WveSRytV1eaa14ojEuJfK/WDNTZwiueMlDblTe/SCf
         VRlQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1724958707; x=1725563507;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=UWn/R6jmGvzbPhOlWXh5KeXbWVap9lx0Vq9PBd9EnIo=;
        b=C9sPWi7niWsnNJfZdIyxy3hKfgzoVWE/opQT/HLrxuSfZxZ8QrWMhMCzETLy+lTsNc
         I/4klDwH5Q/82hhD613Ibdzuln0iWPvNDTZEaqSuMLLp5j5EflEPr6cEkvzXNSziao++
         6DKmlVfZCVftTSCMqimx6F8Q0TX3UeOvDEaMDDqIX58wTO37Mdj1CIxh6fGXiXD50llA
         QtnEnUW68xPf01aF0BEnOAv4VDCNVTcFlXoV83yO0SyQavfOKud6/0SPhsZR5+ZNJ+nx
         MUibe+S46r7yS28MAOEVDHb7N6QcQf6YJqTiwKWXSpZ422FZnEABUsW6XK+VLspGbTJ6
         1OzQ==
X-Gm-Message-State: AOJu0YxBMoS4lw6aXGwztmuyoP3JBOTJrNsX3vzMkLyy/Oxq0UT8J+RK
	/Y1JyUkCPFjRAFKMwT7mtJJWdL2QSF0O8ikN3YSLpUAByUEth1knGPfwa9izUXRenq/21YuImMY
	MTfgqPg==
X-Google-Smtp-Source: 
 AGHT+IHwc2IUON2bO/J9k0AEj/j5fYnXO9LK1uhF1t9DhOWeoUoGRLjUOzrIy+7thLdFvwUMgVoUpyX+qKdS
X-Received: from vipin.c.googlers.com ([35.247.89.60]) (user=vipinsh
 job=sendgmr) by 2002:a17:903:2452:b0:1fb:72b4:8772 with SMTP id
 d9443c01a7336-2050c441911mr1207015ad.10.1724958707325; Thu, 29 Aug 2024
 12:11:47 -0700 (PDT)
Date: Thu, 29 Aug 2024 12:11:35 -0700
In-Reply-To: <20240829191135.2041489-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240829191135.2041489-1-vipinsh@google.com>
X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog
Message-ID: <20240829191135.2041489-5-vipinsh@google.com>
Subject: [PATCH v2 4/4] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU
 read lock
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>

Use MMU read lock to recover TDP MMU NX huge pages. Iterate
huge pages list under tdp_mmu_pages_lock protection and unaccount the
page before dropping the lock.

Modify kvm_tdp_mmu_zap_sp() to tdp_mmu_zap_possible_nx_huge_page() as
there are no other user of it. Ignore the zap if any of the following
condition is true:
- It is a root page.
- Parent is pointing to:
  - A different page table.
  - A huge page.
  - Not present

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table. This should never happen.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an execute small
page. Unaccounting sooner during the list traversal is increasing the
window of that race. Functionally, it is okay, because accounting
doesn't protect against iTLB multi-hit bug, it is there purely to
prevent KVM from bouncing a gfn between two page sizes. The only
downside is that a vCPU will end up doing more work in tearing down all
the child SPTEs. This should be a very rare race.

Zapping under MMU read lock unblock vCPUs which are waiting for MMU read
lock. This optimizaion is done to solve a guest jitter issue on Windows
VM which was observing an increase in network latency. The test workload
sets up two Windows VM and use latte.exe[1] binary to run network
latency benchmark. Running NX huge page recovery under MMU lock was
causing latency to increase up to 30 ms because vCPUs were waiting for
MMU lock.

Running the tool on VMs using MMU read lock NX huge page recovery
removed the jitter issue completely and MMU lock wait time by vCPUs was
also reduced.

Command used for testing:

Server:
latte.exe -udp -a 192.168.100.1:9000 -i 10000000

Client:
latte.exe -c -udp -a 192.168.100.1:9000 -i 10000000 -hist -hl 1000 -hc 30

Output from the latency tool on client:

Before
------

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 69.98
CPU(%)        2.8
CtxSwitch/sec 32783     (2.29/iteration)
SysCall/sec   99948     (6.99/iteration)
Interrupt/sec 55164     (3.86/iteration)

Interval(usec)   Frequency
      0          9999967
   1000          14
   2000          0
   3000          5
   4000          1
   5000          0
   6000          0
   7000          0
   8000          0
   9000          0
  10000          0
  11000          0
  12000          2
  13000          2
  14000          4
  15000          2
  16000          2
  17000          0
  18000          1

After
-----

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 67.66
CPU(%)        1.6
CtxSwitch/sec 32869     (2.22/iteration)
SysCall/sec   69366     (4.69/iteration)
Interrupt/sec 50693     (3.43/iteration)

Interval(usec)   Frequency
      0          9999972
   1000          27
   2000          1

[1] https://github.com/microsoft/latte

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 11 ++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 76 +++++++++++++++++++++++++++++---------
 arch/x86/kvm/mmu/tdp_mmu.h |  1 -
 3 files changed, 67 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d636850c6929..cda6b07d4cda 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7436,14 +7436,19 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 			return 0;
 
 		rcu_idx = srcu_read_lock(&kvm->srcu);
-		write_lock(&kvm->mmu_lock);
 
-		kvm_mmu_recover_nx_huge_pages(kvm);
+		if (kvm_memslots_have_rmaps(kvm)) {
+			write_lock(&kvm->mmu_lock);
+			kvm_mmu_recover_nx_huge_pages(kvm);
+			write_unlock(&kvm->mmu_lock);
+		}
+
 		if (tdp_mmu_enabled) {
+			read_lock(&kvm->mmu_lock);
 			kvm_tdp_mmu_recover_nx_huge_pages(kvm);
+			read_unlock(&kvm->mmu_lock);
 		}
 
-		write_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, rcu_idx);
 	}
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 179cfd67609a..95aa829b856f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -818,23 +818,49 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_unlock();
 }
 
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+static bool tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+					      struct kvm_mmu_page *sp)
 {
-	u64 old_spte;
+	struct tdp_iter iter = {
+		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
+		.sptep = sp->ptep,
+		.level = sp->role.level + 1,
+		.gfn = sp->gfn,
+		.as_id = kvm_mmu_page_as_id(sp),
+	};
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
-	 * This helper intentionally doesn't allow zapping a root shadow page,
-	 * which doesn't have a parent page table and thus no associated entry.
+	 * Root shadow pages don't a parent page table and thus no associated
+	 * entry, but they can never be possible NX huge pages.
 	 */
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	/*
+	 * Since mmu_lock is held in read mode, it's possible another task has
+	 * already modified the SPTE. Zap the SPTE if and only if the SPTE
+	 * points at the SP's page table, as checking  shadow-present isn't
+	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
+	 * another SP. Note, spte_to_child_pt() also checks that the SPTE is
+	 * shadow-present, i.e. guards against zapping a frozen SPTE.
+	 */
+	if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
 		return false;
 
-	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
-			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
+	/*
+	 * If a different task modified the SPTE, then it should be impossible
+	 * for the SPTE to still be used for the to-be-zapped SP. Non-leaf
+	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
+	 * creating non-leaf SPTEs, and all other bits are immutable for non-
+	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
+	 * zapping and replacement.
+	 */
+	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
+		WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
+		return false;
+	}
 
 	return true;
 }
@@ -1806,7 +1832,7 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 	struct kvm_mmu_page *sp;
 	bool flush = false;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held_read(&kvm->mmu_lock);
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
 	 * be done under RCU protection, because the pages are freed via RCU
@@ -1815,8 +1841,11 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 	rcu_read_lock();
 
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages))
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages)) {
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 			break;
+		}
 
 		/*
 		 * We use a separate list instead of just using active_mmu_pages
@@ -1832,16 +1861,29 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 		WARN_ON_ONCE(!sp->role.direct);
 
 		/*
-		 * Unaccount and do not attempt to recover any NX Huge Pages
-		 * that are being dirty tracked, as they would just be faulted
-		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
+		 * Unaccount the shadow page before zapping its SPTE so as to
+		 * avoid bouncing tdp_mmu_pages_lock more than is necessary.
+		 * Clearing nx_huge_page_disallowed before zapping is safe, as
+		 * the flag doesn't protect against iTLB multi-hit, it's there
+		 * purely to prevent bouncing the gfn between an NX huge page
+		 * and an X small spage. A vCPU could get stuck tearing down
+		 * the shadow page, e.g. if it happens to fault on the region
+		 * before the SPTE is zapped and replaces the shadow page with
+		 * an NX huge page and get stuck tearing down the child SPTEs,
+		 * but that is a rare race, i.e. shouldn't impact performance.
+		 */
+		unaccount_nx_huge_page(kvm, sp);
+		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+
+		/*
+		 * Don't bother zapping shadow pages if the memslot is being
+		 * dirty logged, as the relevant pages would just be faulted back
+		 * in as 4KiB pages. Potential NX Huge Pages in this slot will be
 		 * recovered, along with all the other huge pages in the slot,
 		 * when dirty logging is disabled.
 		 */
-		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
-			unaccount_nx_huge_page(kvm, sp);
-		else
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
+		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
+			flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 86c1065a672d..57683b5dca9d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,7 +20,6 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);