From patchwork Mon Aug 12 17:13:40 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vipin Sharma <vipinsh@google.com>
X-Patchwork-Id: 13760875
Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com
 [209.85.215.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97EC7187874
	for <kvm@vger.kernel.org>; Mon, 12 Aug 2024 17:13:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.215.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1723482841; cv=none;
 b=QxoJOTzPf2HRepGW0BUcPOlJx1rJlsShy8RXxAw4orVFJZvnfu+zkMP6KRw605M+ugP3vN6IKbe4gqR8CoHKwaSPr4DaWB4J8cxxUpfquCfyDLUyUWDPUZmqt5YQDoZ0HBHCjQ4ofUatr2pteMIm8JZYsvt73GABvjNJ4d8c7mU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1723482841; c=relaxed/simple;
	bh=tbXsad/z3mZiZfCLnrY3/b5eM02K1ySPKdXXYzGmMXU=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=pXE2ukRj94Yh5sFSU3PT/l7vNT+wzjKccX6vQkvWS5HuBOluUH9DZDoijAw01N8b35ALidYxsm45UwAZYPc8fZ5LvgaKSNM8B0mZE6mCfls9pjzLqiLLI6Z1EuDHUTWtWtVxRERaEtPQbYmyhg558FJXy4GUdCkZSnknjpFOf+E=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=rB5ritOe; arc=none smtp.client-ip=209.85.215.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="rB5ritOe"
Received: by mail-pg1-f202.google.com with SMTP id
 41be03b00d2f7-7a2a04c79b6so4585840a12.0
        for <kvm@vger.kernel.org>; Mon, 12 Aug 2024 10:13:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1723482838; x=1724087638;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=X7zexYQ9nOcrm8eOvumJW9tuoCHDjhCFQCssKDFI5Ro=;
        b=rB5ritOesRi5dMPnKzYUgch6/D1YgIWlZRt2qCwKLdaH+A4HpPob65e5Vy0ehpEzD0
         zpiy5pATkiQa3lDQaIlyc9El0YsbhNIUb6nl1SRIiNPH0mzGst2bdGkunSW5Vw0ssaNx
         /rhi49M+tXQ3reiAuk93n1gRRqg5fAgh0TG6P0wtKo14YA2D7fQCDczbkQcaggYwEUOm
         lOh508wUceyRtEbwqACdci6iAaA4W/u2D7zUMyTqWLMdjmol60/tCWZ42ijUC0ZT5gid
         3SqzNMj4t+VuN/RJncD2/PXQdF409Fse/+Ops8TgiwDxEeHQtiVmV8fvIhi1Je+Ck8/L
         wR0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1723482838; x=1724087638;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=X7zexYQ9nOcrm8eOvumJW9tuoCHDjhCFQCssKDFI5Ro=;
        b=hClGpq6Y8JYrQizIdkJDc6mLn6J6pkSw9DE/wPiyrjsCFK4FZU0Sd8EE7ih3AMU2uo
         TYQEq9eBqBWcF8KxD+hm9IQgqz8dZim+Z9g9iNZxC++gv20z9gptnU+OOWF5hnb/216v
         mRCZQruTl3AcMFc9nd4JavpWuQBpjXs3V8T0S1ZUc4XcTuS/ct8Zt8dI7wmXC5MJcNV/
         vUWhaVEeG4/HHcBNbfE0YFjcr23ck2J9vcCSkLIcy+M2lu0hMG/jvJ9sVCoTYl8c34NH
         ErXDzDTZ8BoACmfmAUNm4jlSe1B2q79F7H3Lei0lsBMbd+g/CJ1juplz8NplkLQsaICi
         C5Kg==
X-Forwarded-Encrypted: i=1;
 AJvYcCVabbW77PbQ5hUKXfP17XwYzlSSPxbjy1UC0GWV41nfGlmBMdudXo0ThMei1d5kyrIP1A0X0uV4myfgxYnqMZ/wGpoU
X-Gm-Message-State: AOJu0YzGEwcx0JAWEiKyOB90Igvlx48ue8pyZhyP4VGEWk83C0BauuRJ
	HwB81gMOb+swC481cfPboctUAbl7IdyLoVuqWfoPTrleU3eq6x+ZzbcFlbo9ftgrhgoM+zhpmo9
	TddMiyA==
X-Google-Smtp-Source: 
 AGHT+IHI1avs/7pYYtecwRvgmLZYgo2lFpjiEi7tuhfIJiSa3gOZzI078U+qk912/p0I1iBqDNU8jjzQS3Bf
X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh
 job=sendgmr) by 2002:a17:90b:f92:b0:2cf:93dc:112d with SMTP id
 98e67ed59e1d1-2d39427d86emr1402a91.4.1723482837862; Mon, 12 Aug 2024 10:13:57
 -0700 (PDT)
Date: Mon, 12 Aug 2024 10:13:40 -0700
In-Reply-To: <20240812171341.1763297-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240812171341.1763297-1-vipinsh@google.com>
X-Mailer: git-send-email 2.46.0.76.ge559c4bf1a-goog
Message-ID: <20240812171341.1763297-2-vipinsh@google.com>
Subject: [PATCH 1/2] KVM: x86/mmu: Split NX hugepage recovery flow into TDP
 and non-TDP flow
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com
Cc: dmatlack@google.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>

Preparatory patch to run NX hugepage recovery for TDP MMU pages under
MMU read lock. Split NX hugepage recovery code into two separate
functions; one for TDP MMU pages and another for non-TDP MMU pages. Run
both flows under MMU write lock, same as in the code prior to split.

Traverse the common list kvm->arch.possible_nx_huge_pages and only
process TDP MMU shadow pages in TDP flow and non-TDP MMU pages in
non-TDP flow. Starvation is avoided by maintaining the invariant that
only first 'to_zap' pages from the list will be zapped at max. If there
are 'x' (x <= to_zap) TDP MMU pages in the first 'to_zap' pages of the
list then TDP flow will only zap 'x' pages and non-TDP MMU flow will get
'to_zap - x' iterations to zap.

In TDP flow, zap using kvm_tdp_mmu_zap() and flush remote TLBs under RCU
read lock.

In non-TDP flow, use kvm_mmu_prepare_zap_page() and
kvm_commit_zap_pages(). There is no RCU read lock needed.

Side effects of this split are:
1. Separate TLB flushes for TDP and non-TDP flow instead of a single
   combined one in existing approach.
   In most practical cases this should not happen often as majority of
   time pages in recovery worker list will be of one type and not both.

2. kvm->arch.possible_nx_huge_pages will be traversed two times for each
   flow.
   This should not cause slow down as traversing a list is fast in
   general.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 164 +++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h |   6 ++
 arch/x86/kvm/mmu/tdp_mmu.c      |  59 ++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   2 +
 4 files changed, 164 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..5534fcc9d1b5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -909,7 +909,7 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 	list_del_init(&sp->possible_nx_huge_page_link);
 }
 
-static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	sp->nx_huge_page_disallowed = false;
 
@@ -7311,98 +7311,128 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-static void kvm_recover_nx_huge_pages(struct kvm *kvm)
+/*
+ * Get the first shadow mmu page of desired type from the NX huge pages list.
+ * Return NULL if list doesn't have the needed page with in the first max pages.
+ */
+struct kvm_mmu_page *kvm_mmu_possible_nx_huge_page(struct kvm *kvm, bool tdp_mmu,
+						   ulong max)
 {
-	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
-	struct kvm_memory_slot *slot;
-	int rcu_idx;
-	struct kvm_mmu_page *sp;
-	unsigned int ratio;
-	LIST_HEAD(invalid_list);
-	bool flush = false;
-	ulong to_zap;
+	struct kvm_mmu_page *sp = NULL;
+	ulong i = 0;
 
-	rcu_idx = srcu_read_lock(&kvm->srcu);
-	write_lock(&kvm->mmu_lock);
+	/*
+	 * We use a separate list instead of just using active_mmu_pages because
+	 * the number of shadow pages that be replaced with an NX huge page is
+	 * expected to be relatively small compared to the total number of shadow
+	 * pages. And because the TDP MMU doesn't use active_mmu_pages.
+	 */
+	list_for_each_entry(sp, &kvm->arch.possible_nx_huge_pages, possible_nx_huge_page_link) {
+		if (i++ >= max)
+			break;
+		if (is_tdp_mmu_page(sp) == tdp_mmu)
+			return sp;
+	}
+
+	return NULL;
+}
+
+static struct kvm_mmu_page *shadow_mmu_nx_huge_page_to_zap(struct kvm *kvm, ulong max)
+{
+	return kvm_mmu_possible_nx_huge_page(kvm, /*tdp_mmu=*/false, max);
+}
+
+bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	struct kvm_memory_slot *slot = NULL;
 
 	/*
-	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
-	 * be done under RCU protection, because the pages are freed via RCU
-	 * callback.
+	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
+	 * it the test cannot possibly return true.  On the other hand, if any
+	 * memslot has logging enabled, chances are good that all of them do, in
+	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
+	 * page.
+	 *
+	 * If a memslot update is in progress, reading an incorrect value of
+	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
+	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
+	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
+	 * affects efficiency in racy situations, and not correctness.
 	 */
-	rcu_read_lock();
+	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
+		struct kvm_memslots *slots;
 
-	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
-	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
-	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.possible_nx_huge_pages))
+		slots = kvm_memslots_for_spte_role(kvm, sp->role);
+		slot = __gfn_to_memslot(slots, sp->gfn);
+		WARN_ON_ONCE(!slot);
+	}
+
+	return slot && kvm_slot_dirty_track_enabled(slot);
+}
+
+static void shadow_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	while (to_zap) {
+		sp = shadow_mmu_nx_huge_page_to_zap(kvm, to_zap);
+		if (!sp)
 			break;
 
-		/*
-		 * We use a separate list instead of just using active_mmu_pages
-		 * because the number of shadow pages that be replaced with an
-		 * NX huge page is expected to be relatively small compared to
-		 * the total number of shadow pages.  And because the TDP MMU
-		 * doesn't use active_mmu_pages.
-		 */
-		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
-				      struct kvm_mmu_page,
-				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
 		WARN_ON_ONCE(!sp->role.direct);
 
 		/*
-		 * Unaccount and do not attempt to recover any NX Huge Pages
-		 * that are being dirty tracked, as they would just be faulted
-		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
+		 * Unaccount and do not attempt to recover any NX Huge Pages that
+		 * are being dirty tracked, as they would just be faulted back in
+		 * as 4KiB pages. The NX Huge Pages in this slot will be
 		 * recovered, along with all the other huge pages in the slot,
 		 * when dirty logging is disabled.
-		 *
-		 * Since gfn_to_memslot() is relatively expensive, it helps to
-		 * skip it if it the test cannot possibly return true.  On the
-		 * other hand, if any memslot has logging enabled, chances are
-		 * good that all of them do, in which case unaccount_nx_huge_page()
-		 * is much cheaper than zapping the page.
-		 *
-		 * If a memslot update is in progress, reading an incorrect value
-		 * of kvm->nr_memslots_dirty_logging is not a problem: if it is
-		 * becoming zero, gfn_to_memslot() will be done unnecessarily; if
-		 * it is becoming nonzero, the page will be zapped unnecessarily.
-		 * Either way, this only affects efficiency in racy situations,
-		 * and not correctness.
 		 */
-		slot = NULL;
-		if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
-			struct kvm_memslots *slots;
-
-			slots = kvm_memslots_for_spte_role(kvm, sp->role);
-			slot = __gfn_to_memslot(slots, sp->gfn);
-			WARN_ON_ONCE(!slot);
-		}
-
-		if (slot && kvm_slot_dirty_track_enabled(slot))
+		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
 			unaccount_nx_huge_page(kvm, sp);
-		else if (is_tdp_mmu_page(sp))
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
 		else
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
-			rcu_read_unlock();
-
+			kvm_mmu_commit_zap_page(kvm, &invalid_list);
 			cond_resched_rwlock_write(&kvm->mmu_lock);
-			flush = false;
-
-			rcu_read_lock();
 		}
+		to_zap--;
 	}
-	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 
-	rcu_read_unlock();
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+}
+
+static void kvm_recover_nx_huge_pages(struct kvm *kvm)
+{
+	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
+	unsigned int ratio;
+	ulong to_zap;
+	int rcu_idx;
+
+	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
+
+	rcu_idx = srcu_read_lock(&kvm->srcu);
+
+	if (to_zap && tdp_mmu_enabled) {
+		write_lock(&kvm->mmu_lock);
+		to_zap = kvm_tdp_mmu_recover_nx_huge_pages(kvm, to_zap);
+		write_unlock(&kvm->mmu_lock);
+	}
+
+	if (to_zap && kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		shadow_mmu_recover_nx_huge_pages(kvm, to_zap);
+		write_unlock(&kvm->mmu_lock);
+	}
 
-	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1721d97743e9..246b1bc0319b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -354,4 +354,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct kvm_mmu_page *kvm_mmu_possible_nx_huge_page(struct kvm *kvm, bool tdp_mmu,
+						   ulong max);
+
+bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, struct kvm_mmu_page *sp);
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..933bb8b11c9f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1796,3 +1796,62 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 	 */
 	return rcu_dereference(sptep);
 }
+
+static struct kvm_mmu_page *tdp_mmu_nx_huge_page_to_zap(struct kvm *kvm, ulong max)
+{
+	return kvm_mmu_possible_nx_huge_page(kvm, /*tdp_mmu=*/true, max);
+}
+
+ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
+{
+	struct kvm_mmu_page *sp;
+	bool flush = false;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	/*
+	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
+	 * be done under RCU protection, because the pages are freed via RCU
+	 * callback.
+	 */
+	rcu_read_lock();
+
+	while (to_zap) {
+		sp = tdp_mmu_nx_huge_page_to_zap(kvm, to_zap);
+
+		if (!sp)
+			break;
+
+		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
+		WARN_ON_ONCE(!sp->role.direct);
+
+		/*
+		 * Unaccount and do not attempt to recover any NX Huge Pages that
+		 * are being dirty tracked, as they would just be faulted back in
+		 * as 4KiB pages. The NX Huge Pages in this slot will be
+		 * recovered, along with all the other huge pages in the slot,
+		 * when dirty logging is disabled.
+		 */
+		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
+			unaccount_nx_huge_page(kvm, sp);
+		else
+			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
+
+		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
+
+		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
+			if (flush) {
+				kvm_flush_remote_tlbs(kvm);
+				flush = false;
+			}
+			rcu_read_unlock();
+			cond_resched_rwlock_write(&kvm->mmu_lock);
+			rcu_read_lock();
+		}
+		to_zap--;
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+	rcu_read_unlock();
+	return to_zap;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1b74e058a81c..7d68c2ddf78c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -52,6 +52,8 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 				      gfn_t start, gfn_t end,
 				      int target_level, bool shared);
 
+ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap);
+
 static inline void kvm_tdp_mmu_walk_lockless_begin(void)
 {
 	rcu_read_lock();

From patchwork Mon Aug 12 17:13:41 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vipin Sharma <vipinsh@google.com>
X-Patchwork-Id: 13760876
Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com
 [209.85.216.74])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47DB8189516
	for <kvm@vger.kernel.org>; Mon, 12 Aug 2024 17:14:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.74
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1723482844; cv=none;
 b=nOAiYTxZDaR7y8c0XVXVAroNZfiN2UtR+yjqIKJSJYSWvW/VRRH7Gl2RfvLBXFufv6zhkXq/FDhiS8TlHjkIfhfvnGNivEZOBkwoPz0NM28biQgVCfPw7ksQGUTBjJp6lg4d36DvUm2M+w7kBCiO0427nXkNdUw3GKZE1hEoRQA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1723482844; c=relaxed/simple;
	bh=HUFlbBYJsG/TxP0oZ1oi0NPTwEgycmU5FKY6RLrmddk=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=dFwAz1GA8g8aM6qXa5Vxp9J7fp9WJ/bZJ0ZkgRzFhZnJ8tueWdnxr4gXCmF3Bw/J0HL080IPpaBNHCU57YpAbC2Nzc0o1P85I207ghMDq/CKUnbFPkxhA4HTJd70MEbE0TdDepyZHfSBC10WcrOyNyyT+CkOQWfL5XSt+lK6KIA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=Om3gT0yP; arc=none smtp.client-ip=209.85.216.74
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="Om3gT0yP"
Received: by mail-pj1-f74.google.com with SMTP id
 98e67ed59e1d1-2cb653cb5e6so4500644a91.2
        for <kvm@vger.kernel.org>; Mon, 12 Aug 2024 10:14:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1723482840; x=1724087640;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=p4NJAgKlbw85VGhLhNdojlEwswzL9kA37p7Y/F/p20k=;
        b=Om3gT0yPNl+YJWFd75gIONovOoW7EJWDHmY07CQAqYYUNGxzKSasTpgUe2MaU5QHqW
         SYcZpAuBZ2X5DdS6/yTKoBRCmkveZGDaPFxIP6MDje4lOeIMWMs6ONPgCaE1XYm83van
         DOguWCmWn4U1/BQtoiHcDb2HZzMqhSG+Gi56GHQA0wZ5Nz3i2GwUamGsawIP+oJxycDe
         yOMciW4VaFqowW0Z34O7n52a8LqBBiyRg0vekTMJilVeeRYg9QymI63PueGUkSbLFbsY
         ZKrzY86JNaoNjs7mOxpEcGESkV5/ZrZaBc7d6QmYj+uPwzxp70L2Vu83OjMWQk6FYKOq
         Oo2A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1723482840; x=1724087640;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=p4NJAgKlbw85VGhLhNdojlEwswzL9kA37p7Y/F/p20k=;
        b=b+M65J5zWN0g0KTN2qP+l9j+MQBmqWbwlVN19P9DB6r8azv0h0gL6gsVMgF58AuK74
         +M65XF9pkQCz839uDSFVxgH/AR24YCTp8ScbmoCiKgZ0CdARIYCObic6rnfQwx+m4oCh
         csjnYEPLbcG2xs9xXfQOWWpBN0lkDEC0llEEKKsk959vCdk1oEQOIYMmiZWZUOozpLg3
         glU730RZ5lHjRXVrDd7JuS4C32VtJmnf8naOR02wM/FhNDHn3f46XGd/S+9j362qoxiE
         Ub86g4A49RGzfUHX4q1Ff0PwyNMrno1E+aihvG41mEuf8RVw6PsZEUYbJA+nvDu5C3zJ
         dMZw==
X-Forwarded-Encrypted: i=1;
 AJvYcCXhkbLuy+6mzG8Ax0NFARwybhWrddQbZkeNgOW5peyF+ElkR6ITueH8a0+Uxnvuw06v1U0/dC+3HfdN3VBb4KlW041A
X-Gm-Message-State: AOJu0YwO6TZ39rGLzOGS6nLovYNwOsEyPkOSTA/S3OoCNVCJMjh2QUox
	97Q49fhOqpsxZfBnYn9tQlsTcjhXNUHxU82PQioqGBdHsBtX165VQPjJ92f17WUsa3IsNts5Vs+
	+d4coZg==
X-Google-Smtp-Source: 
 AGHT+IGsq0NXeVnh711FpZcONc4neA3zJpPkfgfQhyPMtDae/Qu8IfPY6qEEKtzfAXthdm5Ar0YttKZvPO2O
X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh
 job=sendgmr) by 2002:a17:90a:8585:b0:2c9:61e2:ce26 with SMTP id
 98e67ed59e1d1-2d3924b8d05mr46537a91.2.1723482840468; Mon, 12 Aug 2024
 10:14:00 -0700 (PDT)
Date: Mon, 12 Aug 2024 10:13:41 -0700
In-Reply-To: <20240812171341.1763297-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240812171341.1763297-1-vipinsh@google.com>
X-Mailer: git-send-email 2.46.0.76.ge559c4bf1a-goog
Message-ID: <20240812171341.1763297-3-vipinsh@google.com>
Subject: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU
 under MMU read lock
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com
Cc: dmatlack@google.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>

Use MMU read lock to recover NX huge pages belonging to TDP MMU. Iterate
through kvm->arch.possible_nx_huge_pages while holding
kvm->arch.tdp_mmu_pages_lock. Rename kvm_tdp_mmu_zap_sp() to
tdp_mmu_zap_sp() and make it static as there are no callers outside of
TDP MMU.

Ignore the zapping if any of the following is true for parent pte:

 - Pointing to some other page table.
 - Pointing to a huge page.
 - Not present.

These scenarios can happen as recovering is running under MMU read lock.

Zapping under MMU read lock unblock vCPUs which are waiting for MMU read
lock too.

This optimizaion was created to solve a guest jitter issue on Windows VM
which was observing an increase in network latency. The test workload sets
up two Windows VM and use latte.exe[1] binary to run network latency
benchmark. Running NX huge page recovery under MMU lock was causing
latency to increase up to 30 ms because vCPUs were waiting for MMU lock.

Running the tool on VMs using MMU read lock NX huge page recovery
removed the jitter issue completely and MMU lock wait time by vCPUs was
also reduced.

Command used for testing:

Server:
latte.exe -udp -a 192.168.100.1:9000 -i 10000000

Client:
latte.exe -c -udp -a 192.168.100.1:9000 -i 10000000 -hist -hl 1000 -hc 30

Output from the latency tool on client:

Before
------

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 70.41
CPU(%)        1.7
CtxSwitch/sec 31125     (2.19/iteration)
SysCall/sec   62184     (4.38/iteration)
Interrupt/sec 48319     (3.40/iteration)

Interval(usec)   Frequency
      0          9999964
   1000          12
   2000          3
   3000          0
   4000          0
   5000          0
   6000          0
   7000          1
   8000          1
   9000          1
  10000          2
  11000          1
  12000          0
  13000          4
  14000          1
  15000          1
  16000          4
  17000          1
  18000          2
  19000          0
  20000          0
  21000          1
  22000          0
  23000          0
  24000          1

After
-----

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 70.98
CPU(%)        1.3
CtxSwitch/sec 28967     (2.06/iteration)
SysCall/sec   48988     (3.48/iteration)
Interrupt/sec 47280     (3.36/iteration)

Interval(usec)   Frequency
      0          9999996
   1000          4

[1] https://github.com/microsoft/latte

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 10 +++++---
 arch/x86/kvm/mmu/tdp_mmu.c | 48 ++++++++++++++++++++++++++------------
 arch/x86/kvm/mmu/tdp_mmu.h |  1 -
 3 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5534fcc9d1b5..d95770d5303a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7321,6 +7321,7 @@ struct kvm_mmu_page *kvm_mmu_possible_nx_huge_page(struct kvm *kvm, bool tdp_mmu
 	struct kvm_mmu_page *sp = NULL;
 	ulong i = 0;
 
+	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	/*
 	 * We use a separate list instead of just using active_mmu_pages because
 	 * the number of shadow pages that be replaced with an NX huge page is
@@ -7330,10 +7331,13 @@ struct kvm_mmu_page *kvm_mmu_possible_nx_huge_page(struct kvm *kvm, bool tdp_mmu
 	list_for_each_entry(sp, &kvm->arch.possible_nx_huge_pages, possible_nx_huge_page_link) {
 		if (i++ >= max)
 			break;
-		if (is_tdp_mmu_page(sp) == tdp_mmu)
+		if (is_tdp_mmu_page(sp) == tdp_mmu) {
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 			return sp;
+		}
 	}
 
+	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 	return NULL;
 }
 
@@ -7422,9 +7426,9 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	rcu_idx = srcu_read_lock(&kvm->srcu);
 
 	if (to_zap && tdp_mmu_enabled) {
-		write_lock(&kvm->mmu_lock);
+		read_lock(&kvm->mmu_lock);
 		to_zap = kvm_tdp_mmu_recover_nx_huge_pages(kvm, to_zap);
-		write_unlock(&kvm->mmu_lock);
+		read_unlock(&kvm->mmu_lock);
 	}
 
 	if (to_zap && kvm_memslots_have_rmaps(kvm)) {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 933bb8b11c9f..7c7d207ee590 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -817,9 +817,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_unlock();
 }
 
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+static bool tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	u64 old_spte;
+	struct tdp_iter iter = {};
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
 	 * This helper intentionally doesn't allow zapping a root shadow page,
@@ -828,12 +830,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	iter.old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
+	iter.sptep = sp->ptep;
+	iter.level = sp->role.level + 1;
+	iter.gfn = sp->gfn;
+	iter.as_id = kvm_mmu_page_as_id(sp);
+
+retry:
+	/*
+	 * Since mmu_lock is held in read mode, it's possible to race with
+	 * another CPU which can remove sp from the page table hierarchy.
+	 *
+	 * No need to re-read iter.old_spte as tdp_mmu_set_spte_atomic() will
+	 * update it in the case of failure.
+	 */
+	if (sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
 		return false;
 
-	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
-			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
+	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
+		goto retry;
 
 	return true;
 }
@@ -1807,7 +1822,7 @@ ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
 	struct kvm_mmu_page *sp;
 	bool flush = false;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held_read(&kvm->mmu_lock);
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
 	 * be done under RCU protection, because the pages are freed via RCU
@@ -1821,7 +1836,6 @@ ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
 		if (!sp)
 			break;
 
-		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
 		WARN_ON_ONCE(!sp->role.direct);
 
 		/*
@@ -1831,12 +1845,17 @@ ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
 		 * recovered, along with all the other huge pages in the slot,
 		 * when dirty logging is disabled.
 		 */
-		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
+		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
+			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 			unaccount_nx_huge_page(kvm, sp);
-		else
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
-
-		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+			to_zap--;
+			WARN_ON_ONCE(sp->nx_huge_page_disallowed);
+		} else if (tdp_mmu_zap_sp(kvm, sp)) {
+			flush = true;
+			to_zap--;
+			WARN_ON_ONCE(sp->nx_huge_page_disallowed);
+		}
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			if (flush) {
@@ -1844,10 +1863,9 @@ ulong kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, ulong to_zap)
 				flush = false;
 			}
 			rcu_read_unlock();
-			cond_resched_rwlock_write(&kvm->mmu_lock);
+			cond_resched_rwlock_read(&kvm->mmu_lock);
 			rcu_read_lock();
 		}
-		to_zap--;
 	}
 
 	if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 7d68c2ddf78c..e0315cce6798 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,7 +20,6 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);