From patchwork Wed May 15 00:59:37 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664493
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6824663B8;
	Wed, 15 May 2024 01:00:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734805; cv=none;
 b=YUNHGD8og+tcEnKzKXC09vufLMdR2N8SQOOTWjUmISSkZBhjU21Y9n5ybHCjtdZY/VkyOAyZnBXfpbuStXIGnPMgFWWeVQTx4wwdJpvHYOVMUlfPCyOmesIVrI2aHX3oHAY1nDC3f8ukD/EjM+QrAYG4bWf6dZCqULwviocGFws=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734805; c=relaxed/simple;
	bh=NOIn7RE8N6eSPebGTNu70aoSxUhtM1LX0A/8k2eeXGI=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=XssR43L1yukfVvFGUNwzjTaSEAD+yaxx5cBR52AegL1q1jsTtT+s5h21LVa6mdaRf/jiMF01Hp0YxuFOyrXeUgsb5xHWYKMmdEGBLfBKJsCigBomQhQy7t3P1IM0S8t504HgmjiGpF6EHRjpoisoEaBEbpygqBN4HYrZvf5njc0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=c2Il3iH8; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="c2Il3iH8"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734804; x=1747270804;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=NOIn7RE8N6eSPebGTNu70aoSxUhtM1LX0A/8k2eeXGI=;
  b=c2Il3iH8pSSw6ST5m/y95THhP2I/mhb++tMxqyGDoE3f9sEFjzc14w0D
   N2axObumd2Cv2AKfeqFNdqQiCAogUfoo0rOHi2jeapr2zzNVfFsRgZU1H
   lgWZD6rSI1JcELrn0/Y9KQFbd3n6HJ4ftYv3hS1KZqrtYOIsXtPAfr3zB
   2v32FJKeuaL6wvL8K1I4nYcOV9d2wBEKvDKqjl7JQhN8myBQsYz1DRllc
   ary9qf4fD6zu5LnNbagQBMby0T2quuL+A1fDvVOKl1+zxRC1xqyAZolIh
   nSou3neZvJUIaQ3VPq+FSH+HbBbEx5g7vlNZjwOtzbf3l/61niH4YsDbi
   A==;
X-CSE-ConnectionGUID: 2yVU9zoXTOyvi/TqDbC+pA==
X-CSE-MsgGUID: YgB9XT5GRC6W2OGcQppilw==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613923"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613923"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:02 -0700
X-CSE-ConnectionGUID: oX4pbqPATUi5iDkIAyxSTA==
X-CSE-MsgGUID: 9m3mAfm4TRGlrKTvwfq0YQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942704"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:01 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 01/16] KVM: x86: Add a VM type define for TDX
Date: Tue, 14 May 2024 17:59:37 -0700
Message-Id: <20240515005952.3410568-2-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Add a VM type define for TDX.

Future changes will need to lay the ground work for TDX support by
making some behavior conditional on the VM being a TDX guest.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch, split from main series
---
 arch/x86/include/uapi/asm/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 988b5204d636..4dea0cfeee51 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -922,5 +922,6 @@ struct kvm_hyperv_eventfd {
 #define KVM_X86_SEV_VM		2
 #define KVM_X86_SEV_ES_VM	3
 #define KVM_X86_SNP_VM		4
+#define KVM_X86_TDX_VM		5
 
 #endif /* _ASM_X86_KVM_H */

From patchwork Wed May 15 00:59:38 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664495
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id ABDE263B9;
	Wed, 15 May 2024 01:00:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734807; cv=none;
 b=dNuFR22e1vS+VkWiD9NdfDYEW1gHUAFDt/1I04HDu4j9AMJjEmk0fsjPEGQkWOeQ8Z5Uni+hiL3u8ZPKKos+Z8+YB8YCS8zT6lYD4U/6i4c03clJTzJyuqBvHTbOlF/yKulK+x3RUoDBL73qUUzlRJhri6aBNHN1m+ZwPALDYrU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734807; c=relaxed/simple;
	bh=3IiuZ0F0+HeMw0aL1gEKx9S6ntoFeuhcsgyANUKYq88=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=jO9lwsTp9XKpBDw2V2PfhGMDlbxeebulwm3ell8GwXme8PGeSuKSHWXYK9wQprgghKa4wFiJmaOcUcdoed5Tv3+O0Fi5Z1bnBZ0ydnOn0yTNFn2FBb/Vcwzm1aKbirOTo0iJreb8vz+fjl2aNN/ChiHQLQooZH8UeCTSu5RGXbg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=dpixqwNZ; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="dpixqwNZ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734805; x=1747270805;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=3IiuZ0F0+HeMw0aL1gEKx9S6ntoFeuhcsgyANUKYq88=;
  b=dpixqwNZOwgWeE1e8+SQBumiC3H5qyXlAcct7c/InUWYXKR92h5o4YGR
   BZaS5Uj+XTr/Jw16j3A9BDEbzJlfIejzhiJ96x0J3otek/rORv6WeLj2m
   N+AAxnFXT/pMQGgvisiEYYX5YR+tc3IcFH8a3jnN7UE9DWkexto6Q8a8J
   lSh+6wjwdHLeyP3NySQFcOjR+R3eQ/z2e2VTDqsw0W/IH/xBKhu78C1Bf
   mbsv0ufBcwM39Ba5qxrhYg5qtx7/IMHHIUqxLVZKom5c2768Z/BGaoJE+
   pb0WPCvdR++ZuCHrJs9EuS1hQMW1V6SuWrRvvM6+3Citg5/M7k/PCxERh
   g==;
X-CSE-ConnectionGUID: Z5zfnomhSu+PRRz+z4btOA==
X-CSE-MsgGUID: R91iQ/HDQwmRXtXzKs9tLg==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613930"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613930"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:03 -0700
X-CSE-ConnectionGUID: 7pxJ/QTLScKTUtePeSpGwA==
X-CSE-MsgGUID: 9f8f/o5sRSW3s4KeCvQ4nQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942711"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:01 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 02/16] KVM: x86/mmu: Introduce a slot flag to zap only slot
 leafs on slot deletion
Date: Tue, 14 May 2024 17:59:38 -0700
Message-Id: <20240515005952.3410568-3-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Yan Zhao <yan.y.zhao@intel.com>

Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
SPTEs when deleting a memslot.

Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
AMD) and generate fresh new TDPs based on the new memslot layout. This is
because zapping and re-generating TDPs is low overhead for most use cases,
and  more importantly, it's due to a bug [1] which caused VM instability
when a VM is with Nvidia Geforce GPU assigned.

There's a previous attempt [2] to introduce a per-VM flag to workaround bug
[1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
However, [2] was not merged due to lacking of a clear explanation of
exactly what is broken [3] and it's not wise to "have a bug that is known
to happen when you enable the capability".

However, for some specific scenarios, e.g. TDX, invalidating and
re-generating a new page table is not viable for reasons:
- TDX requires root page of private page table remains unaltered throughout
  the TD life cycle.
- TDX mandates that leaf entries in private page table must be zapped prior
  to non-leaf entries.

So, Sean re-considered about introducing a per-VM flag or per-memslot flag
again for VMs like TDX. [4]

This patch is an implementation of per-memslot flag.
Compared to per-VM flag approach,
Pros:
(1) By allowing userspace to control the zapping behavior in fine-grained
    granularity, optimizations for specific use cases can be developed
    without future kernel changes.
(2) Allows developing new zapping behaviors without risking regressions by
    changing KVM behavior, as seen previously.

Cons:
(1) Users need to ensure all necessary memslots are with flag
    KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
    memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
(2) Opens up the possibility that userspace could configure memslots for
    normal VM in such a way that the bug [1] is seen.

However, one thing deserves noting for TDX, is that TDX may potentially
meet bug [1] for either per-memslot flag or per-VM flag approach, since
there's a usage in radar to assign an untrusted & passthrough GPU device
in TDX. If that happens, it can be treated as a bug (not regression) and
fixed accordingly.

An alternative approach we can also consider is to always invalidate &
rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
and private page tables on memslot deletion. This approach could exempt TDX
from bug [1] when "untrusted & passthrough" devices are involved. But
downside is that this approach requires creating new very specific KVM
zapping ABI that could limit future changes in the same way that the bug
did for normal VMs.

Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@google.com [4]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c   | 30 +++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c       | 17 +++++++++++++++++
 include/uapi/linux/kvm.h |  1 +
 virt/kvm/kvm_main.c      |  5 ++++-
 4 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61982da8c8b2..4a8e819794db 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
 	kvm_mmu_zap_all(kvm);
 }
 
+static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+		return;
+
+	write_lock(&kvm->mmu_lock);
+
+	/*
+	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+	 * case scenario we'll have unused shadow pages lying around until they
+	 * are recycled due to age or when the VM is destroyed.
+	 */
+	struct kvm_gfn_range range = {
+		.slot = slot,
+		.start = slot->base_gfn,
+		.end = slot->base_gfn + slot->npages,
+		.may_block = true,
+	};
+
+	if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
+		kvm_flush_remote_tlbs(kvm);
+
+	write_unlock(&kvm->mmu_lock);
+}
+
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
+		kvm_mmu_zap_memslot_leafs(kvm, slot);
+	else
+		kvm_mmu_zap_all_fast(kvm);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c593a081eba..4b3ec2ec79e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
 			return -EINVAL;
 
+		/*
+		 * Since TDX private pages requires re-accepting after zap,
+		 * and TDX private root page should not be zapped, TDX requires
+		 * memslots for private memory must have flag
+		 * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
+		 * the deleting memslot will be zapped and SPTEs in other
+		 * memslots would not be affected.
+		 */
+		if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
+		    (new->flags & KVM_MEM_GUEST_MEMFD) &&
+		    !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
+			return -EINVAL;
+
+		/* zap-leafs-only works only when TDP MMU is enabled for now */
+		if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
+			return -EINVAL;
+
 		return kvm_alloc_memslot_metadata(kvm, new);
 	}
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index aee67912e71c..d53648c19b26 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_ZAP_LEAFS_ONLY	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 81b90bf03f2f..1b1ffb6fc786 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1568,6 +1568,8 @@ static int check_memory_region_flags(struct kvm *kvm,
 	if (kvm_arch_has_private_mem(kvm))
 		valid_flags |= KVM_MEM_GUEST_MEMFD;
 
+	valid_flags |= KVM_MEM_ZAP_LEAFS_ONLY;
+
 	/* Dirty logging private memory is not currently supported. */
 	if (mem->flags & KVM_MEM_GUEST_MEMFD)
 		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
@@ -2052,7 +2054,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
-		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
+		    ((mem->flags ^ old->flags) &
+		     (KVM_MEM_READONLY | KVM_MEM_ZAP_LEAFS_ONLY)))
 			return -EINVAL;
 
 		if (base_gfn != old->base_gfn)

From patchwork Wed May 15 00:59:39 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664494
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBD6963D5;
	Wed, 15 May 2024 01:00:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734806; cv=none;
 b=XIMFTvwmJmrwo5E9thpGqs9Lx7cEOD5AA5jyyaqSeOvSCJTM7sjBkO1NZ1wp3Ozmpes4eIassZifuG/+XnygzGBMBxsUv4GFKOn8TdNiupLVd3+4BCdB9M6SLBul2wKHZ5y/AarWhMnfjcuFsGUBRLnqnVLY3uuDW7OUnteCeRI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734806; c=relaxed/simple;
	bh=v6Cuz4zx2wjPGlGh14VUElcDcI+9AmfYj5Tp5b9rlU8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Gu3eYArLO+EW/j+0jPlOvUnbdvCL82fZ4FwtPw4LxJMMZ4tbNOgIlSmIFJyTODhcICo+nF2bi2nbPrX3EmtZqnuAPZ0SZJCSKiIk29jAxDTz7YiUDZ91YaD6PmwNfy8vSm4i/ZYJtMxJuNEvISUkGN21mtv2AZ5fZBkecagAvdw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=HdA5UoLg; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="HdA5UoLg"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734805; x=1747270805;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=v6Cuz4zx2wjPGlGh14VUElcDcI+9AmfYj5Tp5b9rlU8=;
  b=HdA5UoLgQMGvRLAevm/KcR0Utgh3n3L+8mPLcEu3ge/7dke4D2TvxjhN
   7hCdOHVCAHO8bjtEPNRy7yyO4ewk2rZ8ukY8bZL4qMFir7jj5ly56yAeF
   h1dDeW6XI/Xa6UP1liP2xGATGJJxLKfLZVkCHxU8cPBTfOecBt2S3TE9U
   J+aeuvC8uYeEh5PiBfCZsFYka7UmN/cHHVaaK32hzyLQBI1anrfokvnFQ
   gejJAbSB6FNqdrclE+xWTSOvlCQ/CphcyNEvLiO8IBSN1wPHwulLBGjGt
   XgnbuRxYLbI0uMljrh053N0UFozJifsgn9wA9kIlNh/WqRi4iXp2R6GE+
   A==;
X-CSE-ConnectionGUID: p5VVk2WYSwGHh7zTb+AW6g==
X-CSE-MsgGUID: BKKy142LTn2ESNJbs6qfxQ==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613937"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613937"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:03 -0700
X-CSE-ConnectionGUID: KkRdh+qpQ4GfL6f/kt9Bug==
X-CSE-MsgGUID: TU/ohH1tRra8vedU9iOw9Q==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942719"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:02 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 03/16] KVM: x86/tdp_mmu: Add a helper function to walk down
 the TDP MMU
Date: Tue, 14 May 2024 17:59:39 -0700
Message-Id: <20240515005952.3410568-4-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Export a function to walk down the TDP without modifying it.

Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT, and verify the populated private memory
PFN matches the current one.[1]

There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.

Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.

Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
This helper will be used in the future change that implements
KVM_TDX_INIT_MEM_REGION. Please refer to the following commit for the
usage:
https://github.com/intel/tdx/commit/2832c6d87a4e6a46828b193173550e80b31240d4

TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu.h         |  3 +++
 arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++++++++++----
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index dc80e72e4848..3c7a88400cbb 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -275,6 +275,9 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+				     kvm_pfn_t *pfn);
+
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1259dd63defc..1086e3b2aa5c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1772,16 +1772,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  bool is_private)
 {
 	struct tdp_iter iter;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
-	*root_level = vcpu->arch.mmu->root_role.level;
-
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
@@ -1790,6 +1788,37 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return leaf;
 }
 
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
+{
+	*root_level = vcpu->arch.mmu->root_role.level;
+
+	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
+}
+
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+				     kvm_pfn_t *pfn)
+{
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+	int leaf;
+
+	lockdep_assert_held(&vcpu->kvm->mmu_lock);
+
+	rcu_read_lock();
+	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
+	rcu_read_unlock();
+	if (leaf < 0)
+		return -ENOENT;
+
+	spte = sptes[leaf];
+	if (!(is_shadow_present_pte(spte) && is_last_spte(spte, leaf)))
+		return -ENOENT;
+
+	*pfn = spte_to_pfn(spte);
+	return leaf;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
+
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no

From patchwork Wed May 15 00:59:40 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664496
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4EBEB79C8;
	Wed, 15 May 2024 01:00:05 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734808; cv=none;
 b=DchnOauOSx5Oj/40XJ8HH6pC1+BECVDdfk9evILnk2WG91qIfJuTA+SOeaOP20iWsedf9YmwwwRLaHzeM1XSktdZZOSKH2tMffclelXsBLfBqnQLvdmTjsqG1UAUIBepmyynb1oHfHs2vfjaSUbveElaBYzrVYMyPt38R8AYSZA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734808; c=relaxed/simple;
	bh=EiPyaz12Hs/yvu2vb2xAWA/RYdi5EWxAwUrWQYyN7dw=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=omXYjLZAJZNkVKgFuc27xWr/qpePJh5RuQC+A5VCpKBUiyTLGGRf0aBAWJoPmkXx9JJ10zEJ7V6TZ8sI6AJ0Hz3mSnbg1rwsedW+zT1UcUMN6CVBh0kzfo0CgyzkAdcS8s1Ae1mDg4JP04fr8NeNEWxTMviU3tpEed3kwtOa+Jg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=ELClGgbz; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="ELClGgbz"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734805; x=1747270805;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=EiPyaz12Hs/yvu2vb2xAWA/RYdi5EWxAwUrWQYyN7dw=;
  b=ELClGgbzEWPPrOQH8LYy5ogKNpc3jx+tSlMjye+JiqFExsvgHzzG5Ct9
   qitUmcVoDVhkpetFmKXVQPz5vbo3K958kRmRMhGVT03kWxl2yOMXpRV1l
   8LbUHuVOAbukSalUjQQG+pKGdz0qOk53LdC849VmgH4RwWTxnPM+c11ou
   m9S6XowJ3sIy4Kexg/CBE4bx12e/9BNqlro6U4IbhtA6j4mgHkZFGQ+PL
   wHHZ5P2WBZXY+5wXa8IAPV48xnOhAOvnlndsOA4s65ZWz6dVJMGh0TysD
   dVgIKg510yGYw1RLTnUyDBprb/sYn7JLzwlRuHSPAPF4+93uxjxdsxfBv
   A==;
X-CSE-ConnectionGUID: JscFULhURNOu5LsnGHxd2A==
X-CSE-MsgGUID: KLCznybqRSyZBVQ8OZh0XA==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613944"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613944"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:04 -0700
X-CSE-ConnectionGUID: 59cLFGsYRK+7uA0a878tEw==
X-CSE-MsgGUID: G3nbzKu1SHaBw238yzeYqA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942724"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:02 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 04/16] KVM: x86/mmu: Add address conversion functions for TDX
 shared bit of GPA
Date: Tue, 14 May 2024 17:59:40 -0700
Message-Id: <20240515005952.3410568-5-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Introduce a "gfn_shared_mask" field in the kvm_arch structure to record GPA
shared bit and provide address conversion helpers for TDX shared bit of
GPA.

TDX designates a specific GPA bit as the shared bit, which can be either
bit 51 or bit 47 based on configuration.

This GPA shared bit indicates whether the corresponding physical page is
shared (if shared bit set) or private (if shared bit cleared).

- GPAs with shared bit set will be mapped by VMM into conventional EPT,
  which is pointed by shared EPTP in TDVMCS, resides in host VMM memory
  and is managed by VMM.
- GPAs with shared bit cleared will be mapped by VMM firstly into a
  mirrored EPT, which resides in host VMM memory. Changes of the mirrored
  EPT are then propagated into a private EPT, which resides outside of host
  VMM memory and is managed by TDX module.

Add the "gfn_shared_mask" field to the kvm_arch structure for each VM with
a default value of 0. It will be set to the position of the GPA shared bit
in GFN through TD specific initialization code.

Provide helpers to utilize the gfn_shared_mask to determine whether a GPA
is shared or private, retrieve the GPA shared bit value, and insert/strip
shared bit to/from a GPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX MMU Part 1:
 - Update commit log (Yan)
 - Fix documentation on kvm_is_private_gpa() (Binbin)

v19:
- Add comment on default vm case.
- Added behavior table in the commit message
- drop CONFIG_KVM_MMU_PRIVATE

v18:
- Added Reviewed-by Binbin
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.h              | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..d2f924f1d579 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1519,6 +1519,8 @@ struct kvm_arch {
 	 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
+
+	gfn_t gfn_shared_mask;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3c7a88400cbb..dac13a2d944f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -321,4 +321,37 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
 		return gpa;
 	return translate_nested_gpa(vcpu, gpa, access, exception);
 }
+
+/*
+ *			default or SEV-SNP	TDX: where S = (47 or 51) - 12
+ * gfn_shared_mask	0			S bit
+ * is_private_gpa()	always false		true if GPA has S bit clear
+ * gfn_to_shared()	nop			set S bit
+ * gfn_to_private()	nop			clear S bit
+ *
+ * fault.is_private means that host page should be gotten from guest_memfd
+ * is_private_gpa() means that KVM MMU should invoke private MMU hooks.
+ */
+static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
+{
+	return kvm->arch.gfn_shared_mask;
+}
+
+static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
+{
+	return gfn | kvm_gfn_shared_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
+{
+	return gfn & ~kvm_gfn_shared_mask(kvm);
+}
+
+static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
+{
+	gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+	return mask && !(gpa_to_gfn(gpa) & mask);
+}
+
 #endif

From patchwork Wed May 15 00:59:41 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664497
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3EF7EEBF;
	Wed, 15 May 2024 01:00:06 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734808; cv=none;
 b=h+AjV5FpgyWvCXe/4ch+NAI6qC6lVWYbUlk4LV0nZWNv25SUNxnCZiMF/szIJ3np5GVgiU9aPxmMudrYvuN3H1Nr3+93pYobyv8GGTFB9KLMQDnReP4t+aggy4eSK1So6HmDWndSy7AvxjWoE2cFSq9xK3mQd+ub03/dh1PxiTk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734808; c=relaxed/simple;
	bh=SwveWfaPqRHgaQcN/Q/O3ad7FH2+3nwsm35+tH34STs=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=OWroOX0uro7G2TmNrweZdoZeLrmGssYrE0vxxk7CO89YdCj/BA2X0Rz7F34fPbDmHxcmkGs50AmIfkspSpDCxp2ZpYlYUtJXL1y71dcPkiRTkT7/+OPRKgkTGVlc9oYpFtdEdO0HVZ6INYGmggMBnP8a9schdC6DLrvrZGutVD8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=eAJjHG3z; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="eAJjHG3z"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734807; x=1747270807;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=SwveWfaPqRHgaQcN/Q/O3ad7FH2+3nwsm35+tH34STs=;
  b=eAJjHG3zUUYMDJsFvETIskXE7lYy47dZqZ96lp2re5rvkhOPUXml/ovy
   rWc/9japptXM7Ufkoeamtux3fNE3yJDmQ025T0Bg4OTvYnMFj1c7kVK4k
   myZeISle/7eyj8zZLM3PsU14hpSqO7mKZkGO3UoMLkQFl4T/YkDdIqFHn
   k3w3XAm8S3P2WiWe89eddWkouldRctKtvPVc/KsM4qq1I1ot8S86IZWno
   I4X7oAqrNDkckTnidWGkxMs/O3jiz5ojUD63AP2j714YIAJl1sXEXE8aN
   61WAEq6m4+n8IQw56Rnq6bHP8bqVKmphFnFKmuaf42i2T/hENogNI/PEu
   w==;
X-CSE-ConnectionGUID: fI0JRhpNQquzviuX5xtk2Q==
X-CSE-MsgGUID: 0nSd47Y1RCiLlmpr7JWJbA==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613950"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613950"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:04 -0700
X-CSE-ConnectionGUID: 6R0YLLi+Q1+DSOYzLITwJQ==
X-CSE-MsgGUID: 98kC1KSSTjW6O21Op7AO9g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942735"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:03 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 05/16] KVM: Add member to struct kvm_gfn_range for target
 alias
Date: Tue, 14 May 2024 17:59:41 -0700
Message-Id: <20240515005952.3410568-6-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_process process.
Update the core zapping operations to set them appropriately.

TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.

Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.

There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.

Set target alias with enum appropriately for these MMU operations:
 - For KVM's mmu notifier callbacks, zap shared pages only because private
   pages won't have a userspace mapping
 - For setting memory attributes, kvm_arch_pre_set_memory_attributes()
   chooses the aliases based on the attribute.
 - For guest_memfd invalidations, zap private only.

Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Replaced KVM_PROCESS_BASED_ON_ARG with BUGGY_KVM_INVALIDATION to follow
   the original suggestion and not populte kvm_handle_gfn_range(). And add
   WARN_ON_ONCE().
 - Move attribute specific logic into kvm_vm_set_mem_attributes()
 - Drop Sean's suggested-by tag as the solution has changed
 - Re-write commit log

v18:
 - rebased to kvm-next

v3:
 - Drop the KVM_GFN_RANGE flags
 - Updated struct kvm_gfn_range
 - Change kvm_arch_set_memory_attributes() to return bool for flush
 - Added set_memory_attributes x86 op for vendor backends
 - Refined commit message to describe TDX care concretely

v2:
 - consolidate KVM_GFN_RANGE_FLAGS_GMEM_{PUNCH_HOLE, RELEASE} into
   KVM_GFN_RANGE_FLAGS_GMEM.
 - Update the commit message to describe TDX more.  Drop SEV_SNP.
---
 arch/x86/kvm/mmu/mmu.c   | 12 ++++++++++++
 include/linux/kvm_host.h |  8 ++++++++
 virt/kvm/guest_memfd.c   |  2 ++
 virt/kvm/kvm_main.c      | 14 ++++++++++++++
 4 files changed, 36 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4a8e819794db..1998267a330e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6979,6 +6979,12 @@ static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *s
 		.start = slot->base_gfn,
 		.end = slot->base_gfn + slot->npages,
 		.may_block = true,
+
+		/*
+		 * All private and shared page should be zapped on memslot
+		 * deletion.
+		 */
+		.process = KVM_PROCESS_PRIVATE_AND_SHARED,
 	};
 
 	if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
@@ -7479,6 +7485,12 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
 		return false;
 
+	/* Unmmap the old attribute page. */
+	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+		range->process = KVM_PROCESS_SHARED;
+	else
+		range->process = KVM_PROCESS_PRIVATE;
+
 	return kvm_unmap_gfn_range(kvm, range);
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3c922bf077f..f92c8b605b03 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,11 +260,19 @@ union kvm_mmu_notifier_arg {
 	unsigned long attributes;
 };
 
+enum kvm_process {
+	BUGGY_KVM_INVALIDATION		= 0,
+	KVM_PROCESS_SHARED		= BIT(0),
+	KVM_PROCESS_PRIVATE		= BIT(1),
+	KVM_PROCESS_PRIVATE_AND_SHARED	= KVM_PROCESS_SHARED | KVM_PROCESS_PRIVATE,
+};
+
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
+	enum kvm_process process;
 	bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9714add38852..e5ff6fde2db3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -109,6 +109,8 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
 			.slot = slot,
 			.may_block = true,
+			/* guest memfd is relevant to only private mappings. */
+			.process = KVM_PROCESS_PRIVATE,
 		};
 
 		if (!found_memslot) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1b1ffb6fc786..cc434c7509f1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,11 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			 */
 			gfn_range.arg = range->arg;
 			gfn_range.may_block = range->may_block;
+			/*
+			 * HVA-based notifications aren't relevant to private
+			 * mappings as they don't have a userspace mapping.
+			 */
+			gfn_range.process = KVM_PROCESS_SHARED;
 
 			/*
 			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -2453,6 +2458,14 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	gfn_range.arg = range->arg;
 	gfn_range.may_block = range->may_block;
 
+	/*
+	 * If/when KVM supports more attributes beyond private .vs shared, this
+	 * _could_ set exclude_{private,shared} appropriately if the entire target
+	 * range already has the desired private vs. shared state (it's unclear
+	 * if that is a net win).  For now, KVM reaches this point if and only
+	 * if the private flag is being toggled, i.e. all mappings are in play.
+	 */
+
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 
@@ -2509,6 +2522,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
 		.end = end,
+		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
 		.flush_on_ret = true,

From patchwork Wed May 15 00:59:42 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664498
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 203E0107A6;
	Wed, 15 May 2024 01:00:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734808; cv=none;
 b=Z54P11ETZ2eLP+QsmavJTNbjPiXDh8jCLL6ZV19V6lZEazCcXZZoo+DpbNmnVhmlFXk7ETslbXSgZcIaajncfYh9CMO5YheOV61oofWvTCkmaRSyvXlIXcijQr5zL8mQ5uUZYO9ycPcI3djP75nhffbPP7IcLDkttmgWgzTWXjc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734808; c=relaxed/simple;
	bh=3riEeBhx8HzcTBEGj7bVXK60LuEYP5+ldqG1Lza6N1w=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=u09Xs/CSPtiFaz3KO1b7K8AB+m7WuPqJv7PsYxRKhKeW4KeGwgoAI3Q7VmapigESVGbraxFkZyojKoAg32gXFoV/cZLK3mVc/hJkpMQ3VTsrbP44/GdQ3zHXoyYPkkxllM4wsolNwdHHd4dFtAWUen3j3mUTzbzDxyd8UH5iTFo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=kis8Slub; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="kis8Slub"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734807; x=1747270807;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=3riEeBhx8HzcTBEGj7bVXK60LuEYP5+ldqG1Lza6N1w=;
  b=kis8SlubXlqWRPw0CjW3BqFRbt9q9lANPwWHjQHLqXv+zwwi1LeD8cEA
   bUw/mYu3ki4UdgDngF8adGDq44jxibmZxuTzDesXM3Guvm9sJfNPwhjlR
   ycpSza3zY6yoiw5w2NydDTFdxlAtaUAypqZskJGmzP4ALPvhB4cDOe+yy
   nFKVv25BgudvMCW3EKpIIWHzj1/E78bp8Gc19Q3aDT++4fLeMrwtkTij5
   BfwvYmsvNjIMh+STOuh6E75Kvejpcgu6UKVK2dWjxNR5aZiAWmdiCakLZ
   HNefuZ3/x6vcmj85eLqEZ3eQvoD4ikx+57hHE3Q8OTduYLAFZpUt4UWK/
   Q==;
X-CSE-ConnectionGUID: Q1eCECccTP2qSzmiFe906g==
X-CSE-MsgGUID: /g9g7oVwSbCnOhZA/HoWTg==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613955"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613955"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:04 -0700
X-CSE-ConnectionGUID: hLuB4nMQRHKGeaLU1S3wxQ==
X-CSE-MsgGUID: jkgl2X01RX2y2ngdP0369g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942747"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:03 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 06/16] KVM: x86/mmu: Add a new is_private member for union
 kvm_mmu_page_role
Date: Tue, 14 May 2024 17:59:42 -0700
Message-Id: <20240515005952.3410568-7-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Introduce a "is_private" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.

The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory, recording the root page HPA in each vCPU's
mmu->private_root_hpa. This "is_private" attribute enables vCPUs to find
and get the root page of mirrored EPT from the MMU root list for a guest
TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "is_private" mmu page role and propagate the changes to
the private EPT managed by TDX module.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
- Remove warning and NULL check in is_private_sptep() (Rick)
- Update commit log (Yan)

v19:
- Fix is_private_sptep() when NULL case.
- drop CONFIG_KVM_MMU_PRIVATE
---
 arch/x86/include/asm/kvm_host.h | 13 ++++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h |  5 +++++
 arch/x86/kvm/mmu/spte.h         |  5 +++++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2f924f1d579..13119d4e44e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -351,7 +351,8 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned passthrough:1;
-		unsigned :5;
+		unsigned is_private:1;
+		unsigned :4;
 
 		/*
 		 * This is left at the top of the word so that
@@ -363,6 +364,16 @@ union kvm_mmu_page_role {
 	};
 };
 
+static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+{
+	return !!role.is_private;
+}
+
+static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+{
+	role->is_private = 1;
+}
+
 /*
  * kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
  * relevant to the current MMU configuration.   When loading CR0, CR4, or EFER,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 706f0ce8784c..b114589a595a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -145,6 +145,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return kvm_mmu_role_as_id(sp->role);
 }
 
+static inline bool is_private_sp(const struct kvm_mmu_page *sp)
+{
+	return kvm_mmu_page_role_is_private(sp->role);
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5dd5405fa07a..d0df691ced5c 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,6 +265,11 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
 	return spte_to_child_sp(root);
 }
 
+static inline bool is_private_sptep(u64 *sptep)
+{
+	return is_private_sp(sptep_to_sp(sptep));
+}
+
 static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
 	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&

From patchwork Wed May 15 00:59:43 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664501
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD9801EEE3;
	Wed, 15 May 2024 01:00:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734812; cv=none;
 b=VHcGs0Vfh7PKv1cekngrzuYVEjPr2qXlk5TeFjQeUY5EaRqNWwpuNfJRZpyAnEIjNkTWgj3PQKBAogqKW4Oyum3UCLdA1L/PllAQCbwaB4gpbjfQLMqMXaSOipJrktgD3icBUx08XuYh4w5M6J5BVr1o2gfUwPIedH4tgGbkL3k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734812; c=relaxed/simple;
	bh=ZhLlZRSc1N8HWBGh0Hwv4AHosc/nMbevb/CrbuiWxcc=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=XkTOpelE2FXTVl+QWHOX+gibbz3I0xXcUeBaa/YIqeVaFl2FA3ahUHhvanrOX+3ml8Os1RlY0klOv5bD6BTfpEYthfrVQQGUdZNSl5D2NvxfB4z6X7G92FA5PyfeRHBSSsNwc8w1H7HGB73llZ/kIK595v9AQth4fdNFb3UeH5o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=TnRe3VTF; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="TnRe3VTF"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734809; x=1747270809;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ZhLlZRSc1N8HWBGh0Hwv4AHosc/nMbevb/CrbuiWxcc=;
  b=TnRe3VTF/zs7alAOsZK9NV7y9HuQk5Z011VFxNlPpjD4hTcIHRGbK7m5
   w1GpO7dJKKFj3HYX1zDL4sUvtUIoh2UAcm0ZjkquZZ4J3P+NhZLwqTe+F
   roNNP6sp+Y8+B9o7OUkpGMnz9jKeMXU89ZsTeKfSG6JdyxuF+GEN6g3/A
   7lkvJ+5E3fpkoFcsntzBK9ytLMmMOR8cIRlQl3j1HleUf9YCYMCucsRdf
   cnGA2gwjUQvzht5N/RIxRMz1ZBTMtXnEjFweYZtYhnzXFtNEbtuP5pP7a
   NO7PjZfqtq4eGtAD3dJTXcUW+z4DeuA3rN9SPbF00PvnpZEIOTK2+dxtp
   w==;
X-CSE-ConnectionGUID: 4g2iXPXRR7it1SjX3qdpGg==
X-CSE-MsgGUID: 80cxgbCfQP+fU8k+eIMuJQ==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613959"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613959"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:05 -0700
X-CSE-ConnectionGUID: cY4EQ/NfSsqzZ1SkRhSwGg==
X-CSE-MsgGUID: Dii6f8PhTQi6/zFUhC+cPQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942754"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:04 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 07/16] KVM: x86/mmu: Add a private pointer to struct
 kvm_mmu_page
Date: Tue, 14 May 2024 17:59:43 -0700
Message-Id: <20240515005952.3410568-8-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a private pointer to struct kvm_mmu_page for the private page table and
add helper functions to allocate/initialize/free a private page table page.
Because KVM TDP MMU doesn't use unsync_children and write_flooding_count,
pack them to have room for a pointer and use a union to avoid memory
overhead.

For private GPA, CPU refers to a private page table whose contents are
encrypted.  The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.

When KVM resolves the KVM page fault, it walks the page tables.  To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table, allocate one more page for the mirrored page table
for the KVM MMU code to directly walk.  Resolve the KVM page fault with
the existing code, and do additional operations necessary for the private
page table.  To distinguish such cases, the existing KVM page table is
called a shared page table (i.e., not associated with a private page
table), and the page table with a private page table is called a mirrored
page table.  The relationship is depicted below.

              KVM page fault                     |
                     |                           |
                     V                           |
        -------------+----------                 |
        |                      |                 |
        V                      V                 |
     shared GPA           private GPA            |
        |                      |                 |
        V                      V                 |
    shared PT root      mirrored PT root         |    private PT root
        |                      |                 |           |
        V                      V                 |           V
     shared PT           mirrored PT ----propagate---->  private PT
        |                      |                 |           |
        |                      \-----------------+------\    |
        |                                        |      |    |
        V                                        |      V    V
  shared guest page                              |    private guest page
                                                 |
                           non-encrypted memory  |    encrypted memory
                                                 |
PT: Page table
Shared PT: visible to KVM, and the CPU uses it for shared mappings.
Private PT: the CPU uses it, but it is invisible to KVM.  TDX module
            updates this table to map private guest pages.
Mirrored PT: It is visible to KVM, but the CPU doesn't use it.  KVM uses it
             to propagate PT change to the actual private PT.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX MMU Part 1:
- Rename terminology, dummy PT => mirror PT. and updated the commit message
  By Rick and Kai.
- Added a comment on union of private_spt by Rick.
- Don't handle the root case in kvm_mmu_alloc_private_spt(), it will not
  be needed in future patches. (Rick)
- Update comments (Yan)
- Remove kvm_mmu_init_private_spt(), open code it in later patches (Yan)

v19:
- typo in the comment in kvm_mmu_alloc_private_spt()
- drop CONFIG_KVM_MMU_PRIVATE
---
 arch/x86/include/asm/kvm_host.h |  5 +++++
 arch/x86/kvm/mmu/mmu.c          |  7 +++++++
 arch/x86/kvm/mmu/mmu_internal.h | 36 +++++++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
 4 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 13119d4e44e5..d010ca5c7f44 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -828,6 +828,11 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	/*
+	 * This cache is to allocate private page table. E.g. private EPT used
+	 * by the TDX module.
+	 */
+	struct kvm_mmu_memory_cache mmu_private_spt_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1998267a330e..d5cf5b15a10e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
+	if (kvm_gfn_shared_mask(vcpu->kvm)) {
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache,
+					       PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
 	if (r)
@@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b114589a595a..0f1a9d733d9e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,7 +101,22 @@ struct kvm_mmu_page {
 		int root_count;
 		refcount_t tdp_mmu_root_count;
 	};
-	unsigned int unsync_children;
+	union {
+		/* Those two members aren't used for TDP MMU */
+		struct {
+			unsigned int unsync_children;
+			/*
+			 * Number of writes since the last time traversal
+			 * visited this page.
+			 */
+			atomic_t write_flooding_count;
+		};
+		/*
+		 * Page table page of private PT.
+		 * Passed to TDX module, not accessed by KVM.
+		 */
+		void *private_spt;
+	};
 	union {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
@@ -124,9 +139,6 @@ struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
-	/* Number of writes since the last time traversal visited this page.  */
-	atomic_t write_flooding_count;
-
 #ifdef CONFIG_X86_64
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
@@ -150,6 +162,22 @@ static inline bool is_private_sp(const struct kvm_mmu_page *sp)
 	return kvm_mmu_page_role_is_private(sp->role);
 }
 
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+	return sp->private_spt;
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+	/*
+	 * private_spt is allocated for TDX module to hold private EPT mappings,
+	 * TDX module will initialize the page by itself.
+	 * Therefore, KVM does not need to initialize or access private_spt.
+	 * KVM only interacts with sp->spt for mirrored EPT operations.
+	 */
+	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1086e3b2aa5c..6fa910b017d1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
+	free_page((unsigned long)sp->private_spt);
 	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }

From patchwork Wed May 15 00:59:44 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664499
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B2BAB1EB26;
	Wed, 15 May 2024 01:00:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734810; cv=none;
 b=kKAiCysAczkc2spuOD3iynWvWjFgYPQUFMb+PDB36FyHQ90jt2tZQV1sb3KpqnKfS2fWlBZ4xNe2B8TW0nY0kTp/8tMtpR7LfO2upXRu1tCfj85zhvw1nNgrSz2HltM0WXjSiLi1VQiDcotfFK7VfrZgHZAU5yXCBFoj3dfl9Vw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734810; c=relaxed/simple;
	bh=+6tj0kEKZAuUZugRM5HgXHgBCG9pzGLg+BLWoV5ezpM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=bOPRQNbGqCOkFEdgsuK712/Vb7sdAfkucoMH8hFOSIXtyj3I9W2a70SVogjyw/cFnSnHBZj7K/1bpJ1gqGv6BO1Ctfiu4hJJW2nm2wdzoDTu9wUUTgKfAZVV6TrloKwIEk6E0nMpR9TmrU612Bb4KLNVlQFEvBs5w01ZJAfyK2c=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=nOy4ugC1; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="nOy4ugC1"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734809; x=1747270809;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=+6tj0kEKZAuUZugRM5HgXHgBCG9pzGLg+BLWoV5ezpM=;
  b=nOy4ugC1lSOA3L5juhZaKHEFM0JrooZMkKrefZuLDxx/7bP/NKmDih6q
   93xuDPn9SZcKy7UijJwZqlMzdeLKBDNqflMyXn2RbKzXoErfdq+Z7ud8t
   pRoS3+L7WH9jNXqAUC2LTezh+joU1MXsmQxLi6+GCuiwfwVbmcT4yes6T
   zb8dHEHTCsGSl/ThrD3FQXuhv+RYG7z+gKfYE2AsJoj19g12ZvDax/gHN
   lLfYKDgWBA6pYaIgGAn+rAD+cv79z3p0ip6agFLdBCQorprUI7pFcau3k
   k3tBofCh3557UhYOzkUQUz3NDAYxbcu2M6gLbs2umEotX2dImd1octJfv
   A==;
X-CSE-ConnectionGUID: wODc1ON9TVCTOF3fHbs+lg==
X-CSE-MsgGUID: LUCFnbToQb6dY0cP9D6wBw==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613963"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613963"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:05 -0700
X-CSE-ConnectionGUID: bbyX+Y9bRnCQGjviptusEw==
X-CSE-MsgGUID: GMk516bPQ/inx3OLKItl4A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942765"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:04 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 08/16] KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is
 called for TDX
Date: Tue, 14 May 2024 17:59:44 -0700
Message-Id: <20240515005952.3410568-9-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
guest mappings so they can be faulted in with different PTE properties.

For TDX private memory this technique is fundamentally not possible.
Remapping private memory requires the guest to "accept" it, and also the
needed PTE properties are not currently supported by TDX for private
memory.

These CPU features are:
1) MTRR update
2) CR0.CD update
3) Non-coherent DMA status update
4) APICV update

Since they cannot be supported, they should be blocked from being
exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
supported for TDX as it cannot see the guest registers. For APICV
inhibit it in future changes.

Guest MTRR support is more of an interesting case. Supported versions of
the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
it is not possible to fully support the feature. This leaves KVM with a
few options:
 - Support a modified version of the architecture where the caching
   attributes are ignored for private memory.
 - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
   bug.

With the additional consideration that likely guest MTRR support in KVM
will be going away, the later option is the best. Prevent MTRR MSR writes
from calling kvm_zap_gfn_range() in future changes.

Lastly, the most interesting case is non-coherent DMA status updates.
There isn't a way to reject the call. KVM is just notified that there is a
non-coherent DMA device attached, and expected to act accordingly. For
normal VMs today, that means to start respecting guest PAT. However,
recently there has been a proposal to avoid doing this on selfsnoop CPUs
(see link). On such CPUs it should not be problematic to simply always
configure the EPT to honor guest PAT. In future changes TDX can enforce
this behavior for shared memory, resulting in shared memory always
respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
called in this case either.

Unfortunately, this will result in different cache attributes between
private and shared memory, as private memory is always WB and cannot be
changed by the VMM on current TDX modules. But it can't really be helped
while also supporting non-coherent DMA devices.

Since all callers will be prevented from calling kvm_zap_gfn_range() in
future changes, report a bug and terminate the guest if other future
changes to KVM result in triggering kvm_zap_gfn_range() for a TD.

For lack of a better method currently, use kvm_gfn_shared_mask() to
determine if private memory cannot be zapped (as in TDX, the only VM type
that sets it).

Link: https://lore.kernel.org/all/20240309010929.1403984-6-seanjc@google.com/
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
 - Add this KVM_BUG_ON() instead
---
 arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d5cf5b15a10e..808805b3478d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
-	if (tdp_mmu_enabled)
+	if (tdp_mmu_enabled) {
+		/*
+		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
+		 * type was changed.  TDX can't handle zapping the private
+		 * mapping, but it's ok because KVM doesn't support either of
+		 * those features for TDX. In case a new caller appears, BUG
+		 * the VM if it's called for solutions with private aliases.
+		 */
+		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
 		flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
+	}
 
 	if (flush)
 		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);

From patchwork Wed May 15 00:59:45 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664500
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 170C9208CA;
	Wed, 15 May 2024 01:00:09 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734810; cv=none;
 b=b1vCsnPZcwNO0ScMairq4uhrL+ABdJylPRcDllP8IEY+ei5z8pZwCazFF3FRF5r7znVHDtWzx3tTH+v+YHNakGO0ekpLWmPN0Yzm521xZJryzv/96XWv7G9zDwGCr0extwbs8Zn2m/iWTbTKqm0fzhAIdyM4l3qy6ZljGIKtpaI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734810; c=relaxed/simple;
	bh=F8pMCkdedaDEThDQRu19WnnjslknaUfMOKX2A+N1yxQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=nD8S8lczp9Rfquuhtim4MfeNOXiy+h4ugXeZTdItiS4nLRWja3UtHiMgUoOX3zwhQ35N0nfPBiiQTKDRb6BuKJXslnnFX5fvIF6ppqpb82KK0tSvLGT289APcVL+U+wQMYKeVQpZtBOBLnL4ugxbekYlSmwx6L2qq/21uw7vW7E=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=iq4SykKi; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="iq4SykKi"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734809; x=1747270809;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=F8pMCkdedaDEThDQRu19WnnjslknaUfMOKX2A+N1yxQ=;
  b=iq4SykKiBCkQRz22jrQ8/Fg8/A2SipiWvWkvsku1XngWBLP9Z2ylxqos
   SwxC9ZBECrxA5toFzxEfa7L0luihlFJVGr1oEJTdUEiAtHRm2ERiQvHR9
   HYzUle236moJVhbfGdQzkWIXjfUle8A7gkPx7luh4geyDLMm32Ebuz4Dj
   Lw7NKIPEkMeP4pr5HKL1zD2gbSj4R5tvJjBq5FVB1BtZM/fdTi+5XWKpc
   Er48fqj46pZNW4zptZj3OuHQdz9cqFyYY1RpqaclQyxPctd8yQaXv3wqD
   djomzv1WJeNbQe+IwzBhCUCjQfo5pDmJ/ehMQgGMuIk9IlT96ya65iGLm
   A==;
X-CSE-ConnectionGUID: aucpdI9cT6emLVw5xIz7CA==
X-CSE-MsgGUID: 5K97AqX5Sxq2COkilqo6rA==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613967"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613967"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:06 -0700
X-CSE-ConnectionGUID: AOJ4G5LsTwGNnoW5TOQ9hw==
X-CSE-MsgGUID: puttENqYQQS+WPln4B9VHA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942777"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:05 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 09/16] KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void
Date: Tue, 14 May 2024 17:59:45 -0700
Message-Id: <20240515005952.3410568-10-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
   return kvm_tdp_mmu_alloc_root(vcpu);

So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     | 6 ++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
 arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 808805b3478d..76f92cb37a96 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3700,8 +3700,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	unsigned i;
 	int r;
 
-	if (tdp_mmu_enabled)
-		return kvm_tdp_mmu_alloc_root(vcpu);
+	if (tdp_mmu_enabled) {
+		kvm_tdp_mmu_alloc_root(vcpu);
+		return 0;
+	}
 
 	write_lock(&vcpu->kvm->mmu_lock);
 	r = make_mmu_pages_available(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6fa910b017d1..0d6d96d86703 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
 	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
 }
 
-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	union kvm_mmu_page_role role = mmu->root_role;
@@ -285,7 +285,6 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	 */
 	mmu->root.hpa = __pa(root->spt);
 	mmu->root.pgd = 0;
-	return 0;
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 58b55e61bd33..437ddd4937a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
 
-int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 {

From patchwork Wed May 15 00:59:46 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664503
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B28712B9C4;
	Wed, 15 May 2024 01:00:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734813; cv=none;
 b=SEVEQBg08uT0QgIxWfwTADqIWRrMynMQNZNlHL7d1KWya1sflfu3oQWR5k0D3idwaAa4IfO322WuYADA94iwV0i+9dbPCClVNlOaMHn9ertJt6Je1P+cVNkRKec4Kyq/PKpYwH3vKDOp6czO8tbMfmHissafAOSpBnasg5li8cU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734813; c=relaxed/simple;
	bh=Ts0Q7br9qwoNR0tYNUlPsMXwR1VYg0z28EqH1RigXkg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=s/ReA4XD5LKmE6ThF6OsgnBRDG3siD//d3PlS4EMzcini7aZE1zDjid/1gIcEYGQZKYwfYizCCLfQhRWA5NdyLP4r6lWnptcR5IpuUEhM0dP3xfVF5YRY21jxCHjBtedaUnAtyjND5cNCjL/kAx4D06AjgcQoEl6Ms7KWVu3nDs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=OR6iHeN3; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="OR6iHeN3"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734811; x=1747270811;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Ts0Q7br9qwoNR0tYNUlPsMXwR1VYg0z28EqH1RigXkg=;
  b=OR6iHeN3gP1WEX/VcXP9ABpboyDmJbOZejxPDEALNseXeuDwwVhCILHO
   16rChk3c+HXt6QH5dhLGwN8L+CQ0uC/X39UWPp698YhaXAW0dyulfxgQX
   M20YOCgDG7kIRk83lRm4BFf3XtGiP437J15W/DmX74g1XmfRtFclaoAyU
   mv75byGK202tS4uS6T6vamsfu6jdBN+4W42as7bKWzwIWezZMIKCkx8GU
   2jdq9gC6baSPFClKOaHqVA5xTiEDS/q01Av56oUPUtEUWxwRYywpX6e4l
   hrw3/5HPZn0Fs+PSBTrqcMsxsbzRzlP+mkYFquFJbiLsVXt56SmGJ/qUv
   g==;
X-CSE-ConnectionGUID: E8uWYvgARoCFGu++sO2ANg==
X-CSE-MsgGUID: VBgeHVCbTPCEhJgJLg86Jg==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613973"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613973"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:06 -0700
X-CSE-ConnectionGUID: Ub/KEAqETpSJu78Qo3L98w==
X-CSE-MsgGUID: 2Fpj+nkiQiqzX5Isi6cI8g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942788"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:05 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP
 MMU
Date: Tue, 14 May 2024 17:59:46 -0700
Message-Id: <20240515005952.3410568-11-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Allocate mirrored page table for the private page table and implement MMU
hooks to operate on the private page table.

To handle page fault to a private GPA, KVM walks the mirrored page table in
unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
changes from the mirrored page table to private page table.

  private KVM page fault   |
      |                    |
      V                    |
 private GPA               |     CPU protected EPTP
      |                    |           |
      V                    |           V
 mirrored PT root          |     private PT root
      |                    |           |
      V                    |           V
   mirrored PT --hook to propagate-->private PT
      |                    |           |
      \--------------------+------\    |
                           |      |    |
                           |      V    V
                           |    private guest page
                           |
                           |
     non-encrypted memory  |    encrypted memory
                           |

PT:         page table
Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
            this table to map private guest pages.
Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
            to propagate PT change to the actual private PT.

SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
can be modified atomically with mmu_lock held for read, however, the MMU
hooks to private page table are not atomical operations.

To address it, a special REMOVED_SPTE is introduced and below sequence is
used when mirrored SPTEs are updated atomically.

1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
2. The successful updater of the mirrored SPTE in step 1 proceeds with the
   following steps.
3. Invoke MMU hooks to modify private page table with the target value.
4. (a) On hook succeeds, update mirrored SPTE to target value.
   (b) On hook failure, restore mirrored SPTE to original value.

KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.

This sequence also applies when SPTEs are atomiclly updated from
non-present to present in order to prevent potential conflicts when
multiple vCPUs attempt to set private SPTEs to a different page size
simultaneously, though 4K page size is only supported for private page
table currently.

2M page support can be done in future patches.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Remove unnecessary gfn, access twist in
   tdp_mmu_map_handle_target_level(). (Chao Gao)
 - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
   tdp_mmu_alloc_sp()
 - Update comment in set_private_spte_present() (Yan)
 - Open code call to kvm_mmu_init_private_spt() (Yan)
 - Add comments on TDX MMU hooks (Yan)
 - Fix various whitespace alignment (Yan)
 - Remove pointless warnings and conditionals in
   handle_removed_private_spte() (Yan)
 - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
 - Remove incorrect comment in handle_changed_spte() (Yan)
 - Remove unneeded kvm_pfn_to_refcounted_page() and
   is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
 - Do kvm_gfn_for_root() branchless (Rick)
 - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
 - Add comment for stripping shared bit for fault.gfn (Chao)

v19:
- drop CONFIG_KVM_MMU_PRIVATE

v18:
- Rename freezed => frozen

v14 -> v15:
- Refined is_private condition check in kvm_tdp_mmu_map().
  Add kvm_gfn_shared_mask() check.
- catch up for struct kvm_range change
---
 arch/x86/include/asm/kvm-x86-ops.h |   5 +
 arch/x86/include/asm/kvm_host.h    |  25 +++
 arch/x86/kvm/mmu/mmu.c             |  13 +-
 arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
 arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
 7 files changed, 293 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..d13cb4b8fce6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_private_spt)
+KVM_X86_OP_OPTIONAL(free_private_spt)
+KVM_X86_OP_OPTIONAL(set_private_spte)
+KVM_X86_OP_OPTIONAL(remove_private_spte)
+KVM_X86_OP_OPTIONAL(zap_private_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d010ca5c7f44..20fa8fa58692 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -470,6 +470,7 @@ struct kvm_mmu {
 	int (*sync_spte)(struct kvm_vcpu *vcpu,
 			 struct kvm_mmu_page *sp, int i);
 	struct kvm_mmu_root_info root;
+	hpa_t private_root_hpa;
 	union kvm_cpu_role cpu_role;
 	union kvm_mmu_page_role root_role;
 
@@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	/* Add a page as page table page into private page table */
+	int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *private_spt);
+	/*
+	 * Free a page table page of private page table.
+	 * Only expected to be called when guest is not active, specifically
+	 * during VM destruction phase.
+	 */
+	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *private_spt);
+
+	/* Add a guest private page into private page table */
+	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				kvm_pfn_t pfn);
+
+	/* Remove a guest private page from private page table*/
+	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				   kvm_pfn_t pfn);
+	/*
+	 * Keep a guest private page mapped in private page table, but clear its
+	 * present bit
+	 */
+	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 76f92cb37a96..2506d6277818 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	int r;
 
 	if (tdp_mmu_enabled) {
-		kvm_tdp_mmu_alloc_root(vcpu);
+		if (kvm_gfn_shared_mask(vcpu->kvm))
+			kvm_tdp_mmu_alloc_root(vcpu, true);
+		kvm_tdp_mmu_alloc_root(vcpu, false);
 		return 0;
 	}
 
@@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
 		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
 			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
-			gfn_t base = gfn_round_for_level(fault->gfn,
+			gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
 							 fault->max_level);
 
 			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
@@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
+	mmu->private_root_hpa = INVALID_PAGE;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
@@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
+	if (tdp_mmu_enabled) {
+		read_lock(&vcpu->kvm->mmu_lock);
+		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
+				   NULL);
+		read_unlock(&vcpu->kvm->mmu_lock);
+	}
 	free_mmu_pages(&vcpu->arch.root_mmu);
 	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0f1a9d733d9e..3a7fe9261e23 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
 #include <linux/kvm_host.h>
 #include <asm/kvm_host.h>
 
+#include "mmu.h"
+
 #ifdef CONFIG_KVM_PROVE_MMU
 #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
 #else
@@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
 }
 
+static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
+				     gfn_t gfn)
+{
+	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
+
+	/* Set shared bit if not private */
+	gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
+	return gfn_for_root;
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
@@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
 	int r;
 
 	if (vcpu->arch.mmu->root_role.direct) {
-		fault.gfn = fault.addr >> PAGE_SHIFT;
+		/*
+		 * Things like memslots don't understand the concept of a shared
+		 * bit. Strip it so that the GFN can be used like normal, and the
+		 * fault.addr can be used when the shared bit is needed.
+		 */
+		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
 		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..8a64bcef9deb 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,7 +91,7 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN mapped by the current SPTE */
+	/* The lowest GFN (shared bits included) mapped by the current SPTE */
 	gfn_t gfn;
 	/* The level of the root page given to the iterator */
 	int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0d6d96d86703..810d552e9bf6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
 	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
 }
 
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	union kvm_mmu_page_role role = mmu->root_role;
@@ -232,6 +232,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_page *root;
 
+	if (private)
+		kvm_mmu_page_role_set_private(&role);
+
 	/*
 	 * Check for an existing root before acquiring the pages lock to avoid
 	 * unnecessary serialization if multiple vCPUs are loading a new root.
@@ -283,13 +286,17 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	 * and actually consuming the root if it's invalidated after dropping
 	 * mmu_lock, and the root can't be freed as this vCPU holds a reference.
 	 */
-	mmu->root.hpa = __pa(root->spt);
-	mmu->root.pgd = 0;
+	if (private) {
+		mmu->private_root_hpa = __pa(root->spt);
+	} else {
+		mmu->root.hpa = __pa(root->spt);
+		mmu->root.pgd = 0;
+	}
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+				u64 old_spte, u64 new_spte,
+				union kvm_mmu_page_role role, bool shared);
 
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -416,12 +423,124 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 							  REMOVED_SPTE, level);
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_spte, REMOVED_SPTE, level, shared);
+				    old_spte, REMOVED_SPTE, sp->role,
+				    shared);
+	}
+
+	if (is_private_sp(sp) &&
+	    WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
+							  kvm_mmu_private_spt(sp)))) {
+		/*
+		 * Failed to free page table page in private page table and
+		 * there is nothing to do further.
+		 * Intentionally leak the page to prevent the kernel from
+		 * accessing the encrypted page.
+		 */
+		sp->private_spt = NULL;
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
+{
+	if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+		struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
+		void *private_spt = kvm_mmu_private_spt(sp);
+
+		WARN_ON_ONCE(!private_spt);
+		WARN_ON_ONCE(sp->role.level + 1 != level);
+		WARN_ON_ONCE(sp->gfn != gfn);
+		return private_spt;
+	}
+
+	return NULL;
+}
+
+static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
+					u64 old_spte, u64 new_spte,
+					int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	int ret;
+
+	/*
+	 * Allow only leaf page to be zapped. Reclaim non-leaf page tables page
+	 * at destroying VM.
+	 */
+	if (!was_leaf)
+		return;
+
+	/* Zapping leaf spte is allowed only when write lock is held. */
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+	/* Because write lock is held, operation should success. */
+	if (KVM_BUG_ON(ret, kvm))
+		return;
+
+	ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
+	KVM_BUG_ON(ret, kvm);
+}
+
+static int __must_check __set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+						   gfn_t gfn, u64 old_spte,
+						   u64 new_spte, int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool is_present = is_shadow_present_pte(new_spte);
+	bool is_leaf = is_present && is_last_spte(new_spte, level);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	int ret = 0;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	/* TDP MMU doesn't change present -> present */
+	KVM_BUG_ON(was_present, kvm);
+
+	/*
+	 * Use different call to either set up middle level
+	 * private page table, or leaf.
+	 */
+	if (is_leaf) {
+		ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);
+	} else {
+		void *private_spt = get_private_spt(gfn, new_spte, level);
+
+		KVM_BUG_ON(!private_spt, kvm);
+		ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
+	}
+
+	return ret;
+}
+
+static int __must_check set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+						 gfn_t gfn, u64 old_spte,
+						 u64 new_spte, int level)
+{
+	int ret;
+
+	/*
+	 * For private page table, callbacks are needed to propagate SPTE
+	 * change into the private page table. In order to atomically update
+	 * both the SPTE and the private page tables with callbacks, utilize
+	 * freezing SPTE.
+	 * - Freeze the SPTE. Set entry to REMOVED_SPTE.
+	 * - Trigger callbacks for private page tables.
+	 * - Unfreeze the SPTE.  Set the entry to new_spte.
+	 */
+	lockdep_assert_held(&kvm->mmu_lock);
+	if (!try_cmpxchg64(sptep, &old_spte, REMOVED_SPTE))
+		return -EBUSY;
+
+	ret = __set_private_spte_present(kvm, sptep, gfn, old_spte, new_spte, level);
+	if (ret)
+		__kvm_tdp_mmu_write_spte(sptep, old_spte);
+	else
+		__kvm_tdp_mmu_write_spte(sptep, new_spte);
+	return ret;
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -429,7 +548,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * @gfn: the base GFN that was mapped by the SPTE
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
- * @level: the level of the PT the SPTE is part of in the paging structure
+ * @role: the role of the PT the SPTE is part of in the paging structure
  * @shared: This operation may not be running under the exclusive use of
  *	    the MMU lock and the operation must synchronize with other
  *	    threads that might be modifying SPTEs.
@@ -439,14 +558,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * and fast_pf_fix_direct_spte()).
  */
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+				u64 old_spte, u64 new_spte,
+				union kvm_mmu_page_role role, bool shared)
 {
+	bool is_private = kvm_mmu_page_role_is_private(role);
+	int level = role.level;
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
-	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	bool pfn_changed = old_pfn != new_pfn;
 
 	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	if (was_leaf && is_dirty_spte(old_spte) &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_pfn_dirty(old_pfn);
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 * pages are kernel allocations and should never be migrated.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+		KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)),
+			   kvm);
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+	}
+
+	if (is_private && !is_present)
+		handle_removed_private_spte(kvm, gfn, old_spte, new_spte, role.level);
 
 	if (was_leaf && is_accessed_spte(old_spte) &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
 
-static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
 {
 	u64 *sptep = rcu_dereference(iter->sptep);
 
@@ -542,15 +671,42 @@ static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
 	 */
 	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
 
-	/*
-	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
-	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
-	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
-	 * the current value, so the caller operates on fresh data, e.g. if it
-	 * retries tdp_mmu_set_spte_atomic()
-	 */
-	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-		return -EBUSY;
+	if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+		int ret;
+
+		if (is_shadow_present_pte(new_spte)) {
+			/*
+			 * Populating case.
+			 * - set_private_spte_present() implements
+			 *   1) Freeze SPTE
+			 *   2) call hooks to update private page table,
+			 *   3) update SPTE to new_spte
+			 * - handle_changed_spte() only updates stats.
+			 */
+			ret = set_private_spte_present(kvm, iter->sptep, iter->gfn,
+						       iter->old_spte, new_spte, iter->level);
+			if (ret)
+				return ret;
+		} else {
+			/*
+			 * Zapping case.
+			 * Zap is only allowed when write lock is held
+			 */
+			if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))
+				return -EBUSY;
+		}
+	} else {
+		/*
+		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+		 * and does not hold the mmu_lock.  On failure, i.e. if a
+		 * different logical CPU modified the SPTE, try_cmpxchg64()
+		 * updates iter->old_spte with the current value, so the caller
+		 * operates on fresh data, e.g. if it retries
+		 * tdp_mmu_set_spte_atomic()
+		 */
+		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+			return -EBUSY;
+	}
 
 	return 0;
 }
@@ -576,23 +732,24 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					  struct tdp_iter *iter,
 					  u64 new_spte)
 {
+	u64 *sptep = rcu_dereference(iter->sptep);
 	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
 	if (ret)
 		return ret;
 
 	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    new_spte, iter->level, true);
-
+			    new_spte, sptep_to_sp(sptep)->role, true);
 	return 0;
 }
 
 static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 					  struct tdp_iter *iter)
 {
+	union kvm_mmu_page_role role;
 	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
@@ -605,7 +762,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * Delay processing of the zapped SPTE until after TLBs are flushed and
 	 * the REMOVED_SPTE is replaced (see below).
 	 */
-	ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
 	if (ret)
 		return ret;
 
@@ -619,6 +776,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 */
 	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
 
+
+	role = sptep_to_sp(iter->sptep)->role;
 	/*
 	 * Process the zapped SPTE after flushing TLBs, and after replacing
 	 * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
@@ -626,7 +785,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * SPTEs.
 	 */
 	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    0, iter->level, true);
+			    SHADOW_NONPRESENT_VALUE, role, true);
 
 	return 0;
 }
@@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 			    u64 old_spte, u64 new_spte, gfn_t gfn, int level)
 {
+	union kvm_mmu_page_role role;
+
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
+	if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
+	    is_shadow_present_pte(new_spte)) {
+		/* Because write spin lock is held, no race.  It should success. */
+		KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
+						      new_spte, level), kvm);
+	}
 
-	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	role = sptep_to_sp(sptep)->role;
+	role.level = level;
+	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
 	return old_spte;
 }
 
@@ -684,8 +853,11 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)	\
+	for_each_tdp_pte(_iter,						\
+		 root_to_sp((_private) ? _mmu->private_root_hpa :	\
+				_mmu->root.hpa),			\
+		_start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -853,6 +1025,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
+	/*
+	 * start and end doesn't have GFN shared bit.  This function zaps
+	 * a region including alias.  Adjust shared bit of [start, end) if the
+	 * root is shared.
+	 */
+	start = kvm_gfn_for_root(kvm, root, start);
+	end = kvm_gfn_for_root(kvm, root, end);
+
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+					fault->pfn, iter->old_spte, fault->prefetch, true,
+					fault->map_writable, &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
+	gfn_t raw_gfn;
+	bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
 	int ret = RET_PF_RETRY;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1116,7 +1298,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+	raw_gfn = gpa_to_gfn(fault->addr);
+
+	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
 		int r;
 
 		if (fault->nx_huge_page_workaround_enabled)
@@ -1142,14 +1326,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * needs to be split.
 		 */
 		sp = tdp_mmu_alloc_sp(vcpu);
+		if (kvm_is_private_gpa(kvm, raw_gfn << PAGE_SHIFT))
+			kvm_mmu_alloc_private_spt(vcpu, sp);
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte))
+		if (is_shadow_present_pte(iter.old_spte)) {
+			/*
+			 * TODO: large page support.
+			 * Doesn't support large page for TDX now
+			 */
+			KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
-		else
+		} else {
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+		}
 
 		/*
 		 * Force the guest to retry if installing an upper level SPTE
@@ -1780,7 +1972,7 @@ static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1838,7 +2030,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	/* fast page fault for private GPA isn't supported. */
+	WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
+
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 437ddd4937a9..ac350c51bc18 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
 
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 {

From patchwork Wed May 15 00:59:47 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664502
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D95602BB0A;
	Wed, 15 May 2024 01:00:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734812; cv=none;
 b=QIfMh2syZJKHVWZRXm80npvQ0yqvoSrunVtDs5jAEDKef6TOvJeSFmqAbfG8kjKGW6Q+U7jjl7XJjgFUXuUViIB1GMU+Lw2xUsPiraVj4ah52Rlz86uad3IZh+YjLtJKCEDPvO38gBJU7VFitCrsuwzXF/y2adGWAWQ9cgPJiCI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734812; c=relaxed/simple;
	bh=cTrLZlU5/ZsvuUDUsSy+jSAeCzN3jetOaTy8YedbsKY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=hFehqK7e4pVrkthz2yPiLXHCn3LCBUhrViq5mujy+1YzJYNT66MNTQLTakJ+FkHeowNTZI8P88a2imrywgY2QHxBUhFwI3ESzGcFuTxbH9MWwctEvxCTHWvFGV0ZMGtef714+YOZXYgdHqtSaxYAIFenYLCo7IzmuL3M6eL7g44=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=AFoK7OyV; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="AFoK7OyV"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734811; x=1747270811;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=cTrLZlU5/ZsvuUDUsSy+jSAeCzN3jetOaTy8YedbsKY=;
  b=AFoK7OyVwszSq/NaygsTlXgz4FubPqVXCMSBsiTP24quWepFWmyfN0r8
   2jjZL+SNxWp1mg23f5Wc31P9QjxsQmymxyf4b/Tgqy0iWWaOuJNuE0ffj
   l7XBPlxyGemynYxfFtUQ3POh66/EG6crTXBvB5rbgb9ZKkHJgI/BfQkiK
   AoyOA05tp/UGnGcAjgfg8IXzHRGf+9QQqnw4dTsmw1+e35PWKvPRI7zta
   7A4PHElULibQSvEhtGJjJf8ZW7xiLBPyj9UKw5e0hKZMMepKr3QqBPdNl
   KwEMpTQg7yCYt98+UFC69loMlIZAf8h9aNfPj8rPKDW+JqXpaTUpw5ihV
   g==;
X-CSE-ConnectionGUID: ji/7WyQsQ1ieWBEVEEc7FA==
X-CSE-MsgGUID: DS9xSeS9TPqPR9B02H7Ltg==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613977"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613977"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:07 -0700
X-CSE-ConnectionGUID: tzIbvwIoRTeatS/4eNNkEg==
X-CSE-MsgGUID: loBjX6+QQAOazgFTj34OcA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942799"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:06 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 11/16] KVM: x86/tdp_mmu: Extract root invalid check from
 tdx_mmu_next_root()
Date: Tue, 14 May 2024 17:59:47 -0700
Message-Id: <20240515005952.3410568-12-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Extract tdp_mmu_root_match() to check if the root has given types and use
it for the root page table iterator.  It checks only_invalid now.

TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page
table only (Secure-EPT), or both based on the operation.  KVM MMU notifier
operations only on shared page table.  KVM guest_memfd invalidation
operations only on mirrored page table, and so on.  Introduce a centralized
matching function instead of open coding matching logic in the iterator.
The next step is to extend the function to check whether the page is shared
or private

Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 810d552e9bf6..a0b7c43e843d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,6 +92,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+{
+	if (only_valid && root->role.invalid)
+		return false;
+
+	return true;
+}
+
 /*
  * Returns the next root after @prev_root (or the first root if @prev_root is
  * NULL).  A reference to the returned root is acquired, and the reference to
@@ -125,7 +133,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 						   typeof(*next_root), link);
 
 	while (next_root) {
-		if ((!only_valid || !next_root->role.invalid) &&
+		if (tdp_mmu_root_match(next_root, only_valid) &&
 		    kvm_tdp_mmu_get_root(next_root))
 			break;
 
@@ -176,7 +184,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)		\
 		if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) &&		\
 		    ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
-		     ((_only_valid) && (_root)->role.invalid))) {		\
+		     !tdp_mmu_root_match((_root), (_only_valid)))) {		\
 		} else
 
 #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\

From patchwork Wed May 15 00:59:48 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664504
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20C7D364AE;
	Wed, 15 May 2024 01:00:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734813; cv=none;
 b=WcZ2uI6wphBIhEUjjwFQQzY+812yYlP8aMLjZKcMgj/b4Y0ecjTaC2+2UE5sLqDEHB3z/MG5r0nIylx6ayhGMv4LsnsgfnJS8JmdCG3lf28bWFVz7Vod8EeqfOcmWFOODF3Ejmc+/whwLDS+ojh2dxOnXUIgjPFDBudnkAsfegg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734813; c=relaxed/simple;
	bh=hwT7Hnp4s9Rtxpw8GmgoIARXgPDTRvl14/AMhCiqMB0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=LhHuIF/uawS0Q6EoO13WxzQEICcdE/KATpa7I+8ZE6qQWqB0plDMqi5ujg5GBbhBMNsegtSxMsZFwvzL7yJUvb+DwImhFZoZOtQh0Bm37TxnV5Wsi5cGjxT+GbIWHy3V22P+bgoXmdWGniqV8q+78u0Kky5UX+AT17FMA/QhAxI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=ilYMBWQp; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="ilYMBWQp"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734812; x=1747270812;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=hwT7Hnp4s9Rtxpw8GmgoIARXgPDTRvl14/AMhCiqMB0=;
  b=ilYMBWQpZp+ecPXQJNPTj3T1t9ErvQBDGfxncmpVPIWogmyfAavvmMqc
   ZwvTqnO2fK1lVy1jbIhul2qXG7KOwdFefCVEKzct+IHT2VBctAM2pLsm3
   kmMuTV4CDxs51SlgHwtgSHFGDxsyAOXvoKru4DJX01x/lBUsIkfobz5RK
   pkucL9SPR5QOUj30hbC6ha+W9cavBSuFB7LCHr31ppmgxdQEb508G2sd7
   u8zdsSZeemi28r3GHIwaClQVejCwUQmHaEzYG28eBxyD/TlfgsTGekd3g
   RHsBH0BD1jhqZS2rkX81PAa+h2p9EkcXqZRebX47v3cMFD8g/haBPxmMx
   g==;
X-CSE-ConnectionGUID: 9xW0yRnRSxu/XRoz1glDsw==
X-CSE-MsgGUID: XCMbw4nFS6yAZpDMY9CMkw==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613981"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613981"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:07 -0700
X-CSE-ConnectionGUID: TUPxIWNtQLi04SmwZHLs4g==
X-CSE-MsgGUID: aLwVOWGzReuTVOfZnQUmNQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942814"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:06 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 12/16] KVM: x86/tdp_mmu: Introduce KVM MMU root types to
 specify page table type
Date: Tue, 14 May 2024 17:59:48 -0700
Message-Id: <20240515005952.3410568-13-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1]
so that the iterator on the root page table can consistently filter the
root page table type instead of only_valid.

TDX KVM will operate on KVM page tables with specified types.  Shared page
table, private page table, or both.  Introduce an enum instead of bool
only_valid so that we can easily enhance page table types applicable to
shared, private, or both in addition to valid or not.  Replace
only_valid=false with KVM_ANY_ROOTS and only_valid=true with
KVM_ANY_VALID_ROOTS.  Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap
KVM_VALID_ROOTS to avoid further code churn when shared and private are
introduced.

Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ [1]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Newly introduced.
---
 arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++-------------------
 arch/x86/kvm/mmu/tdp_mmu.h |  7 +++++++
 2 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a0b7c43e843d..7af395073e92 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,9 +92,10 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
-static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
+static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
+			       enum kvm_tdp_mmu_root_types types)
 {
-	if (only_valid && root->role.invalid)
+	if ((types & KVM_VALID_ROOTS) && root->role.invalid)
 		return false;
 
 	return true;
@@ -102,17 +103,17 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root, bool only_valid)
 
 /*
  * Returns the next root after @prev_root (or the first root if @prev_root is
- * NULL).  A reference to the returned root is acquired, and the reference to
- * @prev_root is released (the caller obviously must hold a reference to
- * @prev_root if it's non-NULL).
+ * NULL) that matches with @types.  A reference to the returned root is
+ * acquired, and the reference to @prev_root is released (the caller obviously
+ * must hold a reference to @prev_root if it's non-NULL).
  *
- * If @only_valid is true, invalid roots are skipped.
+ * Roots that doesn't match with @types are skipped.
  *
  * Returns NULL if the end of tdp_mmu_roots was reached.
  */
 static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 					      struct kvm_mmu_page *prev_root,
-					      bool only_valid)
+					      enum kvm_tdp_mmu_root_types types)
 {
 	struct kvm_mmu_page *next_root;
 
@@ -133,7 +134,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 						   typeof(*next_root), link);
 
 	while (next_root) {
-		if (tdp_mmu_root_match(next_root, only_valid) &&
+		if (tdp_mmu_root_match(next_root, types) &&
 		    kvm_tdp_mmu_get_root(next_root))
 			break;
 
@@ -158,20 +159,20 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
  * If shared is set, this function is operating under the MMU lock in read
  * mode.
  */
-#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid)	\
-	for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid);		\
+#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _types)	\
+	for (_root = tdp_mmu_next_root(_kvm, NULL, _types);		\
 	     ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root;		\
-	     _root = tdp_mmu_next_root(_kvm, _root, _only_valid))		\
+	     _root = tdp_mmu_next_root(_kvm, _root, _types))		\
 		if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) {	\
 		} else
 
 #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id)	\
-	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, true)
+	__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, KVM_ANY_VALID_ROOTS)
 
 #define for_each_tdp_mmu_root_yield_safe(_kvm, _root)			\
-	for (_root = tdp_mmu_next_root(_kvm, NULL, false);		\
+	for (_root = tdp_mmu_next_root(_kvm, NULL, KVM_ANY_ROOTS);		\
 	     ({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root;	\
-	     _root = tdp_mmu_next_root(_kvm, _root, false))
+	     _root = tdp_mmu_next_root(_kvm, _root, KVM_ANY_ROOTS))
 
 /*
  * Iterate over all TDP MMU roots.  Requires that mmu_lock be held for write,
@@ -180,18 +181,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
  * Holding mmu_lock for write obviates the need for RCU protection as the list
  * is guaranteed to be stable.
  */
-#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid)		\
+#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _types)			\
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)		\
 		if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) &&		\
 		    ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
-		     !tdp_mmu_root_match((_root), (_only_valid)))) {		\
+		     !tdp_mmu_root_match((_root), (_types)))) {			\
 		} else
 
 #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
-	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
+	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ANY_ROOTS)
 
 #define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id)		\
-	__for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
+	__for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_ANY_VALID_ROOTS)
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
@@ -1389,7 +1390,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 {
 	struct kvm_mmu_page *root;
 
-	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ANY_ROOTS)
 		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
 					  range->may_block, flush);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ac350c51bc18..30f2ab88a642 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -19,6 +19,13 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
+enum kvm_tdp_mmu_root_types {
+	KVM_VALID_ROOTS = BIT(0),
+
+	KVM_ANY_ROOTS = 0,
+	KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
+};
+
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);

From patchwork Wed May 15 00:59:49 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664505
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6AAE374F5;
	Wed, 15 May 2024 01:00:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734814; cv=none;
 b=qiMJi0MRjZxVm+wj8nEcmYYowne7GNTphCNMb8S4xZdQpfi/ZkEnnec8Dvvu1pyuiquVLcrgRM8CFtTzUUv9wQ+lS48VeXbr7iE1AT2rGh5L98JSoZ/wJM+nMeyg75mJF00E/Mp85MR3IoD0tgx8ulJjiT7X/V8TAFvZLOT4ncQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734814; c=relaxed/simple;
	bh=Qkp1GU5j+YcdfSEFw2npWErCnczzFIzBQgNqWdkwsPI=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Iqov8m4AvY0KXAxPgf3H0CHl1mHCb5kBD+JFCB8iCv6rmc4MoAHnJbtfdADR29wUXcIFtrULY5D32ZSnyWyh+ZqXSJb8lasuJXN/t4Q7k0bQSME7+tISKn/1RNEv+OGQWEBvXnMWcoTF95KoXR2ubv04Fa7jKVLIr7hFfqetehk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=CYE3EWNf; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="CYE3EWNf"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734813; x=1747270813;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Qkp1GU5j+YcdfSEFw2npWErCnczzFIzBQgNqWdkwsPI=;
  b=CYE3EWNfzC/Aj6mXMu0M2pN+C4Sz7TzhUgQlEfTHTXm97i3K/rlktZFO
   xr3c5xxcPQgauF0f9p5BsXmS5Rc0Jbu2HsAuh5jlYaLz3jGLByrvONTvZ
   SWvL5v7CJEGNNky7lcIviWHMX8BWpRgrIiFpWq5WWwPnC1Qf6NYaVhkpZ
   b7WbD985SSaEI0Ig3BZhhr+bcnplnkcquifxM44/m+cZQDKTJT5loYq5c
   8MuScPqQ/3fAVVO4VfPSverBO7I5aeTda9hRMihYuJWDBwXfK1le3UBAF
   obddVj8dXvm2lp04AU+2JA2GweB8sF9Z4ZpEKO6xsA0b0LW286dUfMB0+
   g==;
X-CSE-ConnectionGUID: ZkkOwBu+SdGhZLK+gS796A==
X-CSE-MsgGUID: sRYoHgc3TxSJTh0QCVjNgA==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613986"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613986"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:08 -0700
X-CSE-ConnectionGUID: i110ikQjRa6RJzaukZ0gxQ==
X-CSE-MsgGUID: SK7QFYhpSkuSIdoCM5jSfw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942825"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:07 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 13/16] KVM: x86/tdp_mmu: Introduce shared,
 private KVM MMU root types
Date: Tue, 14 May 2024 17:59:49 -0700
Message-Id: <20240515005952.3410568-14-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add more types, shared and private to enum kvm_tdp_mmu_root_types to
specify KVM MMU roots [1] so that the iterator on the root page table can
consistently filter the root page table type.

TDX KVM will operate on KVM page tables with specified types.  Shared page
table, private page table, or both.  Introduce an enum to specify those
page table types and make the iterator take it with the specified root
type.  Valid or not, and shared, private, or both.  Enhance
tdp_mmu_root_match() to understand private vs shared.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.h | 14 ++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7af395073e92..8914c5b0d5ab 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -95,10 +95,20 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
 			       enum kvm_tdp_mmu_root_types types)
 {
+	if (WARN_ON_ONCE(types == BUGGY_KVM_ROOTS))
+		return false;
+	if (WARN_ON_ONCE(!(types & (KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS))))
+		return false;
+
 	if ((types & KVM_VALID_ROOTS) && root->role.invalid)
 		return false;
 
-	return true;
+	if ((types & KVM_SHARED_ROOTS) && !is_private_sp(root))
+		return true;
+	if ((types & KVM_PRIVATE_ROOTS) && is_private_sp(root))
+		return true;
+
+	return false;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 30f2ab88a642..6a65498b481c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,12 +20,18 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
 enum kvm_tdp_mmu_root_types {
-	KVM_VALID_ROOTS = BIT(0),
-
-	KVM_ANY_ROOTS = 0,
-	KVM_ANY_VALID_ROOTS = KVM_VALID_ROOTS,
+	BUGGY_KVM_ROOTS = BUGGY_KVM_INVALIDATION,
+	KVM_SHARED_ROOTS = KVM_PROCESS_SHARED,
+	KVM_PRIVATE_ROOTS = KVM_PROCESS_PRIVATE,
+	KVM_VALID_ROOTS = BIT(2),
+	KVM_ANY_VALID_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS | KVM_VALID_ROOTS,
+	KVM_ANY_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS,
 };
 
+static_assert(!(KVM_SHARED_ROOTS & KVM_VALID_ROOTS));
+static_assert(!(KVM_PRIVATE_ROOTS & KVM_VALID_ROOTS));
+static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
+
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);

From patchwork Wed May 15 00:59:50 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664506
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8FAED383BE;
	Wed, 15 May 2024 01:00:13 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734815; cv=none;
 b=jAFp81kyV/2VTG1nVT1n2dIL4ewSSKjzeZ1+DZc4iMUWjqs3fQV1hT27EU7D/GnvFngRogoMVgzBjPquK5/sNkkJQhgkGbJ1wMHlSRlRL7fJ63E35ziaBgYvr/33EArwu6KYeR2kcTYYrYisQYArtcDc5TwuIF7r2zP+GrEyMWo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734815; c=relaxed/simple;
	bh=YcYrPvm62cI+4A1/6iEWc+sBYrpX+wYdWWAThta5Qro=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=mu+ejgMSIc0m3S1pwwvVtUgPgneshzVyRUhN4GxxtqajzXrT7bMsw+I+kgQbbYQrxuIf670X/HPvqkd3DigIf8eWNQ45oOS8tbLg6Co2WMrs2+qCpullv8tQKSkLirjS6UsuIY48TMg3EVGXMErngPn403vUppw5t1XUFX8Knts=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=N/832aU/; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="N/832aU/"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734814; x=1747270814;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=YcYrPvm62cI+4A1/6iEWc+sBYrpX+wYdWWAThta5Qro=;
  b=N/832aU/jx5M5XU4K0OWyY7yufcqL8XkRl/GGhRZLBS2S+Y9bIiHMdOt
   570c059o3WKshg50Ed9KC08TK5IKuteWXys+HGtV5zSiMhgo68sDyGANt
   lTAY1Y14/jE3ipLboyeKr2s1HXKalQd3ZlpkxWNyH6Qtk07VQaMN9nFoP
   KTpBj37RGNfjRBSgVT8Kt8W1IK+qP5YCzUHGidsoYXOhP5nwOJDpzUKOg
   z8u5sZJOqRCPmTft2yfF8p/6DJ7QLFiaXCCqAGpk3EIUsfS/XrdGoV1xd
   2R2LhL2UaEvVTXkR7Ub8yv61Lt4TgCO2wi8WOKdZqkKuBcaR8EtDen4Ah
   w==;
X-CSE-ConnectionGUID: 74Hcq4I6Qg6EDRa60m1qFQ==
X-CSE-MsgGUID: rL4IxInCRKKRGbknql24oA==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613990"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613990"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:08 -0700
X-CSE-ConnectionGUID: kG7EbLaVSrepTn16J4iMxg==
X-CSE-MsgGUID: 4BA517K8RgukFZNAhdnQcQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942837"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:07 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 14/16] KVM: x86/tdp_mmu: Take root types for
 kvm_tdp_mmu_invalidate_all_roots()
Date: Tue, 14 May 2024 17:59:50 -0700
Message-Id: <20240515005952.3410568-15-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.

Have the callers only invalidate the required roots instead of all
roots.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     | 9 +++++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 5 +++--
 arch/x86/kvm/mmu/tdp_mmu.h | 3 ++-
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2506d6277818..338628094ad7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6414,8 +6414,13 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	 * write and in the same critical section as making the reload request,
 	 * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
 	 */
-	if (tdp_mmu_enabled)
-		kvm_tdp_mmu_invalidate_all_roots(kvm);
+	if (tdp_mmu_enabled) {
+		/*
+		 * The private page tables doesn't support fast zapping.  The
+		 * caller should handle it by other way.
+		 */
+		kvm_tdp_mmu_invalidate_roots(kvm, KVM_SHARED_ROOTS);
+	}
 
 	/*
 	 * Notify all vcpus to reload its shadow page table and flush TLB.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8914c5b0d5ab..eb88af48c8f0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	 * for zapping and thus puts the TDP MMU's reference to each root, i.e.
 	 * ultimately frees all roots.
 	 */
-	kvm_tdp_mmu_invalidate_all_roots(kvm);
+	kvm_tdp_mmu_invalidate_roots(kvm, KVM_ANY_ROOTS);
 	kvm_tdp_mmu_zap_invalidated_roots(kvm);
 
 	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -1170,7 +1170,8 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
  * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
  * See kvm_tdp_mmu_alloc_root().
  */
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+				  enum kvm_tdp_mmu_root_types types)
 {
 	struct kvm_mmu_page *root;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 6a65498b481c..b8a967426fac 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -35,7 +35,8 @@ static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+				  enum kvm_tdp_mmu_root_types types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);

From patchwork Wed May 15 00:59:51 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664507
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0CEB239847;
	Wed, 15 May 2024 01:00:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734815; cv=none;
 b=NL0n0AEL5re1ERlbALnxky0vcsAFB45RfTYPBkBYK0BW4xduT792ZCrTN1ZEDnRlmuZ5ZJqdFlnUmDVh+d3Xd9VPWtG202g6/0+LnjnPGELdmWPwPF3TUOib6jXl5m3dLQp+BXebTcCqxe2D7xtSqy3qlDegdactnsq8QUDk+sE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734815; c=relaxed/simple;
	bh=ncuxWk4n+cq8Op23zgOE/XZqoYu8BbW/SQHwcjWe9Yk=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=I2lodamewsTpymNXLWxzHNT8/kn53yvd/lgKDaE0SCLpfppAsVusrTZxHhCejoHD19714ft1auNnhq1vWPx9H+a8E5lSCD8jiNtI3evk12RS/JkuWXcs6iDg+yG2q8RlWhcp5O0Lfe8wKd/BLxYEmPAvVKkc6ZGAbS45FCBQOVQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=OUaaLWMV; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="OUaaLWMV"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734814; x=1747270814;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ncuxWk4n+cq8Op23zgOE/XZqoYu8BbW/SQHwcjWe9Yk=;
  b=OUaaLWMVcXZBsphfIUntX13iQ8oMA/cEf38S7JzYjAX5+9xCzQbm3Up/
   vaw3X5Gmj8lanBGfp05mK0M5gsJoHx5JQVPFHS/ahL82EA2YWGLGceNV5
   z8E06nJNiZhrPB2a1bWHW0APnn9sr4lCOk4/1/4f4Ih46MvcGRmhqQxGB
   Aaynz9hyNZC4kyw+6qlPTJqmlC3Ev1YeNrxy6SdyDWJ6jLMEecKbFFBmi
   HoqAmU368ncKgu/roTkbKty/lSt45cwBOd4+X2yoRY2WDNUAP/2BqG+HR
   O1uPPzDxBuZl+7bWdFM+LUi1D26V4l9QoubpXaOP7U3vgjSQSRGUxX2lx
   A==;
X-CSE-ConnectionGUID: 0rgenTWKQ3OPWcvSoiPrsw==
X-CSE-MsgGUID: LrHh3l5uS8ezYxp+Ud+TfQ==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613994"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613994"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:09 -0700
X-CSE-ConnectionGUID: dn1mAM6GR0exhGcx8v+U5A==
X-CSE-MsgGUID: bNhNrPqHSdqaemDkCQmK6w==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942853"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:08 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 15/16] KVM: x86/tdp_mmu: Make mmu notifier callbacks to check
 kvm_process
Date: Tue, 14 May 2024 17:59:51 -0700
Message-Id: <20240515005952.3410568-16-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Isaku Yamahata <isaku.yamahata@intel.com>

Teach the MMU notifier callbacks how to check kvm_gfn_range.process to
filter which KVM MMU root types to operate on.

The private GPAs are backed by guest memfd. Such memory is not subjected
to MMU notifier callbacks because it can't be mapped into the host user
address space. Now kvm_gfn_range conveys info about which root to operate
on. Enhance the callback to filter the root page table type.

The KVM MMU notifier comes down to two functions.
kvm_tdp_mmu_unmap_gfn_range() and kvm_tdp_mmu_handle_gfn().

For VM's without a private/shared split in the EPT, all operations
should target the normal(shared) root. Adjust the target roots based
on kvm_gfn_shared_mask().

invalidate_range_start() comes into kvm_tdp_mmu_unmap_gfn_range().
invalidate_range_end() doesn't come into arch code.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Remove warning (Rick)
 - Remove confusing mention of mapping flags (Chao)
 - Re-write coverletter

v19:
- type: test_gfn() => test_young()

v18:
- newly added
---
 arch/x86/kvm/mmu/tdp_mmu.c | 40 +++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eb88af48c8f0..af61d131d2dc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1396,12 +1396,32 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return ret;
 }
 
+static enum kvm_tdp_mmu_root_types kvm_process_to_root_types(struct kvm *kvm,
+							     enum kvm_process process)
+{
+	WARN_ON_ONCE(process == BUGGY_KVM_INVALIDATION);
+
+	/* Always process shared for cases where private is not on a separate root */
+	if (!kvm_gfn_shared_mask(kvm)) {
+		process |= KVM_PROCESS_SHARED;
+		process &= ~KVM_PROCESS_PRIVATE;
+	}
+
+	return (enum kvm_tdp_mmu_root_types)process;
+}
+
+/* Used by mmu notifier via kvm_unmap_gfn_range() */
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
+	enum kvm_tdp_mmu_root_types types = kvm_process_to_root_types(kvm, range->process);
 	struct kvm_mmu_page *root;
 
-	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, KVM_ANY_ROOTS)
+	/* kvm_process_to_root_types() has WARN_ON_ONCE().  Don't warn it again. */
+	if (types == BUGGY_KVM_ROOTS)
+		return flush;
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types)
 		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
 					  range->may_block, flush);
 
@@ -1415,18 +1435,32 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 						   struct kvm_gfn_range *range,
 						   tdp_handler_t handler)
 {
+	enum kvm_tdp_mmu_root_types types = kvm_process_to_root_types(kvm, range->process);
 	struct kvm_mmu_page *root;
 	struct tdp_iter iter;
 	bool ret = false;
 
+	if (types == BUGGY_KVM_ROOTS)
+		return ret;
+
 	/*
 	 * Don't support rescheduling, none of the MMU notifiers that funnel
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
-	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+	__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
+		gfn_t start, end;
+
+		/*
+		 * For TDX shared mapping, set GFN shared bit to the range,
+		 * so the handler() doesn't need to set it, to avoid duplicated
+		 * code in multiple handler()s.
+		 */
+		start = kvm_gfn_for_root(kvm, root, range->start);
+		end = kvm_gfn_for_root(kvm, root, range->end);
+
 		rcu_read_lock();
 
-		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+		tdp_root_for_each_leaf_pte(iter, root, start, end)
 			ret |= handler(kvm, &iter, range);
 
 		rcu_read_unlock();

From patchwork Wed May 15 00:59:52 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rick Edgecombe <rick.p.edgecombe@intel.com>
X-Patchwork-Id: 13664508
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7491939FD3;
	Wed, 15 May 2024 01:00:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.19
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715734816; cv=none;
 b=kyqbo/R2V7QLWDz9iaY6JDtVuBFg1YoaeEJM4d5xKoQrDr7hXQv80I21mrgixFeaq5oI/JQLDCW0b3ouOe6f3J6VihefeV00lOAfWC2KSEsilKj43/WX5uYlVkqWP5GGMkxcoHR9tXu7NpUcXpQXlCR3cz9SOwnqxxbwXeJB2sI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715734816; c=relaxed/simple;
	bh=aj7fEmDeepMPOarJYfKiRSUouqqIYmoqtkZdy4tUeqg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=aJpoZLeYkJ+RSxYV6cpe2AyT6nvktSyB7PRzT4ZmLmxlGTjAOHeaAmVr8FS+eRnXqfm7wPVFRzpUdRBaDKhXM6VLRcmC2RB8oumyM1hfnX9V2k+6kQ9zHDp/fJALV8RMSTspGKZkuy/E4C2jREPiCCsZhRqu+iTveUIJ6dasV7I=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=QdpfjLYS; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="QdpfjLYS"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1715734815; x=1747270815;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=aj7fEmDeepMPOarJYfKiRSUouqqIYmoqtkZdy4tUeqg=;
  b=QdpfjLYStE+yiWVUfX3BnQ0+hygbzbbp3hpbUA9trrdgwXR6Ja3MDx5W
   tpF4W+udcuMJntxVNYdPIzJU83YMZ7+e/cL61I1zPrM4zz8Exjr64llJO
   SNBY4gm/hIOF/G6zseqL+v/Q0ipupi1jK2juwBzxdCP2KHqu69Ch1NPjE
   oeaSBEDOY5dPJlbFxe8FX8/rG5MITBnLsCYPeiO8/6Bt2sINolOkucydM
   E3kW6ULyXORt1ABQ1UmroGazathgm9SRRqdQPudNE1bV32ciw0lAb6sGl
   soMkkPO93PxFsig1qGMQ/TkUdgEarKMUoFOa27eGL2/QssKjUsBCTB3qp
   Q==;
X-CSE-ConnectionGUID: a8KbWV+hRqWXNTA0oiD8Zg==
X-CSE-MsgGUID: gQZq+WRRSbuY1akc02rQQg==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="11613999"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="11613999"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:10 -0700
X-CSE-ConnectionGUID: TFQG3a5BRwWlS7yGvC+bXg==
X-CSE-MsgGUID: awm+P7luTueABNtx8YYR7g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000";
   d="scan'208";a="30942860"
Received: from oyildiz-mobl1.amr.corp.intel.com (HELO
 rpedgeco-desk4.intel.com) ([10.209.51.34])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 18:00:08 -0700
From: Rick Edgecombe <rick.p.edgecombe@intel.com>
To: pbonzini@redhat.com,
	seanjc@google.com,
	kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com,
	erdemaktas@google.com,
	sagis@google.com,
	yan.y.zhao@intel.com,
	dmatlack@google.com,
	rick.p.edgecombe@intel.com
Subject: [PATCH 16/16] KVM: x86/tdp_mmu: Invalidate correct roots
Date: Tue, 14 May 2024 17:59:52 -0700
Message-Id: <20240515005952.3410568-17-rick.p.edgecombe@intel.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
References: <20240515005952.3410568-1-rick.p.edgecombe@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Sean Christopherson <sean.j.christopherson@intel.com>

When invalidating roots, respect the root type passed.

kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a TD it needs to invalidate all roots. Check the root type in root
iterator.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[evolved quite a bit from original author's patch]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Rename from "Don't zap private pages for unsupported cases", and split
   many parts out.
 - Don't support MTRR, apic zapping (Rick)
 - Detangle private/shared alias logic in kvm_tdp_mmu_unmap_gfn_range()
   (Rick)
 - Fix TLB flushing bug debugged by (Chao Gao)
   https://lore.kernel.org/kvm/Zh8yHEiOKyvZO+QR@chao-email/
 - Split out MTRR part
 - Use enum based root iterators (Sean)
 - Reorder logic in kvm_mmu_zap_memslot_leafs().
 - Replace skip_private with enum kvm_tdp_mmu_root_type.
---
 arch/x86/kvm/mmu/tdp_mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index af61d131d2dc..42ccafc7deff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1196,6 +1196,9 @@ void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 	 * or get/put references to roots.
 	 */
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		if (!tdp_mmu_root_match(root, types))
+			continue;
+
 		/*
 		 * Note, invalid roots can outlive a memslot update!  Invalid
 		 * roots must be *zapped* before the memslot update completes,