From patchwork Tue Apr 16 09:19:48 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@linux.intel.com>
X-Patchwork-Id: 10902271
Return-Path: <kvm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B04ED1823
	for <patchwork-kvm@patchwork.kernel.org>;
 Tue, 16 Apr 2019 09:20:32 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8EDF628989
	for <patchwork-kvm@patchwork.kernel.org>;
 Tue, 16 Apr 2019 09:20:32 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 82E942896D; Tue, 16 Apr 2019 09:20:32 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6B2D4289AF
	for <patchwork-kvm@patchwork.kernel.org>;
 Tue, 16 Apr 2019 09:20:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726953AbfDPJUa (ORCPT
        <rfc822;patchwork-kvm@patchwork.kernel.org>);
        Tue, 16 Apr 2019 05:20:30 -0400
Received: from mga01.intel.com ([192.55.52.88]:43544 "EHLO mga01.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726749AbfDPJU3 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Tue, 16 Apr 2019 05:20:29 -0400
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 16 Apr 2019 02:20:28 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,357,1549958400";
   d="scan'208";a="316371588"
Received: from khuang2-desk.gar.corp.intel.com ([10.254.185.158])
  by orsmga005.jf.intel.com with ESMTP; 16 Apr 2019 02:20:25 -0700
From: Kai Huang <kai.huang@linux.intel.com>
To: kvm@vger.kernel.org, pbonzini@redhat.com, rkrcmar@redhat.com
Cc: junaids@google.com, sean.j.christopherson@intel.com,
        guangrong.xiao@gmail.com, thomas.lendacky@amd.com,
        brijesh.singh@amd.com, tglx@linutronix.de, bp@alien8.de,
        hpa@zytor.com, Kai Huang <kai.huang@linux.intel.com>
Subject: [PATCH] kvm: x86: Fix several SPTE mask calculation errors caused by
 MKTME
Date: Tue, 16 Apr 2019 21:19:48 +1200
Message-Id: <20190416091948.4420-1-kai.huang@linux.intel.com>
X-Mailer: git-send-email 2.13.6
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

With both Intel MKTME and AMD SME/SEV, physical address bits are reduced
due to several high bits of physical address are repurposed for memory
encryption. To honor such behavior those repurposed bits are reduced from
cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus
boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported
by CPUID anymore.

KVM uses boot_cpu_data.x86_phys_bits to calculate several SPTE masks
based on assumption that: 1) boot_cpu_data.x86_phys_bits equals to
physical address bits reported by CPUID -- this is used to check CPU has
reserved bits when KVM calculates shadow_mmio_{value|mask}; and whether
shadow_nonpresent_or_rsvd_mask should be setup (KVM assumes L1TF is not
present if CPU has 52 physical address bits); 2) if it is smaller than
52, bits [x86_phys_bits, 51] are reserved bits.

With Intel MKTME or AMD SME/SEV above assumption is not valid any more,
especially when calculating reserved bits with Intel MKTME, since Intel
MKTME treats the reduced bits as 'keyID', thus they are not reduced
bits, therefore boot_cpu_data.x86_phys_bits cannot be used to calcualte
reserved bits anymore, although we can still use it for AMD SME/SEV
since SME/SEV treats the reduced bits differently -- they are treated as
reserved bits, the same as other reserved bits in page table entity [1].

Fix by introducing a new 'shadow_first_rsvd_bit' variable in kvm x86 MMU
code to store the actual position of reserved bits -- for Intel MKTME,
it equals to physical address reported by CPUID, and for AMD SME/SEV, it
is boot_cpu_data.x86_phys_bits. And in reserved bits related calculation
it is used instead of boot_cpu_data.x86_phys_bits. Some other code
changes too to make code more reasonable, ie, kvm_set_mmio_spte_mask is
moved to x86/kvm/mmu.c from x86/kvm/x86.c to use shadow_first_rsvd_bit;
shadow_nonpresent_or_rsvd_mask calculation logic is slightly changed,
based on the fact that only Intel CPU is impacted by L1TF, so that KVM
can use shadow_first_rsvd_bit to check whether KVM should set
shadow_nonpresent_or_rsvd_mask or not.

Note that for the physical address bits reported to guest should remain
unchanged -- KVM should report physical address reported by CPUID to
guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME,
there's no harm if guest sets up 'keyID' bits in guest page table (since
MKTME only works at physical address level), and KVM doesn't even expose
MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it
should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore
KVM should still reports physcial address reported by CPUID to guest.

[1] Section 7.10.1 Determining Support for Secure Memory Encryption,
    AMD Architecture Programmer's Manual Volume 2: System Programming).

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++----------
 arch/x86/kvm/mmu.h |   1 +
 arch/x86/kvm/x86.c |  29 ----------
 3 files changed, 125 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e10962dfc203..3add7283980e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5;
  */
 static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
 
+/*
+ * The position of first reserved bit. If it equals to 52, then CPU doesn't
+ * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are
+ * reserved bits.
+ *
+ * boot_cpu_data.x86_phys_bits cannot be used to determine the position of
+ * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce
+ * physical address bits, and the reduced bits are taken away from
+ * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD
+ * SME/SEV treat those reduced bits differently -- Intel MKTME treats them
+ * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved
+ * bits, thus physical address bits reported by CPUID cannot be used to
+ * determine reserved bits position either.
+ */
+static u64 __read_mostly shadow_first_rsvd_bit;
+
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
 static union kvm_mmu_page_role
@@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+void kvm_set_mmio_spte_mask(void)
+{
+	u64 mask;
+
+	/*
+	 * Set the reserved bits and the present bit of an paging-structure
+	 * entry to generate page fault with PFER.RSV = 1.
+	 */
+
+	/*
+	 * Mask the uppermost physical address bit, which would be reserved as
+	 * long as the supported physical address width is less than 52.
+	 */
+	mask = 1ull << 51;
+
+	/* Set the present bit. */
+	mask |= 1ull;
+
+	/*
+	 * If reserved bit is not supported, clear the present bit to disable
+	 * mmio page fault.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52)
+		mask &= ~1ull;
+
+	kvm_mmu_set_mmio_spte_mask(mask, mask);
+}
+
 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 {
 	return sp->role.ad_disabled;
@@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
 	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
 	u64 mask = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
+	u64 high_gpa_offset;
 
 	access &= ACC_WRITE_MASK | ACC_USER_MASK;
 	mask |= shadow_mmio_value | access;
 	mask |= gpa | shadow_nonpresent_or_rsvd_mask;
+	/*
+	 * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to
+	 * shadow_first_rsvd_bit - 1 are actually keyID bits but not
+	 * reserved bits. We need to put high GPA bits to actual reserved
+	 * bits to mitigate L1TF attack.
+	 */
+	high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
+		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
 	mask |= (gpa & shadow_nonpresent_or_rsvd_mask)
-		<< shadow_nonpresent_or_rsvd_mask_len;
+		<< high_gpa_offset;
 
 	page_header(__pa(sptep))->mmio_cached = true;
 
@@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte)
 static gfn_t get_mmio_spte_gfn(u64 spte)
 {
 	u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask;
+	/* See comments in mark_mmio_spte */
+	u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
+		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
 
-	gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len)
+	gpa |= (spte >> high_gpa_offset)
 	       & shadow_nonpresent_or_rsvd_mask;
 
 	return gpa >> PAGE_SHIFT;
@@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
+static u8 kvm_get_cpuid_phys_bits(void)
+{
+	u32 eax, ebx, ecx, edx;
+
+	if (boot_cpu_data.extended_cpuid_level < 0x80000008)
+		return boot_cpu_data.x86_phys_bits;
+
+	cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+
+	return eax & 0xff;
+}
+
 static void kvm_mmu_reset_all_pte_masks(void)
 {
 	u8 low_phys_bits;
+	bool need_l1tf;
 
 	shadow_user_mask = 0;
 	shadow_accessed_mask = 0;
@@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void)
 	shadow_acc_track_mask = 0;
 
 	/*
-	 * If the CPU has 46 or less physical address bits, then set an
-	 * appropriate mask to guard against L1TF attacks. Otherwise, it is
+	 * Calcualte the first reserved bit position. Although both Intel
+	 * MKTME and AMD SME/SEV reduce physical address bits for memory
+	 * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect
+	 * such fact), they treat those reduced bits differently -- Intel
+	 * MKTME treats those as 'keyID' thus not reserved bits, but AMD
+	 * SME/SEV still treats those bits as reserved bits, so for AMD
+	 * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for
+	 * Intel (and other x86 vendors that don't support memory encryption
+	 * at all), shadow_first_rsvd_bit is physical address bits reported
+	 * by CPUID.
+	 */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+		shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits;
+	else
+		shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits();
+
+	/*
+	 * Only Intel is impacted by L1TF, therefore for AMD and other x86
+	 * vendors L1TF mitigation is not needed.
+	 *
+	 * For Intel CPU, if it has 46 or less physical address bits, then set
+	 * an appropriate mask to guard against L1TF attacks. Otherwise, it is
 	 * assumed that the CPU is not vulnerable to L1TF.
 	 */
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
+			(shadow_first_rsvd_bit <
+				52 - shadow_nonpresent_or_rsvd_mask_len))
+		need_l1tf = true;
+	else
+		need_l1tf = false;
+
 	low_phys_bits = boot_cpu_data.x86_phys_bits;
-	if (boot_cpu_data.x86_phys_bits <
-	    52 - shadow_nonpresent_or_rsvd_mask_len) {
+	shadow_nonpresent_or_rsvd_mask = 0;
+	if (need_l1tf) {
 		shadow_nonpresent_or_rsvd_mask =
 			rsvd_bits(boot_cpu_data.x86_phys_bits -
 				  shadow_nonpresent_or_rsvd_mask_len,
@@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
 static void
 __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 			struct rsvd_bits_validate *rsvd_check,
-			int maxphyaddr, int level, bool nx, bool gbpages,
+			int first_rsvd_bit, int level, bool nx, bool gbpages,
 			bool pse, bool amd)
 {
 	u64 exb_bit_rsvd = 0;
@@ -4369,14 +4465,14 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 		break;
 	case PT32E_ROOT_LEVEL:
 		rsvd_check->rsvd_bits_mask[0][2] =
-			rsvd_bits(maxphyaddr, 63) |
+			rsvd_bits(first_rsvd_bit, 63) |
 			rsvd_bits(5, 8) | rsvd_bits(1, 2);	/* PDPTE */
 		rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd |
-			rsvd_bits(maxphyaddr, 62);	/* PDE */
+			rsvd_bits(first_rsvd_bit, 62);	/* PDE */
 		rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd |
-			rsvd_bits(maxphyaddr, 62); 	/* PTE */
+			rsvd_bits(first_rsvd_bit, 62); 	/* PTE */
 		rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
-			rsvd_bits(maxphyaddr, 62) |
+			rsvd_bits(first_rsvd_bit, 62) |
 			rsvd_bits(13, 20);		/* large page */
 		rsvd_check->rsvd_bits_mask[1][0] =
 			rsvd_check->rsvd_bits_mask[0][0];
@@ -4384,28 +4480,28 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	case PT64_ROOT_5LEVEL:
 		rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd |
 			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
-			rsvd_bits(maxphyaddr, 51);
+			rsvd_bits(first_rsvd_bit, 51);
 		rsvd_check->rsvd_bits_mask[1][4] =
 			rsvd_check->rsvd_bits_mask[0][4];
 		/* fall through */
 	case PT64_ROOT_4LEVEL:
 		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
 			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
-			rsvd_bits(maxphyaddr, 51);
+			rsvd_bits(first_rsvd_bit, 51);
 		rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd |
 			nonleaf_bit8_rsvd | gbpages_bit_rsvd |
-			rsvd_bits(maxphyaddr, 51);
+			rsvd_bits(first_rsvd_bit, 51);
 		rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd |
-			rsvd_bits(maxphyaddr, 51);
+			rsvd_bits(first_rsvd_bit, 51);
 		rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd |
-			rsvd_bits(maxphyaddr, 51);
+			rsvd_bits(first_rsvd_bit, 51);
 		rsvd_check->rsvd_bits_mask[1][3] =
 			rsvd_check->rsvd_bits_mask[0][3];
 		rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd |
-			gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) |
+			gbpages_bit_rsvd | rsvd_bits(first_rsvd_bit, 51) |
 			rsvd_bits(13, 29);
 		rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
-			rsvd_bits(maxphyaddr, 51) |
+			rsvd_bits(first_rsvd_bit, 51) |
 			rsvd_bits(13, 20);		/* large page */
 		rsvd_check->rsvd_bits_mask[1][0] =
 			rsvd_check->rsvd_bits_mask[0][0];
@@ -4425,27 +4521,27 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 
 static void
 __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
-			    int maxphyaddr, bool execonly)
+			    int first_rsvd_bit, bool execonly)
 {
 	u64 bad_mt_xwr;
 
 	rsvd_check->rsvd_bits_mask[0][4] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7);
 	rsvd_check->rsvd_bits_mask[0][3] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7);
 	rsvd_check->rsvd_bits_mask[0][2] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6);
 	rsvd_check->rsvd_bits_mask[0][1] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
-	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6);
+	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(first_rsvd_bit, 51);
 
 	/* large page */
 	rsvd_check->rsvd_bits_mask[1][4] = rsvd_check->rsvd_bits_mask[0][4];
 	rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3];
 	rsvd_check->rsvd_bits_mask[1][2] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 29);
 	rsvd_check->rsvd_bits_mask[1][1] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 20);
 	rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0];
 
 	bad_mt_xwr = 0xFFull << (2 * 8);	/* bits 3..5 must not be 2 */
@@ -4523,13 +4619,13 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
 
 	if (boot_cpu_is_amd())
 		__reset_rsvds_bits_mask(vcpu, shadow_zero_check,
-					boot_cpu_data.x86_phys_bits,
+					shadow_first_rsvd_bit,
 					context->shadow_root_level, false,
 					boot_cpu_has(X86_FEATURE_GBPAGES),
 					true, true);
 	else
 		__reset_rsvds_bits_mask_ept(shadow_zero_check,
-					    boot_cpu_data.x86_phys_bits,
+					    shadow_first_rsvd_bit,
 					    false);
 
 	if (!shadow_me_mask)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 54c2a377795b..45e579e112ea 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -51,6 +51,7 @@ static inline u64 rsvd_bits(int s, int e)
 	return ((1ULL << (e - s + 1)) - 1) << s;
 }
 
+void kvm_set_mmio_spte_mask(void);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value);
 
 void
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 38440316a806..3a71aab495b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6848,35 +6848,6 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
 	.get_guest_ip		= kvm_get_guest_ip,
 };
 
-static void kvm_set_mmio_spte_mask(void)
-{
-	u64 mask;
-	int maxphyaddr = boot_cpu_data.x86_phys_bits;
-
-	/*
-	 * Set the reserved bits and the present bit of an paging-structure
-	 * entry to generate page fault with PFER.RSV = 1.
-	 */
-
-	/*
-	 * Mask the uppermost physical address bit, which would be reserved as
-	 * long as the supported physical address width is less than 52.
-	 */
-	mask = 1ull << 51;
-
-	/* Set the present bit. */
-	mask |= 1ull;
-
-	/*
-	 * If reserved bit is not supported, clear the present bit to disable
-	 * mmio page fault.
-	 */
-	if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52)
-		mask &= ~1ull;
-
-	kvm_mmu_set_mmio_spte_mask(mask, mask);
-}
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {