[v11,17/20] x86/kexec: Flush cache of TDX private memory

Message ID	17bcbe3e154415ee7a4c77489809a3db0c5ddf3f.1685887183.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v11 17/20] x86/kexec: Flush cache of TDX private memory Date: Mon, 5 Jun 2023 02:27:30 +1200 Message-Id: <17bcbe3e154415ee7a4c77489809a3db0c5ddf3f.1685887183.git.kai.huang@intel.com> In-Reply-To: <cover.1685887183.git.kai.huang@intel.com> References: <cover.1685887183.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	TDX host kernel support \| expand [v11,00/20] TDX host kernel support [v11,01/20] x86/tdx: Define TDX supported page sizes as macros [v11,02/20] x86/virt/tdx: Detect TDX during kernel boot [v11,03/20] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v11,04/20] x86/cpu: Detect TDX partial write machine check erratum [v11,05/20] x86/virt/tdx: Add SEAMCALL infrastructure [v11,06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error [v11,07/20] x86/virt/tdx: Add skeleton to enable TDX on demand [v11,08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory [v11,09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v11,10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v11,11/20] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v11,12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v11,13/20] x86/virt/tdx: Designate reserved areas for all TDMRs [v11,14/20] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v11,15/20] x86/virt/tdx: Configure global KeyID on all packages [v11,16/20] x86/virt/tdx: Initialize all TDMRs [v11,17/20] x86/kexec: Flush cache of TDX private memory [v11,18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot [v11,19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v11,20/20] Documentation/x86: Add documentation for TDX host support

Message ID

17bcbe3e154415ee7a4c77489809a3db0c5ddf3f.1685887183.git.kai.huang@intel.com (mailing list archive)

State

New, archived

Headers

From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: linux-mm@kvack.org, dave.hansen@intel.com,
        kirill.shutemov@linux.intel.com, tony.luck@intel.com,
        peterz@infradead.org, tglx@linutronix.de, seanjc@google.com,
        pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com,
        rafael.j.wysocki@intel.com, ying.huang@intel.com,
        reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com,
        isaku.yamahata@intel.com, chao.gao@intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com,
        sagis@google.com, imammedo@redhat.com, kai.huang@intel.com
Subject: [PATCH v11 17/20] x86/kexec: Flush cache of TDX private memory
Date: Mon,  5 Jun 2023 02:27:30 +1200
Message-Id: 
 <17bcbe3e154415ee7a4c77489809a3db0c5ddf3f.1685887183.git.kai.huang@intel.com>
In-Reply-To: <cover.1685887183.git.kai.huang@intel.com>
References: <cover.1685887183.git.kai.huang@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

TDX host kernel support | expand

Commit Message

Huang, Kai June 4, 2023, 2:27 p.m. UTC

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
the cpu which does kexec(), unlike SME which does the cache flush in
relocate_kernel(), do the cache flush right after stopping remote cpus
in machine_shutdown().  This is because on the platforms with above
erratum, the kernel needs to convert all TDX private pages back to
normal before a fast warm reset reboot or booting to the new kernel in
kexec().  Flushing cache in relocate_kernel() only covers the kexec()
but not the fast warm reset reboot.

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v10 -> v11:
 - Fixed a bug that cache for rebooting cpu isn't flushed for TDX private
   memory.
 - Updated changelog accordingly.

v9 -> v10:
 - No change.

v8 -> v9:
 - Various changelog enhancement and fix (Dave).
 - Improved comment (Dave).

v7 -> v8:
 - Changelog:
   - Removed "leave TDX module open" part due to shut down patch has been
     removed.

v6 -> v7:
 - Improved changelog to explain why don't convert TDX private pages back
   to normal.


---
 arch/x86/kernel/process.c |  7 ++++++-
 arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Comments

Kirill A . Shutemov June 9, 2023, 10:14 a.m. UTC | #1

On Mon, Jun 05, 2023 at 02:27:30AM +1200, Kai Huang wrote:
> There are two problems in terms of using kexec() to boot to a new kernel
> when the old kernel has enabled TDX: 1) Part of the memory pages are
> still TDX private pages; 2) There might be dirty cachelines associated
> with TDX private pages.
> 
> The first problem doesn't matter on the platforms w/o the "partial write
> machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
> new kernel wants to use any non-zero KeyID, it needs to convert the
> memory to that KeyID and such conversion would work from any KeyID.
> 
> However the old kernel needs to guarantee there's no dirty cacheline
> left behind before booting to the new kernel to avoid silent corruption
> from later cacheline writeback (Intel hardware doesn't guarantee cache
> coherency across different KeyIDs).
> 
> There are two things that the old kernel needs to do to achieve that:
> 
> 1) Stop accessing TDX private memory mappings:
>    a. Stop making TDX module SEAMCALLs (TDX global KeyID);
>    b. Stop TDX guests from running (per-guest TDX KeyID).
> 2) Flush any cachelines from previous TDX private KeyID writes.
> 
> For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
> support.  And in this way 1) happens for free as there's no TDX activity
> between wbinvd() and the native_halt().
> 
> Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
> the cpu which does kexec(), unlike SME which does the cache flush in
> relocate_kernel(), do the cache flush right after stopping remote cpus
> in machine_shutdown().  This is because on the platforms with above
> erratum, the kernel needs to convert all TDX private pages back to
> normal before a fast warm reset reboot or booting to the new kernel in
> kexec().  Flushing cache in relocate_kernel() only covers the kexec()
> but not the fast warm reset reboot.
> 
> Theoretically, cache flush is only needed when the TDX module has been
> initialized.  However initializing the TDX module is done on demand at
> runtime, and it takes a mutex to read the module status.  Just check
> whether TDX is enabled by the BIOS instead to flush cache.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dac41a0072ea..0ce66deb9bc8 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -780,8 +780,13 @@  void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * The TDX module or guests might have left dirty cachelines
+	 * behind.  Flush them to avoid corruption from later writeback.
+	 * Note that this flushes on all systems where TDX is possible,
+	 * but does not actually check that TDX was in use.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
 		native_wbinvd();
 	for (;;) {
 		/*
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 3adbe97015c1..b3d0e015dae2 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -32,6 +32,7 @@ 
 #include <asm/realmode.h>
 #include <asm/x86_init.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
 
 /*
  * Power off function, if any
@@ -695,6 +696,20 @@  void native_machine_shutdown(void)
 	local_irq_disable();
 	stop_other_cpus();
 #endif
+	/*
+	 * stop_other_cpus() has flushed all dirty cachelines of TDX
+	 * private memory on remote cpus.  Unlike SME, which does the
+	 * cache flush on _this_ cpu in the relocate_kernel(), flush
+	 * the cache for _this_ cpu here.  This is because on the
+	 * platforms with "partial write machine check" erratum the
+	 * kernel needs to convert all TDX private pages back to normal
+	 * before a fast warm reset reboot or booting to the new kernel
+	 * in kexec(), and the cache flush must be done before that.
+	 * Flushing cache in relocate_kernel() only covers the kexec()
+	 * but not the fast warm reset reboot.
+	 */
+	if (platform_tdx_enabled())
+		native_wbinvd();
 
 	lapic_shutdown();
 	restore_boot_irq_mode();

[v11,17/20] x86/kexec: Flush cache of TDX private memory

Commit Message

Comments

Patch