[v13,17/22] x86/kexec: Flush cache of TDX private memory

Message ID	1fa1eb80238dc19b4c732706b40604169316eb34.1692962263.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v13 17/22] x86/kexec: Flush cache of TDX private memory Date: Sat, 26 Aug 2023 00:14:36 +1200 Message-ID: <1fa1eb80238dc19b4c732706b40604169316eb34.1692962263.git.kai.huang@intel.com> In-Reply-To: <cover.1692962263.git.kai.huang@intel.com> References: <cover.1692962263.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	TDX host kernel support \| expand [v13,00/22] TDX host kernel support [v13,01/22] x86/virt/tdx: Detect TDX during kernel boot [v13,02/22] x86/tdx: Define TDX supported page sizes as macros [v13,03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v13,04/22] x86/cpu: Detect TDX partial write machine check erratum [v13,05/22] x86/virt/tdx: Handle SEAMCALL no entropy error in common code [v13,06/22] x86/virt/tdx: Add SEAMCALL error printing for module initialization [v13,07/22] x86/virt/tdx: Add skeleton to enable TDX on demand [v13,08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory [v13,09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v13,10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v13,11/22] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v13,12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v13,13/22] x86/virt/tdx: Designate reserved areas for all TDMRs [v13,14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v13,15/22] x86/virt/tdx: Configure global KeyID on all packages [v13,16/22] x86/virt/tdx: Initialize all TDMRs [v13,17/22] x86/kexec: Flush cache of TDX private memory [v13,18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful [v13,19/22] x86/virt/tdx: Improve readibility of module initialization error handling [v13,20/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum [v13,21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v13,22/22] Documentation/x86: Add documentation for TDX host support

Huang, Kai Aug. 25, 2023, 12:14 p.m. UTC

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
the rebooting cpu which does kexec(), unlike SME which does the cache
flush in relocate_kernel(), flush the cache right after stopping remote
cpus in machine_shutdown().

There are two reasons to do so: 1) For TDX there's no need to defer
cache flush to relocate_kernel() because all TDX activities have been
stopped.  2) On the platforms with the above erratum the kernel must
convert all TDX private pages back to normal before booting to the new
kernel in kexec(), and flushing cache early allows the kernel to convert
memory early rather than having to muck with the relocate_kernel()
assembly.

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/process.c |  8 +++++++-
 arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

Edgecombe, Rick P Sept. 15, 2023, 5:43 p.m. UTC | #1

On Sat, 2023-08-26 at 00:14 +1200, Kai Huang wrote:
> There are two problems in terms of using kexec() to boot to a new
> kernel
> when the old kernel has enabled TDX: 1) Part of the memory pages are
> still TDX private pages; 2) There might be dirty cachelines
> associated
> with TDX private pages.

Does TDX support hibernate? I'm wondering about two potential problems:
1. Reading/writing private pages from the direct map on save/restore
2. The seam module needing to be re-inited (the tdx_enable() stuff)

If that's the case you could have something like the below to just
block it when TDX could be in use:
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 2b4a946a6ff5..3b1b7202452d 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -84,7 +84,8 @@ bool hibernation_available(void)
 {
        return nohibernate == 0 &&
                !security_locked_down(LOCKDOWN_HIBERNATION) &&
-               !secretmem_active() && !cxl_mem_active();
+               !secretmem_active() && !cxl_mem_active() &&
+               !platform_tdx_enabled();
 }
 
 /**

Or maybe better, it could check tdx_module_status? But there is no way
to read that variable from hibernate.

Dave Hansen Sept. 15, 2023, 5:50 p.m. UTC | #2

On 9/15/23 10:43, Edgecombe, Rick P wrote:
> On Sat, 2023-08-26 at 00:14 +1200, Kai Huang wrote:
>> There are two problems in terms of using kexec() to boot to a new 
>> kernel when the old kernel has enabled TDX: 1) Part of the memory
>> pages are still TDX private pages; 2) There might be dirty
>> cachelines associated with TDX private pages.
> Does TDX support hibernate?
No.

There's a whole bunch of volatile state that's generated inside the CPU
and never leaves the CPU, like the ephemeral key that protects TDX
module memory.

SGX, for instance, never even supported suspend, IIRC.  Enclaves just
die and have to be rebuilt.

Huang, Kai Sept. 18, 2023, 12:08 p.m. UTC | #3

On Fri, 2023-09-15 at 10:50 -0700, Dave Hansen wrote:
> On 9/15/23 10:43, Edgecombe, Rick P wrote:
> > On Sat, 2023-08-26 at 00:14 +1200, Kai Huang wrote:
> > > There are two problems in terms of using kexec() to boot to a new 
> > > kernel when the old kernel has enabled TDX: 1) Part of the memory
> > > pages are still TDX private pages; 2) There might be dirty
> > > cachelines associated with TDX private pages.
> > Does TDX support hibernate?
> No.
> 
> There's a whole bunch of volatile state that's generated inside the CPU
> and never leaves the CPU, like the ephemeral key that protects TDX
> module memory.
> 
> SGX, for instance, never even supported suspend, IIRC.  Enclaves just
> die and have to be rebuilt.

Right.  AFAICT TDX cannot survive from S3 either.  All TDX keys get lost when
system enters S3.  However I don't think TDX can be rebuilt after resume like
SGX.  Let me confirm with TDX guys on this.

I think we can register syscore_ops->suspend for TDX, and refuse to suspend when
TDX is enabled.  This covers hibernate case too.

In terms of how to check "TDX is enabled", ideally it's better to check whether
TDX module is actually initialized, but the worst case is we can use
platform_tdx_enabled(). (I need to think more on this)

Hi Dave, Kirill, Rick,

Is this solution overall acceptable?

Dave Hansen Sept. 18, 2023, 3:44 p.m. UTC | #4

On 9/18/23 05:08, Huang, Kai wrote:
> On Fri, 2023-09-15 at 10:50 -0700, Dave Hansen wrote:
>> On 9/15/23 10:43, Edgecombe, Rick P wrote:
>>> On Sat, 2023-08-26 at 00:14 +1200, Kai Huang wrote:
>>>> There are two problems in terms of using kexec() to boot to a new
>>>> kernel when the old kernel has enabled TDX: 1) Part of the memory
>>>> pages are still TDX private pages; 2) There might be dirty
>>>> cachelines associated with TDX private pages.
>>> Does TDX support hibernate?
>> No.
>>
>> There's a whole bunch of volatile state that's generated inside the CPU
>> and never leaves the CPU, like the ephemeral key that protects TDX
>> module memory.
>>
>> SGX, for instance, never even supported suspend, IIRC.  Enclaves just
>> die and have to be rebuilt.
> 
> Right.  AFAICT TDX cannot survive from S3 either.  All TDX keys get lost when
> system enters S3.  However I don't think TDX can be rebuilt after resume like
> SGX.  Let me confirm with TDX guys on this.

By "rebuilt" I mean all private data is totally destroyed and rebuilt
from scratch.  The SGX architecture provides zero help other than
delivering a fault and saying: "whoops all your data is gone".

> I think we can register syscore_ops->suspend for TDX, and refuse to suspend when
> TDX is enabled.  This covers hibernate case too.
> 
> In terms of how to check "TDX is enabled", ideally it's better to check whether
> TDX module is actually initialized, but the worst case is we can use
> platform_tdx_enabled(). (I need to think more on this)

*Ideally* the firmware would have a choke point where it could just tell
the OS that it can't suspend rather than the OS having to figure it out.

Huang, Kai Sept. 18, 2023, 10:14 p.m. UTC | #5

On Mon, 2023-09-18 at 08:44 -0700, Dave Hansen wrote:
> On 9/18/23 05:08, Huang, Kai wrote:
> > On Fri, 2023-09-15 at 10:50 -0700, Dave Hansen wrote:
> > > On 9/15/23 10:43, Edgecombe, Rick P wrote:
> > > > On Sat, 2023-08-26 at 00:14 +1200, Kai Huang wrote:
> > > > > There are two problems in terms of using kexec() to boot to a new
> > > > > kernel when the old kernel has enabled TDX: 1) Part of the memory
> > > > > pages are still TDX private pages; 2) There might be dirty
> > > > > cachelines associated with TDX private pages.
> > > > Does TDX support hibernate?
> > > No.
> > > 
> > > There's a whole bunch of volatile state that's generated inside the CPU
> > > and never leaves the CPU, like the ephemeral key that protects TDX
> > > module memory.
> > > 
> > > SGX, for instance, never even supported suspend, IIRC.  Enclaves just
> > > die and have to be rebuilt.
> > 
> > Right.  AFAICT TDX cannot survive from S3 either.  All TDX keys get lost when
> > system enters S3.  However I don't think TDX can be rebuilt after resume like
> > SGX.  Let me confirm with TDX guys on this.
> 
> By "rebuilt" I mean all private data is totally destroyed and rebuilt
> from scratch.  The SGX architecture provides zero help other than
> delivering a fault and saying: "whoops all your data is gone".

Right.  For TDX I am worrying about SEAMCALL could poison memory thus could
trigger #MC inside kernel, or even could trigger #MC inside SEAM, instead of
delivering a fault that SGX app/kernel can handle.  I am confirming with TDX
team. 

> 
> > I think we can register syscore_ops->suspend for TDX, and refuse to suspend when
> > TDX is enabled.  This covers hibernate case too.
> > 
> > In terms of how to check "TDX is enabled", ideally it's better to check whether
> > TDX module is actually initialized, but the worst case is we can use
> > platform_tdx_enabled(). (I need to think more on this)
> 
> *Ideally* the firmware would have a choke point where it could just tell
> the OS that it can't suspend rather than the OS having to figure it out.

Agreed.  Let me ask TDX team about this too.

[v13,17/22] x86/kexec: Flush cache of TDX private memory

Commit Message

Comments

Patch