diff mbox series

[v8,15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled

Message ID ee5185e1727c3cd8bd51dbf9fcec95d432100d12.1670566861.git.kai.huang@intel.com (mailing list archive)
State New
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai Dec. 9, 2022, 6:52 a.m. UTC
There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages (i.e. metadata used by the TDX module, and any
TDX guest memory if kexec() happens when there's any TDX guest alive).
2) There might be dirty cachelines associated with TDX private pages.

Because the hardware doesn't guarantee cache coherency among different
KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
before booting to the new kernel.  Also, reading TDX private page using
any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
Therefore ideally, the kernel should convert all TDX private pages back
to normal before booting to the new kernel.

However, this implementation doesn't convert TDX private pages back to
normal in kexec() because of below considerations:

1) Neither the kernel nor the TDX module has existing infrastructure to
   track which pages are TDX private pages.
2) The number of TDX private pages can be large, and converting all of
   them (cache flush + using MOVDIR64B to clear the page) in kexec() can
   be time consuming.
3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
   0 doesn't support integrity-check, so it's OK.
4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
   kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
   page with the new MKTME KeyID (just like TDX does) before using it.

Therefore, this implementation just flushes cache to make sure there are
no stale dirty cachelines associated with any TDX private KeyIDs before
booting to the new kernel, otherwise they may silently corrupt the new
kernel.

Following SME support, use wbinvd() to flush cache in stop_this_cpu().
Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by BIOS instead to flush cache.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Changelog:
   - Removed "leave TDX module open" part due to shut down patch has been
     removed.

v6 -> v7:
 - Improved changelog to explain why don't convert TDX private pages back
   to normal.

---
 arch/x86/kernel/process.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Dave Hansen Jan. 7, 2023, 12:35 a.m. UTC | #1
On 12/8/22 22:52, Kai Huang wrote:
> There are two problems in terms of using kexec() to boot to a new kernel
> when the old kernel has enabled TDX: 1) Part of the memory pages are
> still TDX private pages (i.e. metadata used by the TDX module, and any
> TDX guest memory if kexec() happens when there's any TDX guest alive).
> 2) There might be dirty cachelines associated with TDX private pages.
> 
> Because the hardware doesn't guarantee cache coherency among different
> KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
> before booting to the new kernel.  Also, reading TDX private page using
> any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
> Therefore ideally, the kernel should convert all TDX private pages back
> to normal before booting to the new kernel.

This is just talking about way too many things that just don't apply.

Let's focus on the *ACTUAL* problem that's being addressed instead of
the 15 problems that aren't actual practical problems.

> However, this implementation doesn't convert TDX private pages back to
> normal in kexec() because of below considerations:
> 
> 1) Neither the kernel nor the TDX module has existing infrastructure to
>    track which pages are TDX private pages.
> 2) The number of TDX private pages can be large, and converting all of
>    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
>    be time consuming.
> 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
>    0 doesn't support integrity-check, so it's OK.
> 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
>    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
>    page with the new MKTME KeyID (just like TDX does) before using it.

Yeah, why are we getting all worked up about MKTME when there is not
support?

The only thing that matters here is dirty cacheline writeback.  There
are two things the kernel needs to do to mitigate that:

 1. Stop accessing TDX private memory mappings
  1a. Stop making TDX module calls (uses global private KeyID)
  1b. Stop TDX guests from running (uses per-guest KeyID)
 2. Flush any cachelines from previous private KeyID writes

There are a couple of ways we can do #2.  We do *NOT* need to convert
*ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
in any way as far as I can tell.

I think you're also saying that since all CPUs go through this path and
there is no TDX activity between the WBINVD and the native_halt() that
1a and 1b basically happen for "free" without needing to do theme
explicitly.

> Therefore, this implementation just flushes cache to make sure there are
> no stale dirty cachelines associated with any TDX private KeyIDs before
> booting to the new kernel, otherwise they may silently corrupt the new
> kernel.

That's fine.  So, this patch kinda happens to land in the right spot
even after thrashing about about a while.

> Following SME support, use wbinvd() to flush cache in stop_this_cpu().
> Theoretically, cache flush is only needed when the TDX module has been
> initialized.  However initializing the TDX module is done on demand at
> runtime, and it takes a mutex to read the module status.  Just check
> whether TDX is enabled by BIOS instead to flush cache.

Yeah, close enough.

> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index c21b7347a26d..0cc84977dc62 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
>  	 *
>  	 * Test the CPUID bit directly because the machine might've cleared
>  	 * X86_FEATURE_SME due to cmdline options.
> +	 *
> +	 * Similar to SME, if the TDX module is ever initialized, the
> +	 * cachelines associated with any TDX private KeyID must be flushed
> +	 * before transiting to the new kernel.  The TDX module is initialized
> +	 * on demand, and it takes the mutex to read its status.  Just check
> +	 * whether TDX is enabled by BIOS instead to flush cache.
>  	 */

There's too much detail here.  Let's up-level it a bit.  We don't need
to be talking about TDX locking here.

	/*
	 * The TDX module or guests might have left dirty cachelines
	 * behind.  Flush them to avoid corruption from later writeback.
	 * Note that this flushes on all systems where TDX is possible,
	 * but does not actually check that TDX was in use.
	 */

> -	if (cpuid_eax(0x8000001f) & BIT(0))
> +	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
>  		native_wbinvd();
>  	for (;;) {
>  		/*
Huang, Kai Jan. 10, 2023, 11:29 a.m. UTC | #2
On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > There are two problems in terms of using kexec() to boot to a new kernel
> > when the old kernel has enabled TDX: 1) Part of the memory pages are
> > still TDX private pages (i.e. metadata used by the TDX module, and any
> > TDX guest memory if kexec() happens when there's any TDX guest alive).
> > 2) There might be dirty cachelines associated with TDX private pages.
> > 
> > Because the hardware doesn't guarantee cache coherency among different
> > KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
> > before booting to the new kernel.  Also, reading TDX private page using
> > any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
> > Therefore ideally, the kernel should convert all TDX private pages back
> > to normal before booting to the new kernel.
> 
> This is just talking about way too many things that just don't apply.
> 
> Let's focus on the *ACTUAL* problem that's being addressed instead of
> the 15 problems that aren't actual practical problems.

Will get rid of those.

> 
> > However, this implementation doesn't convert TDX private pages back to
> > normal in kexec() because of below considerations:
> > 
> > 1) Neither the kernel nor the TDX module has existing infrastructure to
> >    track which pages are TDX private pages.
> > 2) The number of TDX private pages can be large, and converting all of
> >    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
> >    be time consuming.
> > 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
> >    0 doesn't support integrity-check, so it's OK.
> > 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
> >    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
> >    page with the new MKTME KeyID (just like TDX does) before using it.
> 
> Yeah, why are we getting all worked up about MKTME when there is not
> support?

I am not sure whether we need to consider 3rd party kernel case?

> 
> The only thing that matters here is dirty cacheline writeback.  There
> are two things the kernel needs to do to mitigate that:
> 
>  1. Stop accessing TDX private memory mappings
>   1a. Stop making TDX module calls (uses global private KeyID)
>   1b. Stop TDX guests from running (uses per-guest KeyID)
>  2. Flush any cachelines from previous private KeyID writes
> 
> There are a couple of ways we can do #2.  We do *NOT* need to convert
> *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
> in any way as far as I can tell.

May I ask why?  When I was writing this patch I was not sure whether kexec()
should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
during kexec() but depends on the new kernel to do that too, but I don't know
what's the general guide of supporting kexec().

> 
> I think you're also saying that since all CPUs go through this path and
> there is no TDX activity between the WBINVD and the native_halt() that
> 1a and 1b basically happen for "free" without needing to do theme
> explicitly.

Yes.  Should we mention this part in changelog?

AMD SME kexec() support patch bba4ed011a52d ("x86/mm, kexec: Allow kexec to be
used with SME") seems doesn't mention anything similar (SME and TDX may be
different, though).

> 
> > Therefore, this implementation just flushes cache to make sure there are
> > no stale dirty cachelines associated with any TDX private KeyIDs before
> > booting to the new kernel, otherwise they may silently corrupt the new
> > kernel.
> 
> That's fine.  So, this patch kinda happens to land in the right spot
> even after thrashing about about a while.
> 
> > Following SME support, use wbinvd() to flush cache in stop_this_cpu().
> > Theoretically, cache flush is only needed when the TDX module has been
> > initialized.  However initializing the TDX module is done on demand at
> > runtime, and it takes a mutex to read the module status.  Just check
> > whether TDX is enabled by BIOS instead to flush cache.
> 
> Yeah, close enough.
> 
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index c21b7347a26d..0cc84977dc62 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
> >  	 *
> >  	 * Test the CPUID bit directly because the machine might've cleared
> >  	 * X86_FEATURE_SME due to cmdline options.
> > +	 *
> > +	 * Similar to SME, if the TDX module is ever initialized, the
> > +	 * cachelines associated with any TDX private KeyID must be flushed
> > +	 * before transiting to the new kernel.  The TDX module is initialized
> > +	 * on demand, and it takes the mutex to read its status.  Just check
> > +	 * whether TDX is enabled by BIOS instead to flush cache.
> >  	 */
> 
> There's too much detail here.  Let's up-level it a bit.  We don't need
> to be talking about TDX locking here.

Sure will do.  Thanks!
> 
> 	/*
> 	 * The TDX module or guests might have left dirty cachelines
> 	 * behind.  Flush them to avoid corruption from later writeback.
> 	 * Note that this flushes on all systems where TDX is possible,
> 	 * but does not actually check that TDX was in use.
> 	 */
> 
> > -	if (cpuid_eax(0x8000001f) & BIT(0))
> > +	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
> >  		native_wbinvd();
> >  	for (;;) {
> >  		/*
> 
>
Dave Hansen Jan. 10, 2023, 3:27 p.m. UTC | #3
On 1/10/23 03:29, Huang, Kai wrote:
> On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
>> On 12/8/22 22:52, Kai Huang wrote:
...
>>> However, this implementation doesn't convert TDX private pages back to
>>> normal in kexec() because of below considerations:
>>>
>>> 1) Neither the kernel nor the TDX module has existing infrastructure to
>>>    track which pages are TDX private pages.
>>> 2) The number of TDX private pages can be large, and converting all of
>>>    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
>>>    be time consuming.
>>> 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
>>>    0 doesn't support integrity-check, so it's OK.
>>> 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
>>>    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
>>>    page with the new MKTME KeyID (just like TDX does) before using it.
>>
>> Yeah, why are we getting all worked up about MKTME when there is not
>> support?
> 
> I am not sure whether we need to consider 3rd party kernel case?

No, we don't.

>> The only thing that matters here is dirty cacheline writeback.  There
>> are two things the kernel needs to do to mitigate that:
>>
>>  1. Stop accessing TDX private memory mappings
>>   1a. Stop making TDX module calls (uses global private KeyID)
>>   1b. Stop TDX guests from running (uses per-guest KeyID)
>>  2. Flush any cachelines from previous private KeyID writes
>>
>> There are a couple of ways we can do #2.  We do *NOT* need to convert
>> *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
>> in any way as far as I can tell.
> 
> May I ask why?  When I was writing this patch I was not sure whether kexec()
> should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
> during kexec() but depends on the new kernel to do that too, but I don't know
> what's the general guide of supporting kexec().

Think about it this way: kexec() is modifying persistent (across kexec)
state to get the system ready for the new kernel.  The caches are
persistent state.  Devices have persistent state.  Memory state persists
across kexec().  The memory integrity metadata persists.

What persistent state does a conversion to KeyID-0 affect?  It resets
the integrity metadata and the memory contents.

Kexec leaves memory contents in place and doesn't zero them, so memory
contents don't matter.  The integrity metadata also doesn't matter
because the memory will be used as KeyID-0 and that KeyID doesn't read
the integrity metadata.

What practical impact does a conversion back to KeyID-0 serve?  What
persistent state does it affect that matters?

>> I think you're also saying that since all CPUs go through this path and
>> there is no TDX activity between the WBINVD and the native_halt() that
>> 1a and 1b basically happen for "free" without needing to do theme
>> explicitly.
> 
> Yes.  Should we mention this part in changelog?

That would be nice.
Huang, Kai Jan. 11, 2023, 12:13 a.m. UTC | #4
On Tue, 2023-01-10 at 07:27 -0800, Dave Hansen wrote:
> On 1/10/23 03:29, Huang, Kai wrote:
> > On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
> > > On 12/8/22 22:52, Kai Huang wrote:
> ...
> > > > However, this implementation doesn't convert TDX private pages back to
> > > > normal in kexec() because of below considerations:
> > > > 
> > > > 1) Neither the kernel nor the TDX module has existing infrastructure to
> > > >    track which pages are TDX private pages.
> > > > 2) The number of TDX private pages can be large, and converting all of
> > > >    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
> > > >    be time consuming.
> > > > 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
> > > >    0 doesn't support integrity-check, so it's OK.
> > > > 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
> > > >    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
> > > >    page with the new MKTME KeyID (just like TDX does) before using it.
> > > 
> > > Yeah, why are we getting all worked up about MKTME when there is not
> > > support?
> > 
> > I am not sure whether we need to consider 3rd party kernel case?
> 
> No, we don't.

Good to know.

> 
> > > The only thing that matters here is dirty cacheline writeback.  There
> > > are two things the kernel needs to do to mitigate that:
> > > 
> > >  1. Stop accessing TDX private memory mappings
> > >   1a. Stop making TDX module calls (uses global private KeyID)
> > >   1b. Stop TDX guests from running (uses per-guest KeyID)
> > >  2. Flush any cachelines from previous private KeyID writes
> > > 
> > > There are a couple of ways we can do #2.  We do *NOT* need to convert
> > > *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
> > > in any way as far as I can tell.
> > 
> > May I ask why?  When I was writing this patch I was not sure whether kexec()
> > should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
> > during kexec() but depends on the new kernel to do that too, but I don't know
> > what's the general guide of supporting kexec().
> 
> Think about it this way: kexec() is modifying persistent (across kexec)
> state to get the system ready for the new kernel.  The caches are
> persistent state.  Devices have persistent state.  Memory state persists
> across kexec().  The memory integrity metadata persists.
> 
> What persistent state does a conversion to KeyID-0 affect?  It resets
> the integrity metadata and the memory contents.
> 
> Kexec leaves memory contents in place and doesn't zero them, so memory
> contents don't matter.  The integrity metadata also doesn't matter
> because the memory will be used as KeyID-0 and that KeyID doesn't read
> the integrity metadata.

Right.  So I guess we just need to call out the new kernel will use memory as
KeyID-0?

> 
> What practical impact does a conversion back to KeyID-0 serve?  What
> persistent state does it affect that matters?

If we can be sure the new kernel will use KeyID-0, then we don't need to
convert.  In the 3) and 4) in my changelog, I actually was trying to convery
this.
  
> 
> > > I think you're also saying that since all CPUs go through this path and
> > > there is no TDX activity between the WBINVD and the native_halt() that
> > > 1a and 1b basically happen for "free" without needing to do theme
> > > explicitly.
> > 
> > Yes.  Should we mention this part in changelog?
> 
> That would be nice.
> 

Will do.
Dave Hansen Jan. 11, 2023, 12:30 a.m. UTC | #5
On 1/10/23 16:13, Huang, Kai wrote:
> On Tue, 2023-01-10 at 07:27 -0800, Dave Hansen wrote:
...
>> Think about it this way: kexec() is modifying persistent (across kexec)
>> state to get the system ready for the new kernel.  The caches are
>> persistent state.  Devices have persistent state.  Memory state persists
>> across kexec().  The memory integrity metadata persists.
>>
>> What persistent state does a conversion to KeyID-0 affect?  It resets
>> the integrity metadata and the memory contents.
>>
>> Kexec leaves memory contents in place and doesn't zero them, so memory
>> contents don't matter.  The integrity metadata also doesn't matter
>> because the memory will be used as KeyID-0 and that KeyID doesn't read
>> the integrity metadata.
> 
> Right.  So I guess we just need to call out the new kernel will use memory as
> KeyID-0?

Not even that.

Say the new kernel wanted to use the memory as KeyID-3.  What would it
do?  It would *ASSUME* that the memory *WASN'T* KeyID-3.  It would
convert it to KeyID-3.  That conversion would work from *any* KeyID.

So:

	KeyID-0: OK, because it has no integrity enforcement
	KeyID-1: OK, new kernel will convert the page
	KeyID-2: OK, new kernel will convert the page
	...
	KeyID-$MAX: OK, new kernel will convert the page

So, "OK" everywhere.  Nothing to do... anywhere.

Either I'm totally missing how this works, or you're desperately trying
to make this more complicated than it is.
Huang, Kai Jan. 11, 2023, 1:58 a.m. UTC | #6
On Tue, 2023-01-10 at 16:30 -0800, Hansen, Dave wrote:
> On 1/10/23 16:13, Huang, Kai wrote:
> > On Tue, 2023-01-10 at 07:27 -0800, Dave Hansen wrote:
> ...
> > > Think about it this way: kexec() is modifying persistent (across kexec)
> > > state to get the system ready for the new kernel.  The caches are
> > > persistent state.  Devices have persistent state.  Memory state persists
> > > across kexec().  The memory integrity metadata persists.
> > > 
> > > What persistent state does a conversion to KeyID-0 affect?  It resets
> > > the integrity metadata and the memory contents.
> > > 
> > > Kexec leaves memory contents in place and doesn't zero them, so memory
> > > contents don't matter.  The integrity metadata also doesn't matter
> > > because the memory will be used as KeyID-0 and that KeyID doesn't read
> > > the integrity metadata.
> > 
> > Right.  So I guess we just need to call out the new kernel will use memory as
> > KeyID-0?
> 
> Not even that.
> 
> Say the new kernel wanted to use the memory as KeyID-3.  What would it
> do?  It would *ASSUME* that the memory *WASN'T* KeyID-3.  It would
> convert it to KeyID-3.  That conversion would work from *any* KeyID.
> 
> So:
> 
> 	KeyID-0: OK, because it has no integrity enforcement
> 	KeyID-1: OK, new kernel will convert the page
> 	KeyID-2: OK, new kernel will convert the page
> 	...
> 	KeyID-$MAX: OK, new kernel will convert the page
> 
> So, "OK" everywhere.  Nothing to do... anywhere.
> 
> Either I'm totally missing how this works, or you're desperately trying
> to make this more complicated than it is.
> 

You are right.  The page conversion must do MOVDIR64B first even converting the
page from KeyID 0.  I was wrongly thinking when converting from KeyID 0 we don't
need to do MOVDIR64B.  My bad.

Sorry for the noise.  Thanks for your time.  I'll remove all those staff.
diff mbox series

Patch

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c21b7347a26d..0cc84977dc62 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -765,8 +765,14 @@  void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * Similar to SME, if the TDX module is ever initialized, the
+	 * cachelines associated with any TDX private KeyID must be flushed
+	 * before transiting to the new kernel.  The TDX module is initialized
+	 * on demand, and it takes the mutex to read its status.  Just check
+	 * whether TDX is enabled by BIOS instead to flush cache.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
 		native_wbinvd();
 	for (;;) {
 		/*