diff mbox series

[v7,06/20] x86/virt/tdx: Shut down TDX module in case of error

Message ID 48505089b645019a734d85c2c29f3c8ae2dbd6bd.1668988357.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai Nov. 21, 2022, 12:26 a.m. UTC
TDX supports shutting down the TDX module at any time during its
lifetime.  After the module is shut down, no further TDX module SEAMCALL
leaf functions can be made to the module on any logical cpu.

Shut down the TDX module in case of any error during the initialization
process.  It's pointless to leave the TDX module in some middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
CPUs and use it to shut down the module.  Later logical-cpu scope module
initialization will use it too.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - No change.

v5 -> v6:
 - Removed the seamcall() wrapper to previous patch (Dave).

- v3 -> v5 (no feedback on v4):
 - Added a wrapper of __seamcall() to print error code if SEAMCALL fails.
 - Made the seamcall_on_each_cpu() void.
 - Removed 'seamcall_ret' and 'tdx_module_out' from
   'struct seamcall_ctx', as they must be local variable.
 - Added the comments to tdx_init() and one paragraph to changelog to
   explain the caller should handle VMXON.
 - Called out after shut down, no "TDX module" SEAMCALL can be made.

---
 arch/x86/virt/vmx/tdx/tdx.c | 43 +++++++++++++++++++++++++++++++++----
 arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
 2 files changed, 44 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra Nov. 22, 2022, 9:10 a.m. UTC | #1
On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:

> +/*
> + * Data structure to make SEAMCALL on multiple CPUs concurrently.
> + * @err is set to -EFAULT when SEAMCALL fails on any cpu.
> + */
> +struct seamcall_ctx {
> +	u64 fn;
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	atomic_t err;
> +};

> @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	}
>  }
>  
> +static void seamcall_smp_call_function(void *data)
> +{
> +	struct seamcall_ctx *sc = data;
> +	int ret;
> +
> +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
> +	if (ret)
> +		atomic_set(&sc->err, -EFAULT);
> +}

Can someone explain me this usage of atomic_t, please?
Peter Zijlstra Nov. 22, 2022, 9:13 a.m. UTC | #2
On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:

> +/*
> + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> + */
> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> +}

Suppose the user has NOHZ_FULL configured, and is already running
userspace that will terminate on interrupt (this is desired feature for
NOHZ_FULL), guess how happy they'll be if someone, on another parition,
manages to tickle this TDX gunk?
Peter Zijlstra Nov. 22, 2022, 9:20 a.m. UTC | #3
On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:

> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
> CPUs and use it to shut down the module.  Later logical-cpu scope module
> initialization will use it too.

Uhh, those requirements ^ are not met by this:

> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> +}

Consider:

	CPU0			CPU1			CPU2

	local_irq_disable()
	...
				seamcall_on_each_cpu()
				send-IPIs to 0 and 2
							<IPI>
				runs local seamcall
				(seamcall done)
				waits for 0 and 2
							<has an NMI delay things>
							runs seamcall
							clears CSD_LOCK
							</IPI>
				... spinning ...

	local_irq_enable()
	<IPI>
	  runs seamcall
	  clears CSD_LOCK
				*FINALLY* observes CSD_LOCK cleared on
				all CPU and continues
	</IPI>

IOW, they all 3 run seamcall at different times.

Either the Changelog is broken or this TDX crud is worse crap than I
thought possible, because the only way to actually meet that requirement
as stated is stop_machine().
Thomas Gleixner Nov. 22, 2022, 3:06 p.m. UTC | #4
On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:

> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
>
>> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
>> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
>> CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
>> CPUs and use it to shut down the module.  Later logical-cpu scope module
>> initialization will use it too.
>
> Uhh, those requirements ^ are not met by this:

  Can run concurrently != Must run concurrently
 
The documentation clearly says "can run concurrently" as quoted above.

Thanks,

        tglx
Dave Hansen Nov. 22, 2022, 3:14 p.m. UTC | #5
On 11/22/22 01:13, Peter Zijlstra wrote:
> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
>> +/*
>> + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
>> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
>> + */
>> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
>> +{
>> +	on_each_cpu(seamcall_smp_call_function, sc, true);
>> +}
> 
> Suppose the user has NOHZ_FULL configured, and is already running
> userspace that will terminate on interrupt (this is desired feature for
> NOHZ_FULL), guess how happy they'll be if someone, on another parition,
> manages to tickle this TDX gunk?

Yeah, they'll be none too happy.

But, what do we do?

There are technical solutions like detecting if NOHZ_FULL is in play and
refusing to initialize TDX.  There are also non-technical solutions like
telling folks in the documentation that they better modprobe kvm early
if they want to do TDX, or their NOHZ_FULL apps will pay.

We could also force the TDX module to be loaded early in boot before
NOHZ_FULL is in play, but that would waste memory on TDX metadata even
if TDX is never used.

How do NOHZ_FULL folks deal with late microcode updates, for example?
Those are roughly equally disruptive to all CPUs.
Dave Hansen Nov. 22, 2022, 3:20 p.m. UTC | #6
On 11/22/22 01:20, Peter Zijlstra wrote:
> Either the Changelog is broken or this TDX crud is worse crap than I
> thought possible, because the only way to actually meet that requirement
> as stated is stop_machine().

I think the changelog is broken.  I don't see anything in the TDX module
spec about "the SEMACALL can run concurrently on different CPUs".
Shutdown, as far as I can tell, just requires that the shutdown seamcall
be run once on each CPU.  Concurrency and ordering don't seem to matter
at all.
Thomas Gleixner Nov. 22, 2022, 4:52 p.m. UTC | #7
On Tue, Nov 22 2022 at 07:20, Dave Hansen wrote:
> On 11/22/22 01:20, Peter Zijlstra wrote:
>> Either the Changelog is broken or this TDX crud is worse crap than I
>> thought possible, because the only way to actually meet that requirement
>> as stated is stop_machine().
>
> I think the changelog is broken.  I don't see anything in the TDX module
> spec about "the SEMACALL can run concurrently on different CPUs".
> Shutdown, as far as I can tell, just requires that the shutdown seamcall
> be run once on each CPU.  Concurrency and ordering don't seem to matter
> at all.

You're right. The 'can concurrently run' thing is for LP.INIT:

4.2.2.
 LP-Scope Initialization: TDH.SYS.LP.INIT

  TDH.SYS.LP.INIT is intended to perform LP-scope, core-scope and
  package-scope initialization of the Intel TDX module. It can be called
  only after TDH.SYS.INIT completes successfully, and it can run
  concurrently on multiple LPs.
Dave Hansen Nov. 22, 2022, 6:57 p.m. UTC | #8
On 11/20/22 16:26, Kai Huang wrote:
> TDX supports shutting down the TDX module at any time during its
> lifetime.  After the module is shut down, no further TDX module SEAMCALL
> leaf functions can be made to the module on any logical cpu.
> 
> Shut down the TDX module in case of any error during the initialization
> process.  It's pointless to leave the TDX module in some middle state.
> 
> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
> CPUs and use it to shut down the module.  Later logical-cpu scope module
> initialization will use it too.

To me, this starts to veer way too far into internal implementation details.

	Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs
	to shut down the TDX module.

This is also the point where you should talk about the new
infrastructure.  Why do you need a new 'struct seamcall_something'?
What makes it special?

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b06c1a2bc9cb..5db1a05cb4bd 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,8 @@
>  #include <linux/mutex.h>
>  #include <linux/cpu.h>
>  #include <linux/cpumask.h>
> +#include <linux/smp.h>
> +#include <linux/atomic.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/apic.h>
> @@ -124,15 +126,27 @@ bool platform_tdx_enabled(void)
>  	return !!tdx_keyid_num;
>  }
>  
> +/*
> + * Data structure to make SEAMCALL on multiple CPUs concurrently.
> + * @err is set to -EFAULT when SEAMCALL fails on any cpu.
> + */
> +struct seamcall_ctx {
> +	u64 fn;
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	atomic_t err;
> +};
> +
>  /*
>   * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>   * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
>   * leaf function return code and the additional output respectively if
>   * not NULL.
>   */
> -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> -				    u64 *seamcall_ret,
> -				    struct tdx_module_output *out)
> +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +		    u64 *seamcall_ret, struct tdx_module_output *out)
>  {
>  	u64 sret;
>  
> @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	}
>  }
>  
> +static void seamcall_smp_call_function(void *data)
> +{
> +	struct seamcall_ctx *sc = data;
> +	int ret;
> +
> +	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
> +	if (ret)
> +		atomic_set(&sc->err, -EFAULT);
> +}

The atomic_t is kinda silly.  I guess it's not *that* wasteful though.

I think it would have actually been a lot more clear if instead of
containing an errno it was a *count* of the number of encountered errors.

An "atomic_set()" where everyone is overwriting each other is a bit
counterintuitive.  It's OK here, of course, but it still looks goofy.

If this were:

	atomic_inc(&sc->nr_errors);

it would be a lot more clear that *anyone* can increment and that it
truly is shared.

> +/*
> + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> + */
> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> +{
> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> +}
> +
>  /*
>   * Detect and initialize the TDX module.
>   *
> @@ -181,7 +214,9 @@ static int init_tdx_module(void)
>  
>  static void shutdown_tdx_module(void)
>  {
> -	/* TODO: Shut down the TDX module */
> +	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
> +
> +	seamcall_on_each_cpu(&sc);
>  }


The seamcall_on_each_cpu() function is silly as-is.  Either collapse the
functions or note in the changelog why this is not as silly as it looks.

>  static int __tdx_enable(void)
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 92a8de957dc7..215cc1065d78 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -12,6 +12,11 @@
>  /* MSR to report KeyID partitioning between MKTME and TDX */
>  #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
>  
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_LP_SHUTDOWN	44
> +
>  /*
>   * Do not put any hardware-defined TDX structure representations below
>   * this comment!
Peter Zijlstra Nov. 22, 2022, 7:06 p.m. UTC | #9
On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote:
> On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:
> 
> > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> >
> >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> >> CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
> >> CPUs and use it to shut down the module.  Later logical-cpu scope module
> >> initialization will use it too.
> >
> > Uhh, those requirements ^ are not met by this:
> 
>   Can run concurrently != Must run concurrently
>  
> The documentation clearly says "can run concurrently" as quoted above.

The next sentense says: "Implement a mechanism to run SEAMCALL
concurrently" -- it does not.

Anyway, since we're all in agreement there is no such requirement at
all, a schedule_on_each_cpu() might be more appropriate, there is no
reason to use IPIs and spin-waiting for any of this.

That said; perhaps we should grow:

  schedule_on_cpu(struct cpumask *cpus, work_func_t func);

to only disturb a given mask of CPUs.
Peter Zijlstra Nov. 22, 2022, 7:13 p.m. UTC | #10
On Tue, Nov 22, 2022 at 07:14:14AM -0800, Dave Hansen wrote:
> On 11/22/22 01:13, Peter Zijlstra wrote:
> > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> >> +/*
> >> + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
> >> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
> >> + */
> >> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
> >> +{
> >> +	on_each_cpu(seamcall_smp_call_function, sc, true);
> >> +}
> > 
> > Suppose the user has NOHZ_FULL configured, and is already running
> > userspace that will terminate on interrupt (this is desired feature for
> > NOHZ_FULL), guess how happy they'll be if someone, on another parition,
> > manages to tickle this TDX gunk?
> 
> Yeah, they'll be none too happy.
> 
> But, what do we do?

Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
all TDX using tasks.

> There are technical solutions like detecting if NOHZ_FULL is in play and
> refusing to initialize TDX.  There are also non-technical solutions like
> telling folks in the documentation that they better modprobe kvm early
> if they want to do TDX, or their NOHZ_FULL apps will pay.

Surely modprobe kvm isn't the point where TDX gets loaded? Because
that's on boot for everybody due to all the auto-probing nonsense.

I was expecting TDX to not get initialized until the first TDX using KVM
instance is created. Am I wrong?

> We could also force the TDX module to be loaded early in boot before
> NOHZ_FULL is in play, but that would waste memory on TDX metadata even
> if TDX is never used.

I'm thikning it makes sense to have a tdx={off,on-demand,force} toggle
anyway.

> How do NOHZ_FULL folks deal with late microcode updates, for example?
> Those are roughly equally disruptive to all CPUs.

I imagine they don't do that -- in fact I would recommend we make the
whole late loading thing mutually exclusive with nohz_full; can't have
both.
Peter Zijlstra Nov. 22, 2022, 7:14 p.m. UTC | #11
On Tue, Nov 22, 2022 at 10:57:52AM -0800, Dave Hansen wrote:

> To me, this starts to veer way too far into internal implementation details.
> 
> 	Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs
> 	to shut down the TDX module.

We really need to let go of the whole 'all BIOS-enabled CPUs' thing.
Dave Hansen Nov. 22, 2022, 7:24 p.m. UTC | #12
On 11/22/22 11:13, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 07:14:14AM -0800, Dave Hansen wrote:
>> On 11/22/22 01:13, Peter Zijlstra wrote:
>>> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
>>>> +/*
>>>> + * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
>>>> + * @sc->err to determine whether any SEAMCALL failed on any cpu.
>>>> + */
>>>> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
>>>> +{
>>>> +	on_each_cpu(seamcall_smp_call_function, sc, true);
>>>> +}
>>>
>>> Suppose the user has NOHZ_FULL configured, and is already running
>>> userspace that will terminate on interrupt (this is desired feature for
>>> NOHZ_FULL), guess how happy they'll be if someone, on another parition,
>>> manages to tickle this TDX gunk?
>>
>> Yeah, they'll be none too happy.
>>
>> But, what do we do?
> 
> Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
> all TDX using tasks.

I don't think that works.  As I mentioned to Thomas elsewhere, you don't
just need to initialize TDX on the CPUs where it is used.  Before the
module will start working you need to initialize it on *all* the CPUs it
knows about.  The module itself has a little counter where it tracks
this and will refuse to start being useful until it gets called
thoroughly enough.

>> There are technical solutions like detecting if NOHZ_FULL is in play and
>> refusing to initialize TDX.  There are also non-technical solutions like
>> telling folks in the documentation that they better modprobe kvm early
>> if they want to do TDX, or their NOHZ_FULL apps will pay.
> 
> Surely modprobe kvm isn't the point where TDX gets loaded? Because
> that's on boot for everybody due to all the auto-probing nonsense.
> 
> I was expecting TDX to not get initialized until the first TDX using KVM
> instance is created. Am I wrong?

I went looking for it in this series to prove you wrong.  I failed.  :)

tdx_enable() is buried in here somewhere:

> https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/

I don't have the patience to dig it out today, so I guess we'll have Kai
tell us.

>> We could also force the TDX module to be loaded early in boot before
>> NOHZ_FULL is in play, but that would waste memory on TDX metadata even
>> if TDX is never used.
> 
> I'm thikning it makes sense to have a tdx={off,on-demand,force} toggle
> anyway.

Yep, that makes total sense.  Kai had one in an earlier version but I
made him throw it out because it wasn't *strictly* required and this set
is fat enough.

>> How do NOHZ_FULL folks deal with late microcode updates, for example?
>> Those are roughly equally disruptive to all CPUs.
> 
> I imagine they don't do that -- in fact I would recommend we make the
> whole late loading thing mutually exclusive with nohz_full; can't have
> both.

So, if we just use schedule_on_cpu() for now and have the TDX code wait,
will a NOHZ_FULL task just block the schedule_on_cpu() indefinitely?

That doesn't seem like _horrible_ behavior to start off with for a
minimal series.
Sean Christopherson Nov. 22, 2022, 7:31 p.m. UTC | #13
On Tue, Nov 22, 2022, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote:
> > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:
> > 
> > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> > >
> > >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> > >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> > >> CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
> > >> CPUs and use it to shut down the module.  Later logical-cpu scope module
> > >> initialization will use it too.
> > >
> > > Uhh, those requirements ^ are not met by this:
> > 
> >   Can run concurrently != Must run concurrently
> >  
> > The documentation clearly says "can run concurrently" as quoted above.
> 
> The next sentense says: "Implement a mechanism to run SEAMCALL
> concurrently" -- it does not.
> 
> Anyway, since we're all in agreement there is no such requirement at
> all, a schedule_on_each_cpu() might be more appropriate, there is no
> reason to use IPIs and spin-waiting for any of this.

Backing up a bit, what's the reason for _any_ of this?  The changelog says

  It's pointless to leave the TDX module in some middle state.

but IMO it's just as pointless to do a shutdown unless the kernel benefits in
some meaningful way.  And IIUC, TDH.SYS.LP.SHUTDOWN does nothing more than change
the SEAM VMCS.HOST_RIP to point to an error trampoline.  E.g. it's not like doing
a shutdown lets the kernel reclaim memory that was gifted to the TDX module.

In other words, this is just a really expensive way of changing a function pointer,
and the only way it would ever benefit the kernel is if there is a kernel bug that
leads to trying to use TDX after a fatal error.  And even then, the only difference
seems to be that subsequent bogus SEAMCALLs would get a more unique error message.
Peter Zijlstra Nov. 22, 2022, 7:33 p.m. UTC | #14
On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote:

> > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
> > all TDX using tasks.
> 
> I don't think that works.  As I mentioned to Thomas elsewhere, you don't
> just need to initialize TDX on the CPUs where it is used.  Before the
> module will start working you need to initialize it on *all* the CPUs it
> knows about.  The module itself has a little counter where it tracks
> this and will refuse to start being useful until it gets called
> thoroughly enough.

That's bloody terrible, that is. How are we going to make that work with
the SMT mitigation crud that forces the SMT sibilng offline?

Then the counters don't match and TDX won't work.

Can we get this limitiation removed and simply let the module throw a
wobbly (error) when someone tries and use TDX without that logical CPU
having been properly initialized?
Huang, Kai Nov. 23, 2022, 12:58 a.m. UTC | #15
On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
> > I was expecting TDX to not get initialized until the first TDX using KVM
> > instance is created. Am I wrong?
> 
> I went looking for it in this series to prove you wrong.  I failed.  :)
> 
> tdx_enable() is buried in here somewhere:
> 
> > https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/
> 
> I don't have the patience to dig it out today, so I guess we'll have Kai
> tell us.

It will be done when KVM module is loaded, but not when the first TDX guest is
created.
Dave Hansen Nov. 23, 2022, 1:04 a.m. UTC | #16
On 11/22/22 16:58, Huang, Kai wrote:
> On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
>>> I was expecting TDX to not get initialized until the first TDX using KVM
>>> instance is created. Am I wrong?
>> I went looking for it in this series to prove you wrong.  I failed.  
Huang, Kai Nov. 23, 2022, 1:14 a.m. UTC | #17
On Tue, 2022-11-22 at 20:33 +0100, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote:
> 
> > > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
> > > all TDX using tasks.
> > 
> > I don't think that works.  As I mentioned to Thomas elsewhere, you don't
> > just need to initialize TDX on the CPUs where it is used.  Before the
> > module will start working you need to initialize it on *all* the CPUs it
> > knows about.  The module itself has a little counter where it tracks
> > this and will refuse to start being useful until it gets called
> > thoroughly enough.
> 
> That's bloody terrible, that is. How are we going to make that work with
> the SMT mitigation crud that forces the SMT sibilng offline?
> 
> Then the counters don't match and TDX won't work.
> 
> Can we get this limitiation removed and simply let the module throw a
> wobbly (error) when someone tries and use TDX without that logical CPU
> having been properly initialized?

Dave kindly helped to raise this issue and I'll follow up with TDX module guys
to see whether we can remove/ease such limitation.

Thanks!
Huang, Kai Nov. 23, 2022, 1:22 a.m. UTC | #18
On Tue, 2022-11-22 at 17:04 -0800, Dave Hansen wrote:
> On 11/22/22 16:58, Huang, Kai wrote:
> > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
> > > > I was expecting TDX to not get initialized until the first TDX using KVM
> > > > instance is created. Am I wrong?
> > > I went looking for it in this series to prove you wrong.  I failed.  
Huang, Kai Nov. 23, 2022, 1:24 a.m. UTC | #19
On Tue, 2022-11-22 at 20:14 +0100, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 10:57:52AM -0800, Dave Hansen wrote:
> 
> > To me, this starts to veer way too far into internal implementation details.
> > 
> > 	Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs
> > 	to shut down the TDX module.
> 
> We really need to let go of the whole 'all BIOS-enabled CPUs' thing.

As replied in another email I'll follow up with TDX module guys to see whether
we can remove/ease such limitation.

Thanks!
Huang, Kai Nov. 23, 2022, 9:39 a.m. UTC | #20
On Tue, 2022-11-22 at 19:31 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, Peter Zijlstra wrote:
> > On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote:
> > > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote:
> > > 
> > > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote:
> > > > 
> > > > > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
> > > > > BIOS-enabled CPUs, and the SEMACALL can run concurrently on different
> > > > > CPUs.  Implement a mechanism to run SEAMCALL concurrently on all online
> > > > > CPUs and use it to shut down the module.  Later logical-cpu scope module
> > > > > initialization will use it too.
> > > > 
> > > > Uhh, those requirements ^ are not met by this:
> > > 
> > >   Can run concurrently != Must run concurrently
> > >  
> > > The documentation clearly says "can run concurrently" as quoted above.
> > 
> > The next sentense says: "Implement a mechanism to run SEAMCALL
> > concurrently" -- it does not.
> > 
> > Anyway, since we're all in agreement there is no such requirement at
> > all, a schedule_on_each_cpu() might be more appropriate, there is no
> > reason to use IPIs and spin-waiting for any of this.
> 
> Backing up a bit, what's the reason for _any_ of this?  The changelog says
> 
>   It's pointless to leave the TDX module in some middle state.
> 
> but IMO it's just as pointless to do a shutdown unless the kernel benefits in
> some meaningful way.  And IIUC, TDH.SYS.LP.SHUTDOWN does nothing more than change
> the SEAM VMCS.HOST_RIP to point to an error trampoline.  E.g. it's not like doing
> a shutdown lets the kernel reclaim memory that was gifted to the TDX module.

The metadata memory has been freed back to the kernel in case of error before
shutting down the module.

> 
> In other words, this is just a really expensive way of changing a function pointer,
> and the only way it would ever benefit the kernel is if there is a kernel bug that
> leads to trying to use TDX after a fatal error.  And even then, the only difference
> seems to be that subsequent bogus SEAMCALLs would get a more unique error message.

The only issue of leaving module open is like you said bogus SEAMCALLs can still
be made.  There's a slight risk that those SEAMCALLs may actually be able to do
something that may crash the kernel (i.e. if the module is shut down due to
TDH.SYS.INIT.TDMR error, then further bogus TDH.SYS.INIT.TDMR can still corrupt
the metadata).

But, this belongs to "kernel bug" or "kernel is under attack" category.  Neither
of them should be a concern of TDX: host kernel is out of TCB, and preventing
DoS attack is not part of TDX anyway.

Also, in case of kexec(), we cannot shut down the module either (in this
implementation, due to CPU may not be in VMX operation when kexec()).

So I agree with you, it's fine to not shut down the module.

Hi maintainers, does this look good to you?
Sean Christopherson Nov. 23, 2022, 4:20 p.m. UTC | #21
On Wed, Nov 23, 2022, Huang, Kai wrote:
> On Tue, 2022-11-22 at 17:04 -0800, Dave Hansen wrote:
> > On 11/22/22 16:58, Huang, Kai wrote:
> > > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote:
> > > > > I was expecting TDX to not get initialized until the first TDX using KVM
> > > > > instance is created. Am I wrong?
> > > > I went looking for it in this series to prove you wrong.  I failed.  
Dave Hansen Nov. 23, 2022, 4:41 p.m. UTC | #22
On 11/23/22 08:20, Sean Christopherson wrote:
>>> Why is it done that way?
>>>
>>> Can it be changed to delay TDX initialization until the first TDX guest
>>> needs to run?
>>>
>> Sean suggested.
>>
>> Hi Sean, could you commenet?
> Waiting until the first TDX guest is created would result in false advertising,
> as KVM wouldn't know whether or not TDX is actually supported until that first
> VM is created.  If we can guarantee that TDH.SYS.INIT will fail if and only if
> there is a kernel bug, then I would be ok deferring the "enabling" until the
> first VM is created.

There's no way we can guarantee _that_.  For one, the PAMT* allocations
can always fail.  I guess we could ask sysadmins to fire up a guest to
"prime" things, but that seems a little silly.  Maybe that would work as
the initial implementation that we merge, but I suspect our users will
demand more determinism, maybe a boot or module parameter.

 * Physical Address Metadata Table, a large physically contiguous data
   structure, the rough equivalent of 'struct page' for the TDX module
Sean Christopherson Nov. 23, 2022, 5:37 p.m. UTC | #23
On Wed, Nov 23, 2022, Dave Hansen wrote:
> On 11/23/22 08:20, Sean Christopherson wrote:
> >>> Why is it done that way?
> >>>
> >>> Can it be changed to delay TDX initialization until the first TDX guest
> >>> needs to run?
> >>>
> >> Sean suggested.
> >>
> >> Hi Sean, could you commenet?
> > Waiting until the first TDX guest is created would result in false advertising,
> > as KVM wouldn't know whether or not TDX is actually supported until that first
> > VM is created.  If we can guarantee that TDH.SYS.INIT will fail if and only if
> > there is a kernel bug, then I would be ok deferring the "enabling" until the
> > first VM is created.
> 
> There's no way we can guarantee _that_.  For one, the PAMT* allocations
> can always fail.  I guess we could ask sysadmins to fire up a guest to
> "prime" things, but that seems a little silly.  Maybe that would work as
> the initial implementation that we merge, but I suspect our users will
> demand more determinism, maybe a boot or module parameter.

Oh, you mean all of TDX initialization?  I thought "initialization" here mean just
doing tdx_enable().

Yeah, that's not going to be a viable option.  Aside from lacking determinisim,
it would be all too easy to end up on a system with fragmented memory that can't
allocate the PAMTs post-boot.
Dave Hansen Nov. 23, 2022, 6:18 p.m. UTC | #24
On 11/23/22 09:37, Sean Christopherson wrote:
> On Wed, Nov 23, 2022, Dave Hansen wrote:
>> There's no way we can guarantee _that_.  For one, the PAMT* allocations
>> can always fail.  I guess we could ask sysadmins to fire up a guest to
>> "prime" things, but that seems a little silly.  Maybe that would work as
>> the initial implementation that we merge, but I suspect our users will
>> demand more determinism, maybe a boot or module parameter.
> Oh, you mean all of TDX initialization?  I thought "initialization" here mean just
> doing tdx_enable().

Yes, but the first call to tdx_enable() does TDH_SYS_INIT and all the
subsequent work to get the module going.

> Yeah, that's not going to be a viable option.  Aside from lacking determinisim,
> it would be all too easy to end up on a system with fragmented memory that can't
> allocate the PAMTs post-boot.

For now, the post-boot runtime PAMT allocations are the one any only way
that TDX can be initialized.  I pushed for it to be done this way.
Here's why:

Doing tdx_enable() is relatively slow and it eats up a non-zero amount
of physically contiguous RAM for metadata (~1/256th or ~0.4% of RAM).
Systems that support TDX but will never run TDX guests should not pay
that cost.

That means that we either make folks opt-in at boot-time or we try to
make a best effort at runtime to do the metadata allocations.

From my perspective, the best-effort stuff is absolutely needed.  Users
are going to forget the command-line opt in and there's no harm in
_trying_ the big allocations even if they fail.

Second, in reality, the "real" systems that can run TDX guests are
probably not going to sit around fragmenting memory for a month before
they run their first guest.  They're going to run one shortly after they
boot when memory isn't fragmented and the best-effort allocation will
work really well.

Third, if anyone *REALLY* cared to make it reliable *and* wanted to sit
around fragmenting memory for a month, they could just start a TDX guest
and kill it to get TDX initialized.  This isn't ideal.  But, to me, it
beats defining some new, separate ABI (or boot/module option) to do it.

So, let's have those discussions.  Long-term, what *is* the most
reliable way to get the TDX module loaded with 100% determinism?  What
new ABI or interfaces are needed?  Also, is that 100% determinism
required the moment this series is merged?  Or, can we work up to it?

I think it can wait until this particular series is farther along.
Sean Christopherson Nov. 23, 2022, 7:03 p.m. UTC | #25
On Wed, Nov 23, 2022, Dave Hansen wrote:
> On 11/23/22 09:37, Sean Christopherson wrote:
> > On Wed, Nov 23, 2022, Dave Hansen wrote:
> >> There's no way we can guarantee _that_.  For one, the PAMT* allocations
> >> can always fail.  I guess we could ask sysadmins to fire up a guest to
> >> "prime" things, but that seems a little silly.  Maybe that would work as
> >> the initial implementation that we merge, but I suspect our users will
> >> demand more determinism, maybe a boot or module parameter.
> > Oh, you mean all of TDX initialization?  I thought "initialization" here mean just
> > doing tdx_enable().
> 
> Yes, but the first call to tdx_enable() does TDH_SYS_INIT and all the
> subsequent work to get the module going.

Ah, sorry, I misread the diff.  Actually applied the patches this time...

> > Yeah, that's not going to be a viable option.  Aside from lacking determinisim,
> > it would be all too easy to end up on a system with fragmented memory that can't
> > allocate the PAMTs post-boot.
> 
> For now, the post-boot runtime PAMT allocations are the one any only way
> that TDX can be initialized.  I pushed for it to be done this way.
> Here's why:
> 
> Doing tdx_enable() is relatively slow and it eats up a non-zero amount
> of physically contiguous RAM for metadata (~1/256th or ~0.4% of RAM).
> Systems that support TDX but will never run TDX guests should not pay
> that cost.
> 
> That means that we either make folks opt-in at boot-time or we try to
> make a best effort at runtime to do the metadata allocations.
> 
> From my perspective, the best-effort stuff is absolutely needed.  Users
> are going to forget the command-line opt in

Eh, any sufficiently robust deployment should be able to ensure that its kernels
boot with "required" command-line options.

> and there's no harm in _trying_ the big allocations even if they fail.

No, but there is "harm" if a host can't provide the functionality the control
plane thinks it supports.

> Second, in reality, the "real" systems that can run TDX guests are
> probably not going to sit around fragmenting memory for a month before
> they run their first guest.  They're going to run one shortly after they
> boot when memory isn't fragmented and the best-effort allocation will
> work really well.

I don't think this will hold true.  Long term, we (Google) want to have a common
pool for non-TDX and TDX VMs.  Forcing TDX VMs to use a dedicated pool of hosts
makes it much more difficult to react to demand, e.g. if we carve out N hosts for
TDX, but only use 10% of those hosts, then that's a lot of wasted capacity/money.
IIRC, people have discussed dynamically reconfiguring hosts so that systems could
be moved in/out of a dedicated pool, but that's still suboptimal, e.g. would
require emptying a host to reboot+reconfigure..

If/when we end up with a common pool, then it's very likely that we could have a
TDX-capable host go weeks/months before launching its first TDX VM.

> Third, if anyone *REALLY* cared to make it reliable *and* wanted to sit
> around fragmenting memory for a month, they could just start a TDX guest
> and kill it to get TDX initialized.  This isn't ideal.  But, to me, it
> beats defining some new, separate ABI (or boot/module option) to do it.

That's a hack.  I have no objection to waiting until KVM is _loaded_ to initialize
TDX, but waiting until KVM_CREATE_VM is not acceptable.  Use cases aside, KVM's ABI
would be a mess, e.g. KVM couldn't use KVM_CHECK_EXTENSION or any other /dev/kvm
ioctl() to enumerate TDX support.

> So, let's have those discussions.  Long-term, what *is* the most
> reliable way to get the TDX module loaded with 100% determinism?  What
> new ABI or interfaces are needed?  Also, is that 100% determinism
> required the moment this series is merged?  Or, can we work up to it?

I don't think we (Google again) strictly need 100% determinism with respect to
enabling TDX, what's most important is that if a host says it's TDX-capable, then
it really is TDX-capable.  I'm sure we'll treat "failure to load" as an error,
but it doesn't necessarily need to be a fatal error as the host can still run in
a degraded state (no idea if we'll actually do that though).

> I think it can wait until this particular series is farther along.

For an opt-in kernel param, agreed.  That can be added later, e.g. if it turns
out that the PAMT allocation failure rate is too high.
Dave Hansen Nov. 29, 2022, 9:40 p.m. UTC | #26
On 11/22/22 11:33, Peter Zijlstra wrote:
> On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote:
>>> Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of
>>> all TDX using tasks.
>> I don't think that works.  As I mentioned to Thomas elsewhere, you don't
>> just need to initialize TDX on the CPUs where it is used.  Before the
>> module will start working you need to initialize it on *all* the CPUs it
>> knows about.  The module itself has a little counter where it tracks
>> this and will refuse to start being useful until it gets called
>> thoroughly enough.
> That's bloody terrible, that is. How are we going to make that work with
> the SMT mitigation crud that forces the SMT sibilng offline?
> 
> Then the counters don't match and TDX won't work.
> 
> Can we get this limitiation removed and simply let the module throw a
> wobbly (error) when someone tries and use TDX without that logical CPU
> having been properly initialized?

It sounds like we can at least punt the limitation away from the OS's
purview.

There's actually a multi-step process to get a "real" TDX module loaded.
 There's a fancy ACM (Authenticated Code Module) that's invoked via
GETSEC[ENTERACCS] and an intermediate module loader.  That dance used to
be done in the kernel, but we talked the BIOS guys into doing it instead.

I believe these per-logical-CPU checks _can_ also be punted out of the
TDX module itself and delegated to one of these earlier module loading
phases that the BIOS drives.

I'm still a _bit_ skeptical that the checks are needed in the first
place.  But, as long as they're hidden from the OS, I don't see a need
to be too cranky about it.

In the end, we could just plain stop doing the TDH.SYS.LP.INIT code in
the kernel.

Unless someone screams, I'll ask the BIOS and TDX module folks to look
into this.
Thomas Gleixner Nov. 30, 2022, 11:09 a.m. UTC | #27
On Tue, Nov 29 2022 at 13:40, Dave Hansen wrote:
> On 11/22/22 11:33, Peter Zijlstra wrote:
>> Can we get this limitiation removed and simply let the module throw a
>> wobbly (error) when someone tries and use TDX without that logical CPU
>> having been properly initialized?
>
> It sounds like we can at least punt the limitation away from the OS's
> purview.
>
> There's actually a multi-step process to get a "real" TDX module loaded.
>  There's a fancy ACM (Authenticated Code Module) that's invoked via
> GETSEC[ENTERACCS] and an intermediate module loader.  That dance used to
> be done in the kernel, but we talked the BIOS guys into doing it instead.
>
> I believe these per-logical-CPU checks _can_ also be punted out of the
> TDX module itself and delegated to one of these earlier module loading
> phases that the BIOS drives.
>
> I'm still a _bit_ skeptical that the checks are needed in the first
> place.  But, as long as they're hidden from the OS, I don't see a need
> to be too cranky about it.

Right.

> In the end, we could just plain stop doing the TDH.SYS.LP.INIT code in
> the kernel.

Which in turn makes all the problems we discussed go away.

> Unless someone screams, I'll ask the BIOS and TDX module folks to look
> into this.

Yes, please.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b06c1a2bc9cb..5db1a05cb4bd 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,8 @@ 
 #include <linux/mutex.h>
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
+#include <linux/smp.h>
+#include <linux/atomic.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
@@ -124,15 +126,27 @@  bool platform_tdx_enabled(void)
 	return !!tdx_keyid_num;
 }
 
+/*
+ * Data structure to make SEAMCALL on multiple CPUs concurrently.
+ * @err is set to -EFAULT when SEAMCALL fails on any cpu.
+ */
+struct seamcall_ctx {
+	u64 fn;
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	atomic_t err;
+};
+
 /*
  * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
  * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
  * leaf function return code and the additional output respectively if
  * not NULL.
  */
-static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-				    u64 *seamcall_ret,
-				    struct tdx_module_output *out)
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		    u64 *seamcall_ret, struct tdx_module_output *out)
 {
 	u64 sret;
 
@@ -166,6 +180,25 @@  static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	}
 }
 
+static void seamcall_smp_call_function(void *data)
+{
+	struct seamcall_ctx *sc = data;
+	int ret;
+
+	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL);
+	if (ret)
+		atomic_set(&sc->err, -EFAULT);
+}
+
+/*
+ * Call the SEAMCALL on all online CPUs concurrently.  Caller to check
+ * @sc->err to determine whether any SEAMCALL failed on any cpu.
+ */
+static void seamcall_on_each_cpu(struct seamcall_ctx *sc)
+{
+	on_each_cpu(seamcall_smp_call_function, sc, true);
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -181,7 +214,9 @@  static int init_tdx_module(void)
 
 static void shutdown_tdx_module(void)
 {
-	/* TODO: Shut down the TDX module */
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
+
+	seamcall_on_each_cpu(&sc);
 }
 
 static int __tdx_enable(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 92a8de957dc7..215cc1065d78 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -12,6 +12,11 @@ 
 /* MSR to report KeyID partitioning between MKTME and TDX */
 #define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_LP_SHUTDOWN	44
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!