diff mbox series

[v12,07/22] x86/virt/tdx: Add skeleton to enable TDX on demand

Message ID 104d324cd68b12e14722ee5d85a660cccccd8892.1687784645.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai June 26, 2023, 2:12 p.m. UTC
To enable TDX the kernel needs to initialize TDX from two perspectives:
1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
on one logical cpu before the kernel wants to make any other SEAMCALLs
on that cpu (including those involved during module initialization and
running TDX guests).

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" initialization mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
code mucks with VMX enabling.  If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that.  Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

Similar to module initialization, also make the per-cpu initialization
"on demand" as it also depends on VMX being enabled.

Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
module and enable TDX on local cpu respectively.  For now tdx_enable()
is a placeholder.  The TODO list will be pared down as functionality is
added.

Export both tdx_cpu_enable() and tdx_enable() for KVM use.

In tdx_enable() use a state machine protected by mutex to make sure the
initialization will only be done once, as tdx_enable() can be called
multiple times (i.e. KVM module can be reloaded) and may be called
concurrently by other kernel components in the future.

The per-cpu initialization on each cpu can only be done once during the
module's life time.  Use a per-cpu variable to track its status to make
sure it is only done once in tdx_cpu_enable().

Also, a SEAMCALL to do TDX module global initialization must be done
once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
it inside tdx_cpu_enable() too (if hasn't been done).

tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
per-cpu initialization must be done before those SEAMCALLs are invoked
on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
per-cpu initialization regardless of whether the TDX module has been
initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
but assume the caller has disabled CPU hotplug, done VMXON and
tdx_cpu_enable() on all online cpus before calling tdx_enable().

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12:
 - Simplified TDX module global init and lp init status tracking (David).
 - Added comment around try_init_module_global() for using
   raw_spin_lock() (Dave).
 - Added one sentence to changelog to explain why to expose tdx_enable()
   and tdx_cpu_enable() (Dave).
 - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
   lockdep_assert_*() instead. (Dave)
 - Removed redundent "TDX" in error message (Dave).

v10 -> v11:
 - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
 - Return the actual error code for tdx_enable() instead of -EINVAL.
 - Added Isaku's Reviewed-by.

v9 -> v10:
 - Merged the patch to handle per-cpu initialization to this patch to
   tell the story better.
 - Changed how to handle the per-cpu initialization to only provide a
   tdx_cpu_enable() function to let the user of TDX to do it when the
   user wants to run TDX code on a certain cpu.
 - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
   call lockdep_assert_cpus_held() to assume the caller has done that.
 - Improved comments around tdx_enable() and tdx_cpu_enable().
 - Improved changelog to tell the story better accordingly.

v8 -> v9:
 - Removed detailed TODO list in the changelog (Dave).
 - Added back steps to do module global initialization and per-cpu
   initialization in the TODO list comment.
 - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h

v7 -> v8:
 - Refined changelog (Dave).
 - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
 - Add a "TODO list" comment in init_tdx_module() to list all steps of
   initializing the TDX Module to tell the story (Dave).
 - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
   comments (Dave).
 - Simplified __tdx_enable() to only handle success or failure.
 - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
 - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
 - Improved comments (Dave).
 - Pointed out 'tdx_module_status' is software thing (Dave).

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to set status to TDX_MODULE_NONE if TDX module is not
   loaded (Chao)
 - Added Chao's Reviewed-by.
 - Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.


---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  13 +++
 3 files changed, 179 insertions(+)

Comments

Kuppuswamy Sathyanarayanan June 26, 2023, 9:21 p.m. UTC | #1
On 6/26/23 7:12 AM, Kai Huang wrote:
> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This approach has below pros:
> 
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
> 
> 2) The TDX module design allows it to be updated while the system is
> running.  The update procedure shares quite a few steps with this "on
> demand" initialization mechanism.  The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism.  A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
> 
> 3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
> code mucks with VMX enabling.  If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that.  Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
> 
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX being enabled.
> 
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively.  For now tdx_enable()
> is a placeholder.  The TODO list will be pared down as functionality is
> added.
> 
> Export both tdx_cpu_enable() and tdx_enable() for KVM use.
> 
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
> 
> The per-cpu initialization on each cpu can only be done once during the
> module's life time.  Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
> 
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
> it inside tdx_cpu_enable() too (if hasn't been done).
> 
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug, done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v11 -> v12:
>  - Simplified TDX module global init and lp init status tracking (David).
>  - Added comment around try_init_module_global() for using
>    raw_spin_lock() (Dave).
>  - Added one sentence to changelog to explain why to expose tdx_enable()
>    and tdx_cpu_enable() (Dave).
>  - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
>    lockdep_assert_*() instead. (Dave)
>  - Removed redundent "TDX" in error message (Dave).
> 
> v10 -> v11:
>  - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
>  - Return the actual error code for tdx_enable() instead of -EINVAL.
>  - Added Isaku's Reviewed-by.
> 
> v9 -> v10:
>  - Merged the patch to handle per-cpu initialization to this patch to
>    tell the story better.
>  - Changed how to handle the per-cpu initialization to only provide a
>    tdx_cpu_enable() function to let the user of TDX to do it when the
>    user wants to run TDX code on a certain cpu.
>  - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
>    call lockdep_assert_cpus_held() to assume the caller has done that.
>  - Improved comments around tdx_enable() and tdx_cpu_enable().
>  - Improved changelog to tell the story better accordingly.
> 
> v8 -> v9:
>  - Removed detailed TODO list in the changelog (Dave).
>  - Added back steps to do module global initialization and per-cpu
>    initialization in the TODO list comment.
>  - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> 
> v7 -> v8:
>  - Refined changelog (Dave).
>  - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
>  - Add a "TODO list" comment in init_tdx_module() to list all steps of
>    initializing the TDX Module to tell the story (Dave).
>  - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
>    comments (Dave).
>  - Simplified __tdx_enable() to only handle success or failure.
>  - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
>  - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
>  - Improved comments (Dave).
>  - Pointed out 'tdx_module_status' is software thing (Dave).
> 
> v6 -> v7:
>  - No change.
> 
> v5 -> v6:
>  - Added code to set status to TDX_MODULE_NONE if TDX module is not
>    loaded (Chao)
>  - Added Chao's Reviewed-by.
>  - Improved comments around cpus_read_lock().
> 
> - v3->v5 (no feedback on v4):
>  - Removed the check that SEAMRR and TDX KeyID have been detected on
>    all present cpus.
>  - Removed tdx_detect().
>  - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
>    hotplug lock and return early with error message.
>  - Improved dmesg printing for TDX module detection and initialization.
> 
> 
> ---
>  arch/x86/include/asm/tdx.h  |   4 +
>  arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h |  13 +++
>  3 files changed, 179 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4dfe2e794411..d8226a50c58c 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -ENODEV; }
> +static inline int tdx_enable(void)  { return -ENODEV; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 141d12376c4d..29ca18f66d61 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/archrandom.h>
> @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init;
>  static u32 tdx_guest_keyid_start __ro_after_init;
>  static u32 tdx_nr_guest_keyids __ro_after_init;
>  
> +static bool tdx_global_initialized;
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);

Why use raw_spin_lock()?

> +static DEFINE_PER_CPU(bool, tdx_lp_initialized);
> +
> +static enum tdx_module_status_t tdx_module_status;
> +static DEFINE_MUTEX(tdx_module_lock);

I think you can add a single line comment about what states above
variables tracks. But it is entirely up to you.

> +
>  /*
>   * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>   * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -74,6 +85,157 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	}
>  }
>  
> +/*
> + * Do the module global initialization if not done yet.
> + * It's always called with interrupts and preemption disabled.
> + */
> +static int try_init_module_global(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> +
> +	if (tdx_global_initialized) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */

I have noticed that you add the above comment whenever you call seamcall() with
0 as parameters. Is this a ask from the maintainer? If not, I think you can skip
it. Just explaining the parameters in seamcall function definition is good
enough.

> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (!ret)
> +		tdx_global_initialized = true;
> +out:
> +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * tdx_cpu_enable - Enable TDX on local cpu
> + *
> + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> + * global initialization SEAMCALL if not done) on local cpu to make this
> + * cpu be ready to run any other SEAMCALLs.
> + *
> + * Call this function with preemption disabled.
> + *
> + * Return 0 on success, otherwise errors.
> + */
> +int tdx_cpu_enable(void)
> +{
> +	int ret;
> +
> +	if (!platform_tdx_enabled())
> +		return -ENODEV;
> +
> +	lockdep_assert_preemption_disabled();
> +
> +	/* Already done */
> +	if (__this_cpu_read(tdx_lp_initialized))
> +		return 0;
> +
> +	/*
> +	 * The TDX module global initialization is the very first step
> +	 * to enable TDX.  Need to do it first (if hasn't been done)
> +	 * before the per-cpu initialization.
> +	 */
> +	ret = try_init_module_global();
> +	if (ret)
> +		return ret;
> +
> +	/* All '0's are just unused parameters */
> +	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (ret)
> +		return ret;
> +
> +	__this_cpu_write(tdx_lp_initialized, true);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> +
> +static int init_tdx_module(void)
> +{
> +	/*
> +	 * TODO:
> +	 *
> +	 *  - Get TDX module information and TDX-capable memory regions.
> +	 *  - Build the list of TDX-usable memory regions.
> +	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
> +	 *    all TDX-usable memory regions.
> +	 *  - Configure the TDMRs and the global KeyID to the TDX module.
> +	 *  - Configure the global KeyID on all packages.
> +	 *  - Initialize all TDMRs.
> +	 *
> +	 *  Return error before all steps are done.
> +	 */
> +	return -EINVAL;
> +}
> +
> +static int __tdx_enable(void)
> +{
> +	int ret;
> +
> +	ret = init_tdx_module();
> +	if (ret) {
> +		pr_err("module initialization failed (%d)\n", ret);
> +		tdx_module_status = TDX_MODULE_ERROR;
> +		return ret;
> +	}
> +
> +	pr_info("module initialized.\n");
> +	tdx_module_status = TDX_MODULE_INITIALIZED;
> +
> +	return 0;
> +}
> +
> +/**
> + * tdx_enable - Enable TDX module to make it ready to run TDX guests
> + *
> + * This function assumes the caller has: 1) held read lock of CPU hotplug
> + * lock to prevent any new cpu from becoming online; 2) done both VMXON
> + * and tdx_cpu_enable() on all online cpus.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return 0 if TDX is enabled successfully, otherwise error.
> + */
> +int tdx_enable(void)
> +{
> +	int ret;
> +
> +	if (!platform_tdx_enabled())
> +		return -ENODEV;
> +
> +	lockdep_assert_cpus_held();
> +
> +	mutex_lock(&tdx_module_lock);
> +
> +	switch (tdx_module_status) {
> +	case TDX_MODULE_UNKNOWN:
> +		ret = __tdx_enable();
> +		break;
> +	case TDX_MODULE_INITIALIZED:
> +		/* Already initialized, great, tell the caller. */
> +		ret = 0;
> +		break;
> +	default:
> +		/* Failed to initialize in the previous attempts */
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_enable);
> +
>  static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
>  					    u32 *nr_tdx_keyids)
>  {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 55dbb1b8c971..9fb46033c852 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -16,11 +16,24 @@
>   */
>  #define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
>  
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_INIT		33
> +#define TDH_SYS_LP_INIT		35
> +
>  /*
>   * Do not put any hardware-defined TDX structure representations below
>   * this comment!
>   */
>  
> +/* Kernel defined TDX module status during module initialization. */
> +enum tdx_module_status_t {
> +	TDX_MODULE_UNKNOWN,
> +	TDX_MODULE_INITIALIZED,
> +	TDX_MODULE_ERROR
> +};
> +
>  struct tdx_module_output;
>  u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	       struct tdx_module_output *out);
Kirill A . Shutemov June 27, 2023, 9:50 a.m. UTC | #2
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);

I don't understand how the comment justifies using raw spin lock.
Huang, Kai June 27, 2023, 10:34 a.m. UTC | #3
On Tue, 2023-06-27 at 12:50 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> 
> I don't understand how the comment justifies using raw spin lock.
> 

This comment is for using lock in general.  The reason to use raw_ version is
because this function gets called in IRQ context, and for PREEMPT_RT kernel the
normal spinlock is converted to sleeping lock.

Dave suggested to comment on the function rather than comment on the
raw_spin_lock directly, e.g.,  no other kernel code does that:

https://lore.kernel.org/linux-mm/d2b3bc5e-1371-0c50-8ecb-64fc70917d42@intel.com/

So I commented the function in this version:

+/*
+ * Do the module global initialization if not done yet.
+ * It's always called with interrupts and preemption disabled.
+ */
Huang, Kai June 27, 2023, 10:37 a.m. UTC | #4
On Mon, 2023-06-26 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
> > +	/* All '0's are just unused parameters. */
> 
> I have noticed that you add the above comment whenever you call seamcall()
> with
> 0 as parameters. Is this a ask from the maintainer? If not, I think you can
> skip
> it. Just explaining the parameters in seamcall function definition is good
> enough.

Yes I followed maintainer (I didn't bother to find the exact link this time,
though).  I think in this way we don't need to go to TDX module spec to check
whether 0 has meaning in each SEAMCALL, especially in code review.  I kinda
agree having them in multiple places is a little bit noisy, but I don't have a
better way.
Kirill A . Shutemov June 27, 2023, 12:18 p.m. UTC | #5
On Tue, Jun 27, 2023 at 10:34:04AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 12:50 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > +	/*
> > > +	 * The TDX module global initialization only needs to be done
> > > +	 * once on any cpu.
> > > +	 */
> > > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > 
> > I don't understand how the comment justifies using raw spin lock.
> > 
> 
> This comment is for using lock in general.  The reason to use raw_ version is
> because this function gets called in IRQ context, and for PREEMPT_RT kernel the
> normal spinlock is converted to sleeping lock.

Sorry, but this still doesn't explain anything.

Why converting to sleeping lock here is wrong? There are plenty
spin_lock_irqsave() users all over the kernel that are fine to be
converted to sleeping lock on RT kernel. Why this use-case is special
enough to justify raw_?

From the documentation:

	raw_spinlock_t is a strict spinning lock implementation in all
	kernels, including PREEMPT_RT kernels. Use raw_spinlock_t only in
	real critical core code, low-level interrupt handling and places
	where disabling preemption or interrupts is required, for example,
	to safely access hardware state. raw_spinlock_t can sometimes also
	be used when the critical section is tiny, thus avoiding RT-mutex
	overhead.

How does it apply here?

> Dave suggested to comment on the function rather than comment on the
> raw_spin_lock directly, e.g.,  no other kernel code does that:
> 
> https://lore.kernel.org/linux-mm/d2b3bc5e-1371-0c50-8ecb-64fc70917d42@intel.com/
> 
> So I commented the function in this version:
> 
> +/*
> + * Do the module global initialization if not done yet.
> + * It's always called with interrupts and preemption disabled.
> + */

If interrupts are always disabled why do you need _irqsave()?
Huang, Kai June 27, 2023, 10:37 p.m. UTC | #6
> > 
> > +/*
> > + * Do the module global initialization if not done yet.
> > + * It's always called with interrupts and preemption disabled.
> > + */
> 
> If interrupts are always disabled why do you need _irqsave()?
> 

I'll remove the _irqsave().

AFAICT Isaku preferred this for additional security, but this is not necessary.
Huang, Kai June 28, 2023, 12:28 a.m. UTC | #7
On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > 
> > > +/*
> > > + * Do the module global initialization if not done yet.
> > > + * It's always called with interrupts and preemption disabled.
> > > + */
> > 
> > If interrupts are always disabled why do you need _irqsave()?
> > 
> 
> I'll remove the _irqsave().
> 
> AFAICT Isaku preferred this for additional security, but this is not
> necessary.
> 
> 

Damn.  I think we can change the comment to say this function is called with
preemption being disabled, but _can_ be called with interrupt disabled.  And we
keep using the _irqsave() version.

	/*
	 * Do the module global initialization if not done yet.  It's always
	 * called with preemption disabled and can be called with interrupts
	 * disabled.
	 */

This allows a use case that the caller simply wants to call some SEAMCALL on
local cpu, e.g., IOMMU code may just use below to get some TDX-IO information:

	preempt_disable();
	vmxon();
	tdx_cpu_enable();
	SEAMCALL;
	vmxoff();
	preempt_enable();

Are you OK with this?
Isaku Yamahata June 28, 2023, 12:31 a.m. UTC | #8
On Tue, Jun 27, 2023 at 10:37:58PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> > > 
> > > +/*
> > > + * Do the module global initialization if not done yet.
> > > + * It's always called with interrupts and preemption disabled.
> > > + */
> > 
> > If interrupts are always disabled why do you need _irqsave()?
> > 
> 
> I'll remove the _irqsave().
> 
> AFAICT Isaku preferred this for additional security, but this is not necessary.

It's because the lockdep complains.  Anyway, it's save to remove _irqsave as
discussed with you.
Kirill A . Shutemov June 28, 2023, 11:55 a.m. UTC | #9
On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > > 
> > > > +/*
> > > > + * Do the module global initialization if not done yet.
> > > > + * It's always called with interrupts and preemption disabled.
> > > > + */
> > > 
> > > If interrupts are always disabled why do you need _irqsave()?
> > > 
> > 
> > I'll remove the _irqsave().
> > 
> > AFAICT Isaku preferred this for additional security, but this is not
> > necessary.
> > 
> > 
> 
> Damn.  I think we can change the comment to say this function is called with
> preemption being disabled, but _can_ be called with interrupt disabled.  And we
> keep using the _irqsave() version.
> 
> 	/*
> 	 * Do the module global initialization if not done yet.  It's always
> 	 * called with preemption disabled and can be called with interrupts
> 	 * disabled.
> 	 */
> 
> This allows a use case that the caller simply wants to call some SEAMCALL on
> local cpu, e.g., IOMMU code may just use below to get some TDX-IO information:
> 
> 	preempt_disable();
> 	vmxon();
> 	tdx_cpu_enable();
> 	SEAMCALL;
> 	vmxoff();
> 	preempt_enable();
> 
> Are you OK with this?

Is it hypothetical use-case? If so, I would rather keep it simple for now
and adjust in the future if needed.
Peter Zijlstra June 28, 2023, 1:04 p.m. UTC | #10
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:

> +static int try_init_module_global(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> +
> +	if (tdx_global_initialized) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */
> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (!ret)
> +		tdx_global_initialized = true;
> +out:
> +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> +	return ret;
> +}

How long does that TDX_SYS_INIT take and why is a raw_spinlock with IRQs
disabled the right way to serialize this?
Peter Zijlstra June 28, 2023, 1:08 p.m. UTC | #11
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:

> +/*
> + * Do the module global initialization if not done yet.
> + * It's always called with interrupts and preemption disabled.
> + */
> +static int try_init_module_global(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> +
> +	if (tdx_global_initialized) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */
> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (!ret)
> +		tdx_global_initialized = true;
> +out:
> +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * tdx_cpu_enable - Enable TDX on local cpu
> + *
> + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> + * global initialization SEAMCALL if not done) on local cpu to make this
> + * cpu be ready to run any other SEAMCALLs.
> + *
> + * Call this function with preemption disabled.
> + *
> + * Return 0 on success, otherwise errors.
> + */
> +int tdx_cpu_enable(void)
> +{
> +	int ret;
> +
> +	if (!platform_tdx_enabled())
> +		return -ENODEV;
> +
> +	lockdep_assert_preemption_disabled();
> +
> +	/* Already done */
> +	if (__this_cpu_read(tdx_lp_initialized))
> +		return 0;
> +
> +	/*
> +	 * The TDX module global initialization is the very first step
> +	 * to enable TDX.  Need to do it first (if hasn't been done)
> +	 * before the per-cpu initialization.
> +	 */
> +	ret = try_init_module_global();
> +	if (ret)
> +		return ret;
> +
> +	/* All '0's are just unused parameters */
> +	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (ret)
> +		return ret;

And here you do *NOT* have IRQs disabled... so an IRQ can come in here
and do the above again.

I suspect that's a completely insane thing to have happen, but the way
the code is written does not tell me this and might even suggest I
should worry about it, per the above thing actually disabling IRQs.

> +
> +	__this_cpu_write(tdx_lp_initialized, true);
> +
> +	return 0;
> +}
Peter Zijlstra June 28, 2023, 1:17 p.m. UTC | #12
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);

I can't find a single caller of this.. why is this exported?
Peter Zijlstra June 28, 2023, 1:35 p.m. UTC | #13
On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > > 
> > > > +/*
> > > > + * Do the module global initialization if not done yet.
> > > > + * It's always called with interrupts and preemption disabled.
> > > > + */
> > > 
> > > If interrupts are always disabled why do you need _irqsave()?
> > > 
> > 
> > I'll remove the _irqsave().
> > 
> > AFAICT Isaku preferred this for additional security, but this is not
> > necessary.
> > 
> > 
> 
> Damn.  I think we can change the comment to say this function is called with
> preemption being disabled, but _can_ be called with interrupt disabled.  And we
> keep using the _irqsave() version.
> 
> 	/*
> 	 * Do the module global initialization if not done yet.  It's always
> 	 * called with preemption disabled and can be called with interrupts
> 	 * disabled.
> 	 */

That's still not explaining *why*, what you want to say is:

	Can be called locally or through an IPI function call.
Huang, Kai June 29, 2023, midnight UTC | #14
On Wed, 2023-06-28 at 15:04 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> 
> > +static int try_init_module_global(void)
> > +{
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > +
> > +	if (tdx_global_initialized) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/* All '0's are just unused parameters. */
> > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +	if (!ret)
> > +		tdx_global_initialized = true;
> > +out:
> > +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> > +
> > +	return ret;
> > +}
> 
> How long does that TDX_SYS_INIT take and why is a raw_spinlock with IRQs
> disabled the right way to serialize this?

The spec says it doesn't have a latency requirement, so theoretically it could
be long.  SEAMCALL is a VMEXIT so it would at least cost thousands of cycles.

If raw_spinlock isn't desired, I think I can introduce another function to do
this and let the caller to call it before calling tdx_cpu_enable().  E.g., we
can have below functions:

1) tdx_global_init()	-> TDH_SYS_INIT
2) tdx_cpu_init()	-> TDH_SYS_LP_INIT
3) tdx_enable()		-> actual module initialization

How does this sound?
Huang, Kai June 29, 2023, 12:08 a.m. UTC | #15
On Wed, 2023-06-28 at 15:08 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> 
> > +/*
> > + * Do the module global initialization if not done yet.
> > + * It's always called with interrupts and preemption disabled.
> > + */
> > +static int try_init_module_global(void)
> > +{
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > +
> > +	if (tdx_global_initialized) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/* All '0's are just unused parameters. */
> > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +	if (!ret)
> > +		tdx_global_initialized = true;
> > +out:
> > +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * tdx_cpu_enable - Enable TDX on local cpu
> > + *
> > + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> > + * global initialization SEAMCALL if not done) on local cpu to make this
> > + * cpu be ready to run any other SEAMCALLs.
> > + *
> > + * Call this function with preemption disabled.
> > + *
> > + * Return 0 on success, otherwise errors.
> > + */
> > +int tdx_cpu_enable(void)
> > +{
> > +	int ret;
> > +
> > +	if (!platform_tdx_enabled())
> > +		return -ENODEV;
> > +
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	/* Already done */
> > +	if (__this_cpu_read(tdx_lp_initialized))
> > +		return 0;
> > +
> > +	/*
> > +	 * The TDX module global initialization is the very first step
> > +	 * to enable TDX.  Need to do it first (if hasn't been done)
> > +	 * before the per-cpu initialization.
> > +	 */
> > +	ret = try_init_module_global();
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* All '0's are just unused parameters */
> > +	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> > +	if (ret)
> > +		return ret;
> 
> And here you do *NOT* have IRQs disabled... so an IRQ can come in here
> and do the above again.
> 
> I suspect that's a completely insane thing to have happen, but the way
> the code is written does not tell me this and might even suggest I
> should worry about it, per the above thing actually disabling IRQs.
> 

I can change lockdep_assert_preemption_disabled() to
lockdep_assert_irqs_disabled(), making this function only being called from IPI.
As Kirill also suggested we can do this way as for now KVM is the only user of
this function and it enables hardware for all cpus via IPI.
>
Huang, Kai June 29, 2023, 12:10 a.m. UTC | #16
On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> 
> I can't find a single caller of this.. why is this exported?

It's for KVM TDX patch to use, which isn't in this series.

I'll remove the export.  KVM TDX series can export it.
Huang, Kai June 29, 2023, 12:15 a.m. UTC | #17
On Wed, 2023-06-28 at 15:35 +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote:
> > On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > > > 
> > > > > +/*
> > > > > + * Do the module global initialization if not done yet.
> > > > > + * It's always called with interrupts and preemption disabled.
> > > > > + */
> > > > 
> > > > If interrupts are always disabled why do you need _irqsave()?
> > > > 
> > > 
> > > I'll remove the _irqsave().
> > > 
> > > AFAICT Isaku preferred this for additional security, but this is not
> > > necessary.
> > > 
> > > 
> > 
> > Damn.  I think we can change the comment to say this function is called with
> > preemption being disabled, but _can_ be called with interrupt disabled.  And we
> > keep using the _irqsave() version.
> > 
> > 	/*
> > 	 * Do the module global initialization if not done yet.  It's always
> > 	 * called with preemption disabled and can be called with interrupts
> > 	 * disabled.
> > 	 */
> 
> That's still not explaining *why*, what you want to say is:
> 
> 	Can be called locally or through an IPI function call.
> 

Thanks.  As in another reply, if using spinlock is OK, then I think we can say
it will be called either locally or through an IPI function call.  Otherwise, we
do via a new separate function tdx_global_init() and no lock is needed in that
function.  The caller should call it properly.
David Hildenbrand June 29, 2023, 11:31 a.m. UTC | #18
On 26.06.23 16:12, Kai Huang wrote:
> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This approach has below pros:
> 
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
> 
> 2) The TDX module design allows it to be updated while the system is
> running.  The update procedure shares quite a few steps with this "on
> demand" initialization mechanism.  The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism.  A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
> 
> 3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
> code mucks with VMX enabling.  If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that.  Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
> 
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX being enabled.
> 
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively.  For now tdx_enable()
> is a placeholder.  The TODO list will be pared down as functionality is
> added.
> 
> Export both tdx_cpu_enable() and tdx_enable() for KVM use.
> 
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
> 
> The per-cpu initialization on each cpu can only be done once during the
> module's life time.  Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
> 
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
> it inside tdx_cpu_enable() too (if hasn't been done).
> 
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug, done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v11 -> v12:
>   - Simplified TDX module global init and lp init status tracking (David).
>   - Added comment around try_init_module_global() for using
>     raw_spin_lock() (Dave).
>   - Added one sentence to changelog to explain why to expose tdx_enable()
>     and tdx_cpu_enable() (Dave).
>   - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
>     lockdep_assert_*() instead. (Dave)
>   - Removed redundent "TDX" in error message (Dave).
> 
> v10 -> v11:
>   - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
>   - Return the actual error code for tdx_enable() instead of -EINVAL.
>   - Added Isaku's Reviewed-by.
> 
> v9 -> v10:
>   - Merged the patch to handle per-cpu initialization to this patch to
>     tell the story better.
>   - Changed how to handle the per-cpu initialization to only provide a
>     tdx_cpu_enable() function to let the user of TDX to do it when the
>     user wants to run TDX code on a certain cpu.
>   - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
>     call lockdep_assert_cpus_held() to assume the caller has done that.
>   - Improved comments around tdx_enable() and tdx_cpu_enable().
>   - Improved changelog to tell the story better accordingly.
> 
> v8 -> v9:
>   - Removed detailed TODO list in the changelog (Dave).
>   - Added back steps to do module global initialization and per-cpu
>     initialization in the TODO list comment.
>   - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> 
> v7 -> v8:
>   - Refined changelog (Dave).
>   - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
>   - Add a "TODO list" comment in init_tdx_module() to list all steps of
>     initializing the TDX Module to tell the story (Dave).
>   - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
>     comments (Dave).
>   - Simplified __tdx_enable() to only handle success or failure.
>   - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
>   - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
>   - Improved comments (Dave).
>   - Pointed out 'tdx_module_status' is software thing (Dave).
> 
> v6 -> v7:
>   - No change.
> 
> v5 -> v6:
>   - Added code to set status to TDX_MODULE_NONE if TDX module is not
>     loaded (Chao)
>   - Added Chao's Reviewed-by.
>   - Improved comments around cpus_read_lock().
> 
> - v3->v5 (no feedback on v4):
>   - Removed the check that SEAMRR and TDX KeyID have been detected on
>     all present cpus.
>   - Removed tdx_detect().
>   - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
>     hotplug lock and return early with error message.
>   - Improved dmesg printing for TDX module detection and initialization.
> 
> 
> ---
>   arch/x86/include/asm/tdx.h  |   4 +
>   arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/vmx/tdx/tdx.h |  13 +++
>   3 files changed, 179 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4dfe2e794411..d8226a50c58c 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>   
>   #ifdef CONFIG_INTEL_TDX_HOST
>   bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
>   #else	/* !CONFIG_INTEL_TDX_HOST */
>   static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -ENODEV; }
> +static inline int tdx_enable(void)  { return -ENODEV; }
>   #endif	/* CONFIG_INTEL_TDX_HOST */
>   
>   #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 141d12376c4d..29ca18f66d61 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
>   #include <linux/errno.h>
>   #include <linux/printk.h>
>   #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
>   #include <asm/msr-index.h>
>   #include <asm/msr.h>
>   #include <asm/archrandom.h>
> @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init;
>   static u32 tdx_guest_keyid_start __ro_after_init;
>   static u32 tdx_nr_guest_keyids __ro_after_init;
>   
> +static bool tdx_global_initialized;
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
> +static DEFINE_PER_CPU(bool, tdx_lp_initialized);
> +
> +static enum tdx_module_status_t tdx_module_status;

Why can't you switch to a simple bool here as well?

It's either initialized or uninitialized. If uninitialized and you get 
an error, leave it uninitialized. The next caller will try again and 
fail again.
Huang, Kai June 29, 2023, 10:58 p.m. UTC | #19
> > +static enum tdx_module_status_t tdx_module_status;
> 
> Why can't you switch to a simple bool here as well?
> 
> It's either initialized or uninitialized. If uninitialized and you get 
> an error, leave it uninitialized. The next caller will try again and 
> fail again.
> 

We can, but in this case there might be message printed in each module
initialization call.  Let's say TDH.SYS.INFO is successful but the later
TDH.SYS.CONFIG fails.  In this case, each initialization call will print out TDX
module info and CMR info.

I think only allow initialization to be done once would be better in this case.
Apart from the message printing, it's OK to just use a simple bool.
Peter Zijlstra June 30, 2023, 9:22 a.m. UTC | #20
On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:

> > 	Can be called locally or through an IPI function call.
> > 
> 
> Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> it will be called either locally or through an IPI function call.  Otherwise, we
> do via a new separate function tdx_global_init() and no lock is needed in that
> function.  The caller should call it properly.

IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
actual need for that, but the code as presented didn't -- in comments or
otherwise -- make it clear why it was as it was.

TDX not specifying time constraints on the various TD/SEAM-CALLs is
ofcourse sad, but alas.
Peter Zijlstra June 30, 2023, 9:25 a.m. UTC | #21
On Thu, Jun 29, 2023 at 12:00:44AM +0000, Huang, Kai wrote:

> The spec says it doesn't have a latency requirement, so theoretically it could
> be long.  SEAMCALL is a VMEXIT so it would at least cost thousands of cycles.

:-(

> If raw_spinlock isn't desired, I think I can introduce another function to do
> this and let the caller to call it before calling tdx_cpu_enable().  E.g., we
> can have below functions:
> 
> 1) tdx_global_init()	-> TDH_SYS_INIT
> 2) tdx_cpu_init()	-> TDH_SYS_LP_INIT
> 3) tdx_enable()		-> actual module initialization
> 
> How does this sound?

Ah, wait, I hadn't had enough wake-up juice, it's tdx_global_init() that
did the raw_spinlock_t, but that isn't the IPI thing.

Then perhaps just use a mutex to serialize things?
Peter Zijlstra June 30, 2023, 9:26 a.m. UTC | #22
On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > 
> > I can't find a single caller of this.. why is this exported?
> 
> It's for KVM TDX patch to use, which isn't in this series.
> 
> I'll remove the export.  KVM TDX series can export it.

Fair enough; where will the KVM TDX series call this? Earlier there was
talk about doing it at kvm module load time -- but I objected (and still
do object) to that.

What's the current plan?
Huang, Kai June 30, 2023, 9:48 a.m. UTC | #23
On Fri, 2023-06-30 at 11:25 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 12:00:44AM +0000, Huang, Kai wrote:
> 
> > The spec says it doesn't have a latency requirement, so theoretically it could
> > be long.  SEAMCALL is a VMEXIT so it would at least cost thousands of cycles.
> 
> :-(
> 
> > If raw_spinlock isn't desired, I think I can introduce another function to do
> > this and let the caller to call it before calling tdx_cpu_enable().  E.g., we
> > can have below functions:
> > 
> > 1) tdx_global_init()	-> TDH_SYS_INIT
> > 2) tdx_cpu_init()	-> TDH_SYS_LP_INIT
> > 3) tdx_enable()		-> actual module initialization
> > 
> > How does this sound?
> 
> Ah, wait, I hadn't had enough wake-up juice, it's tdx_global_init() that
> did the raw_spinlock_t, but that isn't the IPI thing.
> 
> Then perhaps just use a mutex to serialize things?
> 

In the current code yes TDH_SYS_INIT is protected by raw_spinlock_t, because it
is done in tdx_cpu_enable().  I thought this makes the caller (KVM)'s life
easier as it doesn't have to call an additional tdx_global_init().

If we put TDH_SYS_INIT to an additional tdx_global_init(), then we are
essentially asking the caller to guarantee it must be called before calling any
tdx_cpu_enable() (or tdx_cpu_init() for better naming).  But in this case we
don't need the raw_spinlock anymore because it's caller's responsibility now.

They both are not protected by the TDX module initialization mutex, only
tdx_enable() is.  The caller (KVM) is supposed to call tdx_cpu_enable() for all
online cpus via IPI function call before calling tdx_enable().

So if using raw_spinlock_t around TDH_SYS_INIT is a concern, then we can go with
the dedicated tdx_global_init() function option.

Hope I've explained this clearly.
Huang, Kai June 30, 2023, 9:55 a.m. UTC | #24
On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > 
> > > I can't find a single caller of this.. why is this exported?
> > 
> > It's for KVM TDX patch to use, which isn't in this series.
> > 
> > I'll remove the export.  KVM TDX series can export it.
> 
> Fair enough; where will the KVM TDX series call this? Earlier there was
> talk about doing it at kvm module load time -- but I objected (and still
> do object) to that.
> 
> What's the current plan?
> 

The direction is still doing it during module load (not my series anyway).  But
this can be a separate discussion with KVM maintainers involved.

I understand you have concern that you don't want to have the memory & cpu time
wasted on enabling TDX by default.  For that we can have a kernel command line
to disable TDX once for all (we can even make it default).  It's just not in
this initial TDX support series but I'll send one once this initial support is
done, as mentioned in the cover letter of the previous version (sadly I removed
this paragraph for the sake of making the cover letter shorter):

"
Also, the patch to add the new kernel comline tdx="force" isn't included
in this initial version, as Dave suggested it isn't mandatory.  But I
will add one once this initial version gets merged.
"

Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
reduce your concern too.
Huang, Kai June 30, 2023, 10:09 a.m. UTC | #25
On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:
> 
> > > 	Can be called locally or through an IPI function call.
> > > 
> > 
> > Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> > it will be called either locally or through an IPI function call.  Otherwise, we
> > do via a new separate function tdx_global_init() and no lock is needed in that
> > function.  The caller should call it properly.
> 
> IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
> actual need for that, but the code as presented didn't -- in comments or
> otherwise -- make it clear why it was as it was.

There's no hard requirement as I replied in another email.

Presumably you prefer the option to have a dedicated tdx_global_init() so we can
avoid the raw_spinlock_t?

Thanks.

> 
> TDX not specifying time constraints on the various TD/SEAM-CALLs is
> ofcourse sad, but alas.

Agreed.
Peter Zijlstra June 30, 2023, 6:30 p.m. UTC | #26
On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote:
> On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > > 
> > > > I can't find a single caller of this.. why is this exported?
> > > 
> > > It's for KVM TDX patch to use, which isn't in this series.
> > > 
> > > I'll remove the export.  KVM TDX series can export it.
> > 
> > Fair enough; where will the KVM TDX series call this? Earlier there was
> > talk about doing it at kvm module load time -- but I objected (and still
> > do object) to that.
> > 
> > What's the current plan?
> > 
> 
> The direction is still doing it during module load (not my series anyway).  But
> this can be a separate discussion with KVM maintainers involved.

They all on Cc afaict.

> I understand you have concern that you don't want to have the memory & cpu time
> wasted on enabling TDX by default.  For that we can have a kernel command line
> to disable TDX once for all (we can even make it default).

That's insane, I don't want to totally disable it. I want it done at
guard creation. Do the whole TDX setup the moment you actually create a
TDX guast.

Totally killing TDX is stupid, just about as stupid as doing it on
module load (which equates to always doing it).

> Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
> reduce your concern too.

I don't get this obsession with doing at module load time :/
Isaku Yamahata June 30, 2023, 6:42 p.m. UTC | #27
On Fri, Jun 30, 2023 at 10:09:08AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:
> > 
> > > > 	Can be called locally or through an IPI function call.
> > > > 
> > > 
> > > Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> > > it will be called either locally or through an IPI function call.  Otherwise, we
> > > do via a new separate function tdx_global_init() and no lock is needed in that
> > > function.  The caller should call it properly.
> > 
> > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
> > actual need for that, but the code as presented didn't -- in comments or
> > otherwise -- make it clear why it was as it was.
> 
> There's no hard requirement as I replied in another email.
> 
> Presumably you prefer the option to have a dedicated tdx_global_init() so we can
> avoid the raw_spinlock_t?

TDX KVM calls tdx_cpu_enable() in IPI context as KVM hardware_setup() callback.
tdx_cpu_enable() calls tdx_global_init().
Isaku Yamahata June 30, 2023, 7:05 p.m. UTC | #28
On Fri, Jun 30, 2023 at 08:30:20PM +0200,
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote:
> > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > > > 
> > > > > I can't find a single caller of this.. why is this exported?
> > > > 
> > > > It's for KVM TDX patch to use, which isn't in this series.
> > > > 
> > > > I'll remove the export.  KVM TDX series can export it.
> > > 
> > > Fair enough; where will the KVM TDX series call this? Earlier there was
> > > talk about doing it at kvm module load time -- but I objected (and still
> > > do object) to that.
> > > 
> > > What's the current plan?
> > > 
> > 
> > The direction is still doing it during module load (not my series anyway).  But
> > this can be a separate discussion with KVM maintainers involved.
> 
> They all on Cc afaict.
> 
> > I understand you have concern that you don't want to have the memory & cpu time
> > wasted on enabling TDX by default.  For that we can have a kernel command line
> > to disable TDX once for all (we can even make it default).
> 
> That's insane, I don't want to totally disable it. I want it done at
> guard creation. Do the whole TDX setup the moment you actually create a
> TDX guast.
> 
> Totally killing TDX is stupid, just about as stupid as doing it on
> module load (which equates to always doing it).
> 
> > Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
> > reduce your concern too.
> 
> I don't get this obsession with doing at module load time :/

The KVM maintainers prefer the initialization on kvm_intel.ko loading time. [1]
I can change enable_tdx parameter for kvm_intel.ko instead of boolean.
Something like

enable_tdx
        ondemand: on-demand initialization when creating the first TDX guest
        onload:   initialize TDX module when loading kvm_intel.ko
        disable:  disable TDX support
        

[1] https://lore.kernel.org/lkml/YkTvw5OXTTFf7j4y@google.com/
Sean Christopherson June 30, 2023, 9:24 p.m. UTC | #29
On Fri, Jun 30, 2023, Isaku Yamahata wrote:
> On Fri, Jun 30, 2023 at 08:30:20PM +0200,
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote:
> > > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> > > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > > > > 
> > > > > > I can't find a single caller of this.. why is this exported?
> > > > > 
> > > > > It's for KVM TDX patch to use, which isn't in this series.
> > > > > 
> > > > > I'll remove the export.  KVM TDX series can export it.
> > > > 
> > > > Fair enough; where will the KVM TDX series call this? Earlier there was
> > > > talk about doing it at kvm module load time -- but I objected (and still
> > > > do object) to that.
> > > > 
> > > > What's the current plan?
> > > > 
> > > 
> > > The direction is still doing it during module load (not my series anyway).  But
> > > this can be a separate discussion with KVM maintainers involved.
> > 
> > They all on Cc afaict.
> > 
> > > I understand you have concern that you don't want to have the memory & cpu time
> > > wasted on enabling TDX by default.  For that we can have a kernel command line
> > > to disable TDX once for all (we can even make it default).
> > 
> > That's insane, I don't want to totally disable it. I want it done at
> > guard creation. Do the whole TDX setup the moment you actually create a
> > TDX guast.
> > 
> > Totally killing TDX is stupid, 

I dunno about that, *totally* killing TDX would make my life a lot simpler ;-)

> > just about as stupid as doing it on module load (which equates to always
> > doing it).
> > 
> > > Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
> > > reduce your concern too.
> > 
> > I don't get this obsession with doing at module load time :/

Waiting until userspace attempts to create the first TDX guest adds complexity
and limits what KVM can do to harden itself.  Currently, all feature support in
KVM is effectively frozen at module load.  E.g. most of the setup code is
contained in __init functions, many module-scoped variables are effectively 
RO after init (though they can't be marked as such until we smush kvm-intel.ko
and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
those patterns would get tossed aside if KVM waits until userspace attempts to
create the first guest.

The userspace experience would also be poor, as KVM can't know whether or TDX is
actually supported until the TDX module is fully loaded and configured.  KVM waits
until VM creation to enable VMX, but that's pure enabling and more or less
guaranteed to succeed, e.g. will succeed barring hardware failures, software bugs,
or *severe* memory pressure.

There are also latency and noisy neighbor concerns, e.g. we *really* don't want
to end up in a situation where creating a TDX guest for a customer can observe
arbitrary latency *and* potentially be disruptive to VMs already running on the
host.

Userspace can workaround the second and third issues by spawning a dummy TDX guest
as early as possible, but that adds complexity to userspace, especially if there's
any desire for it to be race free, e.g. with respect to reporting system capabilities
to the control plan.

On the flip side, limited hardware availability (unless Intel has changed its
tune) and the amount of enabling that's required in BIOS and whatnot makes it
highly unlikely that random Linux users are going to unknowingly boot with TDX
enabled.

That said, if this is a sticking point, let's just make enable_tdx off by default,
i.e. force userspace to opt-in.  Deployments that *know* they may want to schedule
TDX VMs on the host can simply force the module param.  And for everyone else,
since KVM is typically configured as a module by distros, KVM can be unloaded and
reload if the user realizes they want TDX well after the system is up and running.

> The KVM maintainers prefer the initialization on kvm_intel.ko loading time. [1]

You can say "Sean", I'm not the bogeyman :-)

> I can change enable_tdx parameter for kvm_intel.ko instead of boolean.
> Something like
> 
> enable_tdx
>         ondemand: on-demand initialization when creating the first TDX guest
>         onload:   initialize TDX module when loading kvm_intel.ko

No, that's the most complex path and makes no one happy.

>         disable:  disable TDX support
Dan Williams June 30, 2023, 9:58 p.m. UTC | #30
Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Isaku Yamahata wrote:
> > On Fri, Jun 30, 2023 at 08:30:20PM +0200,
> > Peter Zijlstra <peterz@infradead.org> wrote:
[..]
> On the flip side, limited hardware availability (unless Intel has changed its
> tune) and the amount of enabling that's required in BIOS and whatnot makes it
> highly unlikely that random Linux users are going to unknowingly boot with TDX
> enabled.
> 
> That said, if this is a sticking point, let's just make enable_tdx off by default,
> i.e. force userspace to opt-in.  Deployments that *know* they may want to schedule
> TDX VMs on the host can simply force the module param.  And for everyone else,
> since KVM is typically configured as a module by distros, KVM can be unloaded and
> reload if the user realizes they want TDX well after the system is up and running.

Another potential option that also avoids the concern that module
parameters are unwieldy [1] is to have kvm_intel have a soft-dependency
on something like a kvm_intel_tdx module. That affords both a BIOS *and*
userspace policy opt-out where kvm_intel.ko can check that
kvm_intel_tdx.ko is present at init time, or proceed with tdx disabled.

[1]: http://lore.kernel.org/r/Y7z99mf1M5edxV4A@kroah.com
Dave Hansen June 30, 2023, 11:13 p.m. UTC | #31
On 6/30/23 14:24, Sean Christopherson wrote:
> That said, if this is a sticking point, let's just make enable_tdx off by default,
> i.e. force userspace to opt-in.  Deployments that *know* they may want to schedule
> TDX VMs on the host can simply force the module param.  And for everyone else,
> since KVM is typically configured as a module by distros, KVM can be unloaded and
> reload if the user realizes they want TDX well after the system is up and running.

Let's just default it to off for now.

If we default it to on, we risk inflicting TDX on existing KVM users
that don't want it (by surprise).  If it turns out to _that_ big of an
inconvenience, we'd have to reverse course and change the default from
on=>off.  *That* would break existing TDX users when we do it.  Gnashing
of teeth all around would ensue.

On the other hand, if we force TDX users to turn it on from day one, we
don't surprise _anyone_ that wasn't asking for it.  The only teeth
gnashing is for the TDX folks.

We could change _that_ down the line if the TDX users get too rowdy.
But I'd much rather err on the side of inconveniencing the guys that
know they want the snazzy new hardware than those who just want to run
plain old VMs.

I honestly don't care all that much either way.  There's an escape hatch
at runtime (reload kvm_intel.ko) no matter what we do.
Huang, Kai July 1, 2023, 8:15 a.m. UTC | #32
On Fri, 2023-06-30 at 10:09 +0000, Huang, Kai wrote:
> On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:
> > 
> > > > 	Can be called locally or through an IPI function call.
> > > > 
> > > 
> > > Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> > > it will be called either locally or through an IPI function call.  Otherwise, we
> > > do via a new separate function tdx_global_init() and no lock is needed in that
> > > function.  The caller should call it properly.
> > 
> > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
> > actual need for that, but the code as presented didn't -- in comments or
> > otherwise -- make it clear why it was as it was.
> 
> There's no hard requirement as I replied in another email.
> 
> Presumably you prefer the option to have a dedicated tdx_global_init() so we can
> avoid the raw_spinlock_t?
> 

Hmm... didn't have enough coffee.  Sorry after more thinking, I think we need to
avoid tdx_global_init() but do TDH.SYS.INIT within tdx_cpu_enable() with
raw_spinlock_t.  The reason is although KVM will be the first caller of TDX,
there will be other caller of TDX in later phase (e.g., IOMMU TDX support) so we
need to consider race between those callers.

With multiple callers, the tdx_global_init() and tdx_cpu_enable() from them need
to be serialized anyway, and having the additional tdx_global_init() will just
make things more complicated to do.

So I think the simplest way is to use a per-cpu variable to track
TDH.SYS.LP.INIT in tdx_cpu_enable() and only call tdx_cpu_enable() from local
with IRQ disabled or from IPI function call, and use raw_spinlock_t for
TDH.SYS.INIT inside tdx_cpu_enable() to make sure it only gets called once.

I'll clarify this in the changelog and/or comments.

Again sorry for the noise and please let me know for any comments.  Thanks!
Peter Zijlstra July 3, 2023, 10:38 a.m. UTC | #33
On Fri, Jun 30, 2023 at 04:13:39PM -0700, Dave Hansen wrote:

> I honestly don't care all that much either way.  There's an escape hatch
> at runtime (reload kvm_intel.ko) no matter what we do.

Please try with my MODULE=n kernel ;-) localyesconfig FTW.
Peter Zijlstra July 3, 2023, 10:49 a.m. UTC | #34
On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:

> I dunno about that, *totally* killing TDX would make my life a lot simpler ;-)

:-)

> > > I don't get this obsession with doing at module load time :/
> 
> Waiting until userspace attempts to create the first TDX guest adds complexity
> and limits what KVM can do to harden itself.  Currently, all feature support in
> KVM is effectively frozen at module load.  E.g. most of the setup code is
> contained in __init functions, many module-scoped variables are effectively 
> RO after init (though they can't be marked as such until we smush kvm-intel.ko
> and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> those patterns would get tossed aside if KVM waits until userspace attempts to
> create the first guest.

Pff, all that is perfectly possible, just a wee bit more work :-) I
mean, we manage to poke text that's RO, surely we can poke a variable
that supposedly RO.

And I really wish we could put part of the kvm-intel/amd.ko things in
the kernel proper and reduce the EXPORT_SYMBOL surface -- we're
exporting a whole bunch of things that really shouldn't be, just for KVM
:/

> The userspace experience would also be poor, as KVM can't know whether or TDX is
> actually supported until the TDX module is fully loaded and configured.

Quality that :-(

> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> to end up in a situation where creating a TDX guest for a customer can observe
> arbitrary latency *and* potentially be disruptive to VMs already running on the
> host.

Well, that's a quality of implementation issue with the whole TDX
crapola. Sounds like we want to impose latency constraints on the
various TDX calls. Allowing it to consume arbitrary amounts of CPU time
is unacceptable in any case.

> Userspace can workaround the second and third issues by spawning a dummy TDX guest
> as early as possible, but that adds complexity to userspace, especially if there's
> any desire for it to be race free, e.g. with respect to reporting system capabilities
> to the control plan.

FWIW, I'm 100% behind pushing complexity into userspace if it makes for
a simpler kernel.

> On the flip side, limited hardware availability (unless Intel has changed its
> tune) and the amount of enabling that's required in BIOS and whatnot makes it
> highly unlikely that random Linux users are going to unknowingly boot with TDX
> enabled.
> 
> That said, if this is a sticking point, let's just make enable_tdx off by default,

OK.
Dave Hansen July 3, 2023, 2:40 p.m. UTC | #35
On 7/3/23 03:49, Peter Zijlstra wrote:
>> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
>> to end up in a situation where creating a TDX guest for a customer can observe
>> arbitrary latency *and* potentially be disruptive to VMs already running on the
>> host.
> Well, that's a quality of implementation issue with the whole TDX
> crapola. Sounds like we want to impose latency constraints on the
> various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> is unacceptable in any case.

For what it's worth, everybody knew that calling into the TDX module was
going to be a black hole and that consuming large amounts of CPU at
random times would drive people bat guano crazy.

The TDX Module ABI spec does have "Leaf Function Latency" warnings for
some of the module calls.  But, it's basically a binary thing.  A call
is either normal or "longer than most".

The majority of the "longer than most" cases are for initialization.
The _most_ obscene runtime ones are chunked up and can return partial
progress to limit latency spikes.  But I don't think folks tried as hard
on the initialization calls since they're only called once which
actually seems pretty reasonable to me.

Maybe we need three classes of "Leaf Function Latency":
1. Sane
2. "Longer than most"
3. Better turn the NMI watchdog off before calling this. :)

Would that help?
Peter Zijlstra July 3, 2023, 3:03 p.m. UTC | #36
On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
> On 7/3/23 03:49, Peter Zijlstra wrote:
> >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> >> to end up in a situation where creating a TDX guest for a customer can observe
> >> arbitrary latency *and* potentially be disruptive to VMs already running on the
> >> host.
> > Well, that's a quality of implementation issue with the whole TDX
> > crapola. Sounds like we want to impose latency constraints on the
> > various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> > is unacceptable in any case.
> 
> For what it's worth, everybody knew that calling into the TDX module was
> going to be a black hole and that consuming large amounts of CPU at
> random times would drive people bat guano crazy.
> 
> The TDX Module ABI spec does have "Leaf Function Latency" warnings for
> some of the module calls.  But, it's basically a binary thing.  A call
> is either normal or "longer than most".
> 
> The majority of the "longer than most" cases are for initialization.
> The _most_ obscene runtime ones are chunked up and can return partial
> progress to limit latency spikes.  But I don't think folks tried as hard
> on the initialization calls since they're only called once which
> actually seems pretty reasonable to me.
> 
> Maybe we need three classes of "Leaf Function Latency":
> 1. Sane
> 2. "Longer than most"
> 3. Better turn the NMI watchdog off before calling this. :)
> 
> Would that help?

I'm thikning we want something along the lines of the Xen preemptible
hypercalls, except less crazy. Where the caller does:

	for (;;) {
		ret = tdcall(fn, args);
		if (ret == -EAGAIN) {
			cond_resched();
			continue;
		}
		break;
	}

And then the TDX black box provides a guarantee that any one tdcall (or
seamcall or whatever) never takes more than X ns (possibly even
configurable) and we get to raise a bug report if we can prove it
actually takes longer.

Handing the CPU off to random code for random period of time is just not
a good idea, ever.
Dave Hansen July 3, 2023, 3:26 p.m. UTC | #37
On 7/3/23 08:03, Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
>> On 7/3/23 03:49, Peter Zijlstra wrote:
>>>> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
>>>> to end up in a situation where creating a TDX guest for a customer can observe
>>>> arbitrary latency *and* potentially be disruptive to VMs already running on the
>>>> host.
>>> Well, that's a quality of implementation issue with the whole TDX
>>> crapola. Sounds like we want to impose latency constraints on the
>>> various TDX calls. Allowing it to consume arbitrary amounts of CPU time
>>> is unacceptable in any case.
>>
>> For what it's worth, everybody knew that calling into the TDX module was
>> going to be a black hole and that consuming large amounts of CPU at
>> random times would drive people bat guano crazy.
>>
>> The TDX Module ABI spec does have "Leaf Function Latency" warnings for
>> some of the module calls.  But, it's basically a binary thing.  A call
>> is either normal or "longer than most".
>>
>> The majority of the "longer than most" cases are for initialization.
>> The _most_ obscene runtime ones are chunked up and can return partial
>> progress to limit latency spikes.  But I don't think folks tried as hard
>> on the initialization calls since they're only called once which
>> actually seems pretty reasonable to me.
>>
>> Maybe we need three classes of "Leaf Function Latency":
>> 1. Sane
>> 2. "Longer than most"
>> 3. Better turn the NMI watchdog off before calling this. :)
>>
>> Would that help?
> 
> I'm thikning we want something along the lines of the Xen preemptible
> hypercalls, except less crazy. Where the caller does:
> 
> 	for (;;) {
> 		ret = tdcall(fn, args);
> 		if (ret == -EAGAIN) {
> 			cond_resched();
> 			continue;
> 		}
> 		break;
> 	}
> 
> And then the TDX black box provides a guarantee that any one tdcall (or
> seamcall or whatever) never takes more than X ns (possibly even
> configurable) and we get to raise a bug report if we can prove it
> actually takes longer.

It's _supposed_ to be doing something kinda like that.  For instance, in
the places that need locking, the TDX module essentially does:

	if (!trylock(&lock))
		return -EBUSY;

which is a heck of a lot better than spinning in the TDX module.  Those
module locks are also almost always for things that *also* have some
kind of concurrency control in Linux too.

*But*, there are also the really nasty calls that *do* take forever.  It
would be great to have a list of them or, heck, even *enumeration* of
which ones can take forever so we don't need to maintain a table.
Kirill A . Shutemov July 3, 2023, 5:55 p.m. UTC | #38
On Mon, Jul 03, 2023 at 05:03:30PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
> > On 7/3/23 03:49, Peter Zijlstra wrote:
> > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> > >> to end up in a situation where creating a TDX guest for a customer can observe
> > >> arbitrary latency *and* potentially be disruptive to VMs already running on the
> > >> host.
> > > Well, that's a quality of implementation issue with the whole TDX
> > > crapola. Sounds like we want to impose latency constraints on the
> > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> > > is unacceptable in any case.
> > 
> > For what it's worth, everybody knew that calling into the TDX module was
> > going to be a black hole and that consuming large amounts of CPU at
> > random times would drive people bat guano crazy.
> > 
> > The TDX Module ABI spec does have "Leaf Function Latency" warnings for
> > some of the module calls.  But, it's basically a binary thing.  A call
> > is either normal or "longer than most".
> > 
> > The majority of the "longer than most" cases are for initialization.
> > The _most_ obscene runtime ones are chunked up and can return partial
> > progress to limit latency spikes.  But I don't think folks tried as hard
> > on the initialization calls since they're only called once which
> > actually seems pretty reasonable to me.
> > 
> > Maybe we need three classes of "Leaf Function Latency":
> > 1. Sane
> > 2. "Longer than most"
> > 3. Better turn the NMI watchdog off before calling this. :)
> > 
> > Would that help?
> 
> I'm thikning we want something along the lines of the Xen preemptible
> hypercalls, except less crazy. Where the caller does:
> 
> 	for (;;) {
> 		ret = tdcall(fn, args);
> 		if (ret == -EAGAIN) {
> 			cond_resched();
> 			continue;
> 		}
> 		break;
> 	}
> 
> And then the TDX black box provides a guarantee that any one tdcall (or
> seamcall or whatever) never takes more than X ns (possibly even
> configurable) and we get to raise a bug report if we can prove it
> actually takes longer.

TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over
control to the host/VMM.

But I'm not quite follow how it is different from the host stopping
scheduling vCPU on a random instruction. It can happen at any point and
TDCALL is not special from this PoV.
Dave Hansen July 3, 2023, 6:26 p.m. UTC | #39
On 7/3/23 10:55, kirill.shutemov@linux.intel.com wrote:
>> I'm thikning we want something along the lines of the Xen preemptible
>> hypercalls, except less crazy. Where the caller does:
>>
>> 	for (;;) {
>> 		ret = tdcall(fn, args);
>> 		if (ret == -EAGAIN) {
>> 			cond_resched();
>> 			continue;
>> 		}
>> 		break;
>> 	}
>>
>> And then the TDX black box provides a guarantee that any one tdcall (or
>> seamcall or whatever) never takes more than X ns (possibly even
>> configurable) and we get to raise a bug report if we can prove it
>> actually takes longer.
> TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over
> control to the host/VMM.
> 
> But I'm not quite follow how it is different from the host stopping
> scheduling vCPU on a random instruction. It can happen at any point and
> TDCALL is not special from this PoV.

Well, for one, if the host stops the vCPU on a random instruction the
host has to restore all the vCPU state.  *ALL* of it.  That means that
after the host hands control back, the guest is perfectly ready to take
all the interrupts that are pending.

These TDCALLs are *VERY* different.  The guest gets control back and has
some amount of its state zapped, RBP being the most annoying current
example of state that is lost.  So the guest resumes control here and
must handle all of its interrupts with some of its state (and thus
ability to cleanly handle the interrupt) gone.

The instructions after state is lost are very much special.  Just look
at the syscall gap.
Peter Zijlstra July 4, 2023, 4:58 p.m. UTC | #40
On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:

> Waiting until userspace attempts to create the first TDX guest adds complexity
> and limits what KVM can do to harden itself.  Currently, all feature support in
> KVM is effectively frozen at module load.  E.g. most of the setup code is
> contained in __init functions, many module-scoped variables are effectively 
> RO after init (though they can't be marked as such until we smush kvm-intel.ko
> and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> those patterns would get tossed aside if KVM waits until userspace attempts to
> create the first guest.

....

People got poked and the following was suggested:

On boot do:

 TDH.SYS.INIT
 TDH.SYS.LP.INIT
 TDH.SYS.CONFIG
 TDH.SYS.KEY.CONFIG

This should get TDX mostly sorted, but doesn't consume much resources.
Then later, when starting the first TDX guest, do the whole

 TDH.TDMR.INIT

dance to set up the PAMT array -- which is what gobbles up memory. From
what I understand the TDH.TDMR.INIT thing is not one of those
excessively long calls.

If we have concerns about allocating the PAMT array, can't we use CMA
for this? Allocate the whole thing at boot as CMA such that when not
used for TDX it can be used for regular things like userspace and
filecache pages?

Those TDH.SYS calls should be enough to ensure TDX is actually working,
no?
Huang, Kai July 4, 2023, 9:50 p.m. UTC | #41
On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:
> 
> > Waiting until userspace attempts to create the first TDX guest adds complexity
> > and limits what KVM can do to harden itself.  Currently, all feature support in
> > KVM is effectively frozen at module load.  E.g. most of the setup code is
> > contained in __init functions, many module-scoped variables are effectively 
> > RO after init (though they can't be marked as such until we smush kvm-intel.ko
> > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> > those patterns would get tossed aside if KVM waits until userspace attempts to
> > create the first guest.
> 
> ....
> 
> People got poked and the following was suggested:
> 
> On boot do:
> 
>  TDH.SYS.INIT
>  TDH.SYS.LP.INIT
>  TDH.SYS.CONFIG
>  TDH.SYS.KEY.CONFIG
> 
> This should get TDX mostly sorted, but doesn't consume much resources.
> Then later, when starting the first TDX guest, do the whole
> 
>  TDH.TDMR.INIT
> 
> dance to set up the PAMT array -- which is what gobbles up memory. From
> what I understand the TDH.TDMR.INIT thing is not one of those
> excessively long calls.

The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX
module, thus it only initializes a small chunk  (1M I guess) in each call. 
Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all
PAMT entries for all TDX-usable memory, which can be time-consuming.

Currently for simplicity we just do this inside the module initialization, but
can be optimized later when we have an agreed solution of how to optimize.

> 
> If we have concerns about allocating the PAMT array, can't we use CMA
> for this? Allocate the whole thing at boot as CMA such that when not
> used for TDX it can be used for regular things like userspace and
> filecache pages?

The PAMT allocation itself isn't a concern I think.  The concern is the
TDH.TDMR.INIT to initialize them.

Also, one practical problem to prevent us from pre-allocating PAMT is the PAMT
size to be allocated can only be determined after the TDH.SYS.INFO SEAMCALL,
which reports the "PAMT entry size" in the TDSYSINFO_STRUCT.

> 
> Those TDH.SYS calls should be enough to ensure TDX is actually working,
> no?
Peter Zijlstra July 5, 2023, 7:14 a.m. UTC | #42
On Mon, Jul 03, 2023 at 08:55:56PM +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jul 03, 2023 at 05:03:30PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
> > > On 7/3/23 03:49, Peter Zijlstra wrote:
> > > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> > > >> to end up in a situation where creating a TDX guest for a customer can observe
> > > >> arbitrary latency *and* potentially be disruptive to VMs already running on the
> > > >> host.
> > > > Well, that's a quality of implementation issue with the whole TDX
> > > > crapola. Sounds like we want to impose latency constraints on the
> > > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> > > > is unacceptable in any case.
> > > 
> > > For what it's worth, everybody knew that calling into the TDX module was
> > > going to be a black hole and that consuming large amounts of CPU at
> > > random times would drive people bat guano crazy.
> > > 
> > > The TDX Module ABI spec does have "Leaf Function Latency" warnings for
> > > some of the module calls.  But, it's basically a binary thing.  A call
> > > is either normal or "longer than most".
> > > 
> > > The majority of the "longer than most" cases are for initialization.
> > > The _most_ obscene runtime ones are chunked up and can return partial
> > > progress to limit latency spikes.  But I don't think folks tried as hard
> > > on the initialization calls since they're only called once which
> > > actually seems pretty reasonable to me.
> > > 
> > > Maybe we need three classes of "Leaf Function Latency":
> > > 1. Sane
> > > 2. "Longer than most"
> > > 3. Better turn the NMI watchdog off before calling this. :)
> > > 
> > > Would that help?
> > 
> > I'm thikning we want something along the lines of the Xen preemptible
> > hypercalls, except less crazy. Where the caller does:
> > 
> > 	for (;;) {
> > 		ret = tdcall(fn, args);
> > 		if (ret == -EAGAIN) {
> > 			cond_resched();
> > 			continue;
> > 		}
> > 		break;
> > 	}
> > 
> > And then the TDX black box provides a guarantee that any one tdcall (or
> > seamcall or whatever) never takes more than X ns (possibly even
> > configurable) and we get to raise a bug report if we can prove it
> > actually takes longer.
> 
> TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over
> control to the host/VMM.
> 
> But I'm not quite follow how it is different from the host stopping
> scheduling vCPU on a random instruction. It can happen at any point and
> TDCALL is not special from this PoV.

A guest will exit on timer/interrupt and then the host can reschedule;
AFAIU this doesn't actually happen with these TDX calls, if control is
in that SEAM thing, it stays there until it's done.
Peter Zijlstra July 5, 2023, 7:16 a.m. UTC | #43
On Tue, Jul 04, 2023 at 09:50:22PM +0000, Huang, Kai wrote:
> On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote:
> > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:
> > 
> > > Waiting until userspace attempts to create the first TDX guest adds complexity
> > > and limits what KVM can do to harden itself.  Currently, all feature support in
> > > KVM is effectively frozen at module load.  E.g. most of the setup code is
> > > contained in __init functions, many module-scoped variables are effectively 
> > > RO after init (though they can't be marked as such until we smush kvm-intel.ko
> > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> > > those patterns would get tossed aside if KVM waits until userspace attempts to
> > > create the first guest.
> > 
> > ....
> > 
> > People got poked and the following was suggested:
> > 
> > On boot do:
> > 
> >  TDH.SYS.INIT
> >  TDH.SYS.LP.INIT
> >  TDH.SYS.CONFIG
> >  TDH.SYS.KEY.CONFIG
> > 
> > This should get TDX mostly sorted, but doesn't consume much resources.
> > Then later, when starting the first TDX guest, do the whole
> > 
> >  TDH.TDMR.INIT
> > 
> > dance to set up the PAMT array -- which is what gobbles up memory. From
> > what I understand the TDH.TDMR.INIT thing is not one of those
> > excessively long calls.
> 
> The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX
> module, thus it only initializes a small chunk  (1M I guess) in each call. 
> Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all
> PAMT entries for all TDX-usable memory, which can be time-consuming.

Yeah, so you can put a cond_resched() in that loop and all is well, you
do not negatively affect other tasks. Because *that* was the concern
raised.
Huang, Kai July 5, 2023, 7:54 a.m. UTC | #44
On Wed, 2023-07-05 at 09:16 +0200, Peter Zijlstra wrote:
> On Tue, Jul 04, 2023 at 09:50:22PM +0000, Huang, Kai wrote:
> > On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote:
> > > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:
> > > 
> > > > Waiting until userspace attempts to create the first TDX guest adds complexity
> > > > and limits what KVM can do to harden itself.  Currently, all feature support in
> > > > KVM is effectively frozen at module load.  E.g. most of the setup code is
> > > > contained in __init functions, many module-scoped variables are effectively 
> > > > RO after init (though they can't be marked as such until we smush kvm-intel.ko
> > > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> > > > those patterns would get tossed aside if KVM waits until userspace attempts to
> > > > create the first guest.
> > > 
> > > ....
> > > 
> > > People got poked and the following was suggested:
> > > 
> > > On boot do:
> > > 
> > >  TDH.SYS.INIT
> > >  TDH.SYS.LP.INIT
> > >  TDH.SYS.CONFIG
> > >  TDH.SYS.KEY.CONFIG
> > > 
> > > This should get TDX mostly sorted, but doesn't consume much resources.
> > > Then later, when starting the first TDX guest, do the whole
> > > 
> > >  TDH.TDMR.INIT
> > > 
> > > dance to set up the PAMT array -- which is what gobbles up memory. From
> > > what I understand the TDH.TDMR.INIT thing is not one of those
> > > excessively long calls.
> > 
> > The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX
> > module, thus it only initializes a small chunk  (1M I guess) in each call. 
> > Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all
> > PAMT entries for all TDX-usable memory, which can be time-consuming.
> 
> Yeah, so you can put a cond_resched() in that loop and all is well, you
> do not negatively affect other tasks. Because *that* was the concern
> raised.

Yes cond_resched() has been done.  It's in patch 16 (x86/virt/tdx: Initialize
all TDMRs).
Dave Hansen July 5, 2023, 2:34 p.m. UTC | #45
On 7/4/23 09:58, Peter Zijlstra wrote:
> If we have concerns about allocating the PAMT array, can't we use CMA
> for this? Allocate the whole thing at boot as CMA such that when not
> used for TDX it can be used for regular things like userspace and
> filecache pages?

I never thought of CMA as being super reliable.  Maybe it's improved
over the years.

KVM also has a rather nasty habit of pinning pages, like for device
passthrough.  I suspect that means that we'll have one of two scenarios:

 1. CMA works great, but the TDX/CMA area is unusable for KVM because
    it's pinning all its pages and they just get moved out of the CMA
    area immediately.  The CMA area is effectively wasted.
 2. CMA sucks, and users get sporadic TDX failures when they wait a long
    time to run a TDX guest after boot.  Users just work around the CMA
    support by starting up TDX guests at boot or demanding a module
    parameter be set.  Hacking in CMA support was a waste.

Am I just too much of a pessimist?
Peter Zijlstra July 5, 2023, 2:57 p.m. UTC | #46
On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote:
> On 7/4/23 09:58, Peter Zijlstra wrote:
> > If we have concerns about allocating the PAMT array, can't we use CMA
> > for this? Allocate the whole thing at boot as CMA such that when not
> > used for TDX it can be used for regular things like userspace and
> > filecache pages?
> 
> I never thought of CMA as being super reliable.  Maybe it's improved
> over the years.
> 
> KVM also has a rather nasty habit of pinning pages, like for device
> passthrough.  I suspect that means that we'll have one of two scenarios:
> 
>  1. CMA works great, but the TDX/CMA area is unusable for KVM because
>     it's pinning all its pages and they just get moved out of the CMA
>     area immediately.  The CMA area is effectively wasted.
>  2. CMA sucks, and users get sporadic TDX failures when they wait a long
>     time to run a TDX guest after boot.  Users just work around the CMA
>     support by starting up TDX guests at boot or demanding a module
>     parameter be set.  Hacking in CMA support was a waste.
> 
> Am I just too much of a pessimist?

Well, if CMA still sucks, then that needs fixing. If CMA works, but we
have a circular fail in that KVM needs to long-term pin the PAMT pages
but long-term pin is evicted from CMA (the whole point of long-term pin,
after all), then surely we can break that cycle somehow, since in this
case the purpose of the CMA is being able to grab that memory chunk when
we needs it.

That is, either way around is just a matter of a little code, no?
Dave Hansen July 6, 2023, 2:49 p.m. UTC | #47
On 7/5/23 07:57, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote:
>> On 7/4/23 09:58, Peter Zijlstra wrote:
>>> If we have concerns about allocating the PAMT array, can't we use CMA
>>> for this? Allocate the whole thing at boot as CMA such that when not
>>> used for TDX it can be used for regular things like userspace and
>>> filecache pages?
>> I never thought of CMA as being super reliable.  Maybe it's improved
>> over the years.
>>
>> KVM also has a rather nasty habit of pinning pages, like for device
>> passthrough.  I suspect that means that we'll have one of two scenarios:
>>
>>  1. CMA works great, but the TDX/CMA area is unusable for KVM because
>>     it's pinning all its pages and they just get moved out of the CMA
>>     area immediately.  The CMA area is effectively wasted.
>>  2. CMA sucks, and users get sporadic TDX failures when they wait a long
>>     time to run a TDX guest after boot.  Users just work around the CMA
>>     support by starting up TDX guests at boot or demanding a module
>>     parameter be set.  Hacking in CMA support was a waste.
>>
>> Am I just too much of a pessimist?
> Well, if CMA still sucks, then that needs fixing. If CMA works, but we
> have a circular fail in that KVM needs to long-term pin the PAMT pages
> but long-term pin is evicted from CMA (the whole point of long-term pin,
> after all), then surely we can break that cycle somehow, since in this
> case the purpose of the CMA is being able to grab that memory chunk when
> we needs it.
> 
> That is, either way around is just a matter of a little code, no?

It's not a circular dependency, it's conflicting requirements.

CMA makes memory more available, but only in the face of unpinned pages.

KVM can pin lots of pages, even outside of TDX-based VMs.

So we either need to change how CMA works fundamentally or stop KVM from
pinning pages.
Sean Christopherson July 10, 2023, 5:58 p.m. UTC | #48
On Thu, Jul 06, 2023, Dave Hansen wrote:
> On 7/5/23 07:57, Peter Zijlstra wrote:
> > On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote:
> >> On 7/4/23 09:58, Peter Zijlstra wrote:
> >>> If we have concerns about allocating the PAMT array, can't we use CMA
> >>> for this? Allocate the whole thing at boot as CMA such that when not
> >>> used for TDX it can be used for regular things like userspace and
> >>> filecache pages?
> >> I never thought of CMA as being super reliable.  Maybe it's improved
> >> over the years.
> >>
> >> KVM also has a rather nasty habit of pinning pages, like for device
> >> passthrough.  I suspect that means that we'll have one of two scenarios:
> >>
> >>  1. CMA works great, but the TDX/CMA area is unusable for KVM because
> >>     it's pinning all its pages and they just get moved out of the CMA
> >>     area immediately.  The CMA area is effectively wasted.
> >>  2. CMA sucks, and users get sporadic TDX failures when they wait a long
> >>     time to run a TDX guest after boot.  Users just work around the CMA
> >>     support by starting up TDX guests at boot or demanding a module
> >>     parameter be set.  Hacking in CMA support was a waste.
> >>
> >> Am I just too much of a pessimist?
> > Well, if CMA still sucks, then that needs fixing. If CMA works, but we
> > have a circular fail in that KVM needs to long-term pin the PAMT pages
> > but long-term pin is evicted from CMA (the whole point of long-term pin,
> > after all), then surely we can break that cycle somehow, since in this
> > case the purpose of the CMA is being able to grab that memory chunk when
> > we needs it.
> > 
> > That is, either way around is just a matter of a little code, no?
> 
> It's not a circular dependency, it's conflicting requirements.
> 
> CMA makes memory more available, but only in the face of unpinned pages.
> 
> KVM can pin lots of pages, even outside of TDX-based VMs.
> 
> So we either need to change how CMA works fundamentally or stop KVM from
> pinning pages.

Nit, I think you're conflating KVM with VFIO and/or IOMMU code.  Device passhthrough
does pin large chunks of memory, but KVM itself isn't involved or even aware of
the pins.

HugeTLB is another case where CMA will be effectively used to serve guest memory,
but again KVM isn't the thing doing the pinning.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4dfe2e794411..d8226a50c58c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -97,8 +97,12 @@  static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_cpu_enable(void);
+int tdx_enable(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_cpu_enable(void) { return -ENODEV; }
+static inline int tdx_enable(void)  { return -ENODEV; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 141d12376c4d..29ca18f66d61 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,10 @@ 
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/smp.h>
+#include <linux/cpu.h>
+#include <linux/spinlock.h>
+#include <linux/percpu-defs.h>
+#include <linux/mutex.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
@@ -23,6 +27,13 @@  static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+static bool tdx_global_initialized;
+static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
+static DEFINE_PER_CPU(bool, tdx_lp_initialized);
+
+static enum tdx_module_status_t tdx_module_status;
+static DEFINE_MUTEX(tdx_module_lock);
+
 /*
  * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
  * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
@@ -74,6 +85,157 @@  static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	}
 }
 
+/*
+ * Do the module global initialization if not done yet.
+ * It's always called with interrupts and preemption disabled.
+ */
+static int try_init_module_global(void)
+{
+	unsigned long flags;
+	int ret;
+
+	/*
+	 * The TDX module global initialization only needs to be done
+	 * once on any cpu.
+	 */
+	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
+
+	if (tdx_global_initialized) {
+		ret = 0;
+		goto out;
+	}
+
+	/* All '0's are just unused parameters. */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (!ret)
+		tdx_global_initialized = true;
+out:
+	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
+
+	return ret;
+}
+
+/**
+ * tdx_cpu_enable - Enable TDX on local cpu
+ *
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
+ * global initialization SEAMCALL if not done) on local cpu to make this
+ * cpu be ready to run any other SEAMCALLs.
+ *
+ * Call this function with preemption disabled.
+ *
+ * Return 0 on success, otherwise errors.
+ */
+int tdx_cpu_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	lockdep_assert_preemption_disabled();
+
+	/* Already done */
+	if (__this_cpu_read(tdx_lp_initialized))
+		return 0;
+
+	/*
+	 * The TDX module global initialization is the very first step
+	 * to enable TDX.  Need to do it first (if hasn't been done)
+	 * before the per-cpu initialization.
+	 */
+	ret = try_init_module_global();
+	if (ret)
+		return ret;
+
+	/* All '0's are just unused parameters */
+	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (ret)
+		return ret;
+
+	__this_cpu_write(tdx_lp_initialized, true);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_enable);
+
+static int init_tdx_module(void)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Get TDX module information and TDX-capable memory regions.
+	 *  - Build the list of TDX-usable memory regions.
+	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
+	 *    all TDX-usable memory regions.
+	 *  - Configure the TDMRs and the global KeyID to the TDX module.
+	 *  - Configure the global KeyID on all packages.
+	 *  - Initialize all TDMRs.
+	 *
+	 *  Return error before all steps are done.
+	 */
+	return -EINVAL;
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	ret = init_tdx_module();
+	if (ret) {
+		pr_err("module initialization failed (%d)\n", ret);
+		tdx_module_status = TDX_MODULE_ERROR;
+		return ret;
+	}
+
+	pr_info("module initialized.\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+
+	return 0;
+}
+
+/**
+ * tdx_enable - Enable TDX module to make it ready to run TDX guests
+ *
+ * This function assumes the caller has: 1) held read lock of CPU hotplug
+ * lock to prevent any new cpu from becoming online; 2) done both VMXON
+ * and tdx_cpu_enable() on all online cpus.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return 0 if TDX is enabled successfully, otherwise error.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	lockdep_assert_cpus_held();
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		/* Already initialized, great, tell the caller. */
+		ret = 0;
+		break;
+	default:
+		/* Failed to initialize in the previous attempts */
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 55dbb1b8c971..9fb46033c852 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -16,11 +16,24 @@ 
  */
 #define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
  */
 
+/* Kernel defined TDX module status during module initialization. */
+enum tdx_module_status_t {
+	TDX_MODULE_UNKNOWN,
+	TDX_MODULE_INITIALIZED,
+	TDX_MODULE_ERROR
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);