diff mbox series

[v10,05/16] x86/virt/tdx: Add skeleton to enable TDX on demand

Message ID f150316b975b5ca22c6c4016ffd90db79d657bbf.1678111292.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai March 6, 2023, 2:13 p.m. UTC
To enable TDX the kernel needs to initialize TDX from two perspectives:
1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
on one logical cpu before the kernel wants to make any other SEAMCALLs
on that cpu (including those involved during module initialization and
running TDX guests).

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" initialization mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
code mucks with VMX enabling.  If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that.  Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

Similar to module initialization, also make the per-cpu initialization
"on demand" as it also depends on VMX to be enabled.

Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
module and enable TDX on local cpu respectively.  For now tdx_enable()
is a placeholder.  The TODO list will be pared down as functionality is
added.

In tdx_enable() use a state machine protected by mutex to make sure the
initialization will only be done once, as tdx_enable() can be called
multiple times (i.e. KVM module can be reloaded) and may be called
concurrently by other kernel components in the future.

The per-cpu initialization on each cpu can only be done once during the
module's life time.  Use a per-cpu variable to track its status to make
sure it is only done once in tdx_cpu_enable().

Also, a SEAMCALL to do TDX module global initialization must be done
once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
it inside tdx_cpu_enable() too (if hasn't been done).

tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
per-cpu initialization must be done before those SEAMCALLs are invoked
on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
per-cpu initialization regardless of whether the TDX module has been
initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
but assume the caller has disabled CPU hotplug and done VMXON and
tdx_cpu_enable() on all online cpus before calling tdx_enable().

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v9 -> v10:
 - Merged the patch to handle per-cpu initialization to this patch to
   tell the story better.
 - Changed how to handle the per-cpu initialization to only provide a
   tdx_cpu_enable() function to let the user of TDX to do it when the
   user wants to run TDX code on a certain cpu.
 - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
   call lockdep_assert_cpus_held() to assume the caller has done that.
 - Improved comments around tdx_enable() and tdx_cpu_enable().
 - Improved changelog to tell the story better accordingly.

v8 -> v9:
 - Removed detailed TODO list in the changelog (Dave).
 - Added back steps to do module global initialization and per-cpu
   initialization in the TODO list comment.
 - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h

v7 -> v8:
 - Refined changelog (Dave).
 - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
 - Add a "TODO list" comment in init_tdx_module() to list all steps of
   initializing the TDX Module to tell the story (Dave).
 - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
   comments (Dave).
 - Simplified __tdx_enable() to only handle success or failure.
 - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
 - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
 - Improved comments (Dave).
 - Pointed out 'tdx_module_status' is software thing (Dave).

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to set status to TDX_MODULE_NONE if TDX module is not
   loaded (Chao)
 - Added Chao's Reviewed-by.
 - Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.

---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 182 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  25 +++++
 3 files changed, 211 insertions(+)

Comments

Isaku Yamahata March 8, 2023, 10:27 p.m. UTC | #1
On Tue, Mar 07, 2023 at 03:13:50AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This approach has below pros:
> 
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
> 
> 2) The TDX module design allows it to be updated while the system is
> running.  The update procedure shares quite a few steps with this "on
> demand" initialization mechanism.  The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism.  A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
> 
> 3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
> code mucks with VMX enabling.  If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that.  Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
> 
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX to be enabled.
> 
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively.  For now tdx_enable()
> is a placeholder.  The TODO list will be pared down as functionality is
> added.
> 
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
> 
> The per-cpu initialization on each cpu can only be done once during the
> module's life time.  Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
> 
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
> it inside tdx_cpu_enable() too (if hasn't been done).
> 
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug and done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v9 -> v10:
>  - Merged the patch to handle per-cpu initialization to this patch to
>    tell the story better.
>  - Changed how to handle the per-cpu initialization to only provide a
>    tdx_cpu_enable() function to let the user of TDX to do it when the
>    user wants to run TDX code on a certain cpu.
>  - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
>    call lockdep_assert_cpus_held() to assume the caller has done that.
>  - Improved comments around tdx_enable() and tdx_cpu_enable().
>  - Improved changelog to tell the story better accordingly.
> 
> v8 -> v9:
>  - Removed detailed TODO list in the changelog (Dave).
>  - Added back steps to do module global initialization and per-cpu
>    initialization in the TODO list comment.
>  - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> 
> v7 -> v8:
>  - Refined changelog (Dave).
>  - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
>  - Add a "TODO list" comment in init_tdx_module() to list all steps of
>    initializing the TDX Module to tell the story (Dave).
>  - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
>    comments (Dave).
>  - Simplified __tdx_enable() to only handle success or failure.
>  - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
>  - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
>  - Improved comments (Dave).
>  - Pointed out 'tdx_module_status' is software thing (Dave).
> 
> v6 -> v7:
>  - No change.
> 
> v5 -> v6:
>  - Added code to set status to TDX_MODULE_NONE if TDX module is not
>    loaded (Chao)
>  - Added Chao's Reviewed-by.
>  - Improved comments around cpus_read_lock().
> 
> - v3->v5 (no feedback on v4):
>  - Removed the check that SEAMRR and TDX KeyID have been detected on
>    all present cpus.
>  - Removed tdx_detect().
>  - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
>    hotplug lock and return early with error message.
>  - Improved dmesg printing for TDX module detection and initialization.
> 
> ---
>  arch/x86/include/asm/tdx.h  |   4 +
>  arch/x86/virt/vmx/tdx/tdx.c | 182 ++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h |  25 +++++
>  3 files changed, 211 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b489b5b9de5d..112a5b9bd5cd 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -102,8 +102,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -EINVAL; }
> +static inline int tdx_enable(void)  { return -EINVAL; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b65b838f3b5d..29127cb70f51 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
> @@ -22,6 +26,18 @@ static u32 tdx_global_keyid __ro_after_init;
>  static u32 tdx_guest_keyid_start __ro_after_init;
>  static u32 tdx_nr_guest_keyids __ro_after_init;
>  
> +static unsigned int tdx_global_init_status;
> +static DEFINE_SPINLOCK(tdx_global_init_lock);
> +#define TDX_GLOBAL_INIT_DONE	_BITUL(0)
> +#define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
> +
> +static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
> +#define TDX_LP_INIT_DONE	_BITUL(0)
> +#define TDX_LP_INIT_FAILED	_BITUL(1)
> +
> +static enum tdx_module_status_t tdx_module_status;
> +static DEFINE_MUTEX(tdx_module_lock);
> +
>  /*
>   * Use tdx_global_keyid to indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -159,3 +175,169 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	put_cpu();
>  	return ret;
>  }
> +
> +static int try_init_module_global(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	spin_lock(&tdx_global_init_lock);
> +
> +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> +			-EINVAL : 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */
> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +
> +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> +	if (ret)
> +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;

If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
In such case, we should allow the caller to retry or make this function retry
instead of marking error stickily.

Except that,
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>

Thanks,
Huang, Kai March 12, 2023, 11:08 p.m. UTC | #2
On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > +
> > +static int try_init_module_global(void)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	spin_lock(&tdx_global_init_lock);
> > +
> > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > +			-EINVAL : 0;
> > +		goto out;
> > +	}
> > +
> > +	/* All '0's are just unused parameters. */
> > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +
> > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > +	if (ret)
> > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> 
> If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> In such case, we should allow the caller to retry or make this function retry
> instead of marking error stickily.

The spec says:

TDX_SYS_BUSY	The operation was invoked when another TDX module
		operation was in progress. The operation may be retried.

So I don't see how entropy is lacking is related to this error.  Perhaps you
were mixing up with KEY.CONFIG?
Isaku Yamahata March 13, 2023, 11:49 p.m. UTC | #3
On Sun, Mar 12, 2023 at 11:08:44PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > +
> > > +static int try_init_module_global(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * The TDX module global initialization only needs to be done
> > > +	 * once on any cpu.
> > > +	 */
> > > +	spin_lock(&tdx_global_init_lock);
> > > +
> > > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > +			-EINVAL : 0;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* All '0's are just unused parameters. */
> > > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > +
> > > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > +	if (ret)
> > > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > 
> > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > In such case, we should allow the caller to retry or make this function retry
> > instead of marking error stickily.
> 
> The spec says:
> 
> TDX_SYS_BUSY	The operation was invoked when another TDX module
> 		operation was in progress. The operation may be retried.
> 
> So I don't see how entropy is lacking is related to this error.  Perhaps you
> were mixing up with KEY.CONFIG?

TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
strong stack protector enabled by clang and canary value needs to be
initialized.  By default, the canary value is stored at
%fsbase:<STACK_CANARY_OFFSET 0x28>

Although this is a job for libc or language runtime, TDX modules has to do it
itself because it's stand alone.

From tdh_sys_init.c
_STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
{
    ia32_rflags_t rflags = {.raw = 0};
    uint64_t canary;
    if (!ia32_rdrand(&rflags, &canary))
    {
        return TDX_SYS_BUSY;
    }
...
    last_page_ptr->stack_canary.canary = canary;
Huang, Kai March 14, 2023, 1:50 a.m. UTC | #4
On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
> > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > +
> > > > +static int try_init_module_global(void)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	/*
> > > > +	 * The TDX module global initialization only needs to be done
> > > > +	 * once on any cpu.
> > > > +	 */
> > > > +	spin_lock(&tdx_global_init_lock);
> > > > +
> > > > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > +			-EINVAL : 0;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	/* All '0's are just unused parameters. */
> > > > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > +
> > > > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > +	if (ret)
> > > > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > 
> > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > In such case, we should allow the caller to retry or make this function retry
> > > instead of marking error stickily.
> > 
> > The spec says:
> > 
> > TDX_SYS_BUSY	The operation was invoked when another TDX module
> > 		operation was in progress. The operation may be retried.
> > 
> > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > were mixing up with KEY.CONFIG?
> 
> TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> strong stack protector enabled by clang and canary value needs to be
> initialized.  By default, the canary value is stored at
> %fsbase:<STACK_CANARY_OFFSET 0x28>
> 
> Although this is a job for libc or language runtime, TDX modules has to do it
> itself because it's stand alone.
> 
> From tdh_sys_init.c
> _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> {
>     ia32_rflags_t rflags = {.raw = 0};
>     uint64_t canary;
>     if (!ia32_rdrand(&rflags, &canary))
>     {
>         return TDX_SYS_BUSY;
>     }
> ...
>     last_page_ptr->stack_canary.canary = canary;
> 
> 

Then it is a hidden behaviour of the TDX module that is not reflected in the
spec.  I am not sure whether we should handle because: 

1) This is an extremely rare case.  Kernel would be basically under attack if
such error happened.  In the current series we don't handle such case in
KEY.CONFIG either but just leave a comment (see patch 13).

2) Not sure whether this will be changed in the future.

So I think we should keep as is.
Isaku Yamahata March 14, 2023, 4:02 a.m. UTC | #5
On Tue, Mar 14, 2023 at 01:50:40AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> > On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > > +
> > > > > +static int try_init_module_global(void)
> > > > > +{
> > > > > +	int ret;
> > > > > +
> > > > > +	/*
> > > > > +	 * The TDX module global initialization only needs to be done
> > > > > +	 * once on any cpu.
> > > > > +	 */
> > > > > +	spin_lock(&tdx_global_init_lock);
> > > > > +
> > > > > +	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > > +		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > > +			-EINVAL : 0;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	/* All '0's are just unused parameters. */
> > > > > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > > +
> > > > > +	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > > +	if (ret)
> > > > > +		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > > 
> > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > > In such case, we should allow the caller to retry or make this function retry
> > > > instead of marking error stickily.
> > > 
> > > The spec says:
> > > 
> > > TDX_SYS_BUSY	The operation was invoked when another TDX module
> > > 		operation was in progress. The operation may be retried.
> > > 
> > > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > > were mixing up with KEY.CONFIG?
> > 
> > TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> > strong stack protector enabled by clang and canary value needs to be
> > initialized.  By default, the canary value is stored at
> > %fsbase:<STACK_CANARY_OFFSET 0x28>
> > 
> > Although this is a job for libc or language runtime, TDX modules has to do it
> > itself because it's stand alone.
> > 
> > From tdh_sys_init.c
> > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> > {
> >     ia32_rflags_t rflags = {.raw = 0};
> >     uint64_t canary;
> >     if (!ia32_rdrand(&rflags, &canary))
> >     {
> >         return TDX_SYS_BUSY;
> >     }
> > ...
> >     last_page_ptr->stack_canary.canary = canary;
> > 
> > 
> 
> Then it is a hidden behaviour of the TDX module that is not reflected in the
> spec.  I am not sure whether we should handle because: 
> 
> 1) This is an extremely rare case.  Kernel would be basically under attack if
> such error happened.  In the current series we don't handle such case in
> KEY.CONFIG either but just leave a comment (see patch 13).
> 
> 2) Not sure whether this will be changed in the future.
> 
> So I think we should keep as is.

TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
postpone it to TDX 1.5 activity.
Dave Hansen March 14, 2023, 5:45 a.m. UTC | #6
On 3/13/23 21:02, Isaku Yamahata wrote:
>> Then it is a hidden behaviour of the TDX module that is not reflected in the
>> spec.  I am not sure whether we should handle because: 
>>
>> 1) This is an extremely rare case.  Kernel would be basically under attack if
>> such error happened.  In the current series we don't handle such case in
>> KEY.CONFIG either but just leave a comment (see patch 13).
>>
>> 2) Not sure whether this will be changed in the future.
>>
>> So I think we should keep as is.
> TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
> postpone it to TDX 1.5 activity.

What the heck does this mean?

I don't remember seeing any code here that checks for "TDX 1.0" or "TDX
1.5".  That means that this code needs to work with _any_ TDX version.

Are features being added to new versions that break code written for old
versions?
Dave Hansen March 14, 2023, 3:48 p.m. UTC | #7
On 3/13/23 18:50, Huang, Kai wrote:
> On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
>> On Sun, Mar 12, 2023 at 11:08:44PM +0000,
>> "Huang, Kai" <kai.huang@intel.com> wrote:
>>
>>> On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
>>>>> +
>>>>> +static int try_init_module_global(void)
>>>>> +{
>>>>> +       int ret;
>>>>> +
>>>>> +       /*
>>>>> +        * The TDX module global initialization only needs to be done
>>>>> +        * once on any cpu.
>>>>> +        */
>>>>> +       spin_lock(&tdx_global_init_lock);
>>>>> +
>>>>> +       if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
>>>>> +               ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
>>>>> +                       -EINVAL : 0;
>>>>> +               goto out;
>>>>> +       }
>>>>> +
>>>>> +       /* All '0's are just unused parameters. */
>>>>> +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
>>>>> +
>>>>> +       tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
>>>>> +       if (ret)
>>>>> +               tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
>>>>
>>>> If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
>>>> In such case, we should allow the caller to retry or make this function retry
>>>> instead of marking error stickily.
>>>
>>> The spec says:
>>>
>>> TDX_SYS_BUSY        The operation was invoked when another TDX module
>>>             operation was in progress. The operation may be retried.
>>>
>>> So I don't see how entropy is lacking is related to this error.  Perhaps you
>>> were mixing up with KEY.CONFIG?
>>
>> TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
>> strong stack protector enabled by clang and canary value needs to be
>> initialized.  By default, the canary value is stored at
>> %fsbase:<STACK_CANARY_OFFSET 0x28>
>>
>> Although this is a job for libc or language runtime, TDX modules has to do it
>> itself because it's stand alone.
>>
>> From tdh_sys_init.c
>> _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
>> {
>>     ia32_rflags_t rflags = {.raw = 0};
>>     uint64_t canary;
>>     if (!ia32_rdrand(&rflags, &canary))
>>     {
>>         return TDX_SYS_BUSY;
>>     }
>> ...
>>     last_page_ptr->stack_canary.canary = canary;
>>
>>
> 
> Then it is a hidden behaviour of the TDX module that is not reflected in the
> spec.

This is true.  Could you please go ask the TDX module folks to fix this up?

> I am not sure whether we should handle because:
> 
> 1) This is an extremely rare case.  Kernel would be basically under attack if
> such error happened.  In the current series we don't handle such case in
> KEY.CONFIG either but just leave a comment (see patch 13).

Rare, yes.  Under attack?  I'm not sure where you get that from.  Look
at the SDM:

> Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand
> of random numbers by software processes/threads to exceed the rate at which the random number generator
> hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND
> instruction indicates the occurrence of this rare situation by clearing the CF flag.

That doesn't talk about attacks.

> 2) Not sure whether this will be changed in the future.
> 
> So I think we should keep as is.

TDX_SYS_BUSY really is missing some nuance.  You *REALLY* want to retry
RDRAND failures.  But, if you have VMM locking and don't expect two
users calling into the TDX module then TDX_SYS_BUSY from a busy *module*
is a bad (and probably fatal) signal.

I suspect we should just throw a few retries in the seamcall()
infrastructure to retry in the case of TDX_SYS_BUSY.  It'll take care of
RDRAND failures.  If a retry loop fails to resolve it, then we should
probably dump a warning and return an error.

Just do this once, in common code.
Isaku Yamahata March 14, 2023, 5:16 p.m. UTC | #8
On Mon, Mar 13, 2023 at 10:45:45PM -0700,
Dave Hansen <dave.hansen@intel.com> wrote:

> On 3/13/23 21:02, Isaku Yamahata wrote:
> >> Then it is a hidden behaviour of the TDX module that is not reflected in the
> >> spec.  I am not sure whether we should handle because: 
> >>
> >> 1) This is an extremely rare case.  Kernel would be basically under attack if
> >> such error happened.  In the current series we don't handle such case in
> >> KEY.CONFIG either but just leave a comment (see patch 13).
> >>
> >> 2) Not sure whether this will be changed in the future.
> >>
> >> So I think we should keep as is.
> > TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
> > postpone it to TDX 1.5 activity.
> 
> What the heck does this mean?
> 
> I don't remember seeing any code here that checks for "TDX 1.0" or "TDX
> 1.5".  That means that this code needs to work with _any_ TDX version.
> 
> Are features being added to new versions that break code written for old
> versions?

No new feature, but new error code. TDX_RND_NO_ENTROPY, lack of entropy.
For TDX 1.0, some APIs return TDX_SYS_BUSY. It can be contention(lock failure)
or the lack of entropy.  The caller can't distinguish them.
For TDX 1.5, they return TDX_RND_NO_ENTROPY instead of TDX_SYS_BUSY in the case
of rdrand/rdseed failure.

Because both TDX_SYS_BUSY and TDX_RND_NO_ENTROPY are recoverable error
(bit 63 error=1, bit 62 non_recoverable=0), the caller can check error bit and
non_recoverable bit for retry.
Dave Hansen March 14, 2023, 5:38 p.m. UTC | #9
On 3/14/23 10:16, Isaku Yamahata wrote:
>>> TDX 1.5 spec introduced TDX_RND_NO_ENTROPY status code.  For TDX 1.0, let's
>>> postpone it to TDX 1.5 activity.
>> What the heck does this mean?
>>
>> I don't remember seeing any code here that checks for "TDX 1.0" or "TDX
>> 1.5".  That means that this code needs to work with _any_ TDX version.
>>
>> Are features being added to new versions that break code written for old
>> versions?
> No new feature, but new error code. TDX_RND_NO_ENTROPY, lack of entropy.
> For TDX 1.0, some APIs return TDX_SYS_BUSY. It can be contention(lock failure)
> or the lack of entropy.  The caller can't distinguish them.
> For TDX 1.5, they return TDX_RND_NO_ENTROPY instead of TDX_SYS_BUSY in the case
> of rdrand/rdseed failure.
> 
> Because both TDX_SYS_BUSY and TDX_RND_NO_ENTROPY are recoverable error
> (bit 63 error=1, bit 62 non_recoverable=0), the caller can check error bit and
> non_recoverable bit for retry.

Oh, that's actually really nice.  It separates out the "RDRAND is empty"
issue from the "the VMM should have had a lock here" issue.

For now, let's consider TDX_SYS_BUSY to basically indicate a non-fatal
kernel bug: the kernel called TDX in a way that it shouldn't have.
We'll treat it in the kernel as non-recoverable.  We'll return an error,
WARN_ON(), and keep on running.

A follow-on patch can add generic TDX_RND_NO_ENTROPY retry support to
the seamcall infrastructure.
Huang, Kai March 15, 2023, 11:10 a.m. UTC | #10
On Tue, 2023-03-14 at 08:48 -0700, Dave Hansen wrote:
> On 3/13/23 18:50, Huang, Kai wrote:
> > On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> > > On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> > > "Huang, Kai" <kai.huang@intel.com> wrote:
> > > 
> > > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > > > +
> > > > > > +static int try_init_module_global(void)
> > > > > > +{
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * The TDX module global initialization only needs to be done
> > > > > > +        * once on any cpu.
> > > > > > +        */
> > > > > > +       spin_lock(&tdx_global_init_lock);
> > > > > > +
> > > > > > +       if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > > > +               ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > > > +                       -EINVAL : 0;
> > > > > > +               goto out;
> > > > > > +       }
> > > > > > +
> > > > > > +       /* All '0's are just unused parameters. */
> > > > > > +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > > > +
> > > > > > +       tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > > > +       if (ret)
> > > > > > +               tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > > > 
> > > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > > > In such case, we should allow the caller to retry or make this function retry
> > > > > instead of marking error stickily.
> > > > 
> > > > The spec says:
> > > > 
> > > > TDX_SYS_BUSY        The operation was invoked when another TDX module
> > > >             operation was in progress. The operation may be retried.
> > > > 
> > > > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > > > were mixing up with KEY.CONFIG?
> > > 
> > > TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> > > strong stack protector enabled by clang and canary value needs to be
> > > initialized.  By default, the canary value is stored at
> > > %fsbase:<STACK_CANARY_OFFSET 0x28>
> > > 
> > > Although this is a job for libc or language runtime, TDX modules has to do it
> > > itself because it's stand alone.
> > > 
> > > From tdh_sys_init.c
> > > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> > > {
> > >     ia32_rflags_t rflags = {.raw = 0};
> > >     uint64_t canary;
> > >     if (!ia32_rdrand(&rflags, &canary))
> > >     {
> > >         return TDX_SYS_BUSY;
> > >     }
> > > ...
> > >     last_page_ptr->stack_canary.canary = canary;
> > > 
> > > 
> > 
> > Then it is a hidden behaviour of the TDX module that is not reflected in the
> > spec.
> 
> This is true.  Could you please go ask the TDX module folks to fix this up?

Sure will do.

To make sure, you mean we should ask TDX module guys to add the new
TDX_RND_NO_ENTROPY error code to TDX module 1.0?

"another TDX module operation was in progress" and "running out of entropy" are
different thing and should not be mixed together IMHO.

> 
> > I am not sure whether we should handle because:
> > 
> > 1) This is an extremely rare case.  Kernel would be basically under attack if
> > such error happened.  In the current series we don't handle such case in
> > KEY.CONFIG either but just leave a comment (see patch 13).
> 
> Rare, yes.  Under attack?  I'm not sure where you get that from.  Look
> at the SDM:
> 
> > Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand
> > of random numbers by software processes/threads to exceed the rate at which the random number generator
> > hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND
> > instruction indicates the occurrence of this rare situation by clearing the CF flag.
> 
> That doesn't talk about attacks.

Thanks for citing the documentation.  I checked the kernel code before and it
seems currently there's no code to call RDRAND very frequently.  But yes we
should not say "under attack".  I have some old memory that someone said so
(maybe me?).

> 
> > 2) Not sure whether this will be changed in the future.
> > 
> > So I think we should keep as is.
> 
> TDX_SYS_BUSY really is missing some nuance.  You *REALLY* want to retry
> RDRAND failures.  
> 

OK.  Agreed.  Then I think the TDH.SYS.KEY.CONFIG should retry when running out
of entropy too.

> But, if you have VMM locking and don't expect two
> users calling into the TDX module then TDX_SYS_BUSY from a busy *module*
> is a bad (and probably fatal) signal.

Yes we have a lock to protect TDH.SYS.INIT from being called in parallel.  W/o
this entropy thing TDX_SYS_BUSY should never happen.

> 
> I suspect we should just throw a few retries in the seamcall()
> infrastructure to retry in the case of TDX_SYS_BUSY.  It'll take care of
> RDRAND failures.  If a retry loop fails to resolve it, then we should
> probably dump a warning and return an error.
> 
> Just do this once, in common code.

I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
value)?

Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common 
seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
SEAMCALL returns a different error code:

TDX_KEY_GENERATION_FAILED	Failed to generate a random key. This is 
				typically caused by an entropy error of the
				CPU's random number generator, and may
				be impacted by RDSEED, RDRAND or PCONFIG
				executing on other LPs. The operation should be
				retried.
Isaku Yamahata March 16, 2023, 12:31 a.m. UTC | #11
On Tue, Mar 07, 2023 at 03:13:50AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> +static int try_init_module_global(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	spin_lock(&tdx_global_init_lock);


If I use tdx_cpu_enable() via kvm hardware_enable_all(), this function is called
in the context IPI callback and the lockdep complains.  Here is my patch to
address it

From 0c4022ffe8cd68dfb455c418eb65538e4e100115 Mon Sep 17 00:00:00 2001
Message-Id: <0c4022ffe8cd68dfb455c418eb65538e4e100115.1678926123.git.isaku.yamahata@intel.com>
In-Reply-To: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
References: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <isaku.yamahata@intel.com>
Date: Wed, 15 Mar 2023 14:26:37 -0700
Subject: [PATCH] x86/virt/vmx/tdx: Use raw spin lock instead of spin lock

tdx_cpu_enable() can be called by IPI handler.  The lockdep complains about
spin lock as follows.  Use raw spin lock.

=============================
[ BUG: Invalid wait context ]
6.3.0-rc1-tdx-kvm-upstream+ #389 Not tainted
-----------------------------
swapper/154/0 is trying to lock:
ffffffffa7875e58 (tdx_global_init_lock){....}-{3:3}, at: tdx_cpu_enable+0x67/0x180
other info that might help us debug this:
context-{2:2}
no locks held by swapper/154/0.
stack backtrace:
Call Trace:
 <IRQ>
 dump_stack_lvl+0x64/0xb0
 dump_stack+0x10/0x20
 __lock_acquire+0x912/0xc30
 lock_acquire.part.0+0x99/0x220
 lock_acquire+0x60/0x170
 _raw_spin_lock_irqsave+0x43/0x70
 tdx_cpu_enable+0x67/0x180
 vt_hardware_enable+0x3b/0x60
 kvm_arch_hardware_enable+0xe7/0x2e0
 hardware_enable_nolock+0x33/0x80
 __flush_smp_call_function_queue+0xc4/0x590
 generic_smp_call_function_single_interrupt+0x1a/0xb0
 __sysvec_call_function+0x48/0x200
 sysvec_call_function+0xad/0xd0
 </IRQ>

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2ee37a5dedcf..e1c8ffad7406 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -41,7 +41,7 @@ static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
 static unsigned int tdx_global_init_status;
-static DEFINE_SPINLOCK(tdx_global_init_lock);
+static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
 #define TDX_GLOBAL_INIT_DONE	_BITUL(0)
 #define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
 
@@ -349,6 +349,7 @@ static void tdx_trace_seamcalls(u64 level)
 
 static int try_init_module_global(void)
 {
+	unsigned long flags;
 	u64 tsx_ctrl;
 	int ret;
 
@@ -356,7 +357,7 @@ static int try_init_module_global(void)
 	 * The TDX module global initialization only needs to be done
 	 * once on any cpu.
 	 */
-	spin_lock(&tdx_global_init_lock);
+	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
 
 	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
 		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
@@ -373,7 +374,7 @@ static int try_init_module_global(void)
 	if (ret)
 		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
 out:
-	spin_unlock(&tdx_global_init_lock);
+	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
 
 	if (ret) {
 		if (trace_boot_seamcalls)
Isaku Yamahata March 16, 2023, 2:45 a.m. UTC | #12
On Wed, Mar 15, 2023 at 05:31:02PM -0700,
Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> On Tue, Mar 07, 2023 at 03:13:50AM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > +static int try_init_module_global(void)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	spin_lock(&tdx_global_init_lock);
> 
> 
> If I use tdx_cpu_enable() via kvm hardware_enable_all(), this function is called
> in the context IPI callback and the lockdep complains.  Here is my patch to
> address it
> 
> From 0c4022ffe8cd68dfb455c418eb65538e4e100115 Mon Sep 17 00:00:00 2001
> Message-Id: <0c4022ffe8cd68dfb455c418eb65538e4e100115.1678926123.git.isaku.yamahata@intel.com>
> In-Reply-To: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
> References: <d2aa2142665b8204b628232ab615c98090371c99.1678926122.git.isaku.yamahata@intel.com>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> Date: Wed, 15 Mar 2023 14:26:37 -0700
> Subject: [PATCH] x86/virt/vmx/tdx: Use raw spin lock instead of spin lock
> 
> tdx_cpu_enable() can be called by IPI handler.  The lockdep complains about
> spin lock as follows.  Use raw spin lock.
> 
> =============================
> [ BUG: Invalid wait context ]
> 6.3.0-rc1-tdx-kvm-upstream+ #389 Not tainted
> -----------------------------
> swapper/154/0 is trying to lock:
> ffffffffa7875e58 (tdx_global_init_lock){....}-{3:3}, at: tdx_cpu_enable+0x67/0x180
> other info that might help us debug this:
> context-{2:2}
> no locks held by swapper/154/0.
> stack backtrace:
> Call Trace:
>  <IRQ>
>  dump_stack_lvl+0x64/0xb0
>  dump_stack+0x10/0x20
>  __lock_acquire+0x912/0xc30
>  lock_acquire.part.0+0x99/0x220
>  lock_acquire+0x60/0x170
>  _raw_spin_lock_irqsave+0x43/0x70
>  tdx_cpu_enable+0x67/0x180
>  vt_hardware_enable+0x3b/0x60
>  kvm_arch_hardware_enable+0xe7/0x2e0
>  hardware_enable_nolock+0x33/0x80
>  __flush_smp_call_function_queue+0xc4/0x590
>  generic_smp_call_function_single_interrupt+0x1a/0xb0
>  __sysvec_call_function+0x48/0x200
>  sysvec_call_function+0xad/0xd0
>  </IRQ>
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2ee37a5dedcf..e1c8ffad7406 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -41,7 +41,7 @@ static u32 tdx_guest_keyid_start __ro_after_init;
>  static u32 tdx_nr_guest_keyids __ro_after_init;
>  
>  static unsigned int tdx_global_init_status;
> -static DEFINE_SPINLOCK(tdx_global_init_lock);
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
>  #define TDX_GLOBAL_INIT_DONE	_BITUL(0)
>  #define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
>  
> @@ -349,6 +349,7 @@ static void tdx_trace_seamcalls(u64 level)
>  
>  static int try_init_module_global(void)
>  {
> +	unsigned long flags;
>  	u64 tsx_ctrl;
>  	int ret;
>  
> @@ -356,7 +357,7 @@ static int try_init_module_global(void)
>  	 * The TDX module global initialization only needs to be done
>  	 * once on any cpu.
>  	 */
> -	spin_lock(&tdx_global_init_lock);
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);

As hardware_enable_all() uses cpus_read_lock(), irqsave isn't needed.
this line should be raw_spin_lock().
Huang, Kai March 16, 2023, 2:52 a.m. UTC | #13
> >  
> > @@ -356,7 +357,7 @@ static int try_init_module_global(void)
> >  	 * The TDX module global initialization only needs to be done
> >  	 * once on any cpu.
> >  	 */
> > -	spin_lock(&tdx_global_init_lock);
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> 
> As hardware_enable_all() uses cpus_read_lock(), irqsave isn't needed.
> this line should be raw_spin_lock().
> 

OK.  I missed that in PREEMPT_RT kernel the spinlock is converted to sleeping
lock.  So I'll change to use raw_spin_lock() as we talked.  Thanks.
Huang, Kai March 16, 2023, 10:07 p.m. UTC | #14
On Wed, 2023-03-15 at 11:10 +0000, Huang, Kai wrote:
> On Tue, 2023-03-14 at 08:48 -0700, Dave Hansen wrote:
> > On 3/13/23 18:50, Huang, Kai wrote:
> > > On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote:
> > > > On Sun, Mar 12, 2023 at 11:08:44PM +0000,
> > > > "Huang, Kai" <kai.huang@intel.com> wrote:
> > > > 
> > > > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote:
> > > > > > > +
> > > > > > > +static int try_init_module_global(void)
> > > > > > > +{
> > > > > > > +       int ret;
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * The TDX module global initialization only needs to be done
> > > > > > > +        * once on any cpu.
> > > > > > > +        */
> > > > > > > +       spin_lock(&tdx_global_init_lock);
> > > > > > > +
> > > > > > > +       if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
> > > > > > > +               ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
> > > > > > > +                       -EINVAL : 0;
> > > > > > > +               goto out;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       /* All '0's are just unused parameters. */
> > > > > > > +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > > > > > > +
> > > > > > > +       tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
> > > > > > > +       if (ret)
> > > > > > > +               tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
> > > > > > 
> > > > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY.
> > > > > > In such case, we should allow the caller to retry or make this function retry
> > > > > > instead of marking error stickily.
> > > > > 
> > > > > The spec says:
> > > > > 
> > > > > TDX_SYS_BUSY        The operation was invoked when another TDX module
> > > > >             operation was in progress. The operation may be retried.
> > > > > 
> > > > > So I don't see how entropy is lacking is related to this error.  Perhaps you
> > > > > were mixing up with KEY.CONFIG?
> > > > 
> > > > TDH.SYS.INIT() initializes global canary value.  TDX module is compiled with
> > > > strong stack protector enabled by clang and canary value needs to be
> > > > initialized.  By default, the canary value is stored at
> > > > %fsbase:<STACK_CANARY_OFFSET 0x28>
> > > > 
> > > > Although this is a job for libc or language runtime, TDX modules has to do it
> > > > itself because it's stand alone.
> > > > 
> > > > From tdh_sys_init.c
> > > > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void)
> > > > {
> > > >     ia32_rflags_t rflags = {.raw = 0};
> > > >     uint64_t canary;
> > > >     if (!ia32_rdrand(&rflags, &canary))
> > > >     {
> > > >         return TDX_SYS_BUSY;
> > > >     }
> > > > ...
> > > >     last_page_ptr->stack_canary.canary = canary;
> > > > 
> > > > 
> > > 
> > > Then it is a hidden behaviour of the TDX module that is not reflected in the
> > > spec.
> > 
> > This is true.  Could you please go ask the TDX module folks to fix this up?
> 
> Sure will do.
> 
> To make sure, you mean we should ask TDX module guys to add the new
> TDX_RND_NO_ENTROPY error code to TDX module 1.0?
> 
> "another TDX module operation was in progress" and "running out of entropy" are
> different thing and should not be mixed together IMHO.
> 
> > 
> > > I am not sure whether we should handle because:
> > > 
> > > 1) This is an extremely rare case.  Kernel would be basically under attack if
> > > such error happened.  In the current series we don't handle such case in
> > > KEY.CONFIG either but just leave a comment (see patch 13).
> > 
> > Rare, yes.  Under attack?  I'm not sure where you get that from.  Look
> > at the SDM:
> > 
> > > Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand
> > > of random numbers by software processes/threads to exceed the rate at which the random number generator
> > > hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND
> > > instruction indicates the occurrence of this rare situation by clearing the CF flag.
> > 
> > That doesn't talk about attacks.
> 
> Thanks for citing the documentation.  I checked the kernel code before and it
> seems currently there's no code to call RDRAND very frequently.  But yes we
> should not say "under attack".  I have some old memory that someone said so
> (maybe me?).
> 
> > 
> > > 2) Not sure whether this will be changed in the future.
> > > 
> > > So I think we should keep as is.
> > 
> > TDX_SYS_BUSY really is missing some nuance.  You *REALLY* want to retry
> > RDRAND failures.  
> > 
> 
> OK.  Agreed.  Then I think the TDH.SYS.KEY.CONFIG should retry when running out
> of entropy too.
> 
> > But, if you have VMM locking and don't expect two
> > users calling into the TDX module then TDX_SYS_BUSY from a busy *module*
> > is a bad (and probably fatal) signal.
> 
> Yes we have a lock to protect TDH.SYS.INIT from being called in parallel.  W/o
> this entropy thing TDX_SYS_BUSY should never happen.
> 
> > 
> > I suspect we should just throw a few retries in the seamcall()
> > infrastructure to retry in the case of TDX_SYS_BUSY.  It'll take care of
> > RDRAND failures.  If a retry loop fails to resolve it, then we should
> > probably dump a warning and return an error.
> > 
> > Just do this once, in common code.
> 
> I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
> TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
> value)?
> 
> Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common 
> seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
> SEAMCALL returns a different error code:
> 
> TDX_KEY_GENERATION_FAILED	Failed to generate a random key. This is 
> 				typically caused by an entropy error of the
> 				CPU's random number generator, and may
> 				be impacted by RDSEED, RDRAND or PCONFIG
> 				executing on other LPs. The operation should be
> 				retried.
> 

Hi Dave,

Sorry to ping.  Could you help to check whether my understanding is aligned with
what you suggested?
Dave Hansen March 23, 2023, 1:49 p.m. UTC | #15
On 3/15/23 04:10, Huang, Kai wrote:
> I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
> TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
> value)?

I'll put it this way:

	Linux is going to treat TDX_SYS_BUSY like a Linux bug and assume
	Linux is doing something wrong.  It'll mostly mean that
	users will see something nasty and may even cause Linux to give
	up on TDX.  In other words, the TDX module shouldn't use
	TDX_SYS_BUSY for things that aren't Linux's fault.

> Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common
> seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
> SEAMCALL returns a different error code:
> 
> TDX_KEY_GENERATION_FAILED       Failed to generate a random key. This is
>                                 typically caused by an entropy error of the
>                                 CPU's random number generator, and may
>                                 be impacted by RDSEED, RDRAND or PCONFIG
>                                 executing on other LPs. The operation should be
>                                 retried.

Sounds like we should just replace TDX_KEY_GENERATION_FAILED with
TDX_RND_NO_ENTROPY in cases where key generation fails because of a lack
of entropy.
Huang, Kai March 23, 2023, 10:09 p.m. UTC | #16
On Thu, 2023-03-23 at 06:49 -0700, Hansen, Dave wrote:
> On 3/15/23 04:10, Huang, Kai wrote:
> > I can do.  Just want to make sure do you want to retry TDX_SYS_BUSY, or retry
> > TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this
> > value)?
> 
> I'll put it this way:
> 
> 	Linux is going to treat TDX_SYS_BUSY like a Linux bug and assume
> 	Linux is doing something wrong.  It'll mostly mean that
> 	users will see something nasty and may even cause Linux to give
> 	up on TDX.  In other words, the TDX module shouldn't use
> 	TDX_SYS_BUSY for things that aren't Linux's fault.
> 
> > Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common
> > seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this
> > SEAMCALL returns a different error code:
> > 
> > TDX_KEY_GENERATION_FAILED       Failed to generate a random key. This is
> >                                 typically caused by an entropy error of the
> >                                 CPU's random number generator, and may
> >                                 be impacted by RDSEED, RDRAND or PCONFIG
> >                                 executing on other LPs. The operation should be
> >                                 retried.
> 
> Sounds like we should just replace TDX_KEY_GENERATION_FAILED with
> TDX_RND_NO_ENTROPY in cases where key generation fails because of a lack
> of entropy.

Thanks for feedback.

I'll do following, please let me know for any comments in case I have any
misunderstanding.

1) In TDH.SYS.INIT, ask TDX module team to return TDX_RND_NO_ENTROPY instead of
TDX_SYS_BUSY when running out of entropy. 

2) In TDH.SYS.KEY.CONFIG, ask TDX module to return TDX_RND_NO_ENTROPY instead of
TDX_KEY_GENERATION_FAILED when running out of entropy.  Whether
TDX_KEY_GENERATION_FAILED should be still kept is  up to TDX module team
(because it looks running concurrent PCONFIGs is also related).

3) Ask TDX module to always return TDX_RND_NO_ENTROPY in _ALL_ SEAMCALLs and
keep this behaviour for future TDX modules too.

4) In the common seamcall(), retry on TDX_RND_NO_ENTROPY.

In terms of how many times to retry, I will use a fixed value for now, similar
to the kernel code below:

#define RDRAND_RETRY_LOOPS      10                                             
                                                                                                                                                   
/* Unconditional execution of RDRAND and RDSEED */                             
                                                                                                                                                   
static inline bool __must_check rdrand_long(unsigned long *v)                  
{                                                                              
        bool ok;                                                               
        unsigned int retry = RDRAND_RETRY_LOOPS;                               
        do {                                                                   
                asm volatile("rdrand %[out]"                                   
                             CC_SET(c)                                         
                             : CC_OUT(c) (ok), [out] "=r" (*v));               
                if (ok)                                                        
                        return true;                                           
        } while (--retry);                                                     
        return false;                                                          
}
Dave Hansen March 23, 2023, 10:12 p.m. UTC | #17
On 3/23/23 15:09, Huang, Kai wrote:
> 1) In TDH.SYS.INIT, ask TDX module team to return TDX_RND_NO_ENTROPY instead of
> TDX_SYS_BUSY when running out of entropy.
> 
> 2) In TDH.SYS.KEY.CONFIG, ask TDX module to return TDX_RND_NO_ENTROPY instead of
> TDX_KEY_GENERATION_FAILED when running out of entropy.  Whether
> TDX_KEY_GENERATION_FAILED should be still kept is  up to TDX module team
> (because it looks running concurrent PCONFIGs is also related).
> 
> 3) Ask TDX module to always return TDX_RND_NO_ENTROPY in _ALL_ SEAMCALLs and
> keep this behaviour for future TDX modules too.

Yes, that's all fine.

> 4) In the common seamcall(), retry on TDX_RND_NO_ENTROPY.
> 
> In terms of how many times to retry, I will use a fixed value for now, similar
> to the kernel code below:
> 
> #define RDRAND_RETRY_LOOPS      10

Heck, you could even just use RDRAND_RETRY_LOOPS directly.  It's
hard(er) to bikeshed your choice of a random number that you didn't even
pick.
Huang, Kai March 23, 2023, 10:42 p.m. UTC | #18
On Thu, 2023-03-23 at 15:12 -0700, Hansen, Dave wrote:
> On 3/23/23 15:09, Huang, Kai wrote:
> > 1) In TDH.SYS.INIT, ask TDX module team to return TDX_RND_NO_ENTROPY instead of
> > TDX_SYS_BUSY when running out of entropy.
> > 
> > 2) In TDH.SYS.KEY.CONFIG, ask TDX module to return TDX_RND_NO_ENTROPY instead of
> > TDX_KEY_GENERATION_FAILED when running out of entropy.  Whether
> > TDX_KEY_GENERATION_FAILED should be still kept is  up to TDX module team
> > (because it looks running concurrent PCONFIGs is also related).
> > 
> > 3) Ask TDX module to always return TDX_RND_NO_ENTROPY in _ALL_ SEAMCALLs and
> > keep this behaviour for future TDX modules too.
> 
> Yes, that's all fine.
> 
> > 4) In the common seamcall(), retry on TDX_RND_NO_ENTROPY.
> > 
> > In terms of how many times to retry, I will use a fixed value for now, similar
> > to the kernel code below:
> > 
> > #define RDRAND_RETRY_LOOPS      10
> 
> Heck, you could even just use RDRAND_RETRY_LOOPS directly.  It's
> hard(er) to bikeshed your choice of a random number that you didn't even
> pick.

Yes I'll just include the header and use it.  Thanks.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b489b5b9de5d..112a5b9bd5cd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -102,8 +102,12 @@  static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_cpu_enable(void);
+int tdx_enable(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_cpu_enable(void) { return -EINVAL; }
+static inline int tdx_enable(void)  { return -EINVAL; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b65b838f3b5d..29127cb70f51 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,10 @@ 
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/smp.h>
+#include <linux/cpu.h>
+#include <linux/spinlock.h>
+#include <linux/percpu-defs.h>
+#include <linux/mutex.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -22,6 +26,18 @@  static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+static unsigned int tdx_global_init_status;
+static DEFINE_SPINLOCK(tdx_global_init_lock);
+#define TDX_GLOBAL_INIT_DONE	_BITUL(0)
+#define TDX_GLOBAL_INIT_FAILED	_BITUL(1)
+
+static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status);
+#define TDX_LP_INIT_DONE	_BITUL(0)
+#define TDX_LP_INIT_FAILED	_BITUL(1)
+
+static enum tdx_module_status_t tdx_module_status;
+static DEFINE_MUTEX(tdx_module_lock);
+
 /*
  * Use tdx_global_keyid to indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -159,3 +175,169 @@  static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	put_cpu();
 	return ret;
 }
+
+static int try_init_module_global(void)
+{
+	int ret;
+
+	/*
+	 * The TDX module global initialization only needs to be done
+	 * once on any cpu.
+	 */
+	spin_lock(&tdx_global_init_lock);
+
+	if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) {
+		ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ?
+			-EINVAL : 0;
+		goto out;
+	}
+
+	/* All '0's are just unused parameters. */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+
+	tdx_global_init_status = TDX_GLOBAL_INIT_DONE;
+	if (ret)
+		tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED;
+out:
+	spin_unlock(&tdx_global_init_lock);
+
+	return ret;
+}
+
+/**
+ * tdx_cpu_enable - Enable TDX on local cpu
+ *
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
+ * global initialization SEAMCALL if not done) on local cpu to make this
+ * cpu be ready to run any other SEAMCALLs.
+ *
+ * Note this function must be called when preemption is not possible
+ * (i.e. via SMP call or in per-cpu thread).  It is not IRQ safe either
+ * (i.e. cannot be called in per-cpu thread and via SMP call from remote
+ * cpu simultaneously).
+ *
+ * Return 0 on success, otherwise errors.
+ */
+int tdx_cpu_enable(void)
+{
+	unsigned int lp_status;
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -EINVAL;
+
+	lp_status = __this_cpu_read(tdx_lp_init_status);
+
+	/* Already done */
+	if (lp_status & TDX_LP_INIT_DONE)
+		return lp_status & TDX_LP_INIT_FAILED ? -EINVAL : 0;
+
+	/*
+	 * The TDX module global initialization is the very first step
+	 * to enable TDX.  Need to do it first (if hasn't been done)
+	 * before doing the per-cpu initialization.
+	 */
+	ret = try_init_module_global();
+
+	/*
+	 * If the module global initialization failed, there's no point
+	 * to do the per-cpu initialization.  Just mark it as done but
+	 * failed.
+	 */
+	if (ret)
+		goto update_status;
+
+	/* All '0's are just unused parameters */
+	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+
+update_status:
+	lp_status = TDX_LP_INIT_DONE;
+	if (ret)
+		lp_status |= TDX_LP_INIT_FAILED;
+
+	this_cpu_write(tdx_lp_init_status, lp_status);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_enable);
+
+static int init_tdx_module(void)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Get TDX module information and TDX-capable memory regions.
+	 *  - Build the list of TDX-usable memory regions.
+	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
+	 *    all TDX-usable memory regions.
+	 *  - Configure the TDMRs and the global KeyID to the TDX module.
+	 *  - Configure the global KeyID on all packages.
+	 *  - Initialize all TDMRs.
+	 *
+	 *  Return error before all steps are done.
+	 */
+	return -EINVAL;
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	ret = init_tdx_module();
+	if (ret) {
+		pr_err("TDX module initialization failed (%d)\n", ret);
+		tdx_module_status = TDX_MODULE_ERROR;
+		/*
+		 * Just return one universal error code.
+		 * For now the caller cannot recover anyway.
+		 */
+		return -EINVAL;
+	}
+
+	pr_info("TDX module initialized.\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+
+	return 0;
+}
+
+/**
+ * tdx_enable - Enable TDX module to make it ready to run TDX guests
+ *
+ * This function assumes the caller has: 1) held read lock of CPU hotplug
+ * lock to prevent any new cpu from becoming online; 2) done both VMXON
+ * and tdx_cpu_enable() on all online cpus.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return 0 if TDX is enabled successfully, otherwise error.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -EINVAL;
+
+	lockdep_assert_cpus_held();
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		/* Already initialized, great, tell the caller. */
+		ret = 0;
+		break;
+	default:
+		/* Failed to initialize in the previous attempts */
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 48ad1a1ba737..4d6220e86ccf 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -4,6 +4,31 @@ 
 
 #include <linux/types.h>
 
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
+
+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
+/* Kernel defined TDX module status during module initialization. */
+enum tdx_module_status_t {
+	TDX_MODULE_UNKNOWN,
+	TDX_MODULE_INITIALIZED,
+	TDX_MODULE_ERROR
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);