Message ID | 104d324cd68b12e14722ee5d85a660cccccd8892.1687784645.git.kai.huang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | TDX host kernel support | expand |
On 6/26/23 7:12 AM, Kai Huang wrote: > To enable TDX the kernel needs to initialize TDX from two perspectives: > 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready > to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL > on one logical cpu before the kernel wants to make any other SEAMCALLs > on that cpu (including those involved during module initialization and > running TDX guests). > > The TDX module can be initialized only once in its lifetime. Instead > of always initializing it at boot time, this implementation chooses an > "on demand" approach to initialize TDX until there is a real need (e.g > when requested by KVM). This approach has below pros: > > 1) It avoids consuming the memory that must be allocated by kernel and > given to the TDX module as metadata (~1/256th of the TDX-usable memory), > and also saves the CPU cycles of initializing the TDX module (and the > metadata) when TDX is not used at all. > > 2) The TDX module design allows it to be updated while the system is > running. The update procedure shares quite a few steps with this "on > demand" initialization mechanism. The hope is that much of "on demand" > mechanism can be shared with a future "update" mechanism. A boot-time > TDX module implementation would not be able to share much code with the > update mechanism. > > 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM > code mucks with VMX enabling. If the TDX module were to be initialized > separately from KVM (like at boot), the boot code would need to be > taught how to muck with VMX enabling and KVM would need to be taught how > to cope with that. Making KVM itself responsible for TDX initialization > lets the rest of the kernel stay blissfully unaware of VMX. > > Similar to module initialization, also make the per-cpu initialization > "on demand" as it also depends on VMX being enabled. > > Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX > module and enable TDX on local cpu respectively. For now tdx_enable() > is a placeholder. The TODO list will be pared down as functionality is > added. > > Export both tdx_cpu_enable() and tdx_enable() for KVM use. > > In tdx_enable() use a state machine protected by mutex to make sure the > initialization will only be done once, as tdx_enable() can be called > multiple times (i.e. KVM module can be reloaded) and may be called > concurrently by other kernel components in the future. > > The per-cpu initialization on each cpu can only be done once during the > module's life time. Use a per-cpu variable to track its status to make > sure it is only done once in tdx_cpu_enable(). > > Also, a SEAMCALL to do TDX module global initialization must be done > once on any logical cpu before any per-cpu initialization SEAMCALL. Do > it inside tdx_cpu_enable() too (if hasn't been done). > > tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The > per-cpu initialization must be done before those SEAMCALLs are invoked > on some cpu. To keep things simple, in tdx_cpu_enable(), always do the > per-cpu initialization regardless of whether the TDX module has been > initialized or not. And in tdx_enable(), don't call tdx_cpu_enable() > but assume the caller has disabled CPU hotplug, done VMXON and > tdx_cpu_enable() on all online cpus before calling tdx_enable(). > > Signed-off-by: Kai Huang <kai.huang@intel.com> > --- > > v11 -> v12: > - Simplified TDX module global init and lp init status tracking (David). > - Added comment around try_init_module_global() for using > raw_spin_lock() (Dave). > - Added one sentence to changelog to explain why to expose tdx_enable() > and tdx_cpu_enable() (Dave). > - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use > lockdep_assert_*() instead. (Dave) > - Removed redundent "TDX" in error message (Dave). > > v10 -> v11: > - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off. > - Return the actual error code for tdx_enable() instead of -EINVAL. > - Added Isaku's Reviewed-by. > > v9 -> v10: > - Merged the patch to handle per-cpu initialization to this patch to > tell the story better. > - Changed how to handle the per-cpu initialization to only provide a > tdx_cpu_enable() function to let the user of TDX to do it when the > user wants to run TDX code on a certain cpu. > - Changed tdx_enable() to not call cpus_read_lock() explicitly, but > call lockdep_assert_cpus_held() to assume the caller has done that. > - Improved comments around tdx_enable() and tdx_cpu_enable(). > - Improved changelog to tell the story better accordingly. > > v8 -> v9: > - Removed detailed TODO list in the changelog (Dave). > - Added back steps to do module global initialization and per-cpu > initialization in the TODO list comment. > - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h > > v7 -> v8: > - Refined changelog (Dave). > - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). > - Add a "TODO list" comment in init_tdx_module() to list all steps of > initializing the TDX Module to tell the story (Dave). > - Made tdx_enable() unverisally return -EINVAL, and removed nonsense > comments (Dave). > - Simplified __tdx_enable() to only handle success or failure. > - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR > - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. > - Improved comments (Dave). > - Pointed out 'tdx_module_status' is software thing (Dave). > > v6 -> v7: > - No change. > > v5 -> v6: > - Added code to set status to TDX_MODULE_NONE if TDX module is not > loaded (Chao) > - Added Chao's Reviewed-by. > - Improved comments around cpus_read_lock(). > > - v3->v5 (no feedback on v4): > - Removed the check that SEAMRR and TDX KeyID have been detected on > all present cpus. > - Removed tdx_detect(). > - Added num_online_cpus() to MADT-enabled CPUs check within the CPU > hotplug lock and return early with error message. > - Improved dmesg printing for TDX module detection and initialization. > > > --- > arch/x86/include/asm/tdx.h | 4 + > arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++ > arch/x86/virt/vmx/tdx/tdx.h | 13 +++ > 3 files changed, 179 insertions(+) > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h > index 4dfe2e794411..d8226a50c58c 100644 > --- a/arch/x86/include/asm/tdx.h > +++ b/arch/x86/include/asm/tdx.h > @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, > > #ifdef CONFIG_INTEL_TDX_HOST > bool platform_tdx_enabled(void); > +int tdx_cpu_enable(void); > +int tdx_enable(void); > #else /* !CONFIG_INTEL_TDX_HOST */ > static inline bool platform_tdx_enabled(void) { return false; } > +static inline int tdx_cpu_enable(void) { return -ENODEV; } > +static inline int tdx_enable(void) { return -ENODEV; } > #endif /* CONFIG_INTEL_TDX_HOST */ > > #endif /* !__ASSEMBLY__ */ > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > index 141d12376c4d..29ca18f66d61 100644 > --- a/arch/x86/virt/vmx/tdx/tdx.c > +++ b/arch/x86/virt/vmx/tdx/tdx.c > @@ -13,6 +13,10 @@ > #include <linux/errno.h> > #include <linux/printk.h> > #include <linux/smp.h> > +#include <linux/cpu.h> > +#include <linux/spinlock.h> > +#include <linux/percpu-defs.h> > +#include <linux/mutex.h> > #include <asm/msr-index.h> > #include <asm/msr.h> > #include <asm/archrandom.h> > @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init; > static u32 tdx_guest_keyid_start __ro_after_init; > static u32 tdx_nr_guest_keyids __ro_after_init; > > +static bool tdx_global_initialized; > +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock); Why use raw_spin_lock()? > +static DEFINE_PER_CPU(bool, tdx_lp_initialized); > + > +static enum tdx_module_status_t tdx_module_status; > +static DEFINE_MUTEX(tdx_module_lock); I think you can add a single line comment about what states above variables tracks. But it is entirely up to you. > + > /* > * Wrapper of __seamcall() to convert SEAMCALL leaf function error code > * to kernel error code. @seamcall_ret and @out contain the SEAMCALL > @@ -74,6 +85,157 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > } > } > > +/* > + * Do the module global initialization if not done yet. > + * It's always called with interrupts and preemption disabled. > + */ > +static int try_init_module_global(void) > +{ > + unsigned long flags; > + int ret; > + > + /* > + * The TDX module global initialization only needs to be done > + * once on any cpu. > + */ > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > + > + if (tdx_global_initialized) { > + ret = 0; > + goto out; > + } > + > + /* All '0's are just unused parameters. */ I have noticed that you add the above comment whenever you call seamcall() with 0 as parameters. Is this a ask from the maintainer? If not, I think you can skip it. Just explaining the parameters in seamcall function definition is good enough. > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); > + if (!ret) > + tdx_global_initialized = true; > +out: > + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); > + > + return ret; > +} > + > +/** > + * tdx_cpu_enable - Enable TDX on local cpu > + * > + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module > + * global initialization SEAMCALL if not done) on local cpu to make this > + * cpu be ready to run any other SEAMCALLs. > + * > + * Call this function with preemption disabled. > + * > + * Return 0 on success, otherwise errors. > + */ > +int tdx_cpu_enable(void) > +{ > + int ret; > + > + if (!platform_tdx_enabled()) > + return -ENODEV; > + > + lockdep_assert_preemption_disabled(); > + > + /* Already done */ > + if (__this_cpu_read(tdx_lp_initialized)) > + return 0; > + > + /* > + * The TDX module global initialization is the very first step > + * to enable TDX. Need to do it first (if hasn't been done) > + * before the per-cpu initialization. > + */ > + ret = try_init_module_global(); > + if (ret) > + return ret; > + > + /* All '0's are just unused parameters */ > + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL); > + if (ret) > + return ret; > + > + __this_cpu_write(tdx_lp_initialized, true); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > + > +static int init_tdx_module(void) > +{ > + /* > + * TODO: > + * > + * - Get TDX module information and TDX-capable memory regions. > + * - Build the list of TDX-usable memory regions. > + * - Construct a list of "TD Memory Regions" (TDMRs) to cover > + * all TDX-usable memory regions. > + * - Configure the TDMRs and the global KeyID to the TDX module. > + * - Configure the global KeyID on all packages. > + * - Initialize all TDMRs. > + * > + * Return error before all steps are done. > + */ > + return -EINVAL; > +} > + > +static int __tdx_enable(void) > +{ > + int ret; > + > + ret = init_tdx_module(); > + if (ret) { > + pr_err("module initialization failed (%d)\n", ret); > + tdx_module_status = TDX_MODULE_ERROR; > + return ret; > + } > + > + pr_info("module initialized.\n"); > + tdx_module_status = TDX_MODULE_INITIALIZED; > + > + return 0; > +} > + > +/** > + * tdx_enable - Enable TDX module to make it ready to run TDX guests > + * > + * This function assumes the caller has: 1) held read lock of CPU hotplug > + * lock to prevent any new cpu from becoming online; 2) done both VMXON > + * and tdx_cpu_enable() on all online cpus. > + * > + * This function can be called in parallel by multiple callers. > + * > + * Return 0 if TDX is enabled successfully, otherwise error. > + */ > +int tdx_enable(void) > +{ > + int ret; > + > + if (!platform_tdx_enabled()) > + return -ENODEV; > + > + lockdep_assert_cpus_held(); > + > + mutex_lock(&tdx_module_lock); > + > + switch (tdx_module_status) { > + case TDX_MODULE_UNKNOWN: > + ret = __tdx_enable(); > + break; > + case TDX_MODULE_INITIALIZED: > + /* Already initialized, great, tell the caller. */ > + ret = 0; > + break; > + default: > + /* Failed to initialize in the previous attempts */ > + ret = -EINVAL; > + break; > + } > + > + mutex_unlock(&tdx_module_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(tdx_enable); > + > static int __init record_keyid_partitioning(u32 *tdx_keyid_start, > u32 *nr_tdx_keyids) > { > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h > index 55dbb1b8c971..9fb46033c852 100644 > --- a/arch/x86/virt/vmx/tdx/tdx.h > +++ b/arch/x86/virt/vmx/tdx/tdx.h > @@ -16,11 +16,24 @@ > */ > #define TDX_RND_NO_ENTROPY 0x8000020300000000ULL > > +/* > + * TDX module SEAMCALL leaf functions > + */ > +#define TDH_SYS_INIT 33 > +#define TDH_SYS_LP_INIT 35 > + > /* > * Do not put any hardware-defined TDX structure representations below > * this comment! > */ > > +/* Kernel defined TDX module status during module initialization. */ > +enum tdx_module_status_t { > + TDX_MODULE_UNKNOWN, > + TDX_MODULE_INITIALIZED, > + TDX_MODULE_ERROR > +}; > + > struct tdx_module_output; > u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > struct tdx_module_output *out);
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > + /* > + * The TDX module global initialization only needs to be done > + * once on any cpu. > + */ > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); I don't understand how the comment justifies using raw spin lock.
On Tue, 2023-06-27 at 12:50 +0300, kirill.shutemov@linux.intel.com wrote: > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > + /* > > + * The TDX module global initialization only needs to be done > > + * once on any cpu. > > + */ > > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > > I don't understand how the comment justifies using raw spin lock. > This comment is for using lock in general. The reason to use raw_ version is because this function gets called in IRQ context, and for PREEMPT_RT kernel the normal spinlock is converted to sleeping lock. Dave suggested to comment on the function rather than comment on the raw_spin_lock directly, e.g., no other kernel code does that: https://lore.kernel.org/linux-mm/d2b3bc5e-1371-0c50-8ecb-64fc70917d42@intel.com/ So I commented the function in this version: +/* + * Do the module global initialization if not done yet. + * It's always called with interrupts and preemption disabled. + */
On Mon, 2023-06-26 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote: > > + /* All '0's are just unused parameters. */ > > I have noticed that you add the above comment whenever you call seamcall() > with > 0 as parameters. Is this a ask from the maintainer? If not, I think you can > skip > it. Just explaining the parameters in seamcall function definition is good > enough. Yes I followed maintainer (I didn't bother to find the exact link this time, though). I think in this way we don't need to go to TDX module spec to check whether 0 has meaning in each SEAMCALL, especially in code review. I kinda agree having them in multiple places is a little bit noisy, but I don't have a better way.
On Tue, Jun 27, 2023 at 10:34:04AM +0000, Huang, Kai wrote: > On Tue, 2023-06-27 at 12:50 +0300, kirill.shutemov@linux.intel.com wrote: > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > + /* > > > + * The TDX module global initialization only needs to be done > > > + * once on any cpu. > > > + */ > > > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > > > > I don't understand how the comment justifies using raw spin lock. > > > > This comment is for using lock in general. The reason to use raw_ version is > because this function gets called in IRQ context, and for PREEMPT_RT kernel the > normal spinlock is converted to sleeping lock. Sorry, but this still doesn't explain anything. Why converting to sleeping lock here is wrong? There are plenty spin_lock_irqsave() users all over the kernel that are fine to be converted to sleeping lock on RT kernel. Why this use-case is special enough to justify raw_? From the documentation: raw_spinlock_t is a strict spinning lock implementation in all kernels, including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical core code, low-level interrupt handling and places where disabling preemption or interrupts is required, for example, to safely access hardware state. raw_spinlock_t can sometimes also be used when the critical section is tiny, thus avoiding RT-mutex overhead. How does it apply here? > Dave suggested to comment on the function rather than comment on the > raw_spin_lock directly, e.g., no other kernel code does that: > > https://lore.kernel.org/linux-mm/d2b3bc5e-1371-0c50-8ecb-64fc70917d42@intel.com/ > > So I commented the function in this version: > > +/* > + * Do the module global initialization if not done yet. > + * It's always called with interrupts and preemption disabled. > + */ If interrupts are always disabled why do you need _irqsave()?
> > > > +/* > > + * Do the module global initialization if not done yet. > > + * It's always called with interrupts and preemption disabled. > > + */ > > If interrupts are always disabled why do you need _irqsave()? > I'll remove the _irqsave(). AFAICT Isaku preferred this for additional security, but this is not necessary.
On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote: > > > > > > +/* > > > + * Do the module global initialization if not done yet. > > > + * It's always called with interrupts and preemption disabled. > > > + */ > > > > If interrupts are always disabled why do you need _irqsave()? > > > > I'll remove the _irqsave(). > > AFAICT Isaku preferred this for additional security, but this is not > necessary. > > Damn. I think we can change the comment to say this function is called with preemption being disabled, but _can_ be called with interrupt disabled. And we keep using the _irqsave() version. /* * Do the module global initialization if not done yet. It's always * called with preemption disabled and can be called with interrupts * disabled. */ This allows a use case that the caller simply wants to call some SEAMCALL on local cpu, e.g., IOMMU code may just use below to get some TDX-IO information: preempt_disable(); vmxon(); tdx_cpu_enable(); SEAMCALL; vmxoff(); preempt_enable(); Are you OK with this?
On Tue, Jun 27, 2023 at 10:37:58PM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > > > > > > +/* > > > + * Do the module global initialization if not done yet. > > > + * It's always called with interrupts and preemption disabled. > > > + */ > > > > If interrupts are always disabled why do you need _irqsave()? > > > > I'll remove the _irqsave(). > > AFAICT Isaku preferred this for additional security, but this is not necessary. It's because the lockdep complains. Anyway, it's save to remove _irqsave as discussed with you.
On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote: > On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote: > > > > > > > > +/* > > > > + * Do the module global initialization if not done yet. > > > > + * It's always called with interrupts and preemption disabled. > > > > + */ > > > > > > If interrupts are always disabled why do you need _irqsave()? > > > > > > > I'll remove the _irqsave(). > > > > AFAICT Isaku preferred this for additional security, but this is not > > necessary. > > > > > > Damn. I think we can change the comment to say this function is called with > preemption being disabled, but _can_ be called with interrupt disabled. And we > keep using the _irqsave() version. > > /* > * Do the module global initialization if not done yet. It's always > * called with preemption disabled and can be called with interrupts > * disabled. > */ > > This allows a use case that the caller simply wants to call some SEAMCALL on > local cpu, e.g., IOMMU code may just use below to get some TDX-IO information: > > preempt_disable(); > vmxon(); > tdx_cpu_enable(); > SEAMCALL; > vmxoff(); > preempt_enable(); > > Are you OK with this? Is it hypothetical use-case? If so, I would rather keep it simple for now and adjust in the future if needed.
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > +static int try_init_module_global(void) > +{ > + unsigned long flags; > + int ret; > + > + /* > + * The TDX module global initialization only needs to be done > + * once on any cpu. > + */ > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > + > + if (tdx_global_initialized) { > + ret = 0; > + goto out; > + } > + > + /* All '0's are just unused parameters. */ > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); > + if (!ret) > + tdx_global_initialized = true; > +out: > + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); > + > + return ret; > +} How long does that TDX_SYS_INIT take and why is a raw_spinlock with IRQs disabled the right way to serialize this?
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > +/* > + * Do the module global initialization if not done yet. > + * It's always called with interrupts and preemption disabled. > + */ > +static int try_init_module_global(void) > +{ > + unsigned long flags; > + int ret; > + > + /* > + * The TDX module global initialization only needs to be done > + * once on any cpu. > + */ > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > + > + if (tdx_global_initialized) { > + ret = 0; > + goto out; > + } > + > + /* All '0's are just unused parameters. */ > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); > + if (!ret) > + tdx_global_initialized = true; > +out: > + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); > + > + return ret; > +} > + > +/** > + * tdx_cpu_enable - Enable TDX on local cpu > + * > + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module > + * global initialization SEAMCALL if not done) on local cpu to make this > + * cpu be ready to run any other SEAMCALLs. > + * > + * Call this function with preemption disabled. > + * > + * Return 0 on success, otherwise errors. > + */ > +int tdx_cpu_enable(void) > +{ > + int ret; > + > + if (!platform_tdx_enabled()) > + return -ENODEV; > + > + lockdep_assert_preemption_disabled(); > + > + /* Already done */ > + if (__this_cpu_read(tdx_lp_initialized)) > + return 0; > + > + /* > + * The TDX module global initialization is the very first step > + * to enable TDX. Need to do it first (if hasn't been done) > + * before the per-cpu initialization. > + */ > + ret = try_init_module_global(); > + if (ret) > + return ret; > + > + /* All '0's are just unused parameters */ > + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL); > + if (ret) > + return ret; And here you do *NOT* have IRQs disabled... so an IRQ can come in here and do the above again. I suspect that's a completely insane thing to have happen, but the way the code is written does not tell me this and might even suggest I should worry about it, per the above thing actually disabling IRQs. > + > + __this_cpu_write(tdx_lp_initialized, true); > + > + return 0; > +}
On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
I can't find a single caller of this.. why is this exported?
On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote: > On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote: > > > > > > > > +/* > > > > + * Do the module global initialization if not done yet. > > > > + * It's always called with interrupts and preemption disabled. > > > > + */ > > > > > > If interrupts are always disabled why do you need _irqsave()? > > > > > > > I'll remove the _irqsave(). > > > > AFAICT Isaku preferred this for additional security, but this is not > > necessary. > > > > > > Damn. I think we can change the comment to say this function is called with > preemption being disabled, but _can_ be called with interrupt disabled. And we > keep using the _irqsave() version. > > /* > * Do the module global initialization if not done yet. It's always > * called with preemption disabled and can be called with interrupts > * disabled. > */ That's still not explaining *why*, what you want to say is: Can be called locally or through an IPI function call.
On Wed, 2023-06-28 at 15:04 +0200, Peter Zijlstra wrote: > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > +static int try_init_module_global(void) > > +{ > > + unsigned long flags; > > + int ret; > > + > > + /* > > + * The TDX module global initialization only needs to be done > > + * once on any cpu. > > + */ > > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > > + > > + if (tdx_global_initialized) { > > + ret = 0; > > + goto out; > > + } > > + > > + /* All '0's are just unused parameters. */ > > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); > > + if (!ret) > > + tdx_global_initialized = true; > > +out: > > + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); > > + > > + return ret; > > +} > > How long does that TDX_SYS_INIT take and why is a raw_spinlock with IRQs > disabled the right way to serialize this? The spec says it doesn't have a latency requirement, so theoretically it could be long. SEAMCALL is a VMEXIT so it would at least cost thousands of cycles. If raw_spinlock isn't desired, I think I can introduce another function to do this and let the caller to call it before calling tdx_cpu_enable(). E.g., we can have below functions: 1) tdx_global_init() -> TDH_SYS_INIT 2) tdx_cpu_init() -> TDH_SYS_LP_INIT 3) tdx_enable() -> actual module initialization How does this sound?
On Wed, 2023-06-28 at 15:08 +0200, Peter Zijlstra wrote: > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > +/* > > + * Do the module global initialization if not done yet. > > + * It's always called with interrupts and preemption disabled. > > + */ > > +static int try_init_module_global(void) > > +{ > > + unsigned long flags; > > + int ret; > > + > > + /* > > + * The TDX module global initialization only needs to be done > > + * once on any cpu. > > + */ > > + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); > > + > > + if (tdx_global_initialized) { > > + ret = 0; > > + goto out; > > + } > > + > > + /* All '0's are just unused parameters. */ > > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); > > + if (!ret) > > + tdx_global_initialized = true; > > +out: > > + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); > > + > > + return ret; > > +} > > + > > +/** > > + * tdx_cpu_enable - Enable TDX on local cpu > > + * > > + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module > > + * global initialization SEAMCALL if not done) on local cpu to make this > > + * cpu be ready to run any other SEAMCALLs. > > + * > > + * Call this function with preemption disabled. > > + * > > + * Return 0 on success, otherwise errors. > > + */ > > +int tdx_cpu_enable(void) > > +{ > > + int ret; > > + > > + if (!platform_tdx_enabled()) > > + return -ENODEV; > > + > > + lockdep_assert_preemption_disabled(); > > + > > + /* Already done */ > > + if (__this_cpu_read(tdx_lp_initialized)) > > + return 0; > > + > > + /* > > + * The TDX module global initialization is the very first step > > + * to enable TDX. Need to do it first (if hasn't been done) > > + * before the per-cpu initialization. > > + */ > > + ret = try_init_module_global(); > > + if (ret) > > + return ret; > > + > > + /* All '0's are just unused parameters */ > > + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL); > > + if (ret) > > + return ret; > > And here you do *NOT* have IRQs disabled... so an IRQ can come in here > and do the above again. > > I suspect that's a completely insane thing to have happen, but the way > the code is written does not tell me this and might even suggest I > should worry about it, per the above thing actually disabling IRQs. > I can change lockdep_assert_preemption_disabled() to lockdep_assert_irqs_disabled(), making this function only being called from IPI. As Kirill also suggested we can do this way as for now KVM is the only user of this function and it enables hardware for all cpus via IPI. >
On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote: > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > > I can't find a single caller of this.. why is this exported? It's for KVM TDX patch to use, which isn't in this series. I'll remove the export. KVM TDX series can export it.
On Wed, 2023-06-28 at 15:35 +0200, Peter Zijlstra wrote: > On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote: > > On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote: > > > > > > > > > > +/* > > > > > + * Do the module global initialization if not done yet. > > > > > + * It's always called with interrupts and preemption disabled. > > > > > + */ > > > > > > > > If interrupts are always disabled why do you need _irqsave()? > > > > > > > > > > I'll remove the _irqsave(). > > > > > > AFAICT Isaku preferred this for additional security, but this is not > > > necessary. > > > > > > > > > > Damn. I think we can change the comment to say this function is called with > > preemption being disabled, but _can_ be called with interrupt disabled. And we > > keep using the _irqsave() version. > > > > /* > > * Do the module global initialization if not done yet. It's always > > * called with preemption disabled and can be called with interrupts > > * disabled. > > */ > > That's still not explaining *why*, what you want to say is: > > Can be called locally or through an IPI function call. > Thanks. As in another reply, if using spinlock is OK, then I think we can say it will be called either locally or through an IPI function call. Otherwise, we do via a new separate function tdx_global_init() and no lock is needed in that function. The caller should call it properly.
On 26.06.23 16:12, Kai Huang wrote: > To enable TDX the kernel needs to initialize TDX from two perspectives: > 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready > to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL > on one logical cpu before the kernel wants to make any other SEAMCALLs > on that cpu (including those involved during module initialization and > running TDX guests). > > The TDX module can be initialized only once in its lifetime. Instead > of always initializing it at boot time, this implementation chooses an > "on demand" approach to initialize TDX until there is a real need (e.g > when requested by KVM). This approach has below pros: > > 1) It avoids consuming the memory that must be allocated by kernel and > given to the TDX module as metadata (~1/256th of the TDX-usable memory), > and also saves the CPU cycles of initializing the TDX module (and the > metadata) when TDX is not used at all. > > 2) The TDX module design allows it to be updated while the system is > running. The update procedure shares quite a few steps with this "on > demand" initialization mechanism. The hope is that much of "on demand" > mechanism can be shared with a future "update" mechanism. A boot-time > TDX module implementation would not be able to share much code with the > update mechanism. > > 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM > code mucks with VMX enabling. If the TDX module were to be initialized > separately from KVM (like at boot), the boot code would need to be > taught how to muck with VMX enabling and KVM would need to be taught how > to cope with that. Making KVM itself responsible for TDX initialization > lets the rest of the kernel stay blissfully unaware of VMX. > > Similar to module initialization, also make the per-cpu initialization > "on demand" as it also depends on VMX being enabled. > > Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX > module and enable TDX on local cpu respectively. For now tdx_enable() > is a placeholder. The TODO list will be pared down as functionality is > added. > > Export both tdx_cpu_enable() and tdx_enable() for KVM use. > > In tdx_enable() use a state machine protected by mutex to make sure the > initialization will only be done once, as tdx_enable() can be called > multiple times (i.e. KVM module can be reloaded) and may be called > concurrently by other kernel components in the future. > > The per-cpu initialization on each cpu can only be done once during the > module's life time. Use a per-cpu variable to track its status to make > sure it is only done once in tdx_cpu_enable(). > > Also, a SEAMCALL to do TDX module global initialization must be done > once on any logical cpu before any per-cpu initialization SEAMCALL. Do > it inside tdx_cpu_enable() too (if hasn't been done). > > tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The > per-cpu initialization must be done before those SEAMCALLs are invoked > on some cpu. To keep things simple, in tdx_cpu_enable(), always do the > per-cpu initialization regardless of whether the TDX module has been > initialized or not. And in tdx_enable(), don't call tdx_cpu_enable() > but assume the caller has disabled CPU hotplug, done VMXON and > tdx_cpu_enable() on all online cpus before calling tdx_enable(). > > Signed-off-by: Kai Huang <kai.huang@intel.com> > --- > > v11 -> v12: > - Simplified TDX module global init and lp init status tracking (David). > - Added comment around try_init_module_global() for using > raw_spin_lock() (Dave). > - Added one sentence to changelog to explain why to expose tdx_enable() > and tdx_cpu_enable() (Dave). > - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use > lockdep_assert_*() instead. (Dave) > - Removed redundent "TDX" in error message (Dave). > > v10 -> v11: > - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off. > - Return the actual error code for tdx_enable() instead of -EINVAL. > - Added Isaku's Reviewed-by. > > v9 -> v10: > - Merged the patch to handle per-cpu initialization to this patch to > tell the story better. > - Changed how to handle the per-cpu initialization to only provide a > tdx_cpu_enable() function to let the user of TDX to do it when the > user wants to run TDX code on a certain cpu. > - Changed tdx_enable() to not call cpus_read_lock() explicitly, but > call lockdep_assert_cpus_held() to assume the caller has done that. > - Improved comments around tdx_enable() and tdx_cpu_enable(). > - Improved changelog to tell the story better accordingly. > > v8 -> v9: > - Removed detailed TODO list in the changelog (Dave). > - Added back steps to do module global initialization and per-cpu > initialization in the TODO list comment. > - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h > > v7 -> v8: > - Refined changelog (Dave). > - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). > - Add a "TODO list" comment in init_tdx_module() to list all steps of > initializing the TDX Module to tell the story (Dave). > - Made tdx_enable() unverisally return -EINVAL, and removed nonsense > comments (Dave). > - Simplified __tdx_enable() to only handle success or failure. > - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR > - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. > - Improved comments (Dave). > - Pointed out 'tdx_module_status' is software thing (Dave). > > v6 -> v7: > - No change. > > v5 -> v6: > - Added code to set status to TDX_MODULE_NONE if TDX module is not > loaded (Chao) > - Added Chao's Reviewed-by. > - Improved comments around cpus_read_lock(). > > - v3->v5 (no feedback on v4): > - Removed the check that SEAMRR and TDX KeyID have been detected on > all present cpus. > - Removed tdx_detect(). > - Added num_online_cpus() to MADT-enabled CPUs check within the CPU > hotplug lock and return early with error message. > - Improved dmesg printing for TDX module detection and initialization. > > > --- > arch/x86/include/asm/tdx.h | 4 + > arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++ > arch/x86/virt/vmx/tdx/tdx.h | 13 +++ > 3 files changed, 179 insertions(+) > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h > index 4dfe2e794411..d8226a50c58c 100644 > --- a/arch/x86/include/asm/tdx.h > +++ b/arch/x86/include/asm/tdx.h > @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, > > #ifdef CONFIG_INTEL_TDX_HOST > bool platform_tdx_enabled(void); > +int tdx_cpu_enable(void); > +int tdx_enable(void); > #else /* !CONFIG_INTEL_TDX_HOST */ > static inline bool platform_tdx_enabled(void) { return false; } > +static inline int tdx_cpu_enable(void) { return -ENODEV; } > +static inline int tdx_enable(void) { return -ENODEV; } > #endif /* CONFIG_INTEL_TDX_HOST */ > > #endif /* !__ASSEMBLY__ */ > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > index 141d12376c4d..29ca18f66d61 100644 > --- a/arch/x86/virt/vmx/tdx/tdx.c > +++ b/arch/x86/virt/vmx/tdx/tdx.c > @@ -13,6 +13,10 @@ > #include <linux/errno.h> > #include <linux/printk.h> > #include <linux/smp.h> > +#include <linux/cpu.h> > +#include <linux/spinlock.h> > +#include <linux/percpu-defs.h> > +#include <linux/mutex.h> > #include <asm/msr-index.h> > #include <asm/msr.h> > #include <asm/archrandom.h> > @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init; > static u32 tdx_guest_keyid_start __ro_after_init; > static u32 tdx_nr_guest_keyids __ro_after_init; > > +static bool tdx_global_initialized; > +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock); > +static DEFINE_PER_CPU(bool, tdx_lp_initialized); > + > +static enum tdx_module_status_t tdx_module_status; Why can't you switch to a simple bool here as well? It's either initialized or uninitialized. If uninitialized and you get an error, leave it uninitialized. The next caller will try again and fail again.
> > +static enum tdx_module_status_t tdx_module_status; > > Why can't you switch to a simple bool here as well? > > It's either initialized or uninitialized. If uninitialized and you get > an error, leave it uninitialized. The next caller will try again and > fail again. > We can, but in this case there might be message printed in each module initialization call. Let's say TDH.SYS.INFO is successful but the later TDH.SYS.CONFIG fails. In this case, each initialization call will print out TDX module info and CMR info. I think only allow initialization to be done once would be better in this case. Apart from the message printing, it's OK to just use a simple bool.
On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote: > > Can be called locally or through an IPI function call. > > > > Thanks. As in another reply, if using spinlock is OK, then I think we can say > it will be called either locally or through an IPI function call. Otherwise, we > do via a new separate function tdx_global_init() and no lock is needed in that > function. The caller should call it properly. IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's actual need for that, but the code as presented didn't -- in comments or otherwise -- make it clear why it was as it was. TDX not specifying time constraints on the various TD/SEAM-CALLs is ofcourse sad, but alas.
On Thu, Jun 29, 2023 at 12:00:44AM +0000, Huang, Kai wrote: > The spec says it doesn't have a latency requirement, so theoretically it could > be long. SEAMCALL is a VMEXIT so it would at least cost thousands of cycles. :-( > If raw_spinlock isn't desired, I think I can introduce another function to do > this and let the caller to call it before calling tdx_cpu_enable(). E.g., we > can have below functions: > > 1) tdx_global_init() -> TDH_SYS_INIT > 2) tdx_cpu_init() -> TDH_SYS_LP_INIT > 3) tdx_enable() -> actual module initialization > > How does this sound? Ah, wait, I hadn't had enough wake-up juice, it's tdx_global_init() that did the raw_spinlock_t, but that isn't the IPI thing. Then perhaps just use a mutex to serialize things?
On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote: > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote: > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > > > > I can't find a single caller of this.. why is this exported? > > It's for KVM TDX patch to use, which isn't in this series. > > I'll remove the export. KVM TDX series can export it. Fair enough; where will the KVM TDX series call this? Earlier there was talk about doing it at kvm module load time -- but I objected (and still do object) to that. What's the current plan?
On Fri, 2023-06-30 at 11:25 +0200, Peter Zijlstra wrote: > On Thu, Jun 29, 2023 at 12:00:44AM +0000, Huang, Kai wrote: > > > The spec says it doesn't have a latency requirement, so theoretically it could > > be long. SEAMCALL is a VMEXIT so it would at least cost thousands of cycles. > > :-( > > > If raw_spinlock isn't desired, I think I can introduce another function to do > > this and let the caller to call it before calling tdx_cpu_enable(). E.g., we > > can have below functions: > > > > 1) tdx_global_init() -> TDH_SYS_INIT > > 2) tdx_cpu_init() -> TDH_SYS_LP_INIT > > 3) tdx_enable() -> actual module initialization > > > > How does this sound? > > Ah, wait, I hadn't had enough wake-up juice, it's tdx_global_init() that > did the raw_spinlock_t, but that isn't the IPI thing. > > Then perhaps just use a mutex to serialize things? > In the current code yes TDH_SYS_INIT is protected by raw_spinlock_t, because it is done in tdx_cpu_enable(). I thought this makes the caller (KVM)'s life easier as it doesn't have to call an additional tdx_global_init(). If we put TDH_SYS_INIT to an additional tdx_global_init(), then we are essentially asking the caller to guarantee it must be called before calling any tdx_cpu_enable() (or tdx_cpu_init() for better naming). But in this case we don't need the raw_spinlock anymore because it's caller's responsibility now. They both are not protected by the TDX module initialization mutex, only tdx_enable() is. The caller (KVM) is supposed to call tdx_cpu_enable() for all online cpus via IPI function call before calling tdx_enable(). So if using raw_spinlock_t around TDH_SYS_INIT is a concern, then we can go with the dedicated tdx_global_init() function option. Hope I've explained this clearly.
On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote: > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote: > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote: > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > > > > > > I can't find a single caller of this.. why is this exported? > > > > It's for KVM TDX patch to use, which isn't in this series. > > > > I'll remove the export. KVM TDX series can export it. > > Fair enough; where will the KVM TDX series call this? Earlier there was > talk about doing it at kvm module load time -- but I objected (and still > do object) to that. > > What's the current plan? > The direction is still doing it during module load (not my series anyway). But this can be a separate discussion with KVM maintainers involved. I understand you have concern that you don't want to have the memory & cpu time wasted on enabling TDX by default. For that we can have a kernel command line to disable TDX once for all (we can even make it default). It's just not in this initial TDX support series but I'll send one once this initial support is done, as mentioned in the cover letter of the previous version (sadly I removed this paragraph for the sake of making the cover letter shorter): " Also, the patch to add the new kernel comline tdx="force" isn't included in this initial version, as Dave suggested it isn't mandatory. But I will add one once this initial version gets merged. " Also, KVM will have a module parameter 'enable_tdx'. I am hoping this could reduce your concern too.
On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote: > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote: > > > > Can be called locally or through an IPI function call. > > > > > > > Thanks. As in another reply, if using spinlock is OK, then I think we can say > > it will be called either locally or through an IPI function call. Otherwise, we > > do via a new separate function tdx_global_init() and no lock is needed in that > > function. The caller should call it properly. > > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's > actual need for that, but the code as presented didn't -- in comments or > otherwise -- make it clear why it was as it was. There's no hard requirement as I replied in another email. Presumably you prefer the option to have a dedicated tdx_global_init() so we can avoid the raw_spinlock_t? Thanks. > > TDX not specifying time constraints on the various TD/SEAM-CALLs is > ofcourse sad, but alas. Agreed.
On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote: > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote: > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote: > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote: > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > > > > > > > > I can't find a single caller of this.. why is this exported? > > > > > > It's for KVM TDX patch to use, which isn't in this series. > > > > > > I'll remove the export. KVM TDX series can export it. > > > > Fair enough; where will the KVM TDX series call this? Earlier there was > > talk about doing it at kvm module load time -- but I objected (and still > > do object) to that. > > > > What's the current plan? > > > > The direction is still doing it during module load (not my series anyway). But > this can be a separate discussion with KVM maintainers involved. They all on Cc afaict. > I understand you have concern that you don't want to have the memory & cpu time > wasted on enabling TDX by default. For that we can have a kernel command line > to disable TDX once for all (we can even make it default). That's insane, I don't want to totally disable it. I want it done at guard creation. Do the whole TDX setup the moment you actually create a TDX guast. Totally killing TDX is stupid, just about as stupid as doing it on module load (which equates to always doing it). > Also, KVM will have a module parameter 'enable_tdx'. I am hoping this could > reduce your concern too. I don't get this obsession with doing at module load time :/
On Fri, Jun 30, 2023 at 10:09:08AM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote: > > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote: > > > > > > Can be called locally or through an IPI function call. > > > > > > > > > > Thanks. As in another reply, if using spinlock is OK, then I think we can say > > > it will be called either locally or through an IPI function call. Otherwise, we > > > do via a new separate function tdx_global_init() and no lock is needed in that > > > function. The caller should call it properly. > > > > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's > > actual need for that, but the code as presented didn't -- in comments or > > otherwise -- make it clear why it was as it was. > > There's no hard requirement as I replied in another email. > > Presumably you prefer the option to have a dedicated tdx_global_init() so we can > avoid the raw_spinlock_t? TDX KVM calls tdx_cpu_enable() in IPI context as KVM hardware_setup() callback. tdx_cpu_enable() calls tdx_global_init().
On Fri, Jun 30, 2023 at 08:30:20PM +0200, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote: > > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote: > > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote: > > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote: > > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > > > > > > > > > > I can't find a single caller of this.. why is this exported? > > > > > > > > It's for KVM TDX patch to use, which isn't in this series. > > > > > > > > I'll remove the export. KVM TDX series can export it. > > > > > > Fair enough; where will the KVM TDX series call this? Earlier there was > > > talk about doing it at kvm module load time -- but I objected (and still > > > do object) to that. > > > > > > What's the current plan? > > > > > > > The direction is still doing it during module load (not my series anyway). But > > this can be a separate discussion with KVM maintainers involved. > > They all on Cc afaict. > > > I understand you have concern that you don't want to have the memory & cpu time > > wasted on enabling TDX by default. For that we can have a kernel command line > > to disable TDX once for all (we can even make it default). > > That's insane, I don't want to totally disable it. I want it done at > guard creation. Do the whole TDX setup the moment you actually create a > TDX guast. > > Totally killing TDX is stupid, just about as stupid as doing it on > module load (which equates to always doing it). > > > Also, KVM will have a module parameter 'enable_tdx'. I am hoping this could > > reduce your concern too. > > I don't get this obsession with doing at module load time :/ The KVM maintainers prefer the initialization on kvm_intel.ko loading time. [1] I can change enable_tdx parameter for kvm_intel.ko instead of boolean. Something like enable_tdx ondemand: on-demand initialization when creating the first TDX guest onload: initialize TDX module when loading kvm_intel.ko disable: disable TDX support [1] https://lore.kernel.org/lkml/YkTvw5OXTTFf7j4y@google.com/
On Fri, Jun 30, 2023, Isaku Yamahata wrote: > On Fri, Jun 30, 2023 at 08:30:20PM +0200, > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote: > > > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote: > > > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote: > > > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote: > > > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote: > > > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable); > > > > > > > > > > > > I can't find a single caller of this.. why is this exported? > > > > > > > > > > It's for KVM TDX patch to use, which isn't in this series. > > > > > > > > > > I'll remove the export. KVM TDX series can export it. > > > > > > > > Fair enough; where will the KVM TDX series call this? Earlier there was > > > > talk about doing it at kvm module load time -- but I objected (and still > > > > do object) to that. > > > > > > > > What's the current plan? > > > > > > > > > > The direction is still doing it during module load (not my series anyway). But > > > this can be a separate discussion with KVM maintainers involved. > > > > They all on Cc afaict. > > > > > I understand you have concern that you don't want to have the memory & cpu time > > > wasted on enabling TDX by default. For that we can have a kernel command line > > > to disable TDX once for all (we can even make it default). > > > > That's insane, I don't want to totally disable it. I want it done at > > guard creation. Do the whole TDX setup the moment you actually create a > > TDX guast. > > > > Totally killing TDX is stupid, I dunno about that, *totally* killing TDX would make my life a lot simpler ;-) > > just about as stupid as doing it on module load (which equates to always > > doing it). > > > > > Also, KVM will have a module parameter 'enable_tdx'. I am hoping this could > > > reduce your concern too. > > > > I don't get this obsession with doing at module load time :/ Waiting until userspace attempts to create the first TDX guest adds complexity and limits what KVM can do to harden itself. Currently, all feature support in KVM is effectively frozen at module load. E.g. most of the setup code is contained in __init functions, many module-scoped variables are effectively RO after init (though they can't be marked as such until we smush kvm-intel.ko and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan). All of those patterns would get tossed aside if KVM waits until userspace attempts to create the first guest. The userspace experience would also be poor, as KVM can't know whether or TDX is actually supported until the TDX module is fully loaded and configured. KVM waits until VM creation to enable VMX, but that's pure enabling and more or less guaranteed to succeed, e.g. will succeed barring hardware failures, software bugs, or *severe* memory pressure. There are also latency and noisy neighbor concerns, e.g. we *really* don't want to end up in a situation where creating a TDX guest for a customer can observe arbitrary latency *and* potentially be disruptive to VMs already running on the host. Userspace can workaround the second and third issues by spawning a dummy TDX guest as early as possible, but that adds complexity to userspace, especially if there's any desire for it to be race free, e.g. with respect to reporting system capabilities to the control plan. On the flip side, limited hardware availability (unless Intel has changed its tune) and the amount of enabling that's required in BIOS and whatnot makes it highly unlikely that random Linux users are going to unknowingly boot with TDX enabled. That said, if this is a sticking point, let's just make enable_tdx off by default, i.e. force userspace to opt-in. Deployments that *know* they may want to schedule TDX VMs on the host can simply force the module param. And for everyone else, since KVM is typically configured as a module by distros, KVM can be unloaded and reload if the user realizes they want TDX well after the system is up and running. > The KVM maintainers prefer the initialization on kvm_intel.ko loading time. [1] You can say "Sean", I'm not the bogeyman :-) > I can change enable_tdx parameter for kvm_intel.ko instead of boolean. > Something like > > enable_tdx > ondemand: on-demand initialization when creating the first TDX guest > onload: initialize TDX module when loading kvm_intel.ko No, that's the most complex path and makes no one happy. > disable: disable TDX support
Sean Christopherson wrote: > On Fri, Jun 30, 2023, Isaku Yamahata wrote: > > On Fri, Jun 30, 2023 at 08:30:20PM +0200, > > Peter Zijlstra <peterz@infradead.org> wrote: [..] > On the flip side, limited hardware availability (unless Intel has changed its > tune) and the amount of enabling that's required in BIOS and whatnot makes it > highly unlikely that random Linux users are going to unknowingly boot with TDX > enabled. > > That said, if this is a sticking point, let's just make enable_tdx off by default, > i.e. force userspace to opt-in. Deployments that *know* they may want to schedule > TDX VMs on the host can simply force the module param. And for everyone else, > since KVM is typically configured as a module by distros, KVM can be unloaded and > reload if the user realizes they want TDX well after the system is up and running. Another potential option that also avoids the concern that module parameters are unwieldy [1] is to have kvm_intel have a soft-dependency on something like a kvm_intel_tdx module. That affords both a BIOS *and* userspace policy opt-out where kvm_intel.ko can check that kvm_intel_tdx.ko is present at init time, or proceed with tdx disabled. [1]: http://lore.kernel.org/r/Y7z99mf1M5edxV4A@kroah.com
On 6/30/23 14:24, Sean Christopherson wrote: > That said, if this is a sticking point, let's just make enable_tdx off by default, > i.e. force userspace to opt-in. Deployments that *know* they may want to schedule > TDX VMs on the host can simply force the module param. And for everyone else, > since KVM is typically configured as a module by distros, KVM can be unloaded and > reload if the user realizes they want TDX well after the system is up and running. Let's just default it to off for now. If we default it to on, we risk inflicting TDX on existing KVM users that don't want it (by surprise). If it turns out to _that_ big of an inconvenience, we'd have to reverse course and change the default from on=>off. *That* would break existing TDX users when we do it. Gnashing of teeth all around would ensue. On the other hand, if we force TDX users to turn it on from day one, we don't surprise _anyone_ that wasn't asking for it. The only teeth gnashing is for the TDX folks. We could change _that_ down the line if the TDX users get too rowdy. But I'd much rather err on the side of inconveniencing the guys that know they want the snazzy new hardware than those who just want to run plain old VMs. I honestly don't care all that much either way. There's an escape hatch at runtime (reload kvm_intel.ko) no matter what we do.
On Fri, 2023-06-30 at 10:09 +0000, Huang, Kai wrote: > On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote: > > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote: > > > > > > Can be called locally or through an IPI function call. > > > > > > > > > > Thanks. As in another reply, if using spinlock is OK, then I think we can say > > > it will be called either locally or through an IPI function call. Otherwise, we > > > do via a new separate function tdx_global_init() and no lock is needed in that > > > function. The caller should call it properly. > > > > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's > > actual need for that, but the code as presented didn't -- in comments or > > otherwise -- make it clear why it was as it was. > > There's no hard requirement as I replied in another email. > > Presumably you prefer the option to have a dedicated tdx_global_init() so we can > avoid the raw_spinlock_t? > Hmm... didn't have enough coffee. Sorry after more thinking, I think we need to avoid tdx_global_init() but do TDH.SYS.INIT within tdx_cpu_enable() with raw_spinlock_t. The reason is although KVM will be the first caller of TDX, there will be other caller of TDX in later phase (e.g., IOMMU TDX support) so we need to consider race between those callers. With multiple callers, the tdx_global_init() and tdx_cpu_enable() from them need to be serialized anyway, and having the additional tdx_global_init() will just make things more complicated to do. So I think the simplest way is to use a per-cpu variable to track TDH.SYS.LP.INIT in tdx_cpu_enable() and only call tdx_cpu_enable() from local with IRQ disabled or from IPI function call, and use raw_spinlock_t for TDH.SYS.INIT inside tdx_cpu_enable() to make sure it only gets called once. I'll clarify this in the changelog and/or comments. Again sorry for the noise and please let me know for any comments. Thanks!
On Fri, Jun 30, 2023 at 04:13:39PM -0700, Dave Hansen wrote: > I honestly don't care all that much either way. There's an escape hatch > at runtime (reload kvm_intel.ko) no matter what we do. Please try with my MODULE=n kernel ;-) localyesconfig FTW.
On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote: > I dunno about that, *totally* killing TDX would make my life a lot simpler ;-) :-) > > > I don't get this obsession with doing at module load time :/ > > Waiting until userspace attempts to create the first TDX guest adds complexity > and limits what KVM can do to harden itself. Currently, all feature support in > KVM is effectively frozen at module load. E.g. most of the setup code is > contained in __init functions, many module-scoped variables are effectively > RO after init (though they can't be marked as such until we smush kvm-intel.ko > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan). All of > those patterns would get tossed aside if KVM waits until userspace attempts to > create the first guest. Pff, all that is perfectly possible, just a wee bit more work :-) I mean, we manage to poke text that's RO, surely we can poke a variable that supposedly RO. And I really wish we could put part of the kvm-intel/amd.ko things in the kernel proper and reduce the EXPORT_SYMBOL surface -- we're exporting a whole bunch of things that really shouldn't be, just for KVM :/ > The userspace experience would also be poor, as KVM can't know whether or TDX is > actually supported until the TDX module is fully loaded and configured. Quality that :-( > There are also latency and noisy neighbor concerns, e.g. we *really* don't want > to end up in a situation where creating a TDX guest for a customer can observe > arbitrary latency *and* potentially be disruptive to VMs already running on the > host. Well, that's a quality of implementation issue with the whole TDX crapola. Sounds like we want to impose latency constraints on the various TDX calls. Allowing it to consume arbitrary amounts of CPU time is unacceptable in any case. > Userspace can workaround the second and third issues by spawning a dummy TDX guest > as early as possible, but that adds complexity to userspace, especially if there's > any desire for it to be race free, e.g. with respect to reporting system capabilities > to the control plan. FWIW, I'm 100% behind pushing complexity into userspace if it makes for a simpler kernel. > On the flip side, limited hardware availability (unless Intel has changed its > tune) and the amount of enabling that's required in BIOS and whatnot makes it > highly unlikely that random Linux users are going to unknowingly boot with TDX > enabled. > > That said, if this is a sticking point, let's just make enable_tdx off by default, OK.
On 7/3/23 03:49, Peter Zijlstra wrote: >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want >> to end up in a situation where creating a TDX guest for a customer can observe >> arbitrary latency *and* potentially be disruptive to VMs already running on the >> host. > Well, that's a quality of implementation issue with the whole TDX > crapola. Sounds like we want to impose latency constraints on the > various TDX calls. Allowing it to consume arbitrary amounts of CPU time > is unacceptable in any case. For what it's worth, everybody knew that calling into the TDX module was going to be a black hole and that consuming large amounts of CPU at random times would drive people bat guano crazy. The TDX Module ABI spec does have "Leaf Function Latency" warnings for some of the module calls. But, it's basically a binary thing. A call is either normal or "longer than most". The majority of the "longer than most" cases are for initialization. The _most_ obscene runtime ones are chunked up and can return partial progress to limit latency spikes. But I don't think folks tried as hard on the initialization calls since they're only called once which actually seems pretty reasonable to me. Maybe we need three classes of "Leaf Function Latency": 1. Sane 2. "Longer than most" 3. Better turn the NMI watchdog off before calling this. :) Would that help?
On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote: > On 7/3/23 03:49, Peter Zijlstra wrote: > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want > >> to end up in a situation where creating a TDX guest for a customer can observe > >> arbitrary latency *and* potentially be disruptive to VMs already running on the > >> host. > > Well, that's a quality of implementation issue with the whole TDX > > crapola. Sounds like we want to impose latency constraints on the > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time > > is unacceptable in any case. > > For what it's worth, everybody knew that calling into the TDX module was > going to be a black hole and that consuming large amounts of CPU at > random times would drive people bat guano crazy. > > The TDX Module ABI spec does have "Leaf Function Latency" warnings for > some of the module calls. But, it's basically a binary thing. A call > is either normal or "longer than most". > > The majority of the "longer than most" cases are for initialization. > The _most_ obscene runtime ones are chunked up and can return partial > progress to limit latency spikes. But I don't think folks tried as hard > on the initialization calls since they're only called once which > actually seems pretty reasonable to me. > > Maybe we need three classes of "Leaf Function Latency": > 1. Sane > 2. "Longer than most" > 3. Better turn the NMI watchdog off before calling this. :) > > Would that help? I'm thikning we want something along the lines of the Xen preemptible hypercalls, except less crazy. Where the caller does: for (;;) { ret = tdcall(fn, args); if (ret == -EAGAIN) { cond_resched(); continue; } break; } And then the TDX black box provides a guarantee that any one tdcall (or seamcall or whatever) never takes more than X ns (possibly even configurable) and we get to raise a bug report if we can prove it actually takes longer. Handing the CPU off to random code for random period of time is just not a good idea, ever.
On 7/3/23 08:03, Peter Zijlstra wrote: > On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote: >> On 7/3/23 03:49, Peter Zijlstra wrote: >>>> There are also latency and noisy neighbor concerns, e.g. we *really* don't want >>>> to end up in a situation where creating a TDX guest for a customer can observe >>>> arbitrary latency *and* potentially be disruptive to VMs already running on the >>>> host. >>> Well, that's a quality of implementation issue with the whole TDX >>> crapola. Sounds like we want to impose latency constraints on the >>> various TDX calls. Allowing it to consume arbitrary amounts of CPU time >>> is unacceptable in any case. >> >> For what it's worth, everybody knew that calling into the TDX module was >> going to be a black hole and that consuming large amounts of CPU at >> random times would drive people bat guano crazy. >> >> The TDX Module ABI spec does have "Leaf Function Latency" warnings for >> some of the module calls. But, it's basically a binary thing. A call >> is either normal or "longer than most". >> >> The majority of the "longer than most" cases are for initialization. >> The _most_ obscene runtime ones are chunked up and can return partial >> progress to limit latency spikes. But I don't think folks tried as hard >> on the initialization calls since they're only called once which >> actually seems pretty reasonable to me. >> >> Maybe we need three classes of "Leaf Function Latency": >> 1. Sane >> 2. "Longer than most" >> 3. Better turn the NMI watchdog off before calling this. :) >> >> Would that help? > > I'm thikning we want something along the lines of the Xen preemptible > hypercalls, except less crazy. Where the caller does: > > for (;;) { > ret = tdcall(fn, args); > if (ret == -EAGAIN) { > cond_resched(); > continue; > } > break; > } > > And then the TDX black box provides a guarantee that any one tdcall (or > seamcall or whatever) never takes more than X ns (possibly even > configurable) and we get to raise a bug report if we can prove it > actually takes longer. It's _supposed_ to be doing something kinda like that. For instance, in the places that need locking, the TDX module essentially does: if (!trylock(&lock)) return -EBUSY; which is a heck of a lot better than spinning in the TDX module. Those module locks are also almost always for things that *also* have some kind of concurrency control in Linux too. *But*, there are also the really nasty calls that *do* take forever. It would be great to have a list of them or, heck, even *enumeration* of which ones can take forever so we don't need to maintain a table.
On Mon, Jul 03, 2023 at 05:03:30PM +0200, Peter Zijlstra wrote: > On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote: > > On 7/3/23 03:49, Peter Zijlstra wrote: > > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want > > >> to end up in a situation where creating a TDX guest for a customer can observe > > >> arbitrary latency *and* potentially be disruptive to VMs already running on the > > >> host. > > > Well, that's a quality of implementation issue with the whole TDX > > > crapola. Sounds like we want to impose latency constraints on the > > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time > > > is unacceptable in any case. > > > > For what it's worth, everybody knew that calling into the TDX module was > > going to be a black hole and that consuming large amounts of CPU at > > random times would drive people bat guano crazy. > > > > The TDX Module ABI spec does have "Leaf Function Latency" warnings for > > some of the module calls. But, it's basically a binary thing. A call > > is either normal or "longer than most". > > > > The majority of the "longer than most" cases are for initialization. > > The _most_ obscene runtime ones are chunked up and can return partial > > progress to limit latency spikes. But I don't think folks tried as hard > > on the initialization calls since they're only called once which > > actually seems pretty reasonable to me. > > > > Maybe we need three classes of "Leaf Function Latency": > > 1. Sane > > 2. "Longer than most" > > 3. Better turn the NMI watchdog off before calling this. :) > > > > Would that help? > > I'm thikning we want something along the lines of the Xen preemptible > hypercalls, except less crazy. Where the caller does: > > for (;;) { > ret = tdcall(fn, args); > if (ret == -EAGAIN) { > cond_resched(); > continue; > } > break; > } > > And then the TDX black box provides a guarantee that any one tdcall (or > seamcall or whatever) never takes more than X ns (possibly even > configurable) and we get to raise a bug report if we can prove it > actually takes longer. TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over control to the host/VMM. But I'm not quite follow how it is different from the host stopping scheduling vCPU on a random instruction. It can happen at any point and TDCALL is not special from this PoV.
On 7/3/23 10:55, kirill.shutemov@linux.intel.com wrote: >> I'm thikning we want something along the lines of the Xen preemptible >> hypercalls, except less crazy. Where the caller does: >> >> for (;;) { >> ret = tdcall(fn, args); >> if (ret == -EAGAIN) { >> cond_resched(); >> continue; >> } >> break; >> } >> >> And then the TDX black box provides a guarantee that any one tdcall (or >> seamcall or whatever) never takes more than X ns (possibly even >> configurable) and we get to raise a bug report if we can prove it >> actually takes longer. > TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over > control to the host/VMM. > > But I'm not quite follow how it is different from the host stopping > scheduling vCPU on a random instruction. It can happen at any point and > TDCALL is not special from this PoV. Well, for one, if the host stops the vCPU on a random instruction the host has to restore all the vCPU state. *ALL* of it. That means that after the host hands control back, the guest is perfectly ready to take all the interrupts that are pending. These TDCALLs are *VERY* different. The guest gets control back and has some amount of its state zapped, RBP being the most annoying current example of state that is lost. So the guest resumes control here and must handle all of its interrupts with some of its state (and thus ability to cleanly handle the interrupt) gone. The instructions after state is lost are very much special. Just look at the syscall gap.
On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote: > Waiting until userspace attempts to create the first TDX guest adds complexity > and limits what KVM can do to harden itself. Currently, all feature support in > KVM is effectively frozen at module load. E.g. most of the setup code is > contained in __init functions, many module-scoped variables are effectively > RO after init (though they can't be marked as such until we smush kvm-intel.ko > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan). All of > those patterns would get tossed aside if KVM waits until userspace attempts to > create the first guest. .... People got poked and the following was suggested: On boot do: TDH.SYS.INIT TDH.SYS.LP.INIT TDH.SYS.CONFIG TDH.SYS.KEY.CONFIG This should get TDX mostly sorted, but doesn't consume much resources. Then later, when starting the first TDX guest, do the whole TDH.TDMR.INIT dance to set up the PAMT array -- which is what gobbles up memory. From what I understand the TDH.TDMR.INIT thing is not one of those excessively long calls. If we have concerns about allocating the PAMT array, can't we use CMA for this? Allocate the whole thing at boot as CMA such that when not used for TDX it can be used for regular things like userspace and filecache pages? Those TDH.SYS calls should be enough to ensure TDX is actually working, no?
On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote: > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote: > > > Waiting until userspace attempts to create the first TDX guest adds complexity > > and limits what KVM can do to harden itself. Currently, all feature support in > > KVM is effectively frozen at module load. E.g. most of the setup code is > > contained in __init functions, many module-scoped variables are effectively > > RO after init (though they can't be marked as such until we smush kvm-intel.ko > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan). All of > > those patterns would get tossed aside if KVM waits until userspace attempts to > > create the first guest. > > .... > > People got poked and the following was suggested: > > On boot do: > > TDH.SYS.INIT > TDH.SYS.LP.INIT > TDH.SYS.CONFIG > TDH.SYS.KEY.CONFIG > > This should get TDX mostly sorted, but doesn't consume much resources. > Then later, when starting the first TDX guest, do the whole > > TDH.TDMR.INIT > > dance to set up the PAMT array -- which is what gobbles up memory. From > what I understand the TDH.TDMR.INIT thing is not one of those > excessively long calls. The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX module, thus it only initializes a small chunk (1M I guess) in each call. Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all PAMT entries for all TDX-usable memory, which can be time-consuming. Currently for simplicity we just do this inside the module initialization, but can be optimized later when we have an agreed solution of how to optimize. > > If we have concerns about allocating the PAMT array, can't we use CMA > for this? Allocate the whole thing at boot as CMA such that when not > used for TDX it can be used for regular things like userspace and > filecache pages? The PAMT allocation itself isn't a concern I think. The concern is the TDH.TDMR.INIT to initialize them. Also, one practical problem to prevent us from pre-allocating PAMT is the PAMT size to be allocated can only be determined after the TDH.SYS.INFO SEAMCALL, which reports the "PAMT entry size" in the TDSYSINFO_STRUCT. > > Those TDH.SYS calls should be enough to ensure TDX is actually working, > no?
On Mon, Jul 03, 2023 at 08:55:56PM +0300, kirill.shutemov@linux.intel.com wrote: > On Mon, Jul 03, 2023 at 05:03:30PM +0200, Peter Zijlstra wrote: > > On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote: > > > On 7/3/23 03:49, Peter Zijlstra wrote: > > > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want > > > >> to end up in a situation where creating a TDX guest for a customer can observe > > > >> arbitrary latency *and* potentially be disruptive to VMs already running on the > > > >> host. > > > > Well, that's a quality of implementation issue with the whole TDX > > > > crapola. Sounds like we want to impose latency constraints on the > > > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time > > > > is unacceptable in any case. > > > > > > For what it's worth, everybody knew that calling into the TDX module was > > > going to be a black hole and that consuming large amounts of CPU at > > > random times would drive people bat guano crazy. > > > > > > The TDX Module ABI spec does have "Leaf Function Latency" warnings for > > > some of the module calls. But, it's basically a binary thing. A call > > > is either normal or "longer than most". > > > > > > The majority of the "longer than most" cases are for initialization. > > > The _most_ obscene runtime ones are chunked up and can return partial > > > progress to limit latency spikes. But I don't think folks tried as hard > > > on the initialization calls since they're only called once which > > > actually seems pretty reasonable to me. > > > > > > Maybe we need three classes of "Leaf Function Latency": > > > 1. Sane > > > 2. "Longer than most" > > > 3. Better turn the NMI watchdog off before calling this. :) > > > > > > Would that help? > > > > I'm thikning we want something along the lines of the Xen preemptible > > hypercalls, except less crazy. Where the caller does: > > > > for (;;) { > > ret = tdcall(fn, args); > > if (ret == -EAGAIN) { > > cond_resched(); > > continue; > > } > > break; > > } > > > > And then the TDX black box provides a guarantee that any one tdcall (or > > seamcall or whatever) never takes more than X ns (possibly even > > configurable) and we get to raise a bug report if we can prove it > > actually takes longer. > > TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over > control to the host/VMM. > > But I'm not quite follow how it is different from the host stopping > scheduling vCPU on a random instruction. It can happen at any point and > TDCALL is not special from this PoV. A guest will exit on timer/interrupt and then the host can reschedule; AFAIU this doesn't actually happen with these TDX calls, if control is in that SEAM thing, it stays there until it's done.
On Tue, Jul 04, 2023 at 09:50:22PM +0000, Huang, Kai wrote: > On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote: > > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote: > > > > > Waiting until userspace attempts to create the first TDX guest adds complexity > > > and limits what KVM can do to harden itself. Currently, all feature support in > > > KVM is effectively frozen at module load. E.g. most of the setup code is > > > contained in __init functions, many module-scoped variables are effectively > > > RO after init (though they can't be marked as such until we smush kvm-intel.ko > > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan). All of > > > those patterns would get tossed aside if KVM waits until userspace attempts to > > > create the first guest. > > > > .... > > > > People got poked and the following was suggested: > > > > On boot do: > > > > TDH.SYS.INIT > > TDH.SYS.LP.INIT > > TDH.SYS.CONFIG > > TDH.SYS.KEY.CONFIG > > > > This should get TDX mostly sorted, but doesn't consume much resources. > > Then later, when starting the first TDX guest, do the whole > > > > TDH.TDMR.INIT > > > > dance to set up the PAMT array -- which is what gobbles up memory. From > > what I understand the TDH.TDMR.INIT thing is not one of those > > excessively long calls. > > The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX > module, thus it only initializes a small chunk (1M I guess) in each call. > Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all > PAMT entries for all TDX-usable memory, which can be time-consuming. Yeah, so you can put a cond_resched() in that loop and all is well, you do not negatively affect other tasks. Because *that* was the concern raised.
On Wed, 2023-07-05 at 09:16 +0200, Peter Zijlstra wrote: > On Tue, Jul 04, 2023 at 09:50:22PM +0000, Huang, Kai wrote: > > On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote: > > > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote: > > > > > > > Waiting until userspace attempts to create the first TDX guest adds complexity > > > > and limits what KVM can do to harden itself. Currently, all feature support in > > > > KVM is effectively frozen at module load. E.g. most of the setup code is > > > > contained in __init functions, many module-scoped variables are effectively > > > > RO after init (though they can't be marked as such until we smush kvm-intel.ko > > > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan). All of > > > > those patterns would get tossed aside if KVM waits until userspace attempts to > > > > create the first guest. > > > > > > .... > > > > > > People got poked and the following was suggested: > > > > > > On boot do: > > > > > > TDH.SYS.INIT > > > TDH.SYS.LP.INIT > > > TDH.SYS.CONFIG > > > TDH.SYS.KEY.CONFIG > > > > > > This should get TDX mostly sorted, but doesn't consume much resources. > > > Then later, when starting the first TDX guest, do the whole > > > > > > TDH.TDMR.INIT > > > > > > dance to set up the PAMT array -- which is what gobbles up memory. From > > > what I understand the TDH.TDMR.INIT thing is not one of those > > > excessively long calls. > > > > The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX > > module, thus it only initializes a small chunk (1M I guess) in each call. > > Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all > > PAMT entries for all TDX-usable memory, which can be time-consuming. > > Yeah, so you can put a cond_resched() in that loop and all is well, you > do not negatively affect other tasks. Because *that* was the concern > raised. Yes cond_resched() has been done. It's in patch 16 (x86/virt/tdx: Initialize all TDMRs).
On 7/4/23 09:58, Peter Zijlstra wrote: > If we have concerns about allocating the PAMT array, can't we use CMA > for this? Allocate the whole thing at boot as CMA such that when not > used for TDX it can be used for regular things like userspace and > filecache pages? I never thought of CMA as being super reliable. Maybe it's improved over the years. KVM also has a rather nasty habit of pinning pages, like for device passthrough. I suspect that means that we'll have one of two scenarios: 1. CMA works great, but the TDX/CMA area is unusable for KVM because it's pinning all its pages and they just get moved out of the CMA area immediately. The CMA area is effectively wasted. 2. CMA sucks, and users get sporadic TDX failures when they wait a long time to run a TDX guest after boot. Users just work around the CMA support by starting up TDX guests at boot or demanding a module parameter be set. Hacking in CMA support was a waste. Am I just too much of a pessimist?
On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote: > On 7/4/23 09:58, Peter Zijlstra wrote: > > If we have concerns about allocating the PAMT array, can't we use CMA > > for this? Allocate the whole thing at boot as CMA such that when not > > used for TDX it can be used for regular things like userspace and > > filecache pages? > > I never thought of CMA as being super reliable. Maybe it's improved > over the years. > > KVM also has a rather nasty habit of pinning pages, like for device > passthrough. I suspect that means that we'll have one of two scenarios: > > 1. CMA works great, but the TDX/CMA area is unusable for KVM because > it's pinning all its pages and they just get moved out of the CMA > area immediately. The CMA area is effectively wasted. > 2. CMA sucks, and users get sporadic TDX failures when they wait a long > time to run a TDX guest after boot. Users just work around the CMA > support by starting up TDX guests at boot or demanding a module > parameter be set. Hacking in CMA support was a waste. > > Am I just too much of a pessimist? Well, if CMA still sucks, then that needs fixing. If CMA works, but we have a circular fail in that KVM needs to long-term pin the PAMT pages but long-term pin is evicted from CMA (the whole point of long-term pin, after all), then surely we can break that cycle somehow, since in this case the purpose of the CMA is being able to grab that memory chunk when we needs it. That is, either way around is just a matter of a little code, no?
On 7/5/23 07:57, Peter Zijlstra wrote: > On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote: >> On 7/4/23 09:58, Peter Zijlstra wrote: >>> If we have concerns about allocating the PAMT array, can't we use CMA >>> for this? Allocate the whole thing at boot as CMA such that when not >>> used for TDX it can be used for regular things like userspace and >>> filecache pages? >> I never thought of CMA as being super reliable. Maybe it's improved >> over the years. >> >> KVM also has a rather nasty habit of pinning pages, like for device >> passthrough. I suspect that means that we'll have one of two scenarios: >> >> 1. CMA works great, but the TDX/CMA area is unusable for KVM because >> it's pinning all its pages and they just get moved out of the CMA >> area immediately. The CMA area is effectively wasted. >> 2. CMA sucks, and users get sporadic TDX failures when they wait a long >> time to run a TDX guest after boot. Users just work around the CMA >> support by starting up TDX guests at boot or demanding a module >> parameter be set. Hacking in CMA support was a waste. >> >> Am I just too much of a pessimist? > Well, if CMA still sucks, then that needs fixing. If CMA works, but we > have a circular fail in that KVM needs to long-term pin the PAMT pages > but long-term pin is evicted from CMA (the whole point of long-term pin, > after all), then surely we can break that cycle somehow, since in this > case the purpose of the CMA is being able to grab that memory chunk when > we needs it. > > That is, either way around is just a matter of a little code, no? It's not a circular dependency, it's conflicting requirements. CMA makes memory more available, but only in the face of unpinned pages. KVM can pin lots of pages, even outside of TDX-based VMs. So we either need to change how CMA works fundamentally or stop KVM from pinning pages.
On Thu, Jul 06, 2023, Dave Hansen wrote: > On 7/5/23 07:57, Peter Zijlstra wrote: > > On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote: > >> On 7/4/23 09:58, Peter Zijlstra wrote: > >>> If we have concerns about allocating the PAMT array, can't we use CMA > >>> for this? Allocate the whole thing at boot as CMA such that when not > >>> used for TDX it can be used for regular things like userspace and > >>> filecache pages? > >> I never thought of CMA as being super reliable. Maybe it's improved > >> over the years. > >> > >> KVM also has a rather nasty habit of pinning pages, like for device > >> passthrough. I suspect that means that we'll have one of two scenarios: > >> > >> 1. CMA works great, but the TDX/CMA area is unusable for KVM because > >> it's pinning all its pages and they just get moved out of the CMA > >> area immediately. The CMA area is effectively wasted. > >> 2. CMA sucks, and users get sporadic TDX failures when they wait a long > >> time to run a TDX guest after boot. Users just work around the CMA > >> support by starting up TDX guests at boot or demanding a module > >> parameter be set. Hacking in CMA support was a waste. > >> > >> Am I just too much of a pessimist? > > Well, if CMA still sucks, then that needs fixing. If CMA works, but we > > have a circular fail in that KVM needs to long-term pin the PAMT pages > > but long-term pin is evicted from CMA (the whole point of long-term pin, > > after all), then surely we can break that cycle somehow, since in this > > case the purpose of the CMA is being able to grab that memory chunk when > > we needs it. > > > > That is, either way around is just a matter of a little code, no? > > It's not a circular dependency, it's conflicting requirements. > > CMA makes memory more available, but only in the face of unpinned pages. > > KVM can pin lots of pages, even outside of TDX-based VMs. > > So we either need to change how CMA works fundamentally or stop KVM from > pinning pages. Nit, I think you're conflating KVM with VFIO and/or IOMMU code. Device passhthrough does pin large chunks of memory, but KVM itself isn't involved or even aware of the pins. HugeTLB is another case where CMA will be effectively used to serve guest memory, but again KVM isn't the thing doing the pinning.
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 4dfe2e794411..d8226a50c58c 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, #ifdef CONFIG_INTEL_TDX_HOST bool platform_tdx_enabled(void); +int tdx_cpu_enable(void); +int tdx_enable(void); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } +static inline int tdx_cpu_enable(void) { return -ENODEV; } +static inline int tdx_enable(void) { return -ENODEV; } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 141d12376c4d..29ca18f66d61 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -13,6 +13,10 @@ #include <linux/errno.h> #include <linux/printk.h> #include <linux/smp.h> +#include <linux/cpu.h> +#include <linux/spinlock.h> +#include <linux/percpu-defs.h> +#include <linux/mutex.h> #include <asm/msr-index.h> #include <asm/msr.h> #include <asm/archrandom.h> @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init; static u32 tdx_guest_keyid_start __ro_after_init; static u32 tdx_nr_guest_keyids __ro_after_init; +static bool tdx_global_initialized; +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock); +static DEFINE_PER_CPU(bool, tdx_lp_initialized); + +static enum tdx_module_status_t tdx_module_status; +static DEFINE_MUTEX(tdx_module_lock); + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -74,6 +85,157 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, } } +/* + * Do the module global initialization if not done yet. + * It's always called with interrupts and preemption disabled. + */ +static int try_init_module_global(void) +{ + unsigned long flags; + int ret; + + /* + * The TDX module global initialization only needs to be done + * once on any cpu. + */ + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); + + if (tdx_global_initialized) { + ret = 0; + goto out; + } + + /* All '0's are just unused parameters. */ + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); + if (!ret) + tdx_global_initialized = true; +out: + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); + + return ret; +} + +/** + * tdx_cpu_enable - Enable TDX on local cpu + * + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module + * global initialization SEAMCALL if not done) on local cpu to make this + * cpu be ready to run any other SEAMCALLs. + * + * Call this function with preemption disabled. + * + * Return 0 on success, otherwise errors. + */ +int tdx_cpu_enable(void) +{ + int ret; + + if (!platform_tdx_enabled()) + return -ENODEV; + + lockdep_assert_preemption_disabled(); + + /* Already done */ + if (__this_cpu_read(tdx_lp_initialized)) + return 0; + + /* + * The TDX module global initialization is the very first step + * to enable TDX. Need to do it first (if hasn't been done) + * before the per-cpu initialization. + */ + ret = try_init_module_global(); + if (ret) + return ret; + + /* All '0's are just unused parameters */ + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL); + if (ret) + return ret; + + __this_cpu_write(tdx_lp_initialized, true); + + return 0; +} +EXPORT_SYMBOL_GPL(tdx_cpu_enable); + +static int init_tdx_module(void) +{ + /* + * TODO: + * + * - Get TDX module information and TDX-capable memory regions. + * - Build the list of TDX-usable memory regions. + * - Construct a list of "TD Memory Regions" (TDMRs) to cover + * all TDX-usable memory regions. + * - Configure the TDMRs and the global KeyID to the TDX module. + * - Configure the global KeyID on all packages. + * - Initialize all TDMRs. + * + * Return error before all steps are done. + */ + return -EINVAL; +} + +static int __tdx_enable(void) +{ + int ret; + + ret = init_tdx_module(); + if (ret) { + pr_err("module initialization failed (%d)\n", ret); + tdx_module_status = TDX_MODULE_ERROR; + return ret; + } + + pr_info("module initialized.\n"); + tdx_module_status = TDX_MODULE_INITIALIZED; + + return 0; +} + +/** + * tdx_enable - Enable TDX module to make it ready to run TDX guests + * + * This function assumes the caller has: 1) held read lock of CPU hotplug + * lock to prevent any new cpu from becoming online; 2) done both VMXON + * and tdx_cpu_enable() on all online cpus. + * + * This function can be called in parallel by multiple callers. + * + * Return 0 if TDX is enabled successfully, otherwise error. + */ +int tdx_enable(void) +{ + int ret; + + if (!platform_tdx_enabled()) + return -ENODEV; + + lockdep_assert_cpus_held(); + + mutex_lock(&tdx_module_lock); + + switch (tdx_module_status) { + case TDX_MODULE_UNKNOWN: + ret = __tdx_enable(); + break; + case TDX_MODULE_INITIALIZED: + /* Already initialized, great, tell the caller. */ + ret = 0; + break; + default: + /* Failed to initialize in the previous attempts */ + ret = -EINVAL; + break; + } + + mutex_unlock(&tdx_module_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(tdx_enable); + static int __init record_keyid_partitioning(u32 *tdx_keyid_start, u32 *nr_tdx_keyids) { diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 55dbb1b8c971..9fb46033c852 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -16,11 +16,24 @@ */ #define TDX_RND_NO_ENTROPY 0x8000020300000000ULL +/* + * TDX module SEAMCALL leaf functions + */ +#define TDH_SYS_INIT 33 +#define TDH_SYS_LP_INIT 35 + /* * Do not put any hardware-defined TDX structure representations below * this comment! */ +/* Kernel defined TDX module status during module initialization. */ +enum tdx_module_status_t { + TDX_MODULE_UNKNOWN, + TDX_MODULE_INITIALIZED, + TDX_MODULE_ERROR +}; + struct tdx_module_output; u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, struct tdx_module_output *out);
To enable TDX the kernel needs to initialize TDX from two perspectives: 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL on one logical cpu before the kernel wants to make any other SEAMCALLs on that cpu (including those involved during module initialization and running TDX guests). The TDX module can be initialized only once in its lifetime. Instead of always initializing it at boot time, this implementation chooses an "on demand" approach to initialize TDX until there is a real need (e.g when requested by KVM). This approach has below pros: 1) It avoids consuming the memory that must be allocated by kernel and given to the TDX module as metadata (~1/256th of the TDX-usable memory), and also saves the CPU cycles of initializing the TDX module (and the metadata) when TDX is not used at all. 2) The TDX module design allows it to be updated while the system is running. The update procedure shares quite a few steps with this "on demand" initialization mechanism. The hope is that much of "on demand" mechanism can be shared with a future "update" mechanism. A boot-time TDX module implementation would not be able to share much code with the update mechanism. 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM code mucks with VMX enabling. If the TDX module were to be initialized separately from KVM (like at boot), the boot code would need to be taught how to muck with VMX enabling and KVM would need to be taught how to cope with that. Making KVM itself responsible for TDX initialization lets the rest of the kernel stay blissfully unaware of VMX. Similar to module initialization, also make the per-cpu initialization "on demand" as it also depends on VMX being enabled. Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX module and enable TDX on local cpu respectively. For now tdx_enable() is a placeholder. The TODO list will be pared down as functionality is added. Export both tdx_cpu_enable() and tdx_enable() for KVM use. In tdx_enable() use a state machine protected by mutex to make sure the initialization will only be done once, as tdx_enable() can be called multiple times (i.e. KVM module can be reloaded) and may be called concurrently by other kernel components in the future. The per-cpu initialization on each cpu can only be done once during the module's life time. Use a per-cpu variable to track its status to make sure it is only done once in tdx_cpu_enable(). Also, a SEAMCALL to do TDX module global initialization must be done once on any logical cpu before any per-cpu initialization SEAMCALL. Do it inside tdx_cpu_enable() too (if hasn't been done). tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The per-cpu initialization must be done before those SEAMCALLs are invoked on some cpu. To keep things simple, in tdx_cpu_enable(), always do the per-cpu initialization regardless of whether the TDX module has been initialized or not. And in tdx_enable(), don't call tdx_cpu_enable() but assume the caller has disabled CPU hotplug, done VMXON and tdx_cpu_enable() on all online cpus before calling tdx_enable(). Signed-off-by: Kai Huang <kai.huang@intel.com> --- v11 -> v12: - Simplified TDX module global init and lp init status tracking (David). - Added comment around try_init_module_global() for using raw_spin_lock() (Dave). - Added one sentence to changelog to explain why to expose tdx_enable() and tdx_cpu_enable() (Dave). - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use lockdep_assert_*() instead. (Dave) - Removed redundent "TDX" in error message (Dave). v10 -> v11: - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off. - Return the actual error code for tdx_enable() instead of -EINVAL. - Added Isaku's Reviewed-by. v9 -> v10: - Merged the patch to handle per-cpu initialization to this patch to tell the story better. - Changed how to handle the per-cpu initialization to only provide a tdx_cpu_enable() function to let the user of TDX to do it when the user wants to run TDX code on a certain cpu. - Changed tdx_enable() to not call cpus_read_lock() explicitly, but call lockdep_assert_cpus_held() to assume the caller has done that. - Improved comments around tdx_enable() and tdx_cpu_enable(). - Improved changelog to tell the story better accordingly. v8 -> v9: - Removed detailed TODO list in the changelog (Dave). - Added back steps to do module global initialization and per-cpu initialization in the TODO list comment. - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h v7 -> v8: - Refined changelog (Dave). - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). - Add a "TODO list" comment in init_tdx_module() to list all steps of initializing the TDX Module to tell the story (Dave). - Made tdx_enable() unverisally return -EINVAL, and removed nonsense comments (Dave). - Simplified __tdx_enable() to only handle success or failure. - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. - Improved comments (Dave). - Pointed out 'tdx_module_status' is software thing (Dave). v6 -> v7: - No change. v5 -> v6: - Added code to set status to TDX_MODULE_NONE if TDX module is not loaded (Chao) - Added Chao's Reviewed-by. - Improved comments around cpus_read_lock(). - v3->v5 (no feedback on v4): - Removed the check that SEAMRR and TDX KeyID have been detected on all present cpus. - Removed tdx_detect(). - Added num_online_cpus() to MADT-enabled CPUs check within the CPU hotplug lock and return early with error message. - Improved dmesg printing for TDX module detection and initialization. --- arch/x86/include/asm/tdx.h | 4 + arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 13 +++ 3 files changed, 179 insertions(+)