Message ID | 48505089b645019a734d85c2c29f3c8ae2dbd6bd.1668988357.git.kai.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX host kernel support | expand |
On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > +/* > + * Data structure to make SEAMCALL on multiple CPUs concurrently. > + * @err is set to -EFAULT when SEAMCALL fails on any cpu. > + */ > +struct seamcall_ctx { > + u64 fn; > + u64 rcx; > + u64 rdx; > + u64 r8; > + u64 r9; > + atomic_t err; > +}; > @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > } > } > > +static void seamcall_smp_call_function(void *data) > +{ > + struct seamcall_ctx *sc = data; > + int ret; > + > + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL); > + if (ret) > + atomic_set(&sc->err, -EFAULT); > +} Can someone explain me this usage of atomic_t, please?
On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > +/* > + * Call the SEAMCALL on all online CPUs concurrently. Caller to check > + * @sc->err to determine whether any SEAMCALL failed on any cpu. > + */ > +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) > +{ > + on_each_cpu(seamcall_smp_call_function, sc, true); > +} Suppose the user has NOHZ_FULL configured, and is already running userspace that will terminate on interrupt (this is desired feature for NOHZ_FULL), guess how happy they'll be if someone, on another parition, manages to tickle this TDX gunk?
On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all > BIOS-enabled CPUs, and the SEMACALL can run concurrently on different > CPUs. Implement a mechanism to run SEAMCALL concurrently on all online > CPUs and use it to shut down the module. Later logical-cpu scope module > initialization will use it too. Uhh, those requirements ^ are not met by this: > +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) > +{ > + on_each_cpu(seamcall_smp_call_function, sc, true); > +} Consider: CPU0 CPU1 CPU2 local_irq_disable() ... seamcall_on_each_cpu() send-IPIs to 0 and 2 <IPI> runs local seamcall (seamcall done) waits for 0 and 2 <has an NMI delay things> runs seamcall clears CSD_LOCK </IPI> ... spinning ... local_irq_enable() <IPI> runs seamcall clears CSD_LOCK *FINALLY* observes CSD_LOCK cleared on all CPU and continues </IPI> IOW, they all 3 run seamcall at different times. Either the Changelog is broken or this TDX crud is worse crap than I thought possible, because the only way to actually meet that requirement as stated is stop_machine().
On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote: > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different >> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online >> CPUs and use it to shut down the module. Later logical-cpu scope module >> initialization will use it too. > > Uhh, those requirements ^ are not met by this: Can run concurrently != Must run concurrently The documentation clearly says "can run concurrently" as quoted above. Thanks, tglx
On 11/22/22 01:13, Peter Zijlstra wrote: > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: >> +/* >> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check >> + * @sc->err to determine whether any SEAMCALL failed on any cpu. >> + */ >> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) >> +{ >> + on_each_cpu(seamcall_smp_call_function, sc, true); >> +} > > Suppose the user has NOHZ_FULL configured, and is already running > userspace that will terminate on interrupt (this is desired feature for > NOHZ_FULL), guess how happy they'll be if someone, on another parition, > manages to tickle this TDX gunk? Yeah, they'll be none too happy. But, what do we do? There are technical solutions like detecting if NOHZ_FULL is in play and refusing to initialize TDX. There are also non-technical solutions like telling folks in the documentation that they better modprobe kvm early if they want to do TDX, or their NOHZ_FULL apps will pay. We could also force the TDX module to be loaded early in boot before NOHZ_FULL is in play, but that would waste memory on TDX metadata even if TDX is never used. How do NOHZ_FULL folks deal with late microcode updates, for example? Those are roughly equally disruptive to all CPUs.
On 11/22/22 01:20, Peter Zijlstra wrote: > Either the Changelog is broken or this TDX crud is worse crap than I > thought possible, because the only way to actually meet that requirement > as stated is stop_machine(). I think the changelog is broken. I don't see anything in the TDX module spec about "the SEMACALL can run concurrently on different CPUs". Shutdown, as far as I can tell, just requires that the shutdown seamcall be run once on each CPU. Concurrency and ordering don't seem to matter at all.
On Tue, Nov 22 2022 at 07:20, Dave Hansen wrote: > On 11/22/22 01:20, Peter Zijlstra wrote: >> Either the Changelog is broken or this TDX crud is worse crap than I >> thought possible, because the only way to actually meet that requirement >> as stated is stop_machine(). > > I think the changelog is broken. I don't see anything in the TDX module > spec about "the SEMACALL can run concurrently on different CPUs". > Shutdown, as far as I can tell, just requires that the shutdown seamcall > be run once on each CPU. Concurrency and ordering don't seem to matter > at all. You're right. The 'can concurrently run' thing is for LP.INIT: 4.2.2. LP-Scope Initialization: TDH.SYS.LP.INIT TDH.SYS.LP.INIT is intended to perform LP-scope, core-scope and package-scope initialization of the Intel TDX module. It can be called only after TDH.SYS.INIT completes successfully, and it can run concurrently on multiple LPs.
On 11/20/22 16:26, Kai Huang wrote: > TDX supports shutting down the TDX module at any time during its > lifetime. After the module is shut down, no further TDX module SEAMCALL > leaf functions can be made to the module on any logical cpu. > > Shut down the TDX module in case of any error during the initialization > process. It's pointless to leave the TDX module in some middle state. > > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all > BIOS-enabled CPUs, and the SEMACALL can run concurrently on different > CPUs. Implement a mechanism to run SEAMCALL concurrently on all online > CPUs and use it to shut down the module. Later logical-cpu scope module > initialization will use it too. To me, this starts to veer way too far into internal implementation details. Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs to shut down the TDX module. This is also the point where you should talk about the new infrastructure. Why do you need a new 'struct seamcall_something'? What makes it special? > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > index b06c1a2bc9cb..5db1a05cb4bd 100644 > --- a/arch/x86/virt/vmx/tdx/tdx.c > +++ b/arch/x86/virt/vmx/tdx/tdx.c > @@ -13,6 +13,8 @@ > #include <linux/mutex.h> > #include <linux/cpu.h> > #include <linux/cpumask.h> > +#include <linux/smp.h> > +#include <linux/atomic.h> > #include <asm/msr-index.h> > #include <asm/msr.h> > #include <asm/apic.h> > @@ -124,15 +126,27 @@ bool platform_tdx_enabled(void) > return !!tdx_keyid_num; > } > > +/* > + * Data structure to make SEAMCALL on multiple CPUs concurrently. > + * @err is set to -EFAULT when SEAMCALL fails on any cpu. > + */ > +struct seamcall_ctx { > + u64 fn; > + u64 rcx; > + u64 rdx; > + u64 r8; > + u64 r9; > + atomic_t err; > +}; > + > /* > * Wrapper of __seamcall() to convert SEAMCALL leaf function error code > * to kernel error code. @seamcall_ret and @out contain the SEAMCALL > * leaf function return code and the additional output respectively if > * not NULL. > */ > -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > - u64 *seamcall_ret, > - struct tdx_module_output *out) > +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > + u64 *seamcall_ret, struct tdx_module_output *out) > { > u64 sret; > > @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > } > } > > +static void seamcall_smp_call_function(void *data) > +{ > + struct seamcall_ctx *sc = data; > + int ret; > + > + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL); > + if (ret) > + atomic_set(&sc->err, -EFAULT); > +} The atomic_t is kinda silly. I guess it's not *that* wasteful though. I think it would have actually been a lot more clear if instead of containing an errno it was a *count* of the number of encountered errors. An "atomic_set()" where everyone is overwriting each other is a bit counterintuitive. It's OK here, of course, but it still looks goofy. If this were: atomic_inc(&sc->nr_errors); it would be a lot more clear that *anyone* can increment and that it truly is shared. > +/* > + * Call the SEAMCALL on all online CPUs concurrently. Caller to check > + * @sc->err to determine whether any SEAMCALL failed on any cpu. > + */ > +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) > +{ > + on_each_cpu(seamcall_smp_call_function, sc, true); > +} > + > /* > * Detect and initialize the TDX module. > * > @@ -181,7 +214,9 @@ static int init_tdx_module(void) > > static void shutdown_tdx_module(void) > { > - /* TODO: Shut down the TDX module */ > + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN }; > + > + seamcall_on_each_cpu(&sc); > } The seamcall_on_each_cpu() function is silly as-is. Either collapse the functions or note in the changelog why this is not as silly as it looks. > static int __tdx_enable(void) > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h > index 92a8de957dc7..215cc1065d78 100644 > --- a/arch/x86/virt/vmx/tdx/tdx.h > +++ b/arch/x86/virt/vmx/tdx/tdx.h > @@ -12,6 +12,11 @@ > /* MSR to report KeyID partitioning between MKTME and TDX */ > #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 > > +/* > + * TDX module SEAMCALL leaf functions > + */ > +#define TDH_SYS_LP_SHUTDOWN 44 > + > /* > * Do not put any hardware-defined TDX structure representations below > * this comment!
On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote: > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote: > > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > > > >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all > >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different > >> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online > >> CPUs and use it to shut down the module. Later logical-cpu scope module > >> initialization will use it too. > > > > Uhh, those requirements ^ are not met by this: > > Can run concurrently != Must run concurrently > > The documentation clearly says "can run concurrently" as quoted above. The next sentense says: "Implement a mechanism to run SEAMCALL concurrently" -- it does not. Anyway, since we're all in agreement there is no such requirement at all, a schedule_on_each_cpu() might be more appropriate, there is no reason to use IPIs and spin-waiting for any of this. That said; perhaps we should grow: schedule_on_cpu(struct cpumask *cpus, work_func_t func); to only disturb a given mask of CPUs.
On Tue, Nov 22, 2022 at 07:14:14AM -0800, Dave Hansen wrote: > On 11/22/22 01:13, Peter Zijlstra wrote: > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > >> +/* > >> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check > >> + * @sc->err to determine whether any SEAMCALL failed on any cpu. > >> + */ > >> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) > >> +{ > >> + on_each_cpu(seamcall_smp_call_function, sc, true); > >> +} > > > > Suppose the user has NOHZ_FULL configured, and is already running > > userspace that will terminate on interrupt (this is desired feature for > > NOHZ_FULL), guess how happy they'll be if someone, on another parition, > > manages to tickle this TDX gunk? > > Yeah, they'll be none too happy. > > But, what do we do? Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of all TDX using tasks. > There are technical solutions like detecting if NOHZ_FULL is in play and > refusing to initialize TDX. There are also non-technical solutions like > telling folks in the documentation that they better modprobe kvm early > if they want to do TDX, or their NOHZ_FULL apps will pay. Surely modprobe kvm isn't the point where TDX gets loaded? Because that's on boot for everybody due to all the auto-probing nonsense. I was expecting TDX to not get initialized until the first TDX using KVM instance is created. Am I wrong? > We could also force the TDX module to be loaded early in boot before > NOHZ_FULL is in play, but that would waste memory on TDX metadata even > if TDX is never used. I'm thikning it makes sense to have a tdx={off,on-demand,force} toggle anyway. > How do NOHZ_FULL folks deal with late microcode updates, for example? > Those are roughly equally disruptive to all CPUs. I imagine they don't do that -- in fact I would recommend we make the whole late loading thing mutually exclusive with nohz_full; can't have both.
On Tue, Nov 22, 2022 at 10:57:52AM -0800, Dave Hansen wrote: > To me, this starts to veer way too far into internal implementation details. > > Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs > to shut down the TDX module. We really need to let go of the whole 'all BIOS-enabled CPUs' thing.
On 11/22/22 11:13, Peter Zijlstra wrote: > On Tue, Nov 22, 2022 at 07:14:14AM -0800, Dave Hansen wrote: >> On 11/22/22 01:13, Peter Zijlstra wrote: >>> On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: >>>> +/* >>>> + * Call the SEAMCALL on all online CPUs concurrently. Caller to check >>>> + * @sc->err to determine whether any SEAMCALL failed on any cpu. >>>> + */ >>>> +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) >>>> +{ >>>> + on_each_cpu(seamcall_smp_call_function, sc, true); >>>> +} >>> >>> Suppose the user has NOHZ_FULL configured, and is already running >>> userspace that will terminate on interrupt (this is desired feature for >>> NOHZ_FULL), guess how happy they'll be if someone, on another parition, >>> manages to tickle this TDX gunk? >> >> Yeah, they'll be none too happy. >> >> But, what do we do? > > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of > all TDX using tasks. I don't think that works. As I mentioned to Thomas elsewhere, you don't just need to initialize TDX on the CPUs where it is used. Before the module will start working you need to initialize it on *all* the CPUs it knows about. The module itself has a little counter where it tracks this and will refuse to start being useful until it gets called thoroughly enough. >> There are technical solutions like detecting if NOHZ_FULL is in play and >> refusing to initialize TDX. There are also non-technical solutions like >> telling folks in the documentation that they better modprobe kvm early >> if they want to do TDX, or their NOHZ_FULL apps will pay. > > Surely modprobe kvm isn't the point where TDX gets loaded? Because > that's on boot for everybody due to all the auto-probing nonsense. > > I was expecting TDX to not get initialized until the first TDX using KVM > instance is created. Am I wrong? I went looking for it in this series to prove you wrong. I failed. :) tdx_enable() is buried in here somewhere: > https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/ I don't have the patience to dig it out today, so I guess we'll have Kai tell us. >> We could also force the TDX module to be loaded early in boot before >> NOHZ_FULL is in play, but that would waste memory on TDX metadata even >> if TDX is never used. > > I'm thikning it makes sense to have a tdx={off,on-demand,force} toggle > anyway. Yep, that makes total sense. Kai had one in an earlier version but I made him throw it out because it wasn't *strictly* required and this set is fat enough. >> How do NOHZ_FULL folks deal with late microcode updates, for example? >> Those are roughly equally disruptive to all CPUs. > > I imagine they don't do that -- in fact I would recommend we make the > whole late loading thing mutually exclusive with nohz_full; can't have > both. So, if we just use schedule_on_cpu() for now and have the TDX code wait, will a NOHZ_FULL task just block the schedule_on_cpu() indefinitely? That doesn't seem like _horrible_ behavior to start off with for a minimal series.
On Tue, Nov 22, 2022, Peter Zijlstra wrote: > On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote: > > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote: > > > > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > > > > > >> Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all > > >> BIOS-enabled CPUs, and the SEMACALL can run concurrently on different > > >> CPUs. Implement a mechanism to run SEAMCALL concurrently on all online > > >> CPUs and use it to shut down the module. Later logical-cpu scope module > > >> initialization will use it too. > > > > > > Uhh, those requirements ^ are not met by this: > > > > Can run concurrently != Must run concurrently > > > > The documentation clearly says "can run concurrently" as quoted above. > > The next sentense says: "Implement a mechanism to run SEAMCALL > concurrently" -- it does not. > > Anyway, since we're all in agreement there is no such requirement at > all, a schedule_on_each_cpu() might be more appropriate, there is no > reason to use IPIs and spin-waiting for any of this. Backing up a bit, what's the reason for _any_ of this? The changelog says It's pointless to leave the TDX module in some middle state. but IMO it's just as pointless to do a shutdown unless the kernel benefits in some meaningful way. And IIUC, TDH.SYS.LP.SHUTDOWN does nothing more than change the SEAM VMCS.HOST_RIP to point to an error trampoline. E.g. it's not like doing a shutdown lets the kernel reclaim memory that was gifted to the TDX module. In other words, this is just a really expensive way of changing a function pointer, and the only way it would ever benefit the kernel is if there is a kernel bug that leads to trying to use TDX after a fatal error. And even then, the only difference seems to be that subsequent bogus SEAMCALLs would get a more unique error message.
On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote: > > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of > > all TDX using tasks. > > I don't think that works. As I mentioned to Thomas elsewhere, you don't > just need to initialize TDX on the CPUs where it is used. Before the > module will start working you need to initialize it on *all* the CPUs it > knows about. The module itself has a little counter where it tracks > this and will refuse to start being useful until it gets called > thoroughly enough. That's bloody terrible, that is. How are we going to make that work with the SMT mitigation crud that forces the SMT sibilng offline? Then the counters don't match and TDX won't work. Can we get this limitiation removed and simply let the module throw a wobbly (error) when someone tries and use TDX without that logical CPU having been properly initialized?
On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote: > > I was expecting TDX to not get initialized until the first TDX using KVM > > instance is created. Am I wrong? > > I went looking for it in this series to prove you wrong. I failed. :) > > tdx_enable() is buried in here somewhere: > > > https://lore.kernel.org/lkml/CAAhR5DFrwP+5K8MOxz5YK7jYShhaK4A+2h1Pi31U_9+Z+cz-0A@mail.gmail.com/T/ > > I don't have the patience to dig it out today, so I guess we'll have Kai > tell us. It will be done when KVM module is loaded, but not when the first TDX guest is created.
On 11/22/22 16:58, Huang, Kai wrote: > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote: >>> I was expecting TDX to not get initialized until the first TDX using KVM >>> instance is created. Am I wrong? >> I went looking for it in this series to prove you wrong. I failed.
On Tue, 2022-11-22 at 20:33 +0100, Peter Zijlstra wrote: > On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote: > > > > Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of > > > all TDX using tasks. > > > > I don't think that works. As I mentioned to Thomas elsewhere, you don't > > just need to initialize TDX on the CPUs where it is used. Before the > > module will start working you need to initialize it on *all* the CPUs it > > knows about. The module itself has a little counter where it tracks > > this and will refuse to start being useful until it gets called > > thoroughly enough. > > That's bloody terrible, that is. How are we going to make that work with > the SMT mitigation crud that forces the SMT sibilng offline? > > Then the counters don't match and TDX won't work. > > Can we get this limitiation removed and simply let the module throw a > wobbly (error) when someone tries and use TDX without that logical CPU > having been properly initialized? Dave kindly helped to raise this issue and I'll follow up with TDX module guys to see whether we can remove/ease such limitation. Thanks!
On Tue, 2022-11-22 at 17:04 -0800, Dave Hansen wrote: > On 11/22/22 16:58, Huang, Kai wrote: > > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote: > > > > I was expecting TDX to not get initialized until the first TDX using KVM > > > > instance is created. Am I wrong? > > > I went looking for it in this series to prove you wrong. I failed.
On Tue, 2022-11-22 at 20:14 +0100, Peter Zijlstra wrote: > On Tue, Nov 22, 2022 at 10:57:52AM -0800, Dave Hansen wrote: > > > To me, this starts to veer way too far into internal implementation details. > > > > Issue the TDH.SYS.LP.SHUTDOWN SEAMCALL on all BIOS-enabled CPUs > > to shut down the TDX module. > > We really need to let go of the whole 'all BIOS-enabled CPUs' thing. As replied in another email I'll follow up with TDX module guys to see whether we can remove/ease such limitation. Thanks!
On Tue, 2022-11-22 at 19:31 +0000, Sean Christopherson wrote: > On Tue, Nov 22, 2022, Peter Zijlstra wrote: > > On Tue, Nov 22, 2022 at 04:06:25PM +0100, Thomas Gleixner wrote: > > > On Tue, Nov 22 2022 at 10:20, Peter Zijlstra wrote: > > > > > > > On Mon, Nov 21, 2022 at 01:26:28PM +1300, Kai Huang wrote: > > > > > > > > > Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all > > > > > BIOS-enabled CPUs, and the SEMACALL can run concurrently on different > > > > > CPUs. Implement a mechanism to run SEAMCALL concurrently on all online > > > > > CPUs and use it to shut down the module. Later logical-cpu scope module > > > > > initialization will use it too. > > > > > > > > Uhh, those requirements ^ are not met by this: > > > > > > Can run concurrently != Must run concurrently > > > > > > The documentation clearly says "can run concurrently" as quoted above. > > > > The next sentense says: "Implement a mechanism to run SEAMCALL > > concurrently" -- it does not. > > > > Anyway, since we're all in agreement there is no such requirement at > > all, a schedule_on_each_cpu() might be more appropriate, there is no > > reason to use IPIs and spin-waiting for any of this. > > Backing up a bit, what's the reason for _any_ of this? The changelog says > > It's pointless to leave the TDX module in some middle state. > > but IMO it's just as pointless to do a shutdown unless the kernel benefits in > some meaningful way. And IIUC, TDH.SYS.LP.SHUTDOWN does nothing more than change > the SEAM VMCS.HOST_RIP to point to an error trampoline. E.g. it's not like doing > a shutdown lets the kernel reclaim memory that was gifted to the TDX module. The metadata memory has been freed back to the kernel in case of error before shutting down the module. > > In other words, this is just a really expensive way of changing a function pointer, > and the only way it would ever benefit the kernel is if there is a kernel bug that > leads to trying to use TDX after a fatal error. And even then, the only difference > seems to be that subsequent bogus SEAMCALLs would get a more unique error message. The only issue of leaving module open is like you said bogus SEAMCALLs can still be made. There's a slight risk that those SEAMCALLs may actually be able to do something that may crash the kernel (i.e. if the module is shut down due to TDH.SYS.INIT.TDMR error, then further bogus TDH.SYS.INIT.TDMR can still corrupt the metadata). But, this belongs to "kernel bug" or "kernel is under attack" category. Neither of them should be a concern of TDX: host kernel is out of TCB, and preventing DoS attack is not part of TDX anyway. Also, in case of kexec(), we cannot shut down the module either (in this implementation, due to CPU may not be in VMX operation when kexec()). So I agree with you, it's fine to not shut down the module. Hi maintainers, does this look good to you?
On Wed, Nov 23, 2022, Huang, Kai wrote: > On Tue, 2022-11-22 at 17:04 -0800, Dave Hansen wrote: > > On 11/22/22 16:58, Huang, Kai wrote: > > > On Tue, 2022-11-22 at 11:24 -0800, Dave Hansen wrote: > > > > > I was expecting TDX to not get initialized until the first TDX using KVM > > > > > instance is created. Am I wrong? > > > > I went looking for it in this series to prove you wrong. I failed.
On 11/23/22 08:20, Sean Christopherson wrote: >>> Why is it done that way? >>> >>> Can it be changed to delay TDX initialization until the first TDX guest >>> needs to run? >>> >> Sean suggested. >> >> Hi Sean, could you commenet? > Waiting until the first TDX guest is created would result in false advertising, > as KVM wouldn't know whether or not TDX is actually supported until that first > VM is created. If we can guarantee that TDH.SYS.INIT will fail if and only if > there is a kernel bug, then I would be ok deferring the "enabling" until the > first VM is created. There's no way we can guarantee _that_. For one, the PAMT* allocations can always fail. I guess we could ask sysadmins to fire up a guest to "prime" things, but that seems a little silly. Maybe that would work as the initial implementation that we merge, but I suspect our users will demand more determinism, maybe a boot or module parameter. * Physical Address Metadata Table, a large physically contiguous data structure, the rough equivalent of 'struct page' for the TDX module
On Wed, Nov 23, 2022, Dave Hansen wrote: > On 11/23/22 08:20, Sean Christopherson wrote: > >>> Why is it done that way? > >>> > >>> Can it be changed to delay TDX initialization until the first TDX guest > >>> needs to run? > >>> > >> Sean suggested. > >> > >> Hi Sean, could you commenet? > > Waiting until the first TDX guest is created would result in false advertising, > > as KVM wouldn't know whether or not TDX is actually supported until that first > > VM is created. If we can guarantee that TDH.SYS.INIT will fail if and only if > > there is a kernel bug, then I would be ok deferring the "enabling" until the > > first VM is created. > > There's no way we can guarantee _that_. For one, the PAMT* allocations > can always fail. I guess we could ask sysadmins to fire up a guest to > "prime" things, but that seems a little silly. Maybe that would work as > the initial implementation that we merge, but I suspect our users will > demand more determinism, maybe a boot or module parameter. Oh, you mean all of TDX initialization? I thought "initialization" here mean just doing tdx_enable(). Yeah, that's not going to be a viable option. Aside from lacking determinisim, it would be all too easy to end up on a system with fragmented memory that can't allocate the PAMTs post-boot.
On 11/23/22 09:37, Sean Christopherson wrote: > On Wed, Nov 23, 2022, Dave Hansen wrote: >> There's no way we can guarantee _that_. For one, the PAMT* allocations >> can always fail. I guess we could ask sysadmins to fire up a guest to >> "prime" things, but that seems a little silly. Maybe that would work as >> the initial implementation that we merge, but I suspect our users will >> demand more determinism, maybe a boot or module parameter. > Oh, you mean all of TDX initialization? I thought "initialization" here mean just > doing tdx_enable(). Yes, but the first call to tdx_enable() does TDH_SYS_INIT and all the subsequent work to get the module going. > Yeah, that's not going to be a viable option. Aside from lacking determinisim, > it would be all too easy to end up on a system with fragmented memory that can't > allocate the PAMTs post-boot. For now, the post-boot runtime PAMT allocations are the one any only way that TDX can be initialized. I pushed for it to be done this way. Here's why: Doing tdx_enable() is relatively slow and it eats up a non-zero amount of physically contiguous RAM for metadata (~1/256th or ~0.4% of RAM). Systems that support TDX but will never run TDX guests should not pay that cost. That means that we either make folks opt-in at boot-time or we try to make a best effort at runtime to do the metadata allocations. From my perspective, the best-effort stuff is absolutely needed. Users are going to forget the command-line opt in and there's no harm in _trying_ the big allocations even if they fail. Second, in reality, the "real" systems that can run TDX guests are probably not going to sit around fragmenting memory for a month before they run their first guest. They're going to run one shortly after they boot when memory isn't fragmented and the best-effort allocation will work really well. Third, if anyone *REALLY* cared to make it reliable *and* wanted to sit around fragmenting memory for a month, they could just start a TDX guest and kill it to get TDX initialized. This isn't ideal. But, to me, it beats defining some new, separate ABI (or boot/module option) to do it. So, let's have those discussions. Long-term, what *is* the most reliable way to get the TDX module loaded with 100% determinism? What new ABI or interfaces are needed? Also, is that 100% determinism required the moment this series is merged? Or, can we work up to it? I think it can wait until this particular series is farther along.
On Wed, Nov 23, 2022, Dave Hansen wrote: > On 11/23/22 09:37, Sean Christopherson wrote: > > On Wed, Nov 23, 2022, Dave Hansen wrote: > >> There's no way we can guarantee _that_. For one, the PAMT* allocations > >> can always fail. I guess we could ask sysadmins to fire up a guest to > >> "prime" things, but that seems a little silly. Maybe that would work as > >> the initial implementation that we merge, but I suspect our users will > >> demand more determinism, maybe a boot or module parameter. > > Oh, you mean all of TDX initialization? I thought "initialization" here mean just > > doing tdx_enable(). > > Yes, but the first call to tdx_enable() does TDH_SYS_INIT and all the > subsequent work to get the module going. Ah, sorry, I misread the diff. Actually applied the patches this time... > > Yeah, that's not going to be a viable option. Aside from lacking determinisim, > > it would be all too easy to end up on a system with fragmented memory that can't > > allocate the PAMTs post-boot. > > For now, the post-boot runtime PAMT allocations are the one any only way > that TDX can be initialized. I pushed for it to be done this way. > Here's why: > > Doing tdx_enable() is relatively slow and it eats up a non-zero amount > of physically contiguous RAM for metadata (~1/256th or ~0.4% of RAM). > Systems that support TDX but will never run TDX guests should not pay > that cost. > > That means that we either make folks opt-in at boot-time or we try to > make a best effort at runtime to do the metadata allocations. > > From my perspective, the best-effort stuff is absolutely needed. Users > are going to forget the command-line opt in Eh, any sufficiently robust deployment should be able to ensure that its kernels boot with "required" command-line options. > and there's no harm in _trying_ the big allocations even if they fail. No, but there is "harm" if a host can't provide the functionality the control plane thinks it supports. > Second, in reality, the "real" systems that can run TDX guests are > probably not going to sit around fragmenting memory for a month before > they run their first guest. They're going to run one shortly after they > boot when memory isn't fragmented and the best-effort allocation will > work really well. I don't think this will hold true. Long term, we (Google) want to have a common pool for non-TDX and TDX VMs. Forcing TDX VMs to use a dedicated pool of hosts makes it much more difficult to react to demand, e.g. if we carve out N hosts for TDX, but only use 10% of those hosts, then that's a lot of wasted capacity/money. IIRC, people have discussed dynamically reconfiguring hosts so that systems could be moved in/out of a dedicated pool, but that's still suboptimal, e.g. would require emptying a host to reboot+reconfigure.. If/when we end up with a common pool, then it's very likely that we could have a TDX-capable host go weeks/months before launching its first TDX VM. > Third, if anyone *REALLY* cared to make it reliable *and* wanted to sit > around fragmenting memory for a month, they could just start a TDX guest > and kill it to get TDX initialized. This isn't ideal. But, to me, it > beats defining some new, separate ABI (or boot/module option) to do it. That's a hack. I have no objection to waiting until KVM is _loaded_ to initialize TDX, but waiting until KVM_CREATE_VM is not acceptable. Use cases aside, KVM's ABI would be a mess, e.g. KVM couldn't use KVM_CHECK_EXTENSION or any other /dev/kvm ioctl() to enumerate TDX support. > So, let's have those discussions. Long-term, what *is* the most > reliable way to get the TDX module loaded with 100% determinism? What > new ABI or interfaces are needed? Also, is that 100% determinism > required the moment this series is merged? Or, can we work up to it? I don't think we (Google again) strictly need 100% determinism with respect to enabling TDX, what's most important is that if a host says it's TDX-capable, then it really is TDX-capable. I'm sure we'll treat "failure to load" as an error, but it doesn't necessarily need to be a fatal error as the host can still run in a degraded state (no idea if we'll actually do that though). > I think it can wait until this particular series is farther along. For an opt-in kernel param, agreed. That can be added later, e.g. if it turns out that the PAMT allocation failure rate is too high.
On 11/22/22 11:33, Peter Zijlstra wrote: > On Tue, Nov 22, 2022 at 11:24:48AM -0800, Dave Hansen wrote: >>> Not intialize TDX on busy NOHZ_FULL cpus and hard-limit the cpumask of >>> all TDX using tasks. >> I don't think that works. As I mentioned to Thomas elsewhere, you don't >> just need to initialize TDX on the CPUs where it is used. Before the >> module will start working you need to initialize it on *all* the CPUs it >> knows about. The module itself has a little counter where it tracks >> this and will refuse to start being useful until it gets called >> thoroughly enough. > That's bloody terrible, that is. How are we going to make that work with > the SMT mitigation crud that forces the SMT sibilng offline? > > Then the counters don't match and TDX won't work. > > Can we get this limitiation removed and simply let the module throw a > wobbly (error) when someone tries and use TDX without that logical CPU > having been properly initialized? It sounds like we can at least punt the limitation away from the OS's purview. There's actually a multi-step process to get a "real" TDX module loaded. There's a fancy ACM (Authenticated Code Module) that's invoked via GETSEC[ENTERACCS] and an intermediate module loader. That dance used to be done in the kernel, but we talked the BIOS guys into doing it instead. I believe these per-logical-CPU checks _can_ also be punted out of the TDX module itself and delegated to one of these earlier module loading phases that the BIOS drives. I'm still a _bit_ skeptical that the checks are needed in the first place. But, as long as they're hidden from the OS, I don't see a need to be too cranky about it. In the end, we could just plain stop doing the TDH.SYS.LP.INIT code in the kernel. Unless someone screams, I'll ask the BIOS and TDX module folks to look into this.
On Tue, Nov 29 2022 at 13:40, Dave Hansen wrote: > On 11/22/22 11:33, Peter Zijlstra wrote: >> Can we get this limitiation removed and simply let the module throw a >> wobbly (error) when someone tries and use TDX without that logical CPU >> having been properly initialized? > > It sounds like we can at least punt the limitation away from the OS's > purview. > > There's actually a multi-step process to get a "real" TDX module loaded. > There's a fancy ACM (Authenticated Code Module) that's invoked via > GETSEC[ENTERACCS] and an intermediate module loader. That dance used to > be done in the kernel, but we talked the BIOS guys into doing it instead. > > I believe these per-logical-CPU checks _can_ also be punted out of the > TDX module itself and delegated to one of these earlier module loading > phases that the BIOS drives. > > I'm still a _bit_ skeptical that the checks are needed in the first > place. But, as long as they're hidden from the OS, I don't see a need > to be too cranky about it. Right. > In the end, we could just plain stop doing the TDH.SYS.LP.INIT code in > the kernel. Which in turn makes all the problems we discussed go away. > Unless someone screams, I'll ask the BIOS and TDX module folks to look > into this. Yes, please. Thanks, tglx
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index b06c1a2bc9cb..5db1a05cb4bd 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -13,6 +13,8 @@ #include <linux/mutex.h> #include <linux/cpu.h> #include <linux/cpumask.h> +#include <linux/smp.h> +#include <linux/atomic.h> #include <asm/msr-index.h> #include <asm/msr.h> #include <asm/apic.h> @@ -124,15 +126,27 @@ bool platform_tdx_enabled(void) return !!tdx_keyid_num; } +/* + * Data structure to make SEAMCALL on multiple CPUs concurrently. + * @err is set to -EFAULT when SEAMCALL fails on any cpu. + */ +struct seamcall_ctx { + u64 fn; + u64 rcx; + u64 rdx; + u64 r8; + u64 r9; + atomic_t err; +}; + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL * leaf function return code and the additional output respectively if * not NULL. */ -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, - u64 *seamcall_ret, - struct tdx_module_output *out) +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + u64 *seamcall_ret, struct tdx_module_output *out) { u64 sret; @@ -166,6 +180,25 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, } } +static void seamcall_smp_call_function(void *data) +{ + struct seamcall_ctx *sc = data; + int ret; + + ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9, NULL, NULL); + if (ret) + atomic_set(&sc->err, -EFAULT); +} + +/* + * Call the SEAMCALL on all online CPUs concurrently. Caller to check + * @sc->err to determine whether any SEAMCALL failed on any cpu. + */ +static void seamcall_on_each_cpu(struct seamcall_ctx *sc) +{ + on_each_cpu(seamcall_smp_call_function, sc, true); +} + /* * Detect and initialize the TDX module. * @@ -181,7 +214,9 @@ static int init_tdx_module(void) static void shutdown_tdx_module(void) { - /* TODO: Shut down the TDX module */ + struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN }; + + seamcall_on_each_cpu(&sc); } static int __tdx_enable(void) diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 92a8de957dc7..215cc1065d78 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -12,6 +12,11 @@ /* MSR to report KeyID partitioning between MKTME and TDX */ #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 +/* + * TDX module SEAMCALL leaf functions + */ +#define TDH_SYS_LP_SHUTDOWN 44 + /* * Do not put any hardware-defined TDX structure representations below * this comment!