Message ID | 557c526a1190903d11d67c4e2c76e01f67f6eb15.1676286526.git.kai.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX host kernel support | expand |
On 2/13/23 03:59, Kai Huang wrote: > To avoid duplicated code, add a > helper to call SEAMCALL on all online cpus one by one but with a skip > function to check whether to skip certain cpus, and use that helper to > do the per-cpu initialization. ... > +/* > + * Call @func on all online cpus one by one but skip those cpus > + * when @skip_func is valid and returns true for them. > + */ > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, > + bool (*skip_func)(int cpu, void *), > + void *skip_data) I only see one caller of this. Where is the duplicated code?
On 2/13/23 03:59, Kai Huang wrote: > @@ -247,8 +395,17 @@ int tdx_enable(void) > ret = __tdx_enable(); > break; > case TDX_MODULE_INITIALIZED: > - /* Already initialized, great, tell the caller. */ > - ret = 0; > + /* > + * The previous call of __tdx_enable() may only have > + * initialized part of present cpus during module > + * initialization, and new cpus may have become online > + * since then. > + * > + * To make sure all online cpus are TDX-runnable, always > + * do per-cpu initialization for all online cpus here > + * even the module has been initialized. > + */ > + ret = __tdx_enable_online_cpus(); I'm missing something here. CPUs get initialized through either: 1. __tdx_enable(), for the CPUs around at the time 2. tdx_cpu_online(), for hotplugged CPUs after __tdx_enable() But, this is a third class. CPUs that came online after #1, but which got missed by #2. How can that happen?
> On 2/13/23 03:59, Kai Huang wrote: > > @@ -247,8 +395,17 @@ int tdx_enable(void) > > ret = __tdx_enable(); > > break; > > case TDX_MODULE_INITIALIZED: > > - /* Already initialized, great, tell the caller. */ > > - ret = 0; > > + /* > > + * The previous call of __tdx_enable() may only have > > + * initialized part of present cpus during module > > + * initialization, and new cpus may have become online > > + * since then. > > + * > > + * To make sure all online cpus are TDX-runnable, always > > + * do per-cpu initialization for all online cpus here > > + * even the module has been initialized. > > + */ > > + ret = __tdx_enable_online_cpus(); > > I'm missing something here. CPUs get initialized through either: > > 1. __tdx_enable(), for the CPUs around at the time 2. tdx_cpu_online(), for > hotplugged CPUs after __tdx_enable() > > But, this is a third class. CPUs that came online after #1, but which got missed > by #2. How can that happen? (Replying via Microsoft Outlook cause my Evolution suddenly stopped to work after updating the Fedora). Currently we depend on KVM's CPU hotplug callback to call tdx_cpu_online(). The problem is the KVM's callback can go away when KVM module gets unloaded. For example: 1) KVM module loaded when CPU 0, 1, 2 are online, CPU 3, 4, 5 are offline. 2) __tdx_enable() gets called. LP.INIT are done on CPU 0, 1, 2. 3) KVM gets unloaded. It's CPU hotplug callbacks are removed too. 4) CPU 3 becomes online. In this case, tdx_cpu_online() is not called for it as the KVM's CPU hotplug callback is gone. So later if KVM gets loaded again, we need to go through __tdx_enable_online_cpus() to do LP.INIT for CPU 3 as it's already online. Perhaps I didn't explain clearly in the comment. Below is the updated one: /* * The previous call of __tdx_enable() may only have * initialized part of present cpus during module * initialization, and new cpus may have become online * since then w/o doing per-cpu initialization. * * For example, a new CPU can become online when KVM is * unloaded, in which case tdx_cpu_enable() is not called since * KVM's CPU online callback has been removed. * * To make sure all online cpus are TDX-runnable, always * do per-cpu initialization for all online cpus here * even the module has been initialized. */
> On 2/13/23 03:59, Kai Huang wrote: > > To avoid duplicated code, add a > > helper to call SEAMCALL on all online cpus one by one but with a skip > > function to check whether to skip certain cpus, and use that helper to > > do the per-cpu initialization. > ... > > +/* > > + * Call @func on all online cpus one by one but skip those cpus > > + * when @skip_func is valid and returns true for them. > > + */ > > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, > > + bool (*skip_func)(int cpu, void *), > > + void *skip_data) > > I only see one caller of this. Where is the duplicated code? The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages). I kinda mentioned this in the changelog: " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..." If we don't have this helper, then we can end up with having below loop in two functions: for_each_online(cpu) { if (should_skip(cpu)) continue; // call @func on @cpu. }
On 2/13/23 13:13, Huang, Kai wrote: > Perhaps I didn't explain clearly in the comment. Below is the updated one: > > /* > * The previous call of __tdx_enable() may only have > * initialized part of present cpus during module > * initialization, and new cpus may have become online > * since then w/o doing per-cpu initialization. > * > * For example, a new CPU can become online when KVM is > * unloaded, in which case tdx_cpu_enable() is not called since > * KVM's CPU online callback has been removed. > * > * To make sure all online cpus are TDX-runnable, always > * do per-cpu initialization for all online cpus here > * even the module has been initialized. > */ This is voodoo. I want a TDX-specific hotplug CPU handler. Period. Please make that happen. Put that code in this patch. That handler should: 1. Run after the KVM handler (if present) 2. See if VMX is on 3. If VMX is on: 3a. Run smp_func_module_lp_init(), else 3b. Mark the CPU as needing smp_func_module_lp_init() Then, in the 'case TDX_MODULE_INITIALIZED:', you call a function to iterate over the cpumask that was generated in 3b. That makes the handoff *EXPLICIT*. You know exactly which CPUs need what done to them. A CPU hotplug either explicitly involves doing the work to make TDX work on the CPU, or explicitly defers the work to a specific later time in a specific later piece of code.
On 2/13/23 13:19, Huang, Kai wrote: >> On 2/13/23 03:59, Kai Huang wrote: >>> To avoid duplicated code, add a >>> helper to call SEAMCALL on all online cpus one by one but with a skip >>> function to check whether to skip certain cpus, and use that helper to >>> do the per-cpu initialization. >> ... >>> +/* >>> + * Call @func on all online cpus one by one but skip those cpus >>> + * when @skip_func is valid and returns true for them. >>> + */ >>> +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, >>> + bool (*skip_func)(int cpu, void *), >>> + void *skip_data) >> I only see one caller of this. Where is the duplicated code? > The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages). > > I kinda mentioned this in the changelog: > > " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..." > > If we don't have this helper, then we can end up with having below loop in two functions: > > for_each_online(cpu) { > if (should_skip(cpu)) > continue; > > // call @func on @cpu. > } I don't think saving two lines of actual code is worth the opacity that results from this abstraction.
On Mon, 2023-02-13 at 14:28 -0800, Dave Hansen wrote: > On 2/13/23 13:13, Huang, Kai wrote: > > Perhaps I didn't explain clearly in the comment. Below is the updated one: > > > > /* > > * The previous call of __tdx_enable() may only have > > * initialized part of present cpus during module > > * initialization, and new cpus may have become online > > * since then w/o doing per-cpu initialization. > > * > > * For example, a new CPU can become online when KVM is > > * unloaded, in which case tdx_cpu_enable() is not called since > > * KVM's CPU online callback has been removed. > > * > > * To make sure all online cpus are TDX-runnable, always > > * do per-cpu initialization for all online cpus here > > * even the module has been initialized. > > */ > > This is voodoo. > > I want a TDX-specific hotplug CPU handler. Period. Please make that > happen. > Yes 100% agreed. > Put that code in this patch. That handler should: > > 1. Run after the KVM handler (if present) > 2. See if VMX is on > 3. If VMX is on: > 3a. Run smp_func_module_lp_init(), else > 3b. Mark the CPU as needing smp_func_module_lp_init() > > Then, in the 'case TDX_MODULE_INITIALIZED:', you call a function to > iterate over the cpumask that was generated in 3b. > > That makes the handoff *EXPLICIT*. You know exactly which CPUs need > what done to them. A CPU hotplug either explicitly involves doing the > work to make TDX work on the CPU, or explicitly defers the work to a > specific later time in a specific later piece of code. In 3b. we don't need to "explicitly mark the CPU as needing smp_func_module_lp_init()". We already have __cpu_tdx_mask to track whether LP.INIT has been done on one cpu and we can use that to determine: Any online cpu which isn't set in __cpu_tdx_mask needs to do LP.INIT in tdx_enable(). And the function module_lp_init_online_cpus() already handles that, and it can be called directly in tdx_enable() path (as shown in this patch). I'll do above as you suggested, but just use __cpu_tdx_mask as explained above. ( My main concern is "Run after the KVM handler" seems a little bit hacky to me. Logically, it's more reasonable to have the TDX callback _before_ KVM's but not _after_. If any user (KVM) has done tdx_enable() successfully, the TDX code should give the user a "TDX-runnable" cpu before user (KVM)'s own callback is involved. Anyway as mentioned above, I'll do above as you suggested.)
On 2/13/23 15:43, Huang, Kai wrote: > ( My main concern is "Run after the KVM handler" seems a little bit hacky to me. > Logically, it's more reasonable to have the TDX callback _before_ KVM's but not > _after_. If any user (KVM) has done tdx_enable() successfully, the TDX code > should give the user a "TDX-runnable" cpu before user (KVM)'s own callback is > involved. Anyway as mentioned above, I'll do above as you suggested.) I was assuming that the KVM callback is what does VMXON for a given logical CPU. If that were the case, you'd need to do the TDX stuff *AFTER* VMXON. Am I wrong?
On Mon, 2023-02-13 at 14:43 -0800, Dave Hansen wrote: > On 2/13/23 13:19, Huang, Kai wrote: > > > On 2/13/23 03:59, Kai Huang wrote: > > > > To avoid duplicated code, add a > > > > helper to call SEAMCALL on all online cpus one by one but with a skip > > > > function to check whether to skip certain cpus, and use that helper to > > > > do the per-cpu initialization. > > > ... > > > > +/* > > > > + * Call @func on all online cpus one by one but skip those cpus > > > > + * when @skip_func is valid and returns true for them. > > > > + */ > > > > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, > > > > + bool (*skip_func)(int cpu, void *), > > > > + void *skip_data) > > > I only see one caller of this. Where is the duplicated code? > > The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages). > > > > I kinda mentioned this in the changelog: > > > > " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..." > > > > If we don't have this helper, then we can end up with having below loop in two functions: > > > > for_each_online(cpu) { > > if (should_skip(cpu)) > > continue; > > > > // call @func on @cpu. > > } > > I don't think saving two lines of actual code is worth the opacity that > results from this abstraction. Alright thanks for the suggestion. I'll remove this tdx_on_each_cpu_cond() and do directly. But just checking: LP.INIT can actually be called in parallel on different cpus (doesn't have to, of course), so we can actually just use on_each_cpu_cond() for LP.INIT: on_each_cpu_cond(should_skip_cpu, smp_func_module_lp_init, NULL, true); But IIUC Peter doesn't like using IPI and prefers using via work: https://lore.kernel.org/lkml/Y30dujuXC8wlLwoQ@hirez.programming.kicks-ass.net/ So I used smp_call_on_cpu() here, which only calls @func on one cpu, but not a cpumask. For LP.INIT ideally we can have something like: schedule_on_cpu(struct cpumask *cpus, work_func_t func); to call @func on a cpu set, but that doesn't exist now, and I don't think it's worth to introduce it? So, should I use on_each_cpu_cond(), or use smp_call_on_cpu() here? (btw for the TDH.SYS.KEY.CONFIG we must do one by one as it cannot run in parallel on multi cpus, so I'll use smp_call_on_cpu().)
On Mon, 2023-02-13 at 15:52 -0800, Dave Hansen wrote: > On 2/13/23 15:43, Huang, Kai wrote: > > ( My main concern is "Run after the KVM handler" seems a little bit hacky to me. > > Logically, it's more reasonable to have the TDX callback _before_ KVM's but not > > _after_. If any user (KVM) has done tdx_enable() successfully, the TDX code > > should give the user a "TDX-runnable" cpu before user (KVM)'s own callback is > > involved. Anyway as mentioned above, I'll do above as you suggested.) > > I was assuming that the KVM callback is what does VMXON for a given > logical CPU. If that were the case, you'd need to do the TDX stuff > *AFTER* VMXON. > > Am I wrong? > > You are right. What I meant was: because we choose to not support VMXON in the (non-KVM) kernel, we need/have to put TDX's callback after KVM's. Otherwise, perhaps a better way is to put TDX's callback before KVM's. But maybe it's an arguable "perhaps", so let's just do TDX's callback after KVM's as you suggested.
On Tue, Feb 14, 2023 at 12:59:14AM +1300, Kai Huang wrote: > +/* > + * Call @func on all online cpus one by one but skip those cpus > + * when @skip_func is valid and returns true for them. > + */ > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, > + bool (*skip_func)(int cpu, void *), > + void *skip_data) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { > + int ret; > + > + if (skip_func && skip_func(cpu, skip_data)) > + continue; > + > + /* > + * SEAMCALL can be time consuming. Call the @func on > + * remote cpu via smp_call_on_cpu() instead of > + * smp_call_function_single() to avoid busy waiting. > + */ > + ret = smp_call_on_cpu(cpu, func, func_data, true); > + if (ret) > + return ret; > + } > + > + return 0; > +} schedule_on_each_cpu() ?
On Tue, Feb 14, 2023 at 12:02:22AM +0000, Huang, Kai wrote: > On Mon, 2023-02-13 at 14:43 -0800, Dave Hansen wrote: > > On 2/13/23 13:19, Huang, Kai wrote: > > > > On 2/13/23 03:59, Kai Huang wrote: > > > > > To avoid duplicated code, add a > > > > > helper to call SEAMCALL on all online cpus one by one but with a skip > > > > > function to check whether to skip certain cpus, and use that helper to > > > > > do the per-cpu initialization. > > > > ... > > > > > +/* > > > > > + * Call @func on all online cpus one by one but skip those cpus > > > > > + * when @skip_func is valid and returns true for them. > > > > > + */ > > > > > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, > > > > > + bool (*skip_func)(int cpu, void *), > > > > > + void *skip_data) > > > > I only see one caller of this. Where is the duplicated code? > > > The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages). > > > > > > I kinda mentioned this in the changelog: > > > > > > " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..." > > > > > > If we don't have this helper, then we can end up with having below loop in two functions: > > > > > > for_each_online(cpu) { > > > if (should_skip(cpu)) > > > continue; > > > > > > // call @func on @cpu. > > > } > > > > I don't think saving two lines of actual code is worth the opacity that > > results from this abstraction. > > Alright thanks for the suggestion. I'll remove this tdx_on_each_cpu_cond() and > do directly. > > But just checking: > > LP.INIT can actually be called in parallel on different cpus (doesn't have to, > of course), so we can actually just use on_each_cpu_cond() for LP.INIT: > > on_each_cpu_cond(should_skip_cpu, smp_func_module_lp_init, NULL, true); > > But IIUC Peter doesn't like using IPI and prefers using via work: > > https://lore.kernel.org/lkml/Y30dujuXC8wlLwoQ@hirez.programming.kicks-ass.net/ > > So I used smp_call_on_cpu() here, which only calls @func on one cpu, but not a > cpumask. For LP.INIT ideally we can have something like: > > schedule_on_cpu(struct cpumask *cpus, work_func_t func); > > to call @func on a cpu set, but that doesn't exist now, and I don't think it's > worth to introduce it? schedule_on_each_cpu() exists and can easily be extended to take a cond function if you so please.
On Mon, Feb 13, 2023 at 10:07:30AM -0800, Dave Hansen wrote: > On 2/13/23 03:59, Kai Huang wrote: > > @@ -247,8 +395,17 @@ int tdx_enable(void) > > ret = __tdx_enable(); > > break; > > case TDX_MODULE_INITIALIZED: > > - /* Already initialized, great, tell the caller. */ > > - ret = 0; > > + /* > > + * The previous call of __tdx_enable() may only have > > + * initialized part of present cpus during module > > + * initialization, and new cpus may have become online > > + * since then. > > + * > > + * To make sure all online cpus are TDX-runnable, always > > + * do per-cpu initialization for all online cpus here > > + * even the module has been initialized. > > + */ > > + ret = __tdx_enable_online_cpus(); > > I'm missing something here. CPUs get initialized through either: > > 1. __tdx_enable(), for the CPUs around at the time > 2. tdx_cpu_online(), for hotplugged CPUs after __tdx_enable() > > But, this is a third class. CPUs that came online after #1, but which > got missed by #2. How can that happen? offline CPUs, start TDX crap, online CPUs.
> > > > But just checking: > > > > LP.INIT can actually be called in parallel on different cpus (doesn't have to, > > of course), so we can actually just use on_each_cpu_cond() for LP.INIT: > > > > on_each_cpu_cond(should_skip_cpu, smp_func_module_lp_init, NULL, true); > > > > But IIUC Peter doesn't like using IPI and prefers using via work: > > > > https://lore.kernel.org/lkml/Y30dujuXC8wlLwoQ@hirez.programming.kicks-ass.net/ > > > > So I used smp_call_on_cpu() here, which only calls @func on one cpu, but not a > > cpumask. For LP.INIT ideally we can have something like: > > > > schedule_on_cpu(struct cpumask *cpus, work_func_t func); > > > > to call @func on a cpu set, but that doesn't exist now, and I don't think it's > > worth to introduce it? > > schedule_on_each_cpu() exists and can easily be extended to take a cond > function if you so please. > Sure. I just tried to do. There are two minor things: 1) should I just use smp_cond_func_t directly as the cond function? 2) schedule_on_each_cpu() takes cpus_read_lock() internally. However in my case, tdx_enable() already takes that so I need a _locked_ version. How does below look like? (Not tested) +/** + * schedule_on_each_cpu_cond_locked - execute a function synchronously + * on each online CPU for which the + * condition function returns positive + * @func: the function to call + * @cond_func: the condition function to call + * @cond_data: the data passed to the condition function + * + * schedule_on_each_cpu_cond_locked() executes @func on each online CPU + * when @cond_func returns positive for that cpu, using the system + * workqueue and blocks until all CPUs have completed. + * + * schedule_on_each_cpu_cond_locked() doesn't hold read lock of CPU + * hotplug lock but depend on the caller to do. + * + * schedule_on_each_cpu_cond_locked() is very slow. + * + * Return: + * 0 on success, -errno on failure. + */ +int schedule_on_each_cpu_cond_locked(work_func_t func, + smp_cond_func_t cond_func, + void *cond_data) +{ + int cpu; + struct work_struct __percpu *works; + + works = alloc_percpu(struct work_struct); + if (!works) + return -ENOMEM; + + for_each_online_cpu(cpu) { + struct work_struct *work = per_cpu_ptr(works, cpu); + + if (cond_func && !cond_func(cpu, cond_data)) + continue; + + INIT_WORK(work, func); + schedule_work_on(cpu, work); + } + + for_each_online_cpu(cpu) + flush_work(per_cpu_ptr(works, cpu)); + + free_percpu(works); + return 0; +} + +/** + * schedule_on_each_cpu_cond - execute a function synchronously on each + * online CPU for which the condition + * function returns positive + * @func: the function to call + * @cond_func: the condition function to call + * @cond_data: the data passed to the condition function + * + * schedule_on_each_cpu_cond() executes @func on each online CPU + * when @cond_func returns positive for that cpu, using the system + * workqueue and blocks until all CPUs have completed. + * + * schedule_on_each_cpu_cond() is very slow. + * + * Return: + * 0 on success, -errno on failure. + */ +int schedule_on_each_cpu_cond(work_func_t func, + smp_cond_func_t cond_func, + void *cond_data) +{ + int ret; + + cpus_read_lock(); + + ret = schedule_on_each_cpu_cond_locked(func, cond_func, cond_data); + + cpus_read_unlock(); + + return ret; +}
On Tue, Feb 14, 2023 at 10:53:26PM +0000, Huang, Kai wrote: > Sure. I just tried to do. There are two minor things: > > 1) should I just use smp_cond_func_t directly as the cond function? Yeah, might as well I suppose... > 2) schedule_on_each_cpu() takes cpus_read_lock() internally. However in my > case, tdx_enable() already takes that so I need a _locked_ version. > > How does below look like? (Not tested) > > +/** > + * schedule_on_each_cpu_cond_locked - execute a function synchronously > + * on each online CPU for which the > + * condition function returns positive > + * @func: the function to call > + * @cond_func: the condition function to call > + * @cond_data: the data passed to the condition function > + * > + * schedule_on_each_cpu_cond_locked() executes @func on each online CPU > + * when @cond_func returns positive for that cpu, using the system > + * workqueue and blocks until all CPUs have completed. > + * > + * schedule_on_each_cpu_cond_locked() doesn't hold read lock of CPU > + * hotplug lock but depend on the caller to do. > + * > + * schedule_on_each_cpu_cond_locked() is very slow. > + * > + * Return: > + * 0 on success, -errno on failure. > + */ > +int schedule_on_each_cpu_cond_locked(work_func_t func, > + smp_cond_func_t cond_func, > + void *cond_data) > +{ > + int cpu; > + struct work_struct __percpu *works; > + > + works = alloc_percpu(struct work_struct); > + if (!works) > + return -ENOMEM; > + > + for_each_online_cpu(cpu) { > + struct work_struct *work = per_cpu_ptr(works, cpu); > + > + if (cond_func && !cond_func(cpu, cond_data)) > + continue; > + > + INIT_WORK(work, func); > + schedule_work_on(cpu, work); > + } > + > + for_each_online_cpu(cpu) I think you need to skip some flushes too. Given we skip setting work->func, this will go WARN, see __flush_work(). > + flush_work(per_cpu_ptr(works, cpu)); > + > + free_percpu(works); > + return 0; > +} > + > +/** > + * schedule_on_each_cpu_cond - execute a function synchronously on each > + * online CPU for which the condition > + * function returns positive > + * @func: the function to call > + * @cond_func: the condition function to call > + * @cond_data: the data passed to the condition function > + * > + * schedule_on_each_cpu_cond() executes @func on each online CPU > + * when @cond_func returns positive for that cpu, using the system > + * workqueue and blocks until all CPUs have completed. > + * > + * schedule_on_each_cpu_cond() is very slow. > + * > + * Return: > + * 0 on success, -errno on failure. > + */ > +int schedule_on_each_cpu_cond(work_func_t func, > + smp_cond_func_t cond_func, > + void *cond_data) > +{ > + int ret; > + > + cpus_read_lock(); > + > + ret = schedule_on_each_cpu_cond_locked(func, cond_func, cond_data); > + > + cpus_read_unlock(); > + > + return ret; > +} Also, re-implement schedule_on_each_cpu() using the above to save a bunch of duplication: int schedule_on_each_cpu(work_func_t func) { return schedule_on_each_cpu_cond(func, NULL, NULL); } That said, I find it jarring that the schedule_on*() family doesn't have a void* argument to the function, like the smp_*() family has. So how about something like the below (equally untested). It preserves the current semantics, but allows a work function to cast to schedule_work and access ->info if it so desires. diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index a0143dd24430..5e97111322b2 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -103,6 +103,11 @@ struct work_struct { #endif }; +struct schedule_work { + struct work_struct work; + void *info; +}; + #define WORK_DATA_INIT() ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL) #define WORK_DATA_STATIC_INIT() \ ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC)) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 07895deca271..c73bb8860bbc 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -51,6 +51,7 @@ #include <linux/sched/isolation.h> #include <linux/nmi.h> #include <linux/kvm_para.h> +#include <linux/smp.h> #include "workqueue_internal.h" @@ -3302,43 +3303,64 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork) } EXPORT_SYMBOL(cancel_delayed_work_sync); -/** - * schedule_on_each_cpu - execute a function synchronously on each online CPU - * @func: the function to call - * - * schedule_on_each_cpu() executes @func on each online CPU using the - * system workqueue and blocks until all CPUs have completed. - * schedule_on_each_cpu() is very slow. - * - * Return: - * 0 on success, -errno on failure. - */ -int schedule_on_each_cpu(work_func_t func) +int schedule_on_each_cpu_cond_locked(work_func_t func, smp_cond_func_t cond_func, void *info) { + struct schedule_work __percpu *works; int cpu; - struct work_struct __percpu *works; - works = alloc_percpu(struct work_struct); + works = alloc_percpu(struct schedule_work); if (!works) return -ENOMEM; - cpus_read_lock(); - for_each_online_cpu(cpu) { - struct work_struct *work = per_cpu_ptr(works, cpu); + struct schedule_work *work = per_cpu_ptr(works, cpu); - INIT_WORK(work, func); - schedule_work_on(cpu, work); + if (cond_func && !cond_func(cpu, info)) + continue; + + INIT_WORK(&work->work, func); + work->info = info; + schedule_work_on(cpu, &work->work); } - for_each_online_cpu(cpu) - flush_work(per_cpu_ptr(works, cpu)); + for_each_online_cpu(cpu) { + struct schedule_work *work = per_cpu_ptr(works, cpu); + + if (work->work.func) + flush_work(&work->work); + } - cpus_read_unlock(); free_percpu(works); return 0; } +int schedule_on_each_cpu_cond(work_func_t func, smp_cond_func_t cond_func, void *info) +{ + int ret; + + cpus_read_lock(); + ret = schedule_on_each_cpu_cond_locked(func, cond, info); + cpus_read_unlock(); + + return ret; +} + +/** + * schedule_on_each_cpu - execute a function synchronously on each online CPU + * @func: the function to call + * + * schedule_on_each_cpu() executes @func on each online CPU using the + * system workqueue and blocks until all CPUs have completed. + * schedule_on_each_cpu() is very slow. + * + * Return: + * 0 on success, -errno on failure. + */ +int schedule_on_each_cpu(work_func_t func) +{ + return schedule_on_each_cpu_cond(func, NULL, NULL); +} + /** * execute_in_process_context - reliably execute the routine with user context * @fn: the function to execute
> > + > > + for_each_online_cpu(cpu) > > I think you need to skip some flushes too. Given we skip setting > work->func, this will go WARN, see __flush_work(). I missed. Thanks for pointing out. > > > + flush_work(per_cpu_ptr(works, cpu)); > > + > [...] > That said, I find it jarring that the schedule_on*() family doesn't have > a void* argument to the function, like the smp_*() family has. So how > about something like the below (equally untested). It preserves the > current semantics, but allows a work function to cast to schedule_work > and access ->info if it so desires. Yes agreed. Your code below looks indeed better. Thanks! Would you mind send me a patch so I can include to this series, or would you mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can rebase? Appreciate your help! > > > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h > index a0143dd24430..5e97111322b2 100644 > --- a/include/linux/workqueue.h > +++ b/include/linux/workqueue.h > @@ -103,6 +103,11 @@ struct work_struct { > #endif > }; > > +struct schedule_work { > + struct work_struct work; > + void *info; > +}; > + > #define WORK_DATA_INIT() ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL) > #define WORK_DATA_STATIC_INIT() \ > ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC)) > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 07895deca271..c73bb8860bbc 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -51,6 +51,7 @@ > #include <linux/sched/isolation.h> > #include <linux/nmi.h> > #include <linux/kvm_para.h> > +#include <linux/smp.h> > > #include "workqueue_internal.h" > > @@ -3302,43 +3303,64 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork) > } > EXPORT_SYMBOL(cancel_delayed_work_sync); > > -/** > - * schedule_on_each_cpu - execute a function synchronously on each online CPU > - * @func: the function to call > - * > - * schedule_on_each_cpu() executes @func on each online CPU using the > - * system workqueue and blocks until all CPUs have completed. > - * schedule_on_each_cpu() is very slow. > - * > - * Return: > - * 0 on success, -errno on failure. > - */ > -int schedule_on_each_cpu(work_func_t func) > +int schedule_on_each_cpu_cond_locked(work_func_t func, smp_cond_func_t cond_func, void *info) > { > + struct schedule_work __percpu *works; > int cpu; > - struct work_struct __percpu *works; > > - works = alloc_percpu(struct work_struct); > + works = alloc_percpu(struct schedule_work); > if (!works) > return -ENOMEM; > > - cpus_read_lock(); > - > for_each_online_cpu(cpu) { > - struct work_struct *work = per_cpu_ptr(works, cpu); > + struct schedule_work *work = per_cpu_ptr(works, cpu); > > - INIT_WORK(work, func); > - schedule_work_on(cpu, work); > + if (cond_func && !cond_func(cpu, info)) > + continue; > + > + INIT_WORK(&work->work, func); > + work->info = info; > + schedule_work_on(cpu, &work->work); > } > > - for_each_online_cpu(cpu) > - flush_work(per_cpu_ptr(works, cpu)); > + for_each_online_cpu(cpu) { > + struct schedule_work *work = per_cpu_ptr(works, cpu); > + > + if (work->work.func) > + flush_work(&work->work); > + } > > - cpus_read_unlock(); > free_percpu(works); > return 0; > } > > +int schedule_on_each_cpu_cond(work_func_t func, smp_cond_func_t cond_func, void *info) > +{ > + int ret; > + > + cpus_read_lock(); > + ret = schedule_on_each_cpu_cond_locked(func, cond, info); > + cpus_read_unlock(); > + > + return ret; > +} > + > +/** > + * schedule_on_each_cpu - execute a function synchronously on each online CPU > + * @func: the function to call > + * > + * schedule_on_each_cpu() executes @func on each online CPU using the > + * system workqueue and blocks until all CPUs have completed. > + * schedule_on_each_cpu() is very slow. > + * > + * Return: > + * 0 on success, -errno on failure. > + */ > +int schedule_on_each_cpu(work_func_t func) > +{ > + return schedule_on_each_cpu_cond(func, NULL, NULL); > +} > + > /** > * execute_in_process_context - reliably execute the routine with user context > * @fn: the function to execute
On Wed, Feb 15, 2023 at 09:46:10AM +0000, Huang, Kai wrote: > Yes agreed. Your code below looks indeed better. Thanks! > > Would you mind send me a patch so I can include to this series, or would you > mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can > rebase? Just take the patch, add your comments and test it.. enjoy! :-)
On Wed, 2023-02-15 at 14:25 +0100, Peter Zijlstra wrote: > On Wed, Feb 15, 2023 at 09:46:10AM +0000, Huang, Kai wrote: > > Yes agreed. Your code below looks indeed better. Thanks! > > > > Would you mind send me a patch so I can include to this series, or would you > > mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can > > rebase? > > Just take the patch, add your comments and test it.. enjoy! :-) Thank you! I'll at least add your Suggested-by :)
On Wed, 2023-02-15 at 21:37 +0000, Huang, Kai wrote: > On Wed, 2023-02-15 at 14:25 +0100, Peter Zijlstra wrote: > > On Wed, Feb 15, 2023 at 09:46:10AM +0000, Huang, Kai wrote: > > > Yes agreed. Your code below looks indeed better. Thanks! > > > > > > Would you mind send me a patch so I can include to this series, or would you > > > mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can > > > rebase? > > > > Just take the patch, add your comments and test it.. enjoy! :-) > > Thank you! I'll at least add your Suggested-by :) Hi Peter, After discussing with Kirill, I changed the way of how to handle the per-cpu initialization, and in the new version (v10, just sent) I don't need the schedule_on_each_cpu_cond_locked() anymore, because I essentially moved such handling out of TDX host series to KVM. I'll use your patch if the review found we still need to handle it. Thanks!
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 5c5ecfddb15b..2b2efaa4bc0e 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -107,9 +107,11 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, #ifdef CONFIG_INTEL_TDX_HOST bool platform_tdx_enabled(void); int tdx_enable(void); +int tdx_cpu_online(unsigned int cpu); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } static inline int tdx_enable(void) { return -EINVAL; } +static inline int tdx_cpu_online(unsigned int cpu) { return 0; } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 79cee28c51b5..23b2db28726f 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -13,6 +13,8 @@ #include <linux/errno.h> #include <linux/printk.h> #include <linux/mutex.h> +#include <linux/cpumask.h> +#include <linux/cpu.h> #include <asm/msr-index.h> #include <asm/msr.h> #include <asm/tdx.h> @@ -26,6 +28,10 @@ static enum tdx_module_status_t tdx_module_status; /* Prevent concurrent attempts on TDX module initialization */ static DEFINE_MUTEX(tdx_module_lock); +/* TDX-runnable cpus. Protected by cpu_hotplug_lock. */ +static cpumask_t __cpu_tdx_mask; +static cpumask_t *cpu_tdx_mask = &__cpu_tdx_mask; + /* * Use tdx_global_keyid to indicate that TDX is uninitialized. * This is used in TDX initialization error paths to take it from @@ -170,6 +176,63 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, return ret; } +/* + * Call @func on all online cpus one by one but skip those cpus + * when @skip_func is valid and returns true for them. + */ +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data, + bool (*skip_func)(int cpu, void *), + void *skip_data) +{ + int cpu; + + for_each_online_cpu(cpu) { + int ret; + + if (skip_func && skip_func(cpu, skip_data)) + continue; + + /* + * SEAMCALL can be time consuming. Call the @func on + * remote cpu via smp_call_on_cpu() instead of + * smp_call_function_single() to avoid busy waiting. + */ + ret = smp_call_on_cpu(cpu, func, func_data, true); + if (ret) + return ret; + } + + return 0; +} + +static int seamcall_lp_init(void) +{ + /* All '0's are just unused parameters */ + return seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL); +} + +static int smp_func_module_lp_init(void *data) +{ + int ret, cpu = smp_processor_id(); + + ret = seamcall_lp_init(); + if (!ret) + cpumask_set_cpu(cpu, cpu_tdx_mask); + + return ret; +} + +static bool skip_func_module_lp_init_done(int cpu, void *data) +{ + return cpumask_test_cpu(cpu, cpu_tdx_mask); +} + +static int module_lp_init_online_cpus(void) +{ + return tdx_on_each_cpu_cond(smp_func_module_lp_init, NULL, + skip_func_module_lp_init_done, NULL); +} + static int init_tdx_module(void) { int ret; @@ -182,10 +245,26 @@ static int init_tdx_module(void) if (ret) return ret; + /* + * TDX module per-cpu initialization SEAMCALL must be done on + * one cpu before any other SEAMCALLs can be made on that cpu, + * including those involved in further steps to initialize the + * TDX module. + * + * To make sure further SEAMCALLs can be done successfully w/o + * having to consider preemption, disable CPU hotplug during + * rest of module initialization and do per-cpu initialization + * for all online cpus. + */ + cpus_read_lock(); + + ret = module_lp_init_online_cpus(); + if (ret) + goto out; + /* * TODO: * - * - TDX module per-cpu initialization. * - Get TDX module information and TDX-capable memory regions. * - Build the list of TDX-usable memory regions. * - Construct a list of "TD Memory Regions" (TDMRs) to cover @@ -196,7 +275,17 @@ static int init_tdx_module(void) * * Return error before all steps are done. */ - return -EINVAL; + ret = -EINVAL; +out: + /* + * Clear @cpu_tdx_mask if module initialization fails before + * CPU hotplug is re-enabled. tdx_cpu_online() uses it to check + * whether the initialization has been successful or not. + */ + if (ret) + cpumask_clear(cpu_tdx_mask); + cpus_read_unlock(); + return ret; } static int __tdx_enable(void) @@ -220,13 +309,72 @@ static int __tdx_enable(void) return 0; } +/* + * Disable TDX module after it has been initialized successfully. + */ +static void disable_tdx_module(void) +{ + /* + * TODO: module clean up in reverse to steps in + * init_tdx_module(). Remove this comment after + * all steps are done. + */ + cpumask_clear(cpu_tdx_mask); +} + +static int tdx_module_init_online_cpus(void) +{ + int ret; + + /* + * Make sure no cpu can become online to prevent + * race against tdx_cpu_online(). + */ + cpus_read_lock(); + + /* + * Do per-cpu initialization for any new online cpus. + * If any fails, disable TDX. + */ + ret = module_lp_init_online_cpus(); + if (ret) + disable_tdx_module(); + + cpus_read_unlock(); + + return ret; + +} +static int __tdx_enable_online_cpus(void) +{ + if (tdx_module_init_online_cpus()) { + /* + * SEAMCALL failure has already printed + * meaningful error message. + */ + tdx_module_status = TDX_MODULE_ERROR; + + /* + * Just return one universal error code. + * For now the caller cannot recover anyway. + */ + return -EINVAL; + } + + return 0; +} + /** * tdx_enable - Enable TDX to be ready to run TDX guests * * Initialize the TDX module to enable TDX. After this function, the TDX - * module is ready to create and run TDX guests. + * module is ready to create and run TDX guests on all online cpus. + * + * This function internally calls cpus_read_lock()/unlock() to prevent + * any cpu from going online and offline. * * This function assumes all online cpus are already in VMX operation. + * * This function can be called in parallel by multiple callers. * * Return 0 if TDX is enabled successfully, otherwise error. @@ -247,8 +395,17 @@ int tdx_enable(void) ret = __tdx_enable(); break; case TDX_MODULE_INITIALIZED: - /* Already initialized, great, tell the caller. */ - ret = 0; + /* + * The previous call of __tdx_enable() may only have + * initialized part of present cpus during module + * initialization, and new cpus may have become online + * since then. + * + * To make sure all online cpus are TDX-runnable, always + * do per-cpu initialization for all online cpus here + * even the module has been initialized. + */ + ret = __tdx_enable_online_cpus(); break; default: /* Failed to initialize in the previous attempts */ @@ -261,3 +418,46 @@ int tdx_enable(void) return ret; } EXPORT_SYMBOL_GPL(tdx_enable); + +/** + * tdx_cpu_online - Enable TDX on a hotplugged local cpu + * + * @cpu: the cpu to be brought up. + * + * Do TDX module per-cpu initialization for a hotplugged cpu to make + * it TDX-runnable. All online cpus are initialized during module + * initialization. + * + * This function must be called from CPU hotplug callback which holds + * write lock of cpu_hotplug_lock. + * + * This function assumes local cpu is already in VMX operation. + */ +int tdx_cpu_online(unsigned int cpu) +{ + int ret; + + /* + * @cpu_tdx_mask is updated in tdx_enable() and is protected + * by cpus_read_lock()/unlock(). If it is empty, TDX module + * either hasn't been initialized, or TDX didn't get enabled + * successfully. + * + * In either case, do nothing but return success. + */ + if (cpumask_empty(cpu_tdx_mask)) + return 0; + + WARN_ON_ONCE(cpu != smp_processor_id()); + + /* Already done */ + if (cpumask_test_cpu(cpu, cpu_tdx_mask)) + return 0; + + ret = seamcall_lp_init(); + if (!ret) + cpumask_set_cpu(cpu, cpu_tdx_mask); + + return ret; +} +EXPORT_SYMBOL_GPL(tdx_cpu_online); diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 55472e085bc8..30413d7fbee8 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -15,6 +15,7 @@ * TDX module SEAMCALL leaf functions */ #define TDH_SYS_INIT 33 +#define TDH_SYS_LP_INIT 35 /* Kernel defined TDX module status during module initialization. */ enum tdx_module_status_t {
After the SEAMCALL to do TDX module global initialization, a SEAMCALL to do per-cpu initialization (TDH.SYS.LP.INIT) must be done on one logical cpu before any other SEAMCALLs can be made on that cpu, including those involved in the future steps of the module initialization. To keep things simple, this implementation just chooses to guarantee all online cpus are "TDX-runnable" (TDH.SYS.LP.INIT has been successfully done on them). If the kernel were to allow one cpu to be online while TDH.SYS.LP.INIT failed on it, the kernel would need to track a cpumask of "TDX-runnable" cpus, know which task is "TDX workload" and guarantee such task can only be scheduled to "TDX-runnable" cpus. For example, the kernel would need to reject in sched_setaffinity() if the userspace tries to bind TDX task to any "non-TDX-runnable" cpu. To guarantee all online cpus are "TDX-runnable", disable the CPU hotplug during module initialization and do TDH.SYS.LP.INIT for all online cpus before any further steps of module initialization. In CPU hotplug, do TDH.SYS.LP.INIT when TDX has been enabled in the CPU online callback and reject to online the cpu if the SEAMCALL fails. Currently only KVM handles VMXON. Similar to tdx_enable(), only provide a new helper tdx_cpu_online() but make KVM itself responsible for doing VMXON and calling tdx_cpu_online() in its own CPU online callback. Note tdx_enable() can be called multiple times by KVM because KVM module can be unloaded and reloaded. New cpus may become online while KVM is unloaded, and in this case TDH.SYS.LP.INIT won't be called for those new online cpus because KVM's CPU online callback is removed when KVM is unloaded. To make sure all online cpus are "TDX-runnable", always do the per-cpu initialization for all online cpus in tdx_enable() even the module has been initialized. Similar to the per-cpu module initialization, a later step to config the key for the global KeyID needs to call some SEAMCALL on one cpu for each CPU package. The difference is that SEAMCALL cannot run in parallel on different cpus but TDH.SYS.LP.INIT can. To avoid duplicated code, add a helper to call SEAMCALL on all online cpus one by one but with a skip function to check whether to skip certain cpus, and use that helper to do the per-cpu initialization. Signed-off-by: Kai Huang <kai.huang@intel.com> --- v8 -> v9: - Added this patch back. - Handled the relaxed new behaviour of TDH.SYS.LP.INIT --- arch/x86/include/asm/tdx.h | 2 + arch/x86/virt/vmx/tdx/tdx.c | 210 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 1 + 3 files changed, 208 insertions(+), 5 deletions(-)