Message ID | 20220919123233.8538-3-petr.pavlu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | module: Merge same-name module load requests | expand |
On Mon, Sep 19, 2022 at 02:32:33PM +0200, Petr Pavlu wrote: > During a system boot, it can happen that the kernel receives a burst of > requests to insert the same module but loading it eventually fails > during its init call. Please take a look at kmod selftest lib/test_kmod.c and the respective shell selftest tools/testing/selftests/kmod/kmod.sh. Can you modify it to add support to reproduce this issue? > For instance, udev can make a request to insert > a frequency module for each individual CPU That seems stupid indeed, it would seem we should be able for sure to prevent such cases, it can't just be happening for frequency modules. > when another frequency module > is already loaded which causes the init function of the new module to > return an error. Holy smokes. > The module loader currently serializes all such requests, with the > barrier in add_unformed_module(). This creates a lot of unnecessary work > and delays the boot. Sure.. > This patch improves the behavior as follows: > * A check whether a module load matches an already loaded module is > moved right after a module name is determined. -EEXIST continues to be > returned if the module exists and is live, -EBUSY is returned if > a same-name module is going. OK nice. > * A new reference-counted shared_load_info structure is introduced to > keep track of duplicate load requests. Two loads are considered > equivalent if their module name matches. In case a load duplicates > another running insert, the code waits for its completion and then > returns -EEXIST or -EBUSY depending on whether it succeeded. Groovy. > Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST > for modules that have finished loading"), the kernel already did merge > some of same load requests but it was more by accident and relied on > specific timing. The patch brings this behavior back in a more explicit > form. I'm having a hard time with this, because it is not clear if you are suggesting this is a regression introduced by 6e6de3dee51a or not. I'd like you to evaluate the impact of *not* merging a fix to older kernels. In practice I think we'd end up with delays on boot, but is that all? Would boot ever fail? The commit log does not make that clear. The commit log should make it clear if this a regression or not and the impact of not having these fixes merged. Please not that bots will try to scrape for fixes and I suspect bots will pour their heart out on this commit log and identify and assume this if a fix already as-is. If this *is* a regression, we should try to see how perhaps we can split this up into a part which is mergable to stable and then a secondary part which does some new fancy optimizations. Luis
On Mon, Sep 19, 2022 at 2:33 PM Petr Pavlu <petr.pavlu@suse.com> wrote: > > During a system boot, it can happen that the kernel receives a burst of > requests to insert the same module but loading it eventually fails > during its init call. For instance, udev can make a request to insert > a frequency module for each individual CPU when another frequency module > is already loaded which causes the init function of the new module to > return an error. > > The module loader currently serializes all such requests, with the > barrier in add_unformed_module(). This creates a lot of unnecessary work > and delays the boot. > > This patch improves the behavior as follows: > * A check whether a module load matches an already loaded module is > moved right after a module name is determined. -EEXIST continues to be > returned if the module exists and is live, -EBUSY is returned if > a same-name module is going. > * A new reference-counted shared_load_info structure is introduced to > keep track of duplicate load requests. Two loads are considered > equivalent if their module name matches. In case a load duplicates > another running insert, the code waits for its completion and then > returns -EEXIST or -EBUSY depending on whether it succeeded. > > Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST > for modules that have finished loading"), the kernel already did merge > some of same load requests but it was more by accident and relied on > specific timing. The patch brings this behavior back in a more explicit > form. > > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > --- Hi Petr, as you might have seen I sent a patch/fix yesterday (not being aware of this patch and that this is also a performance issue, which is interesting), that similarly makes sure that modules are unique early. https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com It doesn't perform the -EBUSY changes or use something like shared_load_info/refcounts; it simply uses a second list while the module cannot be placed onto the module list yet. Not sure if that part is really required (e.g., for performance reasons). Like Luis, I feel like some of these parts could be split into separate patches, if the other parts are really required. I just tested your patch in the environment where I can reproduce the vmap allocation issue, and (unsurprisingly) this patch similarly seems to fix the issue. So if your patch ends up upstream, it would be good to add some details of my patch description (vmap allocation issue) to this patch description. Cheers, David
On Mon 2022-09-19 14:32:33, Petr Pavlu wrote: > During a system boot, it can happen that the kernel receives a burst of > requests to insert the same module but loading it eventually fails > during its init call. For instance, udev can make a request to insert > a frequency module for each individual CPU when another frequency module > is already loaded which causes the init function of the new module to > return an error. > > The module loader currently serializes all such requests, with the > barrier in add_unformed_module(). This creates a lot of unnecessary work > and delays the boot. > > This patch improves the behavior as follows: > * A check whether a module load matches an already loaded module is > moved right after a module name is determined. -EEXIST continues to be > returned if the module exists and is live, -EBUSY is returned if > a same-name module is going. > * A new reference-counted shared_load_info structure is introduced to > keep track of duplicate load requests. Two loads are considered > equivalent if their module name matches. In case a load duplicates > another running insert, the code waits for its completion and then > returns -EEXIST or -EBUSY depending on whether it succeeded. > > Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST > for modules that have finished loading"), the kernel already did merge > some of same load requests but it was more by accident and relied on > specific timing. The patch brings this behavior back in a more explicit > form. > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -61,14 +61,29 @@ > > /* > * Mutex protects: > - * 1) List of modules (also safely readable with preempt_disable), > + * 1) list of modules (also safely readable with preempt_disable, delete and add > + * uses RCU list operations). > * 2) module_use links, > - * 3) mod_tree.addr_min/mod_tree.addr_max. > - * (delete and add uses RCU list operations). > + * 3) mod_tree.addr_min/mod_tree.addr_max, > + * 4) list of unloaded_tainted_modules. This is unrelated and should be done in separate patch. It minimizes the size of the patch and simplifies review. More importantly, these unrelated changes will not get lost when the patch gets reverted for some reason. > + * 5) list of running_loads. > */ > DEFINE_MUTEX(module_mutex); > LIST_HEAD(modules); > > +/* Shared information to track duplicate module loads. */ > +struct shared_load_info { > + char name[MODULE_NAME_LEN]; > + refcount_t refcnt; > + struct list_head list; > + struct completion done; > + int err; > +}; > +static LIST_HEAD(running_loads); > + > +/* Waiting for a module to finish loading? */ > +static DECLARE_WAIT_QUEUE_HEAD(module_wq); It is not obvious why this is actually needed. One would expect that using the completion in struct shared_load_info would be enough. It is need because the user, resolve_symbol_wait(), does not know the exact name of the module that it is waiting for. It would deserve a comment and maybe even renaming, for example: /* * Waiting for a module load when the exact module name is not known, * for example, when resolving symbols from another modules. */ static DECLARE_WAIT_QUEUE_HEAD(any_module_load_wq); > + > /* Work queue for freeing init sections in success case */ > static void do_free_init(struct work_struct *w); > static DECLARE_WORK(init_free_wq, do_free_init); > @@ -762,8 +774,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user, > strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints)); > > free_module(mod); > - /* someone could wait for the module in add_unformed_module() */ > - wake_up_interruptible(&module_wq); The new code does not longer wakes module_wq when the module is removed. I wondered if it is correct. It is a bit tricky. module_wq used to have two purposes. It was used to wake up: 1. parallel loads of the same module. 2. resolving symbols in a module using symbols from another module. Ad 1. The new code handles the parallel load using struct shared_load_info. Also the new code does not wait when the same module is being removed. So the wake up is not needed here. Ad 2. The module_wq is typically used when the other module is loaded automatically because of module dependencies. In this case, only the wake up in load_module() is important. But what about module removal? IMHO, the is small race window: The problem is the module operations are asynchronous. And A takes the reference on B only after it resolves a symbol, see ref_module() called in resolve_symbol(). Let's have two modules A and B when the module A uses a symbol from module B. CPU 0: CPU 1 CPU 3 modprobe A // see dependency on B // and call modprobe B // in separate thread modprobe B return -EEXIST rmmod B // finished resolve_symbol_wait(sym_from_B) It has to wait until the timeout 30s because module B is gone and it is not beeing loaded. IMHO, the new code makes the race window slightly bigger because modprobe B retuns immediately even when rmmod B is already in progress. IMHO, this is acceptable because the race was there anyway. And it is not much realistic scenario. > return 0; > out: > mutex_unlock(&module_mutex); > @@ -2552,43 +2540,133 @@ static int may_init_module(void) > return 0; > } > > +static struct shared_load_info * > +shared_load_info_alloc(const struct load_info *info) > +{ > + struct shared_load_info *shared_info = > + kzalloc(sizeof(*shared_info), GFP_KERNEL); > + if (shared_info == NULL) > + return ERR_PTR(-ENOMEM); > + > + strscpy(shared_info->name, info->name, sizeof(shared_info->name)); > + refcount_set(&shared_info->refcnt, 1); > + INIT_LIST_HEAD(&shared_info->list); > + init_completion(&shared_info->done); > + return shared_info; > +} > + > +static void shared_load_info_get(struct shared_load_info *shared_info) > +{ > + refcount_inc(&shared_info->refcnt); > +} > + > +static void shared_load_info_put(struct shared_load_info *shared_info) > +{ > + if (refcount_dec_and_test(&shared_info->refcnt)) > + kfree(shared_info); > +} > + > /* > - * We try to place it in the list now to make sure it's unique before > - * we dedicate too many resources. In particular, temporary percpu > + * Check that a module load is unique and make it visible to others. The code > + * looks for parallel running inserts and already loaded modules. Two inserts > + * are considered equivalent if their module name matches. In case this load > + * duplicates another running insert, the code waits for its completion and > + * then returns -EEXIST or -EBUSY depending on whether it succeeded. > + * > + * Detecting early that a load is unique avoids dedicating too many cycles and > + * resources to bring up the module. In particular, it prevents temporary percpu > * memory exhaustion. > + * > + * Merging same load requests then primarily helps during the boot process. It > + * can happen that the kernel receives a burst of requests to load the same > + * module (for example, a same module for each individual CPU) and loading it > + * eventually fails during its init call. Merging the requests allows that only > + * one full attempt to load the module is made. > + * > + * On a non-error return, it is guaranteed that this load is unique. > */ > -static int add_unformed_module(struct module *mod) > +static struct shared_load_info *add_running_load(const struct load_info *info) > { > - int err; > struct module *old; > + struct shared_load_info *shared_info; > > - mod->state = MODULE_STATE_UNFORMED; > - > -again: > mutex_lock(&module_mutex); > - old = find_module_all(mod->name, strlen(mod->name), true); > - if (old != NULL) { > - if (old->state != MODULE_STATE_LIVE) { > - /* Wait in case it fails to load. */ > + > + /* Search if there is a running load of a module with the same name. */ > + list_for_each_entry(shared_info, &running_loads, list) > + if (strcmp(shared_info->name, info->name) == 0) { > + int err; > + > + shared_load_info_get(shared_info); > mutex_unlock(&module_mutex); > - err = wait_event_interruptible(module_wq, > - finished_loading(mod->name)); > - if (err) > - goto out_unlocked; > - goto again; > + > + err = wait_for_completion_interruptible( > + &shared_info->done); This would deserve a comment, for examle: /* * Return -EBUSY when the parallel load failed * from any reason. This load might end up * another way but we are not going to try. */ > + if (!err) > + err = shared_info->err ? -EBUSY : -EEXIST; > + shared_load_info_put(shared_info); > + shared_info = ERR_PTR(err); > + goto out_unlocked; > } > - err = -EEXIST; > + > + /* Search if there is a live module with the given name already. */ > + old = find_module_all(info->name, strlen(info->name), true); > + if (old != NULL) { > + if (old->state == MODULE_STATE_LIVE) { > + shared_info = ERR_PTR(-EEXIST); > + goto out; > + } > + > + /* > + * Any active load always has its record in running_loads and so > + * would be found above. This applies independent whether such > + * a module is currently in MODULE_STATE_UNFORMED, > + * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its > + * initialization failed. It therefore means this must be an > + * older going module and the caller should try later once it is > + * gone. > + */ > + WARN_ON(old->state != MODULE_STATE_GOING); > + shared_info = ERR_PTR(-EBUSY); > goto out; > } > - mod_update_bounds(mod); > - list_add_rcu(&mod->list, &modules); > - mod_tree_insert(mod); > - err = 0; > + > + /* The load is unique, make it visible to others. */ > + shared_info = shared_load_info_alloc(info); > + if (IS_ERR(shared_info)) > + goto out; > + list_add(&shared_info->list, &running_loads); > > out: > mutex_unlock(&module_mutex); > out_unlocked: > - return err; > + return shared_info; > +} > + > +static void finalize_running_load(struct shared_load_info *shared_info, int err) > +{ > + /* Inform other duplicate inserts that the load finished. */ > + mutex_lock(&module_mutex); > + list_del(&shared_info->list); > + shared_info->err = err; > + mutex_unlock(&module_mutex); > + > + complete_all(&shared_info->done); > + shared_load_info_put(shared_info); > + > + /* Tell other modules waiting on this one that it completed loading. */ > + wake_up_interruptible(&module_wq); > +} > + > +static void add_unformed_module(struct module *mod) > +{ > + mod->state = MODULE_STATE_UNFORMED; > + > + mutex_lock(&module_mutex); > + mod_update_bounds(mod); > + list_add_rcu(&mod->list, &modules); > + mod_tree_insert(mod); > + mutex_unlock(&module_mutex); > } > > static int complete_formation(struct module *mod, struct load_info *info) The comments are more or less about cosmetic problems. I do not see any real functional problem. I am sorry that I did not mention them when reviewing v1. Feel free to use: Reviewed-by: Petr Mladek <pmladek@suse.com> Best Regards, Petr
On 9/30/22 22:30, Luis Chamberlain wrote: > On Mon, Sep 19, 2022 at 02:32:33PM +0200, Petr Pavlu wrote: >> During a system boot, it can happen that the kernel receives a burst of >> requests to insert the same module but loading it eventually fails >> during its init call. > > Please take a look at kmod selftest lib/test_kmod.c and the respective shell > selftest tools/testing/selftests/kmod/kmod.sh. Can you modify it to add > support to reproduce this issue? It was possible for me to write some kselftests for this. I will post them as a separate patch in v3. >> For instance, udev can make a request to insert >> a frequency module for each individual CPU > > That seems stupid indeed, it would seem we should be able for sure to prevent > such cases, it can't just be happening for frequency modules. The issue was also observed with EDAC drivers which are similarly exclusive. >> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST >> for modules that have finished loading"), the kernel already did merge >> some of same load requests but it was more by accident and relied on >> specific timing. The patch brings this behavior back in a more explicit >> form. > > I'm having a hard time with this, because it is not clear if you are > suggesting this is a regression introduced by 6e6de3dee51a or not. I'd > like you to evaluate the impact of *not* merging a fix to older kernels. > In practice I think we'd end up with delays on boot, but is that all? > Would boot ever fail? The commit log does not make that clear. > > The commit log should make it clear if this a regression or not and the > impact of not having these fixes merged. Please not that bots will try > to scrape for fixes and I suspect bots will pour their heart out on this > commit log and identify and assume this if a fix already as-is. I touched on this somewhat in my response to review comments on v1 from Petr Mladek [1] but it looks I failed to appropriately update the commit message in the new version. I will try to improve it in v3. The patch does address a regression observed after commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading"). I guess it can have a Fixes tag added to the patch. To add more information, the following is a test from a machine with 288 CPUs which I performed when preparing this patch. The system had Tumbleweed 20220829 installed on it. The boot process tried to load 288x pcc_cpufreq and 576x acpi_cpufreq modules which all failed because intel_pstate was already active. The test used three custom builds. The base was 6.0-rc3, 'revert' is base + revert of 6e6de3dee51a, 'my' is base + the proposed fix. Compiled modules were uncompressed and unsigned. Each configuration had its boot tested 5 times. Time was measured from the first load attempt of a given module to the last one, by simply looking at messages such as "Inserted module 'acpi_cpufreq'" in the udev log and their timestamps. All times are in seconds. | | Configuration | | Boot | base | revert | my | | v | pcc | acpi | pcc | acpi | pcc | acpi | +------+--------+--------+--------+--------+--------+--------+ | #1 | 45.374 | 45.462 | 1.992 | 8.509 | 2.190 | 6.931 | | #2 | 44.727 | 44.712 | 2.249 | 11.436 | 1.821 | 8.413 | | #3 | 45.450 | 45.771 | 1.685 | 8.784 | 1.964 | 6.341 | | #4 | 44.306 | 44.840 | 2.469 | 9.611 | 2.362 | 6.856 | | #5 | 45.132 | 45.216 | 2.063 | 8.782 | 1.717 | 6.405 | +------+--------+--------+--------+--------+--------+--------+ | Avg | 44.998 | 45.200 | 2.092 | 9.424 | 2.011 | 6.989 | This shows the observed regression and results with the proposed fix. > If this *is* a regression, we should try to see how perhaps we can split > this up into a part which is mergable to stable and then a secondary > part which does some new fancy optimizations. I think it is hard to split this patch into parts because the implemented "optimization" is the fix. [1] https://lore.kernel.org/linux-modules/0ccb384f-bbd5-f0fd-3832-c2255df505b2@suse.com/ Thanks, Petr
On 10/14/22 09:54, David Hildenbrand wrote: > On Mon, Sep 19, 2022 at 2:33 PM Petr Pavlu <petr.pavlu@suse.com> wrote: >> >> During a system boot, it can happen that the kernel receives a burst of >> requests to insert the same module but loading it eventually fails >> during its init call. For instance, udev can make a request to insert >> a frequency module for each individual CPU when another frequency module >> is already loaded which causes the init function of the new module to >> return an error. >> >> The module loader currently serializes all such requests, with the >> barrier in add_unformed_module(). This creates a lot of unnecessary work >> and delays the boot. >> >> This patch improves the behavior as follows: >> * A check whether a module load matches an already loaded module is >> moved right after a module name is determined. -EEXIST continues to be >> returned if the module exists and is live, -EBUSY is returned if >> a same-name module is going. >> * A new reference-counted shared_load_info structure is introduced to >> keep track of duplicate load requests. Two loads are considered >> equivalent if their module name matches. In case a load duplicates >> another running insert, the code waits for its completion and then >> returns -EEXIST or -EBUSY depending on whether it succeeded. >> >> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST >> for modules that have finished loading"), the kernel already did merge >> some of same load requests but it was more by accident and relied on >> specific timing. The patch brings this behavior back in a more explicit >> form. >> >> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >> --- > > Hi Petr, > > as you might have seen I sent a patch/fix yesterday (not being aware > of this patch and that > this is also a performance issue, which is interesting), that > similarly makes sure that modules > are unique early. > > https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com > > It doesn't perform the -EBUSY changes or use something like > shared_load_info/refcounts; > it simply uses a second list while the module cannot be placed onto > the module list yet. > > Not sure if that part is really required (e.g., for performance > reasons). Like Luis, I feel like > some of these parts could be split into separate patches, if the other > parts are really required. The shared_load_info/refcounts/-EBUSY logic is actually an important part which addresses the regression mentioned in the commit message and which I'm primarily trying to fix. > I just tested your patch in the environment where I can reproduce the > vmap allocation issue, and > (unsurprisingly) this patch similarly seems to fix the issue. > > So if your patch ends up upstream, it would be good to add some details > of my patch description (vmap allocation issue) to this patch description. Thanks for testing this patch. I will add a note about the vmap allocation issue to the patch description. Petr
On 10/14/22 15:52, Petr Mladek wrote: > On Mon 2022-09-19 14:32:33, Petr Pavlu wrote: >> During a system boot, it can happen that the kernel receives a burst of >> requests to insert the same module but loading it eventually fails >> during its init call. For instance, udev can make a request to insert >> a frequency module for each individual CPU when another frequency module >> is already loaded which causes the init function of the new module to >> return an error. >> >> The module loader currently serializes all such requests, with the >> barrier in add_unformed_module(). This creates a lot of unnecessary work >> and delays the boot. >> >> This patch improves the behavior as follows: >> * A check whether a module load matches an already loaded module is >> moved right after a module name is determined. -EEXIST continues to be >> returned if the module exists and is live, -EBUSY is returned if >> a same-name module is going. >> * A new reference-counted shared_load_info structure is introduced to >> keep track of duplicate load requests. Two loads are considered >> equivalent if their module name matches. In case a load duplicates >> another running insert, the code waits for its completion and then >> returns -EEXIST or -EBUSY depending on whether it succeeded. >> >> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST >> for modules that have finished loading"), the kernel already did merge >> some of same load requests but it was more by accident and relied on >> specific timing. The patch brings this behavior back in a more explicit >> form. > >> --- a/kernel/module/main.c >> +++ b/kernel/module/main.c >> @@ -61,14 +61,29 @@ >> >> /* >> * Mutex protects: >> - * 1) List of modules (also safely readable with preempt_disable), >> + * 1) list of modules (also safely readable with preempt_disable, delete and add >> + * uses RCU list operations). >> * 2) module_use links, >> - * 3) mod_tree.addr_min/mod_tree.addr_max. >> - * (delete and add uses RCU list operations). >> + * 3) mod_tree.addr_min/mod_tree.addr_max, >> + * 4) list of unloaded_tainted_modules. > > This is unrelated and should be done in separate patch. It minimizes > the size of the patch and simplifies review. More importantly, these > unrelated changes will not get lost when the patch gets reverted for > some reason. Ok. >> + * 5) list of running_loads. >> */ >> DEFINE_MUTEX(module_mutex); >> LIST_HEAD(modules); >> >> +/* Shared information to track duplicate module loads. */ >> +struct shared_load_info { >> + char name[MODULE_NAME_LEN]; >> + refcount_t refcnt; >> + struct list_head list; >> + struct completion done; >> + int err; >> +}; >> +static LIST_HEAD(running_loads); >> + >> +/* Waiting for a module to finish loading? */ >> +static DECLARE_WAIT_QUEUE_HEAD(module_wq); > > It is not obvious why this is actually needed. One would expect > that using the completion in struct shared_load_info would > be enough. > > It is need because the user, resolve_symbol_wait(), does > not know the exact name of the module that it is waiting for. > > It would deserve a comment and maybe even renaming, for example: > > /* > * Waiting for a module load when the exact module name is not known, > * for example, when resolving symbols from another modules. > */ > static DECLARE_WAIT_QUEUE_HEAD(any_module_load_wq); I will adjust the comment. >> + >> /* Work queue for freeing init sections in success case */ >> static void do_free_init(struct work_struct *w); >> static DECLARE_WORK(init_free_wq, do_free_init); >> @@ -762,8 +774,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user, >> strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints)); >> >> free_module(mod); >> - /* someone could wait for the module in add_unformed_module() */ >> - wake_up_interruptible(&module_wq); > > The new code does not longer wakes module_wq when the module is > removed. I wondered if it is correct. It is a bit tricky. > module_wq used to have two purposes. It was used to wake up: > > 1. parallel loads of the same module. > > 2. resolving symbols in a module using symbols from > another module. > > > Ad 1. The new code handles the parallel load using struct shared_load_info. > Also the new code does not wait when the same module > is being removed. So the wake up is not needed here. > > > Ad 2. The module_wq is typically used when the other module is > loaded automatically because of module dependencies. > In this case, only the wake up in load_module() is important. > > But what about module removal? IMHO, the is small race window: > > > The problem is the module operations are asynchronous. And A takes > the reference on B only after it resolves a symbol, see ref_module() > called in resolve_symbol(). > > Let's have two modules A and B when the module A uses a symbol > from module B. > > > CPU 0: CPU 1 CPU 3 > > modprobe A > // see dependency on B > // and call modprobe B > // in separate thread > > modprobe B > return -EEXIST > > rmmod B > // finished > > > resolve_symbol_wait(sym_from_B) > > It has to wait until the timeout 30s because module B is gone > and it is not beeing loaded. > > IMHO, the new code makes the race window slightly bigger because > modprobe B retuns immediately even when rmmod B is already in > progress. > > IMHO, this is acceptable because the race was there anyway. And > it is not much realistic scenario. I'm not sure if I understand this scenario. My reading is that modprobe of A would wait in resolve_symbol_wait() only if module B is in the coming state, and then would get later woken up from finalize_running_load(). In all other cases, resolve_symbol_wait() should not sleep: * If B is not loaded at all or is in the unformed state then resolve_symbol() -> find_symbol() detects that a needed symbol is missing. NULL is then returned from resolve_symbol() and subsequently from resolve_symbol_wait(). * If B is live then resolve_symbol() should successfully resolve the symbol and a refcount is increased on this module. * If B is going then resolve_symbol() -> ref_module() -> strong_try_module_get() -> try_module_get() should fail. This then gets propagated as -ENODEV from resolve_symbol() and resolve_symbol_wait(). >> return 0; >> out: >> mutex_unlock(&module_mutex); >> @@ -2552,43 +2540,133 @@ static int may_init_module(void) >> return 0; >> } >> >> +static struct shared_load_info * >> +shared_load_info_alloc(const struct load_info *info) >> +{ >> + struct shared_load_info *shared_info = >> + kzalloc(sizeof(*shared_info), GFP_KERNEL); >> + if (shared_info == NULL) >> + return ERR_PTR(-ENOMEM); >> + >> + strscpy(shared_info->name, info->name, sizeof(shared_info->name)); >> + refcount_set(&shared_info->refcnt, 1); >> + INIT_LIST_HEAD(&shared_info->list); >> + init_completion(&shared_info->done); >> + return shared_info; >> +} >> + >> +static void shared_load_info_get(struct shared_load_info *shared_info) >> +{ >> + refcount_inc(&shared_info->refcnt); >> +} >> + >> +static void shared_load_info_put(struct shared_load_info *shared_info) >> +{ >> + if (refcount_dec_and_test(&shared_info->refcnt)) >> + kfree(shared_info); >> +} >> + >> /* >> - * We try to place it in the list now to make sure it's unique before >> - * we dedicate too many resources. In particular, temporary percpu >> + * Check that a module load is unique and make it visible to others. The code >> + * looks for parallel running inserts and already loaded modules. Two inserts >> + * are considered equivalent if their module name matches. In case this load >> + * duplicates another running insert, the code waits for its completion and >> + * then returns -EEXIST or -EBUSY depending on whether it succeeded. >> + * >> + * Detecting early that a load is unique avoids dedicating too many cycles and >> + * resources to bring up the module. In particular, it prevents temporary percpu >> * memory exhaustion. >> + * >> + * Merging same load requests then primarily helps during the boot process. It >> + * can happen that the kernel receives a burst of requests to load the same >> + * module (for example, a same module for each individual CPU) and loading it >> + * eventually fails during its init call. Merging the requests allows that only >> + * one full attempt to load the module is made. >> + * >> + * On a non-error return, it is guaranteed that this load is unique. >> */ >> -static int add_unformed_module(struct module *mod) >> +static struct shared_load_info *add_running_load(const struct load_info *info) >> { >> - int err; >> struct module *old; >> + struct shared_load_info *shared_info; >> >> - mod->state = MODULE_STATE_UNFORMED; >> - >> -again: >> mutex_lock(&module_mutex); >> - old = find_module_all(mod->name, strlen(mod->name), true); >> - if (old != NULL) { >> - if (old->state != MODULE_STATE_LIVE) { >> - /* Wait in case it fails to load. */ >> + >> + /* Search if there is a running load of a module with the same name. */ >> + list_for_each_entry(shared_info, &running_loads, list) >> + if (strcmp(shared_info->name, info->name) == 0) { >> + int err; >> + >> + shared_load_info_get(shared_info); >> mutex_unlock(&module_mutex); >> - err = wait_event_interruptible(module_wq, >> - finished_loading(mod->name)); >> - if (err) >> - goto out_unlocked; >> - goto again; >> + >> + err = wait_for_completion_interruptible( >> + &shared_info->done); > > This would deserve a comment, for examle: > > /* > * Return -EBUSY when the parallel load failed > * from any reason. This load might end up > * another way but we are not going to try. > */ Ok. >> + if (!err) >> + err = shared_info->err ? -EBUSY : -EEXIST; >> + shared_load_info_put(shared_info); >> + shared_info = ERR_PTR(err); >> + goto out_unlocked; >> } >> - err = -EEXIST; >> + >> + /* Search if there is a live module with the given name already. */ >> + old = find_module_all(info->name, strlen(info->name), true); >> + if (old != NULL) { >> + if (old->state == MODULE_STATE_LIVE) { >> + shared_info = ERR_PTR(-EEXIST); >> + goto out; >> + } >> + >> + /* >> + * Any active load always has its record in running_loads and so >> + * would be found above. This applies independent whether such >> + * a module is currently in MODULE_STATE_UNFORMED, >> + * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its >> + * initialization failed. It therefore means this must be an >> + * older going module and the caller should try later once it is >> + * gone. >> + */ >> + WARN_ON(old->state != MODULE_STATE_GOING); >> + shared_info = ERR_PTR(-EBUSY); >> goto out; >> } >> - mod_update_bounds(mod); >> - list_add_rcu(&mod->list, &modules); >> - mod_tree_insert(mod); >> - err = 0; >> + >> + /* The load is unique, make it visible to others. */ >> + shared_info = shared_load_info_alloc(info); >> + if (IS_ERR(shared_info)) >> + goto out; >> + list_add(&shared_info->list, &running_loads); >> >> out: >> mutex_unlock(&module_mutex); >> out_unlocked: >> - return err; >> + return shared_info; >> +} >> + >> +static void finalize_running_load(struct shared_load_info *shared_info, int err) >> +{ >> + /* Inform other duplicate inserts that the load finished. */ >> + mutex_lock(&module_mutex); >> + list_del(&shared_info->list); >> + shared_info->err = err; >> + mutex_unlock(&module_mutex); >> + >> + complete_all(&shared_info->done); >> + shared_load_info_put(shared_info); >> + >> + /* Tell other modules waiting on this one that it completed loading. */ >> + wake_up_interruptible(&module_wq); >> +} >> + >> +static void add_unformed_module(struct module *mod) >> +{ >> + mod->state = MODULE_STATE_UNFORMED; >> + >> + mutex_lock(&module_mutex); >> + mod_update_bounds(mod); >> + list_add_rcu(&mod->list, &modules); >> + mod_tree_insert(mod); >> + mutex_unlock(&module_mutex); >> } >> >> static int complete_formation(struct module *mod, struct load_info *info) > > The comments are more or less about cosmetic problems. I do not see > any real functional problem. I am sorry that I did not mention > them when reviewing v1. > > Feel free to use: > > Reviewed-by: Petr Mladek <pmladek@suse.com> Thanks, Petr
On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > The patch does address a regression observed after commit 6e6de3dee51a > ("kernel/module.c: Only return -EEXIST for modules that have finished > loading"). I guess it can have a Fixes tag added to the patch. > > I think it is hard to split this patch into parts because the implemented > "optimization" is the fix. git describe --contains 6e6de3dee51a v5.3-rc1~38^2~6 I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the right thing to do, but without it, it still leaves the issue reported by Prarit Bhargava. We need a way to resolve the issue on stable and then your optimizations can be applied on top. Prarit Bhargava, please review Petry's work and see if you can come up with a sensible way to address this for stable. Luis
On 10/18/22 14:33, Luis Chamberlain wrote: > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >> The patch does address a regression observed after commit 6e6de3dee51a >> ("kernel/module.c: Only return -EEXIST for modules that have finished >> loading"). I guess it can have a Fixes tag added to the patch. >> >> I think it is hard to split this patch into parts because the implemented >> "optimization" is the fix. > > git describe --contains 6e6de3dee51a > v5.3-rc1~38^2~6 > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > right thing to do, but without it, it still leaves the issue reported > by Prarit Bhargava. We need a way to resolve the issue on stable and > then your optimizations can be applied on top. > > Prarit Bhargava, please review Petry's work and see if you can come up > with a sensible way to address this for stable. Thanks for the heads up Luis. I'll take a closer look. [A long time ago] I could swear we made a very targeted decision to *NOT* allow modules with the same name to be loaded into the kernel. What's changed that we think this is okay to do today? Thanks, P. > > Luis >
On 10/18/22 14:33, Luis Chamberlain wrote: > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >> The patch does address a regression observed after commit 6e6de3dee51a >> ("kernel/module.c: Only return -EEXIST for modules that have finished >> loading"). I guess it can have a Fixes tag added to the patch. >> >> I think it is hard to split this patch into parts because the implemented >> "optimization" is the fix. > > git describe --contains 6e6de3dee51a > v5.3-rc1~38^2~6 > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > right thing to do, but without it, it still leaves the issue reported > by Prarit Bhargava. We need a way to resolve the issue on stable and > then your optimizations can be applied on top. > > Prarit Bhargava, please review Petry's work and see if you can come up > with a sensible way to address this for stable. I found the original thread surrounding these changes and I do not think this solution should have been committed to the kernel. It is not a good solution to the problem being solved and adds complexity where none is needed IMO. Quoting from the original thread, > > Motivation for this patch is to fix an issue observed on larger machines with > many CPUs where it can take a significant amount of time during boot to run > systemd-udev-trigger.service. An x86-64 system can have already intel_pstate > active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will > attempt to load these modules too. The operation will eventually fail in the > init function of a respective module where it gets recognized that another > cpufreq driver is already loaded and -EEXIST is returned. However, one uevent > is triggered for each CPU and so multiple loads of these modules will be > present. The current code then processes all such loads individually and > serializes them with the barrier in add_unformed_module(). > The way to solve this is not in the module loading code, but in the udev code by adding a new event or in the userspace which handles the loading events. Option 1) Write/modify a udev rule to to use a flock userspace file lock to prevent repeated loading. The problem with this is that it is still racy and still consumes CPU time repeated load the ELF header and, depending on the system (ie a large number of cpus) would still cause a boot delay. This would be better than what we have and is worth looking at as a simple solution. I'd like to see boot times with this change, and I'll try to come up with a measurement on a large CPU system. Option 2) Create a new udev action, "add_once" to indicate to userspace that the module only needs to be loaded one time, and to ignore further load requests. This is a bit tricky as both kernel space and userspace would have be modified. The udev rule would end up looking very similar to what we now. The benefit of option 2 is that driver writers themselves can choose which drivers should issue "add_once" instead of add. Drivers that are known to run on all devices at once would call "add_once" to only issue a single load. Thoughts? P.
On 10/18/22 20:33, Luis Chamberlain wrote: > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >> The patch does address a regression observed after commit 6e6de3dee51a >> ("kernel/module.c: Only return -EEXIST for modules that have finished >> loading"). I guess it can have a Fixes tag added to the patch. >> >> I think it is hard to split this patch into parts because the implemented >> "optimization" is the fix. > > git describe --contains 6e6de3dee51a > v5.3-rc1~38^2~6 > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > right thing to do, but without it, it still leaves the issue reported > by Prarit Bhargava. We need a way to resolve the issue on stable and > then your optimizations can be applied on top. Simpler could be to do the following: diff --git a/kernel/module/main.c b/kernel/module/main.c index d02d39c7174e..0302ac387e93 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) sched_annotate_sleep(); mutex_lock(&module_mutex); mod = find_module_all(name, strlen(name), true); - ret = !mod || mod->state == MODULE_STATE_LIVE; + ret = !mod || mod->state == MODULE_STATE_LIVE + || mod->state == MODULE_STATE_GOING; mutex_unlock(&module_mutex); return ret; @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) mutex_lock(&module_mutex); old = find_module_all(mod->name, strlen(mod->name), true); if (old != NULL) { - if (old->state != MODULE_STATE_LIVE) { + if (old->state == MODULE_STATE_COMING + || old->state == MODULE_STATE_UNFORMED) { /* Wait in case it fails to load. */ mutex_unlock(&module_mutex); err = wait_event_interruptible(module_wq, @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) goto out_unlocked; goto again; } - err = -EEXIST; + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; goto out; } mod_update_bounds(mod); This is an alternative approach to fix the issue that 6e6de3dee51a addressed and it preserves the previous handling of same-module parallel loads. It works well in practice but a problem is that this previous handling is somewhat fragile because it requires specific timings. A second load of a same module returns EBUSY only if it observes the first load in the going state. The following can then happen: * A first load of module A is requested. It passes add_unformed_module() and proceeds with full initialization. * A second load of module A arrives. It proceeds up to add_unformed_module() where it waits on the first module to complete its initialization. * The first load fails because its init function happens to produce an error. The cleanup code in do_init_module() unlinks the module from the modules list, frees the module and finally calls wake_up_all(&module_wq). * The second load gets woken up. It sees that there is no module with the same name in the modules list and continues with its full initialization, which likely again fails in the init function. This scenario can be reproduced when one prepares a sample module with "msleep(1000); return -ENODEV;" in its init function and tries to load it several times in parallel. My posted patch essentially brings this handling of parallel loads back but gained some extra bits as I wanted to prevent the described instability. However, as mentioned previously, if we can avoid these parallel same-module load attempts in the first place then that would be certainly the best option. Thanks, Petr
On Wed 2022-10-19 14:00:55, Petr Pavlu wrote: > On 10/18/22 20:33, Luis Chamberlain wrote: > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > >> The patch does address a regression observed after commit 6e6de3dee51a > >> ("kernel/module.c: Only return -EEXIST for modules that have finished > >> loading"). I guess it can have a Fixes tag added to the patch. > >> > >> I think it is hard to split this patch into parts because the implemented > >> "optimization" is the fix. > > > > git describe --contains 6e6de3dee51a > > v5.3-rc1~38^2~6 > > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > > right thing to do, but without it, it still leaves the issue reported > > by Prarit Bhargava. We need a way to resolve the issue on stable and > > then your optimizations can be applied on top. > > Simpler could be to do the following: > > diff --git a/kernel/module/main.c b/kernel/module/main.c > index d02d39c7174e..0302ac387e93 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > sched_annotate_sleep(); > mutex_lock(&module_mutex); > mod = find_module_all(name, strlen(name), true); > - ret = !mod || mod->state == MODULE_STATE_LIVE; > + ret = !mod || mod->state == MODULE_STATE_LIVE > + || mod->state == MODULE_STATE_GOING; > mutex_unlock(&module_mutex); > > return ret; > @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) > mutex_lock(&module_mutex); > old = find_module_all(mod->name, strlen(mod->name), true); > if (old != NULL) { > - if (old->state != MODULE_STATE_LIVE) { > + if (old->state == MODULE_STATE_COMING > + || old->state == MODULE_STATE_UNFORMED) { > /* Wait in case it fails to load. */ > mutex_unlock(&module_mutex); > err = wait_event_interruptible(module_wq, > @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) > goto out_unlocked; > goto again; > } > - err = -EEXIST; > + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; > > goto out; > } > mod_update_bounds(mod); > > This is an alternative approach to fix the issue that 6e6de3dee51a addressed > and it preserves the previous handling of same-module parallel loads. > > It works well in practice but a problem is that this previous handling is > somewhat fragile because it requires specific timings. A second load of a same > module returns EBUSY only if it observes the first load in the going state. > > The following can then happen: > * A first load of module A is requested. It passes add_unformed_module() and > proceeds with full initialization. > * A second load of module A arrives. It proceeds up to add_unformed_module() > where it waits on the first module to complete its initialization. > * The first load fails because its init function happens to produce an error. > The cleanup code in do_init_module() unlinks the module from the modules > list, frees the module and finally calls wake_up_all(&module_wq). > * The second load gets woken up. It sees that there is no module with the same > name in the modules list and continues with its full initialization, which > likely again fails in the init function. Another solution would be to add one more reference counter directly into struct module. The existing counter is about dependencies on the module. It forces the module to stay in MODULE_STATE_LIVE when there is some dependency. The new reference counter would be just about life time of struct module. It should be easier than to add new structure for passing err code. Also it would allow to remove the racy finished_loading(). wait_event_interruptible() could just check mod->state. Best Regards, Petr
On Tue 2022-10-18 15:53:03, Prarit Bhargava wrote: > On 10/18/22 14:33, Luis Chamberlain wrote: > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > > > The patch does address a regression observed after commit 6e6de3dee51a > > > ("kernel/module.c: Only return -EEXIST for modules that have finished > > > loading"). I guess it can have a Fixes tag added to the patch. > > > > > > I think it is hard to split this patch into parts because the implemented > > > "optimization" is the fix. > > > > git describe --contains 6e6de3dee51a > > v5.3-rc1~38^2~6 > > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > > right thing to do, but without it, it still leaves the issue reported > > by Prarit Bhargava. We need a way to resolve the issue on stable and > > then your optimizations can be applied on top. > > > > Prarit Bhargava, please review Petry's work and see if you can come up > > with a sensible way to address this for stable. > > I found the original thread surrounding these changes and I do not think > this solution should have been committed to the kernel. It is not a good > solution to the problem being solved and adds complexity where none is > needed IMO. > > Quoting from the original thread, > > > > > Motivation for this patch is to fix an issue observed on larger machines with > > many CPUs where it can take a significant amount of time during boot to run > > systemd-udev-trigger.service. An x86-64 system can have already intel_pstate > > active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will > > attempt to load these modules too. The operation will eventually fail in the > > init function of a respective module where it gets recognized that another > > cpufreq driver is already loaded and -EEXIST is returned. However, one uevent > > is triggered for each CPU and so multiple loads of these modules will be > > present. The current code then processes all such loads individually and > > serializes them with the barrier in add_unformed_module(). > > > > The way to solve this is not in the module loading code, but in the udev > code by adding a new event or in the userspace which handles the loading > events. > > Option 1) > > Write/modify a udev rule to to use a flock userspace file lock to prevent > repeated loading. The problem with this is that it is still racy and still > consumes CPU time repeated load the ELF header and, depending on the system > (ie a large number of cpus) would still cause a boot delay. This would be > better than what we have and is worth looking at as a simple solution. I'd > like to see boot times with this change, and I'll try to come up with a > measurement on a large CPU system. This sounds like a ping-pong between projects where to put the complexity. Honestly, I think that it would be more reliable if we serialized/squashed parallel loads on the kernel side. > Option 2) > > Create a new udev action, "add_once" to indicate to userspace that the > module only needs to be loaded one time, and to ignore further load > requests. This is a bit tricky as both kernel space and userspace would > have be modified. The udev rule would end up looking very similar to what > we now. > > The benefit of option 2 is that driver writers themselves can choose which > drivers should issue "add_once" instead of add. Drivers that are known to > run on all devices at once would call "add_once" to only issue a single > load. My concern is how to distinguish first attempts and later (fixed) attempts. I mean that the module load might fail at boot. But the user might fix the root of the problem and try to load the module once again without reboot. The other load need not be direct. It might be part of another more complex operation. Reload of another module. Restart of a service. It might be problematic to do this an user-friendly way. And it might be much more complicated in the end. Best Regards, Petr
On 10/18/22 21:53, Prarit Bhargava wrote: > Quoting from the original thread, > >> >> Motivation for this patch is to fix an issue observed on larger machines with >> many CPUs where it can take a significant amount of time during boot to run >> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate >> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will >> attempt to load these modules too. The operation will eventually fail in the >> init function of a respective module where it gets recognized that another >> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent >> is triggered for each CPU and so multiple loads of these modules will be >> present. The current code then processes all such loads individually and >> serializes them with the barrier in add_unformed_module(). >> > > The way to solve this is not in the module loading code, but in the udev > code by adding a new event or in the userspace which handles the loading > events. > > Option 1) > > Write/modify a udev rule to to use a flock userspace file lock to > prevent repeated loading. The problem with this is that it is still > racy and still consumes CPU time repeated load the ELF header and, > depending on the system (ie a large number of cpus) would still cause a > boot delay. This would be better than what we have and is worth looking > at as a simple solution. I'd like to see boot times with this change, > and I'll try to come up with a measurement on a large CPU system. It is not immediately clear to me how this can be done as a udev rule. You mention that you'll try to test this on a large CPU system. Does it mean that you have a prototype implemented already? If yes, could you please share it? My reading is that one would need to update the "MODALIAS" rule in 80-drivers.rules [1] to do this locking. However, that just collects 'kmod load' (builtin) for udev to execute after all rules are processed. It would then be required to synchronize udev workers to prevent repeated loading? > Option 2) > > Create a new udev action, "add_once" to indicate to userspace that the > module only needs to be loaded one time, and to ignore further load > requests. This is a bit tricky as both kernel space and userspace would > have be modified. The udev rule would end up looking very similar to > what we now. > > The benefit of option 2 is that driver writers themselves can choose > which drivers should issue "add_once" instead of add. Drivers that are > known to run on all devices at once would call "add_once" to only issue > a single load. On the device event side, I more wonder if it would be possible to avoid tying up cpufreq and edac modules to individual CPU devices. Maybe their loading could be attached to some platform device, even if it means introducing an auxiliary device for this purpose? I need to look a bit more into this idea. [1] https://github.com/systemd/systemd/blob/4856f63846fc794711e1b8ec970e4c56494cd320/rules.d/80-drivers.rules Thanks, Petr
On 10/20/22 03:19, Petr Mladek wrote: > On Tue 2022-10-18 15:53:03, Prarit Bhargava wrote: >> On 10/18/22 14:33, Luis Chamberlain wrote: >>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >>>> The patch does address a regression observed after commit 6e6de3dee51a >>>> ("kernel/module.c: Only return -EEXIST for modules that have finished >>>> loading"). I guess it can have a Fixes tag added to the patch. >>>> >>>> I think it is hard to split this patch into parts because the implemented >>>> "optimization" is the fix. >>> >>> git describe --contains 6e6de3dee51a >>> v5.3-rc1~38^2~6 >>> >>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the >>> right thing to do, but without it, it still leaves the issue reported >>> by Prarit Bhargava. We need a way to resolve the issue on stable and >>> then your optimizations can be applied on top. >>> >>> Prarit Bhargava, please review Petry's work and see if you can come up >>> with a sensible way to address this for stable. >> >> I found the original thread surrounding these changes and I do not think >> this solution should have been committed to the kernel. It is not a good >> solution to the problem being solved and adds complexity where none is >> needed IMO. >> >> Quoting from the original thread, >> >>> >>> Motivation for this patch is to fix an issue observed on larger machines with >>> many CPUs where it can take a significant amount of time during boot to run >>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate >>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will >>> attempt to load these modules too. The operation will eventually fail in the >>> init function of a respective module where it gets recognized that another >>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent >>> is triggered for each CPU and so multiple loads of these modules will be >>> present. The current code then processes all such loads individually and >>> serializes them with the barrier in add_unformed_module(). >>> >> >> The way to solve this is not in the module loading code, but in the udev >> code by adding a new event or in the userspace which handles the loading >> events. >> >> Option 1) >> >> Write/modify a udev rule to to use a flock userspace file lock to prevent >> repeated loading. The problem with this is that it is still racy and still >> consumes CPU time repeated load the ELF header and, depending on the system >> (ie a large number of cpus) would still cause a boot delay. This would be >> better than what we have and is worth looking at as a simple solution. I'd >> like to see boot times with this change, and I'll try to come up with a >> measurement on a large CPU system. > > This sounds like a ping-pong between projects where to put the > complexity. Honestly, I think that it would be more reliable if > we serialized/squashed parallel loads on the kernel side. > Well, that's the world we live in. Module loading ping pongs between udev and the kernel. > >> Option 2) >> >> Create a new udev action, "add_once" to indicate to userspace that the >> module only needs to be loaded one time, and to ignore further load >> requests. This is a bit tricky as both kernel space and userspace would >> have be modified. The udev rule would end up looking very similar to what >> we now. >> >> The benefit of option 2 is that driver writers themselves can choose which >> drivers should issue "add_once" instead of add. Drivers that are known to >> run on all devices at once would call "add_once" to only issue a single >> load. > > My concern is how to distinguish first attempts and later (fixed) > attempts. > > I mean that the module load might fail at boot. But the user might > fix the root of the problem and try to load the module once again > without reboot. The other load need not be direct. It might be part > of another more complex operation. Reload of another module. > Restart of a service. > This is an interesting complication, and something to think about. P. > It might be problematic to do this an user-friendly way. > And it might be much more complicated in the end. > > Best Regards, > Petr >
On 10/24/22 08:37, Petr Pavlu wrote: > On 10/18/22 21:53, Prarit Bhargava wrote: >> Quoting from the original thread, >> >>> >>> Motivation for this patch is to fix an issue observed on larger machines with >>> many CPUs where it can take a significant amount of time during boot to run >>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate >>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will >>> attempt to load these modules too. The operation will eventually fail in the >>> init function of a respective module where it gets recognized that another >>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent >>> is triggered for each CPU and so multiple loads of these modules will be >>> present. The current code then processes all such loads individually and >>> serializes them with the barrier in add_unformed_module(). >>> >> >> The way to solve this is not in the module loading code, but in the udev >> code by adding a new event or in the userspace which handles the loading >> events. >> >> Option 1) >> >> Write/modify a udev rule to to use a flock userspace file lock to >> prevent repeated loading. The problem with this is that it is still >> racy and still consumes CPU time repeated load the ELF header and, >> depending on the system (ie a large number of cpus) would still cause a >> boot delay. This would be better than what we have and is worth looking >> at as a simple solution. I'd like to see boot times with this change, >> and I'll try to come up with a measurement on a large CPU system. > > It is not immediately clear to me how this can be done as a udev rule. You > mention that you'll try to test this on a large CPU system. Does it mean that > you have a prototype implemented already? If yes, could you please share it? > Hi Petr, Sorry, I haven't had a chance to actually test this out but I see this problem with the acpi_cpufreq and other multiple-cpu drivers which load once per logical cpu. I was thinking of adding a udev rule like: ACTION!="add", GOTO="acpi_cpufreq_end" # I may have to add CPU modaliases here to get this to work correctly ENV{MODALIAS}=="acpi:ACPI0007:", GOTO="acpi_cpufreq_start" GOTO="acpi_cpufreq_start" GOTO="acpi_cpufreq_end" LABEL="acpi_cpufreq_start" ENV{DELAY_MODALIAS}="$env{MODALIAS}" ENV{MODALIAS}="" PROGRAM="/bin/sh -c flock -n /tmp/delay_acpi_cpufreq sleep 2'", RESULT=="", RUN{builtin}+="kmod load $env{DELAY_MODALIAS}" LABEL="acpi_cpufreq_end" > My reading is that one would need to update the "MODALIAS" rule in > 80-drivers.rules [1] to do this locking. However, that just collects > 'kmod load' (builtin) for udev to execute after all rules are processed. It > would then be required to synchronize udev workers to prevent repeated > loading? > Yes, that would be the case. >> Option 2) >> >> Create a new udev action, "add_once" to indicate to userspace that the >> module only needs to be loaded one time, and to ignore further load >> requests. This is a bit tricky as both kernel space and userspace would >> have be modified. The udev rule would end up looking very similar to >> what we now. >> >> The benefit of option 2 is that driver writers themselves can choose >> which drivers should issue "add_once" instead of add. Drivers that are >> known to run on all devices at once would call "add_once" to only issue >> a single load. > > On the device event side, I more wonder if it would be possible to avoid tying > up cpufreq and edac modules to individual CPU devices. Maybe their loading > could be attached to some platform device, even if it means introducing an > auxiliary device for this purpose? I need to look a bit more into this idea. That's an interesting idea and something I had not considered. Creating a virtual device and loading against that device would be much better (easier?) of a solution. P. > > [1] https://github.com/systemd/systemd/blob/4856f63846fc794711e1b8ec970e4c56494cd320/rules.d/80-drivers.rules > > Thanks, > Petr >
On Mon, Oct 24, 2022 at 09:22:35AM -0400, Prarit Bhargava wrote: > On 10/20/22 03:19, Petr Mladek wrote: > > On Tue 2022-10-18 15:53:03, Prarit Bhargava wrote: > > > On 10/18/22 14:33, Luis Chamberlain wrote: > > > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > > > > > The patch does address a regression observed after commit 6e6de3dee51a > > > > > ("kernel/module.c: Only return -EEXIST for modules that have finished > > > > > loading"). I guess it can have a Fixes tag added to the patch. > > > > > > > > > > I think it is hard to split this patch into parts because the implemented > > > > > "optimization" is the fix. > > > > > > > > git describe --contains 6e6de3dee51a > > > > v5.3-rc1~38^2~6 > > > > > > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > > > > right thing to do, but without it, it still leaves the issue reported > > > > by Prarit Bhargava. We need a way to resolve the issue on stable and > > > > then your optimizations can be applied on top. > > > > > > > > Prarit Bhargava, please review Petry's work and see if you can come up > > > > with a sensible way to address this for stable. > > > > > > I found the original thread surrounding these changes and I do not think > > > this solution should have been committed to the kernel. It is not a good > > > solution to the problem being solved and adds complexity where none is > > > needed IMO. > > > > > > Quoting from the original thread, > > > > > > > > > > > Motivation for this patch is to fix an issue observed on larger machines with > > > > many CPUs where it can take a significant amount of time during boot to run > > > > systemd-udev-trigger.service. An x86-64 system can have already intel_pstate > > > > active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will > > > > attempt to load these modules too. The operation will eventually fail in the > > > > init function of a respective module where it gets recognized that another > > > > cpufreq driver is already loaded and -EEXIST is returned. However, one uevent > > > > is triggered for each CPU and so multiple loads of these modules will be > > > > present. The current code then processes all such loads individually and > > > > serializes them with the barrier in add_unformed_module(). > > > > > > > > > > The way to solve this is not in the module loading code, but in the udev > > > code by adding a new event or in the userspace which handles the loading > > > events. > > > > > > Option 1) > > > > > > Write/modify a udev rule to to use a flock userspace file lock to prevent > > > repeated loading. The problem with this is that it is still racy and still > > > consumes CPU time repeated load the ELF header and, depending on the system > > > (ie a large number of cpus) would still cause a boot delay. This would be > > > better than what we have and is worth looking at as a simple solution. I'd > > > like to see boot times with this change, and I'll try to come up with a > > > measurement on a large CPU system. > > > > This sounds like a ping-pong between projects where to put the > > complexity. Honestly, I think that it would be more reliable if > > we serialized/squashed parallel loads on the kernel side. > > > > Well, that's the world we live in. Module loading ping pongs between udev > and the kernel. You are missing the point. Think of stable first. Upgrading udev is not an option. Yes ou can think of optimizations later that udev can do, and should perhaps do, but that's beyond the scope of the fix needed here. kmod (the library which modprobe now uses) can / probably already has a lookup for modules prior to issuing a new request. But even then, we cannot assume all users user kmod (think android). Anything can request a new module and we should do what is sensible in-kernel. I'd like to see us think about stable first here. Luis
On Thu, Oct 20, 2022 at 09:03:39AM +0200, Petr Mladek wrote: > On Wed 2022-10-19 14:00:55, Petr Pavlu wrote: > > On 10/18/22 20:33, Luis Chamberlain wrote: > > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > > >> The patch does address a regression observed after commit 6e6de3dee51a > > >> ("kernel/module.c: Only return -EEXIST for modules that have finished > > >> loading"). I guess it can have a Fixes tag added to the patch. > > >> > > >> I think it is hard to split this patch into parts because the implemented > > >> "optimization" is the fix. > > > > > > git describe --contains 6e6de3dee51a > > > v5.3-rc1~38^2~6 > > > > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > > > right thing to do, but without it, it still leaves the issue reported > > > by Prarit Bhargava. We need a way to resolve the issue on stable and > > > then your optimizations can be applied on top. > > > > Simpler could be to do the following: > > > > diff --git a/kernel/module/main.c b/kernel/module/main.c > > index d02d39c7174e..0302ac387e93 100644 > > --- a/kernel/module/main.c > > +++ b/kernel/module/main.c > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > > sched_annotate_sleep(); > > mutex_lock(&module_mutex); > > mod = find_module_all(name, strlen(name), true); > > - ret = !mod || mod->state == MODULE_STATE_LIVE; > > + ret = !mod || mod->state == MODULE_STATE_LIVE > > + || mod->state == MODULE_STATE_GOING; > > mutex_unlock(&module_mutex); > > > > return ret; > > @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) > > mutex_lock(&module_mutex); > > old = find_module_all(mod->name, strlen(mod->name), true); > > if (old != NULL) { > > - if (old->state != MODULE_STATE_LIVE) { > > + if (old->state == MODULE_STATE_COMING > > + || old->state == MODULE_STATE_UNFORMED) { > > /* Wait in case it fails to load. */ > > mutex_unlock(&module_mutex); > > err = wait_event_interruptible(module_wq, > > @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) > > goto out_unlocked; > > goto again; > > } > > - err = -EEXIST; > > + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; > > > > goto out; > > } > > mod_update_bounds(mod); > > > > This is an alternative approach to fix the issue that 6e6de3dee51a addressed > > and it preserves the previous handling of same-module parallel loads. > > > > It works well in practice but a problem is that this previous handling is > > somewhat fragile because it requires specific timings. A second load of a same > > module returns EBUSY only if it observes the first load in the going state. > > > > The following can then happen: > > * A first load of module A is requested. It passes add_unformed_module() and > > proceeds with full initialization. > > * A second load of module A arrives. It proceeds up to add_unformed_module() > > where it waits on the first module to complete its initialization. > > * The first load fails because its init function happens to produce an error. > > The cleanup code in do_init_module() unlinks the module from the modules > > list, frees the module and finally calls wake_up_all(&module_wq). > > * The second load gets woken up. It sees that there is no module with the same > > name in the modules list and continues with its full initialization, which > > likely again fails in the init function. > > Another solution would be to add one more reference counter directly > into struct module. The existing counter is about dependencies on the > module. It forces the module to stay in MODULE_STATE_LIVE when there > is some dependency. The new reference counter would be just about > life time of struct module. > > It should be easier than to add new structure for passing err code. > > Also it would allow to remove the racy finished_loading(). > wait_event_interruptible() could just check mod->state. Sounds good, but let us just keep in mind we *first* want a fix for stable, which also fixes 6e6de3dee51a and addresses the fix it intended to have. So I welcome patches, let us first get a small fix in for 6e6de3dee51a and we can optimize away after. Luis
On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: > On 10/18/22 20:33, Luis Chamberlain wrote: > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > >> The patch does address a regression observed after commit 6e6de3dee51a > >> ("kernel/module.c: Only return -EEXIST for modules that have finished > >> loading"). I guess it can have a Fixes tag added to the patch. > >> > >> I think it is hard to split this patch into parts because the implemented > >> "optimization" is the fix. > > > > git describe --contains 6e6de3dee51a > > v5.3-rc1~38^2~6 > > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > > right thing to do, but without it, it still leaves the issue reported > > by Prarit Bhargava. We need a way to resolve the issue on stable and > > then your optimizations can be applied on top. > > Simpler could be to do the following: > > diff --git a/kernel/module/main.c b/kernel/module/main.c > index d02d39c7174e..0302ac387e93 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > sched_annotate_sleep(); > mutex_lock(&module_mutex); > mod = find_module_all(name, strlen(name), true); > - ret = !mod || mod->state == MODULE_STATE_LIVE; > + ret = !mod || mod->state == MODULE_STATE_LIVE > + || mod->state == MODULE_STATE_GOING; > mutex_unlock(&module_mutex); > > return ret; > @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) > mutex_lock(&module_mutex); > old = find_module_all(mod->name, strlen(mod->name), true); > if (old != NULL) { > - if (old->state != MODULE_STATE_LIVE) { > + if (old->state == MODULE_STATE_COMING > + || old->state == MODULE_STATE_UNFORMED) { > /* Wait in case it fails to load. */ > mutex_unlock(&module_mutex); > err = wait_event_interruptible(module_wq, > @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) > goto out_unlocked; > goto again; > } > - err = -EEXIST; > + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; > goto out; > } > mod_update_bounds(mod); Prarit, can you verify this still does not break the issue you reported? David, does this also fix your issue? Petr, does this solve *any* of your issues? Can you also send a proper patch with a commit log once we get confirmation of the tests. I've been nose diving onto all of these 3 issues now and I have some ideas of how to split this crap better and save even more memory on bootup due to these stupid multiple requests. I want to first solve this regression. Luis
On 10/24/22 16:00, Prarit Bhargava wrote: > On 10/24/22 08:37, Petr Pavlu wrote: >> On 10/18/22 21:53, Prarit Bhargava wrote: >>> Quoting from the original thread, >>> >>>> >>>> Motivation for this patch is to fix an issue observed on larger machines with >>>> many CPUs where it can take a significant amount of time during boot to run >>>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate >>>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will >>>> attempt to load these modules too. The operation will eventually fail in the >>>> init function of a respective module where it gets recognized that another >>>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent >>>> is triggered for each CPU and so multiple loads of these modules will be >>>> present. The current code then processes all such loads individually and >>>> serializes them with the barrier in add_unformed_module(). >>>> >>> >>> The way to solve this is not in the module loading code, but in the udev >>> code by adding a new event or in the userspace which handles the loading >>> events. >>> >>> Option 1) >>> >>> Write/modify a udev rule to to use a flock userspace file lock to >>> prevent repeated loading. The problem with this is that it is still >>> racy and still consumes CPU time repeated load the ELF header and, >>> depending on the system (ie a large number of cpus) would still cause a >>> boot delay. This would be better than what we have and is worth looking >>> at as a simple solution. I'd like to see boot times with this change, >>> and I'll try to come up with a measurement on a large CPU system. >> >> It is not immediately clear to me how this can be done as a udev rule. You >> mention that you'll try to test this on a large CPU system. Does it mean that >> you have a prototype implemented already? If yes, could you please share it? >> > > Hi Petr, > > Sorry, I haven't had a chance to actually test this out but I see this > problem with the acpi_cpufreq and other multiple-cpu drivers which load > once per logical cpu. I was thinking of adding a udev rule like: > > ACTION!="add", GOTO="acpi_cpufreq_end" > > # I may have to add CPU modaliases here to get this to work correctly > ENV{MODALIAS}=="acpi:ACPI0007:", GOTO="acpi_cpufreq_start" > > GOTO="acpi_cpufreq_start" > GOTO="acpi_cpufreq_end" > > LABEL="acpi_cpufreq_start" > > ENV{DELAY_MODALIAS}="$env{MODALIAS}" > ENV{MODALIAS}="" > PROGRAM="/bin/sh -c flock -n /tmp/delay_acpi_cpufreq sleep 2'", > RESULT=="", RUN{builtin}+="kmod load $env{DELAY_MODALIAS}" > > LABEL="acpi_cpufreq_end" Thanks, that is an interesting idea. I think though the artificial delay that it introduces would not be good (if I'm reading the snippet correctly). >>> Option 2) >>> >>> Create a new udev action, "add_once" to indicate to userspace that the >>> module only needs to be loaded one time, and to ignore further load >>> requests. This is a bit tricky as both kernel space and userspace would >>> have be modified. The udev rule would end up looking very similar to >>> what we now. >>> >>> The benefit of option 2 is that driver writers themselves can choose >>> which drivers should issue "add_once" instead of add. Drivers that are >>> known to run on all devices at once would call "add_once" to only issue >>> a single load. >> >> On the device event side, I more wonder if it would be possible to avoid tying >> up cpufreq and edac modules to individual CPU devices. Maybe their loading >> could be attached to some platform device, even if it means introducing an >> auxiliary device for this purpose? I need to look a bit more into this idea. > > That's an interesting idea and something I had not considered. Creating > a virtual device and loading against that device would be much better > (easier?) of a solution. Made some progress on this.. Both acpi-cpufreq and pcc-cpufreq drivers have their platform firmware interface defined by ACPI. Allowed performance states and parameters must be same for each CPU. Instead of matching these drivers on acpi:ACPI0007: or acpi:LNXCPU: (per-CPU devices), it is possible to check the ACPI namespace early and create a virtual platform device for each of these interfaces if it is available. My test patch is pasted at the end of the email. This looks to work well and has a benefit that no attempt is made to load pcc-cpufreq on machines where PCC is not available, which should be most systems. I think this change is useful independently of whether there will be eventually any improvement on the udev or module loader side. My plan is to send it for review to the cpufreq maintainers. There still remains a problem though that a CPU is typically aliased by other drivers too: # modprobe --show-depends 'cpu:type:x86,ven0000fam0006mod0055:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006A,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0083,0084,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3,00E4,00E6,00E7,00EA,00F0,00F1,00F2,00F3,00F5,00F9,00FA,00FB,00FE,00FF,0100,0101,0102,0103,0104,0111,0120,0121,0123,0125,0126,0127,0128,0129,012A,012C,012D,012E,012F,0130,0131,0132,0133,0134,0137,0138,0139,013C,013E,013F,0140,0141,0142,0143,0160,0161,0162,0163,0164,0165,0171,01C0,01C1,01C2,01C4,01C5,01C6,0203,0204,020B,024A,025A,025B,025C,025D,025F' insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/cryptd.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/crypto_simd.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/aesni-intel.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/cryptd.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/ghash-clmulni-intel.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/gf128mul.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/polyval-generic.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/polyval-clmulni.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crc32c-intel.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crc32-pclmul.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crct10dif-pclmul.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/virt/lib/irqbypass.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/kvm/kvm.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/kvm/kvm-intel.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/hwmon/coretemp.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/thermal/intel/intel_powerclamp.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/thermal/intel/x86_pkg_temp_thermal.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/nvdimm/libnvdimm.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/acpi/nfit/nfit.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/edac/skx_edac.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency-common.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency.ko insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/powercap/intel_rapl_common.ko If any of these modules repeatedly fails to load then this will again delay processing of 'udevadm trigger' during boot when a given system has many CPUs. Cheers, Petr diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 0002eecbf870..b6fd14b829be 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -57,6 +57,7 @@ acpi-y += evged.o acpi-y += sysfs.o acpi-y += property.o acpi-$(CONFIG_X86) += acpi_cmos_rtc.o +acpi-$(CONFIG_X86) += acpi_cpufreq.o acpi-$(CONFIG_X86) += x86/apple.o acpi-$(CONFIG_X86) += x86/utils.o acpi-$(CONFIG_X86) += x86/s2idle.o diff --git a/drivers/acpi/acpi_cpufreq.c b/drivers/acpi/acpi_cpufreq.c new file mode 100644 index 000000000000..3eebe58fbe9b --- /dev/null +++ b/drivers/acpi/acpi_cpufreq.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * ACPI support for Processor Performance Control and Processor Clocking + * Control. + */ + +#include <linux/acpi.h> +#include <linux/platform_device.h> + +#include "internal.h" + +static void cpufreq_add_device(struct acpi_device *adev, const char *name) +{ + struct platform_device *pdev; + + pdev = platform_device_register_resndata( + &adev->dev, name, PLATFORM_DEVID_NONE, NULL, 0, NULL, 0); + if (IS_ERR(pdev)) + dev_err(&adev->dev, "%s platform device creation failed: %ld\n", + name, PTR_ERR(pdev)); +} + +static acpi_status +acpi_pct_match(acpi_handle handle, u32 level, void *context, + void **return_value) +{ + bool *pct = context; + + /* Check if the first CPU has _PCT. The data must be same for all. */ + *pct = acpi_has_method(handle, "_PCT"); + return AE_CTRL_TERMINATE; +} + +void __init acpi_cpufreq_init(void) +{ + acpi_status status; + acpi_handle handle; + struct acpi_device *device; + bool pct = false; + + status = acpi_get_handle(NULL, "\\_SB", &handle); + if (ACPI_FAILURE(status)) + return; + + device = acpi_fetch_acpi_dev(handle); + if (device == NULL) + return; + + acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT, + ACPI_UINT32_MAX, acpi_pct_match, NULL, &pct, NULL); + if (pct) + cpufreq_add_device(device, "acpi-cpufreq"); + + if (acpi_has_method(handle, "PCCH")) + cpufreq_add_device(device, "pcc-cpufreq"); +} diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h index 219c02df9a08..ab02efeaa192 100644 --- a/drivers/acpi/internal.h +++ b/drivers/acpi/internal.h @@ -55,8 +55,10 @@ static inline void acpi_dock_add(struct acpi_device *adev) {} #endif #ifdef CONFIG_X86 void acpi_cmos_rtc_init(void); +void acpi_cpufreq_init(void); #else static inline void acpi_cmos_rtc_init(void) {} +static inline void acpi_cpufreq_init(void) {} #endif int acpi_rev_override_setup(char *str); diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index b47e93a24a9a..e51cf28fc629 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -2614,6 +2614,7 @@ void __init acpi_scan_init(void) if (!acpi_gbl_reduced_hardware) acpi_bus_scan_fixed(); + acpi_cpufreq_init(); acpi_turn_off_unused_power_resources(); acpi_scan_initialized = true; diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c index 1bb2b90ebb21..1273d42e9ca1 100644 --- a/drivers/cpufreq/acpi-cpufreq.c +++ b/drivers/cpufreq/acpi-cpufreq.c @@ -1056,18 +1056,5 @@ MODULE_PARM_DESC(acpi_pstate_strict, late_initcall(acpi_cpufreq_init); module_exit(acpi_cpufreq_exit); -static const struct x86_cpu_id __maybe_unused acpi_cpufreq_ids[] = { - X86_MATCH_FEATURE(X86_FEATURE_ACPI, NULL), - X86_MATCH_FEATURE(X86_FEATURE_HW_PSTATE, NULL), - {} -}; -MODULE_DEVICE_TABLE(x86cpu, acpi_cpufreq_ids); - -static const struct acpi_device_id __maybe_unused processor_device_ids[] = { - {ACPI_PROCESSOR_OBJECT_HID, }, - {ACPI_PROCESSOR_DEVICE_HID, }, - {}, -}; -MODULE_DEVICE_TABLE(acpi, processor_device_ids); - MODULE_ALIAS("acpi"); +MODULE_ALIAS("platform:acpi-cpufreq"); diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c index 9f3fc7a073d0..cc898bc3e156 100644 --- a/drivers/cpufreq/pcc-cpufreq.c +++ b/drivers/cpufreq/pcc-cpufreq.c @@ -616,12 +616,7 @@ static void __exit pcc_cpufreq_exit(void) free_percpu(pcc_cpu_info); } -static const struct acpi_device_id __maybe_unused processor_device_ids[] = { - {ACPI_PROCESSOR_OBJECT_HID, }, - {ACPI_PROCESSOR_DEVICE_HID, }, - {}, -}; -MODULE_DEVICE_TABLE(acpi, processor_device_ids); +MODULE_ALIAS("platform:pcc-cpufreq"); MODULE_AUTHOR("Matthew Garrett, Naga Chumbalkar"); MODULE_VERSION(PCC_VERSION);
On 12.11.22 02:47, Luis Chamberlain wrote: > On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: >> On 10/18/22 20:33, Luis Chamberlain wrote: >>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >>>> The patch does address a regression observed after commit 6e6de3dee51a >>>> ("kernel/module.c: Only return -EEXIST for modules that have finished >>>> loading"). I guess it can have a Fixes tag added to the patch. >>>> >>>> I think it is hard to split this patch into parts because the implemented >>>> "optimization" is the fix. >>> >>> git describe --contains 6e6de3dee51a >>> v5.3-rc1~38^2~6 >>> >>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the >>> right thing to do, but without it, it still leaves the issue reported >>> by Prarit Bhargava. We need a way to resolve the issue on stable and >>> then your optimizations can be applied on top. >> >> Simpler could be to do the following: >> >> diff --git a/kernel/module/main.c b/kernel/module/main.c >> index d02d39c7174e..0302ac387e93 100644 >> --- a/kernel/module/main.c >> +++ b/kernel/module/main.c >> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >> sched_annotate_sleep(); >> mutex_lock(&module_mutex); >> mod = find_module_all(name, strlen(name), true); >> - ret = !mod || mod->state == MODULE_STATE_LIVE; >> + ret = !mod || mod->state == MODULE_STATE_LIVE >> + || mod->state == MODULE_STATE_GOING; >> mutex_unlock(&module_mutex); >> >> return ret; >> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) >> mutex_lock(&module_mutex); >> old = find_module_all(mod->name, strlen(mod->name), true); >> if (old != NULL) { >> - if (old->state != MODULE_STATE_LIVE) { >> + if (old->state == MODULE_STATE_COMING >> + || old->state == MODULE_STATE_UNFORMED) { >> /* Wait in case it fails to load. */ >> mutex_unlock(&module_mutex); >> err = wait_event_interruptible(module_wq, >> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) >> goto out_unlocked; >> goto again; >> } >> - err = -EEXIST; >> + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; >> goto out; >> } >> mod_update_bounds(mod); > > > Prarit, can you verify this still does not break the issue you reported? > David, does this also fix your issue? I didn't try, but from a quick glimpse I assume no. Allocating module space happens before handling eventual duplicates right now, before a module even is "alive" and in the MODULE_STATE_UNFORMED state. But maybe I am missing something important.
On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote: > On 12.11.22 02:47, Luis Chamberlain wrote: > > On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: > > > On 10/18/22 20:33, Luis Chamberlain wrote: > > > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > > > > > The patch does address a regression observed after commit 6e6de3dee51a > > > > > ("kernel/module.c: Only return -EEXIST for modules that have finished > > > > > loading"). I guess it can have a Fixes tag added to the patch. > > > > > > > > > > I think it is hard to split this patch into parts because the implemented > > > > > "optimization" is the fix. > > > > > > > > git describe --contains 6e6de3dee51a > > > > v5.3-rc1~38^2~6 > > > > > > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the > > > > right thing to do, but without it, it still leaves the issue reported > > > > by Prarit Bhargava. We need a way to resolve the issue on stable and > > > > then your optimizations can be applied on top. > > > > > > Simpler could be to do the following: > > > > > > diff --git a/kernel/module/main.c b/kernel/module/main.c > > > index d02d39c7174e..0302ac387e93 100644 > > > --- a/kernel/module/main.c > > > +++ b/kernel/module/main.c > > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > > > sched_annotate_sleep(); > > > mutex_lock(&module_mutex); > > > mod = find_module_all(name, strlen(name), true); > > > - ret = !mod || mod->state == MODULE_STATE_LIVE; > > > + ret = !mod || mod->state == MODULE_STATE_LIVE > > > + || mod->state == MODULE_STATE_GOING; > > > mutex_unlock(&module_mutex); > > > return ret; > > > @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) > > > mutex_lock(&module_mutex); > > > old = find_module_all(mod->name, strlen(mod->name), true); > > > if (old != NULL) { > > > - if (old->state != MODULE_STATE_LIVE) { > > > + if (old->state == MODULE_STATE_COMING > > > + || old->state == MODULE_STATE_UNFORMED) { > > > /* Wait in case it fails to load. */ > > > mutex_unlock(&module_mutex); > > > err = wait_event_interruptible(module_wq, > > > @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) > > > goto out_unlocked; > > > goto again; > > > } > > > - err = -EEXIST; > > > + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; > > > goto out; > > > } > > > mod_update_bounds(mod); > > > > > > Prarit, can you verify this still does not break the issue you reported? > > David, does this also fix your issue? > > I didn't try, but from a quick glimpse I assume no. Allocating module space > happens before handling eventual duplicates right now, before a module even > is "alive" and in the MODULE_STATE_UNFORMED state. The first two hunks are a revert of commit 6e6de3dee51a and I'm under the impression that cauased your issues as *more* modules states are allowed through. The last hunk tries to fix what 6e6de3dee51a wanted to do. > But maybe I am missing something important. Please do test if you can. Luis
On 14.11.22 16:38, Luis Chamberlain wrote: > On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote: >> On 12.11.22 02:47, Luis Chamberlain wrote: >>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: >>>> On 10/18/22 20:33, Luis Chamberlain wrote: >>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >>>>>> The patch does address a regression observed after commit 6e6de3dee51a >>>>>> ("kernel/module.c: Only return -EEXIST for modules that have finished >>>>>> loading"). I guess it can have a Fixes tag added to the patch. >>>>>> >>>>>> I think it is hard to split this patch into parts because the implemented >>>>>> "optimization" is the fix. >>>>> >>>>> git describe --contains 6e6de3dee51a >>>>> v5.3-rc1~38^2~6 >>>>> >>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the >>>>> right thing to do, but without it, it still leaves the issue reported >>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and >>>>> then your optimizations can be applied on top. >>>> >>>> Simpler could be to do the following: >>>> >>>> diff --git a/kernel/module/main.c b/kernel/module/main.c >>>> index d02d39c7174e..0302ac387e93 100644 >>>> --- a/kernel/module/main.c >>>> +++ b/kernel/module/main.c >>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >>>> sched_annotate_sleep(); >>>> mutex_lock(&module_mutex); >>>> mod = find_module_all(name, strlen(name), true); >>>> - ret = !mod || mod->state == MODULE_STATE_LIVE; >>>> + ret = !mod || mod->state == MODULE_STATE_LIVE >>>> + || mod->state == MODULE_STATE_GOING; >>>> mutex_unlock(&module_mutex); >>>> return ret; >>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod) >>>> mutex_lock(&module_mutex); >>>> old = find_module_all(mod->name, strlen(mod->name), true); >>>> if (old != NULL) { >>>> - if (old->state != MODULE_STATE_LIVE) { >>>> + if (old->state == MODULE_STATE_COMING >>>> + || old->state == MODULE_STATE_UNFORMED) { >>>> /* Wait in case it fails to load. */ >>>> mutex_unlock(&module_mutex); >>>> err = wait_event_interruptible(module_wq, >>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod) >>>> goto out_unlocked; >>>> goto again; >>>> } >>>> - err = -EEXIST; >>>> + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; >>>> goto out; >>>> } >>>> mod_update_bounds(mod); >>> >>> >>> Prarit, can you verify this still does not break the issue you reported? >>> David, does this also fix your issue? >> >> I didn't try, but from a quick glimpse I assume no. Allocating module space >> happens before handling eventual duplicates right now, before a module even >> is "alive" and in the MODULE_STATE_UNFORMED state. > > The first two hunks are a revert of commit 6e6de3dee51a and I'm under > the impression that cauased your issues as *more* modules states are > allowed through. > > The last hunk tries to fix what 6e6de3dee51a wanted to do. > Note that I don't think the issue I raised is due to 6e6de3dee51a. >> But maybe I am missing something important. > > Please do test if you can. I don't have the machine at hand right now. But, again, I doubt this will fix it. The flow is in load_module(): mod = layout_and_allocate(info, flags); if (IS_ERR(mod)) { ... } audit_log_kern_module(mod->name); /* Reserve our place in the list. */ err = add_unformed_module(mod); if (err) goto free_module; You can have 400 threads in layout_and_allocate() loading the same module at the same time and running out of module space. Any changes to add_unformed_module() and finished_loading() won't change that, because they are not involved before the module space allocations happened.
On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: > Note that I don't think the issue I raised is due to 6e6de3dee51a. > I don't have the machine at hand right now. But, again, I doubt this will > fix it. There are *more* modules processed after that commit. That's all. So testing would be appreciated. Luis
On 11/15/22 14:29, Luis Chamberlain wrote: > On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: >> Note that I don't think the issue I raised is due to 6e6de3dee51a. >> I don't have the machine at hand right now. But, again, I doubt this will >> fix it. > > There are *more* modules processed after that commit. That's all. So > testing would be appreciated. > Can anyone tell us if https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/ resolves the module loading delay problem? /me is testing now P. > Luis >
On 15.11.22 20:29, Luis Chamberlain wrote: > On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: >> Note that I don't think the issue I raised is due to 6e6de3dee51a. >> I don't have the machine at hand right now. But, again, I doubt this will >> fix it. > > There are *more* modules processed after that commit. That's all. So > testing would be appreciated. I'll try to get a grab on a suitable system.
On 15.11.22 20:29, Luis Chamberlain wrote: > On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: >> Note that I don't think the issue I raised is due to 6e6de3dee51a. >> I don't have the machine at hand right now. But, again, I doubt this will >> fix it. > > There are *more* modules processed after that commit. That's all. So > testing would be appreciated. I just tested that change on top of 6.1.0-rc5+ on that large system and CONFIG_KASAN_INLINE=y. No change. [ 207.955184] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size [ 207.955891] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size [ 207.956253] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size [ 207.956461] systemd-udevd: vmalloc error: size 2486272, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=1-7 [ 207.956573] CPU: 88 PID: 4925 Comm: systemd-udevd Not tainted 6.1.0-rc5+ #4 [ 207.956580] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020 [ 207.956584] Call Trace: [ 207.956588] <TASK> [ 207.956593] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size [ 207.956593] dump_stack_lvl+0x5b/0x77 [ 207.956613] warn_alloc.cold+0x86/0x195 [ 207.956632] ? zone_watermark_ok_safe+0x2b0/0x2b0 [ 207.956641] ? slab_free_freelist_hook+0x11e/0x1d0 [ 207.956672] ? __get_vm_area_node+0x2a4/0x340 [ 207.956694] __vmalloc_node_range+0xad6/0x11b0 [ 207.956699] ? trace_contention_end+0xda/0x140 [ 207.956715] ? __mutex_lock+0x254/0x1360 [ 207.956740] ? __mutex_unlock_slowpath+0x154/0x600 [ 207.956752] ? bit_wait_io_timeout+0x170/0x170 [ 207.956761] ? vfree_atomic+0xa0/0xa0 [ 207.956775] ? load_module+0x1d8f/0x7ff0 [ 207.956786] module_alloc+0xe7/0x170 [ 207.956802] ? load_module+0x1d8f/0x7ff0 [ 207.956822] load_module+0x1d8f/0x7ff0 [ 207.956876] ? module_frob_arch_sections+0x20/0x20 [ 207.956888] ? ima_post_read_file+0x15a/0x180 [ 207.956904] ? ima_read_file+0x140/0x140 [ 207.956918] ? kernel_read+0x5c/0x140 [ 207.956931] ? security_kernel_post_read_file+0x6d/0xb0 [ 207.956950] ? kernel_read_file+0x21d/0x7d0 [ 207.956971] ? __x64_sys_fspick+0x270/0x270 [ 207.956999] ? __do_sys_finit_module+0xfc/0x180 [ 207.957005] __do_sys_finit_module+0xfc/0x180 [ 207.957012] ? __ia32_sys_init_module+0xa0/0xa0 [ 207.957023] ? __seccomp_filter+0x15e/0xc20 [ 207.957066] ? syscall_trace_enter.constprop.0+0x98/0x230 [ 207.957078] do_syscall_64+0x58/0x80 [ 207.957085] ? asm_exc_page_fault+0x22/0x30 [ 207.957095] ? lockdep_hardirqs_on+0x7d/0x100 [ 207.957103] entry_SYSCALL_64_after_hwframe+0x63/0xcd I have access to the system for a couple more days, if there is anything else I should test.
On 11/16/22 17:03, Prarit Bhargava wrote: > On 11/15/22 14:29, Luis Chamberlain wrote: >> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: >>> Note that I don't think the issue I raised is due to 6e6de3dee51a. >>> I don't have the machine at hand right now. But, again, I doubt this will >>> fix it. >> >> There are *more* modules processed after that commit. That's all. So >> testing would be appreciated. >> > > Can anyone tell us if > > https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/ > > resolves the module loading delay problem? This patch unfortunately makes no difference on my test system. In my case, the kernel has already intel_pstate loaded when udev starts inserting a burst of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init() to immediately return once the check cpufreq_get_current_driver() fails. The code modified by the patch is not reached at all. Petr
On Mon, Nov 21, 2022 at 05:00:49PM +0100, Petr Pavlu wrote: > On 11/16/22 17:03, Prarit Bhargava wrote: > > On 11/15/22 14:29, Luis Chamberlain wrote: > >> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: > >>> Note that I don't think the issue I raised is due to 6e6de3dee51a. > >>> I don't have the machine at hand right now. But, again, I doubt this will > >>> fix it. > >> > >> There are *more* modules processed after that commit. That's all. So > >> testing would be appreciated. > >> > > > > Can anyone tell us if > > > > https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/ > > > > resolves the module loading delay problem? > > This patch unfortunately makes no difference on my test system. In my case, > the kernel has already intel_pstate loaded when udev starts inserting a burst > of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init() > to immediately return once the check cpufreq_get_current_driver() fails. The > code modified by the patch is not reached at all. To be clear I don't care about the patch mentioned in the above URL, I care about this: https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com David was this the on you tested too? Prarit, so you're left to please test, the hope would be that at the very least it still fixes your issue. Petr, you had mentioned in the commit log for your second patch in this thread that your change fixes a regression. What I asked for was for a patch that fixes that regression alone first so we can send to stable. So what issue is that alternative patch fixing? Luis
On 21.11.22 20:03, Luis Chamberlain wrote: > On Mon, Nov 21, 2022 at 05:00:49PM +0100, Petr Pavlu wrote: >> On 11/16/22 17:03, Prarit Bhargava wrote: >>> On 11/15/22 14:29, Luis Chamberlain wrote: >>>> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: >>>>> Note that I don't think the issue I raised is due to 6e6de3dee51a. >>>>> I don't have the machine at hand right now. But, again, I doubt this will >>>>> fix it. >>>> >>>> There are *more* modules processed after that commit. That's all. So >>>> testing would be appreciated. >>>> >>> >>> Can anyone tell us if >>> >>> https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/ >>> >>> resolves the module loading delay problem? >> >> This patch unfortunately makes no difference on my test system. In my case, >> the kernel has already intel_pstate loaded when udev starts inserting a burst >> of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init() >> to immediately return once the check cpufreq_get_current_driver() fails. The >> code modified by the patch is not reached at all. > > To be clear I don't care about the patch mentioned in the above URL, I care > about this: > > https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com > > David was this the on you tested too? Yes, that's the one I tried without luck.
On Mon, Nov 21, 2022 at 08:50:11PM +0100, David Hildenbrand wrote: > On 21.11.22 20:03, Luis Chamberlain wrote: > > On Mon, Nov 21, 2022 at 05:00:49PM +0100, Petr Pavlu wrote: > > > On 11/16/22 17:03, Prarit Bhargava wrote: > > > > On 11/15/22 14:29, Luis Chamberlain wrote: > > > > > On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote: > > > > > > Note that I don't think the issue I raised is due to 6e6de3dee51a. > > > > > > I don't have the machine at hand right now. But, again, I doubt this will > > > > > > fix it. > > > > > > > > > > There are *more* modules processed after that commit. That's all. So > > > > > testing would be appreciated. > > > > > > > > > > > > > Can anyone tell us if > > > > > > > > https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/ > > > > > > > > resolves the module loading delay problem? > > > > > > This patch unfortunately makes no difference on my test system. In my case, > > > the kernel has already intel_pstate loaded when udev starts inserting a burst > > > of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init() > > > to immediately return once the check cpufreq_get_current_driver() fails. The > > > code modified by the patch is not reached at all. > > > > To be clear I don't care about the patch mentioned in the above URL, I care > > about this: > > > > https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com > > > > David was this the on you tested too? > > Yes, that's the one I tried without luck. OK thanks. I'm just trying to sort out the regression first before we go out and fix another issue. We will chase that issue down though and I hav some other ideas to help with this further too. Luis
On 11/21/22 20:03, Luis Chamberlain wrote: > To be clear I don't care about the patch mentioned in the above URL, I care > about this: > > https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com > [...] > > Petr, you had mentioned in the commit log for your second patch in this > thread that your change fixes a regression. What I asked for was for a > patch that fixes that regression alone first so we can send to stable. > So what issue is that alternative patch fixing? This alternative patch fixes the discussed regression with failing inserts of acpi_cpufreq and pcc_cpufreq delaying boot. As mentioned, the situation can in the worst case prevent udev from loading drivers for other devices and might cause timeouts of services waiting on them and subsequently a failed boot. The patch attempts a different solution for the problem that 6e6de3dee51a was trying to solve. Rather than waiting for the unloading to complete, it returns a different error code (-EBUSY) for modules in the GOING state. This should avoid the error situation that was described in 6e6de3dee51a (user space attempting to load a dependent module because the -EEXIST error code would suggest to user space that the first module had been loaded successfully), while avoiding the delay issue too. Petr
On Tue, Nov 22, 2022 at 02:59:18PM +0100, Petr Pavlu wrote: > On 11/21/22 20:03, Luis Chamberlain wrote: > > To be clear I don't care about the patch mentioned in the above URL, I care > > about this: > > > > https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com > > [...] > > > > Petr, you had mentioned in the commit log for your second patch in this > > thread that your change fixes a regression. What I asked for was for a > > patch that fixes that regression alone first so we can send to stable. > > So what issue is that alternative patch fixing? > > This alternative patch fixes the discussed regression with failing inserts of > acpi_cpufreq and pcc_cpufreq delaying boot. As mentioned, the situation can in > the worst case prevent udev from loading drivers for other devices and might > cause timeouts of services waiting on them and subsequently a failed boot. > > The patch attempts a different solution for the problem that 6e6de3dee51a was > trying to solve. Rather than waiting for the unloading to complete, it returns > a different error code (-EBUSY) for modules in the GOING state. This should > avoid the error situation that was described in 6e6de3dee51a (user space > attempting to load a dependent module because the -EEXIST error code would > suggest to user space that the first module had been loaded successfully), > while avoiding the delay issue too. Great, can you send a proper patch now with a proper commit log and Cc stable tag? Luis
On 11/14/22 10:45, David Hildenbrand wrote: > On 14.11.22 16:38, Luis Chamberlain wrote: >> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote: >>> On 12.11.22 02:47, Luis Chamberlain wrote: >>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: >>>>> On 10/18/22 20:33, Luis Chamberlain wrote: >>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >>>>>>> The patch does address a regression observed after commit >>>>>>> 6e6de3dee51a >>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have >>>>>>> finished >>>>>>> loading"). I guess it can have a Fixes tag added to the patch. >>>>>>> >>>>>>> I think it is hard to split this patch into parts because the >>>>>>> implemented >>>>>>> "optimization" is the fix. >>>>>> >>>>>> git describe --contains 6e6de3dee51a >>>>>> v5.3-rc1~38^2~6 >>>>>> >>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would >>>>>> be the >>>>>> right thing to do, but without it, it still leaves the issue reported >>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and >>>>>> then your optimizations can be applied on top. >>>>> >>>>> Simpler could be to do the following: >>>>> >>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c >>>>> index d02d39c7174e..0302ac387e93 100644 >>>>> --- a/kernel/module/main.c >>>>> +++ b/kernel/module/main.c >>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >>>>> sched_annotate_sleep(); >>>>> mutex_lock(&module_mutex); >>>>> mod = find_module_all(name, strlen(name), true); >>>>> - ret = !mod || mod->state == MODULE_STATE_LIVE; >>>>> + ret = !mod || mod->state == MODULE_STATE_LIVE >>>>> + || mod->state == MODULE_STATE_GOING; >>>>> mutex_unlock(&module_mutex); >>>>> return ret; >>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module >>>>> *mod) >>>>> mutex_lock(&module_mutex); >>>>> old = find_module_all(mod->name, strlen(mod->name), true); >>>>> if (old != NULL) { >>>>> - if (old->state != MODULE_STATE_LIVE) { >>>>> + if (old->state == MODULE_STATE_COMING >>>>> + || old->state == MODULE_STATE_UNFORMED) { >>>>> /* Wait in case it fails to load. */ >>>>> mutex_unlock(&module_mutex); >>>>> err = wait_event_interruptible(module_wq, >>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module >>>>> *mod) >>>>> goto out_unlocked; >>>>> goto again; >>>>> } >>>>> - err = -EEXIST; >>>>> + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; >>>>> goto out; >>>>> } >>>>> mod_update_bounds(mod); >>>> >>>> >>>> Prarit, can you verify this still does not break the issue you >>>> reported? >>>> David, does this also fix your issue? >>> >>> I didn't try, but from a quick glimpse I assume no. Allocating module >>> space >>> happens before handling eventual duplicates right now, before a >>> module even >>> is "alive" and in the MODULE_STATE_UNFORMED state. >> >> The first two hunks are a revert of commit 6e6de3dee51a and I'm under >> the impression that cauased your issues as *more* modules states are >> allowed through. >> >> The last hunk tries to fix what 6e6de3dee51a wanted to do. >> > > Note that I don't think the issue I raised is due to 6e6de3dee51a. > >>> But maybe I am missing something important. >> >> Please do test if you can. > > I don't have the machine at hand right now. But, again, I doubt this > will fix it. > > > The flow is in load_module(): > > mod = layout_and_allocate(info, flags); > if (IS_ERR(mod)) { > ... > } > > audit_log_kern_module(mod->name); > > /* Reserve our place in the list. */ > err = add_unformed_module(mod); > if (err) > goto free_module; > > > You can have 400 threads in layout_and_allocate() loading the same > module at the same time and running out of module space. Any changes to > add_unformed_module() and finished_loading() won't change that, because > they are not involved before the module space allocations happened. > I'd like to see a refreshed patch but I tested the latest version and see that the boot time is LONGER with the change Before: [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s multi-user.target reached after 15.606s in userspace. After: Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s multi-user.target reached after 23.093s in userspace. Subsequent reboots also indicate that userspace boot time is longer after the change. P.
On 11/28/22 17:29, Prarit Bhargava wrote: > On 11/14/22 10:45, David Hildenbrand wrote: >> On 14.11.22 16:38, Luis Chamberlain wrote: >>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote: >>>> On 12.11.22 02:47, Luis Chamberlain wrote: >>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: >>>>>> On 10/18/22 20:33, Luis Chamberlain wrote: >>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >>>>>>>> The patch does address a regression observed after commit >>>>>>>> 6e6de3dee51a >>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have >>>>>>>> finished >>>>>>>> loading"). I guess it can have a Fixes tag added to the patch. >>>>>>>> >>>>>>>> I think it is hard to split this patch into parts because the >>>>>>>> implemented >>>>>>>> "optimization" is the fix. >>>>>>> >>>>>>> git describe --contains 6e6de3dee51a >>>>>>> v5.3-rc1~38^2~6 >>>>>>> >>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would >>>>>>> be the >>>>>>> right thing to do, but without it, it still leaves the issue reported >>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and >>>>>>> then your optimizations can be applied on top. >>>>>> >>>>>> Simpler could be to do the following: >>>>>> >>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c >>>>>> index d02d39c7174e..0302ac387e93 100644 >>>>>> --- a/kernel/module/main.c >>>>>> +++ b/kernel/module/main.c >>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >>>>>> sched_annotate_sleep(); >>>>>> mutex_lock(&module_mutex); >>>>>> mod = find_module_all(name, strlen(name), true); >>>>>> - ret = !mod || mod->state == MODULE_STATE_LIVE; >>>>>> + ret = !mod || mod->state == MODULE_STATE_LIVE >>>>>> + || mod->state == MODULE_STATE_GOING; >>>>>> mutex_unlock(&module_mutex); >>>>>> return ret; >>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module >>>>>> *mod) >>>>>> mutex_lock(&module_mutex); >>>>>> old = find_module_all(mod->name, strlen(mod->name), true); >>>>>> if (old != NULL) { >>>>>> - if (old->state != MODULE_STATE_LIVE) { >>>>>> + if (old->state == MODULE_STATE_COMING >>>>>> + || old->state == MODULE_STATE_UNFORMED) { >>>>>> /* Wait in case it fails to load. */ >>>>>> mutex_unlock(&module_mutex); >>>>>> err = wait_event_interruptible(module_wq, >>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module >>>>>> *mod) >>>>>> goto out_unlocked; >>>>>> goto again; >>>>>> } >>>>>> - err = -EEXIST; >>>>>> + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; >>>>>> goto out; >>>>>> } >>>>>> mod_update_bounds(mod); >>>>> >>>>> >>>>> Prarit, can you verify this still does not break the issue you >>>>> reported? >>>>> David, does this also fix your issue? >>>> >>>> I didn't try, but from a quick glimpse I assume no. Allocating module >>>> space >>>> happens before handling eventual duplicates right now, before a >>>> module even >>>> is "alive" and in the MODULE_STATE_UNFORMED state. >>> >>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under >>> the impression that cauased your issues as *more* modules states are >>> allowed through. >>> >>> The last hunk tries to fix what 6e6de3dee51a wanted to do. >>> >> >> Note that I don't think the issue I raised is due to 6e6de3dee51a. >> >>>> But maybe I am missing something important. >>> >>> Please do test if you can. >> >> I don't have the machine at hand right now. But, again, I doubt this >> will fix it. >> >> >> The flow is in load_module(): >> >> mod = layout_and_allocate(info, flags); >> if (IS_ERR(mod)) { >> ... >> } >> >> audit_log_kern_module(mod->name); >> >> /* Reserve our place in the list. */ >> err = add_unformed_module(mod); >> if (err) >> goto free_module; >> >> >> You can have 400 threads in layout_and_allocate() loading the same >> module at the same time and running out of module space. Any changes to >> add_unformed_module() and finished_loading() won't change that, because >> they are not involved before the module space allocations happened. >> > > I'd like to see a refreshed patch but I tested the latest version and > see that the boot time is LONGER with the change > > Before: > > [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze > Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s > (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s > multi-user.target reached after 15.606s in userspace. > > After: > > Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s > (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s > multi-user.target reached after 23.093s in userspace. > > Subsequent reboots also indicate that userspace boot time is longer > after the change. Thanks for testing this patch, that is an interesting result. I see the following dependency chain on my system (openSUSE Tumbleweed): multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service. My understanding is that the udev trigger service only performs the trigger operation but does not actually wait on all devices to be processed by udevd. In other words, handling of the forced udev events can still be in progress after multi-user.target is reached. The current serialization of same-name module loads can result in many udev workers sleeping in add_unformed_module() and hence creating at that point less pressure on the CPU time from udevd. I wonder if this then maybe allows other work needed to reach multi-user.target to proceed faster. Could you please boot the machine with 'udev.log_level=debug' and provide me logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel and with the discussed patch? Thanks, Petr
On Tue 2022-11-29 14:13:37, Petr Pavlu wrote: > On 11/28/22 17:29, Prarit Bhargava wrote: > > On 11/14/22 10:45, David Hildenbrand wrote: > >> On 14.11.22 16:38, Luis Chamberlain wrote: > >>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote: > >>>> On 12.11.22 02:47, Luis Chamberlain wrote: > >>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: > >>>>>> On 10/18/22 20:33, Luis Chamberlain wrote: > >>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: > >>>>>>>> The patch does address a regression observed after commit > >>>>>>>> 6e6de3dee51a > >>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have > >>>>>>>> finished > >>>>>>>> loading"). I guess it can have a Fixes tag added to the patch. > >>>>>>>> > >>>>>>>> I think it is hard to split this patch into parts because the > >>>>>>>> implemented > >>>>>>>> "optimization" is the fix. > >>>>>>> > >>>>>>> git describe --contains 6e6de3dee51a > >>>>>>> v5.3-rc1~38^2~6 > >>>>>>> > >>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would > >>>>>>> be the > >>>>>>> right thing to do, but without it, it still leaves the issue reported > >>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and > >>>>>>> then your optimizations can be applied on top. > >>>>>> > >>>>>> Simpler could be to do the following: > >>>>>> > >>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c > >>>>>> index d02d39c7174e..0302ac387e93 100644 > >>>>>> --- a/kernel/module/main.c > >>>>>> +++ b/kernel/module/main.c > >>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > >>>>>> sched_annotate_sleep(); > >>>>>> mutex_lock(&module_mutex); > >>>>>> mod = find_module_all(name, strlen(name), true); > >>>>>> - ret = !mod || mod->state == MODULE_STATE_LIVE; > >>>>>> + ret = !mod || mod->state == MODULE_STATE_LIVE > >>>>>> + || mod->state == MODULE_STATE_GOING; > >>>>>> mutex_unlock(&module_mutex); > >>>>>> return ret; > >>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module > >>>>>> *mod) > >>>>>> mutex_lock(&module_mutex); > >>>>>> old = find_module_all(mod->name, strlen(mod->name), true); > >>>>>> if (old != NULL) { > >>>>>> - if (old->state != MODULE_STATE_LIVE) { > >>>>>> + if (old->state == MODULE_STATE_COMING > >>>>>> + || old->state == MODULE_STATE_UNFORMED) { > >>>>>> /* Wait in case it fails to load. */ > >>>>>> mutex_unlock(&module_mutex); > >>>>>> err = wait_event_interruptible(module_wq, > >>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module > >>>>>> *mod) > >>>>>> goto out_unlocked; > >>>>>> goto again; > >>>>>> } > >>>>>> - err = -EEXIST; > >>>>>> + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; > >>>>>> goto out; > >>>>>> } > >>>>>> mod_update_bounds(mod); > >>>>> > >>>>> > >>>>> Prarit, can you verify this still does not break the issue you > >>>>> reported? > >>>>> David, does this also fix your issue? > >>>> > >>>> I didn't try, but from a quick glimpse I assume no. Allocating module > >>>> space > >>>> happens before handling eventual duplicates right now, before a > >>>> module even > >>>> is "alive" and in the MODULE_STATE_UNFORMED state. > >>> > >>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under > >>> the impression that cauased your issues as *more* modules states are > >>> allowed through. > >>> > >>> The last hunk tries to fix what 6e6de3dee51a wanted to do. > >>> > >> > >> Note that I don't think the issue I raised is due to 6e6de3dee51a. > >> > >>>> But maybe I am missing something important. > >>> > >>> Please do test if you can. > >> > >> I don't have the machine at hand right now. But, again, I doubt this > >> will fix it. > >> > >> > >> The flow is in load_module(): > >> > >> mod = layout_and_allocate(info, flags); > >> if (IS_ERR(mod)) { > >> ... > >> } > >> > >> audit_log_kern_module(mod->name); > >> > >> /* Reserve our place in the list. */ > >> err = add_unformed_module(mod); > >> if (err) > >> goto free_module; > >> > >> > >> You can have 400 threads in layout_and_allocate() loading the same > >> module at the same time and running out of module space. Any changes to > >> add_unformed_module() and finished_loading() won't change that, because > >> they are not involved before the module space allocations happened. > >> > > > > I'd like to see a refreshed patch but I tested the latest version and > > see that the boot time is LONGER with the change > > > > Before: > > > > [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze > > Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s > > (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s > > multi-user.target reached after 15.606s in userspace. > > > > After: > > > > Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s > > (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s > > multi-user.target reached after 23.093s in userspace. > > > > Subsequent reboots also indicate that userspace boot time is longer > > after the change. > > Thanks for testing this patch, that is an interesting result. > > I see the following dependency chain on my system (openSUSE Tumbleweed): > multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service. > > My understanding is that the udev trigger service only performs the trigger > operation but does not actually wait on all devices to be processed by udevd. > In other words, handling of the forced udev events can still be in progress > after multi-user.target is reached. > > The current serialization of same-name module loads can result in many udev > workers sleeping in add_unformed_module() and hence creating at that point > less pressure on the CPU time from udevd. I wonder if this then maybe allows > other work needed to reach multi-user.target to proceed faster. Interesting theory. Another idea. The module loader newly returns -EBUSY. I could imagine that userspace might handle this return value a special way and try to load the module once again. It might make sense to test it with the updated version of the patch. This one is racy. It returns -EBUSY only when the waiting module loader sees the failed module loader in GOING state. > Could you please boot the machine with 'udev.log_level=debug' and provide me > logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel > and with the discussed patch? I am curious what happens here. Best Regards, Petr
On 11/29/22 08:13, Petr Pavlu wrote: > On 11/28/22 17:29, Prarit Bhargava wrote: >> On 11/14/22 10:45, David Hildenbrand wrote: >>> On 14.11.22 16:38, Luis Chamberlain wrote: >>>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote: >>>>> On 12.11.22 02:47, Luis Chamberlain wrote: >>>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote: >>>>>>> On 10/18/22 20:33, Luis Chamberlain wrote: >>>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote: >>>>>>>>> The patch does address a regression observed after commit >>>>>>>>> 6e6de3dee51a >>>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have >>>>>>>>> finished >>>>>>>>> loading"). I guess it can have a Fixes tag added to the patch. >>>>>>>>> >>>>>>>>> I think it is hard to split this patch into parts because the >>>>>>>>> implemented >>>>>>>>> "optimization" is the fix. >>>>>>>> >>>>>>>> git describe --contains 6e6de3dee51a >>>>>>>> v5.3-rc1~38^2~6 >>>>>>>> >>>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would >>>>>>>> be the >>>>>>>> right thing to do, but without it, it still leaves the issue reported >>>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and >>>>>>>> then your optimizations can be applied on top. >>>>>>> >>>>>>> Simpler could be to do the following: >>>>>>> >>>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c >>>>>>> index d02d39c7174e..0302ac387e93 100644 >>>>>>> --- a/kernel/module/main.c >>>>>>> +++ b/kernel/module/main.c >>>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >>>>>>> sched_annotate_sleep(); >>>>>>> mutex_lock(&module_mutex); >>>>>>> mod = find_module_all(name, strlen(name), true); >>>>>>> - ret = !mod || mod->state == MODULE_STATE_LIVE; >>>>>>> + ret = !mod || mod->state == MODULE_STATE_LIVE >>>>>>> + || mod->state == MODULE_STATE_GOING; >>>>>>> mutex_unlock(&module_mutex); >>>>>>> return ret; >>>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module >>>>>>> *mod) >>>>>>> mutex_lock(&module_mutex); >>>>>>> old = find_module_all(mod->name, strlen(mod->name), true); >>>>>>> if (old != NULL) { >>>>>>> - if (old->state != MODULE_STATE_LIVE) { >>>>>>> + if (old->state == MODULE_STATE_COMING >>>>>>> + || old->state == MODULE_STATE_UNFORMED) { >>>>>>> /* Wait in case it fails to load. */ >>>>>>> mutex_unlock(&module_mutex); >>>>>>> err = wait_event_interruptible(module_wq, >>>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module >>>>>>> *mod) >>>>>>> goto out_unlocked; >>>>>>> goto again; >>>>>>> } >>>>>>> - err = -EEXIST; >>>>>>> + err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST; >>>>>>> goto out; >>>>>>> } >>>>>>> mod_update_bounds(mod); >>>>>> >>>>>> >>>>>> Prarit, can you verify this still does not break the issue you >>>>>> reported? >>>>>> David, does this also fix your issue? >>>>> >>>>> I didn't try, but from a quick glimpse I assume no. Allocating module >>>>> space >>>>> happens before handling eventual duplicates right now, before a >>>>> module even >>>>> is "alive" and in the MODULE_STATE_UNFORMED state. >>>> >>>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under >>>> the impression that cauased your issues as *more* modules states are >>>> allowed through. >>>> >>>> The last hunk tries to fix what 6e6de3dee51a wanted to do. >>>> >>> >>> Note that I don't think the issue I raised is due to 6e6de3dee51a. >>> >>>>> But maybe I am missing something important. >>>> >>>> Please do test if you can. >>> >>> I don't have the machine at hand right now. But, again, I doubt this >>> will fix it. >>> >>> >>> The flow is in load_module(): >>> >>> mod = layout_and_allocate(info, flags); >>> if (IS_ERR(mod)) { >>> ... >>> } >>> >>> audit_log_kern_module(mod->name); >>> >>> /* Reserve our place in the list. */ >>> err = add_unformed_module(mod); >>> if (err) >>> goto free_module; >>> >>> >>> You can have 400 threads in layout_and_allocate() loading the same >>> module at the same time and running out of module space. Any changes to >>> add_unformed_module() and finished_loading() won't change that, because >>> they are not involved before the module space allocations happened. >>> >> >> I'd like to see a refreshed patch but I tested the latest version and >> see that the boot time is LONGER with the change >> >> Before: >> >> [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze >> Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s >> (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s >> multi-user.target reached after 15.606s in userspace. >> >> After: >> >> Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s >> (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s >> multi-user.target reached after 23.093s in userspace. >> >> Subsequent reboots also indicate that userspace boot time is longer >> after the change. > > Thanks for testing this patch, that is an interesting result. > > I see the following dependency chain on my system (openSUSE Tumbleweed): > multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service. > > My understanding is that the udev trigger service only performs the trigger > operation but does not actually wait on all devices to be processed by udevd. > In other words, handling of the forced udev events can still be in progress > after multi-user.target is reached. > > The current serialization of same-name module loads can result in many udev > workers sleeping in add_unformed_module() and hence creating at that point > less pressure on the CPU time from udevd. I wonder if this then maybe allows > other work needed to reach multi-user.target to proceed faster. > > Could you please boot the machine with 'udev.log_level=debug' and provide me > logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel > and with the discussed patch? Petr, I haven't missed your request. I'm waiting for the system to become free (I'm running a week long test on it). Hopefully I can get this data to you tomorrow AM. My apologies for the wait, P. > > Thanks, > Petr >
> >> Could you please boot the machine with 'udev.log_level=debug' and provide me >> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel >> and with the discussed patch? > Petr, I tried to in-line the logs however the email bounced due to its size. I know this isn't a preferred method of passing information on LKML and other lists, but here are links to the logs: https://people.redhat.com/prarit/4petr/ Both outputs were done with, as requested, 'journalctl -b -o short-monotonic'. vanilla.log is kernel booted with 'udev.log_level=debug' with-changeset.log is kernel + patch booted with 'udev.log_level=debug' P.
On 12/6/22 13:31, Prarit Bhargava wrote: > >> >>> Could you please boot the machine with 'udev.log_level=debug' and provide me >>> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel >>> and with the discussed patch? >> > > Petr, I tried to in-line the logs however the email bounced due to its size. > > I know this isn't a preferred method of passing information on LKML and > other lists, but here are links to the logs: > > https://people.redhat.com/prarit/4petr/ > > Both outputs were done with, as requested, 'journalctl -b -o > short-monotonic'. > > vanilla.log is kernel booted with 'udev.log_level=debug' > with-changeset.log is kernel + patch booted with 'udev.log_level=debug' Thanks Prarit for re-testing the patch. Both logs in this case actually show similar startup times. Vanilla: [ 68.108176] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd[1]: Startup finished in 55.874s (firmware) + 25.646s (loader) + 35.793s (kernel) + 7.845s (initrd) + 24.469s (userspace) = 2min 29.629s. With the patch: [ 68.064826] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd[1]: Startup finished in 54.153s (firmware) + 19.947s (loader) + 35.965s (kernel) + 9.449s (initrd) + 22.650s (userspace) = 2min 22.165s. The system has 192 CPUs. The vanilla case shows 144x inserts of acpi_cpufreq and 43x of pcc_cpufreq: acpi_cpufreq (the first and last recorded insert): [ 47.485621] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1871]: Inserted module 'acpi_cpufreq' [...] [ 53.052401] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1914]: Inserted module 'acpi_cpufreq' pcc_cpufreq: [ 47.515221] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1871]: Inserted module 'pcc_cpufreq' [...] [ 53.067917] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[2040]: Inserted module 'pcc_cpufreq' Processing inserts of all CPU frequency modules took at least ~5.5 seconds. It was likely more because not all inserts are recorded in the log, udevd messages appear to be missing from ~53.1. With the patch, both modules are attempted to be inserted 192x: acpi_cpufreq: [ 50.107403] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1817]: Failed to insert module 'acpi_cpufreq': Device or resource busy [...] [ 50.438755] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[2016]: Inserted module 'acpi_cpufreq' pcc_cpufreq: [ 50.110731] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1849]: Failed to insert module 'pcc_cpufreq': Device or resource busy [...] [ 50.579249] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[2016]: Inserted module 'pcc_cpufreq' This shows that the patch reduced the sequence to ~0.5 seconds and its logic looks to be working as intended. Thanks, Petr
diff --git a/kernel/module/main.c b/kernel/module/main.c index 06cb41c3d2a1..fb97d7a0cae9 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -61,14 +61,29 @@ /* * Mutex protects: - * 1) List of modules (also safely readable with preempt_disable), + * 1) list of modules (also safely readable with preempt_disable, delete and add + * uses RCU list operations). * 2) module_use links, - * 3) mod_tree.addr_min/mod_tree.addr_max. - * (delete and add uses RCU list operations). + * 3) mod_tree.addr_min/mod_tree.addr_max, + * 4) list of unloaded_tainted_modules. + * 5) list of running_loads. */ DEFINE_MUTEX(module_mutex); LIST_HEAD(modules); +/* Shared information to track duplicate module loads. */ +struct shared_load_info { + char name[MODULE_NAME_LEN]; + refcount_t refcnt; + struct list_head list; + struct completion done; + int err; +}; +static LIST_HEAD(running_loads); + +/* Waiting for a module to finish loading? */ +static DECLARE_WAIT_QUEUE_HEAD(module_wq); + /* Work queue for freeing init sections in success case */ static void do_free_init(struct work_struct *w); static DECLARE_WORK(init_free_wq, do_free_init); @@ -122,9 +137,6 @@ static void mod_update_bounds(struct module *mod) int modules_disabled; core_param(nomodule, modules_disabled, bint, 0); -/* Waiting for a module to finish initializing? */ -static DECLARE_WAIT_QUEUE_HEAD(module_wq); - static BLOCKING_NOTIFIER_HEAD(module_notify_list); int register_module_notifier(struct notifier_block *nb) @@ -762,8 +774,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user, strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints)); free_module(mod); - /* someone could wait for the module in add_unformed_module() */ - wake_up_interruptible(&module_wq); return 0; out: mutex_unlock(&module_mutex); @@ -2374,26 +2384,6 @@ static int post_relocation(struct module *mod, const struct load_info *info) return module_finalize(info->hdr, info->sechdrs, mod); } -/* Is this module of this name done loading? No locks held. */ -static bool finished_loading(const char *name) -{ - struct module *mod; - bool ret; - - /* - * The module_mutex should not be a heavily contended lock; - * if we get the occasional sleep here, we'll go an extra iteration - * in the wait_event_interruptible(), which is harmless. - */ - sched_annotate_sleep(); - mutex_lock(&module_mutex); - mod = find_module_all(name, strlen(name), true); - ret = !mod || mod->state == MODULE_STATE_LIVE; - mutex_unlock(&module_mutex); - - return ret; -} - /* Call module constructors. */ static void do_mod_ctors(struct module *mod) { @@ -2524,7 +2514,6 @@ static noinline int do_init_module(struct module *mod) schedule_work(&init_free_wq); mutex_unlock(&module_mutex); - wake_up_interruptible(&module_wq); return 0; @@ -2540,7 +2529,6 @@ static noinline int do_init_module(struct module *mod) klp_module_going(mod); ftrace_release_mod(mod); free_module(mod); - wake_up_interruptible(&module_wq); return ret; } @@ -2552,43 +2540,133 @@ static int may_init_module(void) return 0; } +static struct shared_load_info * +shared_load_info_alloc(const struct load_info *info) +{ + struct shared_load_info *shared_info = + kzalloc(sizeof(*shared_info), GFP_KERNEL); + if (shared_info == NULL) + return ERR_PTR(-ENOMEM); + + strscpy(shared_info->name, info->name, sizeof(shared_info->name)); + refcount_set(&shared_info->refcnt, 1); + INIT_LIST_HEAD(&shared_info->list); + init_completion(&shared_info->done); + return shared_info; +} + +static void shared_load_info_get(struct shared_load_info *shared_info) +{ + refcount_inc(&shared_info->refcnt); +} + +static void shared_load_info_put(struct shared_load_info *shared_info) +{ + if (refcount_dec_and_test(&shared_info->refcnt)) + kfree(shared_info); +} + /* - * We try to place it in the list now to make sure it's unique before - * we dedicate too many resources. In particular, temporary percpu + * Check that a module load is unique and make it visible to others. The code + * looks for parallel running inserts and already loaded modules. Two inserts + * are considered equivalent if their module name matches. In case this load + * duplicates another running insert, the code waits for its completion and + * then returns -EEXIST or -EBUSY depending on whether it succeeded. + * + * Detecting early that a load is unique avoids dedicating too many cycles and + * resources to bring up the module. In particular, it prevents temporary percpu * memory exhaustion. + * + * Merging same load requests then primarily helps during the boot process. It + * can happen that the kernel receives a burst of requests to load the same + * module (for example, a same module for each individual CPU) and loading it + * eventually fails during its init call. Merging the requests allows that only + * one full attempt to load the module is made. + * + * On a non-error return, it is guaranteed that this load is unique. */ -static int add_unformed_module(struct module *mod) +static struct shared_load_info *add_running_load(const struct load_info *info) { - int err; struct module *old; + struct shared_load_info *shared_info; - mod->state = MODULE_STATE_UNFORMED; - -again: mutex_lock(&module_mutex); - old = find_module_all(mod->name, strlen(mod->name), true); - if (old != NULL) { - if (old->state != MODULE_STATE_LIVE) { - /* Wait in case it fails to load. */ + + /* Search if there is a running load of a module with the same name. */ + list_for_each_entry(shared_info, &running_loads, list) + if (strcmp(shared_info->name, info->name) == 0) { + int err; + + shared_load_info_get(shared_info); mutex_unlock(&module_mutex); - err = wait_event_interruptible(module_wq, - finished_loading(mod->name)); - if (err) - goto out_unlocked; - goto again; + + err = wait_for_completion_interruptible( + &shared_info->done); + if (!err) + err = shared_info->err ? -EBUSY : -EEXIST; + shared_load_info_put(shared_info); + shared_info = ERR_PTR(err); + goto out_unlocked; } - err = -EEXIST; + + /* Search if there is a live module with the given name already. */ + old = find_module_all(info->name, strlen(info->name), true); + if (old != NULL) { + if (old->state == MODULE_STATE_LIVE) { + shared_info = ERR_PTR(-EEXIST); + goto out; + } + + /* + * Any active load always has its record in running_loads and so + * would be found above. This applies independent whether such + * a module is currently in MODULE_STATE_UNFORMED, + * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its + * initialization failed. It therefore means this must be an + * older going module and the caller should try later once it is + * gone. + */ + WARN_ON(old->state != MODULE_STATE_GOING); + shared_info = ERR_PTR(-EBUSY); goto out; } - mod_update_bounds(mod); - list_add_rcu(&mod->list, &modules); - mod_tree_insert(mod); - err = 0; + + /* The load is unique, make it visible to others. */ + shared_info = shared_load_info_alloc(info); + if (IS_ERR(shared_info)) + goto out; + list_add(&shared_info->list, &running_loads); out: mutex_unlock(&module_mutex); out_unlocked: - return err; + return shared_info; +} + +static void finalize_running_load(struct shared_load_info *shared_info, int err) +{ + /* Inform other duplicate inserts that the load finished. */ + mutex_lock(&module_mutex); + list_del(&shared_info->list); + shared_info->err = err; + mutex_unlock(&module_mutex); + + complete_all(&shared_info->done); + shared_load_info_put(shared_info); + + /* Tell other modules waiting on this one that it completed loading. */ + wake_up_interruptible(&module_wq); +} + +static void add_unformed_module(struct module *mod) +{ + mod->state = MODULE_STATE_UNFORMED; + + mutex_lock(&module_mutex); + mod_update_bounds(mod); + list_add_rcu(&mod->list, &modules); + mod_tree_insert(mod); + mutex_unlock(&module_mutex); } static int complete_formation(struct module *mod, struct load_info *info) @@ -2674,6 +2752,7 @@ static void cfi_init(struct module *mod); static int load_module(struct load_info *info, const char __user *uargs, int flags) { + struct shared_load_info *shared_info; struct module *mod; long err = 0; char *after_dashes; @@ -2711,38 +2790,43 @@ static int load_module(struct load_info *info, const char __user *uargs, goto free_copy; /* - * Now that we know we have the correct module name, check - * if it's blacklisted. + * Now that we know we have the correct module name, check if there is + * another load of the same name in progress. */ + shared_info = add_running_load(info); + if (IS_ERR(shared_info)) { + err = PTR_ERR(shared_info); + goto free_copy; + } + + /* Check if the module is blacklisted. */ if (blacklisted(info->name)) { err = -EPERM; pr_err("Module %s is blacklisted\n", info->name); - goto free_copy; + goto free_shared; } err = rewrite_section_headers(info, flags); if (err) - goto free_copy; + goto free_shared; /* Check module struct version now, before we try to use module. */ if (!check_modstruct_version(info, info->mod)) { err = -ENOEXEC; - goto free_copy; + goto free_shared; } /* Figure out module layout, and allocate all the memory. */ mod = layout_and_allocate(info, flags); if (IS_ERR(mod)) { err = PTR_ERR(mod); - goto free_copy; + goto free_shared; } audit_log_kern_module(mod->name); /* Reserve our place in the list. */ - err = add_unformed_module(mod); - if (err) - goto free_module; + add_unformed_module(mod); #ifdef CONFIG_MODULE_SIG mod->sig_ok = info->sig_ok; @@ -2852,7 +2936,9 @@ static int load_module(struct load_info *info, const char __user *uargs, /* Done! */ trace_module_load(mod); - return do_init_module(mod); + err = do_init_module(mod); + finalize_running_load(shared_info, err); + return err; sysfs_cleanup: mod_sysfs_teardown(mod); @@ -2886,15 +2972,15 @@ static int load_module(struct load_info *info, const char __user *uargs, /* Unlink carefully: kallsyms could be walking list. */ list_del_rcu(&mod->list); mod_tree_remove(mod); - wake_up_interruptible(&module_wq); /* Wait for RCU-sched synchronizing before releasing mod->list. */ synchronize_rcu(); mutex_unlock(&module_mutex); - free_module: /* Free lock-classes; relies on the preceding sync_rcu() */ lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size); module_deallocate(mod, info); + free_shared: + finalize_running_load(shared_info, err); free_copy: free_copy(info, flags); return err;
During a system boot, it can happen that the kernel receives a burst of requests to insert the same module but loading it eventually fails during its init call. For instance, udev can make a request to insert a frequency module for each individual CPU when another frequency module is already loaded which causes the init function of the new module to return an error. The module loader currently serializes all such requests, with the barrier in add_unformed_module(). This creates a lot of unnecessary work and delays the boot. This patch improves the behavior as follows: * A check whether a module load matches an already loaded module is moved right after a module name is determined. -EEXIST continues to be returned if the module exists and is live, -EBUSY is returned if a same-name module is going. * A new reference-counted shared_load_info structure is introduced to keep track of duplicate load requests. Two loads are considered equivalent if their module name matches. In case a load duplicates another running insert, the code waits for its completion and then returns -EEXIST or -EBUSY depending on whether it succeeded. Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading"), the kernel already did merge some of same load requests but it was more by accident and relied on specific timing. The patch brings this behavior back in a more explicit form. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> --- kernel/module/main.c | 214 ++++++++++++++++++++++++++++++------------- 1 file changed, 150 insertions(+), 64 deletions(-)