Message ID | 20221205103557.18363-1-petr.pavlu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] module: Don't wait for GOING modules | expand |
On Mon 2022-12-05 11:35:57, Petr Pavlu wrote: > During a system boot, it can happen that the kernel receives a burst of > requests to insert the same module but loading it eventually fails > during its init call. For instance, udev can make a request to insert > a frequency module for each individual CPU when another frequency module > is already loaded which causes the init function of the new module to > return an error. > > Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for > modules that have finished loading"), the kernel waits for modules in > MODULE_STATE_GOING state to finish unloading before making another > attempt to load the same module. > > This creates unnecessary work in the described scenario and delays the > boot. In the worst case, it can prevent udev from loading drivers for > other devices and might cause timeouts of services waiting on them and > subsequently a failed boot. > > This patch attempts a different solution for the problem 6e6de3dee51a > was trying to solve. Rather than waiting for the unloading to complete, > it returns a different error code (-EBUSY) for modules in the GOING > state. This should avoid the error situation that was described in > 6e6de3dee51a (user space attempting to load a dependent module because > the -EEXIST error code would suggest to user space that the first module > had been loaded successfully), while avoiding the delay situation too. > > Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") > Co-developed-by: Martin Wilck <mwilck@suse.com> > Signed-off-by: Martin Wilck <mwilck@suse.com> > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > Cc: stable@vger.kernel.org > --- > > Changes since v1 [1]: > - Don't attempt a new module initialization when a same-name module > completely disappeared while waiting on it, which means it went > through the GOING state implicitly already. > > [1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/ > > kernel/module/main.c | 26 +++++++++++++++++++++----- > 1 file changed, 21 insertions(+), 5 deletions(-) > > diff --git a/kernel/module/main.c b/kernel/module/main.c > index d02d39c7174e..7a627345d4fd 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > sched_annotate_sleep(); > mutex_lock(&module_mutex); > mod = find_module_all(name, strlen(name), true); > - ret = !mod || mod->state == MODULE_STATE_LIVE; > + ret = !mod || mod->state == MODULE_STATE_LIVE > + || mod->state == MODULE_STATE_GOING; There is a actually one more race. This function is supposed to wait until load of a particular module finishes. But we might find some another module of the same name here. Maybe, it is not that bad. If many modules of the same name are loaded in parallel then hopefully most of them would wait for the first one in add_unformed_module(). And they will never appear in the @modules list. Anyway, to be on the safe side. We might want to pass the pointer to the @old module found in add_unformed_module() and make sure that we find the same module here. Something like: /* * @pending_mod: pointer to module that we are waiting for * @name: name of the module; the string must stay even when * the pending module goes away completely */ static bool finished_loading(const struct module *pending_mod, const char *name) { struct module *mod; bool ret = true; /* * The module_mutex should not be a heavily contended lock; * if we get the occasional sleep here, we'll go an extra iteration * in the wait_event_interruptible(), which is harmless. */ sched_annotate_sleep(); mutex_lock(&module_mutex); mod = find_module_all(name, strlen(name), true); /* Check if the pending module is still being loaded */ if (mod == pending_mod && (mod->state == MODULE_STATE_UNFORMED || mod->state == MODULE_STATE_COMMING)) ret = false; mutex_unlock(&module_mutex); return ret; } Another advantage is that this is using the very same logic (if condition) as add_formed_module() and not the inverted one ;-) Otherwise, the patch looks good. I wonder how it works for Prarit. Best Regards, Petr
On 12/5/22 20:54, Petr Mladek wrote: > On Mon 2022-12-05 11:35:57, Petr Pavlu wrote: >> During a system boot, it can happen that the kernel receives a burst of >> requests to insert the same module but loading it eventually fails >> during its init call. For instance, udev can make a request to insert >> a frequency module for each individual CPU when another frequency module >> is already loaded which causes the init function of the new module to >> return an error. >> >> Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for >> modules that have finished loading"), the kernel waits for modules in >> MODULE_STATE_GOING state to finish unloading before making another >> attempt to load the same module. >> >> This creates unnecessary work in the described scenario and delays the >> boot. In the worst case, it can prevent udev from loading drivers for >> other devices and might cause timeouts of services waiting on them and >> subsequently a failed boot. >> >> This patch attempts a different solution for the problem 6e6de3dee51a >> was trying to solve. Rather than waiting for the unloading to complete, >> it returns a different error code (-EBUSY) for modules in the GOING >> state. This should avoid the error situation that was described in >> 6e6de3dee51a (user space attempting to load a dependent module because >> the -EEXIST error code would suggest to user space that the first module >> had been loaded successfully), while avoiding the delay situation too. >> >> Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") >> Co-developed-by: Martin Wilck <mwilck@suse.com> >> Signed-off-by: Martin Wilck <mwilck@suse.com> >> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >> Cc: stable@vger.kernel.org >> --- >> >> Changes since v1 [1]: >> - Don't attempt a new module initialization when a same-name module >> completely disappeared while waiting on it, which means it went >> through the GOING state implicitly already. >> >> [1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/ >> >> kernel/module/main.c | 26 +++++++++++++++++++++----- >> 1 file changed, 21 insertions(+), 5 deletions(-) >> >> diff --git a/kernel/module/main.c b/kernel/module/main.c >> index d02d39c7174e..7a627345d4fd 100644 >> --- a/kernel/module/main.c >> +++ b/kernel/module/main.c >> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >> sched_annotate_sleep(); >> mutex_lock(&module_mutex); >> mod = find_module_all(name, strlen(name), true); >> - ret = !mod || mod->state == MODULE_STATE_LIVE; >> + ret = !mod || mod->state == MODULE_STATE_LIVE >> + || mod->state == MODULE_STATE_GOING; > > There is a actually one more race. > > This function is supposed to wait until load of a particular module > finishes. But we might find some another module of the same name here. > > Maybe, it is not that bad. If many modules of the same name are loaded > in parallel then hopefully most of them would wait for the first one > in add_unformed_module(). And they will never appear in the @modules > list. Good point, a load waiting in add_unformed_module() could miss that its older parallel load already finished if another insert of the same module appears in the modules list in the meantime. This requires that the new load happens to arrive just after the old one finishes and before the waiting load makes its check. This is somewhat similar to the current state where new same-name insert requests can skip and starve the ones already waiting in add_unformed_module(). I think in practice the situation should occur very rarely and be cleared soon, as long one doesn't continuously try to insert the same module. > Anyway, to be on the safe side. We might want to pass the pointer > to the @old module found in add_unformed_module() and make sure > that we find the same module here. Something like: > > /* > * @pending_mod: pointer to module that we are waiting for > * @name: name of the module; the string must stay even when > * the pending module goes away completely > */ > static bool finished_loading(const struct module *pending_mod, > const char *name) > { > struct module *mod; > bool ret = true; > > /* > * The module_mutex should not be a heavily contended lock; > * if we get the occasional sleep here, we'll go an extra iteration > * in the wait_event_interruptible(), which is harmless. > */ > sched_annotate_sleep(); > mutex_lock(&module_mutex); > > mod = find_module_all(name, strlen(name), true); > /* Check if the pending module is still being loaded */ > if (mod == pending_mod && > (mod->state == MODULE_STATE_UNFORMED || > mod->state == MODULE_STATE_COMMING)) > ret = false; > mutex_unlock(&module_mutex); > > return ret; > } The new pending_mod pointer has no ownership of the target module which can be then destroyed at any time. While no dereference of pending_mod is made in finished_loading(), this looks still problematic to me. I think it is generally good to treat a pointer as invalid and its value as indeterminate when the object it points to reaches the end of its lifetime. One specific problem here is that nothing guarantees that a new module doesn't get allocated at the exactly same address as the old one that got released and the code is actually waiting on. Improving this case then likely results in a more complex solution, similar to the one that was discussed originally. Thanks, Petr
On Tue 2022-12-06 17:57:30, Petr Pavlu wrote: > On 12/5/22 20:54, Petr Mladek wrote: > > On Mon 2022-12-05 11:35:57, Petr Pavlu wrote: > >> During a system boot, it can happen that the kernel receives a burst of > >> requests to insert the same module but loading it eventually fails > >> during its init call. For instance, udev can make a request to insert > >> a frequency module for each individual CPU when another frequency module > >> is already loaded which causes the init function of the new module to > >> return an error. > >> > >> Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for > >> modules that have finished loading"), the kernel waits for modules in > >> MODULE_STATE_GOING state to finish unloading before making another > >> attempt to load the same module. > >> > >> This creates unnecessary work in the described scenario and delays the > >> boot. In the worst case, it can prevent udev from loading drivers for > >> other devices and might cause timeouts of services waiting on them and > >> subsequently a failed boot. > >> > >> This patch attempts a different solution for the problem 6e6de3dee51a > >> was trying to solve. Rather than waiting for the unloading to complete, > >> it returns a different error code (-EBUSY) for modules in the GOING > >> state. This should avoid the error situation that was described in > >> 6e6de3dee51a (user space attempting to load a dependent module because > >> the -EEXIST error code would suggest to user space that the first module > >> had been loaded successfully), while avoiding the delay situation too. > >> > >> --- a/kernel/module/main.c > >> +++ b/kernel/module/main.c > >> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > >> sched_annotate_sleep(); > >> mutex_lock(&module_mutex); > >> mod = find_module_all(name, strlen(name), true); > >> - ret = !mod || mod->state == MODULE_STATE_LIVE; > >> + ret = !mod || mod->state == MODULE_STATE_LIVE > >> + || mod->state == MODULE_STATE_GOING; > > > > There is a actually one more race. > > > > This function is supposed to wait until load of a particular module > > finishes. But we might find some another module of the same name here. > > > > Maybe, it is not that bad. If many modules of the same name are loaded > > in parallel then hopefully most of them would wait for the first one > > in add_unformed_module(). And they will never appear in the @modules > > list. > > Good point, a load waiting in add_unformed_module() could miss that its older > parallel load already finished if another insert of the same module appears in > the modules list in the meantime. This requires that the new load happens to > arrive just after the old one finishes and before the waiting load makes its > check. > > This is somewhat similar to the current state where new same-name insert > requests can skip and starve the ones already waiting in > add_unformed_module(). > > I think in practice the situation should occur very rarely and be cleared > soon, as long one doesn't continuously try to insert the same module. I am not completely sure how rare it is. Anyway, the conditions are rather complex so it feels rare ;-) Anyway, the important part of this race is that the new module found in finished_loading() did not see the previously loaded module and did not wait in add_unformed_module(). It means that the load was not parallel enough. It means that the optimization added by this patch was not used. By other words, if the multiple loads are not parallel enough then the optimization added by this patch will not help. And the other way. If this patch helps in practice then this race is not important. > > Anyway, to be on the safe side. We might want to pass the pointer > > to the @old module found in add_unformed_module() and make sure > > that we find the same module here. Something like: > > > > /* > > * @pending_mod: pointer to module that we are waiting for > > * @name: name of the module; the string must stay even when > > * the pending module goes away completely > > */ > > static bool finished_loading(const struct module *pending_mod, > > const char *name) > > { > > struct module *mod; > > bool ret = true; > > > > /* > > * The module_mutex should not be a heavily contended lock; > > * if we get the occasional sleep here, we'll go an extra iteration > > * in the wait_event_interruptible(), which is harmless. > > */ > > sched_annotate_sleep(); > > mutex_lock(&module_mutex); > > > > mod = find_module_all(name, strlen(name), true); > > /* Check if the pending module is still being loaded */ > > if (mod == pending_mod && > > (mod->state == MODULE_STATE_UNFORMED || > > mod->state == MODULE_STATE_COMMING)) > > ret = false; > > mutex_unlock(&module_mutex); > > > > return ret; > > } > > The new pending_mod pointer has no ownership of the target module which can be > then destroyed at any time. While no dereference of pending_mod is made in > finished_loading(), this looks still problematic to me. > > I think it is generally good to treat a pointer as invalid and its value as > indeterminate when the object it points to reaches the end of its lifetime. > One specific problem here is that nothing guarantees that a new module doesn't > get allocated at the exactly same address as the old one that got released and > the code is actually waiting on. > > Improving this case then likely results in a more complex solution, similar > to the one that was discussed originally. Fair enough. Let's ignore the race in finished_loading(). Otherwise, the patch looks good to me. It is easy enough. IMHO, it makes sense to use it if it helps in practice (multiple loads are parallel enough[*]). Feel free to use: Reviewed-by: Petr Mladek <pmladek@suse.com> Of course, the ideal solution would be to avoid the multiple loads in the first place. AFAIK, everything starts in the kernel that sends the same udev events for each CPU... Best Regards, Petr
On Wed, Dec 07, 2022 at 04:15:21PM +0100, Petr Mladek wrote: > Reviewed-by: Petr Mladek <pmladek@suse.com> Queued onto modules-next. > Of course, the ideal solution would be to avoid the multiple > loads in the first place. AFAIK, everything starts in the kernel > that sends the same udev events for each CPU... Fixes go first, *then we can address enhancements. I have some old fixes I can send after htis is merged. I believe folks have others. Luis
On 12/8/22 03:44, Luis Chamberlain wrote: > On Wed, Dec 07, 2022 at 04:15:21PM +0100, Petr Mladek wrote: >> Reviewed-by: Petr Mladek <pmladek@suse.com> > > Queued onto modules-next. Thanks both. >> Of course, the ideal solution would be to avoid the multiple >> loads in the first place. AFAIK, everything starts in the kernel >> that sends the same udev events for each CPU... > > Fixes go first, *then we can address enhancements. I have some old > fixes I can send after htis is merged. I believe folks have others. This simpler fix should be sufficient and so the original more complex approach can be scraped. My plan is then to only progress the additional developed patches: a test case for the issue, minor clean-up patches, and an ACPI change to load acpi_cpufreq/pcc_cpufreq only once. Note that none of this work now addresses the problem that David Hildenbrand encountered with vmap allocation issues [1]. That needs to be fixed separately. [1] https://lore.kernel.org/linux-modules/20221013180518.217405-1-david@redhat.com/ Cheers, Petr
On Mon, Dec 12, 2022 at 12:29:02PM +0100, Petr Pavlu wrote: > On 12/8/22 03:44, Luis Chamberlain wrote: > > On Wed, Dec 07, 2022 at 04:15:21PM +0100, Petr Mladek wrote: > >> Reviewed-by: Petr Mladek <pmladek@suse.com> > > > > Queued onto modules-next. > > Thanks both. I'm **terribly sorry** for dropping this to Linus' pull request late but I kept trying to justify a few things and I couldn't do it, and so I'll reply to this patch with my last minute observations soon. I figured it was better to be safe than sorry and the potential for yet *more* regressions was *real* and I really wanted to avoid us having to crunch / race for a fix over the holidays. I will leave the fix as-is now in modules-next though, so it can get more testing, but indeed, I don't feel quite ready for the masses specially before the holidays. Luis
On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote: > During a system boot, it can happen that the kernel receives a burst of > requests to insert the same module but loading it eventually fails > during its init call. For instance, udev can make a request to insert > a frequency module for each individual CPU when another frequency module > is already loaded which causes the init function of the new module to > return an error. > > Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for > modules that have finished loading"), the kernel waits for modules in > MODULE_STATE_GOING state to finish unloading before making another > attempt to load the same module. > > This creates unnecessary work in the described scenario and delays the > boot. In the worst case, it can prevent udev from loading drivers for > other devices and might cause timeouts of services waiting on them and > subsequently a failed boot. > > This patch attempts a different solution for the problem 6e6de3dee51a > was trying to solve. Rather than waiting for the unloading to complete, > it returns a different error code (-EBUSY) for modules in the GOING > state. This should avoid the error situation that was described in > 6e6de3dee51a (user space attempting to load a dependent module because > the -EEXIST error code would suggest to user space that the first module > had been loaded successfully), while avoiding the delay situation too. > > Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") > Co-developed-by: Martin Wilck <mwilck@suse.com> > Signed-off-by: Martin Wilck <mwilck@suse.com> > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > Cc: stable@vger.kernel.org > --- > > Changes since v1 [1]: > - Don't attempt a new module initialization when a same-name module > completely disappeared while waiting on it, which means it went > through the GOING state implicitly already. > > [1] https://lore.kernel.org/linux-modules/20221123131226.24359-1-petr.pavlu@suse.com/ > > kernel/module/main.c | 26 +++++++++++++++++++++----- > 1 file changed, 21 insertions(+), 5 deletions(-) So sorry for the late review but these ideas only came to me as I drafted my pull request to Linus. Let's recap the issue from the start as this is why I ended having to pause for a second and couldn't believe myself as I wrote that we had merged this. The old commit which broke things is old, that has been in the kerne since since v5.3-rc1, May 2019, over 2 years ago! The original issue was seeing "Unknown symbols" on the kernel logs for Linux guests on Microsoft HyperV as it disables the X86_FEATURE_SMCA bit on AMD systems. That's nothing to scream about as the error message in this case was bogus and could be ignored. The *fix* to this not-big-deal-issue, however, it was applied on v5.3-rc1 (May 2019), and in the end though created a modern real tux-pain-screaming-worthy regression which in some cases creates failed boots on systems with many CPUs. The issue is that the fix triggered multiple same module requests to eventually return a failure even though the same module may be creeping up, and boot is also delayed by hammering on this check over and over again, instead of informing userspace that the kernel is busy. So this new fix limits these checks, it bounds it, and we pretty much return -EBUSY if we see the same module for which we sent a request for twice in a row. The issue is fatal for large systems given that for each CPU udev stupidly requests the same stupid frequency module, and eventually it runs into the unformed case the fix boobie trapped us into and fails. The series of failures will eventually trigger a full boot failure. > diff --git a/kernel/module/main.c b/kernel/module/main.c > index d02d39c7174e..7a627345d4fd 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > sched_annotate_sleep(); > mutex_lock(&module_mutex); > mod = find_module_all(name, strlen(name), true); > - ret = !mod || mod->state == MODULE_STATE_LIVE; > + ret = !mod || mod->state == MODULE_STATE_LIVE > + || mod->state == MODULE_STATE_GOING; > mutex_unlock(&module_mutex); > > return ret; > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod) > > mod->state = MODULE_STATE_UNFORMED; > > -again: So this is part of my biggest concern for regression, the removal of this tag and its use. Before this we always looped back to trying again and again. > mutex_lock(&module_mutex); > old = find_module_all(mod->name, strlen(mod->name), true); > if (old != NULL) { > - if (old->state != MODULE_STATE_LIVE) { > + if (old->state == MODULE_STATE_COMING > + || old->state == MODULE_STATE_UNFORMED) { > /* Wait in case it fails to load. */ > mutex_unlock(&module_mutex); > err = wait_event_interruptible(module_wq, > finished_loading(mod->name)); > if (err) > goto out_unlocked; > - goto again; We essentially bound this now, and before we didn't. Yes we we wait for finished_loading() of the module -- but if udev is hammering tons of same requests, well, we *will* surely hit this, as many requests managed to get in before userspace saw the module present. While this typically can be useful, it means *quite a bit* of conditions which definitely *did* happen before will now *bail out* fast, to the extent that I'm not even sure why we just re-try once now. If we're going to just re-check *once* why not do something graceful like *at least* cond_resched() to let the system breathe for a *tiny bit*. Because otherwise we it would seem we can end up in similar situations as before except we now deal with that condition differently. And let's see what that looks like now. > + > + /* The module might have gone in the meantime. */ > + mutex_lock(&module_mutex); > + old = find_module_all(mod->name, strlen(mod->name), > + true); > } > - err = -EEXIST; > + > + /* > + * We are here only when the same module was being loaded. Do > + * not try to load it again right now. It prevents long delays > + * caused by serialized module load failures. It might happen > + * when more devices of the same type trigger load of > + * a particular module. > + */ > + if (old && old->state == MODULE_STATE_LIVE) > + err = -EEXIST; > + else > + err = -EBUSY; And for all those use cases we end up here now, with -EBUSY. So udev before was not bounded, and kept busy-looping on the retry in-kernel, and we now immediately bound its condition to just 2 tries to see if the old module existed and now *return* a new value to userspace. My main concerns are: 0) Why not use cond_resched() if we're just going to check twice? 1) How are we sure we are not regressing userspace by removing the boundless loop there? (even if the endless loop was stupid) 2) How is it we expect that we won't resgress userspace now by bounding that check and pretty much returning -EBUSY right away? This last part seems dangerous, in that if userspace did not expect -EBUSY and if an error before caused a module to fail and fail boot, why wouldn't we fail boot now by bailing out faster?? 3) *Fixing* a kernel regression by adding new expected API for testing against -EBUSY seems not ideal. Luis
On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote: > On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote: > > During a system boot, it can happen that the kernel receives a burst of > > requests to insert the same module but loading it eventually fails > > during its init call. For instance, udev can make a request to insert > > a frequency module for each individual CPU when another frequency module > > is already loaded which causes the init function of the new module to > > return an error. > > > > This patch attempts a different solution for the problem 6e6de3dee51a > > was trying to solve. Rather than waiting for the unloading to complete, > > it returns a different error code (-EBUSY) for modules in the GOING > > state. This should avoid the error situation that was described in > > 6e6de3dee51a (user space attempting to load a dependent module because > > the -EEXIST error code would suggest to user space that the first module > > had been loaded successfully), while avoiding the delay situation too. > > > > So sorry for the late review but these ideas only came to me as I > drafted my pull request to Linus. > > Let's recap the issue from the start as this is why I ended having > to pause for a second and couldn't believe myself as I wrote that > we had merged this. IMHO, it is perfectly fine to delay this. The problem is not trivial. It always made my head spin. There is a risk of regression. The change spent in linux next only one week. And this particular merge window is not good for risking because of the holiday season. > > diff --git a/kernel/module/main.c b/kernel/module/main.c > > index d02d39c7174e..7a627345d4fd 100644 > > --- a/kernel/module/main.c > > +++ b/kernel/module/main.c > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > > sched_annotate_sleep(); > > mutex_lock(&module_mutex); > > mod = find_module_all(name, strlen(name), true); > > - ret = !mod || mod->state == MODULE_STATE_LIVE; > > + ret = !mod || mod->state == MODULE_STATE_LIVE > > + || mod->state == MODULE_STATE_GOING; > > mutex_unlock(&module_mutex); > > > > return ret; > > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod) > > > > mod->state = MODULE_STATE_UNFORMED; > > > > -again: > > So this is part of my biggest concern for regression, the removal of > this tag and its use. > > Before this we always looped back to trying again and again. Just to be sure that we are on the same page. The loop was _not_ infinite. It serialized all attempts to load the same module. In our case, it serialized all failures and prolonged the pain. > > mutex_lock(&module_mutex); > > old = find_module_all(mod->name, strlen(mod->name), true); > > if (old != NULL) { > > - if (old->state != MODULE_STATE_LIVE) { > > + if (old->state == MODULE_STATE_COMING > > + || old->state == MODULE_STATE_UNFORMED) { > > /* Wait in case it fails to load. */ > > mutex_unlock(&module_mutex); > > err = wait_event_interruptible(module_wq, > > finished_loading(mod->name)); > > if (err) > > goto out_unlocked; > > - goto again; > > We essentially bound this now, and before we didn't. > > Yes we we wait for finished_loading() of the module -- but if udev is > hammering tons of same requests, well, we *will* surely hit this, as > many requests managed to get in before userspace saw the module present. > > While this typically can be useful, it means *quite a bit* of conditions which > definitely *did* happen before will now *bail out* fast, to the extent > that I'm not even sure why we just re-try once now. I do not understand this. We do _not_ re-try the load in the new version. We just wait for the result of the parallel attempt to load the module. Maybe, you are confused that we repeat find_module_all(). But it is the way how to find the result of the parallel load. > If we're going to > just re-check *once* why not do something graceful like *at least* > cond_resched() to let the system breathe for a *tiny bit*. We must check the result under module_mutex. We have to take this sleeping lock. There is actually a rescheduling. I do not think that cond_resched() would do any difference. > > + > > + /* The module might have gone in the meantime. */ > > + mutex_lock(&module_mutex); > > + old = find_module_all(mod->name, strlen(mod->name), > > + true); > > } > > - err = -EEXIST; > > + > > + /* > > + * We are here only when the same module was being loaded. Do > > + * not try to load it again right now. It prevents long delays > > + * caused by serialized module load failures. It might happen > > + * when more devices of the same type trigger load of > > + * a particular module. > > + */ > > + if (old && old->state == MODULE_STATE_LIVE) > > + err = -EEXIST; > > + else > > + err = -EBUSY; > > And for all those use cases we end up here now, with -EBUSY. So udev > before was not bounded, and kept busy-looping on the retry in-kernel, > and we now immediately bound its condition to just 2 tries to see if the > old module existed and now *return* a new value to userspace. > > My main concerns are: > > 0) Why not use cond_resched() if we're just going to check twice? We take module_mutex. It should cause even bigger delay than cond_resched(). > 1) How are we sure we are not regressing userspace by removing the boundless > loop there? (even if the endless loop was stupid) We could not be sure. On the other hand, if more attempts help to load the module then it is racy and not reliable. The new approach would make it better reproducible and fix the race. > 2) How is it we expect that we won't resgress userspace now by bounding > that check and pretty much returning -EBUSY right away? This last part > seems dangerous, in that if userspace did not expect -EBUSY and if an > error before caused a module to fail and fail boot, why wouldn't we fail > boot now by bailing out faster?? Same answer as for 1) > 3) *Fixing* a kernel regression by adding new expected API for testing > against -EBUSY seems not ideal. IMHO, the right solution is to fix the subsystems so that they send only one uevent. The question is how the module loader would deal with "broken" subsystems. Petr Pavlu, please, fixme. I think that there are more subsystems doing this ugly thing. I personally thing that returning -EBUSY is better than serializing all the loads. It makes eventual problem easier to reproduce and fix. Best Regards, Petr
On 12/13/22 11:17, Petr Mladek wrote: > On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote: >> 3) *Fixing* a kernel regression by adding new expected API for testing >> against -EBUSY seems not ideal. > > IMHO, the right solution is to fix the subsystems so that they send > only one uevent. > > The question is how the module loader would deal with "broken" > subsystems. Petr Pavlu, please, fixme. I think that there are > more subsystems doing this ugly thing. The issue has been seen with cpufreq and edac modules. It is a combination of them being loaded per CPU and use of a cooperative pattern to allow only one module of each such type on the system. Fixing the module loader addresses the immediate regression, but should be useful in general to cope better if some module which is tried to be loaded multiple times per N devices is failing its initialization. I'm not sure if these subsystems can be called "broken". However, I agree it makes sense to have a look at some of the mentioned drivers separately if they can be improved to try to load them only once on each system, because they can be viewed more as whole-platform drivers than per-CPU ones. Thanks, Petr
On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote: > On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote: > > On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote: > > > diff --git a/kernel/module/main.c b/kernel/module/main.c > > > index d02d39c7174e..7a627345d4fd 100644 > > > --- a/kernel/module/main.c > > > +++ b/kernel/module/main.c > > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) > > > sched_annotate_sleep(); > > > mutex_lock(&module_mutex); > > > mod = find_module_all(name, strlen(name), true); > > > - ret = !mod || mod->state == MODULE_STATE_LIVE; > > > + ret = !mod || mod->state == MODULE_STATE_LIVE > > > + || mod->state == MODULE_STATE_GOING; > > > mutex_unlock(&module_mutex); > > > > > > return ret; > > > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod) > > > > > > mod->state = MODULE_STATE_UNFORMED; > > > > > > -again: > > > > So this is part of my biggest concern for regression, the removal of > > this tag and its use. > > > > Before this we always looped back to trying again and again. > > Just to be sure that we are on the same page. > > The loop was _not_ infinite. It serialized all attempts to load > the same module. In our case, it serialized all failures and > prolonged the pain. That's fair yes. The loop happens so long as an already existing module is present with the same name. > > > mutex_lock(&module_mutex); > > > old = find_module_all(mod->name, strlen(mod->name), true); > > > if (old != NULL) { > > > - if (old->state != MODULE_STATE_LIVE) { > > > + if (old->state == MODULE_STATE_COMING > > > + || old->state == MODULE_STATE_UNFORMED) { > > > /* Wait in case it fails to load. */ > > > mutex_unlock(&module_mutex); > > > err = wait_event_interruptible(module_wq, > > > finished_loading(mod->name)); > > > if (err) > > > goto out_unlocked; > > > - goto again; > > > > We essentially bound this now, and before we didn't. > > > > Yes we we wait for finished_loading() of the module -- but if udev is > > hammering tons of same requests, well, we *will* surely hit this, as > > many requests managed to get in before userspace saw the module present. > > > > While this typically can be useful, it means *quite a bit* of conditions which > > definitely *did* happen before will now *bail out* fast, to the extent > > that I'm not even sure why we just re-try once now. > > I do not understand this. We do _not_ re-try the load in the new > version. We just wait for the result of the parallel attempt to > load the module. > > Maybe, you are confused that we repeat find_module_all(). But it is > the way how to find the result of the parallel load. My point is that prior to the buggy commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") and even after that commit it we 'goto again' if an old request is found. We now simply bound this right away. Yes, the loop was not infinite, but in theory at least a few iterations were possible before whereas now immediately return -EBUSY and I don't think all use cases may be ready yet. > > If we're going to > > just re-check *once* why not do something graceful like *at least* > > cond_resched() to let the system breathe for a *tiny bit*. > > We must check the result under module_mutex. We have to take this > sleeping lock. There is actually a rescheduling. I do not think that > cond_resched() would do any difference. Makes sense. > > > + > > > + /* The module might have gone in the meantime. */ > > > + mutex_lock(&module_mutex); > > > + old = find_module_all(mod->name, strlen(mod->name), > > > + true); > > > } > > > - err = -EEXIST; > > > + > > > + /* > > > + * We are here only when the same module was being loaded. Do > > > + * not try to load it again right now. It prevents long delays > > > + * caused by serialized module load failures. It might happen > > > + * when more devices of the same type trigger load of > > > + * a particular module. > > > + */ > > > + if (old && old->state == MODULE_STATE_LIVE) > > > + err = -EEXIST; > > > + else > > > + err = -EBUSY; > > > > And for all those use cases we end up here now, with -EBUSY. So udev > > before was not bounded, and kept busy-looping on the retry in-kernel, > > and we now immediately bound its condition to just 2 tries to see if the > > old module existed and now *return* a new value to userspace. > > > > My main concerns are: > > > > 0) Why not use cond_resched() if we're just going to check twice? > > We take module_mutex. It should cause even bigger delay than cond_resched(). ACK. > > 1) How are we sure we are not regressing userspace by removing the boundless > > loop there? (even if the endless loop was stupid) > > We could not be sure. On the other hand, if more attempts help to load > the module then it is racy and not reliable. The new approach would > make it better reproducible and fix the race. Yes, but the short cut it is a userspace visible change. > > 2) How is it we expect that we won't resgress userspace now by bounding > > that check and pretty much returning -EBUSY right away? This last part > > seems dangerous, in that if userspace did not expect -EBUSY and if an > > error before caused a module to fail and fail boot, why wouldn't we fail > > boot now by bailing out faster?? > > Same answer as for 1) > > > > 3) *Fixing* a kernel regression by adding new expected API for testing > > against -EBUSY seems not ideal. > > IMHO, the right solution is to fix the subsystems so that they send > only one uevent. Makes sense, but that can take time and some folks are stuck on old kernels and perhaps porting fixes for this on subsystems may take time to land to some enterprisy kernels. And then there is also systemd that issues the requests too, at least that was reflected in commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") that commit claims it was systemd issueing the requests which I mean to interpret finit_module(), not calling modprobe. The rationale for making a regression fix with a new userspace return value is fair given the old fix made things even much worse the point some kernel boots would fail. So the rationale to suggest we *must* short-cut parallel loads as effectively as possible seems sensible *iff* that could not make things worse too but sadly I've found an isssue proactively with this fix, or at least that this issue is also not fixed: ./tools/testing/selftests/kmod/kmod.sh -t 0006 Tue Jan 17 23:18:13 UTC 2023 Running test: kmod_test_0006 - run #0 kmod_test_0006: OK! - loading kmod test kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22) ---------------------------------------------------- Custom trigger configuration for: test_kmod0 Number of threads: 50 Test_case: TEST_KMOD_FS_TYPE (2) driver: test_module fs: xfs ---------------------------------------------------- Test completed When can multiple get_fs_type() calls be issued on a system? When mounting a large number of filesystems. Sadly though this issue seems to have gone unnoticed for a while now. Even reverting commit 6e6de3dee51a doesn't fix it, and I've run into issues with trying to bisect, first due to missing Kees' patch which fixes a compiler failure on older kernel [0] and now I'm seeing this while trying to build v5.1: ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions ld: warning: creating DT_TEXTREL in a PIE make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1 make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2 make: *** [arch/x86/Makefile:283: bzImage] Error 2 [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org But we should try to bisect to see what cauased the above kmod test 0006 to start failing. > The question is how the module loader would deal with "broken" > subsystems. Petr Pavlu, please, fixme. I think that there are > more subsystems doing this ugly thing. > > I personally thing that returning -EBUSY is better than serializing > all the loads. It makes eventual problem easier to reproduce and fix. I agree with this assessment, however given the multiple get_fs_type() calls as an example, I am not sure if there are other areas which rely on the old busy-wait mechanism. *If* we knew this issue was not so common I'd go so far as to say we should pr_warn_once() on failure, but at this point in time I think it'd be pretty chatty. I don't yet have confidence that the new fast track to -EXIST or -EBUSY may not create regressions, so the below I think would be too chatty. If it wasn't true, I'd say we should keep record of these uses so we fix the callers. diff --git a/kernel/module/main.c b/kernel/module/main.c index d3be89de706d..d1ad0b510cb8 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2589,13 +2589,6 @@ static int add_unformed_module(struct module *mod) true); } - /* - * We are here only when the same module was being loaded. Do - * not try to load it again right now. It prevents long delays - * caused by serialized module load failures. It might happen - * when more devices of the same type trigger load of - * a particular module. - */ if (old && old->state == MODULE_STATE_LIVE) err = -EEXIST; else @@ -2610,6 +2603,15 @@ static int add_unformed_module(struct module *mod) out: mutex_unlock(&module_mutex); out_unlocked: + /* + * We get an error here only when there is an attempt to load the + * same module. Subsystems should strive to only issue one request + * for a needed module. Multiple requests might happen when more devices + * of the same type trigger load of a particular module. + */ + if (err) + pr_warn_once("%: dropping duplicate module request, err: %d\n", + mod->name, err); return err; } Luis
On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote: > ./tools/testing/selftests/kmod/kmod.sh -t 0006 > Tue Jan 17 23:18:13 UTC 2023 > Running test: kmod_test_0006 - run #0 > kmod_test_0006: OK! - loading kmod test > kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22) > ---------------------------------------------------- > Custom trigger configuration for: test_kmod0 > Number of threads: 50 > Test_case: TEST_KMOD_FS_TYPE (2) > driver: test_module > fs: xfs > ---------------------------------------------------- > Test completed > > When can multiple get_fs_type() calls be issued on a system? When > mounting a large number of filesystems. Sadly though this issue seems > to have gone unnoticed for a while now. Even reverting commit > 6e6de3dee51a doesn't fix it, and I've run into issues with trying > to bisect, first due to missing Kees' patch which fixes a compiler > failure on older kernel [0] and now I'm seeing this while trying to > build v5.1: > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here > ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack > ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker > ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' > ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions > ld: warning: creating DT_TEXTREL in a PIE > make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1 > make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2 > make: *** [arch/x86/Makefile:283: bzImage] Error 2 > > [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org > > But we should try to bisect to see what cauased the above kmod test 0006 > to start failing. BTW if someone beats me to bisecting the above it would be appreciated. I have a feeling having old gcc / linker, etc would be easier. Luis
On 1/18/23 01:04, Luis Chamberlain wrote: > On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote: >> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote: >>> 3) *Fixing* a kernel regression by adding new expected API for testing >>> against -EBUSY seems not ideal. >> >> IMHO, the right solution is to fix the subsystems so that they send >> only one uevent. > > Makes sense, but that can take time and some folks are stuck on old kernels > and perhaps porting fixes for this on subsystems may take time to land > to some enterprisy kernels. And then there is also systemd that issues > the requests too, at least that was reflected in commit 6e6de3dee51a > ("kernel/module.c: Only return -EEXIST for modules that have finished loading") > that commit claims it was systemd issueing the requests which I mean to > interpret finit_module(), not calling modprobe. > > The rationale for making a regression fix with a new userspace return value > is fair given the old fix made things even much worse the point some kernel > boots would fail. So the rationale to suggest we *must* short-cut > parallel loads as effectively as possible seems sensible *iff* that > could not make things worse too but sadly I've found an isssue > proactively with this fix, or at least that this issue is also not fixed: > > ./tools/testing/selftests/kmod/kmod.sh -t 0006 > Tue Jan 17 23:18:13 UTC 2023 > Running test: kmod_test_0006 - run #0 > kmod_test_0006: OK! - loading kmod test > kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22) > ---------------------------------------------------- > Custom trigger configuration for: test_kmod0 > Number of threads: 50 > Test_case: TEST_KMOD_FS_TYPE (2) > driver: test_module > fs: xfs > ---------------------------------------------------- > Test completed > > When can multiple get_fs_type() calls be issued on a system? When > mounting a large number of filesystems. Sadly though this issue seems > to have gone unnoticed for a while now. Even reverting commit > 6e6de3dee51a doesn't fix it, and I've run into issues with trying > to bisect, first due to missing Kees' patch which fixes a compiler > failure on older kernel [0] and now I'm seeing this while trying to > build v5.1: > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here > ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack > ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker > ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' > ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions > ld: warning: creating DT_TEXTREL in a PIE > make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1 > make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2 > make: *** [arch/x86/Makefile:283: bzImage] Error 2 > > [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org > > But we should try to bisect to see what cauased the above kmod test 0006 > to start failing. It is not clear to me from your description if the observed failure of kmod_test_0006 is related to the fix in this thread. The problem was not possible for me to reproduce on my system. My test was on an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config + tools/testing/selftests/kmod/config". Could you perhaps trace the test to determine where the EINVAL value comes from? >> The question is how the module loader would deal with "broken" >> subsystems. Petr Pavlu, please, fixme. I think that there are >> more subsystems doing this ugly thing. >> >> I personally thing that returning -EBUSY is better than serializing >> all the loads. It makes eventual problem easier to reproduce and fix. > > I agree with this assessment, however given the multiple get_fs_type() > calls as an example, I am not sure if there are other areas which rely on the > old busy-wait mechanism. > > *If* we knew this issue was not so common I'd go so far as to say we > should pr_warn_once() on failure, but at this point in time I think it'd > be pretty chatty. > > I don't yet have confidence that the new fast track to -EXIST or -EBUSY may > not create regressions, so the below I think would be too chatty. If it > wasn't true, I'd say we should keep record of these uses so we fix the > callers. A similar fast track logic was present prior to 6e6de3dee51a. The fix in this thread doesn't bring a completely new behavior but rather restores the previous one. The fix should have only two differences: a window when parallel loads are detected is extended, the return code is -EBUSY instead of -EEXIST. > diff --git a/kernel/module/main.c b/kernel/module/main.c > index d3be89de706d..d1ad0b510cb8 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -2589,13 +2589,6 @@ static int add_unformed_module(struct module *mod) > true); > } > > - /* > - * We are here only when the same module was being loaded. Do > - * not try to load it again right now. It prevents long delays > - * caused by serialized module load failures. It might happen > - * when more devices of the same type trigger load of > - * a particular module. > - */ > if (old && old->state == MODULE_STATE_LIVE) > err = -EEXIST; > else > @@ -2610,6 +2603,15 @@ static int add_unformed_module(struct module *mod) > out: > mutex_unlock(&module_mutex); > out_unlocked: > + /* > + * We get an error here only when there is an attempt to load the > + * same module. Subsystems should strive to only issue one request > + * for a needed module. Multiple requests might happen when more devices > + * of the same type trigger load of a particular module. > + */ > + if (err) > + pr_warn_once("%: dropping duplicate module request, err: %d\n", > + mod->name, err); > return err; > } I'm not sure if this would be the right thing to do. What I think would be good to fix is the pattern utilized by some cpufreq (and edac) modules. It is a combination of them being loaded per-CPU and using a cooperative pattern to allow only one module of each such type on the system. They can be viewed more as a whole-platform drivers that should be attempted to be loaded only once. It is still on my todo list to post a patch to cpufreq maintainers to start a discussion on this. Another consideration on the kernel side could be to try grouping other currently per-CPU loaded modules. However, in general, a system can have many hardware pieces of the same type and it looks correct to me that each is individually exposed in sysfs and via uevents with their set of needed modules. I would say that if this turns out to be a further issue in practice, udevd could optimize its process and avoid same-module loads when it is bringing up to life multiple devices. Thanks, Petr
On Wed, Jan 18, 2023 at 04:12:05PM +0100, Petr Pavlu wrote: > On 1/18/23 01:04, Luis Chamberlain wrote: > > The rationale for making a regression fix with a new userspace return value > > is fair given the old fix made things even much worse the point some kernel > > boots would fail. So the rationale to suggest we *must* short-cut > > parallel loads as effectively as possible seems sensible *iff* that > > could not make things worse too but sadly I've found an isssue > > proactively with this fix, or at least that this issue is also not fixed: > > > > ./tools/testing/selftests/kmod/kmod.sh -t 0006 > > Tue Jan 17 23:18:13 UTC 2023 > > Running test: kmod_test_0006 - run #0 > > kmod_test_0006: OK! - loading kmod test > > kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22) > > ---------------------------------------------------- > > Custom trigger configuration for: test_kmod0 > > Number of threads: 50 > > Test_case: TEST_KMOD_FS_TYPE (2) > > driver: test_module > > fs: xfs > > ---------------------------------------------------- > > Test completed > > > > When can multiple get_fs_type() calls be issued on a system? When > > mounting a large number of filesystems. Sadly though this issue seems > > to have gone unnoticed for a while now. Even reverting commit > > 6e6de3dee51a doesn't fix it, and I've run into issues with trying > > to bisect, first due to missing Kees' patch which fixes a compiler > > failure on older kernel [0] and now I'm seeing this while trying to > > build v5.1: > > > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here > > ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack > > ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker > > ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' > > ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions > > ld: warning: creating DT_TEXTREL in a PIE > > make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1 > > make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2 > > make: *** [arch/x86/Makefile:283: bzImage] Error 2 > > > > [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org > > > > But we should try to bisect to see what cauased the above kmod test 0006 > > to start failing. > > It is not clear to me from your description if the observed failure of > kmod_test_0006 is related to the fix in this thread. The issue happens with and without the patch in this thread, I'd just hate to exacerbate the issue further. > The problem was not possible for me to reproduce on my system. My test was on > an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config + > tools/testing/selftests/kmod/config". With the patch? > Could you perhaps trace the test to determine where the EINVAL value comes > from? Sure, it'll take a bit. Luis
On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote: > and now I'm seeing this while trying to build v5.1: > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here You need to backport aa5cacdc29d7 ("x86/asm: Replace __force_order with a memory clobber") for that. Happens when building older kernels with newer toolchain.
On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote: >On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote: >> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote: >> > On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote: >> > > diff --git a/kernel/module/main.c b/kernel/module/main.c >> > > index d02d39c7174e..7a627345d4fd 100644 >> > > --- a/kernel/module/main.c >> > > +++ b/kernel/module/main.c >> > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) >> > > sched_annotate_sleep(); >> > > mutex_lock(&module_mutex); >> > > mod = find_module_all(name, strlen(name), true); >> > > - ret = !mod || mod->state == MODULE_STATE_LIVE; >> > > + ret = !mod || mod->state == MODULE_STATE_LIVE >> > > + || mod->state == MODULE_STATE_GOING; >> > > mutex_unlock(&module_mutex); >> > > >> > > return ret; >> > > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod) >> > > >> > > mod->state = MODULE_STATE_UNFORMED; >> > > >> > > -again: >> > >> > So this is part of my biggest concern for regression, the removal of >> > this tag and its use. >> > >> > Before this we always looped back to trying again and again. >> >> Just to be sure that we are on the same page. >> >> The loop was _not_ infinite. It serialized all attempts to load >> the same module. In our case, it serialized all failures and >> prolonged the pain. > >That's fair yes. The loop happens so long as an already existing module is >present with the same name. > >> > > mutex_lock(&module_mutex); >> > > old = find_module_all(mod->name, strlen(mod->name), true); >> > > if (old != NULL) { >> > > - if (old->state != MODULE_STATE_LIVE) { >> > > + if (old->state == MODULE_STATE_COMING >> > > + || old->state == MODULE_STATE_UNFORMED) { >> > > /* Wait in case it fails to load. */ >> > > mutex_unlock(&module_mutex); >> > > err = wait_event_interruptible(module_wq, >> > > finished_loading(mod->name)); >> > > if (err) >> > > goto out_unlocked; >> > > - goto again; >> > >> > We essentially bound this now, and before we didn't. >> > >> > Yes we we wait for finished_loading() of the module -- but if udev is >> > hammering tons of same requests, well, we *will* surely hit this, as >> > many requests managed to get in before userspace saw the module present. >> > >> > While this typically can be useful, it means *quite a bit* of conditions which >> > definitely *did* happen before will now *bail out* fast, to the extent >> > that I'm not even sure why we just re-try once now. >> >> I do not understand this. We do _not_ re-try the load in the new >> version. We just wait for the result of the parallel attempt to >> load the module. >> >> Maybe, you are confused that we repeat find_module_all(). But it is >> the way how to find the result of the parallel load. > >My point is that prior to the buggy commit 6e6de3dee51a ("kernel/module.c: Only >return -EEXIST for modules that have finished loading") and even after that >commit it we 'goto again' if an old request is found. We now simply bound this >right away. Yes, the loop was not infinite, but in theory at least a few >iterations were possible before whereas now immediately return -EBUSY >and I don't think all use cases may be ready yet. > >> > If we're going to >> > just re-check *once* why not do something graceful like *at least* >> > cond_resched() to let the system breathe for a *tiny bit*. >> >> We must check the result under module_mutex. We have to take this >> sleeping lock. There is actually a rescheduling. I do not think that >> cond_resched() would do any difference. > >Makes sense. > >> > > + >> > > + /* The module might have gone in the meantime. */ >> > > + mutex_lock(&module_mutex); >> > > + old = find_module_all(mod->name, strlen(mod->name), >> > > + true); >> > > } >> > > - err = -EEXIST; >> > > + >> > > + /* >> > > + * We are here only when the same module was being loaded. Do >> > > + * not try to load it again right now. It prevents long delays >> > > + * caused by serialized module load failures. It might happen >> > > + * when more devices of the same type trigger load of >> > > + * a particular module. >> > > + */ >> > > + if (old && old->state == MODULE_STATE_LIVE) >> > > + err = -EEXIST; >> > > + else >> > > + err = -EBUSY; >> > >> > And for all those use cases we end up here now, with -EBUSY. So udev >> > before was not bounded, and kept busy-looping on the retry in-kernel, >> > and we now immediately bound its condition to just 2 tries to see if the >> > old module existed and now *return* a new value to userspace. >> > >> > My main concerns are: >> > >> > 0) Why not use cond_resched() if we're just going to check twice? >> >> We take module_mutex. It should cause even bigger delay than cond_resched(). > >ACK. > >> > 1) How are we sure we are not regressing userspace by removing the boundless >> > loop there? (even if the endless loop was stupid) >> >> We could not be sure. On the other hand, if more attempts help to load >> the module then it is racy and not reliable. The new approach would >> make it better reproducible and fix the race. > >Yes, but the short cut it is a userspace visible change. > >> > 2) How is it we expect that we won't resgress userspace now by bounding >> > that check and pretty much returning -EBUSY right away? This last part >> > seems dangerous, in that if userspace did not expect -EBUSY and if an >> > error before caused a module to fail and fail boot, why wouldn't we fail >> > boot now by bailing out faster?? >> >> Same answer as for 1) >> >> >> > 3) *Fixing* a kernel regression by adding new expected API for testing >> > against -EBUSY seems not ideal. >> >> IMHO, the right solution is to fix the subsystems so that they send >> only one uevent. > >Makes sense, but that can take time and some folks are stuck on old kernels >and perhaps porting fixes for this on subsystems may take time to land >to some enterprisy kernels. And then there is also systemd that issues >the requests too, at least that was reflected in commit 6e6de3dee51a >("kernel/module.c: Only return -EEXIST for modules that have finished loading") >that commit claims it was systemd issueing the requests which I mean to >interpret finit_module(), not calling modprobe. just a comment on this, if it helps anything: commit 6e6de3dee51a says systemd, but it would be more accurate to say udev or systemd-udev to disambiguate with pid 1. udev uses libkmod to load modules so it's pretty much the same logic as modprobe. Lucas De Marchi
On Wed, Jan 18, 2023 at 09:02:52PM +0100, Borislav Petkov wrote: > On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote: > > and now I'm seeing this while trying to build v5.1: > > > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here > > You need to backport > > aa5cacdc29d7 ("x86/asm: Replace __force_order with a memory clobber") > > for that. > > Happens when building older kernels with newer toolchain. Thanks that certainly helps. FWIW if someone needs it, I had to remove the double colons on write cr0 and cr4 to compile, but this crashed :( Any ideas? diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/kaslr_64.c index 748456c365f4..9557c5a15b91 100644 --- a/arch/x86/boot/compressed/kaslr_64.c +++ b/arch/x86/boot/compressed/kaslr_64.c @@ -29,9 +29,6 @@ #define __PAGE_OFFSET __PAGE_OFFSET_BASE #include "../../mm/ident_map.c" -/* Used by pgtable.h asm code to force instruction serialization. */ -unsigned long __force_order; - /* Used to track our page table allocation area. */ struct alloc_pgt_data { unsigned char *pgt_buf; diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c index f8debf7aeb4c..7471b48524cb 100644 --- a/arch/x86/boot/compressed/pgtable_64.c +++ b/arch/x86/boot/compressed/pgtable_64.c @@ -5,15 +5,6 @@ #include "pgtable.h" #include "../string.h" -/* - * __force_order is used by special_insns.h asm code to force instruction - * serialization. - * - * It is not referenced from the code, but GCC < 5 with -fPIE would fail - * due to an undefined symbol. Define it to make these ancient GCCs work. - */ -unsigned long __force_order; - #define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */ #define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 0a3c4cab39db..5586e4cf62d3 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -7,49 +7,50 @@ #include <asm/nops.h> +#define __FORCE_ORDER "m"(*(unsigned int *)0x1000UL) + /* - * Volatile isn't enough to prevent the compiler from reordering the - * read/write functions for the control registers and messing everything up. - * A memory clobber would solve the problem, but would prevent reordering of - * all loads stores around it, which can hurt performance. Solution is to - * use a variable and mimic reads and writes to it to enforce serialization + * The compiler should not reorder volatile asm statements with respect to each + * other: they should execute in program order. However GCC 4.9.x and 5.x have + * a bug (which was fixed in 8.1, 7.3 and 6.5) where they might reorder + * volatile asm. The write functions are not affected since they have memory + * clobbers preventing reordering. To prevent reads from being reordered with + * respect to writes, use a dummy memory operand */ -extern unsigned long __force_order; - static inline unsigned long native_read_cr0(void) { unsigned long val; - asm volatile("mov %%cr0,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr0,%0\n\t" : "=r" (val) : __FORCE_ORDER); return val; } static inline void native_write_cr0(unsigned long val) { - asm volatile("mov %0,%%cr0": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr0": : "r" (val) : "memory"); } static inline unsigned long native_read_cr2(void) { unsigned long val; - asm volatile("mov %%cr2,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr2,%0\n\t" : "=r" (val) : __FORCE_ORDER); return val; } static inline void native_write_cr2(unsigned long val) { - asm volatile("mov %0,%%cr2": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr2": : "r" (val) : "memory"); } static inline unsigned long __native_read_cr3(void) { unsigned long val; - asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER); return val; } static inline void native_write_cr3(unsigned long val) { - asm volatile("mov %0,%%cr3": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr3": : "r" (val) : "memory"); } static inline unsigned long native_read_cr4(void) @@ -64,17 +65,17 @@ static inline unsigned long native_read_cr4(void) asm volatile("1: mov %%cr4, %0\n" "2:\n" _ASM_EXTABLE(1b, 2b) - : "=r" (val), "=m" (__force_order) : "0" (0)); + : "=r" (val) : "0" (0), __FORCE_ORDER); #else /* CR4 always exists on x86_64. */ - asm volatile("mov %%cr4,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr4,%0\n\t" : "=r" (val) : __FORCE_ORDER); #endif return val; } static inline void native_write_cr4(unsigned long val) { - asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr4": : "r" (val) : "memory"); } #ifdef CONFIG_X86_64
On 1/18/23 19:42, Luis Chamberlain wrote: > On Wed, Jan 18, 2023 at 04:12:05PM +0100, Petr Pavlu wrote: >> On 1/18/23 01:04, Luis Chamberlain wrote: >>> The rationale for making a regression fix with a new userspace return value >>> is fair given the old fix made things even much worse the point some kernel >>> boots would fail. So the rationale to suggest we *must* short-cut >>> parallel loads as effectively as possible seems sensible *iff* that >>> could not make things worse too but sadly I've found an isssue >>> proactively with this fix, or at least that this issue is also not fixed: >>> >>> ./tools/testing/selftests/kmod/kmod.sh -t 0006 >>> Tue Jan 17 23:18:13 UTC 2023 >>> Running test: kmod_test_0006 - run #0 >>> kmod_test_0006: OK! - loading kmod test >>> kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22) >>> ---------------------------------------------------- >>> Custom trigger configuration for: test_kmod0 >>> Number of threads: 50 >>> Test_case: TEST_KMOD_FS_TYPE (2) >>> driver: test_module >>> fs: xfs >>> ---------------------------------------------------- >>> Test completed >>> >>> When can multiple get_fs_type() calls be issued on a system? When >>> mounting a large number of filesystems. Sadly though this issue seems >>> to have gone unnoticed for a while now. Even reverting commit >>> 6e6de3dee51a doesn't fix it, and I've run into issues with trying >>> to bisect, first due to missing Kees' patch which fixes a compiler >>> failure on older kernel [0] and now I'm seeing this while trying to >>> build v5.1: >>> >>> ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; >>> arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here >>> ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack >>> ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker >>> ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' >>> ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions >>> ld: warning: creating DT_TEXTREL in a PIE >>> make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1 >>> make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2 >>> make: *** [arch/x86/Makefile:283: bzImage] Error 2 >>> >>> [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org >>> >>> But we should try to bisect to see what cauased the above kmod test 0006 >>> to start failing. >> >> It is not clear to me from your description if the observed failure of >> kmod_test_0006 is related to the fix in this thread. > > The issue happens with and without the patch in this thread, I'd just hate to > exacerbate the issue further. > >> The problem was not possible for me to reproduce on my system. My test was on >> an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config + >> tools/testing/selftests/kmod/config". > > With the patch? It is same for me without and with the patch. The test doesn't produce any failure on my test system. The complete kmod.sh selftest passes too. Cheers, Petr
On Wed 2023-01-18 16:12:05, Petr Pavlu wrote: > On 1/18/23 01:04, Luis Chamberlain wrote: > > On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote: > >> On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote: > >>> 3) *Fixing* a kernel regression by adding new expected API for testing > >>> against -EBUSY seems not ideal. > >> > >> IMHO, the right solution is to fix the subsystems so that they send > >> only one uevent. > > > > Makes sense, but that can take time and some folks are stuck on old kernels > > and perhaps porting fixes for this on subsystems may take time to land > > to some enterprisy kernels. And then there is also systemd that issues > > the requests too, at least that was reflected in commit 6e6de3dee51a > > ("kernel/module.c: Only return -EEXIST for modules that have finished loading") > > that commit claims it was systemd issueing the requests which I mean to > > interpret finit_module(), not calling modprobe. > > > > The rationale for making a regression fix with a new userspace return value > > is fair given the old fix made things even much worse the point some kernel > > boots would fail. So the rationale to suggest we *must* short-cut > > parallel loads as effectively as possible seems sensible *iff* that > > could not make things worse too but sadly I've found an isssue > > proactively with this fix, or at least that this issue is also not fixed: > > > > ./tools/testing/selftests/kmod/kmod.sh -t 0006 > > Tue Jan 17 23:18:13 UTC 2023 > > Running test: kmod_test_0006 - run #0 > > kmod_test_0006: OK! - loading kmod test > > kmod_test_0006: FAIL, test expects SUCCESS (0) - got -EINVAL (-22) > > ---------------------------------------------------- > > Custom trigger configuration for: test_kmod0 > > Number of threads: 50 > > Test_case: TEST_KMOD_FS_TYPE (2) > > driver: test_module > > fs: xfs > > ---------------------------------------------------- > > Test completed > > > > When can multiple get_fs_type() calls be issued on a system? When > > mounting a large number of filesystems. This patch should not change the behavior that much. The parallel load still waits until the pending one finishes. The difference is that it does not try to load it once again. But when the first load fails then something is broken anyway. I could imagine that this could cause regression from the user POV when the first failure is not important from some reasons. But it means that the things work only by chance and the problem might hit the user later anyway. Another difference is that the parallel load finishes immediately also when it sees a going module. But it should not affect that much the multiple get_fs_type() calls. They are all trying to load the module. > Sadly though this issue seems > > to have gone unnoticed for a while now. Even reverting commit > > 6e6de3dee51a doesn't fix it, and I've run into issues with trying > > to bisect, first due to missing Kees' patch which fixes a compiler > > failure on older kernel [0] and now I'm seeing this while trying to > > build v5.1: > > > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here > > ld: warning: arch/x86/boot/compressed/efi_thunk_64.o: missing .note.GNU-stack section implies executable stack > > ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker > > ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text' > > ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions > > ld: warning: creating DT_TEXTREL in a PIE > > make[2]: *** [arch/x86/boot/compressed/Makefile:118: arch/x86/boot/compressed/vmlinux] Error 1 > > make[1]: *** [arch/x86/boot/Makefile:112: arch/x86/boot/compressed/vmlinux] Error 2 > > make: *** [arch/x86/Makefile:283: bzImage] Error 2 > > > > [0] http://lore.kernel.org/lkml/20220213182443.4037039-1-keescook@chromium.org > > > > But we should try to bisect to see what cauased the above kmod test 0006 > > to start failing. > > It is not clear to me from your description if the observed failure of > kmod_test_0006 is related to the fix in this thread. > > The problem was not possible for me to reproduce on my system. My test was on > an 8-CPU x86_64 machine using v6.2-rc4 with "defconfig + kvm_guest.config + > tools/testing/selftests/kmod/config". I can't reproduce it either. I guess that it needs some "good" timing, probably more CPUs or so. I wonder if it races with module -r that removes the module before it tries to load it multiple times in parallel. Does the test pass when you add sleep after the module -r, like this: diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh index 7189715d7960..8a020f90a3f6 100755 --- a/tools/testing/selftests/kmod/kmod.sh +++ b/tools/testing/selftests/kmod/kmod.sh @@ -322,6 +322,7 @@ kmod_defaults_fs() { config_reset modprobe -r $DEFAULT_KMOD_FS + sleep 1 config_set_fs $DEFAULT_KMOD_FS config_set_test_case_fs } > Could you perhaps trace the test to determine where the EINVAL value comes > from? Yes, the -EINVAL error is strange. It is returned also in kernel/module/main.c on few locations. But neither of them looks like a good candidate. My assumption is that it is more likely returned by the module_init() callback from the loaded module. But it is just a guess and I might be wrong. I wonder if it is cause by a delayed release of some resources, when the module is removed, e.g. sysfs or so. It might theoretically cause conflict when they still exist and the reloaded module tries to create them again. Best Regards, Petr
On Wed, Jan 18, 2023 at 05:23:15PM -0800, Luis Chamberlain wrote: > On Wed, Jan 18, 2023 at 09:02:52PM +0100, Borislav Petkov wrote: > > On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote: > > > and now I'm seeing this while trying to build v5.1: > > > > > > ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; > > > arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here > > > > You need to backport > > > > aa5cacdc29d7 ("x86/asm: Replace __force_order with a memory clobber") > > > > for that. > > > > Happens when building older kernels with newer toolchain. > > Thanks that certainly helps. FWIW if someone needs it, I had to remove > the double colons on write cr0 and cr4 to compile, but this crashed :( > Any ideas? The issue was my write for cr0 and cr4 didn't have r+, so: diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 5586e4cf62d3..e4da7248edcf 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -26,7 +26,7 @@ static inline unsigned long native_read_cr0(void) static inline void native_write_cr0(unsigned long val) { - asm volatile("mov %0,%%cr0": : "r" (val) : "memory"); + asm volatile("mov %0,%%cr0": "+r" (val) : : "memory"); } static inline unsigned long native_read_cr2(void) @@ -75,7 +75,7 @@ static inline unsigned long native_read_cr4(void) static inline void native_write_cr4(unsigned long val) { - asm volatile("mov %0,%%cr4": : "r" (val) : "memory"); + asm volatile("mov %0,%%cr4": "+r" (val) : : "memory"); } #ifdef CONFIG_X86_64 The complete patch below now boots on v5.2 however this still ends up in a panic, it dies on cpu_startup_entry(). [ 0.595694] x86: Booting SMP configuration: [ 0.596807] .... node #0, CPUs: #1 [ 0.264759] kvm-clock: cpu 1, msr 15edff041, secondary cpu clock [ 0.264759] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secondary+0x1c6/0x1d0 [ 0.264759] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0+ #5 [ 0.264759] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 0.264759] Call Trace: [ 0.264759] dump_stack+0x5c/0x80 [ 0.264759] panic+0x102/0x2f5 [ 0.264759] ? start_secondary+0x1c6/0x1d0 [ 0.264759] __stack_chk_fail+0x15/0x20 [ 0.264759] start_secondary+0x1c6/0x1d0 [ 0.264759] secondary_startup_64+0xa4/0xb0 [ 0.264759] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secondary+0x1c6/0x1d0 ]--- (gdb) l *(start_secondary+0x1c6) 0xffffffff810506e6 is at arch/x86/kernel/smpboot.c:265. 260 boot_init_stack_canary(); 261 262 x86_cpuinit.setup_percpu_clockev(); 263 264 wmb(); 265 cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); 266 } 267 268 /** 269 * topology_is_primary_thread - Check whether CPU is the primary SMT thread The full attempt to backport aa5cacdc29d7 to v5.2: diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/kaslr_64.c index 748456c365f4..9557c5a15b91 100644 --- a/arch/x86/boot/compressed/kaslr_64.c +++ b/arch/x86/boot/compressed/kaslr_64.c @@ -29,9 +29,6 @@ #define __PAGE_OFFSET __PAGE_OFFSET_BASE #include "../../mm/ident_map.c" -/* Used by pgtable.h asm code to force instruction serialization. */ -unsigned long __force_order; - /* Used to track our page table allocation area. */ struct alloc_pgt_data { unsigned char *pgt_buf; diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c index f8debf7aeb4c..7471b48524cb 100644 --- a/arch/x86/boot/compressed/pgtable_64.c +++ b/arch/x86/boot/compressed/pgtable_64.c @@ -5,15 +5,6 @@ #include "pgtable.h" #include "../string.h" -/* - * __force_order is used by special_insns.h asm code to force instruction - * serialization. - * - * It is not referenced from the code, but GCC < 5 with -fPIE would fail - * due to an undefined symbol. Define it to make these ancient GCCs work. - */ -unsigned long __force_order; - #define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */ #define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 0a3c4cab39db..e4da7248edcf 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -7,49 +7,50 @@ #include <asm/nops.h> +#define __FORCE_ORDER "m"(*(unsigned int *)0x1000UL) + /* - * Volatile isn't enough to prevent the compiler from reordering the - * read/write functions for the control registers and messing everything up. - * A memory clobber would solve the problem, but would prevent reordering of - * all loads stores around it, which can hurt performance. Solution is to - * use a variable and mimic reads and writes to it to enforce serialization + * The compiler should not reorder volatile asm statements with respect to each + * other: they should execute in program order. However GCC 4.9.x and 5.x have + * a bug (which was fixed in 8.1, 7.3 and 6.5) where they might reorder + * volatile asm. The write functions are not affected since they have memory + * clobbers preventing reordering. To prevent reads from being reordered with + * respect to writes, use a dummy memory operand */ -extern unsigned long __force_order; - static inline unsigned long native_read_cr0(void) { unsigned long val; - asm volatile("mov %%cr0,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr0,%0\n\t" : "=r" (val) : __FORCE_ORDER); return val; } static inline void native_write_cr0(unsigned long val) { - asm volatile("mov %0,%%cr0": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr0": "+r" (val) : : "memory"); } static inline unsigned long native_read_cr2(void) { unsigned long val; - asm volatile("mov %%cr2,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr2,%0\n\t" : "=r" (val) : __FORCE_ORDER); return val; } static inline void native_write_cr2(unsigned long val) { - asm volatile("mov %0,%%cr2": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr2": : "r" (val) : "memory"); } static inline unsigned long __native_read_cr3(void) { unsigned long val; - asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER); return val; } static inline void native_write_cr3(unsigned long val) { - asm volatile("mov %0,%%cr3": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr3": : "r" (val) : "memory"); } static inline unsigned long native_read_cr4(void) @@ -64,17 +65,17 @@ static inline unsigned long native_read_cr4(void) asm volatile("1: mov %%cr4, %0\n" "2:\n" _ASM_EXTABLE(1b, 2b) - : "=r" (val), "=m" (__force_order) : "0" (0)); + : "=r" (val) : "0" (0), __FORCE_ORDER); #else /* CR4 always exists on x86_64. */ - asm volatile("mov %%cr4,%0\n\t" : "=r" (val), "=m" (__force_order)); + asm volatile("mov %%cr4,%0\n\t" : "=r" (val) : __FORCE_ORDER); #endif return val; } static inline void native_write_cr4(unsigned long val) { - asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order)); + asm volatile("mov %0,%%cr4": "+r" (val) : : "memory"); } #ifdef CONFIG_X86_64
On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > Yes, the -EINVAL error is strange. It is returned also in > kernel/module/main.c on few locations. But neither of them > looks like a good candidate. OK I updated to next-20230119 and I don't see the issue now. Odd. It could have been an issue with next-20221207 which I was on before. I'll run some more test and if nothing fails I'll send the fix to Linux for rc5. Luis
On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > > Yes, the -EINVAL error is strange. It is returned also in > > kernel/module/main.c on few locations. But neither of them > > looks like a good candidate. > > OK I updated to next-20230119 and I don't see the issue now. > Odd. It could have been an issue with next-20221207 which I was > on before. > > I'll run some more test and if nothing fails I'll send the fix > to Linux for rc5. Jeesh it just occured to me the difference, which I'll have to test next, for next-20221207 I had enabled module compression on kdevops with zstd. You can see the issues on kdevops git log with that... and I finally disabled it and the kmod test issue is gone. So it could be that but I just am ending my day so will check tomorrow if that was it. But if someone else beats me then great. With kdevops it should be a matter of just enabling zstd as I just bumped support for next-20230119 and that has module decompression disabled. Luis
On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > > > Yes, the -EINVAL error is strange. It is returned also in > > > kernel/module/main.c on few locations. But neither of them > > > looks like a good candidate. > > > > OK I updated to next-20230119 and I don't see the issue now. > > Odd. It could have been an issue with next-20221207 which I was > > on before. > > > > I'll run some more test and if nothing fails I'll send the fix > > to Linux for rc5. > > Jeesh it just occured to me the difference, which I'll have to > test next, for next-20221207 I had enabled module compression > on kdevops with zstd. > > You can see the issues on kdevops git log with that... and I finally > disabled it and the kmod test issue is gone. So it could be that > but I just am ending my day so will check tomorrow if that was it. > But if someone else beats me then great. > > With kdevops it should be a matter of just enabling zstd as I > just bumped support for next-20230119 and that has module decompression > disabled. So indeed, my suspcions were correct. There is one bug with compression on debian: - gzip compressed modules don't end up in the initramfs There is a generic upstream kmod bug: - modprobe --show-depends won't grok compressed modules so initramfs tools that use this as Debian likely are not getting module dependencies installed in their initramfs But using xz compression reveals 4 GiB memory is not enough for kmod.sh test 0004, the -EINVAL is due to an OOM hit on modprobe so the request fails. That's a test bug. But increasing memory (8 GiB seems to do it) still reveals kmod.sh test 0009 does fail, not all the times, and it is why the test runs 150 times if you run the test once. The failure is not deterministic but surely fails for me every time at least once out of the 150 runs. Test 0009 tries to trigger running kmod_concurrent over max_modprobes for get_fs_type(). I'm trying to test to see if I this failure can trigger without module compression but I don't see the failure yet. Reverting the patch on this thread on linux-next does not fix that issue and so this has perhaps been broken for a much longer time. And so this patch still remains a candidate fix. Luis
On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > I wonder if it races with module -r that removes the module before > it tries to load it multiple times in parallel. > > Does the test pass when you add sleep after the module -r, like this: > > diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh > index 7189715d7960..8a020f90a3f6 100755 > --- a/tools/testing/selftests/kmod/kmod.sh > +++ b/tools/testing/selftests/kmod/kmod.sh > @@ -322,6 +322,7 @@ kmod_defaults_fs() > { > config_reset > modprobe -r $DEFAULT_KMOD_FS > + sleep 1 > config_set_fs $DEFAULT_KMOD_FS > config_set_test_case_fs > } FWIW I was curious if the kmod test 0009.sh now could pass with this too, but alas it can sometimes fail too. [ 138.590663] misc test_kmod0: reset [ 139.729273] misc test_kmod0: Test case: TEST_KMOD_FS_TYPE (2) [ 139.732874] misc test_kmod0: Test filesystem to load: xfs [ 139.736230] misc test_kmod0: Number of threads to run: 62 [ 139.739575] misc test_kmod0: Thread IDs will range from 0 - 61 [ 140.402079] __request_module: 2 callbacks suppressed [ 140.402082] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.418075] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.430124] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.450119] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.478037] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.498080] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.518066] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.530207] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.549949] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.562182] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module fs-xfs, throttling... [ 140.582342] misc test_kmod0: No errors were found while initializing threads [ 145.385414] SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled [ 145.461962] request_module: modprobe fs-xfs cannot be processed, kmod busy with 50 threads for more than 5 seconds now [ 145.652541] misc test_kmod0: Done: 62 threads have all run now [ 145.655020] misc test_kmod0: Last thread to run: 39 [ 145.655831] misc test_kmod0: Results: [ 145.656450] misc test_kmod0: Sync thread 0 fs: xfs [ 145.657164] misc test_kmod0: Sync thread 1 fs: xfs [ 145.657875] misc test_kmod0: Sync thread 2 fs: xfs [ 145.658615] misc test_kmod0: Sync thread 3 fs: xfs [ 145.659324] misc test_kmod0: Sync thread 4 fs: xfs [ 145.660026] misc test_kmod0: Sync thread 5 fs: xfs [ 145.660948] misc test_kmod0: Sync thread 6 fs: xfs [ 145.661870] misc test_kmod0: Sync thread 7 fs: xfs [ 145.662625] misc test_kmod0: Sync thread 8 fs: xfs [ 145.663363] misc test_kmod0: Sync thread 9 fs: xfs [ 145.664073] misc test_kmod0: Sync thread 10 fs: xfs [ 145.664791] misc test_kmod0: Sync thread 11 fs: xfs [ 145.665509] misc test_kmod0: Sync thread 12 fs: xfs [ 145.666252] misc test_kmod0: Sync thread 13 fs: xfs [ 145.666971] misc test_kmod0: Sync thread 14 fs: xfs [ 145.667693] misc test_kmod0: Sync thread 15 fs: xfs [ 145.668405] misc test_kmod0: Sync thread 16 fs: xfs [ 145.669113] misc test_kmod0: Sync thread 17 fs: xfs [ 145.669823] misc test_kmod0: Sync thread 18 fs: xfs [ 145.670634] misc test_kmod0: Sync thread 19 fs: xfs [ 145.671390] misc test_kmod0: Sync thread 20 fs: xfs [ 145.672126] misc test_kmod0: Sync thread 21 fs: xfs [ 145.672842] misc test_kmod0: Sync thread 22 fs: xfs [ 145.673561] misc test_kmod0: Sync thread 23 fs: xfs [ 145.674327] misc test_kmod0: Sync thread 24 fs: xfs [ 145.675051] misc test_kmod0: Sync thread 25 fs: xfs [ 145.675772] misc test_kmod0: Sync thread 26 fs: xfs [ 145.676491] misc test_kmod0: Sync thread 27 fs: xfs [ 145.677207] misc test_kmod0: Sync thread 28 fs: xfs [ 145.677920] misc test_kmod0: Sync thread 29 fs: xfs [ 145.678658] misc test_kmod0: Sync thread 30 fs: xfs [ 145.679369] misc test_kmod0: Sync thread 31 fs: xfs [ 145.680075] misc test_kmod0: Sync thread 32 fs: xfs [ 145.680780] misc test_kmod0: Sync thread 33 fs: xfs [ 145.681481] misc test_kmod0: Sync thread 34 fs: xfs [ 145.682211] misc test_kmod0: Sync thread 35 fs: xfs [ 145.682925] misc test_kmod0: Sync thread 36 fs: xfs [ 145.683633] misc test_kmod0: Sync thread 37 fs: xfs [ 145.684363] misc test_kmod0: Sync thread 38 fs: xfs [ 145.685196] misc test_kmod0: Sync thread 39 fs: xfs [ 145.685896] misc test_kmod0: Sync thread 40 fs: xfs [ 145.687009] misc test_kmod0: Sync thread 41 fs: xfs [ 145.687800] misc test_kmod0: Sync thread 42 fs: xfs [ 145.688540] misc test_kmod0: Sync thread 43 fs: xfs [ 145.689227] misc test_kmod0: Sync thread 44 fs: xfs [ 145.689901] misc test_kmod0: Sync thread 45 fs: xfs [ 145.690924] misc test_kmod0: Sync thread 46 fs: xfs [ 145.691721] misc test_kmod0: Sync thread 47 fs: xfs [ 145.692533] misc test_kmod0: Sync thread 48 fs: xfs [ 145.693329] misc test_kmod0: Sync thread 49 fs: xfs [ 145.694140] misc test_kmod0: Sync thread 50 fs: xfs [ 145.694910] misc test_kmod0: Sync thread 51 fs: xfs [ 145.695695] misc test_kmod0: Sync thread 52 fs: NULL [ 145.696461] misc test_kmod0: Sync thread 53 fs: xfs [ 145.697229] misc test_kmod0: Sync thread 54 fs: xfs [ 145.698027] misc test_kmod0: Sync thread 55 fs: xfs [ 145.698898] misc test_kmod0: Sync thread 56 fs: xfs [ 145.699953] misc test_kmod0: Sync thread 57 fs: xfs [ 145.700667] misc test_kmod0: Sync thread 58 fs: xfs [ 145.701362] misc test_kmod0: Sync thread 59 fs: xfs [ 145.702078] misc test_kmod0: Sync thread 60 fs: xfs [ 145.702765] misc test_kmod0: Sync thread 61 fs: xfs [ 145.703456] misc test_kmod0: General test result: -22 The key here: [ 145.461962] request_module: modprobe fs-xfs cannot be processed, kmod busy with 50 threads for more than 5 seconds now That is not printed when the test iterates and does not fail. That comes from kernel/kmod.c: if (!ret) { pr_warn_ratelimited("request_module: %s cannot be processed, kmod busy with %d threads for more than %d seconds now", module_name, MAX_KMOD_CONCURRENT, MAX_KMOD_ALL_BUSY_TIMEOUT); return -ETIME; } ... ETIME however is 62 as per include/uapi/asm-generic/errno.h. The loss of the value comes from the fact get_fs_type() ignores error types and so lib/test_kmod.c just sets err_ret = -EINVAL in tally_work_test() as it cannot get more heuristics out of the kernel as to why get_fs_type() failed. This should mean that the failure observed with test 0009 on kmod is very likely not due to module compression but just a timing issue, and that compression just increases the probability of having 50 threads busy concurrently on modprobe in 5 seconds with that test. I've confirmed this by running a test with a modified kmod as follows *after* booting into a kernel with no compression: diff --git a/tools/modprobe.c b/tools/modprobe.c index 3b7897c..0b7574d 100644 --- a/tools/modprobe.c +++ b/tools/modprobe.c @@ -1012,6 +1012,8 @@ static int do_modprobe(int argc, char **orig_argv) log_setup_kmod_log(ctx, verbose); + usleep(5000000); + kmod_load_resources(ctx); if (do_show_config) Modules don't tend to be large in size but module compression is an extrapolation of what could happen without compression if we had huge modules often and userspace doing something wild. If you end up with 50 concurrent threads running modprobe for more than 5 seconds the kernel pr_warn_ratelimited() would print though and it surely is a sign userspace is doing something stupid. The sad part though is that a filesystem mount *can* be triggered in these cases and so can fail to boot. The above test were run with next-20230119 without the patch on this thread, and so as noted before this doesn't create a regression, this is a known issue now. And so -- further confirmation I'll move forward with this patch for the next rc. Note kernel/kmod.c in the kernel states: /* * This is a restriction on having *all* MAX_KMOD_CONCURRENT threads * running at the same time without returning. When this happens we * believe you've somehow ended up with a recursive module dependency * creating a loop. * * We have no option but to fail. * * Userspace should proactively try to detect and prevent these. */ #define MAX_KMOD_ALL_BUSY_TIMEOUT 5 I can update the docs to reflect that this can be triggered by the kernel trying to auto-loading DoS or each CPU count triggering tons of unecessary duplicate auto-loading module requests. This is a self kernel inflicted situation. As for the DoS Vegard Nossum did report that user namespaces *could* trigger / abuse kernel module autoloading and that they shouldn't be allowed to do that as they can't load modules directly anyway (finit_module() won't work for them). I'm waiting for a proper patch follow up from him, but *that* in theory could then be another way to trigger this issue other than kmod test 0009, abuse user namespaces so to trigger a module failure by going over board on auto-module loading. Boot likely can't be compromised unless creation of user namespaces is allowed to be exploited early on boot. But post boot kernel module auto-loading could be DoS'd with user namespaces. As for the other case -- each additional CPU causing more module auto-loading than before -- this is a real issue to monitor for and likely can cause odd boot failures. This thread already dealt with cpu-frequency modules as an example of abuse in the kernel for this. Folks are working to fix this though but older kernels will have these issues. Since the kernel is the one that *is* dealing with throttling of this auto-load situation, fixing these cases in-kernel is the right solution. I don't think userspace can do much here as the limit hit is inherent to auto-loading. Perhaps the we should upgrade the pr_warn_ratelimited() to WARN_ONCE()? [0] https://lore.kernel.org/all/Y8HkC1re3Fo46Ne3@bombadil.infradead.org/T/#u Luis
On Sat, Jan 21, 2023 at 2:40 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > There is one bug with compression on debian: > > - gzip compressed modules don't end up in the initramfs > > There is a generic upstream kmod bug: > > - modprobe --show-depends won't grok compressed modules so initramfs > tools that use this as Debian likely are not getting module dependencies > installed in their initramfs Vincenzo *might* be up to tackle the above, let's see! Luis
On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote: >On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: >> On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: >> > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: >> > > Yes, the -EINVAL error is strange. It is returned also in >> > > kernel/module/main.c on few locations. But neither of them >> > > looks like a good candidate. >> > >> > OK I updated to next-20230119 and I don't see the issue now. >> > Odd. It could have been an issue with next-20221207 which I was >> > on before. >> > >> > I'll run some more test and if nothing fails I'll send the fix >> > to Linux for rc5. >> >> Jeesh it just occured to me the difference, which I'll have to >> test next, for next-20221207 I had enabled module compression >> on kdevops with zstd. >> >> You can see the issues on kdevops git log with that... and I finally >> disabled it and the kmod test issue is gone. So it could be that >> but I just am ending my day so will check tomorrow if that was it. >> But if someone else beats me then great. >> >> With kdevops it should be a matter of just enabling zstd as I >> just bumped support for next-20230119 and that has module decompression >> disabled. > >So indeed, my suspcions were correct. There is one bug with >compression on debian: > > - gzip compressed modules don't end up in the initramfs > >There is a generic upstream kmod bug: > > - modprobe --show-depends won't grok compressed modules so initramfs > tools that use this as Debian likely are not getting module dependencies > installed in their initramfs are you sure you have the relevant compression setting enabled in kmod? $ kmod --version kmod version 30 +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL $ modprobe --show-depends ext4 insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/jbd2/jbd2.ko.zst insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/mbcache.ko.zst insmod /lib/modules/6.1.12-1-MANJARO/kernel/lib/crc16.ko.zst insmod /lib/modules/6.1.12-1-MANJARO/kernel/arch/x86/crypto/crc32c-intel.ko.zst insmod /lib/modules/6.1.12-1-MANJARO/kernel/crypto/crc32c_generic.ko.zst insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/ext4/ext4.ko.zst Lucas De Marchi
On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote: > On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote: > > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: > > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: > > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > > > > > Yes, the -EINVAL error is strange. It is returned also in > > > > > kernel/module/main.c on few locations. But neither of them > > > > > looks like a good candidate. > > > > > > > > OK I updated to next-20230119 and I don't see the issue now. > > > > Odd. It could have been an issue with next-20221207 which I was > > > > on before. > > > > > > > > I'll run some more test and if nothing fails I'll send the fix > > > > to Linux for rc5. > > > > > > Jeesh it just occured to me the difference, which I'll have to > > > test next, for next-20221207 I had enabled module compression > > > on kdevops with zstd. > > > > > > You can see the issues on kdevops git log with that... and I finally > > > disabled it and the kmod test issue is gone. So it could be that > > > but I just am ending my day so will check tomorrow if that was it. > > > But if someone else beats me then great. > > > > > > With kdevops it should be a matter of just enabling zstd as I > > > just bumped support for next-20230119 and that has module decompression > > > disabled. > > > > So indeed, my suspcions were correct. There is one bug with > > compression on debian: > > > > - gzip compressed modules don't end up in the initramfs > > > > There is a generic upstream kmod bug: > > > > - modprobe --show-depends won't grok compressed modules so initramfs > > tools that use this as Debian likely are not getting module dependencies > > installed in their initramfs > > are you sure you have the relevant compression setting enabled > in kmod? > > $ kmod --version > kmod version 30 > +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL Debian has: kmod version 30 +ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL > $ modprobe --show-depends ext4 > insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/jbd2/jbd2.ko.zst insmod > /lib/modules/6.1.12-1-MANJARO/kernel/fs/mbcache.ko.zst insmod > /lib/modules/6.1.12-1-MANJARO/kernel/lib/crc16.ko.zst insmod > /lib/modules/6.1.12-1-MANJARO/kernel/arch/x86/crypto/crc32c-intel.ko.zst > insmod /lib/modules/6.1.12-1-MANJARO/kernel/crypto/crc32c_generic.ko.zst > insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/ext4/ext4.ko.zst Perhaps this was related to the above gzip issue in debian then. I'm hoping will have a bit more time than me to verify. Luis
On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote: >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote: >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote: >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: >> > > > > Yes, the -EINVAL error is strange. It is returned also in >> > > > > kernel/module/main.c on few locations. But neither of them >> > > > > looks like a good candidate. >> > > > >> > > > OK I updated to next-20230119 and I don't see the issue now. >> > > > Odd. It could have been an issue with next-20221207 which I was >> > > > on before. >> > > > >> > > > I'll run some more test and if nothing fails I'll send the fix >> > > > to Linux for rc5. >> > > >> > > Jeesh it just occured to me the difference, which I'll have to >> > > test next, for next-20221207 I had enabled module compression >> > > on kdevops with zstd. >> > > >> > > You can see the issues on kdevops git log with that... and I finally >> > > disabled it and the kmod test issue is gone. So it could be that >> > > but I just am ending my day so will check tomorrow if that was it. >> > > But if someone else beats me then great. >> > > >> > > With kdevops it should be a matter of just enabling zstd as I >> > > just bumped support for next-20230119 and that has module decompression >> > > disabled. >> > >> > So indeed, my suspcions were correct. There is one bug with >> > compression on debian: >> > >> > - gzip compressed modules don't end up in the initramfs >> > >> > There is a generic upstream kmod bug: >> > >> > - modprobe --show-depends won't grok compressed modules so initramfs >> > tools that use this as Debian likely are not getting module dependencies >> > installed in their initramfs >> >> are you sure you have the relevant compression setting enabled >> in kmod? >> >> $ kmod --version >> kmod version 30 >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL > >Debian has: > >kmod version 30 >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL ^ so... mind the minus :). It doesn't support zlib. Change your kernel config to either compress the modules as xz or zstd. Lucas De Marchi > >> $ modprobe --show-depends ext4 >> insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/jbd2/jbd2.ko.zst insmod >> /lib/modules/6.1.12-1-MANJARO/kernel/fs/mbcache.ko.zst insmod >> /lib/modules/6.1.12-1-MANJARO/kernel/lib/crc16.ko.zst insmod >> /lib/modules/6.1.12-1-MANJARO/kernel/arch/x86/crypto/crc32c-intel.ko.zst >> insmod /lib/modules/6.1.12-1-MANJARO/kernel/crypto/crc32c_generic.ko.zst >> insmod /lib/modules/6.1.12-1-MANJARO/kernel/fs/ext4/ext4.ko.zst > >Perhaps this was related to the above gzip issue in debian then. > >I'm hoping will have a bit more time than me to verify. > > Luis
On Thu, Mar 23, 2023 at 8:02 AM Lucas De Marchi <lucas.demarchi@intel.com> wrote: > > On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote: > >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote: > >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote: > >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: > >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: > >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > >> > > > > Yes, the -EINVAL error is strange. It is returned also in > >> > > > > kernel/module/main.c on few locations. But neither of them > >> > > > > looks like a good candidate. > >> > > > > >> > > > OK I updated to next-20230119 and I don't see the issue now. > >> > > > Odd. It could have been an issue with next-20221207 which I was > >> > > > on before. > >> > > > > >> > > > I'll run some more test and if nothing fails I'll send the fix > >> > > > to Linux for rc5. > >> > > > >> > > Jeesh it just occured to me the difference, which I'll have to > >> > > test next, for next-20221207 I had enabled module compression > >> > > on kdevops with zstd. > >> > > > >> > > You can see the issues on kdevops git log with that... and I finally > >> > > disabled it and the kmod test issue is gone. So it could be that > >> > > but I just am ending my day so will check tomorrow if that was it. > >> > > But if someone else beats me then great. > >> > > > >> > > With kdevops it should be a matter of just enabling zstd as I > >> > > just bumped support for next-20230119 and that has module decompression > >> > > disabled. > >> > > >> > So indeed, my suspcions were correct. There is one bug with > >> > compression on debian: > >> > > >> > - gzip compressed modules don't end up in the initramfs > >> > > >> > There is a generic upstream kmod bug: > >> > > >> > - modprobe --show-depends won't grok compressed modules so initramfs > >> > tools that use this as Debian likely are not getting module dependencies > >> > installed in their initramfs > >> > >> are you sure you have the relevant compression setting enabled > >> in kmod? > >> > >> $ kmod --version > >> kmod version 30 > >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL > > > >Debian has: > > > >kmod version 30 > >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL > > ^ so... mind the minus :). It doesn't support zlib. > > Change your kernel config to either compress the modules as xz or zstd. Oh so then we should complain about these things if an initramfs is detected with modules compressed using a compression algorithm which modprobe installed does not support. What tool would do that? Luis
On Thu, Mar 23, 2023 at 08:08:49AM -0700, Luis Chamberlain wrote: >On Thu, Mar 23, 2023 at 8:02 AM Lucas De Marchi ><lucas.demarchi@intel.com> wrote: >> >> On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote: >> >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote: >> >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote: >> >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: >> >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: >> >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: >> >> > > > > Yes, the -EINVAL error is strange. It is returned also in >> >> > > > > kernel/module/main.c on few locations. But neither of them >> >> > > > > looks like a good candidate. >> >> > > > >> >> > > > OK I updated to next-20230119 and I don't see the issue now. >> >> > > > Odd. It could have been an issue with next-20221207 which I was >> >> > > > on before. >> >> > > > >> >> > > > I'll run some more test and if nothing fails I'll send the fix >> >> > > > to Linux for rc5. >> >> > > >> >> > > Jeesh it just occured to me the difference, which I'll have to >> >> > > test next, for next-20221207 I had enabled module compression >> >> > > on kdevops with zstd. >> >> > > >> >> > > You can see the issues on kdevops git log with that... and I finally >> >> > > disabled it and the kmod test issue is gone. So it could be that >> >> > > but I just am ending my day so will check tomorrow if that was it. >> >> > > But if someone else beats me then great. >> >> > > >> >> > > With kdevops it should be a matter of just enabling zstd as I >> >> > > just bumped support for next-20230119 and that has module decompression >> >> > > disabled. >> >> > >> >> > So indeed, my suspcions were correct. There is one bug with >> >> > compression on debian: >> >> > >> >> > - gzip compressed modules don't end up in the initramfs >> >> > >> >> > There is a generic upstream kmod bug: >> >> > >> >> > - modprobe --show-depends won't grok compressed modules so initramfs >> >> > tools that use this as Debian likely are not getting module dependencies >> >> > installed in their initramfs >> >> >> >> are you sure you have the relevant compression setting enabled >> >> in kmod? >> >> >> >> $ kmod --version >> >> kmod version 30 >> >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL >> > >> >Debian has: >> > >> >kmod version 30 >> >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL >> >> ^ so... mind the minus :). It doesn't support zlib. >> >> Change your kernel config to either compress the modules as xz or zstd. > >Oh so then we should complain about these things if an initramfs is >detected with modules compressed using a compression algorithm which >modprobe installed does not support. What tool would do that? I guess we could add that in depmod side as a dummy handler for when that config is off. Thoughts? Lucas De Marchi > > Luis
On Thu, Mar 23, 2023 at 11:03:21PM -0700, Lucas De Marchi wrote: > On Thu, Mar 23, 2023 at 08:08:49AM -0700, Luis Chamberlain wrote: > > On Thu, Mar 23, 2023 at 8:02 AM Lucas De Marchi > > <lucas.demarchi@intel.com> wrote: > > > > > > On Wed, Mar 22, 2023 at 03:31:59PM -0700, Luis Chamberlain wrote: > > > >On Sat, Mar 11, 2023 at 10:25:05PM -0800, Lucas De Marchi wrote: > > > >> On Sat, Jan 21, 2023 at 02:40:20PM -0800, Luis Chamberlain wrote: > > > >> > On Thu, Jan 19, 2023 at 04:58:53PM -0800, Luis Chamberlain wrote: > > > >> > > On Thu, Jan 19, 2023 at 04:51:27PM -0800, Luis Chamberlain wrote: > > > >> > > > On Thu, Jan 19, 2023 at 04:47:05PM +0100, Petr Mladek wrote: > > > >> > > > > Yes, the -EINVAL error is strange. It is returned also in > > > >> > > > > kernel/module/main.c on few locations. But neither of them > > > >> > > > > looks like a good candidate. > > > >> > > > > > > >> > > > OK I updated to next-20230119 and I don't see the issue now. > > > >> > > > Odd. It could have been an issue with next-20221207 which I was > > > >> > > > on before. > > > >> > > > > > > >> > > > I'll run some more test and if nothing fails I'll send the fix > > > >> > > > to Linux for rc5. > > > >> > > > > > >> > > Jeesh it just occured to me the difference, which I'll have to > > > >> > > test next, for next-20221207 I had enabled module compression > > > >> > > on kdevops with zstd. > > > >> > > > > > >> > > You can see the issues on kdevops git log with that... and I finally > > > >> > > disabled it and the kmod test issue is gone. So it could be that > > > >> > > but I just am ending my day so will check tomorrow if that was it. > > > >> > > But if someone else beats me then great. > > > >> > > > > > >> > > With kdevops it should be a matter of just enabling zstd as I > > > >> > > just bumped support for next-20230119 and that has module decompression > > > >> > > disabled. > > > >> > > > > >> > So indeed, my suspcions were correct. There is one bug with > > > >> > compression on debian: > > > >> > > > > >> > - gzip compressed modules don't end up in the initramfs > > > >> > > > > >> > There is a generic upstream kmod bug: > > > >> > > > > >> > - modprobe --show-depends won't grok compressed modules so initramfs > > > >> > tools that use this as Debian likely are not getting module dependencies > > > >> > installed in their initramfs > > > >> > > > >> are you sure you have the relevant compression setting enabled > > > >> in kmod? > > > >> > > > >> $ kmod --version > > > >> kmod version 30 > > > >> +ZSTD +XZ +ZLIB +LIBCRYPTO -EXPERIMENTAL > > > > > > > >Debian has: > > > > > > > >kmod version 30 > > > >+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL > > > > > > ^ so... mind the minus :). It doesn't support zlib. > > > > > > Change your kernel config to either compress the modules as xz or zstd. > > > > Oh so then we should complain about these things if an initramfs is > > detected with modules compressed using a compression algorithm which > > modprobe installed does not support. What tool would do that? > > I guess we could add that in depmod side as a dummy handler for when > that config is off. Thoughts? That sounds like a good solution, better than and complain before allowing someone to boot and *not* be able to. Luis
diff --git a/kernel/module/main.c b/kernel/module/main.c index d02d39c7174e..7a627345d4fd 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name) sched_annotate_sleep(); mutex_lock(&module_mutex); mod = find_module_all(name, strlen(name), true); - ret = !mod || mod->state == MODULE_STATE_LIVE; + ret = !mod || mod->state == MODULE_STATE_LIVE + || mod->state == MODULE_STATE_GOING; mutex_unlock(&module_mutex); return ret; @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod) mod->state = MODULE_STATE_UNFORMED; -again: mutex_lock(&module_mutex); old = find_module_all(mod->name, strlen(mod->name), true); if (old != NULL) { - if (old->state != MODULE_STATE_LIVE) { + if (old->state == MODULE_STATE_COMING + || old->state == MODULE_STATE_UNFORMED) { /* Wait in case it fails to load. */ mutex_unlock(&module_mutex); err = wait_event_interruptible(module_wq, finished_loading(mod->name)); if (err) goto out_unlocked; - goto again; + + /* The module might have gone in the meantime. */ + mutex_lock(&module_mutex); + old = find_module_all(mod->name, strlen(mod->name), + true); } - err = -EEXIST; + + /* + * We are here only when the same module was being loaded. Do + * not try to load it again right now. It prevents long delays + * caused by serialized module load failures. It might happen + * when more devices of the same type trigger load of + * a particular module. + */ + if (old && old->state == MODULE_STATE_LIVE) + err = -EEXIST; + else + err = -EBUSY; goto out; } mod_update_bounds(mod);