diff mbox series

module: add debugging auto-load duplicate module support

Message ID 20230418204636.791699-1-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series module: add debugging auto-load duplicate module support | expand

Commit Message

Luis Chamberlain April 18, 2023, 8:46 p.m. UTC
This adds debugging support to the kernel module auto-loader to
easily detect and deal with duplicate module requests. To aid
with possible bootup failure issues it will supress the waste
in virtual memory when races happen before userspace loads a
module and the kernel is still issuing requests for the same
module.

Folks debugging virtual memory abuse on bootup can and should
enable this to see what WARN()s come on, to see if module
auto-loading is to blame for their woes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---

Changes on this patch since the last RFC:

  o dropped the kernel_read*() patch from this series moving to
    punt the issues as a udev issue now that we have proof auto-loading
    is not the issue
  o some spell checks

 kernel/module/Kconfig    |  40 +++++++
 kernel/module/Makefile   |   1 +
 kernel/module/dups.c     | 234 +++++++++++++++++++++++++++++++++++++++
 kernel/module/internal.h |  15 +++
 kernel/module/kmod.c     |  23 +++-
 5 files changed, 309 insertions(+), 4 deletions(-)
 create mode 100644 kernel/module/dups.c

Comments

Greg KH April 19, 2023, 7:15 a.m. UTC | #1
On Tue, Apr 18, 2023 at 01:46:36PM -0700, Luis Chamberlain wrote:
> This adds debugging support to the kernel module auto-loader to
> easily detect and deal with duplicate module requests. To aid
> with possible bootup failure issues it will supress the waste
> in virtual memory when races happen before userspace loads a
> module and the kernel is still issuing requests for the same
> module.
> 
> Folks debugging virtual memory abuse on bootup can and should
> enable this to see what WARN()s come on, to see if module
> auto-loading is to blame for their woes.

You get 72 columns for changelog text, so you can use it :)

> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> 
> Changes on this patch since the last RFC:
> 
>   o dropped the kernel_read*() patch from this series moving to
>     punt the issues as a udev issue now that we have proof auto-loading
>     is not the issue
>   o some spell checks
> 
>  kernel/module/Kconfig    |  40 +++++++
>  kernel/module/Makefile   |   1 +
>  kernel/module/dups.c     | 234 +++++++++++++++++++++++++++++++++++++++
>  kernel/module/internal.h |  15 +++
>  kernel/module/kmod.c     |  23 +++-
>  5 files changed, 309 insertions(+), 4 deletions(-)
>  create mode 100644 kernel/module/dups.c
> 
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index e6df183e2c80..cc146ef4a6ac 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -59,6 +59,46 @@ config MODULE_STATS
>  
>  	  If unsure, say N.
>  
> +config MODULE_AUTOLOAD_SUPRESS_DUPS

MODULE_DEBUG_DUPLICATE perhaps?  It has nothing to do with autoloading
(other than that is what userspace is doing) and you aren't suppressing
anything except throwing up warnings, right?

> +	bool "Debug duplicate modules with auto-loading"
> +	help
> +	  Module autoloading allows in-kernel code to request modules through
> +	  the *request_module*() API calls. This in turn just calls userspace
> +	  modprobe. Although modprobe checks to see if a module is already
> +	  loaded before trying to load a module there is a small time window in
> +	  which multiple duplicate requests can end up in userspace and multiple
> +	  modprobe calls race calling finit_module() around the same time for
> +	  duplicate modules. The finit_module() system call can consume in the
> +	  worst case more than twice the respective module size in virtual
> +	  memory for each duplicate module requests. Although duplicate module
> +	  requests are non-fatal virtual memory is a limited resource and each
> +	  duplicate module request ends up just wasting virtual memory.

It's not "wasted", as it is returned when the module is determined to be
a duplicate.  Otherwise everyone will want this enabled as they think it
will actually save memory.

> +	  This debugging facility will create WARN() splats for duplicate module
> +	  requests to help identify if module auto-loading is the culprit to your
> +	  woes. Since virtual memory abuse caused by duplicate module requests
> +	  could render a system unusable this functionality will also suppresses
> +	  the waste in virtual memory caused by duplicate requests by sharing
> +	  races in requests for the same module to a single unified request.
> +	  Once a non-wait request_module() call completes a module should be
> +	  loaded and modprobe should simply not allow new finit_module() calls.
> +
> +	  Enable this functionality to try to debug virtual memory abuse during
> +	  boot on systems and identify if the abuse was due to module
> +	  auto-loading.
> +
> +	  If the first module request used request_module_nowait() we cannot
> +	  use that as the anchor to wait for duplicate module requests, since
> +	  users of request_module() do want a proper return value. If a call
> +	  for the same module happened earlier with request_module() though,
> +	  then a duplicate request_module_nowait() would be detected.
> +
> +	  You want to enable this if you want to debug and see if duplicate
> +	  module auto-loading might be causing virtual memory abuse during
> +	  bootup. A kernel trace will be provided for each duplicate request.
> +
> +	  Disable this if you are on production.

"on production"?  That does not translate well, how about:
	Only enable this for debugging system functionality, never have
	it enabled on real systems.
or something like that?


> +
>  endif # MODULE_DEBUG
>  
>  config MODULE_FORCE_LOAD
> diff --git a/kernel/module/Makefile b/kernel/module/Makefile
> index 52340bce497e..e8b121ac39cf 100644
> --- a/kernel/module/Makefile
> +++ b/kernel/module/Makefile
> @@ -10,6 +10,7 @@ KCOV_INSTRUMENT_module.o := n
>  obj-y += main.o
>  obj-y += strict_rwx.o
>  obj-y += kmod.o
> +obj-$(CONFIG_MODULE_AUTOLOAD_SUPRESS_DUPS) += dups.o
>  obj-$(CONFIG_MODULE_DECOMPRESS) += decompress.o
>  obj-$(CONFIG_MODULE_SIG) += signing.o
>  obj-$(CONFIG_LIVEPATCH) += livepatch.o
> diff --git a/kernel/module/dups.c b/kernel/module/dups.c
> new file mode 100644
> index 000000000000..903ab7c7e8f4
> --- /dev/null
> +++ b/kernel/module/dups.c
> @@ -0,0 +1,234 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * kmod dups - the kernel module autoloader duplicate suppressor
> + *
> + * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task.h>
> +#include <linux/binfmts.h>
> +#include <linux/syscalls.h>
> +#include <linux/unistd.h>
> +#include <linux/kmod.h>
> +#include <linux/slab.h>
> +#include <linux/completion.h>
> +#include <linux/cred.h>
> +#include <linux/file.h>
> +#include <linux/fdtable.h>
> +#include <linux/workqueue.h>
> +#include <linux/security.h>
> +#include <linux/mount.h>
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/resource.h>
> +#include <linux/notifier.h>
> +#include <linux/suspend.h>
> +#include <linux/rwsem.h>
> +#include <linux/ptrace.h>
> +#include <linux/async.h>
> +#include <linux/uaccess.h>
> +
> +DEFINE_MUTEX(kmod_dup_mutex);

static?

> +static LIST_HEAD(dup_kmod_reqs);
> +
> +struct kmod_dup_req {
> +	struct list_head list;
> +	char name[MODULE_NAME_LEN];
> +	struct completion first_req_done;
> +	struct work_struct complete_work;
> +	struct delayed_work delete_work;
> +	int dup_ret;
> +};
> +
> +static struct kmod_dup_req *kmod_dup_request_lookup(char *module_name)

Lock requirements?  You should document that here.

> +{
> +	struct kmod_dup_req *kmod_req;
> +
> +	list_for_each_entry_rcu(kmod_req, &dup_kmod_reqs, list,
> +				lockdep_is_held(&kmod_dup_mutex)) {
> +		if (strlen(kmod_req->name) == strlen(module_name) &&
> +		    !memcmp(kmod_req->name, module_name, strlen(module_name))) {
> +			return kmod_req;
> +                }
> +        }
> +
> +	return NULL;
> +}
> +
> +static void kmod_dup_request_delete(struct work_struct *work)
> +{
> +	struct kmod_dup_req *kmod_req;
> +	kmod_req = container_of(to_delayed_work(work), struct kmod_dup_req, delete_work);
> +
> +	/*
> +	 * The typical situation is a module successully loaded. In that
> +	 * situation the module will be present already in userspace. If
> +	 * new requests come in after that, userspace will already know the
> +	 * module is loaded so will just return 0 right away. There is still
> +	 * a small chance right after we delete this entry new request_module()
> +	 * calls may happen after that, they can happen. These heuristics
> +	 * are to protect finit_module() abuse for auto-loading, if modules
> +	 * are still tryign to auto-load even if a module is already loaded,
> +	 * that's on them, and those inneficiencies should not be fixed by
> +	 * kmod. The inneficies there are a call to modprobe and modprobe
> +	 * just returning 0.
> +	 */
> +	mutex_lock(&kmod_dup_mutex);
> +	list_del_rcu(&kmod_req->list);
> +	synchronize_rcu();
> +	mutex_unlock(&kmod_dup_mutex);
> +	kfree(kmod_req);
> +}
> +
> +static void kmod_dup_request_complete(struct work_struct *work)
> +{
> +	struct kmod_dup_req *kmod_req;
> +
> +	kmod_req = container_of(work, struct kmod_dup_req, complete_work);
> +
> +	/*
> +	 * This will ensure that the kernel will let all the waiters get
> +	 * informed its time to check the return value. It's time to
> +	 * go home.
> +	 */
> +	complete_all(&kmod_req->first_req_done);
> +
> +	/*
> +	 * Now that we have allowed prior request_module() calls to go on
> +	 * with life, let's schedule deleting this entry. We don't have
> +	 * to do it right away, but we *eventually* want to do it so to not
> +	 * let this linger forever as this is just a boot optimization for
> +	 * possible abuses of vmalloc() incurred by finit_module() thrashing.
> +	 */
> +	queue_delayed_work(system_wq, &kmod_req->delete_work, 60 * HZ);
> +}
> +
> +bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret)
> +{
> +	struct kmod_dup_req *kmod_req, *new_kmod_req;
> +	int ret;
> +
> +	/*
> +	 * Pre-allocate the entry in case we have to use it later
> +	 * to avoid contention with the mutex.
> +	 */
> +	new_kmod_req = kzalloc(sizeof(*new_kmod_req), GFP_KERNEL);
> +	if (!new_kmod_req)
> +		return false;
> +
> +	memcpy(new_kmod_req->name, module_name, strlen(module_name));
> +	INIT_WORK(&new_kmod_req->complete_work, kmod_dup_request_complete);
> +	INIT_DELAYED_WORK(&new_kmod_req->delete_work, kmod_dup_request_delete);
> +	init_completion(&new_kmod_req->first_req_done);
> +
> +	mutex_lock(&kmod_dup_mutex);
> +
> +	kmod_req = kmod_dup_request_lookup(module_name);
> +	if (!kmod_req) {
> +		/*
> +		 * If the first request that came through for a module
> +		 * was with request_module_nowait() we cannot wait for it
> +		 * and share its return value with other users which may
> +		 * have used request_module() and need a proper return value
> +		 * so just skip using them as an anchor.
> +		 *
> +		 * If a prior request to this one came through with
> +		 * request_module() though, then a request_module_nowait()
> +		 * would benefit from duplicate detection.
> +		 */
> +		if (!wait) {
> +			kfree(new_kmod_req);
> +			pr_warn("New request_module_nowait() for %s -- cannot track duplicates for this request\n", module_name);

pr_dbg() as this is debugging?

And did you set your pr_fmt value to make it obvious where these
messages are coming from?

> +			mutex_unlock(&kmod_dup_mutex);
> +			return false;
> +		}
> +
> +		/*
> +		 * There was no duplicate, just add the request so we can
> +		 * keep tab on duplicates later.
> +		 */
> +		pr_info("New request_module() for %s\n", module_name);

Why not pr_dbg()?

> +		list_add_rcu(&new_kmod_req->list, &dup_kmod_reqs);
> +		mutex_unlock(&kmod_dup_mutex);
> +		return false;
> +	}
> +	mutex_unlock(&kmod_dup_mutex);
> +
> +	/* We are dealing with a duplicate request now */
> +

No blank line needed.

> +	kfree(new_kmod_req);
> +
> +	/*
> +	 * To fix these try to use try_then_request_module() instead as that
> +	 * will check if the component you are looking for is present or not.
> +	 * You could also just queue a single request to load the module once,
> +	 * instead of having each and everything you need try to request for
> +	 * the module.
> +	 *
> +	 * Duplicate request_module() calls  can cause quite a bit of wasted
> +	 * vmalloc() space when racing with userspace.
> +	 */
> +	WARN(1, "module-autoload: duplicate request for module %s\n", module_name);

Remember, this will trigger syzbot and any system that has
panic-on-warn, so be prepared for that mess.  Especially the syzbot
reports, that's going to cause lots of issues for you if you don't tell
them to always disable this option.

thanks,

greg k-h
Luis Chamberlain April 19, 2023, 11:32 p.m. UTC | #2
On Wed, Apr 19, 2023 at 09:15:11AM +0200, Greg KH wrote:
> On Tue, Apr 18, 2023 at 01:46:36PM -0700, Luis Chamberlain wrote:
> You get 72 columns for changelog text, so you can use it :)

Sure! I forget what the limit is, but now I won't forget, new vimrc
settings:

set textwidth=100
autocmd FileType gitcommit set textwidth=72 
set colorcolumn=+1

> > +config MODULE_AUTOLOAD_SUPRESS_DUPS
> 
> MODULE_DEBUG_DUPLICATE perhaps?  It has nothing to do with autoloading
> (other than that is what userspace is doing)

I refer to module auto-loading as the kernel's use of the *request_module()
APIs. This code is used by the module auto-loading request_module() API
callers, prior to us even dealing with userspace.

> and you aren't suppressing anything except throwing up warnings, right?

Actually the code does converge duplicate auto-loading requests into one, but'll
just rename to MODULE_DEBUG_AUTOLOAD_DUPS.

> > +	bool "Debug duplicate modules with auto-loading"
> > +	help
> > +	  Module autoloading allows in-kernel code to request modules through
> > +	  the *request_module*() API calls. This in turn just calls userspace
> > +	  modprobe. Although modprobe checks to see if a module is already
> > +	  loaded before trying to load a module there is a small time window in
> > +	  which multiple duplicate requests can end up in userspace and multiple
> > +	  modprobe calls race calling finit_module() around the same time for
> > +	  duplicate modules. The finit_module() system call can consume in the
> > +	  worst case more than twice the respective module size in virtual
> > +	  memory for each duplicate module requests. Although duplicate module
> > +	  requests are non-fatal virtual memory is a limited resource and each
> > +	  duplicate module request ends up just wasting virtual memory.
> 
> It's not "wasted", as it is returned when the module is determined to be
> a duplicate.  Otherwise everyone will want this enabled as they think it
> will actually save memory.

I'll change the language to be clear the issue is memory pressure early
on boot. I'll also add a bit of language to help at least guide people
to realize that the real value-add for this, ie, I'll have to mention we
suspect issue is udev and not module auto-loading and that this however
may still help find a few cases we can optimize for.

  Luis
Greg KH April 20, 2023, 5:32 a.m. UTC | #3
On Wed, Apr 19, 2023 at 04:32:30PM -0700, Luis Chamberlain wrote:
> > It's not "wasted", as it is returned when the module is determined to be
> > a duplicate.  Otherwise everyone will want this enabled as they think it
> > will actually save memory.
> 
> I'll change the language to be clear the issue is memory pressure early
> on boot. I'll also add a bit of language to help at least guide people
> to realize that the real value-add for this, ie, I'll have to mention we
> suspect issue is udev and not module auto-loading and that this however
> may still help find a few cases we can optimize for.

This isn't udev's "problem", all it is doing is what the kernel asked it
to do.  The kernel said "Here's a new device I found, load a module for
it please!"

And it's the kmod code here, not udev itself doing all of this.  Why not
just rate-limit it in userspace if your system can't handle 10's of
thousands of kmod calls all at once?  I think many s390 systems did this
decades ago when they were controlling 10's of thousands of scsi devices
and were hit with "device detection storms" at boot like this.

If you really think it's the kernel's fault, just slow down the kernel's
device detection code for the specific bus that is having problems.
We've worked hard to make the kernel boot really fast and device
detection happen in parallel, maybe that wasn't a good idea for some
systems and so they need to boot slower.  If so, then just turn off the
parallel probing for the offending bus type.

What specific devices and bus types are the problem here for these systems?

thanks,

greg k-h
Luis Chamberlain April 20, 2023, 9:03 p.m. UTC | #4
On Thu, Apr 20, 2023 at 07:32:10AM +0200, Greg KH wrote:
> On Wed, Apr 19, 2023 at 04:32:30PM -0700, Luis Chamberlain wrote:
> > > It's not "wasted", as it is returned when the module is determined to be
> > > a duplicate.  Otherwise everyone will want this enabled as they think it
> > > will actually save memory.
> > 
> > I'll change the language to be clear the issue is memory pressure early
> > on boot. I'll also add a bit of language to help at least guide people
> > to realize that the real value-add for this, ie, I'll have to mention we
> > suspect issue is udev and not module auto-loading and that this however
> > may still help find a few cases we can optimize for.
> 
> This isn't udev's "problem", all it is doing is what the kernel asked it
> to do.  The kernel said "Here's a new device I found, load a module for
> it please!"

If you believe that then the issue is still a kernel issue, and the
second part to that sentence "load a module for it" was done without
consideration of the implications, or without optimizations in mind.
Given the implications were perhaps not well understood it is unfair
for us to be hard on ourselves on that. But now we know, ideally if we
could we *should* only issue a request for a module *once* during boot.

Where does the kernel actually say "load a module"?

Isn't that just an implied gesture?

> And it's the kmod code here, not udev itself doing all of this.

Yes, IMHO kmod could and *should* be enhanced to share a loading context
during boot so to avoid duplicates too and then udev would have to
embrace such functionality. That's going to take time to propagate, as
you can imagine.

> Why not
> just rate-limit it in userspace if your system can't handle 10's of
> thousands of kmod calls all at once? I think many s390 systems did this
> decades ago when they were controlling 10's of thousands of scsi devices
> and were hit with "device detection storms" at boot like this.

Boot is a special context and in this particular case I agree userspace
kmod could/should be extended to avoid duplicate module requests in that
context. But likewise the kernel should only have to try to issue a
request for a single module once, if it could easily do that.

This does beg the question, why force userspace to rate limit if we
can do better in the kernel? Specially if *we're the ones*, as you say,
that are hinting to userspace to shoot back loading modules for us and we
know we're just going to drop duplicates?

We discussed the possibility to rate limit finit_module() in-kernel [0], but
that was early in this process as I investigated the root cause of the
duplicates. I didn't feel comfortable with such an approach then but now
I have a clearer rationale as to why I felt so uncomfortable with it.
Rate limiting can be a a bandaid, a ball park guess and if one can
understand the issue a bit better, you may not need to to rate limit.

My kludge-of-concept RFC patch which takes the concepts in this patch but
applies them to kread [1] is an attempt for us to see what that could look
like -- and it goes so far as to survive easily the otherwise pathological
stress test of say (assuming you have all deps loaded and don't have XFS
loaded for an existing mount already):

stress-ng --module 1 --module-name xfs

Given Dave's original reported issue should be fixed now by the patch
"module: avoid allocation if module is already present and ready", there
really is no rush in keeping a kludges around but this does be the
question to analyze this case a bit more carefully with the long term
in mind.

[0] https://lkml.kernel.org/r/CAHk-=wjsBkyD+y_zfif1MXWYEkvPt+TPLdVjP41573oKEOu4qA@mail.gmail.com
[1] https://lore.kernel.org/all/ZDmAvwi+KNvie+OI@bombadil.infradead.org/T/#md172510af8fdf7e0f76f6caafee9c99f7a8b6de7

> If you really think it's the kernel's fault, just slow down the kernel's
> device detection code for the specific bus that is having problems.
> We've worked hard to make the kernel boot really fast and device
> detection happen in parallel, maybe that wasn't a good idea for some
> systems and so they need to boot slower.  If so, then just turn off the
> parallel probing for the offending bus type.

I think it is counter productive to do that, it's not like we have not
identified the scaling issue or can't solve this.

> What specific devices and bus types are the problem here for these systems?

My best assessment of the situation is that each CPU in udev ends up triggering
a load of duplicate set of modules, not just one, but *a lot*. Not sure
what heuristics udev uses to load a set of modules per CPU.

Given the slopes of the graphs I have shown on how this scales per CPU [2]
we can project at what point in time this will again likely be an issue if
we don't avoid the duplicates in userspace through kmod or address dealing
with them as in my kludge-of-concept on kreads. Given the secondary vmalloc
is dealt with the patch "module: avoid allocation if module is already present
and ready" that leaves us only to deal with virtual memory pressure from the
kread. In-kernel we may be able to also see if we can use seq reads for the
initial kread and decompression. That remains to be seen, but it would
also be a way for this not cause possible boot failures.

[2] https://lore.kernel.org/all/ZDmAvwi+KNvie+OI@bombadil.infradead.org/T/#mf7f8412c3486f59a3e4cf02b65bf0c4f9ba60ef0

  Luis
Greg KH April 21, 2023, 3:12 p.m. UTC | #5
On Thu, Apr 20, 2023 at 02:03:32PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 20, 2023 at 07:32:10AM +0200, Greg KH wrote:
> > On Wed, Apr 19, 2023 at 04:32:30PM -0700, Luis Chamberlain wrote:
> > > > It's not "wasted", as it is returned when the module is determined to be
> > > > a duplicate.  Otherwise everyone will want this enabled as they think it
> > > > will actually save memory.
> > > 
> > > I'll change the language to be clear the issue is memory pressure early
> > > on boot. I'll also add a bit of language to help at least guide people
> > > to realize that the real value-add for this, ie, I'll have to mention we
> > > suspect issue is udev and not module auto-loading and that this however
> > > may still help find a few cases we can optimize for.
> > 
> > This isn't udev's "problem", all it is doing is what the kernel asked it
> > to do.  The kernel said "Here's a new device I found, load a module for
> > it please!"
> 
> If you believe that then the issue is still a kernel issue, and the
> second part to that sentence "load a module for it" was done without
> consideration of the implications, or without optimizations in mind.
> Given the implications were perhaps not well understood it is unfair
> for us to be hard on ourselves on that. But now we know, ideally if we
> could we *should* only issue a request for a module *once* during boot.

But there is no mapping between devices and modules other than what is
exported in the module info and that is up to userspace to handle.

> Where does the kernel actually say "load a module"?

The driver core says "hey a new device is now present!"

Userspace takes that message and calls kmod with the device information
which then determines what module to load by looking at the device
aliases.

> Isn't that just an implied gesture?

Yes.

> > And it's the kmod code here, not udev itself doing all of this.
> 
> Yes, IMHO kmod could and *should* be enhanced to share a loading context
> during boot so to avoid duplicates too and then udev would have to
> embrace such functionality. That's going to take time to propagate, as
> you can imagine.

udev is just the transport to kmod here, it's not in the job of
filtering duplicate messages.

> > Why not
> > just rate-limit it in userspace if your system can't handle 10's of
> > thousands of kmod calls all at once? I think many s390 systems did this
> > decades ago when they were controlling 10's of thousands of scsi devices
> > and were hit with "device detection storms" at boot like this.
> 
> Boot is a special context and in this particular case I agree userspace
> kmod could/should be extended to avoid duplicate module requests in that
> context. But likewise the kernel should only have to try to issue a
> request for a single module once, if it could easily do that.

Are you sure that this is happening at boot in a way that userspace
didn't just trigger it on its own after init started up?  That happens
as a "coldboot" walk of the device tree and all uevent are regenerated.
That is userspace asking for this, so there's nothing that the kernel
can do.

> This does beg the question, why force userspace to rate limit if we
> can do better in the kernel? Specially if *we're the ones*, as you say,
> that are hinting to userspace to shoot back loading modules for us and we
> know we're just going to drop duplicates?

Maybe error out of duplicate module loading earlier?  I don't know,
sorry.

> > What specific devices and bus types are the problem here for these systems?
> 
> My best assessment of the situation is that each CPU in udev ends up triggering
> a load of duplicate set of modules, not just one, but *a lot*. Not sure
> what heuristics udev uses to load a set of modules per CPU.

Again, finding which device and bus is causing the problem is going to
be key here to try to solve the issue.  Are you logging duplicate module
loads by name as well?

thanks,

greg k-h
Lucas De Marchi April 21, 2023, 4:42 p.m. UTC | #6
On Fri, Apr 21, 2023 at 05:12:51PM +0200, Greg KH wrote:
>On Thu, Apr 20, 2023 at 02:03:32PM -0700, Luis Chamberlain wrote:
>> On Thu, Apr 20, 2023 at 07:32:10AM +0200, Greg KH wrote:
>> > On Wed, Apr 19, 2023 at 04:32:30PM -0700, Luis Chamberlain wrote:
>> > > > It's not "wasted", as it is returned when the module is determined to be
>> > > > a duplicate.  Otherwise everyone will want this enabled as they think it
>> > > > will actually save memory.
>> > >
>> > > I'll change the language to be clear the issue is memory pressure early
>> > > on boot. I'll also add a bit of language to help at least guide people
>> > > to realize that the real value-add for this, ie, I'll have to mention we
>> > > suspect issue is udev and not module auto-loading and that this however
>> > > may still help find a few cases we can optimize for.
>> >
>> > This isn't udev's "problem", all it is doing is what the kernel asked it
>> > to do.  The kernel said "Here's a new device I found, load a module for
>> > it please!"
>>
>> If you believe that then the issue is still a kernel issue, and the
>> second part to that sentence "load a module for it" was done without
>> consideration of the implications, or without optimizations in mind.
>> Given the implications were perhaps not well understood it is unfair
>> for us to be hard on ourselves on that. But now we know, ideally if we
>> could we *should* only issue a request for a module *once* during boot.
>
>But there is no mapping between devices and modules other than what is
>exported in the module info and that is up to userspace to handle.
>
>> Where does the kernel actually say "load a module"?
>
>The driver core says "hey a new device is now present!"
>
>Userspace takes that message and calls kmod with the device information
>which then determines what module to load by looking at the device
>aliases.
>
>> Isn't that just an implied gesture?
>
>Yes.
>
>> > And it's the kmod code here, not udev itself doing all of this.
>>
>> Yes, IMHO kmod could and *should* be enhanced to share a loading context
>> during boot so to avoid duplicates too and then udev would have to
>> embrace such functionality. That's going to take time to propagate, as
>> you can imagine.
>
>udev is just the transport to kmod here, it's not in the job of
>filtering duplicate messages.

udev nowadays use *lib*kmod. It's udev who has the
context it can operate on.

Also, those module loads will not use the path this patch is changing
call_modprobe is not the path that triggers udev to load modules.
/me confused

What can be done from userspace in the udev path

1) udev to do the ratelimit'ing. Define a time window,
filter out uevents in systemd/src/udev/udev-builtin-kmod.c

2) libkmod to do the ratelimit'ing with a similar approach, but udev
needs to tell libkmod what is the window it wants to use

3) libkmod to act on the context it has from the *kernel*. It used
to be cheap with the call simply blocking early on the syscall in
a mutex... or we didn't have that many calls. So libkmod
simply calls [f]init_module() again regardless of the module's
state being in a "coming" state.  Is this the case here? I haven't
seen this data. This is done to avoid a) the toctou implied and b) to
provide the correct return for that call - libkmod can't know if the
previous call will succeed or fail.

libkmod only skips the call if the module is already in
the live state. It seems systemd-udev also duplicates the check
in src/shared/module-util.c:module_load_and_warn()

Note that libkmod already spares loading the module multiple times from
disk as it uses a memory pool for the modules. It reuses one iff it
comes from the same context (i.e. it's only udev involved and not a
bunch of parallel calls to modprobe).

4) If all the calls are coming from the same context and it is udev...
I'm not sure this is actually the problem - the udev's kmod builtin
handler is single-threaded and will handle one request at a time.
I don't see any data to confirm it's coming from a single source or
multiple sources. Could you get a trace containing [f]init_module and
the trace_module_request(), together with a verbose udev log?

If this is all coming from a synthetic use case with thousands of
modprobe execs, I'm not sure there is much to do on the userspace side.

>
>> > Why not
>> > just rate-limit it in userspace if your system can't handle 10's of
>> > thousands of kmod calls all at once? I think many s390 systems did this
>> > decades ago when they were controlling 10's of thousands of scsi devices
>> > and were hit with "device detection storms" at boot like this.
>>
>> Boot is a special context and in this particular case I agree userspace
>> kmod could/should be extended to avoid duplicate module requests in that

see above

>> context. But likewise the kernel should only have to try to issue a
>> request for a single module once, if it could easily do that.
>
>Are you sure that this is happening at boot in a way that userspace
>didn't just trigger it on its own after init started up?  That happens
>as a "coldboot" walk of the device tree and all uevent are regenerated.
>That is userspace asking for this, so there's nothing that the kernel
>can do.
>
>> This does beg the question, why force userspace to rate limit if we
>> can do better in the kernel? Specially if *we're the ones*, as you say,
>> that are hinting to userspace to shoot back loading modules for us and we
>> know we're just going to drop duplicates?
>
>Maybe error out of duplicate module loading earlier?  I don't know,
>sorry.

I still don't see what's the source of the problem from the data
available. Is the kernel issuing multiple request_module()? Or is the
kernel sending multiple udev event for userspace to map the alias to the
module and load it? The mapping alias -> module currently belongs in
userspace so if you are de-duplicating, it can't be only on the module
name.

>
>> > What specific devices and bus types are the problem here for these systems?
>>
>> My best assessment of the situation is that each CPU in udev ends up triggering
>> a load of duplicate set of modules, not just one, but *a lot*. Not sure
>> what heuristics udev uses to load a set of modules per CPU.
>
>Again, finding which device and bus is causing the problem is going to
>be key here to try to solve the issue.  Are you logging duplicate module

agreed.

If the info I requested above is available on other threads, could you
point me to those?

thanks
Lucas De Marchi

>loads by name as well?
>
>thanks,
>
>greg k-h
Luis Chamberlain April 21, 2023, 5:38 p.m. UTC | #7
On Fri, Apr 21, 2023 at 09:42:39AM -0700, Lucas De Marchi wrote:
> On Fri, Apr 21, 2023 at 05:12:51PM +0200, Greg KH wrote:
> > On Thu, Apr 20, 2023 at 02:03:32PM -0700, Luis Chamberlain wrote:
> > udev is just the transport to kmod here, it's not in the job of
> > filtering duplicate messages.
> 
> udev nowadays use *lib*kmod. It's udev who has the
> context it can operate on.
> 
> Also, those module loads will not use the path this patch is changing
> call_modprobe is not the path that triggers udev to load modules.
> /me confused

This patch prooves that module auto-loading (request_modue() calls) and
so the /sbin/modprobe calls are *not* the issue. That is why udev was
the next candidate consideration.

> What can be done from userspace in the udev path
> 
> 1) udev to do the ratelimit'ing. Define a time window,
> filter out uevents in systemd/src/udev/udev-builtin-kmod.c
> 
> 2) libkmod to do the ratelimit'ing with a similar approach, but udev
> needs to tell libkmod what is the window it wants to use
> 
> 3) libkmod to act on the context it has from the *kernel*. It used
> to be cheap with the call simply blocking early on the syscall in
> a mutex... or we didn't have that many calls. So libkmod
> simply calls [f]init_module() again regardless of the module's
> state being in a "coming" state.  Is this the case here?

I only got so far as to also confirm libkmod is used, so if libkmod
does that check then this is already done, but the issue I think is
that I think that the races are so much that you still get duplicates.
So even if the check is done there are so many parallel calls that
the check doesn't help as the module won't be loaded for a while.

> I haven't seen this data.

Just build a modules-next [0] kernel with the new CONFIG_MODULE_STATS
and after boot cat /sys/kernel/debug/modules/stats. Then increase
the number of CPUs on the system by 2 and try again. Then enable
the new MODULE_DEBUG_AUTOLOAD_DUPS which I just pushed to modules-next
and see how many duplicates you see. If you don't see many then that
means the other source for duplicates should be udev.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=modules-next

> This is done to avoid a) the toctou implied and b) to
> provide the correct return for that call - libkmod can't know if the
> previous call will succeed or fail.

Just as with the kludge-of-concept I posted for kread [0], userspace
also should have similar issues in mapping module name to arbitrary
file names given:

  o a module can be in different paths and libkmod could for
     example at one point load a module in one path, then userspace
     removes it, and the next path is used.
  o module names may differ from the filename slightly (in the kernel
    we replace dash with "_", refer to KBUILD_MODNAME

So the only thing it could do is use the full path of the module used to
deter duplicates. Then, it could actually converge duplicate requests and
share the results just as my kludge-of-concept did.

[1] https://lore.kernel.org/all/ZDmAvwi+KNvie+OI@bombadil.infradead.org/T/#md172510af8fdf7e0f76f6caafee9c99f7a8b6de7

> libkmod only skips the call if the module is already in
> the live state.

It can do better, it can converge requests to avoid a kernel_read*()
from using vmalloc space. Note that this was not well known before,
but now it is clear.

I realize though that this could mean sharing a context between all
loads thoughs in udev, and such a change could take significant time
and review to complete.

If we *wanted* to do this in kernel instead, I have already shown it's
not hard.

> It seems systemd-udev also duplicates the check
> in src/shared/module-util.c:module_load_and_warn()

Evidence is showing that does not suffice for the races which are
currently possible.

> Note that libkmod already spares loading the module multiple times from
> disk as it uses a memory pool for the modules. It reuses one iff it
> comes from the same context (i.e. it's only udev involved and not a
> bunch of parallel calls to modprobe).

If a different context is used its not shared.

> 4) If all the calls are coming from the same context and it is udev...
> I'm not sure this is actually the problem - the udev's kmod builtin
> handler is single-threaded and will handle one request at a time.
> I don't see any data to confirm it's coming from a single source or
> multiple sources. Could you get a trace containing [f]init_module and
> the trace_module_request(), together with a verbose udev log?
> 
> If this is all coming from a synthetic use case with thousands of
> modprobe execs, I'm not sure there is much to do on the userspace side.

It's not synthetic, I rested simply increasing the number of CPUs on a
system, you can use kdevops for that if you want to try.

> > > > Why not
> > > > just rate-limit it in userspace if your system can't handle 10's of
> > > > thousands of kmod calls all at once? I think many s390 systems did this
> > > > decades ago when they were controlling 10's of thousands of scsi devices
> > > > and were hit with "device detection storms" at boot like this.
> > > 
> > > Boot is a special context and in this particular case I agree userspace
> > > kmod could/should be extended to avoid duplicate module requests in that
> 
> see above
> 
> > > context. But likewise the kernel should only have to try to issue a
> > > request for a single module once, if it could easily do that.
> > 
> > Are you sure that this is happening at boot in a way that userspace
> > didn't just trigger it on its own after init started up?  That happens
> > as a "coldboot" walk of the device tree and all uevent are regenerated.
> > That is userspace asking for this, so there's nothing that the kernel
> > can do.
> > 
> > > This does beg the question, why force userspace to rate limit if we
> > > can do better in the kernel? Specially if *we're the ones*, as you say,
> > > that are hinting to userspace to shoot back loading modules for us and we
> > > know we're just going to drop duplicates?
> > 
> > Maybe error out of duplicate module loading earlier?  I don't know,
> > sorry.
> 
> I still don't see what's the source of the problem from the data
> available. Is the kernel issuing multiple request_module()?

For the cases I saw it only accounted for *one* of the many duplicates.
So that's not it.

> Or is the
> kernel sending multiple udev event for userspace to map the alias to the
> module and load it?

That's what I suspect. Each CPU triggers tons of module loads.

> The mapping alias -> module currently belongs in
> userspace so if you are de-duplicating, it can't be only on the module
> name.

That's one way, but it can also do it on the path used too.

> > > > What specific devices and bus types are the problem here for these systems?
> > > 
> > > My best assessment of the situation is that each CPU in udev ends up triggering
> > > a load of duplicate set of modules, not just one, but *a lot*. Not sure
> > > what heuristics udev uses to load a set of modules per CPU.
> > 
> > Again, finding which device and bus is causing the problem is going to
> > be key here to try to solve the issue.  Are you logging duplicate module
> 
> agreed.
> 
> If the info I requested above is available on other threads, could you
> point me to those?
> 
> thanks
> Lucas De Marchi
> 
> > loads by name as well?

The above instructions on using modules-next will let you both see
what's going on.

  Luis
Lucas De Marchi April 21, 2023, 6:31 p.m. UTC | #8
On Fri, Apr 21, 2023 at 10:38:49AM -0700, Luis Chamberlain wrote:
>On Fri, Apr 21, 2023 at 09:42:39AM -0700, Lucas De Marchi wrote:
>> On Fri, Apr 21, 2023 at 05:12:51PM +0200, Greg KH wrote:
>> > On Thu, Apr 20, 2023 at 02:03:32PM -0700, Luis Chamberlain wrote:
>> > udev is just the transport to kmod here, it's not in the job of
>> > filtering duplicate messages.
>>
>> udev nowadays use *lib*kmod. It's udev who has the
>> context it can operate on.
>>
>> Also, those module loads will not use the path this patch is changing
>> call_modprobe is not the path that triggers udev to load modules.
>> /me confused
>
>This patch prooves that module auto-loading (request_modue() calls) and
>so the /sbin/modprobe calls are *not* the issue. That is why udev was
>the next candidate consideration.

that makes more sense.

>
>> What can be done from userspace in the udev path
>>
>> 1) udev to do the ratelimit'ing. Define a time window,
>> filter out uevents in systemd/src/udev/udev-builtin-kmod.c
>>
>> 2) libkmod to do the ratelimit'ing with a similar approach, but udev
>> needs to tell libkmod what is the window it wants to use
>>
>> 3) libkmod to act on the context it has from the *kernel*. It used
>> to be cheap with the call simply blocking early on the syscall in
>> a mutex... or we didn't have that many calls. So libkmod
>> simply calls [f]init_module() again regardless of the module's
>> state being in a "coming" state.  Is this the case here?
>
>I only got so far as to also confirm libkmod is used, so if libkmod
>does that check then this is already done, but the issue I think is
>that I think that the races are so much that you still get duplicates.
>So even if the check is done there are so many parallel calls that
>the check doesn't help as the module won't be loaded for a while.
>
>> I haven't seen this data.
>
>Just build a modules-next [0] kernel with the new CONFIG_MODULE_STATS
>and after boot cat /sys/kernel/debug/modules/stats. Then increase
>the number of CPUs on the system by 2 and try again. Then enable
>the new MODULE_DEBUG_AUTOLOAD_DUPS which I just pushed to modules-next
>and see how many duplicates you see. If you don't see many then that
>means the other source for duplicates should be udev.
>
>[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=modules-next
>
>> This is done to avoid a) the toctou implied and b) to
>> provide the correct return for that call - libkmod can't know if the
>> previous call will succeed or fail.
>
>Just as with the kludge-of-concept I posted for kread [0], userspace
>also should have similar issues in mapping module name to arbitrary
>file names given:
>
>  o a module can be in different paths and libkmod could for
>     example at one point load a module in one path, then userspace
>     removes it, and the next path is used.

no, it can't. Unless you are doing out of tree modules and loading them
manually by path. There can only be one module with the same name in kmod's
database. If you have duplicate modules, depmod will use the dir
priority configured by the distro (see depmod.d(5)).

Since we are talking about *udev* it's not a real possibility as
1) the udev requests are serialized
2) there is only 1 kmod ctx, so they use the same configuration, no
funky kmod_new("/another-rootfs", ...) type of thing.

>  o module names may differ from the filename slightly (in the kernel
>    we replace dash with "_", refer to KBUILD_MODNAME

this is taken care by depmod/libkmod too. All the aliases are mapped to
module names and then normalized. See modname_normalize() in kmod.

>
>So the only thing it could do is use the full path of the module used to
>deter duplicates. Then, it could actually converge duplicate requests and
>share the results just as my kludge-of-concept did.

this is assuming you are loading modules by path from random places.
It's not what udev does.

>
>[1] https://lore.kernel.org/all/ZDmAvwi+KNvie+OI@bombadil.infradead.org/T/#md172510af8fdf7e0f76f6caafee9c99f7a8b6de7
>
>> libkmod only skips the call if the module is already in
>> the live state.
>
>It can do better, it can converge requests to avoid a kernel_read*()
>from using vmalloc space. Note that this was not well known before,
>but now it is clear.

in userspace, if using the same context and using init_module() rather
than finit_module(), I **guess** we would have a similar thing due to
the memory pool for modules: we don't read the module again. That is not
true for finit_module() though as we just open and pass the fd.

>
>I realize though that this could mean sharing a context between all
>loads thoughs in udev, and such a change could take significant time
>and review to complete.

But there is only one context. There aren't multiple paralell requests
from multiple sources. Probably need to Cc someone still changing
udev's builtin...  but from a quick look, from what I remember about
that the last time I touched it and without data to prove me wrong,
it seems we are not looking at the right problem space to come up with a
solution.

>
>If we *wanted* to do this in kernel instead, I have already shown it's
>not hard.
>
>> It seems systemd-udev also duplicates the check
>> in src/shared/module-util.c:module_load_and_warn()
>
>Evidence is showing that does not suffice for the races which are
>currently possible.

can you raise the udev verbosity and share? All the kmod-builtin
calls will already be logged there. See
src/udev/udev-event.c:udev_event_execute_run() leading to

	log_device_debug(event->dev, "Running built-in command \"%s\"", command);
	r = udev_builtin_run(event->dev, &event->rtnl, builtin_cmd, command, false);

if you are rather seeing "Running command", ohh... then your udev was
built without libkmod and it will just fork/exec. Not what we want.

>
>> Note that libkmod already spares loading the module multiple times from
>> disk as it uses a memory pool for the modules. It reuses one iff it
>> comes from the same context (i.e. it's only udev involved and not a
>> bunch of parallel calls to modprobe).
>
>If a different context is used its not shared.

see above.

>
>> 4) If all the calls are coming from the same context and it is udev...
>> I'm not sure this is actually the problem - the udev's kmod builtin
>> handler is single-threaded and will handle one request at a time.
>> I don't see any data to confirm it's coming from a single source or
>> multiple sources. Could you get a trace containing [f]init_module and
>> the trace_module_request(), together with a verbose udev log?
>>
>> If this is all coming from a synthetic use case with thousands of
>> modprobe execs, I'm not sure there is much to do on the userspace side.
>
>It's not synthetic, I rested simply increasing the number of CPUs on a
>system, you can use kdevops for that if you want to try.
>
>> > > > Why not
>> > > > just rate-limit it in userspace if your system can't handle 10's of
>> > > > thousands of kmod calls all at once? I think many s390 systems did this
>> > > > decades ago when they were controlling 10's of thousands of scsi devices
>> > > > and were hit with "device detection storms" at boot like this.
>> > >
>> > > Boot is a special context and in this particular case I agree userspace
>> > > kmod could/should be extended to avoid duplicate module requests in that
>>
>> see above
>>
>> > > context. But likewise the kernel should only have to try to issue a
>> > > request for a single module once, if it could easily do that.
>> >
>> > Are you sure that this is happening at boot in a way that userspace
>> > didn't just trigger it on its own after init started up?  That happens
>> > as a "coldboot" walk of the device tree and all uevent are regenerated.
>> > That is userspace asking for this, so there's nothing that the kernel
>> > can do.
>> >
>> > > This does beg the question, why force userspace to rate limit if we
>> > > can do better in the kernel? Specially if *we're the ones*, as you say,
>> > > that are hinting to userspace to shoot back loading modules for us and we
>> > > know we're just going to drop duplicates?
>> >
>> > Maybe error out of duplicate module loading earlier?  I don't know,
>> > sorry.
>>
>> I still don't see what's the source of the problem from the data
>> available. Is the kernel issuing multiple request_module()?
>
>For the cases I saw it only accounted for *one* of the many duplicates.
>So that's not it.
>
>> Or is the
>> kernel sending multiple udev event for userspace to map the alias to the
>> module and load it?
>
>That's what I suspect. Each CPU triggers tons of module loads.

so it seems the easiest thing to do is collect the udev log.

>
>> The mapping alias -> module currently belongs in
>> userspace so if you are de-duplicating, it can't be only on the module
>> name.
>
>That's one way, but it can also do it on the path used too.

path should be irrelevant for the problem we are looking at.

>
>> > > > What specific devices and bus types are the problem here for these systems?
>> > >
>> > > My best assessment of the situation is that each CPU in udev ends up triggering
>> > > a load of duplicate set of modules, not just one, but *a lot*. Not sure
>> > > what heuristics udev uses to load a set of modules per CPU.
>> >
>> > Again, finding which device and bus is causing the problem is going to
>> > be key here to try to solve the issue.  Are you logging duplicate module
>>
>> agreed.
>>
>> If the info I requested above is available on other threads, could you
>> point me to those?
>>
>> thanks
>> Lucas De Marchi
>>
>> > loads by name as well?
>
>The above instructions on using modules-next will let you both see
>what's going on.

hopefully you don't have CONFIG_UEVENT_HELPER_PATH set or anything
mucking /sys/kernel/uevent_helper. Right?

Lucas De Marchi

>
>  Luis
Luis Chamberlain April 21, 2023, 6:45 p.m. UTC | #9
On Fri, Apr 21, 2023 at 11:31:03AM -0700, Lucas De Marchi wrote:
> On Fri, Apr 21, 2023 at 10:38:49AM -0700, Luis Chamberlain wrote:
> > Just as with the kludge-of-concept I posted for kread [0], userspace
> > also should have similar issues in mapping module name to arbitrary
> > file names given:
> > 
> >  o a module can be in different paths and libkmod could for
> >     example at one point load a module in one path, then userspace
> >     removes it, and the next path is used.
> 
> no, it can't. Unless you are doing out of tree modules and loading them
> manually by path. There can only be one module with the same name in kmod's
> database. If you have duplicate modules, depmod will use the dir
> priority configured by the distro (see depmod.d(5)).
> 
> Since we are talking about *udev* it's not a real possibility as
> 1) the udev requests are serialized
> 2) there is only 1 kmod ctx, so they use the same configuration, no
> funky kmod_new("/another-rootfs", ...) type of thing.
> 
> >  o module names may differ from the filename slightly (in the kernel
> >    we replace dash with "_", refer to KBUILD_MODNAME
> 
> this is taken care by depmod/libkmod too. All the aliases are mapped to
> module names and then normalized. See modname_normalize() in kmod.

Great! So this should be much simpler in userspace.

> > [1] https://lore.kernel.org/all/ZDmAvwi+KNvie+OI@bombadil.infradead.org/T/#md172510af8fdf7e0f76f6caafee9c99f7a8b6de7
> > 
> > > libkmod only skips the call if the module is already in
> > > the live state.
> > 
> > It can do better, it can converge requests to avoid a kernel_read*()
> > from using vmalloc space. Note that this was not well known before,
> > but now it is clear.
> 
> in userspace, if using the same context and using init_module() rather
> than finit_module(), I **guess** we would have a similar thing due to
> the memory pool for modules: we don't read the module again. That is not
> true for finit_module() though as we just open and pass the fd.

I think we could not not care about init_module() races for now.

> > I realize though that this could mean sharing a context between all
> > loads thoughs in udev, and such a change could take significant time
> > and review to complete.
> 
> But there is only one context. There aren't multiple paralell requests
> from multiple sources. Probably need to Cc someone still changing
> udev's builtin...  but from a quick look, from what I remember about
> that the last time I touched it and without data to prove me wrong,
> it seems we are not looking at the right problem space to come up with a
> solution.

Data seems to indicate that somehow this might not be true.

> > If we *wanted* to do this in kernel instead, I have already shown it's
> > not hard.
> > 
> > > It seems systemd-udev also duplicates the check
> > > in src/shared/module-util.c:module_load_and_warn()
> > 
> > Evidence is showing that does not suffice for the races which are
> > currently possible.
> 
> can you raise the udev verbosity and share?

How do I do that?

> All the kmod-builtin
> calls will already be logged there. See
> src/udev/udev-event.c:udev_event_execute_run() leading to
> 
> 	log_device_debug(event->dev, "Running built-in command \"%s\"", command);
> 	r = udev_builtin_run(event->dev, &event->rtnl, builtin_cmd, command, false);
> 
> if you are rather seeing "Running command", ohh... then your udev was
> built without libkmod and it will just fork/exec. Not what we want.

I'm using debian testing everything vanilla packages except the kernel,
using modules-next.

> so it seems the easiest thing to do is collect the udev log.
> 
> hopefully you don't have CONFIG_UEVENT_HELPER_PATH set or anything
> mucking /sys/kernel/uevent_helper. Right?

No.

  Luis
Petr Pavlu April 26, 2023, 10:13 a.m. UTC | #10
On 4/21/23 20:31, Lucas De Marchi wrote:
> On Fri, Apr 21, 2023 at 10:38:49AM -0700, Luis Chamberlain wrote:
[...]
>>> libkmod only skips the call if the module is already in
>>> the live state.
>>
>> It can do better, it can converge requests to avoid a kernel_read*()
>>from using vmalloc space. Note that this was not well known before,
>> but now it is clear.
> 
> in userspace, if using the same context and using init_module() rather
> than finit_module(), I **guess** we would have a similar thing due to
> the memory pool for modules: we don't read the module again. That is not
> true for finit_module() though as we just open and pass the fd.
> 
>>
>> I realize though that this could mean sharing a context between all
>> loads thoughs in udev, and such a change could take significant time
>> and review to complete.
> 
> But there is only one context. There aren't multiple paralell requests
> from multiple sources. Probably need to Cc someone still changing
> udev's builtin...  but from a quick look, from what I remember about
> that the last time I touched it and without data to prove me wrong,
> it seems we are not looking at the right problem space to come up with a
> solution.

My understanding is that udev workers are forked. An initial kmod context is
created by the main udevd process but no sharing happens after the fork. It
means that the mentioned memory pool logic doesn't really kick in.

Multiple parallel load requests come from multiple udev workers, for instance,
each handling an udev event for one CPU device and making the exactly same
requests as all others are doing at the same time.

The optimization idea would be to recognize these duplicate requests at the
udevd/kmod level and converge them.

Cheers,
Petr
diff mbox series

Patch

diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index e6df183e2c80..cc146ef4a6ac 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -59,6 +59,46 @@  config MODULE_STATS
 
 	  If unsure, say N.
 
+config MODULE_AUTOLOAD_SUPRESS_DUPS
+	bool "Debug duplicate modules with auto-loading"
+	help
+	  Module autoloading allows in-kernel code to request modules through
+	  the *request_module*() API calls. This in turn just calls userspace
+	  modprobe. Although modprobe checks to see if a module is already
+	  loaded before trying to load a module there is a small time window in
+	  which multiple duplicate requests can end up in userspace and multiple
+	  modprobe calls race calling finit_module() around the same time for
+	  duplicate modules. The finit_module() system call can consume in the
+	  worst case more than twice the respective module size in virtual
+	  memory for each duplicate module requests. Although duplicate module
+	  requests are non-fatal virtual memory is a limited resource and each
+	  duplicate module request ends up just wasting virtual memory.
+
+	  This debugging facility will create WARN() splats for duplicate module
+	  requests to help identify if module auto-loading is the culprit to your
+	  woes. Since virtual memory abuse caused by duplicate module requests
+	  could render a system unusable this functionality will also suppresses
+	  the waste in virtual memory caused by duplicate requests by sharing
+	  races in requests for the same module to a single unified request.
+	  Once a non-wait request_module() call completes a module should be
+	  loaded and modprobe should simply not allow new finit_module() calls.
+
+	  Enable this functionality to try to debug virtual memory abuse during
+	  boot on systems and identify if the abuse was due to module
+	  auto-loading.
+
+	  If the first module request used request_module_nowait() we cannot
+	  use that as the anchor to wait for duplicate module requests, since
+	  users of request_module() do want a proper return value. If a call
+	  for the same module happened earlier with request_module() though,
+	  then a duplicate request_module_nowait() would be detected.
+
+	  You want to enable this if you want to debug and see if duplicate
+	  module auto-loading might be causing virtual memory abuse during
+	  bootup. A kernel trace will be provided for each duplicate request.
+
+	  Disable this if you are on production.
+
 endif # MODULE_DEBUG
 
 config MODULE_FORCE_LOAD
diff --git a/kernel/module/Makefile b/kernel/module/Makefile
index 52340bce497e..e8b121ac39cf 100644
--- a/kernel/module/Makefile
+++ b/kernel/module/Makefile
@@ -10,6 +10,7 @@  KCOV_INSTRUMENT_module.o := n
 obj-y += main.o
 obj-y += strict_rwx.o
 obj-y += kmod.o
+obj-$(CONFIG_MODULE_AUTOLOAD_SUPRESS_DUPS) += dups.o
 obj-$(CONFIG_MODULE_DECOMPRESS) += decompress.o
 obj-$(CONFIG_MODULE_SIG) += signing.o
 obj-$(CONFIG_LIVEPATCH) += livepatch.o
diff --git a/kernel/module/dups.c b/kernel/module/dups.c
new file mode 100644
index 000000000000..903ab7c7e8f4
--- /dev/null
+++ b/kernel/module/dups.c
@@ -0,0 +1,234 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * kmod dups - the kernel module autoloader duplicate suppressor
+ *
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/sched/task.h>
+#include <linux/binfmts.h>
+#include <linux/syscalls.h>
+#include <linux/unistd.h>
+#include <linux/kmod.h>
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/cred.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/workqueue.h>
+#include <linux/security.h>
+#include <linux/mount.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/resource.h>
+#include <linux/notifier.h>
+#include <linux/suspend.h>
+#include <linux/rwsem.h>
+#include <linux/ptrace.h>
+#include <linux/async.h>
+#include <linux/uaccess.h>
+
+DEFINE_MUTEX(kmod_dup_mutex);
+static LIST_HEAD(dup_kmod_reqs);
+
+struct kmod_dup_req {
+	struct list_head list;
+	char name[MODULE_NAME_LEN];
+	struct completion first_req_done;
+	struct work_struct complete_work;
+	struct delayed_work delete_work;
+	int dup_ret;
+};
+
+static struct kmod_dup_req *kmod_dup_request_lookup(char *module_name)
+{
+	struct kmod_dup_req *kmod_req;
+
+	list_for_each_entry_rcu(kmod_req, &dup_kmod_reqs, list,
+				lockdep_is_held(&kmod_dup_mutex)) {
+		if (strlen(kmod_req->name) == strlen(module_name) &&
+		    !memcmp(kmod_req->name, module_name, strlen(module_name))) {
+			return kmod_req;
+                }
+        }
+
+	return NULL;
+}
+
+static void kmod_dup_request_delete(struct work_struct *work)
+{
+	struct kmod_dup_req *kmod_req;
+	kmod_req = container_of(to_delayed_work(work), struct kmod_dup_req, delete_work);
+
+	/*
+	 * The typical situation is a module successully loaded. In that
+	 * situation the module will be present already in userspace. If
+	 * new requests come in after that, userspace will already know the
+	 * module is loaded so will just return 0 right away. There is still
+	 * a small chance right after we delete this entry new request_module()
+	 * calls may happen after that, they can happen. These heuristics
+	 * are to protect finit_module() abuse for auto-loading, if modules
+	 * are still tryign to auto-load even if a module is already loaded,
+	 * that's on them, and those inneficiencies should not be fixed by
+	 * kmod. The inneficies there are a call to modprobe and modprobe
+	 * just returning 0.
+	 */
+	mutex_lock(&kmod_dup_mutex);
+	list_del_rcu(&kmod_req->list);
+	synchronize_rcu();
+	mutex_unlock(&kmod_dup_mutex);
+	kfree(kmod_req);
+}
+
+static void kmod_dup_request_complete(struct work_struct *work)
+{
+	struct kmod_dup_req *kmod_req;
+
+	kmod_req = container_of(work, struct kmod_dup_req, complete_work);
+
+	/*
+	 * This will ensure that the kernel will let all the waiters get
+	 * informed its time to check the return value. It's time to
+	 * go home.
+	 */
+	complete_all(&kmod_req->first_req_done);
+
+	/*
+	 * Now that we have allowed prior request_module() calls to go on
+	 * with life, let's schedule deleting this entry. We don't have
+	 * to do it right away, but we *eventually* want to do it so to not
+	 * let this linger forever as this is just a boot optimization for
+	 * possible abuses of vmalloc() incurred by finit_module() thrashing.
+	 */
+	queue_delayed_work(system_wq, &kmod_req->delete_work, 60 * HZ);
+}
+
+bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret)
+{
+	struct kmod_dup_req *kmod_req, *new_kmod_req;
+	int ret;
+
+	/*
+	 * Pre-allocate the entry in case we have to use it later
+	 * to avoid contention with the mutex.
+	 */
+	new_kmod_req = kzalloc(sizeof(*new_kmod_req), GFP_KERNEL);
+	if (!new_kmod_req)
+		return false;
+
+	memcpy(new_kmod_req->name, module_name, strlen(module_name));
+	INIT_WORK(&new_kmod_req->complete_work, kmod_dup_request_complete);
+	INIT_DELAYED_WORK(&new_kmod_req->delete_work, kmod_dup_request_delete);
+	init_completion(&new_kmod_req->first_req_done);
+
+	mutex_lock(&kmod_dup_mutex);
+
+	kmod_req = kmod_dup_request_lookup(module_name);
+	if (!kmod_req) {
+		/*
+		 * If the first request that came through for a module
+		 * was with request_module_nowait() we cannot wait for it
+		 * and share its return value with other users which may
+		 * have used request_module() and need a proper return value
+		 * so just skip using them as an anchor.
+		 *
+		 * If a prior request to this one came through with
+		 * request_module() though, then a request_module_nowait()
+		 * would benefit from duplicate detection.
+		 */
+		if (!wait) {
+			kfree(new_kmod_req);
+			pr_warn("New request_module_nowait() for %s -- cannot track duplicates for this request\n", module_name);
+			mutex_unlock(&kmod_dup_mutex);
+			return false;
+		}
+
+		/*
+		 * There was no duplicate, just add the request so we can
+		 * keep tab on duplicates later.
+		 */
+		pr_info("New request_module() for %s\n", module_name);
+		list_add_rcu(&new_kmod_req->list, &dup_kmod_reqs);
+		mutex_unlock(&kmod_dup_mutex);
+		return false;
+	}
+	mutex_unlock(&kmod_dup_mutex);
+
+	/* We are dealing with a duplicate request now */
+
+	kfree(new_kmod_req);
+
+	/*
+	 * To fix these try to use try_then_request_module() instead as that
+	 * will check if the component you are looking for is present or not.
+	 * You could also just queue a single request to load the module once,
+	 * instead of having each and everything you need try to request for
+	 * the module.
+	 *
+	 * Duplicate request_module() calls  can cause quite a bit of wasted
+	 * vmalloc() space when racing with userspace.
+	 */
+	WARN(1, "module-autoload: duplicate request for module %s\n", module_name);
+
+	if (!wait) {
+		/*
+		 * If request_module_nowait() was used then the user just
+		 * wanted to issue the request and if another module request
+		 * was already its way with the same name we don't care for
+		 * the return value either. Let duplicate request_module_nowait()
+		 * calls bail out right away.
+		 */
+		*dup_ret = 0;
+		return true;
+	}
+
+	/*
+	 * If a duplicate request_module() was used they *may* care for
+	 * the return value, so we have no other option but to wait for
+	 * the first caller to complete. If the first caller used
+	 * the request_module_nowait() call, subsquent callers will
+	 * deal with the comprmise of getting a successful call with this
+	 * optimization enabled ...
+	 */
+	ret = wait_for_completion_state(&kmod_req->first_req_done,
+					TASK_UNINTERRUPTIBLE | TASK_KILLABLE);
+	if (ret) {
+		*dup_ret = ret;
+		return true;
+	}
+
+	/* Now the duplicate request has the same exact return value as the first request */
+	*dup_ret = kmod_req->dup_ret;
+
+	return true;
+}
+
+void kmod_dup_request_announce(char *module_name, int ret)
+{
+	struct kmod_dup_req *kmod_req;
+
+	mutex_lock(&kmod_dup_mutex);
+
+	kmod_req = kmod_dup_request_lookup(module_name);
+	if (!kmod_req)
+		goto out;
+
+	kmod_req->dup_ret = ret;
+
+	/*
+	 * If we complete() here we may allow duplicate threads
+	 * to continue before the first one that submitted the
+	 * request. We're in no rush also, given that each and
+	 * every bounce back to userspace is slow we avoid that
+	 * with a slight delay here. So queueue up the completion
+	 * and let duplicates suffer, just wait a tad bit longer.
+	 * There is no rush. But we also don't want to hold the
+	 * caller up forever or introduce any boot delays.
+	 */
+	queue_work(system_wq, &kmod_req->complete_work);
+
+out:
+	mutex_unlock(&kmod_dup_mutex);
+}
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 1fd75dd346dc..962f146336e9 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -3,6 +3,7 @@ 
  *
  * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
  */
 
 #include <linux/elf.h>
@@ -221,6 +222,20 @@  static inline void mod_stat_bump_becoming(struct load_info *info, int flags)
 
 #endif /* CONFIG_MODULE_STATS */
 
+#ifdef CONFIG_MODULE_AUTOLOAD_SUPRESS_DUPS
+bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret);
+void kmod_dup_request_announce(char *module_name, int ret);
+#else
+static inline bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret)
+{
+	return false;
+}
+
+static inline void kmod_dup_request_announce(char *module_name, int ret)
+{
+}
+#endif
+
 #ifdef CONFIG_MODULE_UNLOAD_TAINT_TRACKING
 struct mod_unload_taint {
 	struct list_head list;
diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c
index 5899083436a3..0800d9891692 100644
--- a/kernel/module/kmod.c
+++ b/kernel/module/kmod.c
@@ -1,6 +1,9 @@ 
 /*
  * kmod - the kernel module loader
+ *
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
  */
+
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/sched/task.h>
@@ -27,6 +30,7 @@ 
 #include <linux/uaccess.h>
 
 #include <trace/events/module.h>
+#include "internal.h"
 
 /*
  * Assuming:
@@ -65,7 +69,7 @@  static void free_modprobe_argv(struct subprocess_info *info)
 	kfree(info->argv);
 }
 
-static int call_modprobe(char *module_name, int wait)
+static int call_modprobe(char *orig_module_name, int wait)
 {
 	struct subprocess_info *info;
 	static char *envp[] = {
@@ -74,12 +78,14 @@  static int call_modprobe(char *module_name, int wait)
 		"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
 		NULL
 	};
+	char *module_name;
+	int ret;
 
 	char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
 	if (!argv)
 		goto out;
 
-	module_name = kstrdup(module_name, GFP_KERNEL);
+	module_name = kstrdup(orig_module_name, GFP_KERNEL);
 	if (!module_name)
 		goto free_argv;
 
@@ -94,13 +100,16 @@  static int call_modprobe(char *module_name, int wait)
 	if (!info)
 		goto free_module_name;
 
-	return call_usermodehelper_exec(info, wait | UMH_KILLABLE);
+	ret = call_usermodehelper_exec(info, wait | UMH_KILLABLE);
+	kmod_dup_request_announce(orig_module_name, ret);
+	return ret;
 
 free_module_name:
 	kfree(module_name);
 free_argv:
 	kfree(argv);
 out:
+	kmod_dup_request_announce(orig_module_name, -ENOMEM);
 	return -ENOMEM;
 }
 
@@ -124,7 +133,7 @@  int __request_module(bool wait, const char *fmt, ...)
 {
 	va_list args;
 	char module_name[MODULE_NAME_LEN];
-	int ret;
+	int ret, dup_ret;
 
 	/*
 	 * We don't allow synchronous module loading from async.  Module
@@ -156,8 +165,14 @@  int __request_module(bool wait, const char *fmt, ...)
 
 	trace_module_request(module_name, wait, _RET_IP_);
 
+	if (kmod_dup_request_exists_wait(module_name, wait, &dup_ret)) {
+		ret = dup_ret;
+		goto out;
+	}
+
 	ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
 
+out:
 	up(&kmod_concurrent_max);
 
 	return ret;