[v2,2/2] module: Merge same-name module load requests

Message ID	20220919123233.8538-3-petr.pavlu@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-modules-owner@kernel.org> From: Petr Pavlu <petr.pavlu@suse.com> To: mcgrof@kernel.org Cc: pmladek@suse.com, linux-modules@vger.kernel.org, linux-kernel@vger.kernel.org, Petr Pavlu <petr.pavlu@suse.com> Subject: [PATCH v2 2/2] module: Merge same-name module load requests Date: Mon, 19 Sep 2022 14:32:33 +0200 Message-Id: <20220919123233.8538-3-petr.pavlu@suse.com> In-Reply-To: <20220919123233.8538-1-petr.pavlu@suse.com> References: <20220919123233.8538-1-petr.pavlu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	module: Merge same-name module load requests \| expand [v2,0/2] module: Merge same-name module load requests [v2,1/2] module: Correct wake up of module_wq [v2,2/2] module: Merge same-name module load requests

Petr Pavlu Sept. 19, 2022, 12:32 p.m. UTC

During a system boot, it can happen that the kernel receives a burst of
requests to insert the same module but loading it eventually fails
during its init call. For instance, udev can make a request to insert
a frequency module for each individual CPU when another frequency module
is already loaded which causes the init function of the new module to
return an error.

The module loader currently serializes all such requests, with the
barrier in add_unformed_module(). This creates a lot of unnecessary work
and delays the boot.

This patch improves the behavior as follows:
* A check whether a module load matches an already loaded module is
  moved right after a module name is determined. -EEXIST continues to be
  returned if the module exists and is live, -EBUSY is returned if
  a same-name module is going.
* A new reference-counted shared_load_info structure is introduced to
  keep track of duplicate load requests. Two loads are considered
  equivalent if their module name matches. In case a load duplicates
  another running insert, the code waits for its completion and then
  returns -EEXIST or -EBUSY depending on whether it succeeded.

Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
for modules that have finished loading"), the kernel already did merge
some of same load requests but it was more by accident and relied on
specific timing. The patch brings this behavior back in a more explicit
form.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
---
 kernel/module/main.c | 214 ++++++++++++++++++++++++++++++-------------
 1 file changed, 150 insertions(+), 64 deletions(-)

Luis Chamberlain Sept. 30, 2022, 8:30 p.m. UTC | #1

On Mon, Sep 19, 2022 at 02:32:33PM +0200, Petr Pavlu wrote:
> During a system boot, it can happen that the kernel receives a burst of
> requests to insert the same module but loading it eventually fails
> during its init call. 

Please take a look at kmod selftest lib/test_kmod.c and the respective shell
selftest tools/testing/selftests/kmod/kmod.sh. Can you modify it to add
support to reproduce this issue?

> For instance, udev can make a request to insert
> a frequency module for each individual CPU 

That seems stupid indeed, it would seem we should be able for sure to prevent
such cases, it can't just be happening for frequency modules.

> when another frequency module
> is already loaded which causes the init function of the new module to
> return an error.

Holy smokes.

> The module loader currently serializes all such requests, with the
> barrier in add_unformed_module(). This creates a lot of unnecessary work
> and delays the boot.

Sure..

> This patch improves the behavior as follows:
> * A check whether a module load matches an already loaded module is
>   moved right after a module name is determined. -EEXIST continues to be
>   returned if the module exists and is live, -EBUSY is returned if
>   a same-name module is going.

OK nice.

> * A new reference-counted shared_load_info structure is introduced to
>   keep track of duplicate load requests. Two loads are considered
>   equivalent if their module name matches. In case a load duplicates
>   another running insert, the code waits for its completion and then
>   returns -EEXIST or -EBUSY depending on whether it succeeded.

Groovy.

> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
> for modules that have finished loading"), the kernel already did merge
> some of same load requests but it was more by accident and relied on
> specific timing. The patch brings this behavior back in a more explicit
> form.

I'm having a hard time with this, because it is not clear if you are
suggesting this is a regression introduced by 6e6de3dee51a or not. I'd
like you to evaluate the impact of *not* merging a fix to older kernels.
In practice I think we'd end up with delays on boot, but is that all?
Would boot ever fail? The commit log does not make that clear.

The commit log should make it clear if this a regression or not and the
impact of not having these fixes merged. Please not that bots will try
to scrape for fixes and I suspect bots will pour their heart out on this
commit log and identify and assume this if a fix already as-is.

If this *is* a regression, we should try to see how perhaps we can split
this up into a part which is mergable to stable and then a secondary
part which  does some new fancy optimizations.

  Luis

David Hildenbrand Oct. 14, 2022, 7:54 a.m. UTC | #2

On Mon, Sep 19, 2022 at 2:33 PM Petr Pavlu <petr.pavlu@suse.com> wrote:
>
> During a system boot, it can happen that the kernel receives a burst of
> requests to insert the same module but loading it eventually fails
> during its init call. For instance, udev can make a request to insert
> a frequency module for each individual CPU when another frequency module
> is already loaded which causes the init function of the new module to
> return an error.
>
> The module loader currently serializes all such requests, with the
> barrier in add_unformed_module(). This creates a lot of unnecessary work
> and delays the boot.
>
> This patch improves the behavior as follows:
> * A check whether a module load matches an already loaded module is
>   moved right after a module name is determined. -EEXIST continues to be
>   returned if the module exists and is live, -EBUSY is returned if
>   a same-name module is going.
> * A new reference-counted shared_load_info structure is introduced to
>   keep track of duplicate load requests. Two loads are considered
>   equivalent if their module name matches. In case a load duplicates
>   another running insert, the code waits for its completion and then
>   returns -EEXIST or -EBUSY depending on whether it succeeded.
>
> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
> for modules that have finished loading"), the kernel already did merge
> some of same load requests but it was more by accident and relied on
> specific timing. The patch brings this behavior back in a more explicit
> form.
>
> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
> ---

Hi Petr,

as you might have seen I sent a patch/fix yesterday (not being aware
of this patch and that
this is also a performance issue, which is interesting), that
similarly makes sure that modules
are unique early.

https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

It doesn't perform the -EBUSY changes or use something like
shared_load_info/refcounts;
it simply uses a second list while the module cannot be placed onto
the module list yet.

Not sure if that part is really required (e.g., for performance
reasons). Like Luis, I feel like
some of these parts could be split into separate patches, if the other
parts are really required.

I just tested your patch in the environment where I can reproduce the
vmap allocation issue, and
(unsurprisingly) this patch similarly seems to fix the issue.

So if your patch ends up upstream, it would be good to add some details
of my patch description (vmap allocation issue) to this patch description.

Cheers,
David

Petr Mladek Oct. 14, 2022, 1:52 p.m. UTC | #3

On Mon 2022-09-19 14:32:33, Petr Pavlu wrote:
> During a system boot, it can happen that the kernel receives a burst of
> requests to insert the same module but loading it eventually fails
> during its init call. For instance, udev can make a request to insert
> a frequency module for each individual CPU when another frequency module
> is already loaded which causes the init function of the new module to
> return an error.
> 
> The module loader currently serializes all such requests, with the
> barrier in add_unformed_module(). This creates a lot of unnecessary work
> and delays the boot.
> 
> This patch improves the behavior as follows:
> * A check whether a module load matches an already loaded module is
>   moved right after a module name is determined. -EEXIST continues to be
>   returned if the module exists and is live, -EBUSY is returned if
>   a same-name module is going.
> * A new reference-counted shared_load_info structure is introduced to
>   keep track of duplicate load requests. Two loads are considered
>   equivalent if their module name matches. In case a load duplicates
>   another running insert, the code waits for its completion and then
>   returns -EEXIST or -EBUSY depending on whether it succeeded.
> 
> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
> for modules that have finished loading"), the kernel already did merge
> some of same load requests but it was more by accident and relied on
> specific timing. The patch brings this behavior back in a more explicit
> form.

> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -61,14 +61,29 @@
>  
>  /*
>   * Mutex protects:
> - * 1) List of modules (also safely readable with preempt_disable),
> + * 1) list of modules (also safely readable with preempt_disable, delete and add
> + *    uses RCU list operations).
>   * 2) module_use links,
> - * 3) mod_tree.addr_min/mod_tree.addr_max.
> - * (delete and add uses RCU list operations).
> + * 3) mod_tree.addr_min/mod_tree.addr_max,
> + * 4) list of unloaded_tainted_modules.

This is unrelated and should be done in separate patch. It minimizes
the size of the patch and simplifies review. More importantly, these
unrelated changes will not get lost when the patch gets reverted for
some reason.

> + * 5) list of running_loads.
>   */
>  DEFINE_MUTEX(module_mutex);
>  LIST_HEAD(modules);
>  
> +/* Shared information to track duplicate module loads. */
> +struct shared_load_info {
> +	char name[MODULE_NAME_LEN];
> +	refcount_t refcnt;
> +	struct list_head list;
> +	struct completion done;
> +	int err;
> +};
> +static LIST_HEAD(running_loads);
> +
> +/* Waiting for a module to finish loading? */
> +static DECLARE_WAIT_QUEUE_HEAD(module_wq);

It is not obvious why this is actually needed. One would expect
that using the completion in struct shared_load_info would
be enough.

It is need because the user, resolve_symbol_wait(), does
not know the exact name of the module that it is waiting for.

It would deserve a comment and maybe even renaming, for example:

/*
 * Waiting for a module load when the exact module name is not known,
 * for example, when resolving symbols from another modules.
 */
static DECLARE_WAIT_QUEUE_HEAD(any_module_load_wq);


> +
>  /* Work queue for freeing init sections in success case */
>  static void do_free_init(struct work_struct *w);
>  static DECLARE_WORK(init_free_wq, do_free_init);
> @@ -762,8 +774,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>  	strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints));
>  
>  	free_module(mod);
> -	/* someone could wait for the module in add_unformed_module() */
> -	wake_up_interruptible(&module_wq);

The new code does not longer wakes module_wq when the module is
removed. I wondered if it is correct. It is a bit tricky.
module_wq used to have two purposes. It was used to wake up:

   1. parallel loads of the same module.

   2. resolving symbols in a module using symbols from
      another module.


Ad 1. The new code handles the parallel load using struct shared_load_info.
      Also the new code does not wait when the same module
      is being removed. So the wake up is not needed here.


Ad 2. The module_wq is typically used when the other module is
      loaded automatically because of module dependencies.
      In this case, only the wake up in load_module() is important.

      But what about module removal? IMHO, the is small race window:


The problem is the module operations are asynchronous. And A takes
the reference on B only after it resolves a symbol, see ref_module()
called in resolve_symbol().

Let's have two modules A and B when the module A uses a symbol
from module B.


CPU 0:			CPU 1			CPU 3

modprobe A
  // see dependency on B
  // and call modprobe B
  // in separate thread

			modprobe B
			  return -EEXIST

						rmmod B
						  // finished


      resolve_symbol_wait(sym_from_B)

It has to wait until the timeout 30s because module B is gone
and it is not beeing loaded.

IMHO, the new code makes the race window slightly bigger because
modprobe B retuns immediately even when rmmod B is already in
progress.

IMHO, this is acceptable because the race was there anyway. And
it is not much realistic scenario.


>  	return 0;
>  out:
>  	mutex_unlock(&module_mutex);
> @@ -2552,43 +2540,133 @@ static int may_init_module(void)
>  	return 0;
>  }
>  
> +static struct shared_load_info *
> +shared_load_info_alloc(const struct load_info *info)
> +{
> +	struct shared_load_info *shared_info =
> +		kzalloc(sizeof(*shared_info), GFP_KERNEL);
> +	if (shared_info == NULL)
> +		return ERR_PTR(-ENOMEM);
> +
> +	strscpy(shared_info->name, info->name, sizeof(shared_info->name));
> +	refcount_set(&shared_info->refcnt, 1);
> +	INIT_LIST_HEAD(&shared_info->list);
> +	init_completion(&shared_info->done);
> +	return shared_info;
> +}
> +
> +static void shared_load_info_get(struct shared_load_info *shared_info)
> +{
> +	refcount_inc(&shared_info->refcnt);
> +}
> +
> +static void shared_load_info_put(struct shared_load_info *shared_info)
> +{
> +	if (refcount_dec_and_test(&shared_info->refcnt))
> +		kfree(shared_info);
> +}
> +
>  /*
> - * We try to place it in the list now to make sure it's unique before
> - * we dedicate too many resources.  In particular, temporary percpu
> + * Check that a module load is unique and make it visible to others. The code
> + * looks for parallel running inserts and already loaded modules. Two inserts
> + * are considered equivalent if their module name matches. In case this load
> + * duplicates another running insert, the code waits for its completion and
> + * then returns -EEXIST or -EBUSY depending on whether it succeeded.
> + *
> + * Detecting early that a load is unique avoids dedicating too many cycles and
> + * resources to bring up the module. In particular, it prevents temporary percpu
>   * memory exhaustion.
> + *
> + * Merging same load requests then primarily helps during the boot process. It
> + * can happen that the kernel receives a burst of requests to load the same
> + * module (for example, a same module for each individual CPU) and loading it
> + * eventually fails during its init call. Merging the requests allows that only
> + * one full attempt to load the module is made.
> + *
> + * On a non-error return, it is guaranteed that this load is unique.
>   */
> -static int add_unformed_module(struct module *mod)
> +static struct shared_load_info *add_running_load(const struct load_info *info)
>  {
> -	int err;
>  	struct module *old;
> +	struct shared_load_info *shared_info;
>  
> -	mod->state = MODULE_STATE_UNFORMED;
> -
> -again:
>  	mutex_lock(&module_mutex);
> -	old = find_module_all(mod->name, strlen(mod->name), true);
> -	if (old != NULL) {
> -		if (old->state != MODULE_STATE_LIVE) {
> -			/* Wait in case it fails to load. */
> +
> +	/* Search if there is a running load of a module with the same name. */
> +	list_for_each_entry(shared_info, &running_loads, list)
> +		if (strcmp(shared_info->name, info->name) == 0) {
> +			int err;
> +
> +			shared_load_info_get(shared_info);
>  			mutex_unlock(&module_mutex);
> -			err = wait_event_interruptible(module_wq,
> -					       finished_loading(mod->name));
> -			if (err)
> -				goto out_unlocked;
> -			goto again;
> +
> +			err = wait_for_completion_interruptible(
> +				&shared_info->done);

This would deserve a comment, for examle:

			/*
			 * Return -EBUSY when the parallel load failed
			 * from any reason. This load might end up
			 * another way but we are not going to try.
			 */

> +			if (!err)
> +				err = shared_info->err ? -EBUSY : -EEXIST;
> +			shared_load_info_put(shared_info);
> +			shared_info = ERR_PTR(err);
> +			goto out_unlocked;
>  		}
> -		err = -EEXIST;
> +
> +	/* Search if there is a live module with the given name already. */
> +	old = find_module_all(info->name, strlen(info->name), true);
> +	if (old != NULL) {
> +		if (old->state == MODULE_STATE_LIVE) {
> +			shared_info = ERR_PTR(-EEXIST);
> +			goto out;
> +		}
> +
> +		/*
> +		 * Any active load always has its record in running_loads and so
> +		 * would be found above. This applies independent whether such
> +		 * a module is currently in MODULE_STATE_UNFORMED,
> +		 * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its
> +		 * initialization failed. It therefore means this must be an
> +		 * older going module and the caller should try later once it is
> +		 * gone.
> +		 */
> +		WARN_ON(old->state != MODULE_STATE_GOING);
> +		shared_info = ERR_PTR(-EBUSY);
>  		goto out;
>  	}
> -	mod_update_bounds(mod);
> -	list_add_rcu(&mod->list, &modules);
> -	mod_tree_insert(mod);
> -	err = 0;
> +
> +	/* The load is unique, make it visible to others. */
> +	shared_info = shared_load_info_alloc(info);
> +	if (IS_ERR(shared_info))
> +		goto out;
> +	list_add(&shared_info->list, &running_loads);
>  
>  out:
>  	mutex_unlock(&module_mutex);
>  out_unlocked:
> -	return err;
> +	return shared_info;
> +}
> +
> +static void finalize_running_load(struct shared_load_info *shared_info, int err)
> +{
> +	/* Inform other duplicate inserts that the load finished. */
> +	mutex_lock(&module_mutex);
> +	list_del(&shared_info->list);
> +	shared_info->err = err;
> +	mutex_unlock(&module_mutex);
> +
> +	complete_all(&shared_info->done);
> +	shared_load_info_put(shared_info);
> +
> +	/* Tell other modules waiting on this one that it completed loading. */
> +	wake_up_interruptible(&module_wq);
> +}
> +
> +static void add_unformed_module(struct module *mod)
> +{
> +	mod->state = MODULE_STATE_UNFORMED;
> +
> +	mutex_lock(&module_mutex);
> +	mod_update_bounds(mod);
> +	list_add_rcu(&mod->list, &modules);
> +	mod_tree_insert(mod);
> +	mutex_unlock(&module_mutex);
>  }
>  
>  static int complete_formation(struct module *mod, struct load_info *info)

The comments are more or less about cosmetic problems. I do not see
any real functional problem. I am sorry that I did not mention
them when reviewing v1.

Feel free to use:

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

Petr Pavlu Oct. 15, 2022, 9:27 a.m. UTC | #4

On 9/30/22 22:30, Luis Chamberlain wrote:
> On Mon, Sep 19, 2022 at 02:32:33PM +0200, Petr Pavlu wrote:
>> During a system boot, it can happen that the kernel receives a burst of
>> requests to insert the same module but loading it eventually fails
>> during its init call. 
> 
> Please take a look at kmod selftest lib/test_kmod.c and the respective shell
> selftest tools/testing/selftests/kmod/kmod.sh. Can you modify it to add
> support to reproduce this issue?

It was possible for me to write some kselftests for this. I will post them as
a separate patch in v3.

>> For instance, udev can make a request to insert
>> a frequency module for each individual CPU 
> 
> That seems stupid indeed, it would seem we should be able for sure to prevent
> such cases, it can't just be happening for frequency modules.

The issue was also observed with EDAC drivers which are similarly exclusive.

>> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
>> for modules that have finished loading"), the kernel already did merge
>> some of same load requests but it was more by accident and relied on
>> specific timing. The patch brings this behavior back in a more explicit
>> form.
> 
> I'm having a hard time with this, because it is not clear if you are
> suggesting this is a regression introduced by 6e6de3dee51a or not. I'd
> like you to evaluate the impact of *not* merging a fix to older kernels.
> In practice I think we'd end up with delays on boot, but is that all?
> Would boot ever fail? The commit log does not make that clear.
> 
> The commit log should make it clear if this a regression or not and the
> impact of not having these fixes merged. Please not that bots will try
> to scrape for fixes and I suspect bots will pour their heart out on this
> commit log and identify and assume this if a fix already as-is.

I touched on this somewhat in my response to review comments on v1 from Petr
Mladek [1] but it looks I failed to appropriately update the commit message
in the new version. I will try to improve it in v3.

The patch does address a regression observed after commit 6e6de3dee51a
("kernel/module.c: Only return -EEXIST for modules that have finished
loading"). I guess it can have a Fixes tag added to the patch.

To add more information, the following is a test from a machine with 288 CPUs
which I performed when preparing this patch. The system had Tumbleweed
20220829 installed on it. The boot process tried to load 288x pcc_cpufreq and
576x acpi_cpufreq modules which all failed because intel_pstate was already
active.

The test used three custom builds. The base was 6.0-rc3, 'revert' is
base + revert of 6e6de3dee51a, 'my' is base + the proposed fix. Compiled
modules were uncompressed and unsigned.

Each configuration had its boot tested 5 times. Time was measured from the
first load attempt of a given module to the last one, by simply looking at
messages such as "Inserted module 'acpi_cpufreq'" in the udev log and their
timestamps. All times are in seconds.

|      |                Configuration                        |
| Boot | base            | revert          | my              |
|   v  | pcc    | acpi   | pcc    | acpi   | pcc    | acpi   |
+------+--------+--------+--------+--------+--------+--------+
|   #1 | 45.374 | 45.462 |  1.992 |  8.509 |  2.190 |  6.931 |
|   #2 | 44.727 | 44.712 |  2.249 | 11.436 |  1.821 |  8.413 |
|   #3 | 45.450 | 45.771 |  1.685 |  8.784 |  1.964 |  6.341 |
|   #4 | 44.306 | 44.840 |  2.469 |  9.611 |  2.362 |  6.856 |
|   #5 | 45.132 | 45.216 |  2.063 |  8.782 |  1.717 |  6.405 |
+------+--------+--------+--------+--------+--------+--------+
|  Avg | 44.998 | 45.200 |  2.092 |  9.424 |  2.011 |  6.989 |

This shows the observed regression and results with the proposed fix.

> If this *is* a regression, we should try to see how perhaps we can split
> this up into a part which is mergable to stable and then a secondary
> part which  does some new fancy optimizations.

I think it is hard to split this patch into parts because the implemented
"optimization" is the fix.

[1] https://lore.kernel.org/linux-modules/0ccb384f-bbd5-f0fd-3832-c2255df505b2@suse.com/

Thanks,
Petr

Petr Pavlu Oct. 15, 2022, 9:49 a.m. UTC | #5

On 10/14/22 09:54, David Hildenbrand wrote:
> On Mon, Sep 19, 2022 at 2:33 PM Petr Pavlu <petr.pavlu@suse.com> wrote:
>>
>> During a system boot, it can happen that the kernel receives a burst of
>> requests to insert the same module but loading it eventually fails
>> during its init call. For instance, udev can make a request to insert
>> a frequency module for each individual CPU when another frequency module
>> is already loaded which causes the init function of the new module to
>> return an error.
>>
>> The module loader currently serializes all such requests, with the
>> barrier in add_unformed_module(). This creates a lot of unnecessary work
>> and delays the boot.
>>
>> This patch improves the behavior as follows:
>> * A check whether a module load matches an already loaded module is
>>   moved right after a module name is determined. -EEXIST continues to be
>>   returned if the module exists and is live, -EBUSY is returned if
>>   a same-name module is going.
>> * A new reference-counted shared_load_info structure is introduced to
>>   keep track of duplicate load requests. Two loads are considered
>>   equivalent if their module name matches. In case a load duplicates
>>   another running insert, the code waits for its completion and then
>>   returns -EEXIST or -EBUSY depending on whether it succeeded.
>>
>> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
>> for modules that have finished loading"), the kernel already did merge
>> some of same load requests but it was more by accident and relied on
>> specific timing. The patch brings this behavior back in a more explicit
>> form.
>>
>> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
>> ---
> 
> Hi Petr,
> 
> as you might have seen I sent a patch/fix yesterday (not being aware
> of this patch and that
> this is also a performance issue, which is interesting), that
> similarly makes sure that modules
> are unique early.
> 
> https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
> 
> It doesn't perform the -EBUSY changes or use something like
> shared_load_info/refcounts;
> it simply uses a second list while the module cannot be placed onto
> the module list yet.
> 
> Not sure if that part is really required (e.g., for performance
> reasons). Like Luis, I feel like
> some of these parts could be split into separate patches, if the other
> parts are really required.

The shared_load_info/refcounts/-EBUSY logic is actually an important part
which addresses the regression mentioned in the commit message and which I'm
primarily trying to fix.

> I just tested your patch in the environment where I can reproduce the
> vmap allocation issue, and
> (unsurprisingly) this patch similarly seems to fix the issue.
> 
> So if your patch ends up upstream, it would be good to add some details
> of my patch description (vmap allocation issue) to this patch description.

Thanks for testing this patch. I will add a note about the vmap allocation
issue to the patch description.

Petr

Petr Pavlu Oct. 16, 2022, 12:25 p.m. UTC | #6

On 10/14/22 15:52, Petr Mladek wrote:
> On Mon 2022-09-19 14:32:33, Petr Pavlu wrote:
>> During a system boot, it can happen that the kernel receives a burst of
>> requests to insert the same module but loading it eventually fails
>> during its init call. For instance, udev can make a request to insert
>> a frequency module for each individual CPU when another frequency module
>> is already loaded which causes the init function of the new module to
>> return an error.
>>
>> The module loader currently serializes all such requests, with the
>> barrier in add_unformed_module(). This creates a lot of unnecessary work
>> and delays the boot.
>>
>> This patch improves the behavior as follows:
>> * A check whether a module load matches an already loaded module is
>>   moved right after a module name is determined. -EEXIST continues to be
>>   returned if the module exists and is live, -EBUSY is returned if
>>   a same-name module is going.
>> * A new reference-counted shared_load_info structure is introduced to
>>   keep track of duplicate load requests. Two loads are considered
>>   equivalent if their module name matches. In case a load duplicates
>>   another running insert, the code waits for its completion and then
>>   returns -EEXIST or -EBUSY depending on whether it succeeded.
>>
>> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
>> for modules that have finished loading"), the kernel already did merge
>> some of same load requests but it was more by accident and relied on
>> specific timing. The patch brings this behavior back in a more explicit
>> form.
> 
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -61,14 +61,29 @@
>>  
>>  /*
>>   * Mutex protects:
>> - * 1) List of modules (also safely readable with preempt_disable),
>> + * 1) list of modules (also safely readable with preempt_disable, delete and add
>> + *    uses RCU list operations).
>>   * 2) module_use links,
>> - * 3) mod_tree.addr_min/mod_tree.addr_max.
>> - * (delete and add uses RCU list operations).
>> + * 3) mod_tree.addr_min/mod_tree.addr_max,
>> + * 4) list of unloaded_tainted_modules.
> 
> This is unrelated and should be done in separate patch. It minimizes
> the size of the patch and simplifies review. More importantly, these
> unrelated changes will not get lost when the patch gets reverted for
> some reason.

Ok.

>> + * 5) list of running_loads.
>>   */
>>  DEFINE_MUTEX(module_mutex);
>>  LIST_HEAD(modules);
>>  
>> +/* Shared information to track duplicate module loads. */
>> +struct shared_load_info {
>> +	char name[MODULE_NAME_LEN];
>> +	refcount_t refcnt;
>> +	struct list_head list;
>> +	struct completion done;
>> +	int err;
>> +};
>> +static LIST_HEAD(running_loads);
>> +
>> +/* Waiting for a module to finish loading? */
>> +static DECLARE_WAIT_QUEUE_HEAD(module_wq);
> 
> It is not obvious why this is actually needed. One would expect
> that using the completion in struct shared_load_info would
> be enough.
> 
> It is need because the user, resolve_symbol_wait(), does
> not know the exact name of the module that it is waiting for.
> 
> It would deserve a comment and maybe even renaming, for example:
> 
> /*
>  * Waiting for a module load when the exact module name is not known,
>  * for example, when resolving symbols from another modules.
>  */
> static DECLARE_WAIT_QUEUE_HEAD(any_module_load_wq);

I will adjust the comment.

>> +
>>  /* Work queue for freeing init sections in success case */
>>  static void do_free_init(struct work_struct *w);
>>  static DECLARE_WORK(init_free_wq, do_free_init);
>> @@ -762,8 +774,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>>  	strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints));
>>  
>>  	free_module(mod);
>> -	/* someone could wait for the module in add_unformed_module() */
>> -	wake_up_interruptible(&module_wq);
> 
> The new code does not longer wakes module_wq when the module is
> removed. I wondered if it is correct. It is a bit tricky.
> module_wq used to have two purposes. It was used to wake up:
> 
>    1. parallel loads of the same module.
> 
>    2. resolving symbols in a module using symbols from
>       another module.
> 
> 
> Ad 1. The new code handles the parallel load using struct shared_load_info.
>       Also the new code does not wait when the same module
>       is being removed. So the wake up is not needed here.
> 
> 
> Ad 2. The module_wq is typically used when the other module is
>       loaded automatically because of module dependencies.
>       In this case, only the wake up in load_module() is important.
> 
>       But what about module removal? IMHO, the is small race window:
> 
> 
> The problem is the module operations are asynchronous. And A takes
> the reference on B only after it resolves a symbol, see ref_module()
> called in resolve_symbol().
> 
> Let's have two modules A and B when the module A uses a symbol
> from module B.
> 
> 
> CPU 0:			CPU 1			CPU 3
> 
> modprobe A
>   // see dependency on B
>   // and call modprobe B
>   // in separate thread
> 
> 			modprobe B
> 			  return -EEXIST
> 
> 						rmmod B
> 						  // finished
> 
> 
>       resolve_symbol_wait(sym_from_B)
> 
> It has to wait until the timeout 30s because module B is gone
> and it is not beeing loaded.
> 
> IMHO, the new code makes the race window slightly bigger because
> modprobe B retuns immediately even when rmmod B is already in
> progress.
> 
> IMHO, this is acceptable because the race was there anyway. And
> it is not much realistic scenario.

I'm not sure if I understand this scenario. My reading is that modprobe of A
would wait in resolve_symbol_wait() only if module B is in the coming state,
and then would get later woken up from finalize_running_load(). In all other
cases, resolve_symbol_wait() should not sleep:
* If B is not loaded at all or is in the unformed state then resolve_symbol()
  -> find_symbol() detects that a needed symbol is missing. NULL is then
  returned from resolve_symbol() and subsequently from resolve_symbol_wait().
* If B is live then resolve_symbol() should successfully resolve the symbol
  and a refcount is increased on this module.
* If B is going then resolve_symbol() -> ref_module() ->
  strong_try_module_get() -> try_module_get() should fail. This then gets
  propagated as -ENODEV from resolve_symbol() and resolve_symbol_wait().

>>  	return 0;
>>  out:
>>  	mutex_unlock(&module_mutex);
>> @@ -2552,43 +2540,133 @@ static int may_init_module(void)
>>  	return 0;
>>  }
>>  
>> +static struct shared_load_info *
>> +shared_load_info_alloc(const struct load_info *info)
>> +{
>> +	struct shared_load_info *shared_info =
>> +		kzalloc(sizeof(*shared_info), GFP_KERNEL);
>> +	if (shared_info == NULL)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	strscpy(shared_info->name, info->name, sizeof(shared_info->name));
>> +	refcount_set(&shared_info->refcnt, 1);
>> +	INIT_LIST_HEAD(&shared_info->list);
>> +	init_completion(&shared_info->done);
>> +	return shared_info;
>> +}
>> +
>> +static void shared_load_info_get(struct shared_load_info *shared_info)
>> +{
>> +	refcount_inc(&shared_info->refcnt);
>> +}
>> +
>> +static void shared_load_info_put(struct shared_load_info *shared_info)
>> +{
>> +	if (refcount_dec_and_test(&shared_info->refcnt))
>> +		kfree(shared_info);
>> +}
>> +
>>  /*
>> - * We try to place it in the list now to make sure it's unique before
>> - * we dedicate too many resources.  In particular, temporary percpu
>> + * Check that a module load is unique and make it visible to others. The code
>> + * looks for parallel running inserts and already loaded modules. Two inserts
>> + * are considered equivalent if their module name matches. In case this load
>> + * duplicates another running insert, the code waits for its completion and
>> + * then returns -EEXIST or -EBUSY depending on whether it succeeded.
>> + *
>> + * Detecting early that a load is unique avoids dedicating too many cycles and
>> + * resources to bring up the module. In particular, it prevents temporary percpu
>>   * memory exhaustion.
>> + *
>> + * Merging same load requests then primarily helps during the boot process. It
>> + * can happen that the kernel receives a burst of requests to load the same
>> + * module (for example, a same module for each individual CPU) and loading it
>> + * eventually fails during its init call. Merging the requests allows that only
>> + * one full attempt to load the module is made.
>> + *
>> + * On a non-error return, it is guaranteed that this load is unique.
>>   */
>> -static int add_unformed_module(struct module *mod)
>> +static struct shared_load_info *add_running_load(const struct load_info *info)
>>  {
>> -	int err;
>>  	struct module *old;
>> +	struct shared_load_info *shared_info;
>>  
>> -	mod->state = MODULE_STATE_UNFORMED;
>> -
>> -again:
>>  	mutex_lock(&module_mutex);
>> -	old = find_module_all(mod->name, strlen(mod->name), true);
>> -	if (old != NULL) {
>> -		if (old->state != MODULE_STATE_LIVE) {
>> -			/* Wait in case it fails to load. */
>> +
>> +	/* Search if there is a running load of a module with the same name. */
>> +	list_for_each_entry(shared_info, &running_loads, list)
>> +		if (strcmp(shared_info->name, info->name) == 0) {
>> +			int err;
>> +
>> +			shared_load_info_get(shared_info);
>>  			mutex_unlock(&module_mutex);
>> -			err = wait_event_interruptible(module_wq,
>> -					       finished_loading(mod->name));
>> -			if (err)
>> -				goto out_unlocked;
>> -			goto again;
>> +
>> +			err = wait_for_completion_interruptible(
>> +				&shared_info->done);
> 
> This would deserve a comment, for examle:
> 
> 			/*
> 			 * Return -EBUSY when the parallel load failed
> 			 * from any reason. This load might end up
> 			 * another way but we are not going to try.
> 			 */

Ok.

>> +			if (!err)
>> +				err = shared_info->err ? -EBUSY : -EEXIST;
>> +			shared_load_info_put(shared_info);
>> +			shared_info = ERR_PTR(err);
>> +			goto out_unlocked;
>>  		}
>> -		err = -EEXIST;
>> +
>> +	/* Search if there is a live module with the given name already. */
>> +	old = find_module_all(info->name, strlen(info->name), true);
>> +	if (old != NULL) {
>> +		if (old->state == MODULE_STATE_LIVE) {
>> +			shared_info = ERR_PTR(-EEXIST);
>> +			goto out;
>> +		}
>> +
>> +		/*
>> +		 * Any active load always has its record in running_loads and so
>> +		 * would be found above. This applies independent whether such
>> +		 * a module is currently in MODULE_STATE_UNFORMED,
>> +		 * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its
>> +		 * initialization failed. It therefore means this must be an
>> +		 * older going module and the caller should try later once it is
>> +		 * gone.
>> +		 */
>> +		WARN_ON(old->state != MODULE_STATE_GOING);
>> +		shared_info = ERR_PTR(-EBUSY);
>>  		goto out;
>>  	}
>> -	mod_update_bounds(mod);
>> -	list_add_rcu(&mod->list, &modules);
>> -	mod_tree_insert(mod);
>> -	err = 0;
>> +
>> +	/* The load is unique, make it visible to others. */
>> +	shared_info = shared_load_info_alloc(info);
>> +	if (IS_ERR(shared_info))
>> +		goto out;
>> +	list_add(&shared_info->list, &running_loads);
>>  
>>  out:
>>  	mutex_unlock(&module_mutex);
>>  out_unlocked:
>> -	return err;
>> +	return shared_info;
>> +}
>> +
>> +static void finalize_running_load(struct shared_load_info *shared_info, int err)
>> +{
>> +	/* Inform other duplicate inserts that the load finished. */
>> +	mutex_lock(&module_mutex);
>> +	list_del(&shared_info->list);
>> +	shared_info->err = err;
>> +	mutex_unlock(&module_mutex);
>> +
>> +	complete_all(&shared_info->done);
>> +	shared_load_info_put(shared_info);
>> +
>> +	/* Tell other modules waiting on this one that it completed loading. */
>> +	wake_up_interruptible(&module_wq);
>> +}
>> +
>> +static void add_unformed_module(struct module *mod)
>> +{
>> +	mod->state = MODULE_STATE_UNFORMED;
>> +
>> +	mutex_lock(&module_mutex);
>> +	mod_update_bounds(mod);
>> +	list_add_rcu(&mod->list, &modules);
>> +	mod_tree_insert(mod);
>> +	mutex_unlock(&module_mutex);
>>  }
>>  
>>  static int complete_formation(struct module *mod, struct load_info *info)
> 
> The comments are more or less about cosmetic problems. I do not see
> any real functional problem. I am sorry that I did not mention
> them when reviewing v1.
> 
> Feel free to use:
> 
> Reviewed-by: Petr Mladek <pmladek@suse.com>

Thanks,
Petr

Luis Chamberlain Oct. 18, 2022, 6:33 p.m. UTC | #7

On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> The patch does address a regression observed after commit 6e6de3dee51a
> ("kernel/module.c: Only return -EEXIST for modules that have finished
> loading"). I guess it can have a Fixes tag added to the patch.
> 
> I think it is hard to split this patch into parts because the implemented
> "optimization" is the fix.

git describe --contains 6e6de3dee51a
v5.3-rc1~38^2~6

I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
right thing to do, but without it, it still leaves the issue reported
by Prarit Bhargava. We need a way to resolve the issue on stable and
then your optimizations can be applied on top.

Prarit Bhargava, please review Petry's work and see if you can come up
with a sensible way to address this for stable.

  Luis

Prarit Bhargava Oct. 18, 2022, 7:19 p.m. UTC | #8

On 10/18/22 14:33, Luis Chamberlain wrote:
> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>> The patch does address a regression observed after commit 6e6de3dee51a
>> ("kernel/module.c: Only return -EEXIST for modules that have finished
>> loading"). I guess it can have a Fixes tag added to the patch.
>>
>> I think it is hard to split this patch into parts because the implemented
>> "optimization" is the fix.
> 
> git describe --contains 6e6de3dee51a
> v5.3-rc1~38^2~6
> 
> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> right thing to do, but without it, it still leaves the issue reported
> by Prarit Bhargava. We need a way to resolve the issue on stable and
> then your optimizations can be applied on top.
> 
> Prarit Bhargava, please review Petry's work and see if you can come up
> with a sensible way to address this for stable.

Thanks for the heads up Luis.  I'll take a closer look.  [A long time 
ago] I could swear we made a very targeted decision to *NOT* allow 
modules with the same name to be loaded into the kernel.  What's changed 
that we think this is okay to do today?

Thanks,

P.

> 
>    Luis
>

Prarit Bhargava Oct. 18, 2022, 7:53 p.m. UTC | #9

On 10/18/22 14:33, Luis Chamberlain wrote:
> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>> The patch does address a regression observed after commit 6e6de3dee51a
>> ("kernel/module.c: Only return -EEXIST for modules that have finished
>> loading"). I guess it can have a Fixes tag added to the patch.
>>
>> I think it is hard to split this patch into parts because the implemented
>> "optimization" is the fix.
> 
> git describe --contains 6e6de3dee51a
> v5.3-rc1~38^2~6
> 
> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> right thing to do, but without it, it still leaves the issue reported
> by Prarit Bhargava. We need a way to resolve the issue on stable and
> then your optimizations can be applied on top.
> 
> Prarit Bhargava, please review Petry's work and see if you can come up
> with a sensible way to address this for stable.

I found the original thread surrounding these changes and I do not think 
this solution should have been committed to the kernel.  It is not a 
good solution to the problem being solved and adds complexity where none 
is needed IMO.

Quoting from the original thread,

> 
> Motivation for this patch is to fix an issue observed on larger machines with
> many CPUs where it can take a significant amount of time during boot to run
> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
> attempt to load these modules too. The operation will eventually fail in the
> init function of a respective module where it gets recognized that another
> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
> is triggered for each CPU and so multiple loads of these modules will be
> present. The current code then processes all such loads individually and
> serializes them with the barrier in add_unformed_module().
> 

The way to solve this is not in the module loading code, but in the udev 
code by adding a new event or in the userspace which handles the loading 
events.

Option 1)

Write/modify a udev rule to to use a flock userspace file lock to 
prevent repeated loading.  The problem with this is that it is still 
racy and still consumes CPU time repeated load the ELF header and, 
depending on the system (ie a large number of cpus) would still cause a 
boot delay.  This would be better than what we have and is worth looking 
at as a simple solution.  I'd like to see boot times with this change, 
and I'll try to come up with a measurement on a large CPU system.

Option 2)

Create a new udev action, "add_once" to indicate to userspace that the 
module only needs to be loaded one time, and to ignore further load 
requests.  This is a bit tricky as both kernel space and userspace would 
have be modified.  The udev rule would end up looking very similar to 
what we now.

The benefit of option 2 is that driver writers themselves can choose 
which drivers should issue "add_once" instead of add.  Drivers that are 
known to run on all devices at once would call "add_once" to only issue 
a single load.

Thoughts?

P.

Petr Pavlu Oct. 19, 2022, noon UTC | #10

On 10/18/22 20:33, Luis Chamberlain wrote:
> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>> The patch does address a regression observed after commit 6e6de3dee51a
>> ("kernel/module.c: Only return -EEXIST for modules that have finished
>> loading"). I guess it can have a Fixes tag added to the patch.
>>
>> I think it is hard to split this patch into parts because the implemented
>> "optimization" is the fix.
> 
> git describe --contains 6e6de3dee51a
> v5.3-rc1~38^2~6
> 
> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> right thing to do, but without it, it still leaves the issue reported
> by Prarit Bhargava. We need a way to resolve the issue on stable and
> then your optimizations can be applied on top.

Simpler could be to do the following:

diff --git a/kernel/module/main.c b/kernel/module/main.c
index d02d39c7174e..0302ac387e93 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
 	sched_annotate_sleep();
 	mutex_lock(&module_mutex);
 	mod = find_module_all(name, strlen(name), true);
-	ret = !mod || mod->state == MODULE_STATE_LIVE;
+	ret = !mod || mod->state == MODULE_STATE_LIVE
+		|| mod->state == MODULE_STATE_GOING;
 	mutex_unlock(&module_mutex);

 	return ret;
@@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
 	mutex_lock(&module_mutex);
 	old = find_module_all(mod->name, strlen(mod->name), true);
 	if (old != NULL) {
-		if (old->state != MODULE_STATE_LIVE) {
+		if (old->state == MODULE_STATE_COMING
+		    || old->state == MODULE_STATE_UNFORMED) {
 			/* Wait in case it fails to load. */
 			mutex_unlock(&module_mutex);
 			err = wait_event_interruptible(module_wq,
@@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
 				goto out_unlocked;
 			goto again;
 		}
-		err = -EEXIST;
+		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
 		goto out;
 	}
 	mod_update_bounds(mod);

This is an alternative approach to fix the issue that 6e6de3dee51a addressed
and it preserves the previous handling of same-module parallel loads.

It works well in practice but a problem is that this previous handling is
somewhat fragile because it requires specific timings. A second load of a same
module returns EBUSY only if it observes the first load in the going state.

The following can then happen:
* A first load of module A is requested. It passes add_unformed_module() and
  proceeds with full initialization.
* A second load of module A arrives. It proceeds up to add_unformed_module()
  where it waits on the first module to complete its initialization.
* The first load fails because its init function happens to produce an error.
  The cleanup code in do_init_module() unlinks the module from the modules
  list, frees the module and finally calls wake_up_all(&module_wq).
* The second load gets woken up. It sees that there is no module with the same
  name in the modules list and continues with its full initialization, which
  likely again fails in the init function.

This scenario can be reproduced when one prepares a sample module with
"msleep(1000); return -ENODEV;" in its init function and tries to load it
several times in parallel.

My posted patch essentially brings this handling of parallel loads back but
gained some extra bits as I wanted to prevent the described instability.

However, as mentioned previously, if we can avoid these parallel same-module
load attempts in the first place then that would be certainly the best option.

Thanks,
Petr

Petr Mladek Oct. 20, 2022, 7:03 a.m. UTC | #11

On Wed 2022-10-19 14:00:55, Petr Pavlu wrote:
> On 10/18/22 20:33, Luis Chamberlain wrote:
> > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> >> The patch does address a regression observed after commit 6e6de3dee51a
> >> ("kernel/module.c: Only return -EEXIST for modules that have finished
> >> loading"). I guess it can have a Fixes tag added to the patch.
> >>
> >> I think it is hard to split this patch into parts because the implemented
> >> "optimization" is the fix.
> > 
> > git describe --contains 6e6de3dee51a
> > v5.3-rc1~38^2~6
> > 
> > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> > right thing to do, but without it, it still leaves the issue reported
> > by Prarit Bhargava. We need a way to resolve the issue on stable and
> > then your optimizations can be applied on top.
> 
> Simpler could be to do the following:
> 
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d02d39c7174e..0302ac387e93 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>  	sched_annotate_sleep();
>  	mutex_lock(&module_mutex);
>  	mod = find_module_all(name, strlen(name), true);
> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> +	ret = !mod || mod->state == MODULE_STATE_LIVE
> +		|| mod->state == MODULE_STATE_GOING;
>  	mutex_unlock(&module_mutex);
>  
>  	return ret;
> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
>  	mutex_lock(&module_mutex);
>  	old = find_module_all(mod->name, strlen(mod->name), true);
>  	if (old != NULL) {
> -		if (old->state != MODULE_STATE_LIVE) {
> +		if (old->state == MODULE_STATE_COMING
> +		    || old->state == MODULE_STATE_UNFORMED) {
>  			/* Wait in case it fails to load. */
>  			mutex_unlock(&module_mutex);
>  			err = wait_event_interruptible(module_wq,
> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
>  				goto out_unlocked;
>  			goto again;
>  		}
> -		err = -EEXIST;
> +		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>
>  		goto out;
>  	}
>  	mod_update_bounds(mod);
> 
> This is an alternative approach to fix the issue that 6e6de3dee51a addressed
> and it preserves the previous handling of same-module parallel loads.
> 
> It works well in practice but a problem is that this previous handling is
> somewhat fragile because it requires specific timings. A second load of a same
> module returns EBUSY only if it observes the first load in the going state.
> 
> The following can then happen:
> * A first load of module A is requested. It passes add_unformed_module() and
>   proceeds with full initialization.
> * A second load of module A arrives. It proceeds up to add_unformed_module()
>   where it waits on the first module to complete its initialization.
> * The first load fails because its init function happens to produce an error.
>   The cleanup code in do_init_module() unlinks the module from the modules
>   list, frees the module and finally calls wake_up_all(&module_wq).
> * The second load gets woken up. It sees that there is no module with the same
>   name in the modules list and continues with its full initialization, which
>   likely again fails in the init function.

Another solution would be to add one more reference counter directly
into struct module. The existing counter is about dependencies on the
module. It forces the module to stay in MODULE_STATE_LIVE when there
is some dependency. The new reference counter would be just about
life time of struct module.

It should be easier than to add new structure for passing err code.

Also it would allow to remove the racy finished_loading().
wait_event_interruptible() could just check mod->state.

Best Regards,
Petr

Petr Mladek Oct. 20, 2022, 7:19 a.m. UTC | #12

On Tue 2022-10-18 15:53:03, Prarit Bhargava wrote:
> On 10/18/22 14:33, Luis Chamberlain wrote:
> > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> > > The patch does address a regression observed after commit 6e6de3dee51a
> > > ("kernel/module.c: Only return -EEXIST for modules that have finished
> > > loading"). I guess it can have a Fixes tag added to the patch.
> > > 
> > > I think it is hard to split this patch into parts because the implemented
> > > "optimization" is the fix.
> > 
> > git describe --contains 6e6de3dee51a
> > v5.3-rc1~38^2~6
> > 
> > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> > right thing to do, but without it, it still leaves the issue reported
> > by Prarit Bhargava. We need a way to resolve the issue on stable and
> > then your optimizations can be applied on top.
> > 
> > Prarit Bhargava, please review Petry's work and see if you can come up
> > with a sensible way to address this for stable.
> 
> I found the original thread surrounding these changes and I do not think
> this solution should have been committed to the kernel.  It is not a good
> solution to the problem being solved and adds complexity where none is
> needed IMO.
> 
> Quoting from the original thread,
> 
> > 
> > Motivation for this patch is to fix an issue observed on larger machines with
> > many CPUs where it can take a significant amount of time during boot to run
> > systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
> > active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
> > attempt to load these modules too. The operation will eventually fail in the
> > init function of a respective module where it gets recognized that another
> > cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
> > is triggered for each CPU and so multiple loads of these modules will be
> > present. The current code then processes all such loads individually and
> > serializes them with the barrier in add_unformed_module().
> > 
> 
> The way to solve this is not in the module loading code, but in the udev
> code by adding a new event or in the userspace which handles the loading
> events.
> 
> Option 1)
> 
> Write/modify a udev rule to to use a flock userspace file lock to prevent
> repeated loading.  The problem with this is that it is still racy and still
> consumes CPU time repeated load the ELF header and, depending on the system
> (ie a large number of cpus) would still cause a boot delay.  This would be
> better than what we have and is worth looking at as a simple solution.  I'd
> like to see boot times with this change, and I'll try to come up with a
> measurement on a large CPU system.

This sounds like a ping-pong between projects where to put the
complexity. Honestly, I think that it would be more reliable if
we serialized/squashed parallel loads on the kernel side.


> Option 2)
> 
> Create a new udev action, "add_once" to indicate to userspace that the
> module only needs to be loaded one time, and to ignore further load
> requests.  This is a bit tricky as both kernel space and userspace would
> have be modified.  The udev rule would end up looking very similar to what
> we now.
>
> The benefit of option 2 is that driver writers themselves can choose which
> drivers should issue "add_once" instead of add.  Drivers that are known to
> run on all devices at once would call "add_once" to only issue a single
> load.

My concern is how to distinguish first attempts and later (fixed)
attempts.

I mean that the module load might fail at boot. But the user might
fix the root of the problem and try to load the module once again
without reboot. The other load need not be direct. It might be part
of another more complex operation. Reload of another module.
Restart of a service.

It might be problematic to do this an user-friendly way.
And it might be much more complicated in the end.

Best Regards,
Petr

Petr Pavlu Oct. 24, 2022, 12:37 p.m. UTC | #13

On 10/18/22 21:53, Prarit Bhargava wrote:
> Quoting from the original thread,
> 
>>
>> Motivation for this patch is to fix an issue observed on larger machines with
>> many CPUs where it can take a significant amount of time during boot to run
>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
>> attempt to load these modules too. The operation will eventually fail in the
>> init function of a respective module where it gets recognized that another
>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
>> is triggered for each CPU and so multiple loads of these modules will be
>> present. The current code then processes all such loads individually and
>> serializes them with the barrier in add_unformed_module().
>>
> 
> The way to solve this is not in the module loading code, but in the udev 
> code by adding a new event or in the userspace which handles the loading 
> events.
> 
> Option 1)
> 
> Write/modify a udev rule to to use a flock userspace file lock to 
> prevent repeated loading.  The problem with this is that it is still 
> racy and still consumes CPU time repeated load the ELF header and, 
> depending on the system (ie a large number of cpus) would still cause a 
> boot delay.  This would be better than what we have and is worth looking 
> at as a simple solution.  I'd like to see boot times with this change, 
> and I'll try to come up with a measurement on a large CPU system.

It is not immediately clear to me how this can be done as a udev rule. You
mention that you'll try to test this on a large CPU system. Does it mean that
you have a prototype implemented already? If yes, could you please share it?

My reading is that one would need to update the "MODALIAS" rule in
80-drivers.rules [1] to do this locking. However, that just collects
'kmod load' (builtin) for udev to execute after all rules are processed. It
would then be required to synchronize udev workers to prevent repeated
loading?

> Option 2)
> 
> Create a new udev action, "add_once" to indicate to userspace that the 
> module only needs to be loaded one time, and to ignore further load 
> requests.  This is a bit tricky as both kernel space and userspace would 
> have be modified.  The udev rule would end up looking very similar to 
> what we now.
> 
> The benefit of option 2 is that driver writers themselves can choose 
> which drivers should issue "add_once" instead of add.  Drivers that are 
> known to run on all devices at once would call "add_once" to only issue 
> a single load.

On the device event side, I more wonder if it would be possible to avoid tying
up cpufreq and edac modules to individual CPU devices. Maybe their loading
could be attached to some platform device, even if it means introducing an
auxiliary device for this purpose? I need to look a bit more into this idea.

[1] https://github.com/systemd/systemd/blob/4856f63846fc794711e1b8ec970e4c56494cd320/rules.d/80-drivers.rules

Thanks,
Petr

Prarit Bhargava Oct. 24, 2022, 1:22 p.m. UTC | #14

On 10/20/22 03:19, Petr Mladek wrote:
> On Tue 2022-10-18 15:53:03, Prarit Bhargava wrote:
>> On 10/18/22 14:33, Luis Chamberlain wrote:
>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>>>> The patch does address a regression observed after commit 6e6de3dee51a
>>>> ("kernel/module.c: Only return -EEXIST for modules that have finished
>>>> loading"). I guess it can have a Fixes tag added to the patch.
>>>>
>>>> I think it is hard to split this patch into parts because the implemented
>>>> "optimization" is the fix.
>>>
>>> git describe --contains 6e6de3dee51a
>>> v5.3-rc1~38^2~6
>>>
>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
>>> right thing to do, but without it, it still leaves the issue reported
>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
>>> then your optimizations can be applied on top.
>>>
>>> Prarit Bhargava, please review Petry's work and see if you can come up
>>> with a sensible way to address this for stable.
>>
>> I found the original thread surrounding these changes and I do not think
>> this solution should have been committed to the kernel.  It is not a good
>> solution to the problem being solved and adds complexity where none is
>> needed IMO.
>>
>> Quoting from the original thread,
>>
>>>
>>> Motivation for this patch is to fix an issue observed on larger machines with
>>> many CPUs where it can take a significant amount of time during boot to run
>>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
>>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
>>> attempt to load these modules too. The operation will eventually fail in the
>>> init function of a respective module where it gets recognized that another
>>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
>>> is triggered for each CPU and so multiple loads of these modules will be
>>> present. The current code then processes all such loads individually and
>>> serializes them with the barrier in add_unformed_module().
>>>
>>
>> The way to solve this is not in the module loading code, but in the udev
>> code by adding a new event or in the userspace which handles the loading
>> events.
>>
>> Option 1)
>>
>> Write/modify a udev rule to to use a flock userspace file lock to prevent
>> repeated loading.  The problem with this is that it is still racy and still
>> consumes CPU time repeated load the ELF header and, depending on the system
>> (ie a large number of cpus) would still cause a boot delay.  This would be
>> better than what we have and is worth looking at as a simple solution.  I'd
>> like to see boot times with this change, and I'll try to come up with a
>> measurement on a large CPU system.
> 
> This sounds like a ping-pong between projects where to put the
> complexity. Honestly, I think that it would be more reliable if
> we serialized/squashed parallel loads on the kernel side.
> 

Well, that's the world we live in.  Module loading ping pongs between 
udev and the kernel.

> 
>> Option 2)
>>
>> Create a new udev action, "add_once" to indicate to userspace that the
>> module only needs to be loaded one time, and to ignore further load
>> requests.  This is a bit tricky as both kernel space and userspace would
>> have be modified.  The udev rule would end up looking very similar to what
>> we now.
>>
>> The benefit of option 2 is that driver writers themselves can choose which
>> drivers should issue "add_once" instead of add.  Drivers that are known to
>> run on all devices at once would call "add_once" to only issue a single
>> load.
> 
> My concern is how to distinguish first attempts and later (fixed)
> attempts.
> 
> I mean that the module load might fail at boot. But the user might
> fix the root of the problem and try to load the module once again
> without reboot. The other load need not be direct. It might be part
> of another more complex operation. Reload of another module.
> Restart of a service.
> 

This is an interesting complication, and something to think about.

P.


> It might be problematic to do this an user-friendly way.
> And it might be much more complicated in the end.


> 
> Best Regards,
> Petr
>

Prarit Bhargava Oct. 24, 2022, 2 p.m. UTC | #15

On 10/24/22 08:37, Petr Pavlu wrote:
> On 10/18/22 21:53, Prarit Bhargava wrote:
>> Quoting from the original thread,
>>
>>>
>>> Motivation for this patch is to fix an issue observed on larger machines with
>>> many CPUs where it can take a significant amount of time during boot to run
>>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
>>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
>>> attempt to load these modules too. The operation will eventually fail in the
>>> init function of a respective module where it gets recognized that another
>>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
>>> is triggered for each CPU and so multiple loads of these modules will be
>>> present. The current code then processes all such loads individually and
>>> serializes them with the barrier in add_unformed_module().
>>>
>>
>> The way to solve this is not in the module loading code, but in the udev
>> code by adding a new event or in the userspace which handles the loading
>> events.
>>
>> Option 1)
>>
>> Write/modify a udev rule to to use a flock userspace file lock to
>> prevent repeated loading.  The problem with this is that it is still
>> racy and still consumes CPU time repeated load the ELF header and,
>> depending on the system (ie a large number of cpus) would still cause a
>> boot delay.  This would be better than what we have and is worth looking
>> at as a simple solution.  I'd like to see boot times with this change,
>> and I'll try to come up with a measurement on a large CPU system.
> 
> It is not immediately clear to me how this can be done as a udev rule. You
> mention that you'll try to test this on a large CPU system. Does it mean that
> you have a prototype implemented already? If yes, could you please share it?
> 

Hi Petr,

Sorry, I haven't had a chance to actually test this out but I see this 
problem with the acpi_cpufreq and other multiple-cpu drivers which load 
once per logical cpu.  I was thinking of adding a udev rule like:

ACTION!="add", GOTO="acpi_cpufreq_end"

# I may have to add CPU modaliases here to get this to work correctly
ENV{MODALIAS}=="acpi:ACPI0007:", GOTO="acpi_cpufreq_start"

GOTO="acpi_cpufreq_start"
GOTO="acpi_cpufreq_end"

LABEL="acpi_cpufreq_start"

ENV{DELAY_MODALIAS}="$env{MODALIAS}"
ENV{MODALIAS}=""
PROGRAM="/bin/sh -c flock -n /tmp/delay_acpi_cpufreq sleep 2'", 
RESULT=="", RUN{builtin}+="kmod load $env{DELAY_MODALIAS}"

LABEL="acpi_cpufreq_end"


> My reading is that one would need to update the "MODALIAS" rule in
> 80-drivers.rules [1] to do this locking. However, that just collects
> 'kmod load' (builtin) for udev to execute after all rules are processed. It
> would then be required to synchronize udev workers to prevent repeated
> loading?
> 

Yes, that would be the case.

>> Option 2)
>>
>> Create a new udev action, "add_once" to indicate to userspace that the
>> module only needs to be loaded one time, and to ignore further load
>> requests.  This is a bit tricky as both kernel space and userspace would
>> have be modified.  The udev rule would end up looking very similar to
>> what we now.
>>
>> The benefit of option 2 is that driver writers themselves can choose
>> which drivers should issue "add_once" instead of add.  Drivers that are
>> known to run on all devices at once would call "add_once" to only issue
>> a single load.
> 
> On the device event side, I more wonder if it would be possible to avoid tying
> up cpufreq and edac modules to individual CPU devices. Maybe their loading
> could be attached to some platform device, even if it means introducing an
> auxiliary device for this purpose? I need to look a bit more into this idea.

That's an interesting idea and something I had not considered.  Creating 
a virtual device and loading against that device would be much better 
(easier?) of a solution.

P.

> 
> [1] https://github.com/systemd/systemd/blob/4856f63846fc794711e1b8ec970e4c56494cd320/rules.d/80-drivers.rules
> 
> Thanks,
> Petr
>

Luis Chamberlain Oct. 24, 2022, 5:08 p.m. UTC | #16

On Mon, Oct 24, 2022 at 09:22:35AM -0400, Prarit Bhargava wrote:
> On 10/20/22 03:19, Petr Mladek wrote:
> > On Tue 2022-10-18 15:53:03, Prarit Bhargava wrote:
> > > On 10/18/22 14:33, Luis Chamberlain wrote:
> > > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> > > > > The patch does address a regression observed after commit 6e6de3dee51a
> > > > > ("kernel/module.c: Only return -EEXIST for modules that have finished
> > > > > loading"). I guess it can have a Fixes tag added to the patch.
> > > > > 
> > > > > I think it is hard to split this patch into parts because the implemented
> > > > > "optimization" is the fix.
> > > > 
> > > > git describe --contains 6e6de3dee51a
> > > > v5.3-rc1~38^2~6
> > > > 
> > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> > > > right thing to do, but without it, it still leaves the issue reported
> > > > by Prarit Bhargava. We need a way to resolve the issue on stable and
> > > > then your optimizations can be applied on top.
> > > > 
> > > > Prarit Bhargava, please review Petry's work and see if you can come up
> > > > with a sensible way to address this for stable.
> > > 
> > > I found the original thread surrounding these changes and I do not think
> > > this solution should have been committed to the kernel.  It is not a good
> > > solution to the problem being solved and adds complexity where none is
> > > needed IMO.
> > > 
> > > Quoting from the original thread,
> > > 
> > > > 
> > > > Motivation for this patch is to fix an issue observed on larger machines with
> > > > many CPUs where it can take a significant amount of time during boot to run
> > > > systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
> > > > active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
> > > > attempt to load these modules too. The operation will eventually fail in the
> > > > init function of a respective module where it gets recognized that another
> > > > cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
> > > > is triggered for each CPU and so multiple loads of these modules will be
> > > > present. The current code then processes all such loads individually and
> > > > serializes them with the barrier in add_unformed_module().
> > > > 
> > > 
> > > The way to solve this is not in the module loading code, but in the udev
> > > code by adding a new event or in the userspace which handles the loading
> > > events.
> > > 
> > > Option 1)
> > > 
> > > Write/modify a udev rule to to use a flock userspace file lock to prevent
> > > repeated loading.  The problem with this is that it is still racy and still
> > > consumes CPU time repeated load the ELF header and, depending on the system
> > > (ie a large number of cpus) would still cause a boot delay.  This would be
> > > better than what we have and is worth looking at as a simple solution.  I'd
> > > like to see boot times with this change, and I'll try to come up with a
> > > measurement on a large CPU system.
> > 
> > This sounds like a ping-pong between projects where to put the
> > complexity. Honestly, I think that it would be more reliable if
> > we serialized/squashed parallel loads on the kernel side.
> > 
> 
> Well, that's the world we live in.  Module loading ping pongs between udev
> and the kernel.

You are missing the point. Think of stable first. Upgrading udev is not
an option.

Yes ou can think of optimizations later that udev can do, and should
perhaps do, but that's beyond the scope of the fix needed here. kmod
(the library which modprobe now uses) can / probably already has a
lookup for modules prior to issuing a new request. But even then, we
cannot assume all users user kmod (think android). Anything can request
a new module and we should do what is sensible in-kernel.

I'd like to see us think about stable first here.

  Luis

Luis Chamberlain Oct. 24, 2022, 5:53 p.m. UTC | #17

On Thu, Oct 20, 2022 at 09:03:39AM +0200, Petr Mladek wrote:
> On Wed 2022-10-19 14:00:55, Petr Pavlu wrote:
> > On 10/18/22 20:33, Luis Chamberlain wrote:
> > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> > >> The patch does address a regression observed after commit 6e6de3dee51a
> > >> ("kernel/module.c: Only return -EEXIST for modules that have finished
> > >> loading"). I guess it can have a Fixes tag added to the patch.
> > >>
> > >> I think it is hard to split this patch into parts because the implemented
> > >> "optimization" is the fix.
> > > 
> > > git describe --contains 6e6de3dee51a
> > > v5.3-rc1~38^2~6
> > > 
> > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> > > right thing to do, but without it, it still leaves the issue reported
> > > by Prarit Bhargava. We need a way to resolve the issue on stable and
> > > then your optimizations can be applied on top.
> > 
> > Simpler could be to do the following:
> > 
> > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > index d02d39c7174e..0302ac387e93 100644
> > --- a/kernel/module/main.c
> > +++ b/kernel/module/main.c
> > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >  	sched_annotate_sleep();
> >  	mutex_lock(&module_mutex);
> >  	mod = find_module_all(name, strlen(name), true);
> > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> > +	ret = !mod || mod->state == MODULE_STATE_LIVE
> > +		|| mod->state == MODULE_STATE_GOING;
> >  	mutex_unlock(&module_mutex);
> >  
> >  	return ret;
> > @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
> >  	mutex_lock(&module_mutex);
> >  	old = find_module_all(mod->name, strlen(mod->name), true);
> >  	if (old != NULL) {
> > -		if (old->state != MODULE_STATE_LIVE) {
> > +		if (old->state == MODULE_STATE_COMING
> > +		    || old->state == MODULE_STATE_UNFORMED) {
> >  			/* Wait in case it fails to load. */
> >  			mutex_unlock(&module_mutex);
> >  			err = wait_event_interruptible(module_wq,
> > @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
> >  				goto out_unlocked;
> >  			goto again;
> >  		}
> > -		err = -EEXIST;
> > +		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
> >
> >  		goto out;
> >  	}
> >  	mod_update_bounds(mod);
> > 
> > This is an alternative approach to fix the issue that 6e6de3dee51a addressed
> > and it preserves the previous handling of same-module parallel loads.
> > 
> > It works well in practice but a problem is that this previous handling is
> > somewhat fragile because it requires specific timings. A second load of a same
> > module returns EBUSY only if it observes the first load in the going state.
> > 
> > The following can then happen:
> > * A first load of module A is requested. It passes add_unformed_module() and
> >   proceeds with full initialization.
> > * A second load of module A arrives. It proceeds up to add_unformed_module()
> >   where it waits on the first module to complete its initialization.
> > * The first load fails because its init function happens to produce an error.
> >   The cleanup code in do_init_module() unlinks the module from the modules
> >   list, frees the module and finally calls wake_up_all(&module_wq).
> > * The second load gets woken up. It sees that there is no module with the same
> >   name in the modules list and continues with its full initialization, which
> >   likely again fails in the init function.
> 
> Another solution would be to add one more reference counter directly
> into struct module. The existing counter is about dependencies on the
> module. It forces the module to stay in MODULE_STATE_LIVE when there
> is some dependency. The new reference counter would be just about
> life time of struct module.
> 
> It should be easier than to add new structure for passing err code.
> 
> Also it would allow to remove the racy finished_loading().
> wait_event_interruptible() could just check mod->state.

Sounds good, but let us just keep in mind we *first* want a fix for
stable, which also fixes 6e6de3dee51a and addresses the fix it intended
to have.

So I welcome patches, let us first get a small fix in for 6e6de3dee51a
and we can optimize away after.

  Luis

Luis Chamberlain Nov. 12, 2022, 1:47 a.m. UTC | #18

On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
> On 10/18/22 20:33, Luis Chamberlain wrote:
> > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> >> The patch does address a regression observed after commit 6e6de3dee51a
> >> ("kernel/module.c: Only return -EEXIST for modules that have finished
> >> loading"). I guess it can have a Fixes tag added to the patch.
> >>
> >> I think it is hard to split this patch into parts because the implemented
> >> "optimization" is the fix.
> > 
> > git describe --contains 6e6de3dee51a
> > v5.3-rc1~38^2~6
> > 
> > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> > right thing to do, but without it, it still leaves the issue reported
> > by Prarit Bhargava. We need a way to resolve the issue on stable and
> > then your optimizations can be applied on top.
> 
> Simpler could be to do the following:
> 
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d02d39c7174e..0302ac387e93 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>  	sched_annotate_sleep();
>  	mutex_lock(&module_mutex);
>  	mod = find_module_all(name, strlen(name), true);
> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> +	ret = !mod || mod->state == MODULE_STATE_LIVE
> +		|| mod->state == MODULE_STATE_GOING;
>  	mutex_unlock(&module_mutex);
>  
>  	return ret;
> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
>  	mutex_lock(&module_mutex);
>  	old = find_module_all(mod->name, strlen(mod->name), true);
>  	if (old != NULL) {
> -		if (old->state != MODULE_STATE_LIVE) {
> +		if (old->state == MODULE_STATE_COMING
> +		    || old->state == MODULE_STATE_UNFORMED) {
>  			/* Wait in case it fails to load. */
>  			mutex_unlock(&module_mutex);
>  			err = wait_event_interruptible(module_wq,
> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
>  				goto out_unlocked;
>  			goto again;
>  		}
> -		err = -EEXIST;
> +		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>  		goto out;
>  	}
>  	mod_update_bounds(mod);


Prarit, can you verify this still does not break the issue you reported?
David, does this also fix your issue?

Petr, does this solve *any* of your issues? Can you also send a proper
patch with a commit log once we get confirmation of the tests.

I've been nose diving onto all of these 3 issues now and I have some
ideas of how to split this crap better and save even more memory on
bootup due to these stupid multiple requests. I want to first solve this
regression.

  Luis

Petr Pavlu Nov. 13, 2022, 4:44 p.m. UTC | #19

On 10/24/22 16:00, Prarit Bhargava wrote:
> On 10/24/22 08:37, Petr Pavlu wrote:
>> On 10/18/22 21:53, Prarit Bhargava wrote:
>>> Quoting from the original thread,
>>>
>>>>
>>>> Motivation for this patch is to fix an issue observed on larger machines with
>>>> many CPUs where it can take a significant amount of time during boot to run
>>>> systemd-udev-trigger.service. An x86-64 system can have already intel_pstate
>>>> active but as its CPUs can match also acpi_cpufreq and pcc_cpufreq, udev will
>>>> attempt to load these modules too. The operation will eventually fail in the
>>>> init function of a respective module where it gets recognized that another
>>>> cpufreq driver is already loaded and -EEXIST is returned. However, one uevent
>>>> is triggered for each CPU and so multiple loads of these modules will be
>>>> present. The current code then processes all such loads individually and
>>>> serializes them with the barrier in add_unformed_module().
>>>>
>>>
>>> The way to solve this is not in the module loading code, but in the udev
>>> code by adding a new event or in the userspace which handles the loading
>>> events.
>>>
>>> Option 1)
>>>
>>> Write/modify a udev rule to to use a flock userspace file lock to
>>> prevent repeated loading.  The problem with this is that it is still
>>> racy and still consumes CPU time repeated load the ELF header and,
>>> depending on the system (ie a large number of cpus) would still cause a
>>> boot delay.  This would be better than what we have and is worth looking
>>> at as a simple solution.  I'd like to see boot times with this change,
>>> and I'll try to come up with a measurement on a large CPU system.
>>
>> It is not immediately clear to me how this can be done as a udev rule. You
>> mention that you'll try to test this on a large CPU system. Does it mean that
>> you have a prototype implemented already? If yes, could you please share it?
>>
> 
> Hi Petr,
> 
> Sorry, I haven't had a chance to actually test this out but I see this 
> problem with the acpi_cpufreq and other multiple-cpu drivers which load 
> once per logical cpu.  I was thinking of adding a udev rule like:
> 
> ACTION!="add", GOTO="acpi_cpufreq_end"
> 
> # I may have to add CPU modaliases here to get this to work correctly
> ENV{MODALIAS}=="acpi:ACPI0007:", GOTO="acpi_cpufreq_start"
> 
> GOTO="acpi_cpufreq_start"
> GOTO="acpi_cpufreq_end"
> 
> LABEL="acpi_cpufreq_start"
> 
> ENV{DELAY_MODALIAS}="$env{MODALIAS}"
> ENV{MODALIAS}=""
> PROGRAM="/bin/sh -c flock -n /tmp/delay_acpi_cpufreq sleep 2'", 
> RESULT=="", RUN{builtin}+="kmod load $env{DELAY_MODALIAS}"
> 
> LABEL="acpi_cpufreq_end"

Thanks, that is an interesting idea. I think though the artificial delay that
it introduces would not be good (if I'm reading the snippet correctly).

>>> Option 2)
>>>
>>> Create a new udev action, "add_once" to indicate to userspace that the
>>> module only needs to be loaded one time, and to ignore further load
>>> requests.  This is a bit tricky as both kernel space and userspace would
>>> have be modified.  The udev rule would end up looking very similar to
>>> what we now.
>>>
>>> The benefit of option 2 is that driver writers themselves can choose
>>> which drivers should issue "add_once" instead of add.  Drivers that are
>>> known to run on all devices at once would call "add_once" to only issue
>>> a single load.
>>
>> On the device event side, I more wonder if it would be possible to avoid tying
>> up cpufreq and edac modules to individual CPU devices. Maybe their loading
>> could be attached to some platform device, even if it means introducing an
>> auxiliary device for this purpose? I need to look a bit more into this idea.
> 
> That's an interesting idea and something I had not considered.  Creating 
> a virtual device and loading against that device would be much better 
> (easier?) of a solution.

Made some progress on this.. Both acpi-cpufreq and pcc-cpufreq drivers have
their platform firmware interface defined by ACPI. Allowed performance states
and parameters must be same for each CPU. Instead of matching these drivers on
acpi:ACPI0007: or acpi:LNXCPU: (per-CPU devices), it is possible to check the
ACPI namespace early and create a virtual platform device for each of these
interfaces if it is available. My test patch is pasted at the end of the
email.

This looks to work well and has a benefit that no attempt is made to load
pcc-cpufreq on machines where PCC is not available, which should be most
systems. I think this change is useful independently of whether there will be
eventually any improvement on the udev or module loader side. My plan is to
send it for review to the cpufreq maintainers.

There still remains a problem though that a CPU is typically aliased by other
drivers too:

# modprobe --show-depends 'cpu:type:x86,ven0000fam0006mod0055:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006A,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0083,0084,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3,00E4,00E6,00E7,00EA,00F0,00F1,00F2,00F3,00F5,00F9,00FA,00FB,00FE,00FF,0100,0101,0102,0103,0104,0111,0120,0121,0123,0125,0126,0127,0128,0129,012A,012C,012D,012E,012F,0130,0131,0132,0133,0134,0137,0138,0139,013C,013E,013F,0140,0141,0142,0143,0160,0161,0162,0163,0164,0165,0171,01C0,01C1,01C2,01C4,01C5,01C6,0203,0204,020B,024A,025A,025B,025C,025D,025F'
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/cryptd.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/crypto_simd.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/aesni-intel.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/sha512-ssse3.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/cryptd.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/ghash-clmulni-intel.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/gf128mul.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/crypto/polyval-generic.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/polyval-clmulni.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crc32c-intel.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crc32-pclmul.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/crypto/crct10dif-pclmul.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/virt/lib/irqbypass.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/kvm/kvm.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/arch/x86/kvm/kvm-intel.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/hwmon/coretemp.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/thermal/intel/intel_powerclamp.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/thermal/intel/x86_pkg_temp_thermal.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/nvdimm/libnvdimm.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/acpi/nfit/nfit.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/edac/skx_edac.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency-common.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/platform/x86/intel/uncore-frequency/intel-uncore-frequency.ko 
insmod /lib/modules/6.1.0-rc3-default+/kernel/drivers/powercap/intel_rapl_common.ko

If any of these modules repeatedly fails to load then this will again delay
processing of 'udevadm trigger' during boot when a given system has many CPUs.

Cheers,
Petr


diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 0002eecbf870..b6fd14b829be 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -57,6 +57,7 @@ acpi-y				+= evged.o
 acpi-y				+= sysfs.o
 acpi-y				+= property.o
 acpi-$(CONFIG_X86)		+= acpi_cmos_rtc.o
+acpi-$(CONFIG_X86)		+= acpi_cpufreq.o
 acpi-$(CONFIG_X86)		+= x86/apple.o
 acpi-$(CONFIG_X86)		+= x86/utils.o
 acpi-$(CONFIG_X86)		+= x86/s2idle.o
diff --git a/drivers/acpi/acpi_cpufreq.c b/drivers/acpi/acpi_cpufreq.c
new file mode 100644
index 000000000000..3eebe58fbe9b
--- /dev/null
+++ b/drivers/acpi/acpi_cpufreq.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * ACPI support for Processor Performance Control and Processor Clocking
+ * Control.
+ */
+
+#include <linux/acpi.h>
+#include <linux/platform_device.h>
+
+#include "internal.h"
+
+static void cpufreq_add_device(struct acpi_device *adev, const char *name)
+{
+	struct platform_device *pdev;
+
+	pdev = platform_device_register_resndata(
+		&adev->dev, name, PLATFORM_DEVID_NONE, NULL, 0, NULL, 0);
+	if (IS_ERR(pdev))
+		dev_err(&adev->dev, "%s platform device creation failed: %ld\n",
+			name, PTR_ERR(pdev));
+}
+
+static acpi_status
+acpi_pct_match(acpi_handle handle, u32 level, void *context,
+	       void **return_value)
+{
+	bool *pct = context;
+
+	/* Check if the first CPU has _PCT. The data must be same for all. */
+	*pct = acpi_has_method(handle, "_PCT");
+	return AE_CTRL_TERMINATE;
+}
+
+void __init acpi_cpufreq_init(void)
+{
+	acpi_status status;
+	acpi_handle handle;
+	struct acpi_device *device;
+	bool pct = false;
+
+	status = acpi_get_handle(NULL, "\\_SB", &handle);
+	if (ACPI_FAILURE(status))
+		return;
+
+	device = acpi_fetch_acpi_dev(handle);
+	if (device == NULL)
+		return;
+
+	acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+			    ACPI_UINT32_MAX, acpi_pct_match, NULL, &pct, NULL);
+	if (pct)
+		cpufreq_add_device(device, "acpi-cpufreq");
+
+	if (acpi_has_method(handle, "PCCH"))
+		cpufreq_add_device(device, "pcc-cpufreq");
+}
diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
index 219c02df9a08..ab02efeaa192 100644
--- a/drivers/acpi/internal.h
+++ b/drivers/acpi/internal.h
@@ -55,8 +55,10 @@ static inline void acpi_dock_add(struct acpi_device *adev) {}
 #endif
 #ifdef CONFIG_X86
 void acpi_cmos_rtc_init(void);
+void acpi_cpufreq_init(void);
 #else
 static inline void acpi_cmos_rtc_init(void) {}
+static inline void acpi_cpufreq_init(void) {}
 #endif
 int acpi_rev_override_setup(char *str);
 
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index b47e93a24a9a..e51cf28fc629 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2614,6 +2614,7 @@ void __init acpi_scan_init(void)
 	if (!acpi_gbl_reduced_hardware)
 		acpi_bus_scan_fixed();
 
+	acpi_cpufreq_init();
 	acpi_turn_off_unused_power_resources();
 
 	acpi_scan_initialized = true;
diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 1bb2b90ebb21..1273d42e9ca1 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -1056,18 +1056,5 @@ MODULE_PARM_DESC(acpi_pstate_strict,
 late_initcall(acpi_cpufreq_init);
 module_exit(acpi_cpufreq_exit);
 
-static const struct x86_cpu_id __maybe_unused acpi_cpufreq_ids[] = {
-	X86_MATCH_FEATURE(X86_FEATURE_ACPI, NULL),
-	X86_MATCH_FEATURE(X86_FEATURE_HW_PSTATE, NULL),
-	{}
-};
-MODULE_DEVICE_TABLE(x86cpu, acpi_cpufreq_ids);
-
-static const struct acpi_device_id __maybe_unused processor_device_ids[] = {
-	{ACPI_PROCESSOR_OBJECT_HID, },
-	{ACPI_PROCESSOR_DEVICE_HID, },
-	{},
-};
-MODULE_DEVICE_TABLE(acpi, processor_device_ids);
-
 MODULE_ALIAS("acpi");
+MODULE_ALIAS("platform:acpi-cpufreq");
diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
index 9f3fc7a073d0..cc898bc3e156 100644
--- a/drivers/cpufreq/pcc-cpufreq.c
+++ b/drivers/cpufreq/pcc-cpufreq.c
@@ -616,12 +616,7 @@ static void __exit pcc_cpufreq_exit(void)
 	free_percpu(pcc_cpu_info);
 }
 
-static const struct acpi_device_id __maybe_unused processor_device_ids[] = {
-	{ACPI_PROCESSOR_OBJECT_HID, },
-	{ACPI_PROCESSOR_DEVICE_HID, },
-	{},
-};
-MODULE_DEVICE_TABLE(acpi, processor_device_ids);
+MODULE_ALIAS("platform:pcc-cpufreq");
 
 MODULE_AUTHOR("Matthew Garrett, Naga Chumbalkar");
 MODULE_VERSION(PCC_VERSION);

David Hildenbrand Nov. 14, 2022, 8:57 a.m. UTC | #20

On 12.11.22 02:47, Luis Chamberlain wrote:
> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
>> On 10/18/22 20:33, Luis Chamberlain wrote:
>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>>>> The patch does address a regression observed after commit 6e6de3dee51a
>>>> ("kernel/module.c: Only return -EEXIST for modules that have finished
>>>> loading"). I guess it can have a Fixes tag added to the patch.
>>>>
>>>> I think it is hard to split this patch into parts because the implemented
>>>> "optimization" is the fix.
>>>
>>> git describe --contains 6e6de3dee51a
>>> v5.3-rc1~38^2~6
>>>
>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
>>> right thing to do, but without it, it still leaves the issue reported
>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
>>> then your optimizations can be applied on top.
>>
>> Simpler could be to do the following:
>>
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index d02d39c7174e..0302ac387e93 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>>   	sched_annotate_sleep();
>>   	mutex_lock(&module_mutex);
>>   	mod = find_module_all(name, strlen(name), true);
>> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
>> +	ret = !mod || mod->state == MODULE_STATE_LIVE
>> +		|| mod->state == MODULE_STATE_GOING;
>>   	mutex_unlock(&module_mutex);
>>   
>>   	return ret;
>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
>>   	mutex_lock(&module_mutex);
>>   	old = find_module_all(mod->name, strlen(mod->name), true);
>>   	if (old != NULL) {
>> -		if (old->state != MODULE_STATE_LIVE) {
>> +		if (old->state == MODULE_STATE_COMING
>> +		    || old->state == MODULE_STATE_UNFORMED) {
>>   			/* Wait in case it fails to load. */
>>   			mutex_unlock(&module_mutex);
>>   			err = wait_event_interruptible(module_wq,
>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
>>   				goto out_unlocked;
>>   			goto again;
>>   		}
>> -		err = -EEXIST;
>> +		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>>   		goto out;
>>   	}
>>   	mod_update_bounds(mod);
> 
> 
> Prarit, can you verify this still does not break the issue you reported?
> David, does this also fix your issue?

I didn't try, but from a quick glimpse I assume no. Allocating module 
space happens before handling eventual duplicates right now, before a 
module even is "alive" and in the MODULE_STATE_UNFORMED state.

But maybe I am missing something important.

Luis Chamberlain Nov. 14, 2022, 3:38 p.m. UTC | #21

On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
> On 12.11.22 02:47, Luis Chamberlain wrote:
> > On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
> > > On 10/18/22 20:33, Luis Chamberlain wrote:
> > > > On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> > > > > The patch does address a regression observed after commit 6e6de3dee51a
> > > > > ("kernel/module.c: Only return -EEXIST for modules that have finished
> > > > > loading"). I guess it can have a Fixes tag added to the patch.
> > > > > 
> > > > > I think it is hard to split this patch into parts because the implemented
> > > > > "optimization" is the fix.
> > > > 
> > > > git describe --contains 6e6de3dee51a
> > > > v5.3-rc1~38^2~6
> > > > 
> > > > I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
> > > > right thing to do, but without it, it still leaves the issue reported
> > > > by Prarit Bhargava. We need a way to resolve the issue on stable and
> > > > then your optimizations can be applied on top.
> > > 
> > > Simpler could be to do the following:
> > > 
> > > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > > index d02d39c7174e..0302ac387e93 100644
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> > >   	sched_annotate_sleep();
> > >   	mutex_lock(&module_mutex);
> > >   	mod = find_module_all(name, strlen(name), true);
> > > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> > > +	ret = !mod || mod->state == MODULE_STATE_LIVE
> > > +		|| mod->state == MODULE_STATE_GOING;
> > >   	mutex_unlock(&module_mutex);
> > >   	return ret;
> > > @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
> > >   	mutex_lock(&module_mutex);
> > >   	old = find_module_all(mod->name, strlen(mod->name), true);
> > >   	if (old != NULL) {
> > > -		if (old->state != MODULE_STATE_LIVE) {
> > > +		if (old->state == MODULE_STATE_COMING
> > > +		    || old->state == MODULE_STATE_UNFORMED) {
> > >   			/* Wait in case it fails to load. */
> > >   			mutex_unlock(&module_mutex);
> > >   			err = wait_event_interruptible(module_wq,
> > > @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
> > >   				goto out_unlocked;
> > >   			goto again;
> > >   		}
> > > -		err = -EEXIST;
> > > +		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
> > >   		goto out;
> > >   	}
> > >   	mod_update_bounds(mod);
> > 
> > 
> > Prarit, can you verify this still does not break the issue you reported?
> > David, does this also fix your issue?
> 
> I didn't try, but from a quick glimpse I assume no. Allocating module space
> happens before handling eventual duplicates right now, before a module even
> is "alive" and in the MODULE_STATE_UNFORMED state.

The first two hunks are a revert of commit 6e6de3dee51a and I'm under
the impression that cauased your issues as *more* modules states are
allowed through.

The last hunk tries to fix what 6e6de3dee51a wanted to do.

> But maybe I am missing something important.

Please do test if you can.

  Luis

David Hildenbrand Nov. 14, 2022, 3:45 p.m. UTC | #22

On 14.11.22 16:38, Luis Chamberlain wrote:
> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
>> On 12.11.22 02:47, Luis Chamberlain wrote:
>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
>>>> On 10/18/22 20:33, Luis Chamberlain wrote:
>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>>>>>> The patch does address a regression observed after commit 6e6de3dee51a
>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have finished
>>>>>> loading"). I guess it can have a Fixes tag added to the patch.
>>>>>>
>>>>>> I think it is hard to split this patch into parts because the implemented
>>>>>> "optimization" is the fix.
>>>>>
>>>>> git describe --contains 6e6de3dee51a
>>>>> v5.3-rc1~38^2~6
>>>>>
>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would be the
>>>>> right thing to do, but without it, it still leaves the issue reported
>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
>>>>> then your optimizations can be applied on top.
>>>>
>>>> Simpler could be to do the following:
>>>>
>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>>> index d02d39c7174e..0302ac387e93 100644
>>>> --- a/kernel/module/main.c
>>>> +++ b/kernel/module/main.c
>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>>>>    	sched_annotate_sleep();
>>>>    	mutex_lock(&module_mutex);
>>>>    	mod = find_module_all(name, strlen(name), true);
>>>> -	ret = !mod || mod->state == MODULE_STATE_LIVE;
>>>> +	ret = !mod || mod->state == MODULE_STATE_LIVE
>>>> +		|| mod->state == MODULE_STATE_GOING;
>>>>    	mutex_unlock(&module_mutex);
>>>>    	return ret;
>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module *mod)
>>>>    	mutex_lock(&module_mutex);
>>>>    	old = find_module_all(mod->name, strlen(mod->name), true);
>>>>    	if (old != NULL) {
>>>> -		if (old->state != MODULE_STATE_LIVE) {
>>>> +		if (old->state == MODULE_STATE_COMING
>>>> +		    || old->state == MODULE_STATE_UNFORMED) {
>>>>    			/* Wait in case it fails to load. */
>>>>    			mutex_unlock(&module_mutex);
>>>>    			err = wait_event_interruptible(module_wq,
>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module *mod)
>>>>    				goto out_unlocked;
>>>>    			goto again;
>>>>    		}
>>>> -		err = -EEXIST;
>>>> +		err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>>>>    		goto out;
>>>>    	}
>>>>    	mod_update_bounds(mod);
>>>
>>>
>>> Prarit, can you verify this still does not break the issue you reported?
>>> David, does this also fix your issue?
>>
>> I didn't try, but from a quick glimpse I assume no. Allocating module space
>> happens before handling eventual duplicates right now, before a module even
>> is "alive" and in the MODULE_STATE_UNFORMED state.
> 
> The first two hunks are a revert of commit 6e6de3dee51a and I'm under
> the impression that cauased your issues as *more* modules states are
> allowed through.
> 
> The last hunk tries to fix what 6e6de3dee51a wanted to do.
> 

Note that I don't think the issue I raised is due to 6e6de3dee51a.

>> But maybe I am missing something important.
> 
> Please do test if you can.

I don't have the machine at hand right now. But, again, I doubt this 
will fix it.


The flow is in load_module():

	mod = layout_and_allocate(info, flags);
	if (IS_ERR(mod)) {
		...
	}

	audit_log_kern_module(mod->name);

	/* Reserve our place in the list. */
	err = add_unformed_module(mod);
	if (err)
		goto free_module;


You can have 400 threads in layout_and_allocate() loading the same 
module at the same time and running out of module space. Any changes to 
add_unformed_module() and finished_loading() won't change that, because 
they are not involved before the module space allocations happened.

Luis Chamberlain Nov. 15, 2022, 7:29 p.m. UTC | #23

On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
> Note that I don't think the issue I raised is due to 6e6de3dee51a.
> I don't have the machine at hand right now. But, again, I doubt this will
> fix it.

There are *more* modules processed after that commit. That's all. So
testing would be appreciated.

  Luis

Prarit Bhargava Nov. 16, 2022, 4:03 p.m. UTC | #24

On 11/15/22 14:29, Luis Chamberlain wrote:
> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>> I don't have the machine at hand right now. But, again, I doubt this will
>> fix it.
> 
> There are *more* modules processed after that commit. That's all. So
> testing would be appreciated.
> 

Can anyone tell us if

https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/

resolves the module loading delay problem?

/me is testing now

P.

>    Luis
>

David Hildenbrand Nov. 16, 2022, 4:04 p.m. UTC | #25

On 15.11.22 20:29, Luis Chamberlain wrote:
> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>> I don't have the machine at hand right now. But, again, I doubt this will
>> fix it.
> 
> There are *more* modules processed after that commit. That's all. So
> testing would be appreciated.

I'll try to get a grab on a suitable system.

David Hildenbrand Nov. 18, 2022, 5:32 p.m. UTC | #26

On 15.11.22 20:29, Luis Chamberlain wrote:
> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>> I don't have the machine at hand right now. But, again, I doubt this will
>> fix it.
> 
> There are *more* modules processed after that commit. That's all. So
> testing would be appreciated.

I just tested that change on top of 6.1.0-rc5+ on that large system
and CONFIG_KASAN_INLINE=y. No change.


[  207.955184] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size
[  207.955891] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size
[  207.956253] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size
[  207.956461] systemd-udevd: vmalloc error: size 2486272, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=1-7
[  207.956573] CPU: 88 PID: 4925 Comm: systemd-udevd Not tainted 6.1.0-rc5+ #4
[  207.956580] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
[  207.956584] Call Trace:
[  207.956588]  <TASK>
[  207.956593] vmap allocation for size 2490368 failed: use vmalloc=<size> to increase size
[  207.956593]  dump_stack_lvl+0x5b/0x77
[  207.956613]  warn_alloc.cold+0x86/0x195
[  207.956632]  ? zone_watermark_ok_safe+0x2b0/0x2b0
[  207.956641]  ? slab_free_freelist_hook+0x11e/0x1d0
[  207.956672]  ? __get_vm_area_node+0x2a4/0x340
[  207.956694]  __vmalloc_node_range+0xad6/0x11b0
[  207.956699]  ? trace_contention_end+0xda/0x140
[  207.956715]  ? __mutex_lock+0x254/0x1360
[  207.956740]  ? __mutex_unlock_slowpath+0x154/0x600
[  207.956752]  ? bit_wait_io_timeout+0x170/0x170
[  207.956761]  ? vfree_atomic+0xa0/0xa0
[  207.956775]  ? load_module+0x1d8f/0x7ff0
[  207.956786]  module_alloc+0xe7/0x170
[  207.956802]  ? load_module+0x1d8f/0x7ff0
[  207.956822]  load_module+0x1d8f/0x7ff0
[  207.956876]  ? module_frob_arch_sections+0x20/0x20
[  207.956888]  ? ima_post_read_file+0x15a/0x180
[  207.956904]  ? ima_read_file+0x140/0x140
[  207.956918]  ? kernel_read+0x5c/0x140
[  207.956931]  ? security_kernel_post_read_file+0x6d/0xb0
[  207.956950]  ? kernel_read_file+0x21d/0x7d0
[  207.956971]  ? __x64_sys_fspick+0x270/0x270
[  207.956999]  ? __do_sys_finit_module+0xfc/0x180
[  207.957005]  __do_sys_finit_module+0xfc/0x180
[  207.957012]  ? __ia32_sys_init_module+0xa0/0xa0
[  207.957023]  ? __seccomp_filter+0x15e/0xc20
[  207.957066]  ? syscall_trace_enter.constprop.0+0x98/0x230
[  207.957078]  do_syscall_64+0x58/0x80
[  207.957085]  ? asm_exc_page_fault+0x22/0x30
[  207.957095]  ? lockdep_hardirqs_on+0x7d/0x100
[  207.957103]  entry_SYSCALL_64_after_hwframe+0x63/0xcd


I have access to the system for a couple more days, if there
is anything else I should test.

Petr Pavlu Nov. 21, 2022, 4 p.m. UTC | #27

On 11/16/22 17:03, Prarit Bhargava wrote:
> On 11/15/22 14:29, Luis Chamberlain wrote:
>> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
>>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>>> I don't have the machine at hand right now. But, again, I doubt this will
>>> fix it.
>>
>> There are *more* modules processed after that commit. That's all. So
>> testing would be appreciated.
>>
> 
> Can anyone tell us if
> 
> https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/
> 
> resolves the module loading delay problem?

This patch unfortunately makes no difference on my test system. In my case,
the kernel has already intel_pstate loaded when udev starts inserting a burst
of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init()
to immediately return once the check cpufreq_get_current_driver() fails. The
code modified by the patch is not reached at all.

Petr

Luis Chamberlain Nov. 21, 2022, 7:03 p.m. UTC | #28

On Mon, Nov 21, 2022 at 05:00:49PM +0100, Petr Pavlu wrote:
> On 11/16/22 17:03, Prarit Bhargava wrote:
> > On 11/15/22 14:29, Luis Chamberlain wrote:
> >> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
> >>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
> >>> I don't have the machine at hand right now. But, again, I doubt this will
> >>> fix it.
> >>
> >> There are *more* modules processed after that commit. That's all. So
> >> testing would be appreciated.
> >>
> > 
> > Can anyone tell us if
> > 
> > https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/
> > 
> > resolves the module loading delay problem?
> 
> This patch unfortunately makes no difference on my test system. In my case,
> the kernel has already intel_pstate loaded when udev starts inserting a burst
> of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init()
> to immediately return once the check cpufreq_get_current_driver() fails. The
> code modified by the patch is not reached at all.

To be clear I don't care about the patch mentioned in the above URL, I care
about this:

https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com

David was this the on you tested too?

Prarit, so you're left to please test, the hope would be that at the
very least it still fixes your issue.

Petr, you had mentioned in the commit log for your second patch in this
thread that your change fixes a regression. What I asked for was for a
patch that fixes that regression alone first so we can send to stable.
So what issue is that alternative patch fixing?

  Luis

David Hildenbrand Nov. 21, 2022, 7:50 p.m. UTC | #29

On 21.11.22 20:03, Luis Chamberlain wrote:
> On Mon, Nov 21, 2022 at 05:00:49PM +0100, Petr Pavlu wrote:
>> On 11/16/22 17:03, Prarit Bhargava wrote:
>>> On 11/15/22 14:29, Luis Chamberlain wrote:
>>>> On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
>>>>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>>>>> I don't have the machine at hand right now. But, again, I doubt this will
>>>>> fix it.
>>>>
>>>> There are *more* modules processed after that commit. That's all. So
>>>> testing would be appreciated.
>>>>
>>>
>>> Can anyone tell us if
>>>
>>> https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/
>>>
>>> resolves the module loading delay problem?
>>
>> This patch unfortunately makes no difference on my test system. In my case,
>> the kernel has already intel_pstate loaded when udev starts inserting a burst
>> of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init()
>> to immediately return once the check cpufreq_get_current_driver() fails. The
>> code modified by the patch is not reached at all.
> 
> To be clear I don't care about the patch mentioned in the above URL, I care
> about this:
> 
> https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com
> 
> David was this the on you tested too?

Yes, that's the one I tried without luck.

Luis Chamberlain Nov. 21, 2022, 8:27 p.m. UTC | #30

On Mon, Nov 21, 2022 at 08:50:11PM +0100, David Hildenbrand wrote:
> On 21.11.22 20:03, Luis Chamberlain wrote:
> > On Mon, Nov 21, 2022 at 05:00:49PM +0100, Petr Pavlu wrote:
> > > On 11/16/22 17:03, Prarit Bhargava wrote:
> > > > On 11/15/22 14:29, Luis Chamberlain wrote:
> > > > > On Mon, Nov 14, 2022 at 04:45:05PM +0100, David Hildenbrand wrote:
> > > > > > Note that I don't think the issue I raised is due to 6e6de3dee51a.
> > > > > > I don't have the machine at hand right now. But, again, I doubt this will
> > > > > > fix it.
> > > > > 
> > > > > There are *more* modules processed after that commit. That's all. So
> > > > > testing would be appreciated.
> > > > > 
> > > > 
> > > > Can anyone tell us if
> > > > 
> > > > https://lore.kernel.org/linux-pm/20221102195957.82871-1-stuart.w.hayes@gmail.com/
> > > > 
> > > > resolves the module loading delay problem?
> > > 
> > > This patch unfortunately makes no difference on my test system. In my case,
> > > the kernel has already intel_pstate loaded when udev starts inserting a burst
> > > of acpi_cpufreq modules. It then causes the init function acpi_cpufreq_init()
> > > to immediately return once the check cpufreq_get_current_driver() fails. The
> > > code modified by the patch is not reached at all.
> > 
> > To be clear I don't care about the patch mentioned in the above URL, I care
> > about this:
> > 
> > https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com
> > 
> > David was this the on you tested too?
> 
> Yes, that's the one I tried without luck.

OK thanks. I'm just trying to sort out the regression first before we go
out and fix another issue. We will chase that issue down though and I
hav some other ideas to help with this further too.

  Luis

Petr Pavlu Nov. 22, 2022, 1:59 p.m. UTC | #31

On 11/21/22 20:03, Luis Chamberlain wrote:
> To be clear I don't care about the patch mentioned in the above URL, I care
> about this:
> 
> https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com
> [...]
> 
> Petr, you had mentioned in the commit log for your second patch in this
> thread that your change fixes a regression. What I asked for was for a
> patch that fixes that regression alone first so we can send to stable.
> So what issue is that alternative patch fixing?

This alternative patch fixes the discussed regression with failing inserts of
acpi_cpufreq and pcc_cpufreq delaying boot. As mentioned, the situation can in
the worst case prevent udev from loading drivers for other devices and might
cause timeouts of services waiting on them and subsequently a failed boot.

The patch attempts a different solution for the problem that 6e6de3dee51a was
trying to solve. Rather than waiting for the unloading to complete, it returns
a different error code (-EBUSY) for modules in the GOING state. This should
avoid the error situation that was described in 6e6de3dee51a (user space
attempting to load a dependent module because the -EEXIST error code would
suggest to user space that the first module had been loaded successfully),
while avoiding the delay issue too.

Petr

Luis Chamberlain Nov. 22, 2022, 5:58 p.m. UTC | #32

On Tue, Nov 22, 2022 at 02:59:18PM +0100, Petr Pavlu wrote:
> On 11/21/22 20:03, Luis Chamberlain wrote:
> > To be clear I don't care about the patch mentioned in the above URL, I care
> > about this:
> > 
> > https://lkml.kernel.org/r/d0bc50e3-0e42-311b-20ed-7538bb918c5b@suse.com
> > [...]
> > 
> > Petr, you had mentioned in the commit log for your second patch in this
> > thread that your change fixes a regression. What I asked for was for a
> > patch that fixes that regression alone first so we can send to stable.
> > So what issue is that alternative patch fixing?
> 
> This alternative patch fixes the discussed regression with failing inserts of
> acpi_cpufreq and pcc_cpufreq delaying boot. As mentioned, the situation can in
> the worst case prevent udev from loading drivers for other devices and might
> cause timeouts of services waiting on them and subsequently a failed boot.
> 
> The patch attempts a different solution for the problem that 6e6de3dee51a was
> trying to solve. Rather than waiting for the unloading to complete, it returns
> a different error code (-EBUSY) for modules in the GOING state. This should
> avoid the error situation that was described in 6e6de3dee51a (user space
> attempting to load a dependent module because the -EEXIST error code would
> suggest to user space that the first module had been loaded successfully),
> while avoiding the delay issue too.

Great, can you send a proper patch now with a proper commit log and Cc
stable tag?

  Luis

Prarit Bhargava Nov. 28, 2022, 4:29 p.m. UTC | #33

On 11/14/22 10:45, David Hildenbrand wrote:
> On 14.11.22 16:38, Luis Chamberlain wrote:
>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
>>> On 12.11.22 02:47, Luis Chamberlain wrote:
>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
>>>>> On 10/18/22 20:33, Luis Chamberlain wrote:
>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>>>>>>> The patch does address a regression observed after commit 
>>>>>>> 6e6de3dee51a
>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have 
>>>>>>> finished
>>>>>>> loading"). I guess it can have a Fixes tag added to the patch.
>>>>>>>
>>>>>>> I think it is hard to split this patch into parts because the 
>>>>>>> implemented
>>>>>>> "optimization" is the fix.
>>>>>>
>>>>>> git describe --contains 6e6de3dee51a
>>>>>> v5.3-rc1~38^2~6
>>>>>>
>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would 
>>>>>> be the
>>>>>> right thing to do, but without it, it still leaves the issue reported
>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
>>>>>> then your optimizations can be applied on top.
>>>>>
>>>>> Simpler could be to do the following:
>>>>>
>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>>>> index d02d39c7174e..0302ac387e93 100644
>>>>> --- a/kernel/module/main.c
>>>>> +++ b/kernel/module/main.c
>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>>>>>        sched_annotate_sleep();
>>>>>        mutex_lock(&module_mutex);
>>>>>        mod = find_module_all(name, strlen(name), true);
>>>>> -    ret = !mod || mod->state == MODULE_STATE_LIVE;
>>>>> +    ret = !mod || mod->state == MODULE_STATE_LIVE
>>>>> +        || mod->state == MODULE_STATE_GOING;
>>>>>        mutex_unlock(&module_mutex);
>>>>>        return ret;
>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module 
>>>>> *mod)
>>>>>        mutex_lock(&module_mutex);
>>>>>        old = find_module_all(mod->name, strlen(mod->name), true);
>>>>>        if (old != NULL) {
>>>>> -        if (old->state != MODULE_STATE_LIVE) {
>>>>> +        if (old->state == MODULE_STATE_COMING
>>>>> +            || old->state == MODULE_STATE_UNFORMED) {
>>>>>                /* Wait in case it fails to load. */
>>>>>                mutex_unlock(&module_mutex);
>>>>>                err = wait_event_interruptible(module_wq,
>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module 
>>>>> *mod)
>>>>>                    goto out_unlocked;
>>>>>                goto again;
>>>>>            }
>>>>> -        err = -EEXIST;
>>>>> +        err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>>>>>            goto out;
>>>>>        }
>>>>>        mod_update_bounds(mod);
>>>>
>>>>
>>>> Prarit, can you verify this still does not break the issue you 
>>>> reported?
>>>> David, does this also fix your issue?
>>>
>>> I didn't try, but from a quick glimpse I assume no. Allocating module 
>>> space
>>> happens before handling eventual duplicates right now, before a 
>>> module even
>>> is "alive" and in the MODULE_STATE_UNFORMED state.
>>
>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under
>> the impression that cauased your issues as *more* modules states are
>> allowed through.
>>
>> The last hunk tries to fix what 6e6de3dee51a wanted to do.
>>
> 
> Note that I don't think the issue I raised is due to 6e6de3dee51a.
> 
>>> But maybe I am missing something important.
>>
>> Please do test if you can.
> 
> I don't have the machine at hand right now. But, again, I doubt this 
> will fix it.
> 
> 
> The flow is in load_module():
> 
>      mod = layout_and_allocate(info, flags);
>      if (IS_ERR(mod)) {
>          ...
>      }
> 
>      audit_log_kern_module(mod->name);
> 
>      /* Reserve our place in the list. */
>      err = add_unformed_module(mod);
>      if (err)
>          goto free_module;
> 
> 
> You can have 400 threads in layout_and_allocate() loading the same 
> module at the same time and running out of module space. Any changes to 
> add_unformed_module() and finished_loading() won't change that, because 
> they are not involved before the module space allocations happened.
> 

I'd like to see a refreshed patch but I tested the latest version and 
see that the boot time is LONGER with the change

Before:

[11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze
Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s 
(kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s
multi-user.target reached after 15.606s in userspace.

After:

Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s 
(kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s
multi-user.target reached after 23.093s in userspace.

Subsequent reboots also indicate that userspace boot time is longer 
after the change.

P.

Petr Pavlu Nov. 29, 2022, 1:13 p.m. UTC | #34

On 11/28/22 17:29, Prarit Bhargava wrote:
> On 11/14/22 10:45, David Hildenbrand wrote:
>> On 14.11.22 16:38, Luis Chamberlain wrote:
>>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
>>>> On 12.11.22 02:47, Luis Chamberlain wrote:
>>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
>>>>>> On 10/18/22 20:33, Luis Chamberlain wrote:
>>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>>>>>>>> The patch does address a regression observed after commit 
>>>>>>>> 6e6de3dee51a
>>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have 
>>>>>>>> finished
>>>>>>>> loading"). I guess it can have a Fixes tag added to the patch.
>>>>>>>>
>>>>>>>> I think it is hard to split this patch into parts because the 
>>>>>>>> implemented
>>>>>>>> "optimization" is the fix.
>>>>>>>
>>>>>>> git describe --contains 6e6de3dee51a
>>>>>>> v5.3-rc1~38^2~6
>>>>>>>
>>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would 
>>>>>>> be the
>>>>>>> right thing to do, but without it, it still leaves the issue reported
>>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
>>>>>>> then your optimizations can be applied on top.
>>>>>>
>>>>>> Simpler could be to do the following:
>>>>>>
>>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>>>>> index d02d39c7174e..0302ac387e93 100644
>>>>>> --- a/kernel/module/main.c
>>>>>> +++ b/kernel/module/main.c
>>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>>>>>>        sched_annotate_sleep();
>>>>>>        mutex_lock(&module_mutex);
>>>>>>        mod = find_module_all(name, strlen(name), true);
>>>>>> -    ret = !mod || mod->state == MODULE_STATE_LIVE;
>>>>>> +    ret = !mod || mod->state == MODULE_STATE_LIVE
>>>>>> +        || mod->state == MODULE_STATE_GOING;
>>>>>>        mutex_unlock(&module_mutex);
>>>>>>        return ret;
>>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module 
>>>>>> *mod)
>>>>>>        mutex_lock(&module_mutex);
>>>>>>        old = find_module_all(mod->name, strlen(mod->name), true);
>>>>>>        if (old != NULL) {
>>>>>> -        if (old->state != MODULE_STATE_LIVE) {
>>>>>> +        if (old->state == MODULE_STATE_COMING
>>>>>> +            || old->state == MODULE_STATE_UNFORMED) {
>>>>>>                /* Wait in case it fails to load. */
>>>>>>                mutex_unlock(&module_mutex);
>>>>>>                err = wait_event_interruptible(module_wq,
>>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module 
>>>>>> *mod)
>>>>>>                    goto out_unlocked;
>>>>>>                goto again;
>>>>>>            }
>>>>>> -        err = -EEXIST;
>>>>>> +        err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>>>>>>            goto out;
>>>>>>        }
>>>>>>        mod_update_bounds(mod);
>>>>>
>>>>>
>>>>> Prarit, can you verify this still does not break the issue you 
>>>>> reported?
>>>>> David, does this also fix your issue?
>>>>
>>>> I didn't try, but from a quick glimpse I assume no. Allocating module 
>>>> space
>>>> happens before handling eventual duplicates right now, before a 
>>>> module even
>>>> is "alive" and in the MODULE_STATE_UNFORMED state.
>>>
>>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under
>>> the impression that cauased your issues as *more* modules states are
>>> allowed through.
>>>
>>> The last hunk tries to fix what 6e6de3dee51a wanted to do.
>>>
>>
>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>>
>>>> But maybe I am missing something important.
>>>
>>> Please do test if you can.
>>
>> I don't have the machine at hand right now. But, again, I doubt this 
>> will fix it.
>>
>>
>> The flow is in load_module():
>>
>>      mod = layout_and_allocate(info, flags);
>>      if (IS_ERR(mod)) {
>>          ...
>>      }
>>
>>      audit_log_kern_module(mod->name);
>>
>>      /* Reserve our place in the list. */
>>      err = add_unformed_module(mod);
>>      if (err)
>>          goto free_module;
>>
>>
>> You can have 400 threads in layout_and_allocate() loading the same 
>> module at the same time and running out of module space. Any changes to 
>> add_unformed_module() and finished_loading() won't change that, because 
>> they are not involved before the module space allocations happened.
>>
> 
> I'd like to see a refreshed patch but I tested the latest version and 
> see that the boot time is LONGER with the change
> 
> Before:
> 
> [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze
> Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s 
> (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s
> multi-user.target reached after 15.606s in userspace.
> 
> After:
> 
> Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s 
> (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s
> multi-user.target reached after 23.093s in userspace.
> 
> Subsequent reboots also indicate that userspace boot time is longer 
> after the change.

Thanks for testing this patch, that is an interesting result.

I see the following dependency chain on my system (openSUSE Tumbleweed):
multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service.

My understanding is that the udev trigger service only performs the trigger
operation but does not actually wait on all devices to be processed by udevd.
In other words, handling of the forced udev events can still be in progress
after multi-user.target is reached.

The current serialization of same-name module loads can result in many udev
workers sleeping in add_unformed_module() and hence creating at that point
less pressure on the CPU time from udevd. I wonder if this then maybe allows
other work needed to reach multi-user.target to proceed faster.

Could you please boot the machine with 'udev.log_level=debug' and provide me
logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
and with the discussed patch?

Thanks,
Petr

Petr Mladek Dec. 2, 2022, 4:36 p.m. UTC | #35

On Tue 2022-11-29 14:13:37, Petr Pavlu wrote:
> On 11/28/22 17:29, Prarit Bhargava wrote:
> > On 11/14/22 10:45, David Hildenbrand wrote:
> >> On 14.11.22 16:38, Luis Chamberlain wrote:
> >>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
> >>>> On 12.11.22 02:47, Luis Chamberlain wrote:
> >>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
> >>>>>> On 10/18/22 20:33, Luis Chamberlain wrote:
> >>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> >>>>>>>> The patch does address a regression observed after commit 
> >>>>>>>> 6e6de3dee51a
> >>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have 
> >>>>>>>> finished
> >>>>>>>> loading"). I guess it can have a Fixes tag added to the patch.
> >>>>>>>>
> >>>>>>>> I think it is hard to split this patch into parts because the 
> >>>>>>>> implemented
> >>>>>>>> "optimization" is the fix.
> >>>>>>>
> >>>>>>> git describe --contains 6e6de3dee51a
> >>>>>>> v5.3-rc1~38^2~6
> >>>>>>>
> >>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would 
> >>>>>>> be the
> >>>>>>> right thing to do, but without it, it still leaves the issue reported
> >>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
> >>>>>>> then your optimizations can be applied on top.
> >>>>>>
> >>>>>> Simpler could be to do the following:
> >>>>>>
> >>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
> >>>>>> index d02d39c7174e..0302ac387e93 100644
> >>>>>> --- a/kernel/module/main.c
> >>>>>> +++ b/kernel/module/main.c
> >>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >>>>>>        sched_annotate_sleep();
> >>>>>>        mutex_lock(&module_mutex);
> >>>>>>        mod = find_module_all(name, strlen(name), true);
> >>>>>> -    ret = !mod || mod->state == MODULE_STATE_LIVE;
> >>>>>> +    ret = !mod || mod->state == MODULE_STATE_LIVE
> >>>>>> +        || mod->state == MODULE_STATE_GOING;
> >>>>>>        mutex_unlock(&module_mutex);
> >>>>>>        return ret;
> >>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module 
> >>>>>> *mod)
> >>>>>>        mutex_lock(&module_mutex);
> >>>>>>        old = find_module_all(mod->name, strlen(mod->name), true);
> >>>>>>        if (old != NULL) {
> >>>>>> -        if (old->state != MODULE_STATE_LIVE) {
> >>>>>> +        if (old->state == MODULE_STATE_COMING
> >>>>>> +            || old->state == MODULE_STATE_UNFORMED) {
> >>>>>>                /* Wait in case it fails to load. */
> >>>>>>                mutex_unlock(&module_mutex);
> >>>>>>                err = wait_event_interruptible(module_wq,
> >>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module 
> >>>>>> *mod)
> >>>>>>                    goto out_unlocked;
> >>>>>>                goto again;
> >>>>>>            }
> >>>>>> -        err = -EEXIST;
> >>>>>> +        err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
> >>>>>>            goto out;
> >>>>>>        }
> >>>>>>        mod_update_bounds(mod);
> >>>>>
> >>>>>
> >>>>> Prarit, can you verify this still does not break the issue you 
> >>>>> reported?
> >>>>> David, does this also fix your issue?
> >>>>
> >>>> I didn't try, but from a quick glimpse I assume no. Allocating module 
> >>>> space
> >>>> happens before handling eventual duplicates right now, before a 
> >>>> module even
> >>>> is "alive" and in the MODULE_STATE_UNFORMED state.
> >>>
> >>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under
> >>> the impression that cauased your issues as *more* modules states are
> >>> allowed through.
> >>>
> >>> The last hunk tries to fix what 6e6de3dee51a wanted to do.
> >>>
> >>
> >> Note that I don't think the issue I raised is due to 6e6de3dee51a.
> >>
> >>>> But maybe I am missing something important.
> >>>
> >>> Please do test if you can.
> >>
> >> I don't have the machine at hand right now. But, again, I doubt this 
> >> will fix it.
> >>
> >>
> >> The flow is in load_module():
> >>
> >>      mod = layout_and_allocate(info, flags);
> >>      if (IS_ERR(mod)) {
> >>          ...
> >>      }
> >>
> >>      audit_log_kern_module(mod->name);
> >>
> >>      /* Reserve our place in the list. */
> >>      err = add_unformed_module(mod);
> >>      if (err)
> >>          goto free_module;
> >>
> >>
> >> You can have 400 threads in layout_and_allocate() loading the same 
> >> module at the same time and running out of module space. Any changes to 
> >> add_unformed_module() and finished_loading() won't change that, because 
> >> they are not involved before the module space allocations happened.
> >>
> > 
> > I'd like to see a refreshed patch but I tested the latest version and 
> > see that the boot time is LONGER with the change
> > 
> > Before:
> > 
> > [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze
> > Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s 
> > (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s
> > multi-user.target reached after 15.606s in userspace.
> > 
> > After:
> > 
> > Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s 
> > (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s
> > multi-user.target reached after 23.093s in userspace.
> > 
> > Subsequent reboots also indicate that userspace boot time is longer 
> > after the change.
> 
> Thanks for testing this patch, that is an interesting result.
> 
> I see the following dependency chain on my system (openSUSE Tumbleweed):
> multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service.
> 
> My understanding is that the udev trigger service only performs the trigger
> operation but does not actually wait on all devices to be processed by udevd.
> In other words, handling of the forced udev events can still be in progress
> after multi-user.target is reached.
> 
> The current serialization of same-name module loads can result in many udev
> workers sleeping in add_unformed_module() and hence creating at that point
> less pressure on the CPU time from udevd. I wonder if this then maybe allows
> other work needed to reach multi-user.target to proceed faster.

Interesting theory.

Another idea. The module loader newly returns -EBUSY. I could imagine
that userspace might handle this return value a special way and try
to load the module once again.

It might make sense to test it with the updated version of the patch.
This one is racy. It returns -EBUSY only when the waiting module loader
sees the failed module loader in GOING state.

> Could you please boot the machine with 'udev.log_level=debug' and provide me
> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
> and with the discussed patch?

I am curious what happens here.

Best Regards,
Petr

Prarit Bhargava Dec. 4, 2022, 7:58 p.m. UTC | #36

On 11/29/22 08:13, Petr Pavlu wrote:
> On 11/28/22 17:29, Prarit Bhargava wrote:
>> On 11/14/22 10:45, David Hildenbrand wrote:
>>> On 14.11.22 16:38, Luis Chamberlain wrote:
>>>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
>>>>> On 12.11.22 02:47, Luis Chamberlain wrote:
>>>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
>>>>>>> On 10/18/22 20:33, Luis Chamberlain wrote:
>>>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
>>>>>>>>> The patch does address a regression observed after commit
>>>>>>>>> 6e6de3dee51a
>>>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have
>>>>>>>>> finished
>>>>>>>>> loading"). I guess it can have a Fixes tag added to the patch.
>>>>>>>>>
>>>>>>>>> I think it is hard to split this patch into parts because the
>>>>>>>>> implemented
>>>>>>>>> "optimization" is the fix.
>>>>>>>>
>>>>>>>> git describe --contains 6e6de3dee51a
>>>>>>>> v5.3-rc1~38^2~6
>>>>>>>>
>>>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would
>>>>>>>> be the
>>>>>>>> right thing to do, but without it, it still leaves the issue reported
>>>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
>>>>>>>> then your optimizations can be applied on top.
>>>>>>>
>>>>>>> Simpler could be to do the following:
>>>>>>>
>>>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>>>>>> index d02d39c7174e..0302ac387e93 100644
>>>>>>> --- a/kernel/module/main.c
>>>>>>> +++ b/kernel/module/main.c
>>>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
>>>>>>>         sched_annotate_sleep();
>>>>>>>         mutex_lock(&module_mutex);
>>>>>>>         mod = find_module_all(name, strlen(name), true);
>>>>>>> -    ret = !mod || mod->state == MODULE_STATE_LIVE;
>>>>>>> +    ret = !mod || mod->state == MODULE_STATE_LIVE
>>>>>>> +        || mod->state == MODULE_STATE_GOING;
>>>>>>>         mutex_unlock(&module_mutex);
>>>>>>>         return ret;
>>>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module
>>>>>>> *mod)
>>>>>>>         mutex_lock(&module_mutex);
>>>>>>>         old = find_module_all(mod->name, strlen(mod->name), true);
>>>>>>>         if (old != NULL) {
>>>>>>> -        if (old->state != MODULE_STATE_LIVE) {
>>>>>>> +        if (old->state == MODULE_STATE_COMING
>>>>>>> +            || old->state == MODULE_STATE_UNFORMED) {
>>>>>>>                 /* Wait in case it fails to load. */
>>>>>>>                 mutex_unlock(&module_mutex);
>>>>>>>                 err = wait_event_interruptible(module_wq,
>>>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module
>>>>>>> *mod)
>>>>>>>                     goto out_unlocked;
>>>>>>>                 goto again;
>>>>>>>             }
>>>>>>> -        err = -EEXIST;
>>>>>>> +        err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
>>>>>>>             goto out;
>>>>>>>         }
>>>>>>>         mod_update_bounds(mod);
>>>>>>
>>>>>>
>>>>>> Prarit, can you verify this still does not break the issue you
>>>>>> reported?
>>>>>> David, does this also fix your issue?
>>>>>
>>>>> I didn't try, but from a quick glimpse I assume no. Allocating module
>>>>> space
>>>>> happens before handling eventual duplicates right now, before a
>>>>> module even
>>>>> is "alive" and in the MODULE_STATE_UNFORMED state.
>>>>
>>>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under
>>>> the impression that cauased your issues as *more* modules states are
>>>> allowed through.
>>>>
>>>> The last hunk tries to fix what 6e6de3dee51a wanted to do.
>>>>
>>>
>>> Note that I don't think the issue I raised is due to 6e6de3dee51a.
>>>
>>>>> But maybe I am missing something important.
>>>>
>>>> Please do test if you can.
>>>
>>> I don't have the machine at hand right now. But, again, I doubt this
>>> will fix it.
>>>
>>>
>>> The flow is in load_module():
>>>
>>>       mod = layout_and_allocate(info, flags);
>>>       if (IS_ERR(mod)) {
>>>           ...
>>>       }
>>>
>>>       audit_log_kern_module(mod->name);
>>>
>>>       /* Reserve our place in the list. */
>>>       err = add_unformed_module(mod);
>>>       if (err)
>>>           goto free_module;
>>>
>>>
>>> You can have 400 threads in layout_and_allocate() loading the same
>>> module at the same time and running out of module space. Any changes to
>>> add_unformed_module() and finished_loading() won't change that, because
>>> they are not involved before the module space allocations happened.
>>>
>>
>> I'd like to see a refreshed patch but I tested the latest version and
>> see that the boot time is LONGER with the change
>>
>> Before:
>>
>> [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze
>> Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s
>> (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s
>> multi-user.target reached after 15.606s in userspace.
>>
>> After:
>>
>> Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s
>> (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s
>> multi-user.target reached after 23.093s in userspace.
>>
>> Subsequent reboots also indicate that userspace boot time is longer
>> after the change.
> 
> Thanks for testing this patch, that is an interesting result.
> 
> I see the following dependency chain on my system (openSUSE Tumbleweed):
> multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service.
> 
> My understanding is that the udev trigger service only performs the trigger
> operation but does not actually wait on all devices to be processed by udevd.
> In other words, handling of the forced udev events can still be in progress
> after multi-user.target is reached.
> 
> The current serialization of same-name module loads can result in many udev
> workers sleeping in add_unformed_module() and hence creating at that point
> less pressure on the CPU time from udevd. I wonder if this then maybe allows
> other work needed to reach multi-user.target to proceed faster.
> 
> Could you please boot the machine with 'udev.log_level=debug' and provide me
> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
> and with the discussed patch?

Petr, I haven't missed your request.  I'm waiting for the system to 
become free (I'm running a week long test on it).  Hopefully I can get 
this data to you tomorrow AM.

My apologies for the wait,

P.

> 
> Thanks,
> Petr
>

Prarit Bhargava Dec. 6, 2022, 12:31 p.m. UTC | #37

> 
>> Could you please boot the machine with 'udev.log_level=debug' and provide me
>> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
>> and with the discussed patch?
> 

Petr, I tried to in-line the logs however the email bounced due to its size.

I know this isn't a preferred method of passing information on LKML and 
other lists, but here are links to the logs:

https://people.redhat.com/prarit/4petr/

Both outputs were done with, as requested, 'journalctl -b -o 
short-monotonic'.

vanilla.log is kernel booted with 'udev.log_level=debug'
with-changeset.log is kernel + patch booted with 'udev.log_level=debug'

P.

Petr Pavlu Dec. 7, 2022, 1:23 p.m. UTC | #38

On 12/6/22 13:31, Prarit Bhargava wrote:
> 
>>
>>> Could you please boot the machine with 'udev.log_level=debug' and provide me
>>> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
>>> and with the discussed patch?
>>
> 
> Petr, I tried to in-line the logs however the email bounced due to its size.
> 
> I know this isn't a preferred method of passing information on LKML and 
> other lists, but here are links to the logs:
> 
> https://people.redhat.com/prarit/4petr/
> 
> Both outputs were done with, as requested, 'journalctl -b -o 
> short-monotonic'.
> 
> vanilla.log is kernel booted with 'udev.log_level=debug'
> with-changeset.log is kernel + patch booted with 'udev.log_level=debug'

Thanks Prarit for re-testing the patch. Both logs in this case actually show
similar startup times.

Vanilla:
[   68.108176] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd[1]: Startup finished in 55.874s (firmware) + 25.646s (loader) + 35.793s (kernel) + 7.845s (initrd) + 24.469s (userspace) = 2min 29.629s.

With the patch:
[   68.064826] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd[1]: Startup finished in 54.153s (firmware) + 19.947s (loader) + 35.965s (kernel) + 9.449s (initrd) + 22.650s (userspace) = 2min 22.165s.


The system has 192 CPUs. The vanilla case shows 144x inserts of acpi_cpufreq
and 43x of pcc_cpufreq:

acpi_cpufreq (the first and last recorded insert):
[   47.485621] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1871]: Inserted module 'acpi_cpufreq'
[...]
[   53.052401] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1914]: Inserted module 'acpi_cpufreq'

pcc_cpufreq:
[   47.515221] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1871]: Inserted module 'pcc_cpufreq'
[...]
[   53.067917] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[2040]: Inserted module 'pcc_cpufreq'

Processing inserts of all CPU frequency modules took at least ~5.5 seconds.
It was likely more because not all inserts are recorded in the log, udevd
messages appear to be missing from ~53.1.


With the patch, both modules are attempted to be inserted 192x:

acpi_cpufreq:
[   50.107403] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1817]: Failed to insert module 'acpi_cpufreq': Device or resource busy
[...]
[   50.438755] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[2016]: Inserted module 'acpi_cpufreq'

pcc_cpufreq:
[   50.110731] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[1849]: Failed to insert module 'pcc_cpufreq': Device or resource busy
[...]
[   50.579249] intel-eaglestream-spr-15.khw3.lab.eng.bos.redhat.com systemd-udevd[2016]: Inserted module 'pcc_cpufreq'

This shows that the patch reduced the sequence to ~0.5 seconds and its logic
looks to be working as intended.

Thanks,
Petr

[v2,2/2] module: Merge same-name module load requests

Commit Message

Comments

Patch