common/config: use modprobe -w when supported

Message ID	20241205002624.3420504-1-mcgrof@kernel.org (mailing list archive)
State	New
Headers	show Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9ADF51C36; Thu, 5 Dec 2024 00:26:26 +0000 (UTC) From: Luis Chamberlain <mcgrof@kernel.org> To: patches@lists.linux.dev, fstests@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, gost.dev@samsung.com, sandeen@redhat.com, mcgrof@kernel.org Subject: [PATCH] common/config: use modprobe -w when supported Date: Wed, 4 Dec 2024 16:26:24 -0800 Message-ID: <20241205002624.3420504-1-mcgrof@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: Luis Chamberlain <mcgrof@infradead.org>
Series	common/config: use modprobe -w when supported \| expand common/config: use modprobe -w when supported

Luis Chamberlain Dec. 5, 2024, 12:26 a.m. UTC

We had added support for open coding a patient module remover long
ago on fstests through commit d405c21d40aa1 ("common/module: add patient
module rmmod support") to fix many flaky tests. This assumed we'd end up
with modprobe -p -t <msec-timeout> but in the end kmod upstream just
used modprobe -w <msec-timeout> through the respective kmod commit
2b98ed888614 ("modprobe: Add --wait").

Take advantage of the upstream patient module remover support added
since June 2022, so many distributions should already have support for
this now.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---

Eric, I saw you mentioning on IRC you didn't understand *why*
the patient module remover was added. Even though I thought the
commit log explained it, let me summarize again: fix tons of
flaky tests which assume module removal is being done correctly.
It is not and fixing this is a module specific issue like with
scsi_debug as documented in the commit log bugzilla references.
So any sane test suite thing relying on module removal should use
something like modprobe -w <timeout-in-ms>.

  Luis

 common/config | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Luis Chamberlain Dec. 5, 2024, 3:48 a.m. UTC | #1

On Wed, Dec 04, 2024 at 04:26:24PM -0800, Luis Chamberlain wrote:
> -modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "remove-patiently"
> +modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait TIMEOUT_MSEC"

Silly me, this is wrong, this should be:

+modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait up to MSEC"

  Luis

Eric Sandeen Dec. 5, 2024, 4:35 a.m. UTC | #2

On 12/4/24 6:26 PM, Luis Chamberlain wrote:
> Eric, I saw you mentioning on IRC you didn't understand *why*
> the patient module remover was added. Even though I thought the
> commit log explained it, let me summarize again: fix tons of
> flaky tests which assume module removal is being done correctly.
> It is not and fixing this is a module specific issue like with
> scsi_debug as documented in the commit log bugzilla references.
> So any sane test suite thing relying on module removal should use
> something like modprobe -w <timeout-in-ms>.

Well, I was having a sad because the upshot of all of that was
that when xfs was not removable at all because it was the root
filesystem, the end result was something like this:

    --- tests/xfs/435.out	2024-11-21 05:13:04.000000000 -0500
    +++ /var/lib/xfstests/results//xfs/435.out.bad	2024-11-21 05:14:47.949206141 -0500
    @@ -3,3 +3,4 @@
     Create a many-block file
     Remount to check recovery
     See if we leak
    +custom patient module removal for xfs timed out waiting for refcnt to become 0 using timeout of 50

which is kind of nonsense, but that probably has more to do with
the test not realizing /before it starts/ that the module cannot
be removed and it should not even try.

Darrick fixed that with:

[PATCH 2/2] xfs/43[4-6]: implement impatient module reloading

but it's starting to feel like a bit of a complex house of cards
by now. We might need a more robust framework for determining whether
a module is removable /at all/ before we decide to wait patiently
for a thing that cannot ever happen?

-Eric

Daniel Gomez Dec. 5, 2024, 7:41 a.m. UTC | #3

On 12/5/2024 1:26 AM, Luis Chamberlain wrote:
> We had added support for open coding a patient module remover long
> ago on fstests through commit d405c21d40aa1 ("common/module: add patient
> module rmmod support") to fix many flaky tests. This assumed we'd end up
> with modprobe -p -t <msec-timeout> but in the end kmod upstream just

I can't find modprobe -p and/or -t arguments in the manual. What do they 
mean?

Well, I'm now seeing this "patient" support:
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=d405c21d40aa1f0ca846dd144a1a7731e55679b2

but i can't find the module remover support in kmod.


Nit. I find useful using the long argument instead of the short one 
(e.g. --wait instead of -w). as it's usually self-descriptive. But I 
guess we don't have that long option for -p and -t?


Daniel

> used modprobe -w <msec-timeout> through the respective kmod commit
> 2b98ed888614 ("modprobe: Add --wait").
> 
> Take advantage of the upstream patient module remover support added
> since June 2022, so many distributions should already have support for
> this now.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> 
> Eric, I saw you mentioning on IRC you didn't understand *why*
> the patient module remover was added. Even though I thought the
> commit log explained it, let me summarize again: fix tons of
> flaky tests which assume module removal is being done correctly.
> It is not and fixing this is a module specific issue like with
> scsi_debug as documented in the commit log bugzilla references.
> So any sane test suite thing relying on module removal should use
> something like modprobe -w <timeout-in-ms>.
> 
>    Luis
> 
>   common/config | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/common/config b/common/config
> index fcff0660b05a..d899129fd5f1 100644
> --- a/common/config
> +++ b/common/config
> @@ -264,7 +264,7 @@ export UDEV_SETTLE_PROG
>   # Set MODPROBE_PATIENT_RM_TIMEOUT_SECONDS to "forever" if you want the patient
>   # modprobe removal to run forever trying to remove a module.
>   MODPROBE_REMOVE_PATIENT=""
> -modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "remove-patiently"
> +modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait TIMEOUT_MSEC"
>   if [[ $? -ne 0 ]]; then
>   	if [[ -z "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" ]]; then
>   		# We will open code our own implementation of patient module
> @@ -276,19 +276,19 @@ else
>   	if [[ ! -z "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" ]]; then
>   		if [[ "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" != "forever" ]]; then
>   			MODPROBE_PATIENT_RM_TIMEOUT_MS="$((MODPROBE_PATIENT_RM_TIMEOUT_SECONDS * 1000))"
> -			MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-t $MODPROBE_PATIENT_RM_TIMEOUT_MS"
> +			MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-w $MODPROBE_PATIENT_RM_TIMEOUT_MS"
>   		fi
>   	else
>   		# We export MODPROBE_PATIENT_RM_TIMEOUT_SECONDS here for parity
> -		# with environments without support for modprobe -p, but we
> +		# with environments without support for modprobe -w, but we
>   		# only really need it exported right now for environments which
> -		# don't have support for modprobe -p to implement our own
> +		# don't have support for modprobe -w to implement our own
>   		# patient module removal support within fstests.
>   		export MODPROBE_PATIENT_RM_TIMEOUT_SECONDS="50"
>   		MODPROBE_PATIENT_RM_TIMEOUT_MS="$((MODPROBE_PATIENT_RM_TIMEOUT_SECONDS * 1000))"
> -		MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-t $MODPROBE_PATIENT_RM_TIMEOUT_MS"
> +		MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-w $MODPROBE_PATIENT_RM_TIMEOUT_MS"
>   	fi
> -	MODPROBE_REMOVE_PATIENT="modprobe -p $MODPROBE_RM_PATIENT_TIMEOUT_ARGS"
> +	MODPROBE_REMOVE_PATIENT="modprobe $MODPROBE_RM_PATIENT_TIMEOUT_ARGS"
>   fi
>   export MODPROBE_REMOVE_PATIENT
>

Luis Chamberlain Dec. 5, 2024, 10:21 p.m. UTC | #4

On Thu, Dec 05, 2024 at 08:41:22AM +0100, Daniel Gomez wrote:
> On 12/5/2024 1:26 AM, Luis Chamberlain wrote:
> > We had added support for open coding a patient module remover long
> > ago on fstests through commit d405c21d40aa1 ("common/module: add patient
> > module rmmod support") to fix many flaky tests. This assumed we'd end up
> > with modprobe -p -t <msec-timeout> but in the end kmod upstream just
> 
> I can't find modprobe -p and/or -t arguments in the manual. What do they
> mean?

I had proposed -p to mean patient module remover with a default timeout
set, -t to override. In the end this went upstream instead over a lot of
dialog with just -w <timeout>.

> but i can't find the module remover support in kmod.
> 
> 
> Nit. I find useful using the long argument instead of the short one (e.g.
> --wait instead of -w). as it's usually self-descriptive. But I guess we
> don't have that long option for -p and -t?

It was one or the other that went upstream, we implemented this on
fstests upstream as an open coded solution while we wanted for this
meachanism to be agreed upon and merged. I just forgot to come back to
this after -w was merged and decided upon.

  Luis

Luis Chamberlain Dec. 5, 2024, 10:52 p.m. UTC | #5

On Wed, Dec 04, 2024 at 10:35:45PM -0600, Eric Sandeen wrote:
> but that probably has more to do with the test not realizing
> /before it starts/ that the module cannot be removed and it
> should not even try.

Right.

> Darrick fixed that with:
> 
> [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading

Looks good to me.

> but it's starting to feel like a bit of a complex house of cards
> by now. We might need a more robust framework for determining whether
> a module is removable /at all/ before we decide to wait patiently
> for a thing that cannot ever happen?

I think the above is a good example of knowing userspace and knowing
that userspace may be doing something else and we're ok to fail.
Essentially, module removal is non-deterministic due to how finicky
and easy it is to bump the refcnt for arbitrary reasons which are
subsystem specific. The URLs in the commit log I added provide good
examples of this. It is up to each subsystem to ensure a proper
quiesce makes sense to ensure userspace won't do something stupid
later.

If one can control the test environment to quiesce first, then it
makes sense to patiently remove the module. Otherwise the optional
impatient removal makes sense.

  Luis

Daniel Gomez Dec. 6, 2024, 9 a.m. UTC | #6

On 12/5/2024 11:21 PM, Luis Chamberlain wrote:
> On Thu, Dec 05, 2024 at 08:41:22AM +0100, Daniel Gomez wrote:
>> On 12/5/2024 1:26 AM, Luis Chamberlain wrote:
>>> We had added support for open coding a patient module remover long
>>> ago on fstests through commit d405c21d40aa1 ("common/module: add patient
>>> module rmmod support") to fix many flaky tests. This assumed we'd end up
>>> with modprobe -p -t <msec-timeout> but in the end kmod upstream just
>>
>> I can't find modprobe -p and/or -t arguments in the manual. What do they
>> mean?
> 
> I had proposed -p to mean patient module remover with a default timeout
> set, -t to override. In the end this went upstream instead over a lot of
> dialog with just -w <timeout>.
> 
>> but i can't find the module remover support in kmod.
>>
>>
>> Nit. I find useful using the long argument instead of the short one (e.g.
>> --wait instead of -w). as it's usually self-descriptive. But I guess we
>> don't have that long option for -p and -t?
> 
> It was one or the other that went upstream, we implemented this on
> fstests upstream as an open coded solution while we wanted for this
> meachanism to be agreed upon and merged. I just forgot to come back to
> this after -w was merged and decided upon.

Thanks for clarifying.


> 
>    Luis

Eric Sandeen Dec. 6, 2024, 4:38 p.m. UTC | #7

On 12/5/24 4:52 PM, Luis Chamberlain wrote:
> On Wed, Dec 04, 2024 at 10:35:45PM -0600, Eric Sandeen wrote:
>> but that probably has more to do with the test not realizing
>> /before it starts/ that the module cannot be removed and it
>> should not even try.
> 
> Right.
> 
>> Darrick fixed that with:
>>
>> [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading
> 
> Looks good to me.
> 
>> but it's starting to feel like a bit of a complex house of cards
>> by now. We might need a more robust framework for determining whether
>> a module is removable /at all/ before we decide to wait patiently
>> for a thing that cannot ever happen?
> 
> I think the above is a good example of knowing userspace and knowing
> that userspace may be doing something else and we're ok to fail.
> Essentially, module removal is non-deterministic due to how finicky
> and easy it is to bump the refcnt for arbitrary reasons which are
> subsystem specific. The URLs in the commit log I added provide good
> examples of this. It is up to each subsystem to ensure a proper
> quiesce makes sense to ensure userspace won't do something stupid
> later.
> 
> If one can control the test environment to quiesce first, then it
> makes sense to patiently remove the module. Otherwise the optional
> impatient removal makes sense.

Not to belabor the point too much, but my gut feeling is there are
cases where "quiescing" is not the issue at all - if the module is
in use on the system somewhere outside of xfstests, no amount of
quiescing or waiting will make it removable. Essentially, xfstests
needs to figure out if it is the sole owner/user of a module before
it tries to do any sort of waiting for removal, IMHO.

I haven't had the time to think through it much, just my spidey-
sense tingling when looking at the xfs/43[4-6] problem and feeling
like there's probably a better/cleaner/more explicit solution for
some of this.

Thanks,
-Eric

>   Luis
>

Luis Chamberlain Dec. 6, 2024, 7:34 p.m. UTC | #8

On Fri, Dec 06, 2024 at 10:38:11AM -0600, Eric Sandeen wrote:
> On 12/5/24 4:52 PM, Luis Chamberlain wrote:
> > On Wed, Dec 04, 2024 at 10:35:45PM -0600, Eric Sandeen wrote:
> >> but that probably has more to do with the test not realizing
> >> /before it starts/ that the module cannot be removed and it
> >> should not even try.
> > 
> > Right.
> > 
> >> Darrick fixed that with:
> >>
> >> [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading
> > 
> > Looks good to me.
> > 
> >> but it's starting to feel like a bit of a complex house of cards
> >> by now. We might need a more robust framework for determining whether
> >> a module is removable /at all/ before we decide to wait patiently
> >> for a thing that cannot ever happen?
> > 
> > I think the above is a good example of knowing userspace and knowing
> > that userspace may be doing something else and we're ok to fail.
> > Essentially, module removal is non-deterministic due to how finicky
> > and easy it is to bump the refcnt for arbitrary reasons which are
> > subsystem specific. The URLs in the commit log I added provide good
> > examples of this. It is up to each subsystem to ensure a proper
> > quiesce makes sense to ensure userspace won't do something stupid
> > later.
> > 
> > If one can control the test environment to quiesce first, then it
> > makes sense to patiently remove the module. Otherwise the optional
> > impatient removal makes sense.
> 
> Not to belabor the point too much, but my gut feeling is there are
> cases where "quiescing" is not the issue at all - if the module is
> in use on the system somewhere outside of xfstests, no amount of
> quiescing or waiting will make it removable.

Yes indeed, that is a good point. Only if the test suite has full
control to ensure the lifetime of the module could it rely on removal.
But there are holes in the assumptions which can be made even on this
front too. I'll explain below.

> Essentially, xfstests
> needs to figure out if it is the sole owner/user of a module before
> it tries to do any sort of waiting for removal, IMHO.

Even if it could do that, it can't prevent the user from poking around,
or for some userspace package to not be present which may proactively
poke around block devices of certain type for example. Such things
will break the assumptions that the test suite has full control.

Test runners, which do full bringup / setup / package installations
have more control for a more deterministic setup, fstests itself
however will have most control, but since this is about flaky tests
failing for stupid reasons, this is all about *improving* from a test
perspective consciosness over this problem and for the tester to make
more appropriate calls for what it thinks it can have control over.

  Luis

common/config: use modprobe -w when supported

Commit Message

Comments

Patch