[1/2] r8152: Hold the rtnl_lock for all of reset

Message ID	20231117130836.1.I77097aa9ec01aeca1b3c75fde4ba5007a17fdf76@changeid (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org header.b="kAkHxOwr" From: Douglas Anderson <dianders@chromium.org> To: Jakub Kicinski <kuba@kernel.org>, Hayes Wang <hayeswang@realtek.com>, "David S . Miller" <davem@davemloft.net> Cc: Grant Grundler <grundler@chromium.org>, Simon Horman <horms@kernel.org>, Edward Hill <ecgh@chromium.org>, linux-usb@vger.kernel.org, Laura Nao <laura.nao@collabora.com>, Alan Stern <stern@rowland.harvard.edu>, Douglas Anderson <dianders@chromium.org>, =?utf-8?q?Bj=C3=B8rn_Mork?= <bjorn@mork.no>, Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>, linux-kernel@vger.kernel.org, netdev@vger.kernel.org Subject: [PATCH 1/2] r8152: Hold the rtnl_lock for all of reset Date: Fri, 17 Nov 2023 13:08:41 -0800 Message-ID: <20231117130836.1.I77097aa9ec01aeca1b3c75fde4ba5007a17fdf76@changeid> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[1/2] r8152: Hold the rtnl_lock for all of reset \| expand [1/2] r8152: Hold the rtnl_lock for all of reset [2/2] r8152: Add RTL8152_INACCESSIBLE checks to more loops

Message ID

20231117130836.1.I77097aa9ec01aeca1b3c75fde4ba5007a17fdf76@changeid (mailing list archive)

State

Superseded

Delegated to:

Netdev Maintainers

Headers

From: Douglas Anderson <dianders@chromium.org>
To: Jakub Kicinski <kuba@kernel.org>,
	Hayes Wang <hayeswang@realtek.com>,
	"David S . Miller" <davem@davemloft.net>
Cc: Grant Grundler <grundler@chromium.org>, Simon Horman <horms@kernel.org>,
 Edward Hill <ecgh@chromium.org>, linux-usb@vger.kernel.org,
 Laura Nao <laura.nao@collabora.com>, Alan Stern <stern@rowland.harvard.edu>,
 Douglas Anderson <dianders@chromium.org>,
 =?utf-8?q?Bj=C3=B8rn_Mork?= <bjorn@mork.no>,
 Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>,
 linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: [PATCH 1/2] r8152: Hold the rtnl_lock for all of reset
Date: Fri, 17 Nov 2023 13:08:41 -0800
Message-ID: 
 <20231117130836.1.I77097aa9ec01aeca1b3c75fde4ba5007a17fdf76@changeid>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

[1/2] r8152: Hold the rtnl_lock for all of reset | expand

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1127 this patch: 1127
netdev/cc_maintainers	success	CCed 8 of 8 maintainers
netdev/build_clang	success	Errors and warnings before: 1154 this patch: 1154
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1154 this patch: 1154
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 39 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Context

Check

Description

netdev/series_format

warning

Single patches do not need cover letters; Target tree name not specified in the subject

netdev/tree_selection

success

Guessed tree name to be net-next

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 1127 this patch: 1127

netdev/cc_maintainers

success

CCed 8 of 8 maintainers

netdev/build_clang

success

Errors and warnings before: 1154 this patch: 1154

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/deprecated_api

success

None detected

netdev/check_selftest

success

No net selftest shell script

netdev/verify_fixes

success

Fixes tag looks correct

netdev/build_allmodconfig_warn

success

Errors and warnings before: 1154 this patch: 1154

netdev/checkpatch

success

total: 0 errors, 0 warnings, 0 checks, 39 lines checked

netdev/build_clang_rust

success

No Rust files in patch. Skipping build

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

Commit Message

Doug Anderson Nov. 17, 2023, 9:08 p.m. UTC

As of commit d9962b0d4202 ("r8152: Block future register access if
register access fails") there is a race condition that can happen
between the USB device reset thread and napi_enable() (not) getting
called during rtl8152_open(). Specifically:
* While rtl8152_open() is running we get a register access error
  that's _not_ -ENODEV and queue up a USB reset.
* rtl8152_open() exits before calling napi_enable() due to any reason
  (including usb_submit_urb() returning an error).

In that case:
* Since the USB reset is perform in a separate thread asynchronously,
  it can run at anytime USB device lock is not held - even before
  rtl8152_open() has exited with an error and caused __dev_open() to
  clear the __LINK_STATE_START bit.
* The rtl8152_pre_reset() will notice that the netif_running() returns
  true (since __LINK_STATE_START wasn't cleared) so it won't exit
  early.
* rtl8152_pre_reset() will then hang in napi_disable() because
  napi_enable() was never called.

We can fix the race by making sure that the r8152 reset routines don't
run at the same time as we're opening the device. Specifically we need
the reset routines in their entirety rely on the return value of
netif_running(). The only way to reliably depend on that is for them
to hold the rntl_lock() mutex for the duration of reset.

Grabbing the rntl_lock() mutex for the duration of reset seems like a
long time, but reset is not expected to be common and the rtnl_lock()
mutex is already held for long durations since the core grabs it
around the open/close calls.

Fixes: d9962b0d4202 ("r8152: Block future register access if register access fails")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/net/usb/r8152.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Comments

Grant Grundler Nov. 21, 2023, 3:23 a.m. UTC | #1

On Fri, Nov 17, 2023 at 1:10 PM Douglas Anderson <dianders@chromium.org> wrote:
>
> As of commit d9962b0d4202 ("r8152: Block future register access if
> register access fails") there is a race condition that can happen
> between the USB device reset thread and napi_enable() (not) getting
> called during rtl8152_open(). Specifically:
> * While rtl8152_open() is running we get a register access error
>   that's _not_ -ENODEV and queue up a USB reset.
> * rtl8152_open() exits before calling napi_enable() due to any reason
>   (including usb_submit_urb() returning an error).
>
> In that case:
> * Since the USB reset is perform in a separate thread asynchronously,
>   it can run at anytime USB device lock is not held - even before
>   rtl8152_open() has exited with an error and caused __dev_open() to
>   clear the __LINK_STATE_START bit.
> * The rtl8152_pre_reset() will notice that the netif_running() returns
>   true (since __LINK_STATE_START wasn't cleared) so it won't exit
>   early.
> * rtl8152_pre_reset() will then hang in napi_disable() because
>   napi_enable() was never called.
>
> We can fix the race by making sure that the r8152 reset routines don't
> run at the same time as we're opening the device. Specifically we need
> the reset routines in their entirety rely on the return value of
> netif_running(). The only way to reliably depend on that is for them
> to hold the rntl_lock() mutex for the duration of reset.
>
> Grabbing the rntl_lock() mutex for the duration of reset seems like a
> long time, but reset is not expected to be common and the rtnl_lock()
> mutex is already held for long durations since the core grabs it
> around the open/close calls.
>
> Fixes: d9962b0d4202 ("r8152: Block future register access if register access fails")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Reviewed-by: Grant Grundler <grundler@chromium.org>

BTW, for ChromeOS systems, the outcome of hang in napi_disable() is a
"hung task" panic after 120 seconds. Fortunately, the stack trace made
it relatively easy (compared to other hung tasks I've looked at) to
unravel.

Doug gets all the credit for figuring out this solution (using rtnl_lock()).

cheers,
grant

> ---
>
>  drivers/net/usb/r8152.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
> index 2c5c1e91ded6..d6edf0254599 100644
> --- a/drivers/net/usb/r8152.c
> +++ b/drivers/net/usb/r8152.c
> @@ -8397,6 +8397,8 @@ static int rtl8152_pre_reset(struct usb_interface *intf)
>         struct r8152 *tp = usb_get_intfdata(intf);
>         struct net_device *netdev;
>
> +       rtnl_lock();
> +
>         if (!tp || !test_bit(PROBED_WITH_NO_ERRORS, &tp->flags))
>                 return 0;
>
> @@ -8428,20 +8430,17 @@ static int rtl8152_post_reset(struct usb_interface *intf)
>         struct sockaddr sa;
>
>         if (!tp || !test_bit(PROBED_WITH_NO_ERRORS, &tp->flags))
> -               return 0;
> +               goto exit;
>
>         rtl_set_accessible(tp);
>
>         /* reset the MAC address in case of policy change */
> -       if (determine_ethernet_addr(tp, &sa) >= 0) {
> -               rtnl_lock();
> +       if (determine_ethernet_addr(tp, &sa) >= 0)
>                 dev_set_mac_address (tp->netdev, &sa, NULL);
> -               rtnl_unlock();
> -       }
>
>         netdev = tp->netdev;
>         if (!netif_running(netdev))
> -               return 0;
> +               goto exit;
>
>         set_bit(WORK_ENABLE, &tp->flags);
>         if (netif_carrier_ok(netdev)) {
> @@ -8460,6 +8459,8 @@ static int rtl8152_post_reset(struct usb_interface *intf)
>         if (!list_empty(&tp->rx_done))
>                 napi_schedule(&tp->napi);
>
> +exit:
> +       rtnl_unlock();
>         return 0;
>  }
>
> --
> 2.43.0.rc0.421.g78406f8d94-goog
>

Paolo Abeni Nov. 21, 2023, 10:25 a.m. UTC | #2

On Fri, 2023-11-17 at 13:08 -0800, Douglas Anderson wrote:
> As of commit d9962b0d4202 ("r8152: Block future register access if
> register access fails") there is a race condition that can happen
> between the USB device reset thread and napi_enable() (not) getting
> called during rtl8152_open(). Specifically:
> * While rtl8152_open() is running we get a register access error
>   that's _not_ -ENODEV and queue up a USB reset.
> * rtl8152_open() exits before calling napi_enable() due to any reason
>   (including usb_submit_urb() returning an error).
> 
> In that case:
> * Since the USB reset is perform in a separate thread asynchronously,
>   it can run at anytime USB device lock is not held - even before
>   rtl8152_open() has exited with an error and caused __dev_open() to
>   clear the __LINK_STATE_START bit.
> * The rtl8152_pre_reset() will notice that the netif_running() returns
>   true (since __LINK_STATE_START wasn't cleared) so it won't exit
>   early.
> * rtl8152_pre_reset() will then hang in napi_disable() because
>   napi_enable() was never called.
> 
> We can fix the race by making sure that the r8152 reset routines don't
> run at the same time as we're opening the device. Specifically we need
> the reset routines in their entirety rely on the return value of
> netif_running(). The only way to reliably depend on that is for them
> to hold the rntl_lock() mutex for the duration of reset.

Acquiring the rtnl_lock in a callback and releasing it in a different
one, with the latter called depending on the configuration, looks
fragile and possibly prone to deadlock issues.

Have you tested your patch with lockdep enabled?

Can you instead acquire the rtnl lock only for pre_reset/post_rest and
in rtl8152_open() do something alike:

	for (i = 0; i < MAX_WAIT; ++i) {
		if (usb_lock_device_for_reset(udev, NULL))
			goto error;

		wait_again = udev->reset_in_progress;
		usb_unlock_device(udev);
		if (!wait_again)
			break;

		usleep(1);
	}
	if (i == MAX_WAIT)
		goto error;

which should be more polite to other locks?


Thanks,

Paolo

Doug Anderson Nov. 21, 2023, 5:41 p.m. UTC | #3

Hi,

On Tue, Nov 21, 2023 at 2:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2023-11-17 at 13:08 -0800, Douglas Anderson wrote:
> > As of commit d9962b0d4202 ("r8152: Block future register access if
> > register access fails") there is a race condition that can happen
> > between the USB device reset thread and napi_enable() (not) getting
> > called during rtl8152_open(). Specifically:
> > * While rtl8152_open() is running we get a register access error
> >   that's _not_ -ENODEV and queue up a USB reset.
> > * rtl8152_open() exits before calling napi_enable() due to any reason
> >   (including usb_submit_urb() returning an error).
> >
> > In that case:
> > * Since the USB reset is perform in a separate thread asynchronously,
> >   it can run at anytime USB device lock is not held - even before
> >   rtl8152_open() has exited with an error and caused __dev_open() to
> >   clear the __LINK_STATE_START bit.
> > * The rtl8152_pre_reset() will notice that the netif_running() returns
> >   true (since __LINK_STATE_START wasn't cleared) so it won't exit
> >   early.
> > * rtl8152_pre_reset() will then hang in napi_disable() because
> >   napi_enable() was never called.
> >
> > We can fix the race by making sure that the r8152 reset routines don't
> > run at the same time as we're opening the device. Specifically we need
> > the reset routines in their entirety rely on the return value of
> > netif_running(). The only way to reliably depend on that is for them
> > to hold the rntl_lock() mutex for the duration of reset.
>
> Acquiring the rtnl_lock in a callback and releasing it in a different
> one, with the latter called depending on the configuration, looks
> fragile and possibly prone to deadlock issues.

Yeah, I debated this as well. I looked through the USB code and I
couldn't find any reason that it wouldn't work to hold the lock for
the duration. I agree that it's a little more fragile in one sense,
but I think it avoids potential races too and that makes it less
fragile in a different sense. ;-)

> Have you tested your patch with lockdep enabled?

Yes, lockdep reported no problems with my patch. Indeed lockdep hints
are how I ended up with the current solution. When I originally tried
to lock the device in rtl8152_open() then lockdep yelled at me about
the AB BA issues between the device lock and the rtnl_lock() mutex
which made me realize that grabbing the rtnl_lock() in the reset code
was the right solution here.

> Can you instead acquire the rtnl lock only for pre_reset/post_rest and
> in rtl8152_open() do something alike:
>
>         for (i = 0; i < MAX_WAIT; ++i) {
>                 if (usb_lock_device_for_reset(udev, NULL))
>                         goto error;
>
>                 wait_again = udev->reset_in_progress;
>                 usb_unlock_device(udev);
>                 if (!wait_again)
>                         break;
>
>                 usleep(1);
>         }
>         if (i == MAX_WAIT)
>                 goto error;
>
> which should be more polite to other locks?

Right, I could add a call to usb_lock_device_for_reset() here. That
shouldn't trigger AB BA lockdep splats since it has a timeout. I'm not
100% convinced that it's right, though. ...and I'm fairly certain that
if we call it we don't want to call it in a loop.

I don't think we should have a loop because
usb_lock_device_for_reset() already has a loop in it and I don't think
an extra loop will help. I'd imagine that usb_lock_device_for_reset()
would usually timeout only if USB reset is currently running and
somehow blocked. If pre_reset or post_reset are currently running then
they've already got the USB lock (from their caller) and may be
blocked waiting for the rtnl_lock. We've already got the rtnl_lock
(from our caller) and now we're waiting for the USB lock. In neither
case do I think it's a good idea to drop the locks that our caller
grabbed for us, so about the best we can do in that case is return an
error from r8152_open() after the first timeout.

Let's step back and think about why we might want to get the USB lock
in the first place. This would only be necessary if we dropped the
lock between pre_reset and post_reset, right? ...so we're trying to
make sure that we're not trying to open a device while the USB reset
code is half executed. I guess the expected order of operations we're
trying to protect against would be:

1. rtl8152_close() is called and has a transfer error that queues up a reset.
2. USB reset starts and pre-reset runs. It should be a no-op because
netif_running() would return false.
3. rl8152_open() is called and opens the device successfully
4. USB reset runs post-reset, which is no longer the inverse of
pre-reset because netif_running() would return true. This would end up
with, among other things, an unbalanced napi_enable() count.

That feels relatively unlikely to actually hit but it does seem
conceivably possible. Thus if we do drop the rtnl_lock between
pre-reset and post-reset then I agree we should call
usb_lock_device_for_reset(). Probably we need to do that for _both_
rtl8152_open() and rtl8152_close()? We also probably don't need to
hold the lock for the whole duration of rtl8152_open() /
rtl8152_close(). We can just grab it and release it to make sure that
we're not midway through a reset.

I guess one sorta odd thing here is that it means that rtl8152_close()
could now fail if someone called it at just the right time and we were
unable to grab the USB lock. Though it does have an error return,
that's not a failure that I'd expect most users to be able to handle
terribly well. I guess conceivably we could return -EAGAIN or
-EDEADLOCK in this case, but ick...

Hopefully the above makes sense. I'd be interested to hear your
further thoughts on the issue. I'd still lean towards leaving the code
as-is and holding the rtnl_lock across the whole reset, but for all
practical purposes I think it would be fine to split it and add
usb_lock_device_for_reset() to the rtl8152_open() / rtl8152_close(),
since the issues I talk about above seem like they'd need extremely
rare timing conditions to hit.

-Doug

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 2c5c1e91ded6..d6edf0254599 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -8397,6 +8397,8 @@  static int rtl8152_pre_reset(struct usb_interface *intf)
 	struct r8152 *tp = usb_get_intfdata(intf);
 	struct net_device *netdev;
 
+	rtnl_lock();
+
 	if (!tp || !test_bit(PROBED_WITH_NO_ERRORS, &tp->flags))
 		return 0;
 
@@ -8428,20 +8430,17 @@  static int rtl8152_post_reset(struct usb_interface *intf)
 	struct sockaddr sa;
 
 	if (!tp || !test_bit(PROBED_WITH_NO_ERRORS, &tp->flags))
-		return 0;
+		goto exit;
 
 	rtl_set_accessible(tp);
 
 	/* reset the MAC address in case of policy change */
-	if (determine_ethernet_addr(tp, &sa) >= 0) {
-		rtnl_lock();
+	if (determine_ethernet_addr(tp, &sa) >= 0)
 		dev_set_mac_address (tp->netdev, &sa, NULL);
-		rtnl_unlock();
-	}
 
 	netdev = tp->netdev;
 	if (!netif_running(netdev))
-		return 0;
+		goto exit;
 
 	set_bit(WORK_ENABLE, &tp->flags);
 	if (netif_carrier_ok(netdev)) {
@@ -8460,6 +8459,8 @@  static int rtl8152_post_reset(struct usb_interface *intf)
 	if (!list_empty(&tp->rx_done))
 		napi_schedule(&tp->napi);
 
+exit:
+	rtnl_unlock();
 	return 0;
 }

[1/2] r8152: Hold the rtnl_lock for all of reset

Checks

Commit Message

Comments

Patch