diff mbox series

[net-next,1/7] net: phy: always call phy_process_state_change() under lock

Message ID E1qgoNP-007a46-Mj@rmk-PC.armlinux.org.uk (mailing list archive)
State Accepted
Commit 8da77df649c43d9cec70d683ee317ba5d49d09b7
Delegated to: Netdev Maintainers
Headers show
Series net: phy: avoid race when erroring stopping PHY | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1340 this patch: 1340
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 1363 this patch: 1363
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1363 this patch: 1363
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 14 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Russell King (Oracle) Sept. 14, 2023, 3:35 p.m. UTC
phy_stop() calls phy_process_state_change() while holding the phydev
lock, so also arrange for phy_state_machine() to do the same, so that
this function is called with consistent locking.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Florian Fainelli Sept. 14, 2023, 6:21 p.m. UTC | #1
On 9/14/23 08:35, Russell King (Oracle) wrote:
> phy_stop() calls phy_process_state_change() while holding the phydev
> lock, so also arrange for phy_state_machine() to do the same, so that
> this function is called with consistent locking.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Marek Szyprowski Sept. 18, 2023, 12:33 p.m. UTC | #2
Hi Russell,

On 14.09.2023 17:35, Russell King (Oracle) wrote:
> phy_stop() calls phy_process_state_change() while holding the phydev
> lock, so also arrange for phy_state_machine() to do the same, so that
> this function is called with consistent locking.
>
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

This change, merged to linux-next as commit 8da77df649c4 ("net: phy: 
always call phy_process_state_change() under lock") introduces the 
following deadlock with ASIX AX8817X USB driver:

--->8---

asix 1-1.4:1.0 (unnamed net_device) (uninitialized): PHY 
[usb-001:003:10] driver [Asix Electronics AX88772A] (irq=POLL)
Asix Electronics AX88772A usb-001:003:10: attached PHY driver 
(mii_bus:phy_addr=usb-001:003:10, irq=POLL)
asix 1-1.4:1.0 eth0: register 'asix' at usb-12110000.usb-1.4, ASIX 
AX88772 USB 2.0 Ethernet, a2:99:b6:cd:11:eb

asix 1-1.4:1.0 eth0: configuring for phy/internal link mode

============================================
WARNING: possible recursive locking detected
6.6.0-rc1-00239-g8da77df649c4-dirty #13949 Not tainted
--------------------------------------------
kworker/3:3/71 is trying to acquire lock:
c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_start_aneg+0x1c/0x38

but task is already holding lock:
c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x100/0x2b8

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&dev->lock);
   lock(&dev->lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

3 locks held by kworker/3:3/71:
  #0: c1c090a8 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: 
process_one_work+0x148/0x608
  #1: f0bddf20 
((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}, at: 
process_one_work+0x148/0x608
  #2: c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x100/0x2b8

stack backtrace:
CPU: 3 PID: 71 Comm: kworker/3:3 Not tainted 
6.6.0-rc1-00239-g8da77df649c4-dirty #13949
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events_power_efficient phy_state_machine
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x58/0x70
  dump_stack_lvl from __lock_acquire+0x1300/0x2984
  __lock_acquire from lock_acquire+0x130/0x37c
  lock_acquire from __mutex_lock+0x94/0x94c
  __mutex_lock from mutex_lock_nested+0x1c/0x24
  mutex_lock_nested from phy_start_aneg+0x1c/0x38
  phy_start_aneg from phy_state_machine+0x10c/0x2b8
  phy_state_machine from process_one_work+0x204/0x608
  process_one_work from worker_thread+0x1e0/0x498
  worker_thread from kthread+0x104/0x138
  kthread from ret_from_fork+0x14/0x28
Exception stack(0xf0bddfb0 to 0xf0bddff8)
...

-------

This probably need to be fixed somewhere in drivers/net/usb/asix* but at 
the first glance I don't see any obvious place that need a fix.

> ---
>   drivers/net/phy/phy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index df54c137c5f5..1e5218935eb3 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -1506,6 +1506,7 @@ void phy_state_machine(struct work_struct *work)
>   	if (err < 0)
>   		phy_error_precise(phydev, func, err);
>   
> +	mutex_lock(&phydev->lock);
>   	phy_process_state_change(phydev, old_state);
>   
>   	/* Only re-schedule a PHY state machine change if we are polling the
> @@ -1516,7 +1517,6 @@ void phy_state_machine(struct work_struct *work)
>   	 * state machine would be pointless and possibly error prone when
>   	 * called from phy_disconnect() synchronously.
>   	 */
> -	mutex_lock(&phydev->lock);
>   	if (phy_polling_mode(phydev) && phy_is_started(phydev))
>   		phy_queue_state_machine(phydev, PHY_STATE_TIME);
>   	mutex_unlock(&phydev->lock);

Best regards
Andrew Lunn Sept. 18, 2023, 12:55 p.m. UTC | #3
> This probably need to be fixed somewhere in drivers/net/usb/asix* but at 
> the first glance I don't see any obvious place that need a fix.

static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc,
                            bool in_pm)
{
        struct usbnet *dev = netdev_priv(netdev);
        __le16 res;
        int ret;

        mutex_lock(&dev->phy_mutex);

Taking this lock here is the problem. Same for write.

There is some funky stuff going on in asix_devices.c. It using both
phylib and the much older mii code.

       Andrew
Marek Szyprowski Sept. 18, 2023, 1:05 p.m. UTC | #4
On 18.09.2023 14:55, Andrew Lunn wrote:
>> This probably need to be fixed somewhere in drivers/net/usb/asix* but at
>> the first glance I don't see any obvious place that need a fix.
> static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc,
>                              bool in_pm)
> {
>          struct usbnet *dev = netdev_priv(netdev);
>          __le16 res;
>          int ret;
>
>          mutex_lock(&dev->phy_mutex);
>
> Taking this lock here is the problem. Same for write.
>
> There is some funky stuff going on in asix_devices.c. It using both
> phylib and the much older mii code.

This must be something different. Removing those calls to phy_mutex from 
__asix_mdio_read/write doesn't fix this deadlock (I intentionally 
ignored the fact that some kind of synchronization is probably required 
there).

Best regards
Russell King (Oracle) Sept. 18, 2023, 1:06 p.m. UTC | #5
On Mon, Sep 18, 2023 at 02:33:04PM +0200, Marek Szyprowski wrote:
> Hi Russell,
> 
> On 14.09.2023 17:35, Russell King (Oracle) wrote:
> > phy_stop() calls phy_process_state_change() while holding the phydev
> > lock, so also arrange for phy_state_machine() to do the same, so that
> > this function is called with consistent locking.
> >
> > Tested-by: Jijie Shao <shaojijie@huawei.com>
> > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> 
> This change, merged to linux-next as commit 8da77df649c4 ("net: phy: 
> always call phy_process_state_change() under lock") introduces the 
> following deadlock with ASIX AX8817X USB driver:

Yay, latent bug found...

I guess this is asix_ax88772a_link_change_notify() which is causing
the problem, and yes, that phy_start_aneg() needs to be the unlocked
version (which we'll have to export.)

This should fix it.

diff --git a/drivers/net/phy/ax88796b.c b/drivers/net/phy/ax88796b.c
index 0f1e617a26c9..eb74a8cf8df1 100644
--- a/drivers/net/phy/ax88796b.c
+++ b/drivers/net/phy/ax88796b.c
@@ -90,7 +90,7 @@ static void asix_ax88772a_link_change_notify(struct phy_device *phydev)
 	 */
 	if (phydev->state == PHY_NOLINK) {
 		phy_init_hw(phydev);
-		phy_start_aneg(phydev);
+		_phy_start_aneg(phydev);
 	}
 }
 
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 93a8676dd8d8..a5fa077650e8 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -981,7 +981,7 @@ static int phy_check_link_status(struct phy_device *phydev)
  *   If the PHYCONTROL Layer is operating, we change the state to
  *   reflect the beginning of Auto-negotiation or forcing.
  */
-static int _phy_start_aneg(struct phy_device *phydev)
+int _phy_start_aneg(struct phy_device *phydev)
 {
 	int err;
 
@@ -1002,6 +1002,7 @@ static int _phy_start_aneg(struct phy_device *phydev)
 
 	return err;
 }
+EXPORT_SYMBOL(_phy_start_aneg);
 
 /**
  * phy_start_aneg - start auto-negotiation for this PHY device
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 1351b802ffcf..3cc52826f18e 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1736,6 +1736,7 @@ void phy_detach(struct phy_device *phydev);
 void phy_start(struct phy_device *phydev);
 void phy_stop(struct phy_device *phydev);
 int phy_config_aneg(struct phy_device *phydev);
+int _phy_start_aneg(struct phy_device *phydev);
 int phy_start_aneg(struct phy_device *phydev);
 int phy_aneg_done(struct phy_device *phydev);
 int phy_speed_down(struct phy_device *phydev, bool sync);
Russell King (Oracle) Sept. 18, 2023, 1:07 p.m. UTC | #6
On Mon, Sep 18, 2023 at 02:55:32PM +0200, Andrew Lunn wrote:
> > This probably need to be fixed somewhere in drivers/net/usb/asix* but at 
> > the first glance I don't see any obvious place that need a fix.
> 
> static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc,
>                             bool in_pm)
> {
>         struct usbnet *dev = netdev_priv(netdev);
>         __le16 res;
>         int ret;
> 
>         mutex_lock(&dev->phy_mutex);
> 
> Taking this lock here is the problem. Same for write.
> 
> There is some funky stuff going on in asix_devices.c. It using both
> phylib and the much older mii code.

I don't think that's the problem...
Marek Szyprowski Sept. 18, 2023, 1:15 p.m. UTC | #7
On 18.09.2023 15:06, Russell King (Oracle) wrote:
> On Mon, Sep 18, 2023 at 02:33:04PM +0200, Marek Szyprowski wrote:
>> Hi Russell,
>>
>> On 14.09.2023 17:35, Russell King (Oracle) wrote:
>>> phy_stop() calls phy_process_state_change() while holding the phydev
>>> lock, so also arrange for phy_state_machine() to do the same, so that
>>> this function is called with consistent locking.
>>>
>>> Tested-by: Jijie Shao <shaojijie@huawei.com>
>>> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
>> This change, merged to linux-next as commit 8da77df649c4 ("net: phy:
>> always call phy_process_state_change() under lock") introduces the
>> following deadlock with ASIX AX8817X USB driver:
> Yay, latent bug found...
>
> I guess this is asix_ax88772a_link_change_notify() which is causing
> the problem, and yes, that phy_start_aneg() needs to be the unlocked
> version (which we'll have to export.)
>
> This should fix it.

Thanks!

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>


> diff --git a/drivers/net/phy/ax88796b.c b/drivers/net/phy/ax88796b.c
> index 0f1e617a26c9..eb74a8cf8df1 100644
> --- a/drivers/net/phy/ax88796b.c
> +++ b/drivers/net/phy/ax88796b.c
> @@ -90,7 +90,7 @@ static void asix_ax88772a_link_change_notify(struct phy_device *phydev)
>   	 */
>   	if (phydev->state == PHY_NOLINK) {
>   		phy_init_hw(phydev);
> -		phy_start_aneg(phydev);
> +		_phy_start_aneg(phydev);
>   	}
>   }
>   
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 93a8676dd8d8..a5fa077650e8 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -981,7 +981,7 @@ static int phy_check_link_status(struct phy_device *phydev)
>    *   If the PHYCONTROL Layer is operating, we change the state to
>    *   reflect the beginning of Auto-negotiation or forcing.
>    */
> -static int _phy_start_aneg(struct phy_device *phydev)
> +int _phy_start_aneg(struct phy_device *phydev)
>   {
>   	int err;
>   
> @@ -1002,6 +1002,7 @@ static int _phy_start_aneg(struct phy_device *phydev)
>   
>   	return err;
>   }
> +EXPORT_SYMBOL(_phy_start_aneg);
>   
>   /**
>    * phy_start_aneg - start auto-negotiation for this PHY device
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 1351b802ffcf..3cc52826f18e 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -1736,6 +1736,7 @@ void phy_detach(struct phy_device *phydev);
>   void phy_start(struct phy_device *phydev);
>   void phy_stop(struct phy_device *phydev);
>   int phy_config_aneg(struct phy_device *phydev);
> +int _phy_start_aneg(struct phy_device *phydev);
>   int phy_start_aneg(struct phy_device *phydev);
>   int phy_aneg_done(struct phy_device *phydev);
>   int phy_speed_down(struct phy_device *phydev, bool sync);

Best regards
diff mbox series

Patch

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index df54c137c5f5..1e5218935eb3 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1506,6 +1506,7 @@  void phy_state_machine(struct work_struct *work)
 	if (err < 0)
 		phy_error_precise(phydev, func, err);
 
+	mutex_lock(&phydev->lock);
 	phy_process_state_change(phydev, old_state);
 
 	/* Only re-schedule a PHY state machine change if we are polling the
@@ -1516,7 +1517,6 @@  void phy_state_machine(struct work_struct *work)
 	 * state machine would be pointless and possibly error prone when
 	 * called from phy_disconnect() synchronously.
 	 */
-	mutex_lock(&phydev->lock);
 	if (phy_polling_mode(phydev) && phy_is_started(phydev))
 		phy_queue_state_machine(phydev, PHY_STATE_TIME);
 	mutex_unlock(&phydev->lock);