Message ID | E1qgoNP-007a46-Mj@rmk-PC.armlinux.org.uk (mailing list archive) |
---|---|
State | Accepted |
Commit | 8da77df649c43d9cec70d683ee317ba5d49d09b7 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: phy: avoid race when erroring stopping PHY | expand |
On 9/14/23 08:35, Russell King (Oracle) wrote: > phy_stop() calls phy_process_state_change() while holding the phydev > lock, so also arrange for phy_state_machine() to do the same, so that > this function is called with consistent locking. > > Tested-by: Jijie Shao <shaojijie@huawei.com> > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Hi Russell, On 14.09.2023 17:35, Russell King (Oracle) wrote: > phy_stop() calls phy_process_state_change() while holding the phydev > lock, so also arrange for phy_state_machine() to do the same, so that > this function is called with consistent locking. > > Tested-by: Jijie Shao <shaojijie@huawei.com> > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> This change, merged to linux-next as commit 8da77df649c4 ("net: phy: always call phy_process_state_change() under lock") introduces the following deadlock with ASIX AX8817X USB driver: --->8--- asix 1-1.4:1.0 (unnamed net_device) (uninitialized): PHY [usb-001:003:10] driver [Asix Electronics AX88772A] (irq=POLL) Asix Electronics AX88772A usb-001:003:10: attached PHY driver (mii_bus:phy_addr=usb-001:003:10, irq=POLL) asix 1-1.4:1.0 eth0: register 'asix' at usb-12110000.usb-1.4, ASIX AX88772 USB 2.0 Ethernet, a2:99:b6:cd:11:eb asix 1-1.4:1.0 eth0: configuring for phy/internal link mode ============================================ WARNING: possible recursive locking detected 6.6.0-rc1-00239-g8da77df649c4-dirty #13949 Not tainted -------------------------------------------- kworker/3:3/71 is trying to acquire lock: c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_start_aneg+0x1c/0x38 but task is already holding lock: c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x100/0x2b8 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&dev->lock); lock(&dev->lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/3:3/71: #0: c1c090a8 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x148/0x608 #1: f0bddf20 ((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}, at: process_one_work+0x148/0x608 #2: c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x100/0x2b8 stack backtrace: CPU: 3 PID: 71 Comm: kworker/3:3 Not tainted 6.6.0-rc1-00239-g8da77df649c4-dirty #13949 Hardware name: Samsung Exynos (Flattened Device Tree) Workqueue: events_power_efficient phy_state_machine unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x58/0x70 dump_stack_lvl from __lock_acquire+0x1300/0x2984 __lock_acquire from lock_acquire+0x130/0x37c lock_acquire from __mutex_lock+0x94/0x94c __mutex_lock from mutex_lock_nested+0x1c/0x24 mutex_lock_nested from phy_start_aneg+0x1c/0x38 phy_start_aneg from phy_state_machine+0x10c/0x2b8 phy_state_machine from process_one_work+0x204/0x608 process_one_work from worker_thread+0x1e0/0x498 worker_thread from kthread+0x104/0x138 kthread from ret_from_fork+0x14/0x28 Exception stack(0xf0bddfb0 to 0xf0bddff8) ... ------- This probably need to be fixed somewhere in drivers/net/usb/asix* but at the first glance I don't see any obvious place that need a fix. > --- > drivers/net/phy/phy.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index df54c137c5f5..1e5218935eb3 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -1506,6 +1506,7 @@ void phy_state_machine(struct work_struct *work) > if (err < 0) > phy_error_precise(phydev, func, err); > > + mutex_lock(&phydev->lock); > phy_process_state_change(phydev, old_state); > > /* Only re-schedule a PHY state machine change if we are polling the > @@ -1516,7 +1517,6 @@ void phy_state_machine(struct work_struct *work) > * state machine would be pointless and possibly error prone when > * called from phy_disconnect() synchronously. > */ > - mutex_lock(&phydev->lock); > if (phy_polling_mode(phydev) && phy_is_started(phydev)) > phy_queue_state_machine(phydev, PHY_STATE_TIME); > mutex_unlock(&phydev->lock); Best regards
> This probably need to be fixed somewhere in drivers/net/usb/asix* but at > the first glance I don't see any obvious place that need a fix. static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc, bool in_pm) { struct usbnet *dev = netdev_priv(netdev); __le16 res; int ret; mutex_lock(&dev->phy_mutex); Taking this lock here is the problem. Same for write. There is some funky stuff going on in asix_devices.c. It using both phylib and the much older mii code. Andrew
On 18.09.2023 14:55, Andrew Lunn wrote: >> This probably need to be fixed somewhere in drivers/net/usb/asix* but at >> the first glance I don't see any obvious place that need a fix. > static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc, > bool in_pm) > { > struct usbnet *dev = netdev_priv(netdev); > __le16 res; > int ret; > > mutex_lock(&dev->phy_mutex); > > Taking this lock here is the problem. Same for write. > > There is some funky stuff going on in asix_devices.c. It using both > phylib and the much older mii code. This must be something different. Removing those calls to phy_mutex from __asix_mdio_read/write doesn't fix this deadlock (I intentionally ignored the fact that some kind of synchronization is probably required there). Best regards
On Mon, Sep 18, 2023 at 02:33:04PM +0200, Marek Szyprowski wrote: > Hi Russell, > > On 14.09.2023 17:35, Russell King (Oracle) wrote: > > phy_stop() calls phy_process_state_change() while holding the phydev > > lock, so also arrange for phy_state_machine() to do the same, so that > > this function is called with consistent locking. > > > > Tested-by: Jijie Shao <shaojijie@huawei.com> > > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> > > This change, merged to linux-next as commit 8da77df649c4 ("net: phy: > always call phy_process_state_change() under lock") introduces the > following deadlock with ASIX AX8817X USB driver: Yay, latent bug found... I guess this is asix_ax88772a_link_change_notify() which is causing the problem, and yes, that phy_start_aneg() needs to be the unlocked version (which we'll have to export.) This should fix it. diff --git a/drivers/net/phy/ax88796b.c b/drivers/net/phy/ax88796b.c index 0f1e617a26c9..eb74a8cf8df1 100644 --- a/drivers/net/phy/ax88796b.c +++ b/drivers/net/phy/ax88796b.c @@ -90,7 +90,7 @@ static void asix_ax88772a_link_change_notify(struct phy_device *phydev) */ if (phydev->state == PHY_NOLINK) { phy_init_hw(phydev); - phy_start_aneg(phydev); + _phy_start_aneg(phydev); } } diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 93a8676dd8d8..a5fa077650e8 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -981,7 +981,7 @@ static int phy_check_link_status(struct phy_device *phydev) * If the PHYCONTROL Layer is operating, we change the state to * reflect the beginning of Auto-negotiation or forcing. */ -static int _phy_start_aneg(struct phy_device *phydev) +int _phy_start_aneg(struct phy_device *phydev) { int err; @@ -1002,6 +1002,7 @@ static int _phy_start_aneg(struct phy_device *phydev) return err; } +EXPORT_SYMBOL(_phy_start_aneg); /** * phy_start_aneg - start auto-negotiation for this PHY device diff --git a/include/linux/phy.h b/include/linux/phy.h index 1351b802ffcf..3cc52826f18e 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -1736,6 +1736,7 @@ void phy_detach(struct phy_device *phydev); void phy_start(struct phy_device *phydev); void phy_stop(struct phy_device *phydev); int phy_config_aneg(struct phy_device *phydev); +int _phy_start_aneg(struct phy_device *phydev); int phy_start_aneg(struct phy_device *phydev); int phy_aneg_done(struct phy_device *phydev); int phy_speed_down(struct phy_device *phydev, bool sync);
On Mon, Sep 18, 2023 at 02:55:32PM +0200, Andrew Lunn wrote: > > This probably need to be fixed somewhere in drivers/net/usb/asix* but at > > the first glance I don't see any obvious place that need a fix. > > static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc, > bool in_pm) > { > struct usbnet *dev = netdev_priv(netdev); > __le16 res; > int ret; > > mutex_lock(&dev->phy_mutex); > > Taking this lock here is the problem. Same for write. > > There is some funky stuff going on in asix_devices.c. It using both > phylib and the much older mii code. I don't think that's the problem...
On 18.09.2023 15:06, Russell King (Oracle) wrote: > On Mon, Sep 18, 2023 at 02:33:04PM +0200, Marek Szyprowski wrote: >> Hi Russell, >> >> On 14.09.2023 17:35, Russell King (Oracle) wrote: >>> phy_stop() calls phy_process_state_change() while holding the phydev >>> lock, so also arrange for phy_state_machine() to do the same, so that >>> this function is called with consistent locking. >>> >>> Tested-by: Jijie Shao <shaojijie@huawei.com> >>> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> >> This change, merged to linux-next as commit 8da77df649c4 ("net: phy: >> always call phy_process_state_change() under lock") introduces the >> following deadlock with ASIX AX8817X USB driver: > Yay, latent bug found... > > I guess this is asix_ax88772a_link_change_notify() which is causing > the problem, and yes, that phy_start_aneg() needs to be the unlocked > version (which we'll have to export.) > > This should fix it. Thanks! Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > diff --git a/drivers/net/phy/ax88796b.c b/drivers/net/phy/ax88796b.c > index 0f1e617a26c9..eb74a8cf8df1 100644 > --- a/drivers/net/phy/ax88796b.c > +++ b/drivers/net/phy/ax88796b.c > @@ -90,7 +90,7 @@ static void asix_ax88772a_link_change_notify(struct phy_device *phydev) > */ > if (phydev->state == PHY_NOLINK) { > phy_init_hw(phydev); > - phy_start_aneg(phydev); > + _phy_start_aneg(phydev); > } > } > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index 93a8676dd8d8..a5fa077650e8 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -981,7 +981,7 @@ static int phy_check_link_status(struct phy_device *phydev) > * If the PHYCONTROL Layer is operating, we change the state to > * reflect the beginning of Auto-negotiation or forcing. > */ > -static int _phy_start_aneg(struct phy_device *phydev) > +int _phy_start_aneg(struct phy_device *phydev) > { > int err; > > @@ -1002,6 +1002,7 @@ static int _phy_start_aneg(struct phy_device *phydev) > > return err; > } > +EXPORT_SYMBOL(_phy_start_aneg); > > /** > * phy_start_aneg - start auto-negotiation for this PHY device > diff --git a/include/linux/phy.h b/include/linux/phy.h > index 1351b802ffcf..3cc52826f18e 100644 > --- a/include/linux/phy.h > +++ b/include/linux/phy.h > @@ -1736,6 +1736,7 @@ void phy_detach(struct phy_device *phydev); > void phy_start(struct phy_device *phydev); > void phy_stop(struct phy_device *phydev); > int phy_config_aneg(struct phy_device *phydev); > +int _phy_start_aneg(struct phy_device *phydev); > int phy_start_aneg(struct phy_device *phydev); > int phy_aneg_done(struct phy_device *phydev); > int phy_speed_down(struct phy_device *phydev, bool sync); Best regards
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index df54c137c5f5..1e5218935eb3 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -1506,6 +1506,7 @@ void phy_state_machine(struct work_struct *work) if (err < 0) phy_error_precise(phydev, func, err); + mutex_lock(&phydev->lock); phy_process_state_change(phydev, old_state); /* Only re-schedule a PHY state machine change if we are polling the @@ -1516,7 +1517,6 @@ void phy_state_machine(struct work_struct *work) * state machine would be pointless and possibly error prone when * called from phy_disconnect() synchronously. */ - mutex_lock(&phydev->lock); if (phy_polling_mode(phydev) && phy_is_started(phydev)) phy_queue_state_machine(phydev, PHY_STATE_TIME); mutex_unlock(&phydev->lock);