Message ID | 20230617155500.4005881-1-andrew@lunn.ch (mailing list archive) |
---|---|
State | Accepted |
Commit | c938ab4da0eb1620ae3243b0b24c572ddfc318fc |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: phy: Manual remove LEDs to ensure correct ordering | expand |
Hello: This patch was applied to netdev/net.git (main) by David S. Miller <davem@davemloft.net>: On Sat, 17 Jun 2023 17:55:00 +0200 you wrote: > If the core is left to remove the LEDs via devm_, it is performed too > late, after the PHY driver is removed from the PHY. This results in > dereferencing a NULL pointer when the LED core tries to turn the LED > off before destroying the LED. > > Manually unregister the LEDs at a safe point in phy_remove. > > [...] Here is the summary with links: - [net] net: phy: Manual remove LEDs to ensure correct ordering https://git.kernel.org/netdev/net/c/c938ab4da0eb You are awesome, thank you!
Hi Andrew, On 6/17/2023 4:55 PM, Andrew Lunn wrote: > If the core is left to remove the LEDs via devm_, it is performed too > late, after the PHY driver is removed from the PHY. This results in > dereferencing a NULL pointer when the LED core tries to turn the LED > off before destroying the LED. > > Manually unregister the LEDs at a safe point in phy_remove. > > Cc: stable@vger.kernel.org > Reported-by: Florian Fainelli <f.fainelli@gmail.com> > Suggested-by: Florian Fainelli <f.fainelli@gmail.com> > Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") > Signed-off-by: Andrew Lunn <andrew@lunn.ch> Thanks for fixing this, this is an improvement, though I can still hit another sort of use after free whereby the GENET driver removes the mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO block thus causing the following: # reboot -f [ 18.162000] bcmgenet 8f00000.ethernet eth0: Link is Down [ 18.305163] SError Interrupt on CPU2, code 0x00000000bf000002 -- SError [ 18.305170] GISB: target abort at 0x8f00e14 [R ], core: cpu_0 [ 18.305180] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 6.4.0-rc5-next-20230607-gc7a93fa22690 #98 [ 18.305187] Hardware name: BCM972180HB_V20 (DT) [ 18.305191] Workqueue: events set_brightness_delayed [ 18.305214] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 18.305220] pc : el1_abort+0x30/0x5c [ 18.305230] lr : el1_abort+0x24/0x5c [ 18.305235] sp : ffffffc082b73a90 [ 18.305236] x29: ffffffc082b73a90 x28: ffffff8002fad780 x27: 0000000000000000 [ 18.305243] x26: 0000000000000000 x25: 0000000000000000 x24: ffffff807dbb340d [ 18.305250] x23: 0000000060000005 x22: ffffffc08066d9ac x21: 0000000096000210 [ 18.305256] x20: ffffffc082b55e14 x19: ffffffc082b73ad0 x18: 0000000000000000 [ 18.305263] x17: 74656e2f74656e72 x16: 656874652e303030 x15: 303066382f626472 [ 18.305269] x14: ffffff8004a84cd8 x13: 6e69622f7273752f x12: 0000000000000000 [ 18.305275] x11: ffffff8002d1c710 x10: 0000000000000870 x9 : ffffffc080667e34 [ 18.305282] x8 : ffffff8003d44a80 x7 : fefefefefefefeff x6 : 000073746e657665 [ 18.305288] x5 : ffffff8003d44a80 x4 : ffffffc082b73ad0 x3 : 0000000000000025 [ 18.305294] x2 : 000000000000001c x1 : 0000000004208060 x0 : 0000000000000000 [ 18.305303] Kernel panic - not syncing: Asynchronous SError Interrupt [ 18.305306] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 6.4.0-rc5-next-20230607-gc7a93fa22690 #98 [ 18.305311] Hardware name: BCM972180HB_V20 (DT) [ 18.305314] Workqueue: events set_brightness_delayed [ 18.305319] Call trace: [ 18.305321] dump_backtrace+0xdc/0x114 [ 18.305329] show_stack+0x1c/0x28 [ 18.305333] dump_stack_lvl+0x44/0x58 [ 18.305339] dump_stack+0x14/0x1c [ 18.305342] panic+0x128/0x2f8 [ 18.305350] nmi_panic+0x50/0x70 [ 18.305356] arm64_serror_panic+0x74/0x80 [ 18.305361] do_serror+0x2c/0x5c [ 18.305366] el1h_64_error_handler+0x30/0x44 [ 18.305372] el1h_64_error+0x64/0x68 [ 18.305378] el1_abort+0x30/0x5c [ 18.305383] el1h_64_sync_handler+0x64/0xc8 [ 18.305389] el1h_64_sync+0x64/0x68 [ 18.305392] readl_relaxed+0x0/0x8 [ 18.305401] __mdiobus_write+0x3c/0x94 [ 18.305409] mdiobus_write+0x4c/0x70 [ 18.305415] phy_write+0x1c/0x24 [ 18.305419] bcm_phy_read_shadow+0x24/0x40 [ 18.305423] bcm_phy_led_brightness_set+0x40/0x94 [ 18.305428] phy_led_set_brightness+0x48/0x68 [ 18.305434] set_brightness_delayed_set_brightness+0x44/0x7c [ 18.305443] set_brightness_delayed+0xc4/0x1a4 [ 18.305447] process_one_work+0x1c0/0x284 [ 18.305455] process_scheduled_works+0x44/0x48 [ 18.305461] worker_thread+0x1e8/0x264 [ 18.305467] kthread+0xcc/0xdc [ 18.305474] ret_from_fork+0x10/0x20 [ 18.311812] Kernel Offset: disabled [ 18.311814] CPU features: 0x00000003,00010000,0000420b [ 18.311818] Memory Limit: none [ 18.566507] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- still not clear to me how the workqueue managed to execute and not finish before we unregistered the PHY device.
> Thanks for fixing this, this is an improvement, though I can still hit > another sort of use after free whereby the GENET driver removes the > mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO > block thus causing the following: Hi Florian Yes, i was not expecting this patch to fix that. But i was getting the NULL pointer dereference you pointed out with another setup, and this change does fix that part of the problem. > still not clear to me how the workqueue managed to execute and not finish > before we unregistered the PHY device. Me neither. I took a look at the MDIO bus driver and could not see anything obvious. I think you are going to have to scatter printk() in the code to get a clear understanding of the order things are done. Maybe it is another devm_ timing issue. Andrew
On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote: > Hi Andrew, > > On 6/17/2023 4:55 PM, Andrew Lunn wrote: > > If the core is left to remove the LEDs via devm_, it is performed too > > late, after the PHY driver is removed from the PHY. This results in > > dereferencing a NULL pointer when the LED core tries to turn the LED > > off before destroying the LED. > > > > Manually unregister the LEDs at a safe point in phy_remove. > > > > Cc: stable@vger.kernel.org > > Reported-by: Florian Fainelli <f.fainelli@gmail.com> > > Suggested-by: Florian Fainelli <f.fainelli@gmail.com> > > Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") > > Signed-off-by: Andrew Lunn <andrew@lunn.ch> > > Thanks for fixing this, this is an improvement, though I can still hit > another sort of use after free whereby the GENET driver removes the > mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO > block thus causing the following: Hi Florian, Can you try setting trigger_data->led_cdev to NULL after the cancel_delayed_work_sync() in netdev_trig_deactivate() and see what the effect is? Thanks.
Hi Russell, On 6/21/2023 6:04 PM, Russell King (Oracle) wrote: > On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote: >> Hi Andrew, >> >> On 6/17/2023 4:55 PM, Andrew Lunn wrote: >>> If the core is left to remove the LEDs via devm_, it is performed too >>> late, after the PHY driver is removed from the PHY. This results in >>> dereferencing a NULL pointer when the LED core tries to turn the LED >>> off before destroying the LED. >>> >>> Manually unregister the LEDs at a safe point in phy_remove. >>> >>> Cc: stable@vger.kernel.org >>> Reported-by: Florian Fainelli <f.fainelli@gmail.com> >>> Suggested-by: Florian Fainelli <f.fainelli@gmail.com> >>> Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") >>> Signed-off-by: Andrew Lunn <andrew@lunn.ch> >> >> Thanks for fixing this, this is an improvement, though I can still hit >> another sort of use after free whereby the GENET driver removes the >> mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO >> block thus causing the following: > > Hi Florian, > > Can you try setting trigger_data->led_cdev to NULL after the > cancel_delayed_work_sync() in netdev_trig_deactivate() and see > what the effect is? Thanks for the suggestion, getting an identical trace as before with that change.
On Wed, Jun 21, 2023 at 06:12:40PM +0100, Florian Fainelli wrote: > Hi Russell, > > On 6/21/2023 6:04 PM, Russell King (Oracle) wrote: > > On Wed, Jun 21, 2023 at 03:04:14PM +0100, Florian Fainelli wrote: > > > Hi Andrew, > > > > > > On 6/17/2023 4:55 PM, Andrew Lunn wrote: > > > > If the core is left to remove the LEDs via devm_, it is performed too > > > > late, after the PHY driver is removed from the PHY. This results in > > > > dereferencing a NULL pointer when the LED core tries to turn the LED > > > > off before destroying the LED. > > > > > > > > Manually unregister the LEDs at a safe point in phy_remove. > > > > > > > > Cc: stable@vger.kernel.org > > > > Reported-by: Florian Fainelli <f.fainelli@gmail.com> > > > > Suggested-by: Florian Fainelli <f.fainelli@gmail.com> > > > > Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") > > > > Signed-off-by: Andrew Lunn <andrew@lunn.ch> > > > > > > Thanks for fixing this, this is an improvement, though I can still hit > > > another sort of use after free whereby the GENET driver removes the > > > mdio-bcm-unimac platform device and eventually cuts the clock to the MDIO > > > block thus causing the following: > > > > Hi Florian, > > > > Can you try setting trigger_data->led_cdev to NULL after the > > cancel_delayed_work_sync() in netdev_trig_deactivate() and see > > what the effect is? > > Thanks for the suggestion, getting an identical trace as before with that > change. Thanks for trying. I was wondering whether the work was being re-queued after the flush_work(), but seemingly not.
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 17d0d0555a79..53598210be6c 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -3021,6 +3021,15 @@ static int phy_led_blink_set(struct led_classdev *led_cdev, return err; } +static void phy_leds_unregister(struct phy_device *phydev) +{ + struct phy_led *phyled; + + list_for_each_entry(phyled, &phydev->leds, list) { + led_classdev_unregister(&phyled->led_cdev); + } +} + static int of_phy_led(struct phy_device *phydev, struct device_node *led) { @@ -3054,7 +3063,7 @@ static int of_phy_led(struct phy_device *phydev, init_data.fwnode = of_fwnode_handle(led); init_data.devname_mandatory = true; - err = devm_led_classdev_register_ext(dev, cdev, &init_data); + err = led_classdev_register_ext(dev, cdev, &init_data); if (err) return err; @@ -3083,6 +3092,7 @@ static int of_phy_leds(struct phy_device *phydev) err = of_phy_led(phydev, led); if (err) { of_node_put(led); + phy_leds_unregister(phydev); return err; } } @@ -3305,6 +3315,9 @@ static int phy_remove(struct device *dev) cancel_delayed_work_sync(&phydev->state_queue); + if (IS_ENABLED(CONFIG_PHYLIB_LEDS)) + phy_leds_unregister(phydev); + phydev->state = PHY_DOWN; sfp_bus_del_upstream(phydev->sfp_bus);
If the core is left to remove the LEDs via devm_, it is performed too late, after the PHY driver is removed from the PHY. This results in dereferencing a NULL pointer when the LED core tries to turn the LED off before destroying the LED. Manually unregister the LEDs at a safe point in phy_remove. Cc: stable@vger.kernel.org Reported-by: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Andrew Lunn <andrew@lunn.ch> --- drivers/net/phy/phy_device.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)