Message ID | 20230725030026.1664873-1-zyytlz.wz@163.com (mailing list archive) |
---|---|
State | Under Review |
Delegated to: | Geert Uytterhoeven |
Headers | show |
Series | [v4] net: ravb: Fix possible UAF bug in ravb_remove | expand |
On Tue, 25 Jul 2023 11:00:26 +0800 Zheng Wang wrote: > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > index 4d6b3b7d6abb..ce2da5101e51 100644 > --- a/drivers/net/ethernet/renesas/ravb_main.c > +++ b/drivers/net/ethernet/renesas/ravb_main.c > @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) > struct ravb_private *priv = netdev_priv(ndev); > const struct ravb_hw_info *info = priv->info; > > + netif_carrier_off(ndev); > + netif_tx_disable(ndev); > + cancel_work_sync(&priv->work); Still racy, the carrier can come back up after canceling the work. But whatever, this is a non-issue in the first place. The fact that ravb_tx_timeout_work doesn't take any locks seems much more suspicious.
On Tue, 2023-07-25 at 20:19 -0700, Jakub Kicinski wrote: > On Tue, 25 Jul 2023 11:00:26 +0800 Zheng Wang wrote: > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > > index 4d6b3b7d6abb..ce2da5101e51 100644 > > --- a/drivers/net/ethernet/renesas/ravb_main.c > > +++ b/drivers/net/ethernet/renesas/ravb_main.c > > @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) > > struct ravb_private *priv = netdev_priv(ndev); > > const struct ravb_hw_info *info = priv->info; > > > > + netif_carrier_off(ndev); > > + netif_tx_disable(ndev); > > + cancel_work_sync(&priv->work); > > Still racy, the carrier can come back up after canceling the work. I must admit I don't see how/when this driver sets the carrier on ?!? > But whatever, this is a non-issue in the first place. Do you mean the UaF can't happen? I think that is real. > The fact that ravb_tx_timeout_work doesn't take any locks seems much > more suspicious. Indeed! But that should be a different patch, right? Waiting a little more for feedback from renesas. /P
Hello! On 7/27/23 11:21 AM, Paolo Abeni wrote: [...] >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c >>> index 4d6b3b7d6abb..ce2da5101e51 100644 >>> --- a/drivers/net/ethernet/renesas/ravb_main.c >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c >>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) >>> struct ravb_private *priv = netdev_priv(ndev); >>> const struct ravb_hw_info *info = priv->info; >>> >>> + netif_carrier_off(ndev); >>> + netif_tx_disable(ndev); >>> + cancel_work_sync(&priv->work); >> >> Still racy, the carrier can come back up after canceling the work. > > I must admit I don't see how/when this driver sets the carrier on ?!? The phylib code does it for this MAC driver, see the call tree of phy_link_change(), on e.g. https://elixir.bootlin.com/linux/v6.5-rc3/source/... >> But whatever, this is a non-issue in the first place. > > Do you mean the UaF can't happen? I think that is real. Looks possible to me, at least now... and anyway, shouldn't we clean up after ourselves if we call schedule_work()?However my current impression is that cancel_work_sync() should be called from ravb_close(), after calling phy_{stop|disconnect}()... >> The fact that ravb_tx_timeout_work doesn't take any locks seems much >> more suspicious. > > Indeed! But that should be a different patch, right? Yes. > Waiting a little more for feedback from renesas. Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb driver patches, so I took that task upon myself. I also happen to be a nominal author of this driver... :-) > /P MBR, Sergey
On Thu, 27 Jul 2023 21:48:41 +0300 Sergey Shtylyov wrote: > >> Still racy, the carrier can come back up after canceling the work. > > > > I must admit I don't see how/when this driver sets the carrier on ?!? > > The phylib code does it for this MAC driver, see the call tree of > phy_link_change(), on e.g. https://elixir.bootlin.com/linux/v6.5-rc3/source/... > > >> But whatever, this is a non-issue in the first place. > > > > Do you mean the UaF can't happen? I think that is real. > > Looks possible to me, at least now... and anyway, shouldn't we clean up > after ourselves if we call schedule_work()?However my current impression is > that cancel_work_sync() should be called from ravb_close(), after calling > phy_{stop|disconnect}()... > > >> The fact that ravb_tx_timeout_work doesn't take any locks seems much > >> more suspicious. > > > > Indeed! But that should be a different patch, right? > > Yes. > > > Waiting a little more for feedback from renesas. > > Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb > driver patches, so I took that task upon myself. I also happen to be a nominal > author of this driver... :-) Simplest fix I can think of is to take a reference on the netdev before scheduling the work, and then check if it's still registered in the work itself. Wrap the timeout work in rtnl_lock() to avoid any races there.
On Thu, Jul 27, 2023 at 09:48:41PM +0300, Sergey Shtylyov wrote: > Hello! > > On 7/27/23 11:21 AM, Paolo Abeni wrote: > [...] > >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > >>> index 4d6b3b7d6abb..ce2da5101e51 100644 > >>> --- a/drivers/net/ethernet/renesas/ravb_main.c > >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c > >>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) > >>> struct ravb_private *priv = netdev_priv(ndev); > >>> const struct ravb_hw_info *info = priv->info; > >>> > >>> + netif_carrier_off(ndev); > >>> + netif_tx_disable(ndev); > >>> + cancel_work_sync(&priv->work); > >> > >> Still racy, the carrier can come back up after canceling the work. > > > > I must admit I don't see how/when this driver sets the carrier on ?!? > > The phylib code does it for this MAC driver, see the call tree of > phy_link_change(), on e.g. https://elixir.bootlin.com/linux/v6.5-rc3/source/... > > >> But whatever, this is a non-issue in the first place. > > > > Do you mean the UaF can't happen? I think that is real. > > Looks possible to me, at least now... and anyway, shouldn't we clean up > after ourselves if we call schedule_work()?However my current impression is > that cancel_work_sync() should be called from ravb_close(), after calling > phy_{stop|disconnect}()... > > >> The fact that ravb_tx_timeout_work doesn't take any locks seems much > >> more suspicious. > > > > Indeed! But that should be a different patch, right? > > Yes. > > > Waiting a little more for feedback from renesas. > > Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb > driver patches, so I took that task upon myself. I also happen to be a nominal > author of this driver... :-) FWIIW, that matches my recollection. Although it may be out of date by now.
On Tue, 25 Jul 2023, Zheng Wang wrote: > In ravb_probe, priv->work was bound with ravb_tx_timeout_work. > If timeout occurs, it will start the work. And if we call > ravb_remove without finishing the work, there may be a > use-after-free bug on ndev. > > Fix it by finishing the job before cleanup in ravb_remove. > > Note that this bug is found by static analysis, it might be > false positive. > > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> > --- > v4: > - add information about the bug was found suggested by Yunsheng Lin > v3: > - fix typo in commit message > v2: > - stop dev_watchdog so that handle no more timeout work suggested by Yunsheng Lin, > add an empty line to make code clear suggested by Sergey Shtylyov > --- > drivers/net/ethernet/renesas/ravb_main.c | 3 +++ > 1 file changed, 3 insertions(+) Trying my best not to sound like a broken record, but ... What's the latest with this fix? Is a v5 en route?
On Tue, 15 Aug 2023, Lee Jones wrote: > On Tue, 25 Jul 2023, Zheng Wang wrote: > > > In ravb_probe, priv->work was bound with ravb_tx_timeout_work. > > If timeout occurs, it will start the work. And if we call > > ravb_remove without finishing the work, there may be a > > use-after-free bug on ndev. > > > > Fix it by finishing the job before cleanup in ravb_remove. > > > > Note that this bug is found by static analysis, it might be > > false positive. > > > > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> > > --- > > v4: > > - add information about the bug was found suggested by Yunsheng Lin > > v3: > > - fix typo in commit message > > v2: > > - stop dev_watchdog so that handle no more timeout work suggested by Yunsheng Lin, > > add an empty line to make code clear suggested by Sergey Shtylyov > > --- > > drivers/net/ethernet/renesas/ravb_main.c | 3 +++ > > 1 file changed, 3 insertions(+) > > Trying my best not to sound like a broken record, but ... > > What's the latest with this fix? Is a v5 en route? Any update please Zheng Wang?
Sorry for my late reply. I'll update another patch later today. Best regards, Zheng Lee Jones <lee@kernel.org> 于2023年8月29日周二 21:46写道: > > On Tue, 15 Aug 2023, Lee Jones wrote: > > > On Tue, 25 Jul 2023, Zheng Wang wrote: > > > > > In ravb_probe, priv->work was bound with ravb_tx_timeout_work. > > > If timeout occurs, it will start the work. And if we call > > > ravb_remove without finishing the work, there may be a > > > use-after-free bug on ndev. > > > > > > Fix it by finishing the job before cleanup in ravb_remove. > > > > > > Note that this bug is found by static analysis, it might be > > > false positive. > > > > > > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > > Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> > > > --- > > > v4: > > > - add information about the bug was found suggested by Yunsheng Lin > > > v3: > > > - fix typo in commit message > > > v2: > > > - stop dev_watchdog so that handle no more timeout work suggested by Yunsheng Lin, > > > add an empty line to make code clear suggested by Sergey Shtylyov > > > --- > > > drivers/net/ethernet/renesas/ravb_main.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > Trying my best not to sound like a broken record, but ... > > > > What's the latest with this fix? Is a v5 en route? > > Any update please Zheng Wang? > > -- > Lee Jones [李琼斯]
Hi everyone, After reviewing all comments about the patch. I agree with Jakub. But adding reference on net_device is a big move. All related drivers must modify the code. For now, I couldn't think a better idea about the fix. Thanks for your effort and sorry for my late reply. Best regards, Zheng Wang Zheng Hacker <hackerzheng666@gmail.com> 于2023年8月30日周三 12:30写道: > > Sorry for my late reply. I'll update another patch later today. > > Best regards, > Zheng > > Lee Jones <lee@kernel.org> 于2023年8月29日周二 21:46写道: > > > > On Tue, 15 Aug 2023, Lee Jones wrote: > > > > > On Tue, 25 Jul 2023, Zheng Wang wrote: > > > > > > > In ravb_probe, priv->work was bound with ravb_tx_timeout_work. > > > > If timeout occurs, it will start the work. And if we call > > > > ravb_remove without finishing the work, there may be a > > > > use-after-free bug on ndev. > > > > > > > > Fix it by finishing the job before cleanup in ravb_remove. > > > > > > > > Note that this bug is found by static analysis, it might be > > > > false positive. > > > > > > > > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > > > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com> > > > > Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> > > > > --- > > > > v4: > > > > - add information about the bug was found suggested by Yunsheng Lin > > > > v3: > > > > - fix typo in commit message > > > > v2: > > > > - stop dev_watchdog so that handle no more timeout work suggested by Yunsheng Lin, > > > > add an empty line to make code clear suggested by Sergey Shtylyov > > > > --- > > > > drivers/net/ethernet/renesas/ravb_main.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > Trying my best not to sound like a broken record, but ... > > > > > > What's the latest with this fix? Is a v5 en route? > > > > Any update please Zheng Wang? > > > > -- > > Lee Jones [李琼斯]
Hello Sergey! > From: Sergey Shtylyov, Sent: Friday, July 28, 2023 3:49 AM > > Hello! > > On 7/27/23 11:21 AM, Paolo Abeni wrote: > [...] > >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > >>> index 4d6b3b7d6abb..ce2da5101e51 100644 > >>> --- a/drivers/net/ethernet/renesas/ravb_main.c > >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c > >>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) > >>> struct ravb_private *priv = netdev_priv(ndev); > >>> const struct ravb_hw_info *info = priv->info; > >>> > >>> + netif_carrier_off(ndev); > >>> + netif_tx_disable(ndev); > >>> + cancel_work_sync(&priv->work); > >> > >> Still racy, the carrier can come back up after canceling the work. > > > > I must admit I don't see how/when this driver sets the carrier on ?!? > > The phylib code does it for this MAC driver, see the call tree of > phy_link_change(), on e.g. > https://elixir.bootlin.com/linux/v6.5-rc3/source > > >> But whatever, this is a non-issue in the first place. > > > > Do you mean the UaF can't happen? I think that is real. > > Looks possible to me, at least now... and anyway, shouldn't we clean up > after ourselves if we call schedule_work()?However my current impression is > that cancel_work_sync() should be called from ravb_close(), after calling > phy_{stop|disconnect}()... I also think so. ravb_remove() calls unregister_netdev(). -> unregister_netdev() calls rtnl_lock() and unregister_netdevice(). --> unregiter_netdevice_queue() ---> unregiter_netdevice_many() ----> unregiter_netdevice_many_notify(). -----> dev_close_many() ------> __dev_close_many() -------> ops->ndo_stop() ravb_close() calls phy_stop() -> phy_state_machine() with PHY_HALTED --> phy_link_down() ---> phy_link_change() ----> netif_carrier_off() The patch will be the following: --- drivers/net/ethernet/renesas/ravb_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 7df9f9f8e134..e452d90de7c2 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -2167,6 +2167,8 @@ static int ravb_close(struct net_device *ndev) of_phy_deregister_fixed_link(np); } + cancel_work_sync(&priv->work); + if (info->multi_irqs) { free_irq(priv->tx_irqs[RAVB_NC], ndev); free_irq(priv->rx_irqs[RAVB_NC], ndev); --- If this patch is acceptable, I'll submit it. But, what do you think? Best regards, Yoshihiro Shimoda > >> The fact that ravb_tx_timeout_work doesn't take any locks seems much > >> more suspicious. > > > > Indeed! But that should be a different patch, right? > > Yes. > > > Waiting a little more for feedback from renesas. > > Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb > driver patches, so I took that task upon myself. I also happen to be a nominal > author of this driver... :-) > > > /P > > MBR, Sergey
Hello! On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote: Sorry, I got ill that same day and still have subfebrile temperature, and I forgot about your mail. I'll try replying to it on this weekend... [...] MBR, Sergey
Hello Sergey, > From: Sergey Shtylyov, Sent: Saturday, September 30, 2023 5:23 AM > > Hello! > > On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote: > > Sorry, I got ill that same day and still have subfebrile temperature, > and I forgot about your mail. I'll try replying to it on this weekend... Thank you for your reply! I understood it. Please take care of yourself. I hope you will get well soon. Best regards, Yoshihiro Shimoda > [...] > > MBR, Sergey
Hello! Concerning the subject: I doubt that UAF acronym is known to everybody (e.g. it wasn't known to me), I think we should be able to afford spelling out use-after-free there... On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote: [...] >>>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c >>>>> index 4d6b3b7d6abb..ce2da5101e51 100644 >>>>> --- a/drivers/net/ethernet/renesas/ravb_main.c >>>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c >>>>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) >>>>> struct ravb_private *priv = netdev_priv(ndev); >>>>> const struct ravb_hw_info *info = priv->info; >>>>> >>>>> + netif_carrier_off(ndev); >>>>> + netif_tx_disable(ndev); >>>>> + cancel_work_sync(&priv->work); >>>> >>>> Still racy, the carrier can come back up after canceling the work. >>> >>> I must admit I don't see how/when this driver sets the carrier on ?!? >> >> The phylib code does it for this MAC driver, see the call tree of >> phy_link_change(), on e.g. >> https://elixir.bootlin.com/linux/v6.5-rc3/source >> >>>> But whatever, this is a non-issue in the first place. >>> >>> Do you mean the UaF can't happen? I think that is real. >> >> Looks possible to me, at least now... and anyway, shouldn't we clean up >> after ourselves if we call schedule_work()?However my current impression is >> that cancel_work_sync() should be called from ravb_close(), after calling >> phy_{stop|disconnect}()... > > I also think so. > > ravb_remove() calls unregister_netdev(). > -> unregister_netdev() calls rtnl_lock() and unregister_netdevice(). > --> unregiter_netdevice_queue() > ---> unregiter_netdevice_many() > ----> unregiter_netdevice_many_notify(). > -----> dev_close_many() > ------> __dev_close_many() > -------> ops->ndo_stop() > > ravb_close() calls phy_stop() > -> phy_state_machine() with PHY_HALTED > --> phy_link_down() > ---> phy_link_change() > ----> netif_carrier_off() Thanks for sharing the call chain, I've followed it once again... :-) > The patch will be the following: > --- > drivers/net/ethernet/renesas/ravb_main.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > index 7df9f9f8e134..e452d90de7c2 100644 > --- a/drivers/net/ethernet/renesas/ravb_main.c > +++ b/drivers/net/ethernet/renesas/ravb_main.c > @@ -2167,6 +2167,8 @@ static int ravb_close(struct net_device *ndev) > of_phy_deregister_fixed_link(np); > } > > + cancel_work_sync(&priv->work); > + > if (info->multi_irqs) { > free_irq(priv->tx_irqs[RAVB_NC], ndev); > free_irq(priv->rx_irqs[RAVB_NC], ndev); > --- > > If this patch is acceptable, I'll submit it. But, what do you think? I think it should do the job. And I suspect you can even test it... :-) > Best regards, > Yoshihiro Shimoda [...] MBR, Sergey
Hello Sergey, > From: Sergey Shtylyov, Sent: Wednesday, October 4, 2023 4:39 AM > > Hello! > > Concerning the subject: I doubt that UAF acronym is known to > everybody (e.g. it wasn't known to me), I think we should be able > to afford spelling out use-after-free there... I got it. I'll change the subject. > On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote: > [...] > > >>>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > >>>>> index 4d6b3b7d6abb..ce2da5101e51 100644 > >>>>> --- a/drivers/net/ethernet/renesas/ravb_main.c > >>>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c > >>>>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) > >>>>> struct ravb_private *priv = netdev_priv(ndev); > >>>>> const struct ravb_hw_info *info = priv->info; > >>>>> > >>>>> + netif_carrier_off(ndev); > >>>>> + netif_tx_disable(ndev); > >>>>> + cancel_work_sync(&priv->work); > >>>> > >>>> Still racy, the carrier can come back up after canceling the work. > >>> > >>> I must admit I don't see how/when this driver sets the carrier on ?!? > >> > >> The phylib code does it for this MAC driver, see the call tree of > >> phy_link_change(), on e.g. > >> <snip URL> > >> > >>>> But whatever, this is a non-issue in the first place. > >>> > >>> Do you mean the UaF can't happen? I think that is real. > >> > >> Looks possible to me, at least now... and anyway, shouldn't we clean up > >> after ourselves if we call schedule_work()?However my current impression is > >> that cancel_work_sync() should be called from ravb_close(), after calling > >> phy_{stop|disconnect}()... > > > > I also think so. > > > > ravb_remove() calls unregister_netdev(). > > -> unregister_netdev() calls rtnl_lock() and unregister_netdevice(). > > --> unregiter_netdevice_queue() > > ---> unregiter_netdevice_many() > > ----> unregiter_netdevice_many_notify(). > > -----> dev_close_many() > > ------> __dev_close_many() > > -------> ops->ndo_stop() > > > > ravb_close() calls phy_stop() > > -> phy_state_machine() with PHY_HALTED > > --> phy_link_down() > > ---> phy_link_change() > > ----> netif_carrier_off() > > Thanks for sharing the call chain, I've followed it once again... :-) Thank you :) > > The patch will be the following: > > --- > > drivers/net/ethernet/renesas/ravb_main.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > > index 7df9f9f8e134..e452d90de7c2 100644 > > --- a/drivers/net/ethernet/renesas/ravb_main.c > > +++ b/drivers/net/ethernet/renesas/ravb_main.c > > @@ -2167,6 +2167,8 @@ static int ravb_close(struct net_device *ndev) > > of_phy_deregister_fixed_link(np); > > } > > > > + cancel_work_sync(&priv->work); > > + > > if (info->multi_irqs) { > > free_irq(priv->tx_irqs[RAVB_NC], ndev); > > free_irq(priv->rx_irqs[RAVB_NC], ndev); > > --- > > > > If this patch is acceptable, I'll submit it. But, what do you think? > > I think it should do the job. Thank you for your comment! I'll make such a patch. > And I suspect you can even test it... :-) IIUC, causing tx timeout is difficult. But, I guess we can add a fault injection code somehow. Best regards, Yoshihiro Shimoda > > Best regards, > > Yoshihiro Shimoda > > [...] > > MBR, Sergey
On 26.07.2023 05:19, Jakub Kicinski wrote: ... > The fact that ravb_tx_timeout_work doesn't take any locks seems much > more suspicious. Does anybody plan to look into this, too? Best regards Dirk
Hello Behme, > From: Behme Dirk (CM/ESO2), Sent: Tuesday, October 10, 2023 9:59 PM > > On 26.07.2023 05:19, Jakub Kicinski wrote: > ... > > The fact that ravb_tx_timeout_work doesn't take any locks seems much > > more suspicious. > Does anybody plan to look into this, too? I believe my fixed patch [1] resolved this issue too. Let me explain it in detail below. In the thread, Jakub also mentioned [2] like below: --- Simplest fix I can think of is to take a reference on the netdev before scheduling the work, and then check if it's still registered in the work itself. Wrap the timeout work in rtnl_lock() to avoid any races there. --- Sergey suggested to add cancel_work_sync() into the ravb_close () [3]. And I investigated calltrace, and then the ravb_close() is under rtnl_lock() [4] like below: ----------------------------------------------------------------------- ravb_remove() calls unregister_netdev(). -> unregister_netdev() calls rtnl_lock() and unregister_netdevice(). --> unregiter_netdevice_queue() ---> unregiter_netdevice_many() ----> unregiter_netdevice_many_notify(). -----> dev_close_many() ------> __dev_close_many() -------> ops->ndo_stop() ravb_close() calls phy_stop() -> phy_state_machine() with PHY_HALTED --> phy_link_down() ---> phy_link_change() ----> netif_carrier_off() ----------------------------------------------------------------------- So, during cancel_work_sync() is waiting for canceling the workqueue in ravb_close(), it's under rtnl_lock() so that no additional locks are needed in ravb_tx_timeout_work(). [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3971442870713de527684398416970cf025b4f89 [2] https://lore.kernel.org/netdev/20230727164820.48c9e685@kernel.org/ [3] https://lore.kernel.org/netdev/607f4fe4-5a59-39dd-71c2-0cf769b48187@omp.ru/ [4] https://lore.kernel.org/netdev/OSYPR01MB53341CFDBB49A3BA41A6752CD8F9A@OSYPR01MB5334.jpnprd01.prod.outlook.com/ Best regards, Yoshihiro Shimoda > Best regards > > Dirk
Hi, On 12.10.2023 10:39, Yoshihiro Shimoda wrote: > Hello Behme, > >> From: Behme Dirk (CM/ESO2), Sent: Tuesday, October 10, 2023 9:59 PM >> >> On 26.07.2023 05:19, Jakub Kicinski wrote: >> ... >>> The fact that ravb_tx_timeout_work doesn't take any locks seems much >>> more suspicious. >> Does anybody plan to look into this, too? > > I believe my fixed patch [1] resolved this issue too. I'm not an expert of this driver nor the network stack, so sorry if I'm totally wrong here ;) But somehow this answer confuses me. Let me explain: What you did with [1] is to stop/cancel the workqueue in ravb_close(). That's fine. But that is at driver's close time. What's about driver's normal runtime? What I understood is that ravb_tx_timeout_work() might run totally asynchronously to the rest of the driver. And therefore needs locking/protection/synchronization if it uses resources of the driver which are used elsewhere in the driver, too. I think this is exactly what is described with: > --- > Simplest fix I can think of is to take a reference on the netdev before > scheduling the work, and then check if it's still registered in the work > itself. Wrap the timeout work in rtnl_lock() to avoid any races there. > --- So, where is above done? Not at close time, but at normal run time of the driver? Best regards Dirk > Sergey suggested to add cancel_work_sync() into the ravb_close () [3]. > And I investigated calltrace, and then the ravb_close() is under rtnl_lock() [4] > like below: > ----------------------------------------------------------------------- > ravb_remove() calls unregister_netdev(). > -> unregister_netdev() calls rtnl_lock() and unregister_netdevice(). > --> unregiter_netdevice_queue() > ---> unregiter_netdevice_many() > ----> unregiter_netdevice_many_notify(). > -----> dev_close_many() > ------> __dev_close_many() > -------> ops->ndo_stop() > > ravb_close() calls phy_stop() > -> phy_state_machine() with PHY_HALTED > --> phy_link_down() > ---> phy_link_change() > ----> netif_carrier_off() > ----------------------------------------------------------------------- > > So, during cancel_work_sync() is waiting for canceling the workqueue in ravb_close(), > it's under rtnl_lock() so that no additional locks are needed in ravb_tx_timeout_work(). > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3971442870713de527684398416970cf025b4f89 > [2] https://lore.kernel.org/netdev/20230727164820.48c9e685@kernel.org/ > [3] https://lore.kernel.org/netdev/607f4fe4-5a59-39dd-71c2-0cf769b48187@omp.ru/ > [4] https://lore.kernel.org/netdev/OSYPR01MB53341CFDBB49A3BA41A6752CD8F9A@OSYPR01MB5334.jpnprd01.prod.outlook.com/ > > Best regards, > Yoshihiro Shimoda > >> Best regards >> >> Dirk
Hi, > From: Behme Dirk (CM/ESO2), Sent: Friday, October 13, 2023 3:05 PM > > Hi, > > On 12.10.2023 10:39, Yoshihiro Shimoda wrote: > > Hello Behme, > > > >> From: Behme Dirk (CM/ESO2), Sent: Tuesday, October 10, 2023 9:59 PM > >> > >> On 26.07.2023 05:19, Jakub Kicinski wrote: > >> ... > >>> The fact that ravb_tx_timeout_work doesn't take any locks seems much > >>> more suspicious. > >> Does anybody plan to look into this, too? > > > > I believe my fixed patch [1] resolved this issue too. > > > I'm not an expert of this driver nor the network stack, so sorry if I'm > totally wrong here ;) But somehow this answer confuses me. Let me explain: > > What you did with [1] is to stop/cancel the workqueue in ravb_close(). > That's fine. But that is at driver's close time. > > What's about driver's normal runtime? What I understood is that > ravb_tx_timeout_work() might run totally asynchronously to the rest of > the driver. And therefore needs locking/protection/synchronization if it > uses resources of the driver which are used elsewhere in the driver, too. > > I think this is exactly what is described with: > > > --- > > Simplest fix I can think of is to take a reference on the netdev before > > scheduling the work, and then check if it's still registered in the work > > itself. Wrap the timeout work in rtnl_lock() to avoid any races there. > > --- > > So, where is above done? Not at close time, but at normal run time of > the driver? Thank you very much for your detailed explanation. I understood it. ravb_tx_timeout_work() still has races between ethtool ops for instance. So, I'll make a patch for it by early next week. However, IIUC, using rtnl_lock() in ravb_tx_timeout_work() is possible to cause deadlock from cancel_work_sync() in ravb_close(). So, I'll use rtnl_trylock() instead. Best regards, Yoshihiro Shimoda > Best regards > > Dirk > > > Sergey suggested to add cancel_work_sync() into the ravb_close () [3]. > > And I investigated calltrace, and then the ravb_close() is under rtnl_lock() [4] > > like below: > > ----------------------------------------------------------------------- > > ravb_remove() calls unregister_netdev(). > > -> unregister_netdev() calls rtnl_lock() and unregister_netdevice(). > > --> unregiter_netdevice_queue() > > ---> unregiter_netdevice_many() > > ----> unregiter_netdevice_many_notify(). > > -----> dev_close_many() > > ------> __dev_close_many() > > -------> ops->ndo_stop() > > > > ravb_close() calls phy_stop() > > -> phy_state_machine() with PHY_HALTED > > --> phy_link_down() > > ---> phy_link_change() > > ----> netif_carrier_off() > > ----------------------------------------------------------------------- > > > > So, during cancel_work_sync() is waiting for canceling the workqueue in ravb_close(), > > it's under rtnl_lock() so that no additional locks are needed in ravb_tx_timeout_work(). > > > > [1] > https://git.kernel.org/pub/scm/linux/kernel/git%25 > 2Fnetdev%2Fnet.git%2Fcommit%2F%3Fid%3D3971442870713de527684398416970cf025b4f89&data=05%7C01%7Cyoshihiro.shimoda.uh%4 > 0renesas.com%7C466e046b20b548b264f808dbcbb255f6%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUn > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HkA8f5a > gawjXMvAGkaE6tELaSpjpbIn7M3mU5xbDTD0%3D&reserved=0 > > [2] > https://lore.kernel.org/netdev/20230727164820.48c9e685 > %40kernel.org%2F&data=05%7C01%7Cyoshihiro.shimoda.uh%40renesas.com%7C466e046b20b548b264f808dbcbb255f6%7C53d82571da19 > 47e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTi > I6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cGvnA8WqxM%2FUDa%2FNS2OBztr1IWgjCX4IzBYXe1LGkZU%3D&reserved=0 > > [3] > https://lore.kernel.org/netdev/607f4fe4-5a59-39dd-71c2 > -0cf769b48187%40omp.ru%2F&data=05%7C01%7Cyoshihiro.shimoda.uh%40renesas.com%7C466e046b20b548b264f808dbcbb255f6%7C53d > 82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM > zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OWwBKy%2Fdckgo3clPPfn2hxE4H6ToyqdcbhPhGoqoo30%3D&reserved=0 > > [4] > https://lore.kernel.org/netdev/OSYPR01MB53341CFDBB49A3 > BA41A6752CD8F9A%40OSYPR01MB5334.jpnprd01.prod.outlook.com%2F&data=05%7C01%7Cyoshihiro.shimoda.uh%40renesas.com%7C466 > e046b20b548b264f808dbcbb255f6%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUnknown%7CTWFpbGZsb3 > d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Jfypf10jiUfTqWUAukjnPzIQp > urx7m0ETF5N2Toq8wE%3D&reserved=0 > > > > Best regards, > > Yoshihiro Shimoda > > > >> Best regards > >> > >> Dirk > > -- > ====================================================================== > Dirk Behme Robert Bosch Car Multimedia GmbH > CM/ESO2 > Phone: +49 5121 49-3274 Dirk Behme > Fax: +49 711 811 5053274 PO Box 77 77 77 > mailto:dirk.behme@de.bosch.com D-31132 Hildesheim - Germany > > Bosch Group, Car Multimedia (CM) > Engineering SW Operating Systems 2 (ESO2) > > Robert Bosch Car Multimedia GmbH - Ein Unternehmen der Bosch Gruppe > Sitz: Hildesheim > Registergericht: Amtsgericht Hildesheim HRB 201334 > Aufsichtsratsvorsitzender: Dr. Dirk Hoheisel > Geschäftsführung: Dr. Steffen Berns; > Dr. Sven Ost, Jörg Pollak, Dr. Walter Schirm > ======================================================================
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 4d6b3b7d6abb..ce2da5101e51 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev) struct ravb_private *priv = netdev_priv(ndev); const struct ravb_hw_info *info = priv->info; + netif_carrier_off(ndev); + netif_tx_disable(ndev); + cancel_work_sync(&priv->work); /* Stop PTP Clock driver */ if (info->ccc_gac) ravb_ptp_stop(ndev);