diff mbox series

[net] ravb: Fix races between ravb_tx_timeout_work() and net related ops

Message ID 20231017085341.813335-1-yoshihiro.shimoda.uh@renesas.com (mailing list archive)
State Superseded
Delegated to: Geert Uytterhoeven
Headers show
Series [net] ravb: Fix races between ravb_tx_timeout_work() and net related ops | expand

Commit Message

Yoshihiro Shimoda Oct. 17, 2023, 8:53 a.m. UTC
Fix races between ravb_tx_timeout_work() and functions of net_device_ops
and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
---
 drivers/net/ethernet/renesas/ravb_main.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Sergey Shtylyov Oct. 17, 2023, 6:58 p.m. UTC | #1
Hello!

On 10/17/23 11:53 AM, Yoshihiro Shimoda wrote:

> Fix races between ravb_tx_timeout_work() and functions of net_device_ops
> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
> since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock.

   I don't quite follow... how calling cancel_work_sync() is a problem?
I thought the problem was that unregister_netdev() can be called with
the TX timeout work still pending? And, more generally, shouldn't we
protect against the TX timeout work being executed on a different CPU
than the {net_device|ethtool}_ops methods are being executed (is that
possible?)?
   I also had a suspicion that we still miss the spinlock calls in
ravb_tx_timeout_work() but nothing in particular jumped at me...
mind looking into that? :-)

> Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  drivers/net/ethernet/renesas/ravb_main.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 0ef0b88b7145..b53533ab4599 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
[...]
> @@ -1907,6 +1910,7 @@ static void ravb_tx_timeout_work(struct work_struct *work)
>  		 */
>  		netdev_err(ndev, "%s: ravb_dmac_init() failed, error %d\n",
>  			   __func__, error);
> +		rtnl_unlock();
>  		return;

   Perhaps *goto* instead here?

>  	}
>  	ravb_emac_init(ndev);
> @@ -1917,6 +1921,7 @@ static void ravb_tx_timeout_work(struct work_struct *work)
>  		ravb_ptp_init(ndev, priv->pdev);
>  
>  	netif_tx_start_all_queues(ndev);

   ... and add label here?

> +	rtnl_unlock();
>  }

MBR, Sergey
Yoshihiro Shimoda Oct. 18, 2023, 9:39 a.m. UTC | #2
Hello Sergey,

> From: Sergey Shtylyov, Sent: Wednesday, October 18, 2023 3:59 AM
> 
> Hello!
> 
> On 10/17/23 11:53 AM, Yoshihiro Shimoda wrote:
> 
> > Fix races between ravb_tx_timeout_work() and functions of net_device_ops
> > and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
> > since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
> > ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock.
> 
>    I don't quite follow... how calling cancel_work_sync() is a problem?
> I thought the problem was that unregister_netdev() can be called with
> the TX timeout work still pending? And, more generally, shouldn't we
> protect against the TX timeout work being executed on a different CPU
> than the {net_device|ethtool}_ops methods are being executed (is that
> possible?)?

__dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop().
So, the ravb_close() is under rtnl lock. While locking the rtnl, it's
possible to call ravb_tx_timeout_work() on other CPU. In such a case,
it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work()

CPU0				CPU1
				ravb_tx_timeout()
				schedule_work()
...
__dev_close_many()
// this is under rtnl lock
ravb_close()
cancel_work_sync()
				ravb_tx_timeout_work()
				rtnl_lock()
				// this is possible to cause a deadlock

>    I also had a suspicion that we still miss the spinlock calls in
> ravb_tx_timeout_work() but nothing in particular jumped at me...
> mind looking into that? :-)

Yes, perhaps we have to check it somehow...

> > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
> > Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> > ---
> >  drivers/net/ethernet/renesas/ravb_main.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> > index 0ef0b88b7145..b53533ab4599 100644
> > --- a/drivers/net/ethernet/renesas/ravb_main.c
> > +++ b/drivers/net/ethernet/renesas/ravb_main.c
> [...]
> > @@ -1907,6 +1910,7 @@ static void ravb_tx_timeout_work(struct work_struct *work)
> >  		 */
> >  		netdev_err(ndev, "%s: ravb_dmac_init() failed, error %d\n",
> >  			   __func__, error);
> > +		rtnl_unlock();
> >  		return;
> 
>    Perhaps *goto* instead here?

...

> >  	}
> >  	ravb_emac_init(ndev);
> > @@ -1917,6 +1921,7 @@ static void ravb_tx_timeout_work(struct work_struct *work)
> >  		ravb_ptp_init(ndev, priv->pdev);
> >
> >  	netif_tx_start_all_queues(ndev);
> 
>    ... and add label here?

I got it. Using goto is better, I think.

Best regards,
Yoshihiro Shimoda

> > +	rtnl_unlock();
> >  }
> 
> MBR, Sergey
Sergey Shtylyov Oct. 18, 2023, 5:27 p.m. UTC | #3
On 10/18/23 12:39 PM, Yoshihiro Shimoda wrote:
[...]
>>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops
>>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
>>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
>>> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock.
>>
>>    I don't quite follow... how calling cancel_work_sync() is a problem?
>> I thought the problem was that unregister_netdev() can be called with
>> the TX timeout work still pending? And, more generally, shouldn't we
>> protect against the TX timeout work being executed on a different CPU
>> than the {net_device|ethtool}_ops methods are being executed (is that
>> possible?)?
> 
> __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop().
> So, the ravb_close() is under rtnl lock. While locking the rtnl, it's
> possible to call ravb_tx_timeout_work() on other CPU. In such a case,
> it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work()
> 
> CPU0				CPU1
> 				ravb_tx_timeout()
> 				schedule_work()
> ...
> __dev_close_many()
> // this is under rtnl lock
> ravb_close()
> cancel_work_sync()
> 				ravb_tx_timeout_work()
> 				rtnl_lock()
> 				// this is possible to cause a deadlock

   Ah, cancel_work_sync() means we have to wait for the work to
finish -- indeed a deadlock is possiblet then. 
>>    I also had a suspicion that we still miss the spinlock calls in
>> ravb_tx_timeout_work() but nothing in particular jumped at me...

   We mainly need to protect against the interrupts in this case...

>> mind looking into that? :-)
> 
> Yes, perhaps we have to check it somehow...

   Unfortunately, I don't seem to have no bandwidth to do that myself...

[...]

MBR, Sergey
Yoshihiro Shimoda Oct. 19, 2023, 12:40 a.m. UTC | #4
Hello Sergey,

> From: Sergey Shtylyov, Sent: Thursday, October 19, 2023 2:27 AM
> 
> On 10/18/23 12:39 PM, Yoshihiro Shimoda wrote:
> [...]
> >>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops
> >>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
> >>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
> >>> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock.
> >>
> >>    I don't quite follow... how calling cancel_work_sync() is a problem?
> >> I thought the problem was that unregister_netdev() can be called with
> >> the TX timeout work still pending? And, more generally, shouldn't we
> >> protect against the TX timeout work being executed on a different CPU
> >> than the {net_device|ethtool}_ops methods are being executed (is that
> >> possible?)?
> >
> > __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop().
> > So, the ravb_close() is under rtnl lock. While locking the rtnl, it's
> > possible to call ravb_tx_timeout_work() on other CPU. In such a case,
> > it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work()
> >
> > CPU0				CPU1
> > 				ravb_tx_timeout()
> > 				schedule_work()
> > ...
> > __dev_close_many()
> > // this is under rtnl lock
> > ravb_close()
> > cancel_work_sync()
> > 				ravb_tx_timeout_work()
> > 				rtnl_lock()
> > 				// this is possible to cause a deadlock
> 
>    Ah, cancel_work_sync() means we have to wait for the work to
> finish -- indeed a deadlock is possiblet then.

Thank you for your reply. I'll update commit description in v2.

> >>    I also had a suspicion that we still miss the spinlock calls in
> >> ravb_tx_timeout_work() but nothing in particular jumped at me...
> 
>    We mainly need to protect against the interrupts in this case...

I think so. However, we can not use spin_lock_irqsave() for whole this
ravb_tx_timeout_work() because ravb_ring_init() calls kcalloc() with GFP_KERNEL.

> >> mind looking into that? :-)
> >
> > Yes, perhaps we have to check it somehow...
> 
>    Unfortunately, I don't seem to have no bandwidth to do that myself...

I got it. I'll investigate this later.

Best regards,
Yoshihiro Shimoda

> [...]
> 
> MBR, Sergey
diff mbox series

Patch

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 0ef0b88b7145..b53533ab4599 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1874,6 +1874,9 @@  static void ravb_tx_timeout_work(struct work_struct *work)
 	struct net_device *ndev = priv->ndev;
 	int error;
 
+	if (!rtnl_trylock())
+		return;
+
 	netif_tx_stop_all_queues(ndev);
 
 	/* Stop PTP Clock driver */
@@ -1907,6 +1910,7 @@  static void ravb_tx_timeout_work(struct work_struct *work)
 		 */
 		netdev_err(ndev, "%s: ravb_dmac_init() failed, error %d\n",
 			   __func__, error);
+		rtnl_unlock();
 		return;
 	}
 	ravb_emac_init(ndev);
@@ -1917,6 +1921,7 @@  static void ravb_tx_timeout_work(struct work_struct *work)
 		ravb_ptp_init(ndev, priv->pdev);
 
 	netif_tx_start_all_queues(ndev);
+	rtnl_unlock();
 }
 
 /* Packet transmit function for Ethernet AVB */