Message ID | 20240331142353.93792-2-marex@denx.de (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [1/2] net: ks8851: Inline ks8851_rx_skb() | expand |
On 2024-03-31 at 19:51:46, Marek Vasut (marex@denx.de) wrote: > The ks8851_irq() thread may call ks8851_rx_pkts() in case there are > any packets in the MAC FIFO, which calls netif_rx(). This netif_rx() > implementation is guarded by local_bh_disable() and local_bh_enable(). > The local_bh_enable() may call do_softirq() to run softirqs in case > any are pending. One of the softirqs is net_rx_action, which ultimately > reaches the driver .start_xmit callback. If that happens, the system > hangs. The entire call chain is below: > > ks8851_start_xmit_par from netdev_start_xmit > netdev_start_xmit from dev_hard_start_xmit > dev_hard_start_xmit from sch_direct_xmit > sch_direct_xmit from __dev_queue_xmit > __dev_queue_xmit from __neigh_update > __neigh_update from neigh_update > neigh_update from arp_process.constprop.0 > arp_process.constprop.0 from __netif_receive_skb_one_core > __netif_receive_skb_one_core from process_backlog > process_backlog from __napi_poll.constprop.0 > __napi_poll.constprop.0 from net_rx_action > net_rx_action from __do_softirq > __do_softirq from call_with_stack > call_with_stack from do_softirq > do_softirq from __local_bh_enable_ip > __local_bh_enable_ip from netif_rx > netif_rx from ks8851_irq > ks8851_irq from irq_thread_fn > irq_thread_fn from irq_thread > irq_thread from kthread > kthread from ret_from_fork > > The hang happens because ks8851_irq() first locks a spinlock in > ks8851_par.c ks8851_lock_par() spin_lock_irqsave(&ksp->lock, ...) > and with that spinlock locked, calls netif_rx(). Once the execution > reaches ks8851_start_xmit_par(), it calls ks8851_lock_par() again > which attempts to claim the already locked spinlock again, and the > hang happens. > > Move the do_softirq() call outside of the spinlock protected section > of ks8851_irq() by disabling BHs around the entire spinlock protected > section of ks8851_irq() handler. Place local_bh_enable() outside of > the spinlock protected section, so that it can trigger do_softirq() > without the ks8851_par.c ks8851_lock_par() spinlock being held, and > safely call ks8851_start_xmit_par() without attempting to lock the > already locked spinlock. > > Since ks8851_irq() is protected by local_bh_disable()/local_bh_enable() > now, replace netif_rx() with __netif_rx() which is not duplicating the > local_bh_disable()/local_bh_enable() calls. > > Fixes: 797047f875b5 ("net: ks8851: Implement Parallel bus operations") > Signed-off-by: Marek Vasut <marex@denx.de> > --- > Cc: "David S. Miller" <davem@davemloft.net> > Cc: "Uwe Kleine-König" <u.kleine-koenig@pengutronix.de> > Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> > Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Mark Brown <broonie@kernel.org> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: Ronald Wahl <ronald.wahl@raritan.com> > Cc: Simon Horman <horms@kernel.org> > Cc: netdev@vger.kernel.org > --- > drivers/net/ethernet/micrel/ks8851_common.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c > index 896d43bb8883d..b6b727e651f3d 100644 > --- a/drivers/net/ethernet/micrel/ks8851_common.c > +++ b/drivers/net/ethernet/micrel/ks8851_common.c > @@ -299,7 +299,7 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > ks8851_dbg_dumpkkt(ks, rxpkt); > > skb->protocol = eth_type_trans(skb, ks->netdev); > - netif_rx(skb); > + __netif_rx(skb); > > ks->netdev->stats.rx_packets++; > ks->netdev->stats.rx_bytes += rxlen; > @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > */ > static irqreturn_t ks8851_irq(int irq, void *_ks) > { > + bool need_bh_off = !(hardirq_count() | softirq_count()); IMO, in_task() macro would be better. > struct ks8851_net *ks = _ks; > unsigned handled = 0; > unsigned long flags; > unsigned int status; > > + if (need_bh_off) > + local_bh_disable(); This threaded irq's thread function (ks8851_irq()) will always run in process context, right ? Do you need "if(need_bh_off)" loop? > + > ks8851_lock(ks, &flags); > > status = ks8851_rdreg16(ks, KS_ISR); > @@ -406,6 +410,9 @@ static irqreturn_t ks8851_irq(int irq, void *_ks) > if (status & IRQ_LCI) > mii_check_link(&ks->mii); > > + if (need_bh_off) > + local_bh_enable(); > + > return IRQ_HANDLED; > } > > -- > 2.43.0 >
On 4/1/24 6:18 AM, Ratheesh Kannoth wrote: > On 2024-03-31 at 19:51:46, Marek Vasut (marex@denx.de) wrote: >> The ks8851_irq() thread may call ks8851_rx_pkts() in case there are >> any packets in the MAC FIFO, which calls netif_rx(). This netif_rx() >> implementation is guarded by local_bh_disable() and local_bh_enable(). >> The local_bh_enable() may call do_softirq() to run softirqs in case >> any are pending. One of the softirqs is net_rx_action, which ultimately >> reaches the driver .start_xmit callback. If that happens, the system >> hangs. The entire call chain is below: >> >> ks8851_start_xmit_par from netdev_start_xmit >> netdev_start_xmit from dev_hard_start_xmit >> dev_hard_start_xmit from sch_direct_xmit >> sch_direct_xmit from __dev_queue_xmit >> __dev_queue_xmit from __neigh_update >> __neigh_update from neigh_update >> neigh_update from arp_process.constprop.0 >> arp_process.constprop.0 from __netif_receive_skb_one_core >> __netif_receive_skb_one_core from process_backlog >> process_backlog from __napi_poll.constprop.0 >> __napi_poll.constprop.0 from net_rx_action >> net_rx_action from __do_softirq >> __do_softirq from call_with_stack >> call_with_stack from do_softirq >> do_softirq from __local_bh_enable_ip >> __local_bh_enable_ip from netif_rx >> netif_rx from ks8851_irq >> ks8851_irq from irq_thread_fn >> irq_thread_fn from irq_thread >> irq_thread from kthread >> kthread from ret_from_fork >> >> The hang happens because ks8851_irq() first locks a spinlock in >> ks8851_par.c ks8851_lock_par() spin_lock_irqsave(&ksp->lock, ...) >> and with that spinlock locked, calls netif_rx(). Once the execution >> reaches ks8851_start_xmit_par(), it calls ks8851_lock_par() again >> which attempts to claim the already locked spinlock again, and the >> hang happens. >> >> Move the do_softirq() call outside of the spinlock protected section >> of ks8851_irq() by disabling BHs around the entire spinlock protected >> section of ks8851_irq() handler. Place local_bh_enable() outside of >> the spinlock protected section, so that it can trigger do_softirq() >> without the ks8851_par.c ks8851_lock_par() spinlock being held, and >> safely call ks8851_start_xmit_par() without attempting to lock the >> already locked spinlock. >> >> Since ks8851_irq() is protected by local_bh_disable()/local_bh_enable() >> now, replace netif_rx() with __netif_rx() which is not duplicating the >> local_bh_disable()/local_bh_enable() calls. [...] >> diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c >> index 896d43bb8883d..b6b727e651f3d 100644 >> --- a/drivers/net/ethernet/micrel/ks8851_common.c >> +++ b/drivers/net/ethernet/micrel/ks8851_common.c >> @@ -299,7 +299,7 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) >> ks8851_dbg_dumpkkt(ks, rxpkt); >> >> skb->protocol = eth_type_trans(skb, ks->netdev); >> - netif_rx(skb); >> + __netif_rx(skb); >> >> ks->netdev->stats.rx_packets++; >> ks->netdev->stats.rx_bytes += rxlen; >> @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) >> */ >> static irqreturn_t ks8851_irq(int irq, void *_ks) >> { >> + bool need_bh_off = !(hardirq_count() | softirq_count()); > IMO, in_task() macro would be better. I don't think in_task() is exactly identical to !(hardirq_count() | softirq_count()) according to include/linux/preempt.h , it also takes NMI into consideration. I am not sure if that could pose a problem or not ? This test here has been taken from net/core/dev.c netif_rx() , it is the same one used there around __netif_rx() invocation. >> struct ks8851_net *ks = _ks; >> unsigned handled = 0; >> unsigned long flags; >> unsigned int status; >> >> + if (need_bh_off) >> + local_bh_disable(); > This threaded irq's thread function (ks8851_irq()) will always run in process context, right ? I think so. > Do you need "if(need_bh_off)" loop? It is not a loop, it is invoked once. It is here to disable BHs so that the net_rx_action BH wouldn't run until after the spinlock protected section of the IRQ handler. Te net_rx_action may end up calling ks8851_start_xmit_par, which must be called with the spinlock released, otherwise the system would lock up.
> From: Marek Vasut <marex@denx.de> > To: Ratheesh Kannoth <rkannoth@marvell.com> > Cc: netdev@vger.kernel.org; David S. Miller <davem@davemloft.net>; Uwe > This test here has been taken from net/core/dev.c netif_rx() , it is the same > one used there around __netif_rx() invocation. > > >> struct ks8851_net *ks = _ks; > >> unsigned handled = 0; > >> unsigned long flags; > >> unsigned int status; > >> > >> + if (need_bh_off) > >> + local_bh_disable(); > > This threaded irq's thread function (ks8851_irq()) will always run in process > context, right ? > > I think so. > > > Do you need "if(need_bh_off)" loop? My bad. Typo. I meant "if (need_bh_off) statement"; not "loop". > It is not a loop, it is invoked once. It is here to disable BHs so that the > net_rx_action BH wouldn't run until after the spinlock protected section of the > IRQ handler. Te net_rx_action may end up calling ks8851_start_xmit_par, > which must be called with the spinlock released, otherwise the system would > lock up. I understand that. My question - will there be a case (currently, without this patch) ks8851_irq() Is called after disabling local BH. If it is always called without disabled, can we avoid "if" statement. altogether?
On Sun, 31 Mar 2024 16:21:46 +0200 Marek Vasut wrote: > diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c > index 896d43bb8883d..b6b727e651f3d 100644 > --- a/drivers/net/ethernet/micrel/ks8851_common.c > +++ b/drivers/net/ethernet/micrel/ks8851_common.c > @@ -299,7 +299,7 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > ks8851_dbg_dumpkkt(ks, rxpkt); > > skb->protocol = eth_type_trans(skb, ks->netdev); > - netif_rx(skb); > + __netif_rx(skb); maybe return the packets from ks8851_rx_pkts() (you can put them on a queue / struct sk_buff_head) and process them once the lock is released? Also why not NAPI? > ks->netdev->stats.rx_packets++; > ks->netdev->stats.rx_bytes += rxlen; > @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > */ > static irqreturn_t ks8851_irq(int irq, void *_ks) > { > + bool need_bh_off = !(hardirq_count() | softirq_count()); I don't think IRQ / RT developers look approvingly at uses of such low level macros in drivers.
On 4/1/24 4:13 PM, Ratheesh Kannoth wrote: >> From: Marek Vasut <marex@denx.de> >> To: Ratheesh Kannoth <rkannoth@marvell.com> >> Cc: netdev@vger.kernel.org; David S. Miller <davem@davemloft.net>; Uwe >> This test here has been taken from net/core/dev.c netif_rx() , it is the same >> one used there around __netif_rx() invocation. >> >>>> struct ks8851_net *ks = _ks; >>>> unsigned handled = 0; >>>> unsigned long flags; >>>> unsigned int status; >>>> >>>> + if (need_bh_off) >>>> + local_bh_disable(); >>> This threaded irq's thread function (ks8851_irq()) will always run in process >> context, right ? >> >> I think so. >> >>> Do you need "if(need_bh_off)" loop? > My bad. Typo. I meant "if (need_bh_off) statement"; not "loop". > >> It is not a loop, it is invoked once. It is here to disable BHs so that the >> net_rx_action BH wouldn't run until after the spinlock protected section of the >> IRQ handler. Te net_rx_action may end up calling ks8851_start_xmit_par, >> which must be called with the spinlock released, otherwise the system would >> lock up. > I understand that. My question - will there be a case (currently, without this patch) ks8851_irq() > Is called after disabling local BH. If it is always called without disabled, can we avoid "if" statement. > altogether? Aha, I think that makes sense and yes, we can drop the if statement altogether. I'll add that into V2.
On 4/2/24 6:06 AM, Jakub Kicinski wrote: > On Sun, 31 Mar 2024 16:21:46 +0200 Marek Vasut wrote: >> diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c >> index 896d43bb8883d..b6b727e651f3d 100644 >> --- a/drivers/net/ethernet/micrel/ks8851_common.c >> +++ b/drivers/net/ethernet/micrel/ks8851_common.c >> @@ -299,7 +299,7 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) >> ks8851_dbg_dumpkkt(ks, rxpkt); >> >> skb->protocol = eth_type_trans(skb, ks->netdev); >> - netif_rx(skb); >> + __netif_rx(skb); > > maybe return the packets from ks8851_rx_pkts() > (you can put them on a queue / struct sk_buff_head) > and process them once the lock is released? I think that's what netif_rx() basically already does internally in netif_rx_internal(): enqueue_to_backlog(skb, smp_processor_id(), &qtail); > Also why not NAPI? I am not quite sure what to answer here. Are you asking me to add NAPI support to the driver ? >> ks->netdev->stats.rx_packets++; >> ks->netdev->stats.rx_bytes += rxlen; >> @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) >> */ >> static irqreturn_t ks8851_irq(int irq, void *_ks) >> { >> + bool need_bh_off = !(hardirq_count() | softirq_count()); > > I don't think IRQ / RT developers look approvingly at uses of such > low level macros in drivers. I _think_ the need_bh_off will be always true as Ratheesh suggested, so this can be dropped. I will test that before doing a V2.
On Tue, 2 Apr 2024 19:38:26 +0200 Marek Vasut wrote: > >> ks->netdev->stats.rx_packets++; > >> ks->netdev->stats.rx_bytes += rxlen; > >> @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > >> */ > >> static irqreturn_t ks8851_irq(int irq, void *_ks) > >> { > >> + bool need_bh_off = !(hardirq_count() | softirq_count()); > > > > I don't think IRQ / RT developers look approvingly at uses of such > > low level macros in drivers. > > I _think_ the need_bh_off will be always true as Ratheesh suggested, so > this can be dropped. I will test that before doing a V2. Quite possibly, seems like a reasonable fix if we don't have to make it conditional.
Hi, for the spi version of the chip this change now leads to [ 23.793000] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:283 [ 23.801915] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 857, name: irq/52-eth-link [ 23.810895] preempt_count: 200, expected: 0 [ 23.815288] CPU: 0 PID: 857 Comm: irq/52-eth-link Not tainted 6.6.28-sama5 #1 [ 23.822790] Hardware name: Atmel SAMA5 [ 23.826717] unwind_backtrace from show_stack+0xb/0xc [ 23.831992] show_stack from dump_stack_lvl+0x19/0x1e [ 23.837433] dump_stack_lvl from __might_resched+0xb7/0xec [ 23.843122] __might_resched from mutex_lock+0xf/0x2c [ 23.848540] mutex_lock from ks8851_irq+0x1f/0x164 [ 23.853525] ks8851_irq from irq_thread_fn+0xf/0x28 [ 23.858776] irq_thread_fn from irq_thread+0x93/0x130 [ 23.864037] irq_thread from kthread+0x7f/0x90 [ 23.868699] kthread from ret_from_fork+0x11/0x1c Actually the spi driver variant does not suffer from the issue as it has different locking so we probably should do the local_bh_disable/local_bh_enable only for the "par" version. What do you think? - ron On 31.03.24 16:21, Marek Vasut wrote: > [You don't often get email from marex@denx.de. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > The ks8851_irq() thread may call ks8851_rx_pkts() in case there are > any packets in the MAC FIFO, which calls netif_rx(). This netif_rx() > implementation is guarded by local_bh_disable() and local_bh_enable(). > The local_bh_enable() may call do_softirq() to run softirqs in case > any are pending. One of the softirqs is net_rx_action, which ultimately > reaches the driver .start_xmit callback. If that happens, the system > hangs. The entire call chain is below: > > ks8851_start_xmit_par from netdev_start_xmit > netdev_start_xmit from dev_hard_start_xmit > dev_hard_start_xmit from sch_direct_xmit > sch_direct_xmit from __dev_queue_xmit > __dev_queue_xmit from __neigh_update > __neigh_update from neigh_update > neigh_update from arp_process.constprop.0 > arp_process.constprop.0 from __netif_receive_skb_one_core > __netif_receive_skb_one_core from process_backlog > process_backlog from __napi_poll.constprop.0 > __napi_poll.constprop.0 from net_rx_action > net_rx_action from __do_softirq > __do_softirq from call_with_stack > call_with_stack from do_softirq > do_softirq from __local_bh_enable_ip > __local_bh_enable_ip from netif_rx > netif_rx from ks8851_irq > ks8851_irq from irq_thread_fn > irq_thread_fn from irq_thread > irq_thread from kthread > kthread from ret_from_fork > > The hang happens because ks8851_irq() first locks a spinlock in > ks8851_par.c ks8851_lock_par() spin_lock_irqsave(&ksp->lock, ...) > and with that spinlock locked, calls netif_rx(). Once the execution > reaches ks8851_start_xmit_par(), it calls ks8851_lock_par() again > which attempts to claim the already locked spinlock again, and the > hang happens. > > Move the do_softirq() call outside of the spinlock protected section > of ks8851_irq() by disabling BHs around the entire spinlock protected > section of ks8851_irq() handler. Place local_bh_enable() outside of > the spinlock protected section, so that it can trigger do_softirq() > without the ks8851_par.c ks8851_lock_par() spinlock being held, and > safely call ks8851_start_xmit_par() without attempting to lock the > already locked spinlock. > > Since ks8851_irq() is protected by local_bh_disable()/local_bh_enable() > now, replace netif_rx() with __netif_rx() which is not duplicating the > local_bh_disable()/local_bh_enable() calls. > > Fixes: 797047f875b5 ("net: ks8851: Implement Parallel bus operations") > Signed-off-by: Marek Vasut <marex@denx.de> > --- > Cc: "David S. Miller" <davem@davemloft.net> > Cc: "Uwe Kleine-König" <u.kleine-koenig@pengutronix.de> > Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> > Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Mark Brown <broonie@kernel.org> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: Ronald Wahl <ronald.wahl@raritan.com> > Cc: Simon Horman <horms@kernel.org> > Cc: netdev@vger.kernel.org > --- > drivers/net/ethernet/micrel/ks8851_common.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c > index 896d43bb8883d..b6b727e651f3d 100644 > --- a/drivers/net/ethernet/micrel/ks8851_common.c > +++ b/drivers/net/ethernet/micrel/ks8851_common.c > @@ -299,7 +299,7 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > ks8851_dbg_dumpkkt(ks, rxpkt); > > skb->protocol = eth_type_trans(skb, ks->netdev); > - netif_rx(skb); > + __netif_rx(skb); > > ks->netdev->stats.rx_packets++; > ks->netdev->stats.rx_bytes += rxlen; > @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) > */ > static irqreturn_t ks8851_irq(int irq, void *_ks) > { > + bool need_bh_off = !(hardirq_count() | softirq_count()); > struct ks8851_net *ks = _ks; > unsigned handled = 0; > unsigned long flags; > unsigned int status; > > + if (need_bh_off) > + local_bh_disable(); > + > ks8851_lock(ks, &flags); > > status = ks8851_rdreg16(ks, KS_ISR); > @@ -406,6 +410,9 @@ static irqreturn_t ks8851_irq(int irq, void *_ks) > if (status & IRQ_LCI) > mii_check_link(&ks->mii); > > + if (need_bh_off) > + local_bh_enable(); > + > return IRQ_HANDLED; > } > > -- > 2.43.0 >
On 4/29/24 1:46 PM, Ronald Wahl wrote: > Hi, Hi, > for the spi version of the chip this change now leads to > > [ 23.793000] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:283 > [ 23.801915] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: > 857, name: irq/52-eth-link > [ 23.810895] preempt_count: 200, expected: 0 > [ 23.815288] CPU: 0 PID: 857 Comm: irq/52-eth-link Not tainted > 6.6.28-sama5 #1 > [ 23.822790] Hardware name: Atmel SAMA5 > [ 23.826717] unwind_backtrace from show_stack+0xb/0xc > [ 23.831992] show_stack from dump_stack_lvl+0x19/0x1e > [ 23.837433] dump_stack_lvl from __might_resched+0xb7/0xec > [ 23.843122] __might_resched from mutex_lock+0xf/0x2c > [ 23.848540] mutex_lock from ks8851_irq+0x1f/0x164 > [ 23.853525] ks8851_irq from irq_thread_fn+0xf/0x28 > [ 23.858776] irq_thread_fn from irq_thread+0x93/0x130 > [ 23.864037] irq_thread from kthread+0x7f/0x90 > [ 23.868699] kthread from ret_from_fork+0x11/0x1c > > Actually the spi driver variant does not suffer from the issue as it has > different locking so we probably should do the > local_bh_disable/local_bh_enable only for the "par" version. What do you > think? Ah sigh, sorry for the breakage. Indeed, the locking is not great here. I am not entirely sure about the local_bh_disable/enable being par only. I will try to prepare some sort of a patch, would you be willing to test it on the SPI variant ?
On 29.04.24 15:23, Marek Vasut wrote: > On 4/29/24 1:46 PM, Ronald Wahl wrote: >> Hi, > > Hi, > >> for the spi version of the chip this change now leads to >> >> [ 23.793000] BUG: sleeping function called from invalid context at >> kernel/locking/mutex.c:283 >> [ 23.801915] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: >> 857, name: irq/52-eth-link >> [ 23.810895] preempt_count: 200, expected: 0 >> [ 23.815288] CPU: 0 PID: 857 Comm: irq/52-eth-link Not tainted >> 6.6.28-sama5 #1 >> [ 23.822790] Hardware name: Atmel SAMA5 >> [ 23.826717] unwind_backtrace from show_stack+0xb/0xc >> [ 23.831992] show_stack from dump_stack_lvl+0x19/0x1e >> [ 23.837433] dump_stack_lvl from __might_resched+0xb7/0xec >> [ 23.843122] __might_resched from mutex_lock+0xf/0x2c >> [ 23.848540] mutex_lock from ks8851_irq+0x1f/0x164 >> [ 23.853525] ks8851_irq from irq_thread_fn+0xf/0x28 >> [ 23.858776] irq_thread_fn from irq_thread+0x93/0x130 >> [ 23.864037] irq_thread from kthread+0x7f/0x90 >> [ 23.868699] kthread from ret_from_fork+0x11/0x1c >> >> Actually the spi driver variant does not suffer from the issue as it has >> different locking so we probably should do the >> local_bh_disable/local_bh_enable only for the "par" version. What do >> you think? > > Ah sigh, sorry for the breakage. Indeed, the locking is not great here. > > I am not entirely sure about the local_bh_disable/enable being par only. > > I will try to prepare some sort of a patch, would you be willing to test > it on the SPI variant ? Yes, I can help here, thanks. Meanwhile I also have some good understanding at least on the TX path because we had some issues here in the past. I will come up myself with another fix in the interrupt handler later. We currently reset the ISR status flags too late risking a TX queue stall with the SPI chip variant. They must be reset immediately after reading them. Need to wait a bit for field feedback as I was not able to reproduce this mysqelf. - ron
On 4/29/24 3:50 PM, Ronald Wahl wrote: > On 29.04.24 15:23, Marek Vasut wrote: >> On 4/29/24 1:46 PM, Ronald Wahl wrote: >>> Hi, >> >> Hi, >> >>> for the spi version of the chip this change now leads to >>> >>> [ 23.793000] BUG: sleeping function called from invalid context at >>> kernel/locking/mutex.c:283 >>> [ 23.801915] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: >>> 857, name: irq/52-eth-link >>> [ 23.810895] preempt_count: 200, expected: 0 >>> [ 23.815288] CPU: 0 PID: 857 Comm: irq/52-eth-link Not tainted >>> 6.6.28-sama5 #1 >>> [ 23.822790] Hardware name: Atmel SAMA5 >>> [ 23.826717] unwind_backtrace from show_stack+0xb/0xc >>> [ 23.831992] show_stack from dump_stack_lvl+0x19/0x1e >>> [ 23.837433] dump_stack_lvl from __might_resched+0xb7/0xec >>> [ 23.843122] __might_resched from mutex_lock+0xf/0x2c >>> [ 23.848540] mutex_lock from ks8851_irq+0x1f/0x164 >>> [ 23.853525] ks8851_irq from irq_thread_fn+0xf/0x28 >>> [ 23.858776] irq_thread_fn from irq_thread+0x93/0x130 >>> [ 23.864037] irq_thread from kthread+0x7f/0x90 >>> [ 23.868699] kthread from ret_from_fork+0x11/0x1c >>> >>> Actually the spi driver variant does not suffer from the issue as it has >>> different locking so we probably should do the >>> local_bh_disable/local_bh_enable only for the "par" version. What do >>> you think? >> >> Ah sigh, sorry for the breakage. Indeed, the locking is not great here. >> >> I am not entirely sure about the local_bh_disable/enable being par only. >> >> I will try to prepare some sort of a patch, would you be willing to test >> it on the SPI variant ? > > Yes, I can help here, thanks. Meanwhile I also have some good understanding > at least on the TX path because we had some issues here in the past. > > I will come up myself with another fix in the interrupt handler later. We > currently reset the ISR status flags too late risking a TX queue stall with > the SPI chip variant. They must be reset immediately after reading them. > Need > to wait a bit for field feedback as I was not able to reproduce this > mysqelf. This chip really is a gift that keeps on giving ... sigh. I just sent this patch, you are on CC, please give it a try: https://patchwork.kernel.org/project/netdevbpf/patch/20240430011518.110416-1-marex@denx.de/
diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c index 896d43bb8883d..b6b727e651f3d 100644 --- a/drivers/net/ethernet/micrel/ks8851_common.c +++ b/drivers/net/ethernet/micrel/ks8851_common.c @@ -299,7 +299,7 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) ks8851_dbg_dumpkkt(ks, rxpkt); skb->protocol = eth_type_trans(skb, ks->netdev); - netif_rx(skb); + __netif_rx(skb); ks->netdev->stats.rx_packets++; ks->netdev->stats.rx_bytes += rxlen; @@ -325,11 +325,15 @@ static void ks8851_rx_pkts(struct ks8851_net *ks) */ static irqreturn_t ks8851_irq(int irq, void *_ks) { + bool need_bh_off = !(hardirq_count() | softirq_count()); struct ks8851_net *ks = _ks; unsigned handled = 0; unsigned long flags; unsigned int status; + if (need_bh_off) + local_bh_disable(); + ks8851_lock(ks, &flags); status = ks8851_rdreg16(ks, KS_ISR); @@ -406,6 +410,9 @@ static irqreturn_t ks8851_irq(int irq, void *_ks) if (status & IRQ_LCI) mii_check_link(&ks->mii); + if (need_bh_off) + local_bh_enable(); + return IRQ_HANDLED; }
The ks8851_irq() thread may call ks8851_rx_pkts() in case there are any packets in the MAC FIFO, which calls netif_rx(). This netif_rx() implementation is guarded by local_bh_disable() and local_bh_enable(). The local_bh_enable() may call do_softirq() to run softirqs in case any are pending. One of the softirqs is net_rx_action, which ultimately reaches the driver .start_xmit callback. If that happens, the system hangs. The entire call chain is below: ks8851_start_xmit_par from netdev_start_xmit netdev_start_xmit from dev_hard_start_xmit dev_hard_start_xmit from sch_direct_xmit sch_direct_xmit from __dev_queue_xmit __dev_queue_xmit from __neigh_update __neigh_update from neigh_update neigh_update from arp_process.constprop.0 arp_process.constprop.0 from __netif_receive_skb_one_core __netif_receive_skb_one_core from process_backlog process_backlog from __napi_poll.constprop.0 __napi_poll.constprop.0 from net_rx_action net_rx_action from __do_softirq __do_softirq from call_with_stack call_with_stack from do_softirq do_softirq from __local_bh_enable_ip __local_bh_enable_ip from netif_rx netif_rx from ks8851_irq ks8851_irq from irq_thread_fn irq_thread_fn from irq_thread irq_thread from kthread kthread from ret_from_fork The hang happens because ks8851_irq() first locks a spinlock in ks8851_par.c ks8851_lock_par() spin_lock_irqsave(&ksp->lock, ...) and with that spinlock locked, calls netif_rx(). Once the execution reaches ks8851_start_xmit_par(), it calls ks8851_lock_par() again which attempts to claim the already locked spinlock again, and the hang happens. Move the do_softirq() call outside of the spinlock protected section of ks8851_irq() by disabling BHs around the entire spinlock protected section of ks8851_irq() handler. Place local_bh_enable() outside of the spinlock protected section, so that it can trigger do_softirq() without the ks8851_par.c ks8851_lock_par() spinlock being held, and safely call ks8851_start_xmit_par() without attempting to lock the already locked spinlock. Since ks8851_irq() is protected by local_bh_disable()/local_bh_enable() now, replace netif_rx() with __netif_rx() which is not duplicating the local_bh_disable()/local_bh_enable() calls. Fixes: 797047f875b5 ("net: ks8851: Implement Parallel bus operations") Signed-off-by: Marek Vasut <marex@denx.de> --- Cc: "David S. Miller" <davem@davemloft.net> Cc: "Uwe Kleine-König" <u.kleine-koenig@pengutronix.de> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ronald Wahl <ronald.wahl@raritan.com> Cc: Simon Horman <horms@kernel.org> Cc: netdev@vger.kernel.org --- drivers/net/ethernet/micrel/ks8851_common.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)