Message ID | 1590486419-9289-1-git-send-email-yoshihiro.shimoda.uh@renesas.com (mailing list archive) |
---|---|
State | Under Review |
Delegated to: | Geert Uytterhoeven |
Headers | show |
Series | [PATCH/RFC] net: ethernet: ravb: Try to wake subqueue instead of stop on timeout | expand |
Hi Sergei-san, (add your private email address to CC.) > From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM > > According to the report of [1], this driver is possible to cause > the following error in ravb_tx_timeout_work(). > > ravb e6800000.ethernet ethernet: failed to switch device to config mode > > This error means that the hardware could not change the state > from "Operation" to "Configuration" while some tx queue is operating. > After that, ravb_config() in ravb_dmac_init() will fail, and then > any descriptors will be not allocaled anymore so that NULL porinter > dereference happens after that on ravb_start_xmit(). > > Such a case is possible to be caused because this driver supports > two queues (NC and BE) and the ravb_stop_dma() is possible to return > without any stopping process if TCCR or CSR register indicates > the hardware is operating for TX. > > To fix the issue, just try to wake the subqueue on > ravb_tx_timeout_work() if the descriptors are not full instead > of stop all transfers (all queues of TX and RX). > > [1] > https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ > > Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > --- > I'm guessing that this issue is possible to happen if: > - ravb_start_xmit() calls netif_stop_subqueue(), and > - ravb_poll() will not be called with some reason, and > - netif_wake_subqueue() will be not called, and then > - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). > > However, unfortunately, I didn't reproduce the issue yet. > To be honest, I'm also guessing other queues (SR) of this hardware > which out-of tree driver manages are possible to reproduce this issue, > but I didn't try such environment for now... > > So, I marked RFC on this patch now. I'm afraid, but do you have any comments about this patch? Best regards, Yoshihiro Shimoda > drivers/net/ethernet/renesas/ravb.h | 1 - > drivers/net/ethernet/renesas/ravb_main.c | 48 ++++++++++---------------------- > 2 files changed, 14 insertions(+), 35 deletions(-) > > diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h > index 9f88b5d..42cf41a 100644 > --- a/drivers/net/ethernet/renesas/ravb.h > +++ b/drivers/net/ethernet/renesas/ravb.h > @@ -1021,7 +1021,6 @@ struct ravb_private { > u32 cur_tx[NUM_TX_QUEUE]; > u32 dirty_tx[NUM_TX_QUEUE]; > struct napi_struct napi[NUM_RX_QUEUE]; > - struct work_struct work; > /* MII transceiver section. */ > struct mii_bus *mii_bus; /* MDIO bus control */ > int link; > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > index 067ad25..45e1ecd 100644 > --- a/drivers/net/ethernet/renesas/ravb_main.c > +++ b/drivers/net/ethernet/renesas/ravb_main.c > @@ -1428,44 +1428,25 @@ static int ravb_open(struct net_device *ndev) > static void ravb_tx_timeout(struct net_device *ndev, unsigned int txqueue) > { > struct ravb_private *priv = netdev_priv(ndev); > + unsigned long flags; > + bool wake = false; > + > + spin_lock_irqsave(&priv->lock, flags); > + if (priv->cur_tx[txqueue] - priv->dirty_tx[txqueue] <= > + (priv->num_tx_ring[txqueue] - 1) * priv->num_tx_desc) > + wake = true; > > netif_err(priv, tx_err, ndev, > - "transmit timed out, status %08x, resetting...\n", > - ravb_read(ndev, ISS)); > + "transmit timed out (%d %d), status %08x %08x %08x\n", > + txqueue, wake, ravb_read(ndev, ISS), ravb_read(ndev, TCCR), > + ravb_read(ndev, CSR)); > + > + if (wake) > + netif_wake_subqueue(ndev, txqueue); > > /* tx_errors count up */ > ndev->stats.tx_errors++; > - > - schedule_work(&priv->work); > -} > - > -static void ravb_tx_timeout_work(struct work_struct *work) > -{ > - struct ravb_private *priv = container_of(work, struct ravb_private, > - work); > - struct net_device *ndev = priv->ndev; > - > - netif_tx_stop_all_queues(ndev); > - > - /* Stop PTP Clock driver */ > - if (priv->chip_id == RCAR_GEN2) > - ravb_ptp_stop(ndev); > - > - /* Wait for DMA stopping */ > - ravb_stop_dma(ndev); > - > - ravb_ring_free(ndev, RAVB_BE); > - ravb_ring_free(ndev, RAVB_NC); > - > - /* Device init */ > - ravb_dmac_init(ndev); > - ravb_emac_init(ndev); > - > - /* Initialise PTP Clock driver */ > - if (priv->chip_id == RCAR_GEN2) > - ravb_ptp_init(ndev, priv->pdev); > - > - netif_tx_start_all_queues(ndev); > + spin_unlock_irqrestore(&priv->lock, flags); > } > > /* Packet transmit function for Ethernet AVB */ > @@ -2046,7 +2027,6 @@ static int ravb_probe(struct platform_device *pdev) > } > > spin_lock_init(&priv->lock); > - INIT_WORK(&priv->work, ravb_tx_timeout_work); > > error = of_get_phy_mode(np, &priv->phy_interface); > if (error && error != -ENODEV) > -- > 2.7.4
Hello! On 15.06.2020 8:58, Yoshihiro Shimoda wrote: >> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM >> >> According to the report of [1], this driver is possible to cause >> the following error in ravb_tx_timeout_work(). >> >> ravb e6800000.ethernet ethernet: failed to switch device to config mode >> >> This error means that the hardware could not change the state >> from "Operation" to "Configuration" while some tx queue is operating. >> After that, ravb_config() in ravb_dmac_init() will fail, and then >> any descriptors will be not allocaled anymore so that NULL porinter >> dereference happens after that on ravb_start_xmit(). >> >> Such a case is possible to be caused because this driver supports >> two queues (NC and BE) and the ravb_stop_dma() is possible to return >> without any stopping process if TCCR or CSR register indicates >> the hardware is operating for TX. >> >> To fix the issue, just try to wake the subqueue on >> ravb_tx_timeout_work() if the descriptors are not full instead >> of stop all transfers (all queues of TX and RX). >> >> [1] >> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ >> >> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> >> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> >> --- >> I'm guessing that this issue is possible to happen if: >> - ravb_start_xmit() calls netif_stop_subqueue(), and >> - ravb_poll() will not be called with some reason, and >> - netif_wake_subqueue() will be not called, and then >> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). >> >> However, unfortunately, I didn't reproduce the issue yet. >> To be honest, I'm also guessing other queues (SR) of this hardware >> which out-of tree driver manages are possible to reproduce this issue, >> but I didn't try such environment for now... >> >> So, I marked RFC on this patch now. > > I'm afraid, but do you have any comments about this patch? I agree that we should now reset only the stuck queue, not both but I doubt your solution is good enough. Let me have another look... [...] MBR, Sergei
Hello! > From: Sergei Shtylyov, Sent: Monday, June 15, 2020 5:13 PM > > Hello! > > On 15.06.2020 8:58, Yoshihiro Shimoda wrote: > > >> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM > >> > >> According to the report of [1], this driver is possible to cause > >> the following error in ravb_tx_timeout_work(). > >> > >> ravb e6800000.ethernet ethernet: failed to switch device to config mode > >> > >> This error means that the hardware could not change the state > >> from "Operation" to "Configuration" while some tx queue is operating. > >> After that, ravb_config() in ravb_dmac_init() will fail, and then > >> any descriptors will be not allocaled anymore so that NULL porinter > >> dereference happens after that on ravb_start_xmit(). > >> > >> Such a case is possible to be caused because this driver supports > >> two queues (NC and BE) and the ravb_stop_dma() is possible to return > >> without any stopping process if TCCR or CSR register indicates > >> the hardware is operating for TX. > >> > >> To fix the issue, just try to wake the subqueue on > >> ravb_tx_timeout_work() if the descriptors are not full instead > >> of stop all transfers (all queues of TX and RX). > >> > >> [1] > >> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ > >> > >> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > >> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > >> --- > >> I'm guessing that this issue is possible to happen if: > >> - ravb_start_xmit() calls netif_stop_subqueue(), and > >> - ravb_poll() will not be called with some reason, and > >> - netif_wake_subqueue() will be not called, and then > >> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). > >> > >> However, unfortunately, I didn't reproduce the issue yet. > >> To be honest, I'm also guessing other queues (SR) of this hardware > >> which out-of tree driver manages are possible to reproduce this issue, > >> but I didn't try such environment for now... > >> > >> So, I marked RFC on this patch now. > > > > I'm afraid, but do you have any comments about this patch? > > I agree that we should now reset only the stuck queue, not both but I > doubt your solution is good enough. Let me have another look... Thank you for your comment! I hope this solution is good enough... By the way, there is other topic though, I'm thinking we should not call ravb_{close,open}() in ravb_{suspend,resume}() because this is possible to a race condition between ifconfig up/down and system suspend/resume. In other words, I'm thinking this should not free/reallocate dma memory while netif_running is true. But, what do you think? Best regards, Yoshihiro Shimoda > [...] > > MBR, Sergei
Hello! > From: Yoshihiro Shimoda, Sent: Friday, June 19, 2020 2:46 PM > > Hello! > > > From: Sergei Shtylyov, Sent: Monday, June 15, 2020 5:13 PM > > > > Hello! > > > > On 15.06.2020 8:58, Yoshihiro Shimoda wrote: > > > > >> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM > > >> > > >> According to the report of [1], this driver is possible to cause > > >> the following error in ravb_tx_timeout_work(). > > >> > > >> ravb e6800000.ethernet ethernet: failed to switch device to config mode > > >> > > >> This error means that the hardware could not change the state > > >> from "Operation" to "Configuration" while some tx queue is operating. > > >> After that, ravb_config() in ravb_dmac_init() will fail, and then > > >> any descriptors will be not allocaled anymore so that NULL porinter > > >> dereference happens after that on ravb_start_xmit(). > > >> > > >> Such a case is possible to be caused because this driver supports > > >> two queues (NC and BE) and the ravb_stop_dma() is possible to return > > >> without any stopping process if TCCR or CSR register indicates > > >> the hardware is operating for TX. > > >> > > >> To fix the issue, just try to wake the subqueue on > > >> ravb_tx_timeout_work() if the descriptors are not full instead > > >> of stop all transfers (all queues of TX and RX). > > >> > > >> [1] > > >> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ > > >> > > >> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > > >> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > > >> --- > > >> I'm guessing that this issue is possible to happen if: > > >> - ravb_start_xmit() calls netif_stop_subqueue(), and > > >> - ravb_poll() will not be called with some reason, and > > >> - netif_wake_subqueue() will be not called, and then > > >> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). > > >> > > >> However, unfortunately, I didn't reproduce the issue yet. > > >> To be honest, I'm also guessing other queues (SR) of this hardware > > >> which out-of tree driver manages are possible to reproduce this issue, > > >> but I didn't try such environment for now... > > >> > > >> So, I marked RFC on this patch now. > > > > > > I'm afraid, but do you have any comments about this patch? > > > > I agree that we should now reset only the stuck queue, not both but I > > doubt your solution is good enough. Let me have another look... > > Thank you for your comment! I hope this solution is good enough... I'm sorry again and again. But, do you have any time to look this patch? Best regards, Yoshihiro Shimoda
Hello! On 29.06.2020 8:24, Yoshihiro Shimoda wrote: >>>>> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM >>>>> >>>>> According to the report of [1], this driver is possible to cause >>>>> the following error in ravb_tx_timeout_work(). >>>>> >>>>> ravb e6800000.ethernet ethernet: failed to switch device to config mode >>>>> >>>>> This error means that the hardware could not change the state >>>>> from "Operation" to "Configuration" while some tx queue is operating. >>>>> After that, ravb_config() in ravb_dmac_init() will fail, and then >>>>> any descriptors will be not allocaled anymore so that NULL porinter >>>>> dereference happens after that on ravb_start_xmit(). >>>>> >>>>> Such a case is possible to be caused because this driver supports >>>>> two queues (NC and BE) and the ravb_stop_dma() is possible to return >>>>> without any stopping process if TCCR or CSR register indicates >>>>> the hardware is operating for TX. >>>>> >>>>> To fix the issue, just try to wake the subqueue on >>>>> ravb_tx_timeout_work() if the descriptors are not full instead >>>>> of stop all transfers (all queues of TX and RX). >>>>> >>>>> [1] >>>>> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ >>>>> >>>>> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> >>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> >>>>> --- >>>>> I'm guessing that this issue is possible to happen if: >>>>> - ravb_start_xmit() calls netif_stop_subqueue(), and >>>>> - ravb_poll() will not be called with some reason, and >>>>> - netif_wake_subqueue() will be not called, and then >>>>> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). >>>>> >>>>> However, unfortunately, I didn't reproduce the issue yet. >>>>> To be honest, I'm also guessing other queues (SR) of this hardware >>>>> which out-of tree driver manages are possible to reproduce this issue, >>>>> but I didn't try such environment for now... >>>>> >>>>> So, I marked RFC on this patch now. >>>> >>>> I'm afraid, but do you have any comments about this patch? >>> >>> I agree that we should now reset only the stuck queue, not both but I >>> doubt your solution is good enough. Let me have another look... >> >> Thank you for your comment! I hope this solution is good enough... > > I'm sorry again and again. But, do you have any time to look this patch? Yes, in the sense of reviewing -- I don't consider it complete. And no, in the sense of looking into the issue myself... Can we do a per-queue tear-down and re-init (not necessarily all in 1 patch)? > Best regards, > Yoshihiro Shimoda MBR, Sergei
Hello! > From: Sergei Shtylyov, Sent: Monday, June 29, 2020 5:40 PM > > Hello! > > On 29.06.2020 8:24, Yoshihiro Shimoda wrote: > > >>>>> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM > >>>>> > >>>>> According to the report of [1], this driver is possible to cause > >>>>> the following error in ravb_tx_timeout_work(). > >>>>> > >>>>> ravb e6800000.ethernet ethernet: failed to switch device to config mode > >>>>> > >>>>> This error means that the hardware could not change the state > >>>>> from "Operation" to "Configuration" while some tx queue is operating. > >>>>> After that, ravb_config() in ravb_dmac_init() will fail, and then > >>>>> any descriptors will be not allocaled anymore so that NULL porinter > >>>>> dereference happens after that on ravb_start_xmit(). > >>>>> > >>>>> Such a case is possible to be caused because this driver supports > >>>>> two queues (NC and BE) and the ravb_stop_dma() is possible to return > >>>>> without any stopping process if TCCR or CSR register indicates > >>>>> the hardware is operating for TX. > >>>>> > >>>>> To fix the issue, just try to wake the subqueue on > >>>>> ravb_tx_timeout_work() if the descriptors are not full instead > >>>>> of stop all transfers (all queues of TX and RX). > >>>>> > >>>>> [1] > >>>>> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ > >>>>> > >>>>> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > >>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > >>>>> --- > >>>>> I'm guessing that this issue is possible to happen if: > >>>>> - ravb_start_xmit() calls netif_stop_subqueue(), and > >>>>> - ravb_poll() will not be called with some reason, and > >>>>> - netif_wake_subqueue() will be not called, and then > >>>>> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). > >>>>> > >>>>> However, unfortunately, I didn't reproduce the issue yet. > >>>>> To be honest, I'm also guessing other queues (SR) of this hardware > >>>>> which out-of tree driver manages are possible to reproduce this issue, > >>>>> but I didn't try such environment for now... > >>>>> > >>>>> So, I marked RFC on this patch now. > >>>> > >>>> I'm afraid, but do you have any comments about this patch? > >>> > >>> I agree that we should now reset only the stuck queue, not both but I > >>> doubt your solution is good enough. Let me have another look... > >> > >> Thank you for your comment! I hope this solution is good enough... > > > > I'm sorry again and again. But, do you have any time to look this patch? > > Yes, in the sense of reviewing -- I don't consider it complete. And no, in > the sense of looking into the issue myself... Can we do a per-queue tear-down > and re-init (not necessarily all in 1 patch)? Thank you for your reply! Since I'm asking Renesas internal members about this, please wait for a little. # I don't think we can do a per-queu tear-down and re-init though... Best regards, Yoshihiro Shimoda > > Best regards, > > Yoshihiro Shimoda > > MBR, Sergei
Hello! > From: Sergei Shtylyov, Sent: Monday, June 29, 2020 5:40 PM > > Hello! > > On 29.06.2020 8:24, Yoshihiro Shimoda wrote: > > >>>>> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM > >>>>> > >>>>> According to the report of [1], this driver is possible to cause > >>>>> the following error in ravb_tx_timeout_work(). > >>>>> > >>>>> ravb e6800000.ethernet ethernet: failed to switch device to config mode > >>>>> > >>>>> This error means that the hardware could not change the state > >>>>> from "Operation" to "Configuration" while some tx queue is operating. > >>>>> After that, ravb_config() in ravb_dmac_init() will fail, and then > >>>>> any descriptors will be not allocaled anymore so that NULL porinter > >>>>> dereference happens after that on ravb_start_xmit(). > >>>>> > >>>>> Such a case is possible to be caused because this driver supports > >>>>> two queues (NC and BE) and the ravb_stop_dma() is possible to return > >>>>> without any stopping process if TCCR or CSR register indicates > >>>>> the hardware is operating for TX. > >>>>> > >>>>> To fix the issue, just try to wake the subqueue on > >>>>> ravb_tx_timeout_work() if the descriptors are not full instead > >>>>> of stop all transfers (all queues of TX and RX). > >>>>> > >>>>> [1] > >>>>> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ > >>>>> > >>>>> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > >>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > >>>>> --- > >>>>> I'm guessing that this issue is possible to happen if: > >>>>> - ravb_start_xmit() calls netif_stop_subqueue(), and > >>>>> - ravb_poll() will not be called with some reason, and > >>>>> - netif_wake_subqueue() will be not called, and then > >>>>> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). > >>>>> > >>>>> However, unfortunately, I didn't reproduce the issue yet. > >>>>> To be honest, I'm also guessing other queues (SR) of this hardware > >>>>> which out-of tree driver manages are possible to reproduce this issue, > >>>>> but I didn't try such environment for now... > >>>>> > >>>>> So, I marked RFC on this patch now. > >>>> > >>>> I'm afraid, but do you have any comments about this patch? > >>> > >>> I agree that we should now reset only the stuck queue, not both but I > >>> doubt your solution is good enough. Let me have another look... > >> > >> Thank you for your comment! I hope this solution is good enough... > > > > I'm sorry again and again. But, do you have any time to look this patch? > > Yes, in the sense of reviewing -- I don't consider it complete. And no, in > the sense of looking into the issue myself... Can we do a per-queue tear-down > and re-init (not necessarily all in 1 patch)? Thank you for your comment! I'm not sure this "re-init" mean. But, we can do a per-queue tear-down if DMAC is still working. And, we can prepare new descriptors for the queue after tear-down. < Tear-down > 1. Set DT_EOS to the desc_bat[queue]. 2. Set DLR.LBAx to 1. 3. Check if DLA.LBAx is cleared. < Prepare new descriptors and start again > 4. Prepare new descriptors. 5. Set DT_LINKFIX to the desc_bat[queue]. 6. Set DLR.LBAx to 1. 7. Check if DLA.LBAx is cleared. I'm thinking doing tear-down and "re-init" of the descriptors is better than just try to wake the subqueue. But, what do you think? Best regards, Yoshihiro Shimoda > > Best regards, > > Yoshihiro Shimoda > > MBR, Sergei
Hello! On 30.06.2020 8:22, Yoshihiro Shimoda wrote: >>>>>>> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM >>>>>>> >>>>>>> According to the report of [1], this driver is possible to cause >>>>>>> the following error in ravb_tx_timeout_work(). >>>>>>> >>>>>>> ravb e6800000.ethernet ethernet: failed to switch device to config mode >>>>>>> >>>>>>> This error means that the hardware could not change the state >>>>>>> from "Operation" to "Configuration" while some tx queue is operating. >>>>>>> After that, ravb_config() in ravb_dmac_init() will fail, and then >>>>>>> any descriptors will be not allocaled anymore so that NULL porinter Pointer. :-) >>>>>>> dereference happens after that on ravb_start_xmit(). >>>>>>> >>>>>>> Such a case is possible to be caused because this driver supports >>>>>>> two queues (NC and BE) and the ravb_stop_dma() is possible to return >>>>>>> without any stopping process if TCCR or CSR register indicates >>>>>>> the hardware is operating for TX. Maybe we should just fix those blind assumptions? >>>>>>> To fix the issue, just try to wake the subqueue on >>>>>>> ravb_tx_timeout_work() if the descriptors are not full instead >>>>>>> of stop all transfers (all queues of TX and RX). >>>>>>> >>>>>>> [1] >>>>>>> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ >>>>>>> >>>>>>> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> >>>>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> >>>>>>> --- >>>>>>> I'm guessing that this issue is possible to happen if: >>>>>>> - ravb_start_xmit() calls netif_stop_subqueue(), and >>>>>>> - ravb_poll() will not be called with some reason, and >>>>>>> - netif_wake_subqueue() will be not called, and then >>>>>>> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). >>>>>>> >>>>>>> However, unfortunately, I didn't reproduce the issue yet. >>>>>>> To be honest, I'm also guessing other queues (SR) of this hardware >>>>>>> which out-of tree driver manages are possible to reproduce this issue, >>>>>>> but I didn't try such environment for now... >>>>>>> >>>>>>> So, I marked RFC on this patch now. >>>>>> >>>>>> I'm afraid, but do you have any comments about this patch? >>>>> >>>>> I agree that we should now reset only the stuck queue, not both but I >>>>> doubt your solution is good enough. Let me have another look... >>>> >>>> Thank you for your comment! I hope this solution is good enough... >>> >>> I'm sorry again and again. But, do you have any time to look this patch? >> >> Yes, in the sense of reviewing -- I don't consider it complete. And no, in >> the sense of looking into the issue myself... Can we do a per-queue tear-down >> and re-init (not necessarily all in 1 patch)? In fact, it would ensue many changes... > Thank you for your comment! I'm not sure this "re-init" mean. But, we can do Well, I meant the ring re-allocation and re-formatting... but (looking at sh_eth) the former is not really necessary, it's enough to just stop the TX ring and then re-format it and re-start... Well, unfortunately, the way I structured the code, we can't do *just* that... > a per-queue tear-down if DMAC is still working. And, we can prepare new descriptors > for the queue after tear-down. > > < Tear-down > > 1. Set DT_EOS to the desc_bat[queue]. > 2. Set DLR.LBAx to 1. > 3. Check if DLA.LBAx is cleared. DLR.LBAx, you mean? Well, I was thinking of polling TCCR and CSR like the current ravb_stop_dma() does, but if that works... > < Prepare new descriptors and start again > > 4. Prepare new descriptors. That's where the cause for using the workqueue lies -- the descriptors are allocated with GFP_KERNEL, not GFP_ATOMIC... if you have time/desire to untangle all this, I'd appreciate it; else I'd have to work on this in my copious free time... :-) > 5. Set DT_LINKFIX to the desc_bat[queue]. > 6. Set DLR.LBAx to 1. > 7. Check if DLA.LBAx is cleared. > > > I'm thinking doing tear-down and "re-init" of the descriptors is better > than just try to wake the subqueue. But, what do you think? Definitely better, yes... > Best regards, > Yoshihiro Shimoda MBR, Sergei
Hello! > From: Sergei Shtylyov, Sent: Monday, July 6, 2020 5:29 AM > > Hello! > > On 30.06.2020 8:22, Yoshihiro Shimoda wrote: > > >>>>>>> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM > >>>>>>> > >>>>>>> According to the report of [1], this driver is possible to cause > >>>>>>> the following error in ravb_tx_timeout_work(). > >>>>>>> > >>>>>>> ravb e6800000.ethernet ethernet: failed to switch device to config mode > >>>>>>> > >>>>>>> This error means that the hardware could not change the state > >>>>>>> from "Operation" to "Configuration" while some tx queue is operating. > >>>>>>> After that, ravb_config() in ravb_dmac_init() will fail, and then > >>>>>>> any descriptors will be not allocaled anymore so that NULL porinter > > Pointer. :-) Oops! I should fix it :) > >>>>>>> dereference happens after that on ravb_start_xmit(). > >>>>>>> > >>>>>>> Such a case is possible to be caused because this driver supports > >>>>>>> two queues (NC and BE) and the ravb_stop_dma() is possible to return > >>>>>>> without any stopping process if TCCR or CSR register indicates > >>>>>>> the hardware is operating for TX. > > Maybe we should just fix those blind assumptions? Maybe I should have described some facts instead of assumptions like below? If so, I should modify the code too. After ravb_stop_dma() was called, the driver assumed any transfers were stopped. However, the current ravb_tx_timeout_work() doesn't check whether the ravb_stop_dma() is succeeded without any error or not. So, we should fix it. > >>>>>>> To fix the issue, just try to wake the subqueue on > >>>>>>> ravb_tx_timeout_work() if the descriptors are not full instead > >>>>>>> of stop all transfers (all queues of TX and RX). > >>>>>>> > >>>>>>> [1] > >>>>>>> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ > >>>>>>> > >>>>>>> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > >>>>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > >>>>>>> --- > >>>>>>> I'm guessing that this issue is possible to happen if: > >>>>>>> - ravb_start_xmit() calls netif_stop_subqueue(), and > >>>>>>> - ravb_poll() will not be called with some reason, and > >>>>>>> - netif_wake_subqueue() will be not called, and then > >>>>>>> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). > >>>>>>> > >>>>>>> However, unfortunately, I didn't reproduce the issue yet. > >>>>>>> To be honest, I'm also guessing other queues (SR) of this hardware > >>>>>>> which out-of tree driver manages are possible to reproduce this issue, > >>>>>>> but I didn't try such environment for now... > >>>>>>> > >>>>>>> So, I marked RFC on this patch now. > >>>>>> > >>>>>> I'm afraid, but do you have any comments about this patch? > >>>>> > >>>>> I agree that we should now reset only the stuck queue, not both but I > >>>>> doubt your solution is good enough. Let me have another look... > >>>> > >>>> Thank you for your comment! I hope this solution is good enough... > >>> > >>> I'm sorry again and again. But, do you have any time to look this patch? > >> > >> Yes, in the sense of reviewing -- I don't consider it complete. And no, in > >> the sense of looking into the issue myself... Can we do a per-queue tear-down > >> and re-init (not necessarily all in 1 patch)? > > In fact, it would ensue many changes... I think so. > > Thank you for your comment! I'm not sure this "re-init" mean. But, we can do > > Well, I meant the ring re-allocation and re-formatting... but (looking at > sh_eth) the former is not really necessary, it's enough to just stop the TX > ring and then re-format it and re-start... I got it. I also think the ring re-allocation is not really necessary. > Well, unfortunately, the way I > structured the code, we can't do *just* that... I agree. We need refactoring for it. > > a per-queue tear-down if DMAC is still working. And, we can prepare new descriptors > > for the queue after tear-down. > > > > < Tear-down > > > 1. Set DT_EOS to the desc_bat[queue]. > > 2. Set DLR.LBAx to 1. > > 3. Check if DLA.LBAx is cleared. > > DLR.LBAx, you mean? Yes. I heard this procedure from BSP team. > Well, I was thinking of polling TCCR and CSR like the current > ravb_stop_dma() does, but if that works... I'm not sure whether polling TCCR and CSR is enough or not. Instead of polling those registers, maybe we should poll whether ravb_stop_dma() is succeeded or not? Especially, result of ravb_config() is a key point whether the hardware is really stopped or not. So, I'm thinking that just polling the ravb_stop_dma() in ravb_tx_timeout_work() is better than the per-queue tear-down and re-init now. But, what do you think? > > < Prepare new descriptors and start again > > > 4. Prepare new descriptors. > > That's where the cause for using the workqueue lies -- the descriptors are > allocated with GFP_KERNEL, not GFP_ATOMIC... IIUC, we can avoid to use the workqueue if re-allocation is not really necessary. > if you have time/desire to > untangle all this, I'd appreciate it; else I'd have to work on this in my > copious free time... :-) If we don't need refactoring, I think I can do it :) > > 5. Set DT_LINKFIX to the desc_bat[queue]. > > 6. Set DLR.LBAx to 1. > > 7. Check if DLA.LBAx is cleared. > > > > > > I'm thinking doing tear-down and "re-init" of the descriptors is better > > than just try to wake the subqueue. But, what do you think? > > Definitely better, yes... I got it. Best regards, Yoshihiro Shimoda > > Best regards, > > Yoshihiro Shimoda > > MBR, Sergei
Hello! Sorry about another late reply, was having h/w issues on the new work... On 07/06/2020 12:25 PM, Yoshihiro Shimoda wrote: >>>>>>>>> From: Yoshihiro Shimoda, Sent: Tuesday, May 26, 2020 6:47 PM >>>>>>>>> >>>>>>>>> According to the report of [1], this driver is possible to cause >>>>>>>>> the following error in ravb_tx_timeout_work(). >>>>>>>>> >>>>>>>>> ravb e6800000.ethernet ethernet: failed to switch device to config mode >>>>>>>>> >>>>>>>>> This error means that the hardware could not change the state >>>>>>>>> from "Operation" to "Configuration" while some tx queue is operating. >>>>>>>>> After that, ravb_config() in ravb_dmac_init() will fail, and then >>>>>>>>> any descriptors will be not allocaled anymore so that NULL porinter >> >> Pointer. :-) > > Oops! I should fix it :) > >>>>>>>>> dereference happens after that on ravb_start_xmit(). >>>>>>>>> >>>>>>>>> Such a case is possible to be caused because this driver supports >>>>>>>>> two queues (NC and BE) and the ravb_stop_dma() is possible to return >>>>>>>>> without any stopping process if TCCR or CSR register indicates >>>>>>>>> the hardware is operating for TX. >> >> Maybe we should just fix those blind assumptions? > > Maybe I should have described some facts instead of assumptions like below? > If so, I should modify the code too. > > After ravb_stop_dma() was called, the driver assumed any transfers were > stopped. However, the current ravb_tx_timeout_work() doesn't check whether > the ravb_stop_dma() is succeeded without any error or not. So, we should > fix it. Yes. Better a stuck TX queue (with a chance to recover) than kernel oops... >>>>>>>>> To fix the issue, just try to wake the subqueue on >>>>>>>>> ravb_tx_timeout_work() if the descriptors are not full instead >>>>>>>>> of stop all transfers (all queues of TX and RX). >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ >>>>>>>>> >>>>>>>>> Reported-by: Dirk Behme <dirk.behme@de.bosch.com> >>>>>>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> >>>>>>>>> --- >>>>>>>>> I'm guessing that this issue is possible to happen if: >>>>>>>>> - ravb_start_xmit() calls netif_stop_subqueue(), and >>>>>>>>> - ravb_poll() will not be called with some reason, and >>>>>>>>> - netif_wake_subqueue() will be not called, and then >>>>>>>>> - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). >>>>>>>>> >>>>>>>>> However, unfortunately, I didn't reproduce the issue yet. >>>>>>>>> To be honest, I'm also guessing other queues (SR) of this hardware >>>>>>>>> which out-of tree driver manages are possible to reproduce this issue, >>>>>>>>> but I didn't try such environment for now... >>>>>>>>> >>>>>>>>> So, I marked RFC on this patch now. >>>>>>>> >>>>>>>> I'm afraid, but do you have any comments about this patch? >>>>>>> >>>>>>> I agree that we should now reset only the stuck queue, not both but I >>>>>>> doubt your solution is good enough. Let me have another look... >>>>>> >>>>>> Thank you for your comment! I hope this solution is good enough... >>>>> >>>>> I'm sorry again and again. But, do you have any time to look this patch? >>>> >>>> Yes, in the sense of reviewing -- I don't consider it complete. And no, in >>>> the sense of looking into the issue myself... Can we do a per-queue tear-down >>>> and re-init (not necessarily all in 1 patch)? >> >> In fact, it would ensue many changes... > > I think so. > >>> Thank you for your comment! I'm not sure this "re-init" mean. But, we can do >> >> Well, I meant the ring re-allocation and re-formatting... but (looking at >> sh_eth) the former is not really necessary, it's enough to just stop the TX >> ring and then re-format it and re-start... > > I got it. I also think the ring re-allocation is not really necessary. > >> Well, unfortunately, the way I >> structured the code, we can't do *just* that... > > I agree. We need refactoring for it. > >>> a per-queue tear-down if DMAC is still working. And, we can prepare new descriptors >>> for the queue after tear-down. >>> >>> < Tear-down > >>> 1. Set DT_EOS to the desc_bat[queue]. >>> 2. Set DLR.LBAx to 1. >>> 3. Check if DLA.LBAx is cleared. >> >> DLR.LBAx, you mean? > > Yes. I heard this procedure from BSP team. > >> Well, I was thinking of polling TCCR and CSR like the current >> ravb_stop_dma() does, but if that works... > > I'm not sure whether polling TCCR and CSR is enough or not. > Instead of polling those registers, maybe we should poll whether > ravb_stop_dma() is succeeded or not? Yes, if by polling you mean just checking the result of it. :-) > Especially, result of ravb_config() is > a key point whether the hardware is really stopped or not. > So, I'm thinking that just polling the ravb_stop_dma() in > ravb_tx_timeout_work() is better than the per-queue tear-down and > re-init now. But, what do you think? I don't think it's better since we're now supposed to handle a per-queue TX timeout (still not sure it's possible with this h/w). But of course, it's better as it's simple enough for a bug fix. >>> < Prepare new descriptors and start again > >>> 4. Prepare new descriptors. >> >> That's where the cause for using the workqueue lies -- the descriptors are >> allocated with GFP_KERNEL, not GFP_ATOMIC... > > IIUC, we can avoid to use the workqueue if re-allocation is not really necessary. > >> if you have time/desire to >> untangle all this, I'd appreciate it; else I'd have to work on this in my >> copious free time... :-) > > If we don't need refactoring, I think I can do it :) Let's go forward with the simple fix (assuming it fixes the original oops). [...] MBR, Sergei
Hello! > From: Sergei Shtylyov, Sent: Monday, July 20, 2020 4:20 AM > > Hello! > > Sorry about another late reply, was having h/w issues on the new work... No problem! :) Thank you for your reply! > On 07/06/2020 12:25 PM, Yoshihiro Shimoda wrote: <snip> > >> Maybe we should just fix those blind assumptions? > > > > Maybe I should have described some facts instead of assumptions like below? > > If so, I should modify the code too. > > > > After ravb_stop_dma() was called, the driver assumed any transfers were > > stopped. However, the current ravb_tx_timeout_work() doesn't check whether > > the ravb_stop_dma() is succeeded without any error or not. So, we should > > fix it. > > Yes. Better a stuck TX queue (with a chance to recover) than kernel oops... I got it. <snip> > >> Well, I was thinking of polling TCCR and CSR like the current > >> ravb_stop_dma() does, but if that works... > > > > I'm not sure whether polling TCCR and CSR is enough or not. > > Instead of polling those registers, maybe we should poll whether > > ravb_stop_dma() is succeeded or not? > > Yes, if by polling you mean just checking the result of it. :-) Yes, I intend to check the result of it :) > > Especially, result of ravb_config() is > > a key point whether the hardware is really stopped or not. > > So, I'm thinking that just polling the ravb_stop_dma() in > > ravb_tx_timeout_work() is better than the per-queue tear-down and > > re-init now. But, what do you think? > > I don't think it's better since we're now supposed to handle a per-queue > TX timeout (still not sure it's possible with this h/w). But of course, it's > better as it's simple enough for a bug fix. Thank you for your comment. Yes, I agree it's better to handle a per-queue TX timeout. However, I think we need refactoring for it. So, I'd like to fix a bug by simple code. > >>> < Prepare new descriptors and start again > > >>> 4. Prepare new descriptors. > >> > >> That's where the cause for using the workqueue lies -- the descriptors are > >> allocated with GFP_KERNEL, not GFP_ATOMIC... > > > > IIUC, we can avoid to use the workqueue if re-allocation is not really necessary. > > > >> if you have time/desire to > >> untangle all this, I'd appreciate it; else I'd have to work on this in my > >> copious free time... :-) > > > > If we don't need refactoring, I think I can do it :) > > Let's go forward with the simple fix (assuming it fixes the original oops). Thank you for your comment! I'll do that. Best regards, Yoshihiro Shimoda > [...] > > MBR, Sergei
diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index 9f88b5d..42cf41a 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -1021,7 +1021,6 @@ struct ravb_private { u32 cur_tx[NUM_TX_QUEUE]; u32 dirty_tx[NUM_TX_QUEUE]; struct napi_struct napi[NUM_RX_QUEUE]; - struct work_struct work; /* MII transceiver section. */ struct mii_bus *mii_bus; /* MDIO bus control */ int link; diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 067ad25..45e1ecd 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -1428,44 +1428,25 @@ static int ravb_open(struct net_device *ndev) static void ravb_tx_timeout(struct net_device *ndev, unsigned int txqueue) { struct ravb_private *priv = netdev_priv(ndev); + unsigned long flags; + bool wake = false; + + spin_lock_irqsave(&priv->lock, flags); + if (priv->cur_tx[txqueue] - priv->dirty_tx[txqueue] <= + (priv->num_tx_ring[txqueue] - 1) * priv->num_tx_desc) + wake = true; netif_err(priv, tx_err, ndev, - "transmit timed out, status %08x, resetting...\n", - ravb_read(ndev, ISS)); + "transmit timed out (%d %d), status %08x %08x %08x\n", + txqueue, wake, ravb_read(ndev, ISS), ravb_read(ndev, TCCR), + ravb_read(ndev, CSR)); + + if (wake) + netif_wake_subqueue(ndev, txqueue); /* tx_errors count up */ ndev->stats.tx_errors++; - - schedule_work(&priv->work); -} - -static void ravb_tx_timeout_work(struct work_struct *work) -{ - struct ravb_private *priv = container_of(work, struct ravb_private, - work); - struct net_device *ndev = priv->ndev; - - netif_tx_stop_all_queues(ndev); - - /* Stop PTP Clock driver */ - if (priv->chip_id == RCAR_GEN2) - ravb_ptp_stop(ndev); - - /* Wait for DMA stopping */ - ravb_stop_dma(ndev); - - ravb_ring_free(ndev, RAVB_BE); - ravb_ring_free(ndev, RAVB_NC); - - /* Device init */ - ravb_dmac_init(ndev); - ravb_emac_init(ndev); - - /* Initialise PTP Clock driver */ - if (priv->chip_id == RCAR_GEN2) - ravb_ptp_init(ndev, priv->pdev); - - netif_tx_start_all_queues(ndev); + spin_unlock_irqrestore(&priv->lock, flags); } /* Packet transmit function for Ethernet AVB */ @@ -2046,7 +2027,6 @@ static int ravb_probe(struct platform_device *pdev) } spin_lock_init(&priv->lock); - INIT_WORK(&priv->work, ravb_tx_timeout_work); error = of_get_phy_mode(np, &priv->phy_interface); if (error && error != -ENODEV)
According to the report of [1], this driver is possible to cause the following error in ravb_tx_timeout_work(). ravb e6800000.ethernet ethernet: failed to switch device to config mode This error means that the hardware could not change the state from "Operation" to "Configuration" while some tx queue is operating. After that, ravb_config() in ravb_dmac_init() will fail, and then any descriptors will be not allocaled anymore so that NULL porinter dereference happens after that on ravb_start_xmit(). Such a case is possible to be caused because this driver supports two queues (NC and BE) and the ravb_stop_dma() is possible to return without any stopping process if TCCR or CSR register indicates the hardware is operating for TX. To fix the issue, just try to wake the subqueue on ravb_tx_timeout_work() if the descriptors are not full instead of stop all transfers (all queues of TX and RX). [1] https://lore.kernel.org/linux-renesas-soc/20200518045452.2390-1-dirk.behme@de.bosch.com/ Reported-by: Dirk Behme <dirk.behme@de.bosch.com> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> --- I'm guessing that this issue is possible to happen if: - ravb_start_xmit() calls netif_stop_subqueue(), and - ravb_poll() will not be called with some reason, and - netif_wake_subqueue() will be not called, and then - dev_watchdog() in net/sched/sch_generic.c calls ndo_tx_timeout(). However, unfortunately, I didn't reproduce the issue yet. To be honest, I'm also guessing other queues (SR) of this hardware which out-of tree driver manages are possible to reproduce this issue, but I didn't try such environment for now... So, I marked RFC on this patch now. drivers/net/ethernet/renesas/ravb.h | 1 - drivers/net/ethernet/renesas/ravb_main.c | 48 ++++++++++---------------------- 2 files changed, 14 insertions(+), 35 deletions(-)