Message ID | 20230419115632.738730-1-yajun.deng@linux.dev (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: sched: print jiffies when transmit queue time out | expand |
On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <yajun.deng@linux.dev> wrote: > > Although there is watchdog_timeo to let users know when the transmit queue > begin stall, but dev_watchdog() is called with an interval. The jiffies > will always be greater than watchdog_timeo. > > To let users know the exact time the stall started, print jiffies when > the transmit queue time out. > > Signed-off-by: Yajun Deng <yajun.deng@linux.dev> > --- > atomic_long_inc(&txq->trans_timeout); > break; > } > @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t) > > if (unlikely(some_queue_timedout)) { > trace_net_dev_xmit_timeout(dev, i); > - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n", > - dev->name, netdev_drivername(dev), i); > + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \ > + transmit queue %u timed out %lu jiffies\n", > + dev->name, netdev_drivername(dev), i, some_queue_timedout); If we really want this, I suggest we export a time in ms units, using jiffies_to_msecs()
On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote: > Although there is watchdog_timeo to let users know when the transmit queue > begin stall, but dev_watchdog() is called with an interval. The jiffies > will always be greater than watchdog_timeo. > > To let users know the exact time the stall started, print jiffies when > the transmit queue time out. Please add an explanation of how this information is useful in practice.
April 20, 2023 9:27 AM, "Jakub Kicinski" <kuba@kernel.org> wrote: > On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote: > >> Although there is watchdog_timeo to let users know when the transmit queue >> begin stall, but dev_watchdog() is called with an interval. The jiffies >> will always be greater than watchdog_timeo. >> >> To let users know the exact time the stall started, print jiffies when >> the transmit queue time out. > > Please add an explanation of how this information is useful in practice. We found some cases with several warnings. We want to confirm which happened first. First warning: 16:37:57 kernel: [ 7100.097547] ------------[ cut here ]------------ 16:37:57 kernel: [ 7100.097550] NETDEV WATCHDOG: eno2 (i40e): transmit queue 8 timed out 16:37:57 kernel: [ 7100.097571] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270 ... Second warning: 16:38:44 kernel: [ 7147.756952] rcu: INFO: rcu_preempt self-detected stall on CPU 16:38:44 kernel: [ 7147.756958] rcu: 24-....: (59999 ticks this GP) idle=546/1/0x4000000000000000 softirq=367 3137/3673146 fqs=13844 16:38:44 kernel: [ 7147.756960] (t=60001 jiffies g=4322709 q=133381) 16:38:44 kernel: [ 7147.756962] NMI backtrace for cpu 24 ... As we can see, the transmit queue start stall should be before 16:37:52, the rcu start stall is 16:37:44. These two times are closer, we want to confirm which happened first.
April 19, 2023 8:02 PM, "Eric Dumazet" <edumazet@google.com> wrote: > On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <yajun.deng@linux.dev> wrote: > >> Although there is watchdog_timeo to let users know when the transmit queue >> begin stall, but dev_watchdog() is called with an interval. The jiffies >> will always be greater than watchdog_timeo. >> >> To let users know the exact time the stall started, print jiffies when >> the transmit queue time out. >> >> Signed-off-by: Yajun Deng <yajun.deng@linux.dev> >> --- >> >> atomic_long_inc(&txq->trans_timeout); >> break; >> } >> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t) >> >> if (unlikely(some_queue_timedout)) { >> trace_net_dev_xmit_timeout(dev, i); >> - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n", >> - dev->name, netdev_drivername(dev), i); >> + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \ >> + transmit queue %u timed out %lu jiffies\n", >> + dev->name, netdev_drivername(dev), i, some_queue_timedout); > > If we really want this, I suggest we export a time in ms units, using > jiffies_to_msecs() OK.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index a9aadc4e6858..67b78e260d28 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -502,7 +502,7 @@ static void dev_watchdog(struct timer_list *t) if (netif_device_present(dev) && netif_running(dev) && netif_carrier_ok(dev)) { - int some_queue_timedout = 0; + unsigned long some_queue_timedout = 0; unsigned int i; unsigned long trans_start; @@ -514,7 +514,7 @@ static void dev_watchdog(struct timer_list *t) if (netif_xmit_stopped(txq) && time_after(jiffies, (trans_start + dev->watchdog_timeo))) { - some_queue_timedout = 1; + some_queue_timedout = jiffies - trans_start; atomic_long_inc(&txq->trans_timeout); break; } @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t) if (unlikely(some_queue_timedout)) { trace_net_dev_xmit_timeout(dev, i); - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n", - dev->name, netdev_drivername(dev), i); + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \ + transmit queue %u timed out %lu jiffies\n", + dev->name, netdev_drivername(dev), i, some_queue_timedout); netif_freeze_queues(dev); dev->netdev_ops->ndo_tx_timeout(dev, i); netif_unfreeze_queues(dev);
Although there is watchdog_timeo to let users know when the transmit queue begin stall, but dev_watchdog() is called with an interval. The jiffies will always be greater than watchdog_timeo. To let users know the exact time the stall started, print jiffies when the transmit queue time out. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> --- net/sched/sch_generic.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)