net: sched: print jiffies when transmit queue time out

Message ID	20230419115632.738730-1-yajun.deng@linux.dev (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Yajun Deng <yajun.deng@linux.dev> To: jhs@mojatatu.com, xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Yajun Deng <yajun.deng@linux.dev> Subject: [PATCH] net: sched: print jiffies when transmit queue time out Date: Wed, 19 Apr 2023 19:56:32 +0800 Message-Id: <20230419115632.738730-1-yajun.deng@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net: sched: print jiffies when transmit queue time out \| expand net: sched: print jiffies when transmit queue time out

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 26 this patch: 26
netdev/cc_maintainers	success	CCed 8 of 8 maintainers
netdev/build_clang	success	Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 26 this patch: 26
netdev/checkpatch	warning	WARNING: Avoid line continuations in quoted strings WARNING: Possible unnecessary KERN_INFO WARNING: line length of 101 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Yajun Deng April 19, 2023, 11:56 a.m. UTC

Although there is watchdog_timeo to let users know when the transmit queue
begin stall, but dev_watchdog() is called with an interval. The jiffies
will always be greater than watchdog_timeo.

To let users know the exact time the stall started, print jiffies when
the transmit queue time out.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
---
 net/sched/sch_generic.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Eric Dumazet April 19, 2023, 12:02 p.m. UTC | #1

On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <yajun.deng@linux.dev> wrote:
>
> Although there is watchdog_timeo to let users know when the transmit queue
> begin stall, but dev_watchdog() is called with an interval. The jiffies
> will always be greater than watchdog_timeo.
>
> To let users know the exact time the stall started, print jiffies when
> the transmit queue time out.
>
> Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
> ---


>                                         atomic_long_inc(&txq->trans_timeout);
>                                         break;
>                                 }
> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>
>                         if (unlikely(some_queue_timedout)) {
>                                 trace_net_dev_xmit_timeout(dev, i);
> -                               WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
> -                                      dev->name, netdev_drivername(dev), i);
> +                               WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
> +                                         transmit queue %u timed out %lu jiffies\n",
> +                                         dev->name, netdev_drivername(dev), i, some_queue_timedout);

If we really want this, I suggest we export a time in ms units, using
jiffies_to_msecs()

Jakub Kicinski April 20, 2023, 1:27 a.m. UTC | #2

On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote:
> Although there is watchdog_timeo to let users know when the transmit queue
> begin stall, but dev_watchdog() is called with an interval. The jiffies
> will always be greater than watchdog_timeo.
> 
> To let users know the exact time the stall started, print jiffies when
> the transmit queue time out.

Please add an explanation of how this information is useful in practice.

Yajun Deng April 20, 2023, 2:17 a.m. UTC | #3

April 20, 2023 9:27 AM, "Jakub Kicinski" <kuba@kernel.org> wrote:

> On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote:
> 
>> Although there is watchdog_timeo to let users know when the transmit queue
>> begin stall, but dev_watchdog() is called with an interval. The jiffies
>> will always be greater than watchdog_timeo.
>> 
>> To let users know the exact time the stall started, print jiffies when
>> the transmit queue time out.
> 
> Please add an explanation of how this information is useful in practice.

We found some cases with several warnings. We want to confirm which happened first. 

First warning:
16:37:57 kernel: [ 7100.097547] ------------[ cut here ]------------
16:37:57 kernel: [ 7100.097550] NETDEV WATCHDOG: eno2 (i40e): transmit queue 8 timed out
16:37:57 kernel: [ 7100.097571] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
...

Second warning:
16:38:44 kernel: [ 7147.756952] rcu: INFO: rcu_preempt self-detected stall on CPU
16:38:44 kernel: [ 7147.756958] rcu:   24-....: (59999 ticks this GP) idle=546/1/0x4000000000000000 softirq=367      3137/3673146 fqs=13844
16:38:44 kernel: [ 7147.756960]        (t=60001 jiffies g=4322709 q=133381)
16:38:44 kernel: [ 7147.756962] NMI backtrace for cpu 24
...

As we can see, the transmit queue start stall should be before 16:37:52, the rcu start stall is 16:37:44.
These two times are closer, we want to confirm which happened first.

Yajun Deng April 20, 2023, 2:49 a.m. UTC | #4

April 19, 2023 8:02 PM, "Eric Dumazet" <edumazet@google.com> wrote:

> On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <yajun.deng@linux.dev> wrote:
> 
>> Although there is watchdog_timeo to let users know when the transmit queue
>> begin stall, but dev_watchdog() is called with an interval. The jiffies
>> will always be greater than watchdog_timeo.
>> 
>> To let users know the exact time the stall started, print jiffies when
>> the transmit queue time out.
>> 
>> Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
>> ---
>> 
>> atomic_long_inc(&txq->trans_timeout);
>> break;
>> }
>> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>> 
>> if (unlikely(some_queue_timedout)) {
>> trace_net_dev_xmit_timeout(dev, i);
>> - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>> - dev->name, netdev_drivername(dev), i);
>> + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
>> + transmit queue %u timed out %lu jiffies\n",
>> + dev->name, netdev_drivername(dev), i, some_queue_timedout);
> 
> If we really want this, I suggest we export a time in ms units, using
> jiffies_to_msecs()

OK.

net: sched: print jiffies when transmit queue time out

Checks

Commit Message

Comments

Patch