[RFC,27/30] Code tagging based latency tracking

Message ID	20220830214919.53220-28-surenb@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 30 Aug 2022 14:49:16 -0700 In-Reply-To: <20220830214919.53220-1-surenb@google.com> Mime-Version: 1.0 References: <20220830214919.53220-1-surenb@google.com> Message-ID: <20220830214919.53220-28-surenb@google.com> Subject: [RFC PATCH 27/30] Code tagging based latency tracking From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: kent.overstreet@linux.dev, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, roman.gushchin@linux.dev, mgorman@suse.de, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, void@manifault.com, peterz@infradead.org, juri.lelli@redhat.com, ldufour@linux.ibm.com, peterx@redhat.com, david@redhat.com, axboe@kernel.dk, mcgrof@kernel.org, masahiroy@kernel.org, nathan@kernel.org, changbin.du@intel.com, ytcoode@gmail.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com, vschneid@redhat.com, cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com, 42.hyeyoo@gmail.com, glider@google.com, elver@google.com, dvyukov@google.com, shakeelb@google.com, songmuchun@bytedance.com, arnd@arndb.de, jbaron@akamai.com, rientjes@google.com, minchan@google.com, kaleshsingh@google.com, surenb@google.com, kernel-team@android.com, linux-mm@kvack.org, iommu@lists.linux.dev, kasan-dev@googlegroups.com, io-uring@vger.kernel.org, linux-arch@vger.kernel.org, xen-devel@lists.xenproject.org, linux-bcache@vger.kernel.org, linux-modules@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Code tagging framework and applications \| expand [RFC,00/30] Code tagging framework and applications [RFC,01/30] kernel/module: move find_kallsyms_symbol_value declaration [RFC,02/30] lib/string_helpers: Drop space in string_get_size's output [RFC,03/30] Lazy percpu counters [RFC,04/30] scripts/kallysms: Always include __start and __stop symbols [RFC,05/30] lib: code tagging framework [RFC,06/30] lib: code tagging module support [RFC,07/30] lib: add support for allocation tagging [RFC,08/30] lib: introduce page allocation tagging [RFC,09/30] change alloc_pages name in dma_map_ops to avoid name conflicts [RFC,10/30] mm: enable page allocation tagging for __get_free_pages and alloc_pages [RFC,11/30] mm: introduce slabobj_ext to support slab object extensions [RFC,12/30] mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation [RFC,13/30] mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation [RFC,14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects [RFC,15/30] lib: introduce slab allocation tagging [RFC,16/30] mm: enable slab allocation tagging for kmalloc and friends [RFC,17/30] lib/string.c: strsep_no_empty() [RFC,18/30] codetag: add codetag query helper functions [RFC,19/30] move stack capture functionality into a separate function for reuse [RFC,20/30] lib: introduce support for storing code tag context [RFC,21/30] lib: implement context capture support for page and slab allocators [RFC,22/30] Code tagging based fault injection [RFC,23/30] timekeeping: Add a missing include [RFC,24/30] wait: Clean up waitqueue_entry initialization [RFC,25/30] lib/time_stats: New library for statistics on events [RFC,26/30] bcache: Convert to lib/time_stats [RFC,27/30] Code tagging based latency tracking [RFC,28/30] Improved symbolic error names [RFC,29/30] dyndbg: Convert to code tagging [RFC,30/30] MAINTAINERS: Add entries for code tagging & related

Suren Baghdasaryan Aug. 30, 2022, 9:49 p.m. UTC

From: Kent Overstreet <kent.overstreet@linux.dev>

This adds the ability to easily instrument code for measuring latency.
To use, add the following to calls to your code, at the start and end of
the event you wish to measure:

  code_tag_time_stats_start(start_time);
  code_tag_time_stats_finish(start_time);

Stastistics will then show up in debugfs under
/sys/kernel/debug/time_stats, listed by file and line number.

Stastics measured include weighted averages of frequency, duration, max
duration, as well as quantiles.

This patch also instruments all calls to init_wait and finish_wait,
which includes all calls to wait_event. Example debugfs output:

fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
count:          17
rate:           0/sec
frequency:      2 sec
avg duration:   10 us
max duration:   232 us
quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128

lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
count:          3
rate:           0/sec
frequency:      4 sec
avg duration:   4 sec
max duration:   4 sec
quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048

net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
count:          10
rate:           1/sec
frequency:      859 ms
avg duration:   472 ms
max duration:   30 sec
quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 17217 17217 17217 17217

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/asm-generic/codetag.lds.h  |   3 +-
 include/linux/codetag_time_stats.h |  54 +++++++++++
 include/linux/io_uring_types.h     |   2 +-
 include/linux/wait.h               |  22 ++++-
 kernel/sched/wait.c                |   6 +-
 lib/Kconfig.debug                  |   8 ++
 lib/Makefile                       |   1 +
 lib/codetag_time_stats.c           | 143 +++++++++++++++++++++++++++++
 8 files changed, 233 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/codetag_time_stats.h
 create mode 100644 lib/codetag_time_stats.c

Randy Dunlap Aug. 31, 2022, 1:53 a.m. UTC | #1

On 8/30/22 14:49, Suren Baghdasaryan wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index b7d03afbc808..b0f86643b8f0 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1728,6 +1728,14 @@ config LATENCYTOP
>  	  Enable this option if you want to use the LatencyTOP tool
>  	  to find out which userspace is blocking on what kernel operations.
>  
> +config CODETAG_TIME_STATS
> +	bool "Code tagging based latency measuring"
> +	depends on DEBUG_FS
> +	select TIME_STATS
> +	select CODE_TAGGING
> +	help
> +	  Enabling this option makes latency statistics available in debugfs

Missing period at the end of the sentence.

Suren Baghdasaryan Aug. 31, 2022, 3:55 p.m. UTC | #2

On Tue, Aug 30, 2022 at 6:53 PM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 8/30/22 14:49, Suren Baghdasaryan wrote:
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index b7d03afbc808..b0f86643b8f0 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1728,6 +1728,14 @@ config LATENCYTOP
> >         Enable this option if you want to use the LatencyTOP tool
> >         to find out which userspace is blocking on what kernel operations.
> >
> > +config CODETAG_TIME_STATS
> > +     bool "Code tagging based latency measuring"
> > +     depends on DEBUG_FS
> > +     select TIME_STATS
> > +     select CODE_TAGGING
> > +     help
> > +       Enabling this option makes latency statistics available in debugfs
>
> Missing period at the end of the sentence.

Ack.

>
> --
> ~Randy

Peter Zijlstra Sept. 1, 2022, 7:11 a.m. UTC | #3

On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> From: Kent Overstreet <kent.overstreet@linux.dev>
> 
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
> 
>   code_tag_time_stats_start(start_time);
>   code_tag_time_stats_finish(start_time);
> 
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
> 
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
> 
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:

How can't you do this with a simple eBPF script on top of
trace_sched_stat_* and friends?

Kent Overstreet Sept. 1, 2022, 2:43 p.m. UTC | #4

On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <kent.overstreet@linux.dev>
> > 
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> > 
> >   code_tag_time_stats_start(start_time);
> >   code_tag_time_stats_finish(start_time);
> > 
> > Stastistics will then show up in debugfs under
> > /sys/kernel/debug/time_stats, listed by file and line number.
> > 
> > Stastics measured include weighted averages of frequency, duration, max
> > duration, as well as quantiles.
> > 
> > This patch also instruments all calls to init_wait and finish_wait,
> > which includes all calls to wait_event. Example debugfs output:
> 
> How can't you do this with a simple eBPF script on top of
> trace_sched_stat_* and friends?

I know about those tracepoints, and I've never found them to be usable. I've
never succesfully used them for debugging latency issues, or known anyone who
has.

And an eBPF script to do everything this does wouldn't be simple at all.
Honesly, the time stats stuff looks _far_ simpler to me than anything involving
tracing - and with tracing you have to correlate the start and end events after
the fact.

Steven Rostedt Sept. 1, 2022, 9:38 p.m. UTC | #5

On Tue, 30 Aug 2022 14:49:16 -0700
Suren Baghdasaryan <surenb@google.com> wrote:

> From: Kent Overstreet <kent.overstreet@linux.dev>
> 
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
> 
>   code_tag_time_stats_start(start_time);
>   code_tag_time_stats_finish(start_time);

So you need to modify the code to see what you want?

> 
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
> 
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
> 
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:
> 
> fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
> count:          17
> rate:           0/sec
> frequency:      2 sec
> avg duration:   10 us
> max duration:   232 us
> quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
> 
> lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
> count:          3
> rate:           0/sec
> frequency:      4 sec
> avg duration:   4 sec
> max duration:   4 sec
> quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048
> 
> net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
> count:          10
> rate:           1/sec
> frequency:      859 ms
> avg duration:   472 ms
> max duration:   30 sec
> quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 17217 17217 17217 17217

For function length you could just do something like this:

 # cd /sys/kernel/tracing
 # echo __skb_wait_for_more_packets > set_ftrace_filter
 # echo 1 > function_profile_enabled
 # cat trace_stat/function*
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  __skb_wait_for_more_packets              1    0.000 us        0.000 us        0.000 us    
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  __skb_wait_for_more_packets              1    74.813 us       74.813 us       0.000 us    
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---

The above is for a 4 CPU machine. The s^2 is the square of the standard
deviation (makes not having to do divisions while it runs).

But if you are looking for latency between two events (which can be kprobes
too, where you do not need to rebuild your kernel):

From: https://man.archlinux.org/man/sqlhist.1.en
which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
  if not already installed on your distro.

 # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100'

The above creates a synthetic event called "wakeup_lat" that joins two
events (sched_waking and sched_switch) when the pid field of sched_waking
matches the next_pid field of sched_switch. When there is a match, it will
trigger the wakeup_lat event only if the prio of the sched_waking event is
less than 100 (which in the kernel means any real-time task). The
wakeup_lat event will record the next_comm (as comm field), the pid of
woken task and the time delta in microseconds between the two events.

 # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above starts a histogram tracing the name of the woken task, the
priority and the delta (but placing the delta in buckets of size 10, as we
do not need to see every latency number).

 # chrt -f 20 sleep 1

Force something to be woken up that is interesting.

 # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active]
#

{ comm: migration/5                                       , prio:          0, delta: ~ 10-19 } hitcount:          1
{ comm: migration/2                                       , prio:          0, delta: ~ 10-19 } hitcount:          1
{ comm: sleep                                             , prio:         79, delta: ~ 10-19 } hitcount:          1
{ comm: migration/7                                       , prio:          0, delta: ~ 10-19 } hitcount:          1
{ comm: migration/4                                       , prio:          0, delta: ~ 10-19 } hitcount:          1
{ comm: migration/6                                       , prio:          0, delta: ~ 10-19 } hitcount:          1
{ comm: migration/1                                       , prio:          0, delta: ~ 10-19 } hitcount:          2
{ comm: migration/0                                       , prio:          0, delta: ~ 10-19 } hitcount:          1
{ comm: migration/2                                       , prio:          0, delta: ~ 20-29 } hitcount:          1
{ comm: migration/0                                       , prio:          0, delta: ~ 20-29 } hitcount:          1

Totals:
    Hits: 11
    Entries: 10
    Dropped: 0

That is a histogram of the wakeup latency of all real time tasks that woke
up. Oh, and it does not drop events unless the number of entries is bigger
than the size of the count of buckets, which I haven't actually
encountered, as there's 2048 buckets. But you can make it bigger with the
"size" attribute in the creation of the histogram.

-- Steve

Steven Rostedt Sept. 1, 2022, 9:46 p.m. UTC | #6

On Thu, 1 Sep 2022 17:38:44 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>  # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above could almost be done with sqlhist (but I haven't implemented
"buckets=10" yet because that's a new feature. But for now, let's do log2):

 # sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat'

("-e" is to execute the command, as it normally only displays what commands
need to be run to create the synthetic events and histograms)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active]
#

{ comm: migration/4                                       , prio:          0, delta: ~ 2^5  } hitcount:          1
{ comm: migration/0                                       , prio:          0, delta: ~ 2^4  } hitcount:          2
{ comm: rtkit-daemon                                      , prio:          0, delta: ~ 2^7  } hitcount:          2
{ comm: rtkit-daemon                                      , prio:          0, delta: ~ 2^6  } hitcount:          4
{ comm: migration/0                                       , prio:          0, delta: ~ 2^5  } hitcount:          8
{ comm: migration/4                                       , prio:          0, delta: ~ 2^4  } hitcount:          9
{ comm: migration/2                                       , prio:          0, delta: ~ 2^4  } hitcount:         10
{ comm: migration/5                                       , prio:          0, delta: ~ 2^4  } hitcount:         10
{ comm: migration/7                                       , prio:          0, delta: ~ 2^4  } hitcount:         10
{ comm: migration/1                                       , prio:          0, delta: ~ 2^4  } hitcount:         10
{ comm: migration/6                                       , prio:          0, delta: ~ 2^4  } hitcount:         10

Totals:
    Hits: 76
    Entries: 11
    Dropped: 0


-- Steve

Kent Overstreet Sept. 1, 2022, 9:54 p.m. UTC | #7

On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote:
> On Tue, 30 Aug 2022 14:49:16 -0700
> Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > From: Kent Overstreet <kent.overstreet@linux.dev>
> > 
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> > 
> >   code_tag_time_stats_start(start_time);
> >   code_tag_time_stats_finish(start_time);
> 
> So you need to modify the code to see what you want?

Figuring out the _correct_ place to measure is often a significant amount of the
total effort.

Having done so once, why not annotate that in the source code?

> For function length you could just do something like this:
> 
>  # cd /sys/kernel/tracing
>  # echo __skb_wait_for_more_packets > set_ftrace_filter
>  # echo 1 > function_profile_enabled
>  # cat trace_stat/function*
>   Function                               Hit    Time            Avg             s^2
>   --------                               ---    ----            ---             ---
>   __skb_wait_for_more_packets              1    0.000 us        0.000 us        0.000 us    
>   Function                               Hit    Time            Avg             s^2
>   --------                               ---    ----            ---             ---
>   __skb_wait_for_more_packets              1    74.813 us       74.813 us       0.000 us    
>   Function                               Hit    Time            Avg             s^2
>   --------                               ---    ----            ---             ---
>   Function                               Hit    Time            Avg             s^2
>   --------                               ---    ----            ---             ---
> 
> The above is for a 4 CPU machine. The s^2 is the square of the standard
> deviation (makes not having to do divisions while it runs).
> 
> But if you are looking for latency between two events (which can be kprobes
> too, where you do not need to rebuild your kernel):
> 
> From: https://man.archlinux.org/man/sqlhist.1.en
> which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
>   if not already installed on your distro.
> 
>  # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100'
> 
> The above creates a synthetic event called "wakeup_lat" that joins two
> events (sched_waking and sched_switch) when the pid field of sched_waking
> matches the next_pid field of sched_switch. When there is a match, it will
> trigger the wakeup_lat event only if the prio of the sched_waking event is
> less than 100 (which in the kernel means any real-time task). The
> wakeup_lat event will record the next_comm (as comm field), the pid of
> woken task and the time delta in microseconds between the two events.

So this looks like it's gotten better since I last looked, but it's still not
there yet.

Part of the problem is that the tracepoints themselves are in the wrong place:
your end event is when a task is woken up, but that means spurious wakeups will
cause one wait_event() call to be reported as multiple smaller waits, not one
long wait - oops, now I can't actually find the thing that's causing my
multi-second delay.

Also, in your example you don't have it broken out by callsite. That would be
the first thing I'd need for any real world debugging.

So, it looks like tracing has made some progress over the past 10 years, but
for debugging latency issues it's still not there yet in general. I will
definitely remember function latency tracing the next time I'm doing performance
work, but I expect that to be far too heavy to enable on a live server.

This thing is only a couple hundred lines of code though, so perhaps tracing
shouldn't be the only tool in our toolbox :)

Steven Rostedt Sept. 1, 2022, 10:34 p.m. UTC | #8

On Thu, 1 Sep 2022 17:54:38 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> So this looks like it's gotten better since I last looked, but it's still not
> there yet.
> 
> Part of the problem is that the tracepoints themselves are in the wrong place:
> your end event is when a task is woken up, but that means spurious wakeups will

The end event is when a task is scheduled onto the CPU. The start event is
the first time it is woken up.

> cause one wait_event() call to be reported as multiple smaller waits, not one
> long wait - oops, now I can't actually find the thing that's causing my
> multi-second delay.
> 
> Also, in your example you don't have it broken out by callsite. That would be
> the first thing I'd need for any real world debugging.

OK, how about this (currently we can only have 3 keys, but you can create
multiple histograms on the same event).

 # echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

(notice the "stacktrace" in the keys)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active]
#

{ comm: migration/2                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: migration/5                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: migration/1                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: migration/7                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: migration/0                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         start_kernel+0x595/0x5be
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: migration/4                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: rtkit-daemon                                      , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         preempt_schedule_common+0x2d/0x70
         preempt_schedule_thunk+0x16/0x18
         _raw_spin_unlock_irq+0x2e/0x40
         eventfd_write+0xc8/0x290
         vfs_write+0xc0/0x2a0
         ksys_write+0x5f/0xe0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 10-19} hitcount:          1
{ comm: migration/6                                       , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:          7
{ comm: rtkit-daemon                                      , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 20-29} hitcount:          1
{ comm: rtkit-daemon                                      , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         preempt_schedule_common+0x2d/0x70
         preempt_schedule_thunk+0x16/0x18
         _raw_spin_unlock_irq+0x2e/0x40
         eventfd_write+0xc8/0x290
         vfs_write+0xc0/0x2a0
         ksys_write+0x5f/0xe0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 30-39} hitcount:          1
{ comm: rtkit-daemon                                      , stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule_idle+0x26/0x40
         do_idle+0xb4/0xd0
         cpu_startup_entry+0x19/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 40-49} hitcount:          1

Totals:
    Hits: 53
    Entries: 11
    Dropped: 0


Not the prettiest thing to read. But hey, we got the full stack of where
these latencies happened!

Yes, it adds some overhead when the events are triggered due to the
stacktrace code, but it's extremely useful information.

> 
> So, it looks like tracing has made some progress over the past 10 years,
> but for debugging latency issues it's still not there yet in general. I

I call BS on that statement. Just because you do not know what has been
added to the kernel in the last 10 years (like you had no idea about
seq_buf and that was added in 2014) means to me that you are totally
clueless on what tracing can and can not do.

It appears to me that you are too focused on inventing your own wheel that
does exactly what you want before looking to see how things are today. Just
because something didn't fit your needs 10 years ago doesn't mean that it
can't fit your needs today.


> will definitely remember function latency tracing the next time I'm doing
> performance work, but I expect that to be far too heavy to enable on a
> live server.

I run it on production machines all the time. With the filtering in place
it has very little overhead. Mostly in the noise. The best part is that it
has practically zero overhead (but can add some cache pressure) when it's
off, and can be turned on at run time.

The tracing infrastructure is very modular, you can use parts of it that
you need, without the overhead of other parts. Like you found out this week
that tracepoints are not the same as trace events. Because tracepoints are
just a hook in the code that anything can attach to (that's what Daniel's
RV work does). Trace events provide the stored data to be recorded.

I will note that the current histogram code overhead has increased due to
retpolines, but I have code to convert them from indirect calls to direct
calls via a switch statement which drops the overhead by 20%!

  https://lore.kernel.org/all/20220823214606.344269352@goodmis.org/


> 
> This thing is only a couple hundred lines of code though, so perhaps
> tracing shouldn't be the only tool in our toolbox :)

I'm already getting complaints from customers/users that are saying there's
too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The
idea is to have the tools using mostly the same infrastructure, and not be
100% off on its own, unless there's a clear reason to invent a new wheel
that several people are asking for, not just one or two.

-- Steve

Kent Overstreet Sept. 1, 2022, 10:55 p.m. UTC | #9

On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 Sep 2022 17:54:38 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > So this looks like it's gotten better since I last looked, but it's still not
> > there yet.
> > 
> > Part of the problem is that the tracepoints themselves are in the wrong place:
> > your end event is when a task is woken up, but that means spurious wakeups will
> 
> The end event is when a task is scheduled onto the CPU. The start event is
> the first time it is woken up.

Yeah, that's not what I want. You're just tracing latency due to having more
processes runnable than CPUs.

I don't care about that for debugging, though! I specifically want latency at
the wait_event() level, and related - every time a process blocked _on some
condition_, until that condition became true. Not until some random, potentially
spurious wakeup.

> Not the prettiest thing to read. But hey, we got the full stack of where
> these latencies happened!

Most of the time I _don't_ want full stacktraces, though!

That means I have a ton more output to sort through, and the data is far more
expensive to collect.

I don't know why it's what people go to first - see the page_owner stuff - but
that doesn't get used much either because the output is _really hard to sort
through_.

Most of the time, just a single file and line number is all you want - and
tracing has always made it hard to get at that.

> Yes, it adds some overhead when the events are triggered due to the
> stacktrace code, but it's extremely useful information.
> 
> > 
> > So, it looks like tracing has made some progress over the past 10 years,
> > but for debugging latency issues it's still not there yet in general. I
> 
> I call BS on that statement. Just because you do not know what has been
> added to the kernel in the last 10 years (like you had no idea about
> seq_buf and that was added in 2014) means to me that you are totally
> clueless on what tracing can and can not do.
> 
> It appears to me that you are too focused on inventing your own wheel that
> does exactly what you want before looking to see how things are today. Just
> because something didn't fit your needs 10 years ago doesn't mean that it
> can't fit your needs today.

...And the ad hominem attacks start.

Steve, I'm not attacking you, and there's room enough in this world for the both
of us to be doing our thing creating new and useful tools.

> I'm already getting complaints from customers/users that are saying there's
> too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The
> idea is to have the tools using mostly the same infrastructure, and not be
> 100% off on its own, unless there's a clear reason to invent a new wheel
> that several people are asking for, not just one or two.

I would like to see more focus on usability.

That means, in a best case scenario, always-on data collection that I can just
look at, and it'll already be in the format most likely to be useful.

Surely you can appreciate the usefulness of that..?

Tracing started out as a tool for efficiently getting lots of data out of the
kernel, and it's great for that. But I think your focus on the cool thing you
built may be blinding you a bit to alternative approaches...

Steven Rostedt Sept. 2, 2022, 12:23 a.m. UTC | #10

On Thu, 1 Sep 2022 18:55:15 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Sep 2022 17:54:38 -0400
> > Kent Overstreet <kent.overstreet@linux.dev> wrote:  
> > > 
> > > So this looks like it's gotten better since I last looked, but it's still not
> > > there yet.
> > > 
> > > Part of the problem is that the tracepoints themselves are in the wrong place:
> > > your end event is when a task is woken up, but that means spurious wakeups will  
> > 
> > The end event is when a task is scheduled onto the CPU. The start event is
> > the first time it is woken up.  
> 
> Yeah, that's not what I want. You're just tracing latency due to having more
> processes runnable than CPUs.
> 
> I don't care about that for debugging, though! I specifically want latency at
> the wait_event() level, and related - every time a process blocked _on some
> condition_, until that condition became true. Not until some random, potentially
> spurious wakeup.

Ideally this would be better if we could pass the stack trace from one
event to the next, but that wouldn't be too hard to implement. It just
needs to be done.

But anyway:

 # echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events

// created an event on prepare_to_wait_event as that's usually called just
// before wait event.

 # sqlhist -e -n wait_sched 'select start.common_pid as pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start join sched_switch as end on start.common_pid = end.prev_pid where end.prev_state & 3'

// Create a "wait_sched" event that traces the time between the
// prepare_to_wait_event call and the scheduler. Only trigger it if the
// schedule happens in the interruptible or uninterruptible states.

 # sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as end on start.pid = end.next_pid where start.delta < 50'

// Now attach the wait_event to the sched_switch where the task gets
// scheduled back in. But we are only going to care if the delta between
// the prepare_to_wait_event and the schedule is less that 50us. This is a
// hack to just care about where a prepare_to_wait_event was done just before
// scheduling out.

 # echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > /sys/kernel/tracing/events/synthetic/wake_sched/trigger

// Now we are going to look at the deltas that the task was sleeping for an
// event. But this just gives pids and deltas.

 # echo 'hist:keys=pid,stacktrace if delta < 50' >> /sys/kernel/tracing/events/synthetic/wait_sched/trigger

// And this is to get the backtraces of where the task was. This is because
// the stack trace is not available at the schedule in, because the
// sched_switch can only give the stack trace of when a task schedules out.
// Again, this is somewhat a hack.

 # cat /sys/kernel/tracing/events/synthetic/wake_sched/hist
# event histogram
#
# trigger info: hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 [active]
#

{ pid:       2114, delta2: ~ 10-19 } hitcount:          1
{ pid:       1389, delta2: ~ 160-169 } hitcount:          1
{ pid:       1389, delta2: ~ 660-669 } hitcount:          1
{ pid:       1389, delta2: ~ 1020-1029 } hitcount:          1
{ pid:       1189, delta2: ~ 500020-500029 } hitcount:          1
{ pid:       1189, delta2: ~ 500030-500039 } hitcount:          1
{ pid:       1195, delta2: ~ 500030-500039 } hitcount:          2
{ pid:       1189, delta2: ~ 500040-500049 } hitcount:         10
{ pid:       1193, delta2: ~ 500040-500049 } hitcount:          3
{ pid:       1197, delta2: ~ 500040-500049 } hitcount:          3
{ pid:       1195, delta2: ~ 500040-500049 } hitcount:          9
{ pid:       1190, delta2: ~ 500050-500059 } hitcount:         55
{ pid:       1197, delta2: ~ 500050-500059 } hitcount:         51
{ pid:       1191, delta2: ~ 500050-500059 } hitcount:         61
{ pid:       1198, delta2: ~ 500050-500059 } hitcount:         56
{ pid:       1195, delta2: ~ 500050-500059 } hitcount:         48
{ pid:       1192, delta2: ~ 500050-500059 } hitcount:         54
{ pid:       1194, delta2: ~ 500050-500059 } hitcount:         50
{ pid:       1196, delta2: ~ 500050-500059 } hitcount:         57
{ pid:       1189, delta2: ~ 500050-500059 } hitcount:         48
{ pid:       1193, delta2: ~ 500050-500059 } hitcount:         52
{ pid:       1194, delta2: ~ 500060-500069 } hitcount:         12
{ pid:       1191, delta2: ~ 500060-500069 } hitcount:          2
{ pid:       1190, delta2: ~ 500060-500069 } hitcount:          7
{ pid:       1198, delta2: ~ 500060-500069 } hitcount:          9
{ pid:       1193, delta2: ~ 500060-500069 } hitcount:          6
{ pid:       1196, delta2: ~ 500060-500069 } hitcount:          5
{ pid:       1192, delta2: ~ 500060-500069 } hitcount:          9
{ pid:       1197, delta2: ~ 500060-500069 } hitcount:          9
{ pid:       1195, delta2: ~ 500060-500069 } hitcount:          6
{ pid:       1189, delta2: ~ 500060-500069 } hitcount:          6
{ pid:       1198, delta2: ~ 500070-500079 } hitcount:          1
{ pid:       1192, delta2: ~ 500070-500079 } hitcount:          2
{ pid:       1193, delta2: ~ 500070-500079 } hitcount:          3
{ pid:       1194, delta2: ~ 500070-500079 } hitcount:          2
{ pid:       1191, delta2: ~ 500070-500079 } hitcount:          3
{ pid:       1190, delta2: ~ 500070-500079 } hitcount:          1
{ pid:       1196, delta2: ~ 500070-500079 } hitcount:          1
{ pid:       1193, delta2: ~ 500080-500089 } hitcount:          1
{ pid:       1192, delta2: ~ 500080-500089 } hitcount:          1
{ pid:       1196, delta2: ~ 500080-500089 } hitcount:          2
{ pid:       1194, delta2: ~ 500090-500099 } hitcount:          1
{ pid:       1197, delta2: ~ 500090-500099 } hitcount:          1
{ pid:       1193, delta2: ~ 500090-500099 } hitcount:          1
{ pid:         61, delta2: ~ 503910-503919 } hitcount:          1
{ pid:         61, delta2: ~ 503920-503929 } hitcount:          1
{ pid:         61, delta2: ~ 503930-503939 } hitcount:          1
{ pid:         61, delta2: ~ 503960-503969 } hitcount:         15
{ pid:         61, delta2: ~ 503970-503979 } hitcount:         18
{ pid:         61, delta2: ~ 503980-503989 } hitcount:         20
{ pid:         61, delta2: ~ 504010-504019 } hitcount:          2
{ pid:         61, delta2: ~ 504020-504029 } hitcount:          1
{ pid:         61, delta2: ~ 504030-504039 } hitcount:          2
{ pid:         58, delta2: ~ 43409960-43409969 } hitcount:          1

Totals:
    Hits: 718
    Entries: 54
    Dropped: 0

The above is useless without the following:

# cat /sys/kernel/tracing/events/synthetic/wait_sched/hist 
# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:__arg_1618_2=pid,__arg_1618_3=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if delta < 10 [active]
#

{ pid:        612 } hitcount:          1
{ pid:        889 } hitcount:          2
{ pid:       1389 } hitcount:          3
{ pid:         58 } hitcount:          3
{ pid:       2096 } hitcount:          5
{ pid:         61 } hitcount:        145
{ pid:       1196 } hitcount:        151
{ pid:       1190 } hitcount:        151
{ pid:       1198 } hitcount:        153
{ pid:       1197 } hitcount:        153
{ pid:       1195 } hitcount:        153
{ pid:       1194 } hitcount:        153
{ pid:       1191 } hitcount:        153
{ pid:       1192 } hitcount:        153
{ pid:       1189 } hitcount:        153
{ pid:       1193 } hitcount:        153

Totals:
    Hits: 1685
    Entries: 16
    Dropped: 0


# event histogram
#
# trigger info: hist:keys=pid,stacktrace:vals=hitcount:sort=hitcount:size=2048 if delta < 10 [active]
#

{ pid:       1389, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         pipe_read+0x318/0x420
         new_sync_read+0x18b/0x1a0
         vfs_read+0xf5/0x190
         ksys_read+0xab/0xe0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:          3
{ pid:       1189, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:         61, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         schedule_timeout+0x88/0x160
         kcompactd+0x364/0x3f0
         kthread+0x141/0x170
         ret_from_fork+0x22/0x30
} hitcount:         28
{ pid:       1194, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1197, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1198, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1191, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1196, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1192, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1195, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1190, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28
{ pid:       1193, stacktrace:
         event_hist_trigger+0x290/0x2b0
         event_triggers_call+0x52/0xe0
         trace_event_buffer_commit+0x193/0x240
         trace_event_raw_event_sched_switch+0x120/0x180
         __traceiter_sched_switch+0x39/0x50
         __schedule+0x310/0x700
         schedule+0x72/0x110
         read_events+0x119/0x190
         do_io_getevents+0x72/0xe0
         __x64_sys_io_getevents+0x59/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount:         28

Totals:
    Hits: 311
    Entries: 12
    Dropped: 0

Now we just need a tool to map the pids of the delta histogram to the pids
of the stack traces to figure out where the issues may happen.

The above is just to show that there's a lot of infrastructure already
there that does a lot of this work, but needs improvement. My theme to this
email is to modify what's there to make it work for you before just doing
everything from scratch, and then we have a bunch of stuff that only does
what we want, but is not flexible to do what others may want.

> 
> 
> > Not the prettiest thing to read. But hey, we got the full stack of where
> > these latencies happened!  
> 
> Most of the time I _don't_ want full stacktraces, though!

We could easily add a feature to limit how much you want to trace. Perhaps even
a skip level. That is, add skip and depth options to the stacktrace field.

> 
> That means I have a ton more output to sort through, and the data is far more
> expensive to collect.

That's what user space tools are for ;-)

> 
> I don't know why it's what people go to first - see the page_owner stuff - but
> that doesn't get used much either because the output is _really hard to sort
> through_.
> 
> Most of the time, just a single file and line number is all you want - and
> tracing has always made it hard to get at that.

Because we would need to store too much dwarf information in the kernel to
do so. But user space could do this for you with the function/offset
information.

> 
> 
> > Yes, it adds some overhead when the events are triggered due to the
> > stacktrace code, but it's extremely useful information.
> >   
> > > 
> > > So, it looks like tracing has made some progress over the past 10
> > > years, but for debugging latency issues it's still not there yet in
> > > general. I  
> > 
> > I call BS on that statement. Just because you do not know what has been
> > added to the kernel in the last 10 years (like you had no idea about
> > seq_buf and that was added in 2014) means to me that you are totally
> > clueless on what tracing can and can not do.
> > 
> > It appears to me that you are too focused on inventing your own wheel
> > that does exactly what you want before looking to see how things are
> > today. Just because something didn't fit your needs 10 years ago
> > doesn't mean that it can't fit your needs today.  
> 
> ...And the ad hominem attacks start.

Look, you keep making comments about the tracing infrastructure that you
clearly do not understand. And that is pretty insulting. Sorry, I'm not
sure you realize this, but those comments do turn people off and their
responses will start to become stronger.

> 
> Steve, I'm not attacking you, and there's room enough in this world for
> the both of us to be doing our thing creating new and useful tools.

You seem to push back hard when people suggest improving other utilities
to suite your needs.

> 
> > I'm already getting complaints from customers/users that are saying
> > there's too many tools in the toolbox already. (Do we use
> > ftrace/perf/bpf?). The idea is to have the tools using mostly the same
> > infrastructure, and not be 100% off on its own, unless there's a clear
> > reason to invent a new wheel that several people are asking for, not
> > just one or two.  
> 
> I would like to see more focus on usability.

Then lets make the current tools more usable. For example, the synthetic
event kernel interface is horrible. It's an awesome feature that wasn't
getting used due to the interface. This is why I created "sqlhist". It's
now really easy to create synthetic events with that tool. I agree, focus
on usability, but that doesn't always mean to create yet another tool. This
reminds me of:

   https://xkcd.com/927/


> 
> That means, in a best case scenario, always-on data collection that I can
> just look at, and it'll already be in the format most likely to be useful.
> 
> Surely you can appreciate the usefulness of that..?

I find "runtime turn on and off" better than "always on". We have
static_branches today (aka jump labels). I would strongly suggest using
them. You get them automatically from tracepoints . Even sched_stats are
using these.

> 
> Tracing started out as a tool for efficiently getting lots of data out of
> the kernel, and it's great for that. But I think your focus on the cool
> thing you built may be blinding you a bit to alternative approaches...

I actually work hard to have the tracing infrastructure help out other
approaches. perf and bpf use the ftrace infrastructure because it is
designed to be modular. Nothing is "must be the ftrace way". I'm not against
the new features you are adding, I just want you to make a little more
effort in incorporating other infrastructures (and perhaps even improving
that infrastructure) to suite your needs.

If ftrace, perf, bpf can't do what you want, take a harder look to see if
you can modify them to do so.

-- Steve

Kent Overstreet Sept. 2, 2022, 1:35 a.m. UTC | #11

On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> If ftrace, perf, bpf can't do what you want, take a harder look to see if
> you can modify them to do so.

Maybe we can use this exchange to make both of our tools better. I like your
histograms - the quantiles algorithm I've had for years is janky, I've been
meaning to rip that out, I'd love to take a look at your code for that. And
having an on/off switch is a good idea, I'll try to add that at some point.
Maybe you got some ideas from my stuff too.

I'd love to get better tracepoints for measuring latency - what I added to
init_wait() and finish_wait() was really only a starting point. Figuring out
the right places to measure is where I'd like to be investing my time in this
area, and there's no reason we couldn't both be making use of that.

e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not
all code uses that, init_wait() got me better coverage. But I've already seen
that that misses things, too, there's more work to be done.

random thought: might try adding a warning in schedule() any time it's called
and codetag_time_stats_start() hasn't been called, that'll be a starting
point...

Steven Rostedt Sept. 2, 2022, 1:59 a.m. UTC | #12

On Thu, 1 Sep 2022 21:35:32 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> > If ftrace, perf, bpf can't do what you want, take a harder look to see if
> > you can modify them to do so.  
> 
> Maybe we can use this exchange to make both of our tools better. I like your
> histograms - the quantiles algorithm I've had for years is janky, I've been
> meaning to rip that out, I'd love to take a look at your code for that. And
> having an on/off switch is a good idea, I'll try to add that at some point.
> Maybe you got some ideas from my stuff too.
> 
> I'd love to get better tracepoints for measuring latency - what I added to
> init_wait() and finish_wait() was really only a starting point. Figuring out
> the right places to measure is where I'd like to be investing my time in this
> area, and there's no reason we couldn't both be making use of that.

Yes, this is exactly what I'm talking about. I'm not against your work, I
just want you to work more with everyone to come up with ideas that can
help everyone as a whole. That's how "open source communities" is suppose
to work ;-)

The histogram and synthetic events can use some more clean ups. There's a
lot of places that can be improved in that code. But I feel the ideas
behind that code is sound. It's just getting the implementation to be a bit
more efficient.

> 
> e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not
> all code uses that, init_wait() got me better coverage. But I've already seen
> that that misses things, too, there's more work to be done.

I picked prepare_to_wait() just because I was hacking up something quick
and thought that was "close enough" ;-)

-- Steve

[RFC,27/30] Code tagging based latency tracking

Commit Message

Comments

Patch