mbox series

[RFC,0/7] metricfs metric file system and examples

Message ID 20200807212916.2883031-1-jwadams@google.com (mailing list archive)
Headers show
Series metricfs metric file system and examples | expand

Message

Jonathan Adams Aug. 7, 2020, 9:29 p.m. UTC
[resending to widen the CC lists per rdunlap@infradead.org's suggestion
original posting to lkml here: https://lkml.org/lkml/2020/8/5/1009]

To try to restart the discussion of kernel statistics started by the
statsfs patchsets (https://lkml.org/lkml/2020/5/26/332), I wanted
to share the following set of patches which are Google's 'metricfs'
implementation and some example uses.  Google has been using metricfs
internally since 2012 as a way to export various statistics to our
telemetry systems (similar to OpenTelemetry), and we have over 200
statistics exported on a typical machine.

These patches have been cleaned up and modernized v.s. the versions
in production; I've included notes under the fold in the patches.
They're based on v5.8-rc6.

The statistics live under debugfs, in a tree rooted at:

	/sys/kernel/debug/metricfs

Each metric is a directory, with four files in it.  For example, the '
core/metricfs: Create metricfs, standardized files under debugfs.' patch
includes a simple 'metricfs_presence' metric, whose files look like:
/sys/kernel/debug/metricfs:
 metricfs_presence/annotations
  DESCRIPTION A\ basic\ presence\ metric.
 metricfs_presence/fields
  value
  int
 metricfs_presence/values
  1
 metricfs_presence/version
  1

(The "version" field always says '1', and is kind of vestigial)

An example of a more complicated stat is the networking stats.
For example, the tx_bytes stat looks like:

net/dev/stats/tx_bytes/annotations
  DESCRIPTION net\ device\ transmited\ bytes\ count
  CUMULATIVE
net/dev/stats/tx_bytes/fields
  interface value
  str int
net/dev/stats/tx_bytes/values
  lo 4394430608
  eth0 33353183843
  eth1 16228847091
net/dev/stats/tx_bytes/version
  1

The per-cpu statistics show up in the schedulat stat info and x86
IRQ counts.  For example:

stat/user/annotations
  DESCRIPTION time\ in\ user\ mode\ (nsec)
  CUMULATIVE
stat/user/fields
  cpu value
  int int
stat/user/values
  0 1183486517734
  1 1038284237228
  ...
stat/user/version
  1

The full set of example metrics I've included are:

core/metricfs: Create metricfs, standardized files under debugfs.
  metricfs_presence
core/metricfs: metric for kernel warnings
  warnings/values
core/metricfs: expose scheduler stat information through metricfs
  stat/*
net-metricfs: Export /proc/net/dev via metricfs.
  net/dev/stats/[tr]x_*
core/metricfs: expose x86-specific irq information through metricfs
  irq_x86/*

The general approach is called out in kernel/metricfs.c:

The kernel provides:
  - A description of the metric
  - The subsystem for the metric (NULL is ok)
  - Type information about the metric, and
  - A callback function which supplies metric values.

Limitations:
  - "values" files are at MOST 64K. We truncate the file at that point.
  - The list of fields and types is at most 1K.
  - Metrics may have at most 2 fields.

Best Practices:
  - Emit the most important data first! Once the 64K per-metric buffer
    is full, the emit* functions won't do anything.
  - In userspace, open(), read(), and close() the file quickly! The kernel
    allocation for the metric is alive as long as the file is open. This
    permits users to seek around the contents of the file, while
    permitting an atomic view of the data.

Note that since the callbacks are called and the data is generated at
file open() time, the relative consistency is only between members of
a given metric; the rx_bytes stat for every network interface will
be read at almost the same time, but if you want to get rx_bytes
and rx_packets, there could be a bunch of slew between the two file
opens.  (So this doesn't entirely address Andrew Lunn's comments in
https://lkml.org/lkml/2020/5/26/490)

This also doesn't address one of the basic parts of the statsfs work:
moving the statistics out of debugfs to avoid lockdown interactions.

Google has found a lot of value in having a generic interface for adding
these kinds of statistics with reasonably low overhead (reading them
is O(number of statistics), not number of objects in each statistic).
There are definitely warts in the interface, but does the basic approach
make sense to folks?

Thanks,
- Jonathan

Jonathan Adams (5):
  core/metricfs: add support for percpu metricfs files
  core/metricfs: metric for kernel warnings
  core/metricfs: expose softirq information through metricfs
  core/metricfs: expose scheduler stat information through metricfs
  core/metricfs: expose x86-specific irq information through metricfs

Justin TerAvest (1):
  core/metricfs: Create metricfs, standardized files under debugfs.

Laurent Chavey (1):
  net-metricfs: Export /proc/net/dev via metricfs.

 arch/x86/kernel/irq.c      |  80 ++++
 fs/proc/stat.c             |  57 +++
 include/linux/metricfs.h   | 131 +++++++
 kernel/Makefile            |   2 +
 kernel/metricfs.c          | 775 +++++++++++++++++++++++++++++++++++++
 kernel/metricfs_examples.c | 151 ++++++++
 kernel/panic.c             | 131 +++++++
 kernel/softirq.c           |  45 +++
 lib/Kconfig.debug          |  18 +
 net/core/Makefile          |   1 +
 net/core/net_metricfs.c    | 194 ++++++++++
 11 files changed, 1585 insertions(+)
 create mode 100644 include/linux/metricfs.h
 create mode 100644 kernel/metricfs.c
 create mode 100644 kernel/metricfs_examples.c
 create mode 100644 net/core/net_metricfs.c

Comments

Andrew Lunn Aug. 8, 2020, 2:06 a.m. UTC | #1
> net/dev/stats/tx_bytes/annotations
>   DESCRIPTION net\ device\ transmited\ bytes\ count
>   CUMULATIVE
> net/dev/stats/tx_bytes/fields
>   interface value
>   str int
> net/dev/stats/tx_bytes/values
>   lo 4394430608
>   eth0 33353183843
>   eth1 16228847091

This is a rather small system. Have you tested it at scale? An
Ethernet switch with 64 physical interfaces, and say 32 VLAN
interfaces stack on top. So this one file will contain 2048 entries?

And generally, you are not interested in one statistic, but many
statistics. So you will need to cat each file, not just one file. And
the way this is implemented:

+static void dev_stats_emit(struct metric_emitter *e,
+                          struct net_device *dev,
+                          struct metric_def *metricd)
+{
+       struct rtnl_link_stats64 temp;
+       const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
+
+       if (stats) {
+               __u8 *ptr = (((__u8 *)stats) + metricd->off);
+
+               METRIC_EMIT_INT(e, *(__u64 *)ptr, dev->name, NULL);
+       }
+}

means you are going to be calling dev_get_stats() for each file, and
there are 23 files if i counted correctly. So dev_get_stats() will be
called 47104 times, in this made up example. And this is not always
cheap, these counts can be atomic.

So i personally don't think netdev statistics is a good idea, i doubt
it scales.

I also think you are looking at the wrong set of netdev counters. I
would be more interested in ethtool -S counters. But it appears you
make the assumption that each object you are collecting metrics for
has the same set of counters. This is untrue for network interfaces,
where each driver can export whatever counters it wants, and they can
be dynamic.

	Andrew
David Ahern Aug. 8, 2020, 3:59 p.m. UTC | #2
On 8/7/20 8:06 PM, Andrew Lunn wrote:
> So i personally don't think netdev statistics is a good idea, i doubt
> it scales.

+1
Pavel Machek Aug. 10, 2020, 9:23 a.m. UTC | #3
On Fri 2020-08-07 14:29:09, Jonathan Adams wrote:
> [resending to widen the CC lists per rdunlap@infradead.org's suggestion
> original posting to lkml here: https://lkml.org/lkml/2020/8/5/1009]
> 
> To try to restart the discussion of kernel statistics started by the
> statsfs patchsets (https://lkml.org/lkml/2020/5/26/332), I wanted
> to share the following set of patches which are Google's 'metricfs'
> implementation and some example uses.  Google has been using metricfs
> internally since 2012 as a way to export various statistics to our
> telemetry systems (similar to OpenTelemetry), and we have over 200
> statistics exported on a typical machine.
> 
> These patches have been cleaned up and modernized v.s. the versions
> in production; I've included notes under the fold in the patches.
> They're based on v5.8-rc6.
> 
> The statistics live under debugfs, in a tree rooted at:
> 
> 	/sys/kernel/debug/metricfs

Is debugfs right place for this? It looks like something where people
would expect compatibility guarantees...

								Pavel

--
Jakub Kicinski Aug. 10, 2020, 6:20 p.m. UTC | #4
On Sat, 8 Aug 2020 09:59:34 -0600 David Ahern wrote:
> On 8/7/20 8:06 PM, Andrew Lunn wrote:
> > So i personally don't think netdev statistics is a good idea, i doubt
> > it scales.  
> 
> +1

+1

Please stop using networking as the example for this.

We don't want file interfaces for stats, and we already made that very
clear last time.