Message ID | 20200504110344.17560-1-eesposit@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Statsfs: a new ram-based file sytem for Linux kernel statistics | expand |
On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote: > There is currently no common way for Linux kernel subsystems to expose > statistics to userspace shared throughout the Linux kernel; subsystems > have to take care of gathering and displaying statistics by themselves, > for example in the form of files in debugfs. For example KVM has its own > code section that takes care of this in virt/kvm/kvm_main.c, where it sets > up debugfs handlers for displaying values and aggregating them from > various subfolders to obtain information about the system state (i.e. > displaying the total number of exits, calculated by summing all exits of > all cpus of all running virtual machines). > > Allowing each section of the kernel to do so has two disadvantages. First, > it will introduce redundant code. Second, debugfs is anyway not the right > place for statistics (for example it is affected by lockdown) > > In this patch series I introduce statsfs, a synthetic ram-based virtual > filesystem that takes care of gathering and displaying statistics for the > Linux kernel subsystems. > This is exciting, we have been looking in the same area recently. Adding Jonathan Adams <jwadams@google.com>. In your diffstat, one thing I notice that is omitted: an update to Documentation/* :) Any chance of getting some proposed Documentation/ updates with structure of the fs, the per subsystem breakdown, and best practices for managing the stats from the kernel level? > The file system is mounted on /sys/kernel/stats and would be already used > by kvm. Statsfs was initially introduced by Paolo Bonzini [1]. > > Statsfs offers a generic and stable API, allowing any kind of > directory/file organization and supporting multiple kind of aggregations > (not only sum, but also average, max, min and count_zero) and data types > (all unsigned and signed types plus boolean). The implementation, which is > a generalization of KVM’s debugfs statistics code, takes care of gathering > and displaying information at run time; users only need to specify the > values to be included in each source. > > Statsfs would also be a different mountpoint from debugfs, and would not > suffer from limited access due to the security lock down patches. Its main > function is to display each statistics as a file in the desired folder > hierarchy defined through the API. Statsfs files can be read, and possibly > cleared if their file mode allows it. > > Statsfs has two main components: the public API defined by > include/linux/statsfs.h, and the virtual file system which should end up > in /sys/kernel/stats. > > The API has two main elements, values and sources. Kernel subsystems like > KVM can use the API to create a source, add child > sources/values/aggregates and register it to the root source (that on the > virtual fs would be /sys/kernel/statsfs). > > Sources are created via statsfs_source_create(), and each source becomes a > directory in the file system. Sources form a parent-child relationship; > root sources are added to the file system via statsfs_source_register(). > Every other source is added to or removed from a parent through the > statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs. > Once a source is created and added to the tree (via add_subordinate), it > will be used to compute aggregate values in the parent source. > > Values represent quantites that are gathered by the statsfs user. Examples > of values include the number of vm exits of a given kind, the amount of > memory used by some data structure, the length of the longest hash table > chain, or anything like that. Values are defined with the > statsfs_source_add_values function. Each value is defined by a struct > statsfs_value; the same statsfs_value can be added to many different > sources. A value can be considered "simple" if it fetches data from a > user-provided location, or "aggregate" if it groups all values in the > subordinates sources that include the same statsfs_value. > This seems like it could have a lot of overhead if we wanted to periodically track the totality of subsystem stats as a form of telemetry gathering from userspace. To collect telemetry for 1,000 different stats, do we need to issue lseek()+read() syscalls for each of them individually (or, worse, open()+read()+close())? Any thoughts on how that can be optimized? A couple of ideas: - an interface that allows gathering of all stats for a particular interface through a single file that would likely be encoded in binary and the responsibility of userspace to disseminate, or - an interface that extends beyond this proposal and allows the reader to specify which stats they are interested in collecting and then the kernel will only provide these stats in a well formed structure and also be binary encoded. We've found that the one-file-per-stat method is pretty much a show stopper from the performance view and we always must execute at least two syscalls to obtain a single stat. Since this is becoming a generic API (good!!), maybe we can discuss possible ways to optimize gathering of stats in mass? > For more information, please consult the kerneldoc documentation in patch > 2 and the sample uses in the kunit tests and in KVM. > > This series of patches is based on my previous series "libfs: group and > simplify linux fs code" and the single patch sent to kvm "kvm_host: unify > VM_STAT and VCPU_STAT definitions in a single place". The former > simplifies code duplicated in debugfs and tracefs (from which statsfs is > based on), the latter groups all macros definition for statistics in kvm > in a single common file shared by all architectures. > > Patch 1 adds a new refcount and kref destructor wrappers that take a > semaphore, as those are used later by statsfs. Patch 2 introduces the > statsfs API, patch 3 provides extensive tests that can also be used as > example on how to use the API and patch 4 adds the file system support. > Finally, patch 5 provides a real-life example of statsfs usage in KVM. > > [1] https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M > > Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> > > v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation, > change statsfs in stats_fs > > Emanuele Giuseppe Esposito (5): > refcount, kref: add dec-and-test wrappers for rw_semaphores > stats_fs API: create, add and remove stats_fs sources and values > kunit: tests for stats_fs API > stats_fs fs: virtual fs to show stats to the end-user > kvm_main: replace debugfs with stats_fs > > MAINTAINERS | 7 + > arch/arm64/kvm/Kconfig | 1 + > arch/arm64/kvm/guest.c | 2 +- > arch/mips/kvm/Kconfig | 1 + > arch/mips/kvm/mips.c | 2 +- > arch/powerpc/kvm/Kconfig | 1 + > arch/powerpc/kvm/book3s.c | 6 +- > arch/powerpc/kvm/booke.c | 8 +- > arch/s390/kvm/Kconfig | 1 + > arch/s390/kvm/kvm-s390.c | 16 +- > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/Makefile | 2 +- > arch/x86/kvm/debugfs.c | 64 -- > arch/x86/kvm/stats_fs.c | 56 ++ > arch/x86/kvm/x86.c | 6 +- > fs/Kconfig | 12 + > fs/Makefile | 1 + > fs/stats_fs/Makefile | 6 + > fs/stats_fs/inode.c | 337 ++++++++++ > fs/stats_fs/internal.h | 35 + > fs/stats_fs/stats_fs-tests.c | 1088 +++++++++++++++++++++++++++++++ > fs/stats_fs/stats_fs.c | 773 ++++++++++++++++++++++ > include/linux/kref.h | 11 + > include/linux/kvm_host.h | 39 +- > include/linux/refcount.h | 2 + > include/linux/stats_fs.h | 304 +++++++++ > include/uapi/linux/magic.h | 1 + > lib/refcount.c | 32 + > tools/lib/api/fs/fs.c | 21 + > virt/kvm/arm/arm.c | 2 +- > virt/kvm/kvm_main.c | 314 ++------- > 32 files changed, 2772 insertions(+), 382 deletions(-) > delete mode 100644 arch/x86/kvm/debugfs.c > create mode 100644 arch/x86/kvm/stats_fs.c > create mode 100644 fs/stats_fs/Makefile > create mode 100644 fs/stats_fs/inode.c > create mode 100644 fs/stats_fs/internal.h > create mode 100644 fs/stats_fs/stats_fs-tests.c > create mode 100644 fs/stats_fs/stats_fs.c > create mode 100644 include/linux/stats_fs.h > > -- > 2.25.2 > >
On 5/4/20 11:37 PM, David Rientjes wrote: > On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote: > >> >> In this patch series I introduce statsfs, a synthetic ram-based virtual >> filesystem that takes care of gathering and displaying statistics for the >> Linux kernel subsystems. >> > > This is exciting, we have been looking in the same area recently. Adding > Jonathan Adams <jwadams@google.com>. > > In your diffstat, one thing I notice that is omitted: an update to > Documentation/* :) Any chance of getting some proposed Documentation/ > updates with structure of the fs, the per subsystem breakdown, and best > practices for managing the stats from the kernel level? Yes, I will write some documentation. Thank you for the suggestion. >> >> Values represent quantites that are gathered by the statsfs user. Examples >> of values include the number of vm exits of a given kind, the amount of >> memory used by some data structure, the length of the longest hash table >> chain, or anything like that. Values are defined with the >> statsfs_source_add_values function. Each value is defined by a struct >> statsfs_value; the same statsfs_value can be added to many different >> sources. A value can be considered "simple" if it fetches data from a >> user-provided location, or "aggregate" if it groups all values in the >> subordinates sources that include the same statsfs_value. >> > > This seems like it could have a lot of overhead if we wanted to > periodically track the totality of subsystem stats as a form of telemetry > gathering from userspace. To collect telemetry for 1,000 different stats, > do we need to issue lseek()+read() syscalls for each of them individually > (or, worse, open()+read()+close())? > > Any thoughts on how that can be optimized? A couple of ideas: > > - an interface that allows gathering of all stats for a particular > interface through a single file that would likely be encoded in binary > and the responsibility of userspace to disseminate, or > > - an interface that extends beyond this proposal and allows the reader to > specify which stats they are interested in collecting and then the > kernel will only provide these stats in a well formed structure and > also be binary encoded. Are you thinking of another file, containing all the stats for the directory in binary format? > We've found that the one-file-per-stat method is pretty much a show > stopper from the performance view and we always must execute at least two > syscalls to obtain a single stat. > > Since this is becoming a generic API (good!!), maybe we can discuss > possible ways to optimize gathering of stats in mass? Sure, the idea of a binary format was considered from the beginning in [1], and it can be done either together with the current filesystem, or as a replacement via different mount options. Thank you, Emanuele >> [1] https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M >> >> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> >> >> v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation, >> change statsfs in stats_fs >> >> Emanuele Giuseppe Esposito (5): >> refcount, kref: add dec-and-test wrappers for rw_semaphores >> stats_fs API: create, add and remove stats_fs sources and values >> kunit: tests for stats_fs API >> stats_fs fs: virtual fs to show stats to the end-user >> kvm_main: replace debugfs with stats_fs >> >> MAINTAINERS | 7 + >> arch/arm64/kvm/Kconfig | 1 + >> arch/arm64/kvm/guest.c | 2 +- >> arch/mips/kvm/Kconfig | 1 + >> arch/mips/kvm/mips.c | 2 +- >> arch/powerpc/kvm/Kconfig | 1 + >> arch/powerpc/kvm/book3s.c | 6 +- >> arch/powerpc/kvm/booke.c | 8 +- >> arch/s390/kvm/Kconfig | 1 + >> arch/s390/kvm/kvm-s390.c | 16 +- >> arch/x86/include/asm/kvm_host.h | 2 +- >> arch/x86/kvm/Kconfig | 1 + >> arch/x86/kvm/Makefile | 2 +- >> arch/x86/kvm/debugfs.c | 64 -- >> arch/x86/kvm/stats_fs.c | 56 ++ >> arch/x86/kvm/x86.c | 6 +- >> fs/Kconfig | 12 + >> fs/Makefile | 1 + >> fs/stats_fs/Makefile | 6 + >> fs/stats_fs/inode.c | 337 ++++++++++ >> fs/stats_fs/internal.h | 35 + >> fs/stats_fs/stats_fs-tests.c | 1088 +++++++++++++++++++++++++++++++ >> fs/stats_fs/stats_fs.c | 773 ++++++++++++++++++++++ >> include/linux/kref.h | 11 + >> include/linux/kvm_host.h | 39 +- >> include/linux/refcount.h | 2 + >> include/linux/stats_fs.h | 304 +++++++++ >> include/uapi/linux/magic.h | 1 + >> lib/refcount.c | 32 + >> tools/lib/api/fs/fs.c | 21 + >> virt/kvm/arm/arm.c | 2 +- >> virt/kvm/kvm_main.c | 314 ++------- >> 32 files changed, 2772 insertions(+), 382 deletions(-) >> delete mode 100644 arch/x86/kvm/debugfs.c >> create mode 100644 arch/x86/kvm/stats_fs.c >> create mode 100644 fs/stats_fs/Makefile >> create mode 100644 fs/stats_fs/inode.c >> create mode 100644 fs/stats_fs/internal.h >> create mode 100644 fs/stats_fs/stats_fs-tests.c >> create mode 100644 fs/stats_fs/stats_fs.c >> create mode 100644 include/linux/stats_fs.h >> >> -- >> 2.25.2 >>
On Tue, May 5, 2020 at 2:18 AM Emanuele Giuseppe Esposito <eesposit@redhat.com> wrote: > > > > On 5/4/20 11:37 PM, David Rientjes wrote: > > Since this is becoming a generic API (good!!), maybe we can discuss > > possible ways to optimize gathering of stats in mass? > > Sure, the idea of a binary format was considered from the beginning in > [1], and it can be done either together with the current filesystem, or > as a replacement via different mount options. ASCII stats are not scalable. A binary format is definitely the way to go.
On 05/05/20 18:53, Jim Mattson wrote: >>> Since this is becoming a generic API (good!!), maybe we can discuss >>> possible ways to optimize gathering of stats in mass? >> Sure, the idea of a binary format was considered from the beginning in >> [1], and it can be done either together with the current filesystem, or >> as a replacement via different mount options. > > ASCII stats are not scalable. A binary format is definitely the way to go. I am totally in favor of having a binary format, but it should be introduced as a separate series on top of this one---and preferably by someone who has already put some thought into the problem (which Emanuele and I have not, beyond ensuring that the statsfs concept and API is flexible enough). ASCII stats are necessary for quick userspace consumption and for backwards compatibility with KVM debugfs (which is not an ABI, but it's damn useful and should not be dropped without providing something as handy), so this is what this series starts from. Paolo
On Tue, 5 May 2020, Paolo Bonzini wrote: > >>> Since this is becoming a generic API (good!!), maybe we can discuss > >>> possible ways to optimize gathering of stats in mass? > >> Sure, the idea of a binary format was considered from the beginning in > >> [1], and it can be done either together with the current filesystem, or > >> as a replacement via different mount options. > > > > ASCII stats are not scalable. A binary format is definitely the way to go. > > I am totally in favor of having a binary format, but it should be > introduced as a separate series on top of this one---and preferably by > someone who has already put some thought into the problem (which > Emanuele and I have not, beyond ensuring that the statsfs concept and > API is flexible enough). > The concern is that once this series is merged then /sys/kernel/stats could be considered an ABI and there would be a reasonable expectation that it will remain stable, in so far as the stats that userspace is interested in are stable and not obsoleted. So is this a suggestion that the binary format becomes complementary to statsfs and provide a means for getting all stats from a single subsystem, or that this series gets converted to such a format before it is merged? > ASCII stats are necessary for quick userspace consumption and for > backwards compatibility with KVM debugfs (which is not an ABI, but it's > damn useful and should not be dropped without providing something as > handy), so this is what this series starts from. >
On 05/05/20 19:07, David Rientjes wrote: >> I am totally in favor of having a binary format, but it should be >> introduced as a separate series on top of this one---and preferably by >> someone who has already put some thought into the problem (which >> Emanuele and I have not, beyond ensuring that the statsfs concept and >> API is flexible enough). >> > The concern is that once this series is merged then /sys/kernel/stats > could be considered an ABI and there would be a reasonable expectation > that it will remain stable, in so far as the stats that userspace is > interested in are stable and not obsoleted. > > So is this a suggestion that the binary format becomes complementary to > statsfs and provide a means for getting all stats from a single subsystem, > or that this series gets converted to such a format before it is merged? The binary format should be complementary. The ASCII format should indeed be considered stable even though individual statistics would come and go. It may make sense to allow disabling ASCII files via mount and/or Kconfig options; but either way, the binary format can and should be added on top. I have not put any thought into what the binary format would look like and what its features would be. For example these are but the first questions that come to mind: * would it be possible to read/clear an arbitrary statistic with pread/pwrite, or do you have to read all of them? * if userspace wants to read the schema just once and then read the statistics many times, how is it informed of schema changes? * and of course the details of how the schema (names of stat and subsources) is encoded and what details it should include about the values (e.g. type or just signedness). Another possibility is to query stats via BPF. This could be a third way to access the stats, or it could be alternative to a binary format. Paolo
Adding Stefan Raspl, who has done a lot of kvm_stat work in the past. On 05.05.20 19:21, Paolo Bonzini wrote: > On 05/05/20 19:07, David Rientjes wrote: >>> I am totally in favor of having a binary format, but it should be >>> introduced as a separate series on top of this one---and preferably by >>> someone who has already put some thought into the problem (which >>> Emanuele and I have not, beyond ensuring that the statsfs concept and >>> API is flexible enough). >>> >> The concern is that once this series is merged then /sys/kernel/stats >> could be considered an ABI and there would be a reasonable expectation >> that it will remain stable, in so far as the stats that userspace is >> interested in are stable and not obsoleted. >> >> So is this a suggestion that the binary format becomes complementary to >> statsfs and provide a means for getting all stats from a single subsystem, >> or that this series gets converted to such a format before it is merged? > > The binary format should be complementary. The ASCII format should > indeed be considered stable even though individual statistics would come > and go. It may make sense to allow disabling ASCII files via mount > and/or Kconfig options; but either way, the binary format can and should > be added on top. > > I have not put any thought into what the binary format would look like > and what its features would be. For example these are but the first > questions that come to mind: > > * would it be possible to read/clear an arbitrary statistic with > pread/pwrite, or do you have to read all of them? > > * if userspace wants to read the schema just once and then read the > statistics many times, how is it informed of schema changes? > > * and of course the details of how the schema (names of stat and > subsources) is encoded and what details it should include about the > values (e.g. type or just signedness). > > Another possibility is to query stats via BPF. This could be a third > way to access the stats, or it could be alternative to a binary format. > > Paolo >
On Mon, May 4, 2020 at 4:05 AM Emanuele Giuseppe Esposito <eesposit@redhat.com> wrote: ... > Statsfs offers a generic and stable API, allowing any kind of > directory/file organization and supporting multiple kind of aggregations > (not only sum, but also average, max, min and count_zero) and data types > (all unsigned and signed types plus boolean). The implementation, which is > a generalization of KVM’s debugfs statistics code, takes care of gathering > and displaying information at run time; users only need to specify the > values to be included in each source. > > Statsfs would also be a different mountpoint from debugfs, and would not > suffer from limited access due to the security lock down patches. Its main > function is to display each statistics as a file in the desired folder > hierarchy defined through the API. Statsfs files can be read, and possibly > cleared if their file mode allows it. > > Statsfs has two main components: the public API defined by > include/linux/statsfs.h, and the virtual file system which should end up > in /sys/kernel/stats. This is good work. As David Rientjes mentioned, I'm currently investigating a similar project, based on a google-internal debugfs-based FS we call "metricfs". It's designed in a slightly different fashion than statsfs here is, and the statistics exported are mostly fed into our OpenTelemetry-like system. We're motivated by wanting an upstreamed solution, so that we can upstream the metrics we create that are of general interest, and lower the overall rebasing burden for our tree. Some feedback on your design as proposed: - the 8/16/32/64 signed/unsigned integers seems like a wart, and the built-in support to grab any offset from a structure doesn't seem like much of an advantage. A simpler interface would be to just support an "integer" (possibly signed/unsigned) type, which is always 64-bit, and allow the caller to provide a function pointer to retrieve the value, with one or two void *s cbargs. Then the framework could provide an offset-based callback (or callbacks) similar to the existing functionality, and a similar one for per-CPU based statistics. A second "clear" callback could be optionally provided to allow for statistics to be cleared, as in your current proposal. - A callback-style interface also allows for a lot more flexibility in sourcing values, and doesn't lock your callers into one way of storing them. You would, of course, have to be clear about locking rules etc. for the callbacks. - Beyond the statistic's type, one *very* useful piece of metadata for telemetry tools is knowing whether a given statistic is "cumulative" (an unsigned counter which is only ever increased), as opposed to a floating value (like "amount of memory used"). I agree with the folks asking for a binary interface to read statistics, but I also agree that it can be added on later. I'm more concerned with getting the statistics model and capabilities right from the beginning, because those are harder to adjust later. Would you be open to collaborating on the statsfs design? As background for this discussion, here are some details of how our metricfs implementation approaches statistics: 1. Each metricfs metric can have one or two string or integer "keys". If these exist, they expand the metric from a single value into a multi-dimensional table. For example, we use this to report a hash table we keep of functions calling "WARN()", in a 'warnings' statistic: % cat .../warnings/values x86_pmu_stop 1 % Indicates that the x86_pmu_stop() function has had a WARN() fire once since the system was booted. If multiple functions have fired WARN()s, they are listed in this table with their own counts. [1] We also use these to report per-CPU counters on a CPU-by-CPU basis: % cat .../irq_x86/NMI/values 0 42 1 18 ... one line per cpu % 2. We also export some metadata about each statistic. For example, the metadata for the NMI counter above looks like: % cat .../NMI/annotations DESCRIPTION Non-maskable\ interrupts CUMULATIVE % cat .../NMI/fields cpu value int int % (Describing the statistic, marking it as "cumulative", and saying the fields are "cpu" and "value", both ints). The metadata doesn't change much, so having separate files allows the user-space agent to read them once and then the values multiple times. 3. We have a (very few) statistics where the value itself is a string, usually for device statuses. For our use cases, we generally don't both output a statistic and it's aggregation from the kernel; either we sum up things in the kernel (e.g. over a bunch of per-cpu or per-memcg counters) and only have the result statistic, or we expect user-space to sum up the data if it's interested. The tabular form makes it pretty easy to do so (i.e. you can use awk(1) to sum all of the per-cpu NMI counters). We don't generally reset statistics, except as a side effect of removing a device. Thanks again for the patchset, and for pointing out that KVM also needs statistics sent out; it's great that there is interest in this. Cheers, - jonathan P.S. I also have a couple (non-critical) high-level notes: * It's not clear what tree your patches are against, or their dependencies; I was able to get them to apply to linux-next master with a little massaging, but then they failed to compile because they're built on top of your "libfs: group and simplify linux fs code" patch series you sent out in late april. Including a git link or at least a baseline tree and a list of the patch series you rely upon is helpful for anyone wanting to try out your changes. * The main reason I was trying to try out your patches was to get a sense of the set of directories and things the KVM example generates; while it's apparently the same as the existing KVM debugfs tree, it's useful to know how this ends up looking on a real system, and I'm not familiar with the KVM stats. Since this patch is intended slightly more broadly than just KVM, it might have been useful to include sample output for those not familiar with how things are today. [1] We also use this to export various network/storage statistics on a per-device basis. e.g. network bytes received counts: % cat .../rx_bytes/values lo 501360681 eth0 1457631256 ... %
[Answering for Emanuele because he's not available until Monday] On 07/05/20 19:45, Jonathan Adams wrote: > This is good work. As David Rientjes mentioned, I'm currently investigating > a similar project, based on a google-internal debugfs-based FS we call > "metricfs". It's > designed in a slightly different fashion than statsfs here is, and the > statistics exported are > mostly fed into our OpenTelemetry-like system. We're motivated by > wanting an upstreamed solution, so that we can upstream the metrics we > create that are of general interest, and lower the overall rebasing > burden for our tree. Cool. We included a public reading API exactly so that there could be other "frontends". I was mostly thinking of BPF as an in-tree user, but your metricfs could definitely use the reading API. > - the 8/16/32/64 signed/unsigned integers seems like a wart, and the > built-in support to grab any offset from a structure doesn't seem like > much of an advantage. A simpler interface would be to just support an> "integer" (possibly signed/unsigned) type, which is always 64-bit, and > allow the caller to provide a function pointer to retrieve the value, > with one or two void *s cbargs. Then the framework could provide an > offset-based callback (or callbacks) similar to the existing > functionality, and a similar one for per-CPU based statistics. A > second "clear" callback could be optionally provided to allow for > statistics to be cleared, as in your current proposal. Ok, so basically splitting get_simple_value into many separate callbacks. The callbacks would be in a struct like struct stats_fs_type { uint64_t (*get)(struct stats_fs_value *, void *); void (*clear)(struct stats_fs_value *, void *); bool signed; } static uint64_t stats_fs_get_u8(struct stats_fs_value *val, void *base) { return *((uint8_t *)(base + (uintptr_t)val->arg); } static void stats_fs_clear_u8(struct stats_fs_value *val, void *base) { *((uint8_t *)(base + (uintptr_t)val->arg) = 0; } struct stats_fs_type stats_fs_type_u8 = { stats_fs_get_u8, stats_fs_clear_u8, false }; and custom types can be defined using "&(struct stats_fs_type) {...}". > - Beyond the statistic's type, one *very* useful piece of metadata > for telemetry tools is knowing whether a given statistic is > "cumulative" (an unsigned counter which is only ever increased), as > opposed to a floating value (like "amount of memory used"). Good idea. Also, clearing does not make sense for a floating value, so we can use cumulative/floating to get a default for the mode: KVM statistics for example are mostly cumulative and mode 644, except a few that are floating and those are all mode 444. Therefore it makes sense to add cumulative/floating even before outputting it as metadata. > I'm more > concerned with getting the statistics model and capabilities right > from the beginning, because those are harder to adjust later. Agreed. > 1. Each metricfs metric can have one or two string or integer "keys". > If these exist, they expand the metric from a single value into a > multi-dimensional table. For example, we use this to report a hash > table we keep of functions calling "WARN()", in a 'warnings' > statistic: > > % cat .../warnings/values > x86_pmu_stop 1 > % > > Indicates that the x86_pmu_stop() function has had a WARN() fire once > since the system was booted. If multiple functions have fired > WARN()s, they are listed in this table with their own counts. [1] We > also use these to report per-CPU counters on a CPU-by-CPU basis: > > % cat .../irq_x86/NMI/values > 0 42 > 1 18 > ... one line per cpu > % cat .../rx_bytes/values > lo 501360681 > eth0 1457631256 These seem like two different things. The percpu and per-interface values are best represented as subordinate sources, one per CPU and one per interface. For interfaces I would just use a separate directory, but it doesn't really make sense for CPUs. So if we can cater for it in the model, it's better. For example: - add a new argument to statsfs_create_source and statsfs_create_values that makes it not create directories and files respectively. - add a new "aggregate function" STATS_FS_LIST that directs the parent to build a table of all the simple values below it We can also add a helper statsfs_add_values_percpu that creates a new source for each CPU, I think. The warnings one instead is a real hash table. It should be possible to implement it as some kind of customized aggregation, that is implemented in the client instead of coming from subordinate sources. The presentation can then just use STATS_FS_LIST. I don't see anything in the design that is a blocker. > 2. We also export some metadata about each statistic. For example, > the metadata for the NMI counter above looks like: > > % cat .../NMI/annotations > DESCRIPTION Non-maskable\ interrupts > CUMULATIVE > % cat .../NMI/fields > cpu value > int int > % Good idea. I would prefer per-directory dot-named files for this. For example a hypothetical statsfs version of /proc/interrupts could be like this: $ cat /sys/kernel/stats/interrupts/.schema 0 // Name CUMULATIVE // Flags int:int // Type(s) IR-IO-APIC 2-edge timer // Description ... LOC CUMULATIVE int:int Local timer interrupts ... $ cat /sys/kernel/stats/interrupts/LOC 0 4286815 1 4151572 2 4199361 3 4229248 > 3. We have a (very few) statistics where the value itself is a string, > usually for device statuses. Maybe in addition to CUMULATIVE and FLOATING we can have ENUM properties, and a table to convert those enums to strings. Aggregation could also be used to make a histogram out of enums in subordinate sources, e.g. $ cat /sys/kernel/stats/kvm/637-1/vcpu_state running 12 uninitialized 0 halted 4 So in general I'd say the sources/values model holds up. We certainly want to: - switch immediately to callbacks instead of the type constants (so that core statsfs code only does signed/unsigned) - add a field to distinguish cumulative and floating properties (and use it to determine the default file mode) - add a new argument to statsfs_create_source and statsfs_create_values that makes it not create directories and files respectively - add a new API to look for a statsfs_value recursively in all the subordinate sources, and pass the source/value pair to a callback function; and reimplement recursive aggregation and clear in terms of this function. > For our use cases, we generally don't both output a statistic and it's > aggregation from the kernel; either we sum up things in the kernel > (e.g. over a bunch of per-cpu or per-memcg counters) and only have the > result statistic, or we expect user-space to sum up the data if it's > interested. The tabular form makes it pretty easy to do so (i.e. you > can use awk(1) to sum all of the per-cpu NMI counters). Yep, the above "not create a dentry" flag would handle the case where you sum things up in the kernel because the more fine grained counters would be overwhelming. Paolo
On 5/8/20 11:44 AM, Paolo Bonzini wrote: > So in general I'd say the sources/values model holds up. We certainly > want to: > > - switch immediately to callbacks instead of the type constants (so that > core statsfs code only does signed/unsigned) > > - add a field to distinguish cumulative and floating properties (and use > it to determine the default file mode) > > - add a new argument to statsfs_create_source and statsfs_create_values > that makes it not create directories and files respectively > > - add a new API to look for a statsfs_value recursively in all the > subordinate sources, and pass the source/value pair to a callback > function; and reimplement recursive aggregation and clear in terms of > this function. Ok I will apply this, thank you for all the suggestions. I will post the v3 patchset in the next few weeks. In the meanwhile, I wrote the documentation you asked (even though it's going to change in v3), you can find it here: https://github.com/esposem/linux/commit/dfa92f270f1aed73d5f3b7f12640b2a1635c711f Thank you, Emanuele
On Fri, May 8, 2020 at 2:44 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > [Answering for Emanuele because he's not available until Monday] > > On 07/05/20 19:45, Jonathan Adams wrote: > > This is good work. As David Rientjes mentioned, I'm currently investigating > > a similar project, based on a google-internal debugfs-based FS we call > > "metricfs". It's > > designed in a slightly different fashion than statsfs here is, and the > > statistics exported are > > mostly fed into our OpenTelemetry-like system. We're motivated by > > wanting an upstreamed solution, so that we can upstream the metrics we > > create that are of general interest, and lower the overall rebasing > > burden for our tree. > > Cool. We included a public reading API exactly so that there could be > other "frontends". I was mostly thinking of BPF as an in-tree user, but > your metricfs could definitely use the reading API. > > > - the 8/16/32/64 signed/unsigned integers seems like a wart, and the > > built-in support to grab any offset from a structure doesn't seem like > > much of an advantage. A simpler interface would be to just support an> "integer" (possibly signed/unsigned) type, which is always 64-bit, and > > allow the caller to provide a function pointer to retrieve the value, > > with one or two void *s cbargs. Then the framework could provide an > > offset-based callback (or callbacks) similar to the existing > > functionality, and a similar one for per-CPU based statistics. A > > second "clear" callback could be optionally provided to allow for > > statistics to be cleared, as in your current proposal. > > Ok, so basically splitting get_simple_value into many separate > callbacks. The callbacks would be in a struct like > > struct stats_fs_type { > uint64_t (*get)(struct stats_fs_value *, void *); > void (*clear)(struct stats_fs_value *, void *); > bool signed; > } ... > struct stats_fs_type stats_fs_type_u8 = { > stats_fs_get_u8, > stats_fs_clear_u8, > false > }; > > and custom types can be defined using "&(struct stats_fs_type) {...}". That makes sense. > > - Beyond the statistic's type, one *very* useful piece of metadata > > for telemetry tools is knowing whether a given statistic is > > "cumulative" (an unsigned counter which is only ever increased), as > > opposed to a floating value (like "amount of memory used"). > > Good idea. Also, clearing does not make sense for a floating value, so > we can use cumulative/floating to get a default for the mode: KVM > statistics for example are mostly cumulative and mode 644, except a few > that are floating and those are all mode 444. Therefore it makes sense > to add cumulative/floating even before outputting it as metadata. > > > I'm more > > concerned with getting the statistics model and capabilities right > > from the beginning, because those are harder to adjust later. > > Agreed. > > > 1. Each metricfs metric can have one or two string or integer "keys". > > If these exist, they expand the metric from a single value into a > > multi-dimensional table. For example, we use this to report a hash > > table we keep of functions calling "WARN()", in a 'warnings' > > statistic: > > > > % cat .../warnings/values > > x86_pmu_stop 1 > > % > > > > Indicates that the x86_pmu_stop() function has had a WARN() fire once > > since the system was booted. If multiple functions have fired > > WARN()s, they are listed in this table with their own counts. [1] We > > also use these to report per-CPU counters on a CPU-by-CPU basis: > > > > % cat .../irq_x86/NMI/values > > 0 42 > > 1 18 > > ... one line per cpu > > % cat .../rx_bytes/values > > lo 501360681 > > eth0 1457631256 > > These seem like two different things. I see your point; I agree that there are two different things here. > The percpu and per-interface values are best represented as subordinate > sources, one per CPU and one per interface. For interfaces I would just > use a separate directory, but it doesn't really make sense for CPUs. So > if we can cater for it in the model, it's better. For example: > > - add a new argument to statsfs_create_source and statsfs_create_values > that makes it not create directories and files respectively. > > - add a new "aggregate function" STATS_FS_LIST that directs the parent > to build a table of all the simple values below it > > We can also add a helper statsfs_add_values_percpu that creates a new > source for each CPU, I think. I think I'd characterize this slightly differently; we have a set of statistics which are essentially "in parallel": - a variety of statistics, N CPUs they're available for, or - a variety of statistics, N interfaces they're available for. - a variety of statistics, N kvm object they're available for. Recreating a parallel hierarchy of statistics any time we add/subtract a CPU or interface seems like a lot of overhead. Perhaps a better model would be some sort of "parameter enumn" (naming is hard; parameter set?), so when a CPU/network interface/etc is added you'd add its ID to the "CPUs" we know about, and at removal time you'd take it out; it would have an associated cbarg for the value getting callback. Does that make sense as a design? I'm working on characterizing all of our metricfs usage; I'll see if this looks like it mostly covers our usecases. > The warnings one instead is a real hash table. It should be possible to > implement it as some kind of customized aggregation, that is implemented > in the client instead of coming from subordinate sources. The > presentation can then just use STATS_FS_LIST. I don't see anything in > the design that is a blocker. Yes; though if it's low-enough overhead, you could imagine having a dynamically-updated parameter enum based on the hash table. > > 2. We also export some metadata about each statistic. For example, > > the metadata for the NMI counter above looks like: > > > > % cat .../NMI/annotations > > DESCRIPTION Non-maskable\ interrupts > > CUMULATIVE > > % cat .../NMI/fields > > cpu value > > int int > > % > > Good idea. I would prefer per-directory dot-named files for this. For > example a hypothetical statsfs version of /proc/interrupts could be like > this: > > $ cat /sys/kernel/stats/interrupts/.schema > 0 // Name > CUMULATIVE // Flags > int:int // Type(s) > IR-IO-APIC 2-edge timer // Description > ... > LOC > CUMULATIVE > int:int > Local timer interrupts > ... > $ cat /sys/kernel/stats/interrupts/LOC > 0 4286815 > 1 4151572 > 2 4199361 > 3 4229248 > > > 3. We have a (very few) statistics where the value itself is a string, > > usually for device statuses. > > Maybe in addition to CUMULATIVE and FLOATING we can have ENUM > properties, and a table to convert those enums to strings. Aggregation > could also be used to make a histogram out of enums in subordinate > sources, e.g. > > $ cat /sys/kernel/stats/kvm/637-1/vcpu_state > running 12 > uninitialized 0 > halted 4 That's along similar lines to the parameter enums, yeah. > So in general I'd say the sources/values model holds up. We certainly > want to: > > - switch immediately to callbacks instead of the type constants (so that > core statsfs code only does signed/unsigned) > > - add a field to distinguish cumulative and floating properties (and use > it to determine the default file mode) Yup, these make sense. > - add a new argument to statsfs_create_source and statsfs_create_values > that makes it not create directories and files respectively > > - add a new API to look for a statsfs_value recursively in all the > subordinate sources, and pass the source/value pair to a callback > function; and reimplement recursive aggregation and clear in terms of > this function. This is where I think a little iteration on the "parameter enums" should happen before jumping into implementation. > > For our use cases, we generally don't both output a statistic and it's > > aggregation from the kernel; either we sum up things in the kernel > > (e.g. over a bunch of per-cpu or per-memcg counters) and only have the > > result statistic, or we expect user-space to sum up the data if it's > > interested. The tabular form makes it pretty easy to do so (i.e. you > > can use awk(1) to sum all of the per-cpu NMI counters). > > Yep, the above "not create a dentry" flag would handle the case where > you sum things up in the kernel because the more fine grained counters > would be overwhelming. nodnod; or the callback could handle the sum itself. Thanks, - jonathan
Hi Jonathan, I think the remaining sticky point is this one: On 11/05/20 19:02, Jonathan Adams wrote: > I think I'd characterize this slightly differently; we have a set of > statistics which are essentially "in parallel": > > - a variety of statistics, N CPUs they're available for, or > - a variety of statistics, N interfaces they're available for. > - a variety of statistics, N kvm object they're available for. > > Recreating a parallel hierarchy of statistics any time we add/subtract > a CPU or interface seems like a lot of overhead. Perhaps a better > model would be some sort of "parameter enumn" (naming is hard; > parameter set?), so when a CPU/network interface/etc is added you'd > add its ID to the "CPUs" we know about, and at removal time you'd > take it out; it would have an associated cbarg for the value getting > callback. > >> Yep, the above "not create a dentry" flag would handle the case where >> you sum things up in the kernel because the more fine grained counters >> would be overwhelming. > > nodnod; or the callback could handle the sum itself. In general for statsfs we took a more explicit approach where each addend in a sum is a separate stats_fs_source. In this version of the patches it's also a directory, but we'll take your feedback and add both the ability to hide directories (first) and to list values (second). So, in the cases of interfaces and KVM objects I would prefer to keep each addend separate. For CPUs that however would be pretty bad. Many subsystems might accumulate stats percpu for performance reason, which would then be exposed as the sum (usually). So yeah, native handling of percpu values makes sense. I think it should fit naturally into the same custom aggregation framework as hash table keys, we'll see if there's any devil in the details. Core kernel stats such as /proc/interrupts or /proc/stat are the exception here, since individual per-CPU values can be vital for debugging. For those, creating a source per stat, possibly on-the-fly at hotplug/hot-unplug time because NR_CPUS can be huge, would still be my preferred way to do it. Thanks, Paolo
On Mon, May 11, 2020 at 10:34 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > Hi Jonathan, I think the remaining sticky point is this one: Apologies it took a couple days for me to respond; I wanted to finish evaluating our current usage to make sure I had a full picture; I'll summarize our state at the bottom. > On 11/05/20 19:02, Jonathan Adams wrote: > > I think I'd characterize this slightly differently; we have a set of > > statistics which are essentially "in parallel": > > > > - a variety of statistics, N CPUs they're available for, or > > - a variety of statistics, N interfaces they're available for. > > - a variety of statistics, N kvm object they're available for. > > > > Recreating a parallel hierarchy of statistics any time we add/subtract > > a CPU or interface seems like a lot of overhead. Perhaps a better > > model would be some sort of "parameter enumn" (naming is hard; > > parameter set?), so when a CPU/network interface/etc is added you'd > > add its ID to the "CPUs" we know about, and at removal time you'd > > take it out; it would have an associated cbarg for the value getting > > callback. > > > >> Yep, the above "not create a dentry" flag would handle the case where > >> you sum things up in the kernel because the more fine grained counters > >> would be overwhelming. > > > > nodnod; or the callback could handle the sum itself. > > In general for statsfs we took a more explicit approach where each > addend in a sum is a separate stats_fs_source. In this version of the > patches it's also a directory, but we'll take your feedback and add both > the ability to hide directories (first) and to list values (second). > > So, in the cases of interfaces and KVM objects I would prefer to keep > each addend separate. This just feels like a lot of churn just to add a statistic or object; in your model, every time a KVM or VCPU is created, you create the N statistics, leading to N*M total objects. As I was imagining it, you'd have: A 'parameter enum' which maps names to object pointers and A set of statistics which map a statfs path to {callback, cbarg, zero or more parameter enums} So adding a new KVM VCPU would just be "add an object to the KVM's VCPU parameter enum", and removing it would be the opposite, and a couple callbacks could handle basically all of the stats. The only tricky part would be making sure the parameter enum value create/destroy and the callback calls are coordinated correctly. If you wanted stats for a particular VCPU, we could mark the overall directory as "include subdirs for VCPU parameter", and you'd automatically get one directory per VCPU, with the same set of stats in it, constrained to the single VCPU. I could also imagine having an ".agg_sum/{stata,statb,...}" to report using the aggregations you have, or a mode to say "stats in this directory are sums over the following VCPU parameter". > For CPUs that however would be pretty bad. Many subsystems might > accumulate stats percpu for performance reason, which would then be > exposed as the sum (usually). So yeah, native handling of percpu values > makes sense. I think it should fit naturally into the same custom > aggregation framework as hash table keys, we'll see if there's any devil > in the details. > > Core kernel stats such as /proc/interrupts or /proc/stat are the > exception here, since individual per-CPU values can be vital for > debugging. For those, creating a source per stat, possibly on-the-fly > at hotplug/hot-unplug time because NR_CPUS can be huge, would still be > my preferred way to do it. Our metricfs has basically two modes: report all per-CPU values (for the IPI counts etc; you pass a callback which takes a 'int cpu' argument) or a callback that sums over CPUs and reports the full value. It also seems hard to have any subsystem with a per-CPU stat having to install a hotplug callback to add/remove statistics. In my model, a "CPU" parameter enum which is automatically kept up-to-date is probably sufficient for the "report all per-CPU values". Does this make sense to you? I realize that this is a significant change to the model y'all are starting with; I'm willing to do the work to flesh it out. Thanks for your time, - Jonathan P.S. Here's a summary of the types of statistics we use in metricfs in google, to give a little context: - integer values (single value per stat, source also a single value); a couple of these are boolean values exported as '0' or '1'. - per-CPU integer values, reported as a <cpuid, value> table - per-CPU integer values, summed and reported as an aggregate - single-value values, keys related to objects: - many per-device (disk, network, etc) integer stats - some per-device string data (version strings, UUIDs, and occasional statuses.) - a few histograms (usually counts by duration ranges) - the "function name" to count for the WARN statistic I mentioned. - A single statistic with two keys (for livepatch statistics; the value is the livepatch status as a string) Most of the stats with keys are "complete" (every key has a value), but there are several examples of statistics where only some of the possible keys have values, or (e.g. for networking statistics) only the keys visible to the reading process (e.g. in its namespaces) are included.
On 14/05/20 19:35, Jonathan Adams wrote: >> In general for statsfs we took a more explicit approach where each >> addend in a sum is a separate stats_fs_source. In this version of the >> patches it's also a directory, but we'll take your feedback and add both >> the ability to hide directories (first) and to list values (second). >> >> So, in the cases of interfaces and KVM objects I would prefer to keep >> each addend separate. > > This just feels like a lot of churn just to add a statistic or object; > in your model, every time a KVM or VCPU is created, you create the N > statistics, leading to N*M total objects. While it's N*M files, only O(M) statsfs API calls are needed to create them. Whether you have O(N*M) total kmalloc-ed objects or O(M) is an implementation detail. Having O(N*M) API calls would be a non-started, I agree - especially once you start thinking of more efficient publishing mechanisms that unlike files are also O(M). >> For CPUs that however would be pretty bad. Many subsystems might >> accumulate stats percpu for performance reason, which would then be >> exposed as the sum (usually). So yeah, native handling of percpu values >> makes sense. I think it should fit naturally into the same custom >> aggregation framework as hash table keys, we'll see if there's any devil >> in the details. >> >> Core kernel stats such as /proc/interrupts or /proc/stat are the >> exception here, since individual per-CPU values can be vital for >> debugging. For those, creating a source per stat, possibly on-the-fly >> at hotplug/hot-unplug time because NR_CPUS can be huge, would still be >> my preferred way to do it. > > Our metricfs has basically two modes: report all per-CPU values (for > the IPI counts etc; you pass a callback which takes a 'int cpu' > argument) or a callback that sums over CPUs and reports the full > value. It also seems hard to have any subsystem with a per-CPU stat > having to install a hotplug callback to add/remove statistics. Yes, this is also why I think percpu values should have some kind of native handling. Reporting per-CPU values individually is the exception. > In my model, a "CPU" parameter enum which is automatically kept > up-to-date is probably sufficient for the "report all per-CPU values". Yes (or a separate CPU source in my model). Paolo > Does this make sense to you? I realize that this is a significant > change to the model y'all are starting with; I'm willing to do the > work to flesh it out. > Thanks for your time, > - Jonathan > > P.S. Here's a summary of the types of statistics we use in metricfs > in google, to give a little context: > > - integer values (single value per stat, source also a single value); > a couple of these are boolean values exported as '0' or '1'. > - per-CPU integer values, reported as a <cpuid, value> table > - per-CPU integer values, summed and reported as an aggregate > - single-value values, keys related to objects: > - many per-device (disk, network, etc) integer stats > - some per-device string data (version strings, UUIDs, and > occasional statuses.) > - a few histograms (usually counts by duration ranges) > - the "function name" to count for the WARN statistic I mentioned. > - A single statistic with two keys (for livepatch statistics; the > value is the livepatch status as a string) > > Most of the stats with keys are "complete" (every key has a value), > but there are several examples of statistics where only some of the > possible keys have values, or (e.g. for networking statistics) only > the keys visible to the reading process (e.g. in its namespaces) are > included. >
On Tue, May 5, 2020 at 3:07 AM David Rientjes <rientjes@google.com> wrote: > > On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote: > > > There is currently no common way for Linux kernel subsystems to expose > > statistics to userspace shared throughout the Linux kernel; subsystems > > have to take care of gathering and displaying statistics by themselves, > > for example in the form of files in debugfs. For example KVM has its own > > code section that takes care of this in virt/kvm/kvm_main.c, where it sets > > up debugfs handlers for displaying values and aggregating them from > > various subfolders to obtain information about the system state (i.e. > > displaying the total number of exits, calculated by summing all exits of > > all cpus of all running virtual machines). > > > > Allowing each section of the kernel to do so has two disadvantages. First, > > it will introduce redundant code. Second, debugfs is anyway not the right > > place for statistics (for example it is affected by lockdown) > > > > In this patch series I introduce statsfs, a synthetic ram-based virtual > > filesystem that takes care of gathering and displaying statistics for the > > Linux kernel subsystems. > > > > This is exciting, we have been looking in the same area recently. Adding > Jonathan Adams <jwadams@google.com>. > > In your diffstat, one thing I notice that is omitted: an update to > Documentation/* :) Any chance of getting some proposed Documentation/ > updates with structure of the fs, the per subsystem breakdown, and best > practices for managing the stats from the kernel level? > > > The file system is mounted on /sys/kernel/stats and would be already used > > by kvm. Statsfs was initially introduced by Paolo Bonzini [1]. > > > > Statsfs offers a generic and stable API, allowing any kind of > > directory/file organization and supporting multiple kind of aggregations > > (not only sum, but also average, max, min and count_zero) and data types > > (all unsigned and signed types plus boolean). The implementation, which is > > a generalization of KVM’s debugfs statistics code, takes care of gathering > > and displaying information at run time; users only need to specify the > > values to be included in each source. > > > > Statsfs would also be a different mountpoint from debugfs, and would not > > suffer from limited access due to the security lock down patches. Its main > > function is to display each statistics as a file in the desired folder > > hierarchy defined through the API. Statsfs files can be read, and possibly > > cleared if their file mode allows it. > > > > Statsfs has two main components: the public API defined by > > include/linux/statsfs.h, and the virtual file system which should end up > > in /sys/kernel/stats. > > > > The API has two main elements, values and sources. Kernel subsystems like > > KVM can use the API to create a source, add child > > sources/values/aggregates and register it to the root source (that on the > > virtual fs would be /sys/kernel/statsfs). > > > > Sources are created via statsfs_source_create(), and each source becomes a > > directory in the file system. Sources form a parent-child relationship; > > root sources are added to the file system via statsfs_source_register(). > > Every other source is added to or removed from a parent through the > > statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs. > > Once a source is created and added to the tree (via add_subordinate), it > > will be used to compute aggregate values in the parent source. > > > > Values represent quantites that are gathered by the statsfs user. Examples > > of values include the number of vm exits of a given kind, the amount of > > memory used by some data structure, the length of the longest hash table > > chain, or anything like that. Values are defined with the > > statsfs_source_add_values function. Each value is defined by a struct > > statsfs_value; the same statsfs_value can be added to many different > > sources. A value can be considered "simple" if it fetches data from a > > user-provided location, or "aggregate" if it groups all values in the > > subordinates sources that include the same statsfs_value. > > > > This seems like it could have a lot of overhead if we wanted to > periodically track the totality of subsystem stats as a form of telemetry > gathering from userspace. To collect telemetry for 1,000 different stats, > do we need to issue lseek()+read() syscalls for each of them individually > (or, worse, open()+read()+close())? > > Any thoughts on how that can be optimized? A couple of ideas: > > - an interface that allows gathering of all stats for a particular > interface through a single file that would likely be encoded in binary > and the responsibility of userspace to disseminate, or > > - an interface that extends beyond this proposal and allows the reader to > specify which stats they are interested in collecting and then the > kernel will only provide these stats in a well formed structure and > also be binary encoded. Something akin to how ftrace allows you specify the list of functions in /sys/kernel/debug/tracing/set_ftrace_filter would make this a lot easier to use than the one-file-per-stat interface. That would be useful, e.g. in capturing correlated stats periodically e.g. scheduler, power and thermal stats > We've found that the one-file-per-stat method is pretty much a show > stopper from the performance view and we always must execute at least two > syscalls to obtain a single stat. > > Since this is becoming a generic API (good!!), maybe we can discuss > possible ways to optimize gathering of stats in mass? > > > For more information, please consult the kerneldoc documentation in patch > > 2 and the sample uses in the kunit tests and in KVM. > > > > This series of patches is based on my previous series "libfs: group and > > simplify linux fs code" and the single patch sent to kvm "kvm_host: unify > > VM_STAT and VCPU_STAT definitions in a single place". The former > > simplifies code duplicated in debugfs and tracefs (from which statsfs is > > based on), the latter groups all macros definition for statistics in kvm > > in a single common file shared by all architectures. > > > > Patch 1 adds a new refcount and kref destructor wrappers that take a > > semaphore, as those are used later by statsfs. Patch 2 introduces the > > statsfs API, patch 3 provides extensive tests that can also be used as > > example on how to use the API and patch 4 adds the file system support. > > Finally, patch 5 provides a real-life example of statsfs usage in KVM. > > > > [1] https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M > > > > Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> > > > > v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation, > > change statsfs in stats_fs > > > > Emanuele Giuseppe Esposito (5): > > refcount, kref: add dec-and-test wrappers for rw_semaphores > > stats_fs API: create, add and remove stats_fs sources and values > > kunit: tests for stats_fs API > > stats_fs fs: virtual fs to show stats to the end-user > > kvm_main: replace debugfs with stats_fs > > > > MAINTAINERS | 7 + > > arch/arm64/kvm/Kconfig | 1 + > > arch/arm64/kvm/guest.c | 2 +- > > arch/mips/kvm/Kconfig | 1 + > > arch/mips/kvm/mips.c | 2 +- > > arch/powerpc/kvm/Kconfig | 1 + > > arch/powerpc/kvm/book3s.c | 6 +- > > arch/powerpc/kvm/booke.c | 8 +- > > arch/s390/kvm/Kconfig | 1 + > > arch/s390/kvm/kvm-s390.c | 16 +- > > arch/x86/include/asm/kvm_host.h | 2 +- > > arch/x86/kvm/Kconfig | 1 + > > arch/x86/kvm/Makefile | 2 +- > > arch/x86/kvm/debugfs.c | 64 -- > > arch/x86/kvm/stats_fs.c | 56 ++ > > arch/x86/kvm/x86.c | 6 +- > > fs/Kconfig | 12 + > > fs/Makefile | 1 + > > fs/stats_fs/Makefile | 6 + > > fs/stats_fs/inode.c | 337 ++++++++++ > > fs/stats_fs/internal.h | 35 + > > fs/stats_fs/stats_fs-tests.c | 1088 +++++++++++++++++++++++++++++++ > > fs/stats_fs/stats_fs.c | 773 ++++++++++++++++++++++ > > include/linux/kref.h | 11 + > > include/linux/kvm_host.h | 39 +- > > include/linux/refcount.h | 2 + > > include/linux/stats_fs.h | 304 +++++++++ > > include/uapi/linux/magic.h | 1 + > > lib/refcount.c | 32 + > > tools/lib/api/fs/fs.c | 21 + > > virt/kvm/arm/arm.c | 2 +- > > virt/kvm/kvm_main.c | 314 ++------- > > 32 files changed, 2772 insertions(+), 382 deletions(-) > > delete mode 100644 arch/x86/kvm/debugfs.c > > create mode 100644 arch/x86/kvm/stats_fs.c > > create mode 100644 fs/stats_fs/Makefile > > create mode 100644 fs/stats_fs/inode.c > > create mode 100644 fs/stats_fs/internal.h > > create mode 100644 fs/stats_fs/stats_fs-tests.c > > create mode 100644 fs/stats_fs/stats_fs.c > > create mode 100644 include/linux/stats_fs.h > > > > -- > > 2.25.2 > > > >
There is currently no common way for Linux kernel subsystems to expose statistics to userspace shared throughout the Linux kernel; subsystems have to take care of gathering and displaying statistics by themselves, for example in the form of files in debugfs. For example KVM has its own code section that takes care of this in virt/kvm/kvm_main.c, where it sets up debugfs handlers for displaying values and aggregating them from various subfolders to obtain information about the system state (i.e. displaying the total number of exits, calculated by summing all exits of all cpus of all running virtual machines). Allowing each section of the kernel to do so has two disadvantages. First, it will introduce redundant code. Second, debugfs is anyway not the right place for statistics (for example it is affected by lockdown) In this patch series I introduce statsfs, a synthetic ram-based virtual filesystem that takes care of gathering and displaying statistics for the Linux kernel subsystems. The file system is mounted on /sys/kernel/stats and would be already used by kvm. Statsfs was initially introduced by Paolo Bonzini [1]. Statsfs offers a generic and stable API, allowing any kind of directory/file organization and supporting multiple kind of aggregations (not only sum, but also average, max, min and count_zero) and data types (all unsigned and signed types plus boolean). The implementation, which is a generalization of KVM’s debugfs statistics code, takes care of gathering and displaying information at run time; users only need to specify the values to be included in each source. Statsfs would also be a different mountpoint from debugfs, and would not suffer from limited access due to the security lock down patches. Its main function is to display each statistics as a file in the desired folder hierarchy defined through the API. Statsfs files can be read, and possibly cleared if their file mode allows it. Statsfs has two main components: the public API defined by include/linux/statsfs.h, and the virtual file system which should end up in /sys/kernel/stats. The API has two main elements, values and sources. Kernel subsystems like KVM can use the API to create a source, add child sources/values/aggregates and register it to the root source (that on the virtual fs would be /sys/kernel/statsfs). Sources are created via statsfs_source_create(), and each source becomes a directory in the file system. Sources form a parent-child relationship; root sources are added to the file system via statsfs_source_register(). Every other source is added to or removed from a parent through the statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs. Once a source is created and added to the tree (via add_subordinate), it will be used to compute aggregate values in the parent source. Values represent quantites that are gathered by the statsfs user. Examples of values include the number of vm exits of a given kind, the amount of memory used by some data structure, the length of the longest hash table chain, or anything like that. Values are defined with the statsfs_source_add_values function. Each value is defined by a struct statsfs_value; the same statsfs_value can be added to many different sources. A value can be considered "simple" if it fetches data from a user-provided location, or "aggregate" if it groups all values in the subordinates sources that include the same statsfs_value. For more information, please consult the kerneldoc documentation in patch 2 and the sample uses in the kunit tests and in KVM. This series of patches is based on my previous series "libfs: group and simplify linux fs code" and the single patch sent to kvm "kvm_host: unify VM_STAT and VCPU_STAT definitions in a single place". The former simplifies code duplicated in debugfs and tracefs (from which statsfs is based on), the latter groups all macros definition for statistics in kvm in a single common file shared by all architectures. Patch 1 adds a new refcount and kref destructor wrappers that take a semaphore, as those are used later by statsfs. Patch 2 introduces the statsfs API, patch 3 provides extensive tests that can also be used as example on how to use the API and patch 4 adds the file system support. Finally, patch 5 provides a real-life example of statsfs usage in KVM. [1] https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@redhat.com/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation, change statsfs in stats_fs Emanuele Giuseppe Esposito (5): refcount, kref: add dec-and-test wrappers for rw_semaphores stats_fs API: create, add and remove stats_fs sources and values kunit: tests for stats_fs API stats_fs fs: virtual fs to show stats to the end-user kvm_main: replace debugfs with stats_fs MAINTAINERS | 7 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/guest.c | 2 +- arch/mips/kvm/Kconfig | 1 + arch/mips/kvm/mips.c | 2 +- arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/kvm/book3s.c | 6 +- arch/powerpc/kvm/booke.c | 8 +- arch/s390/kvm/Kconfig | 1 + arch/s390/kvm/kvm-s390.c | 16 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/debugfs.c | 64 -- arch/x86/kvm/stats_fs.c | 56 ++ arch/x86/kvm/x86.c | 6 +- fs/Kconfig | 12 + fs/Makefile | 1 + fs/stats_fs/Makefile | 6 + fs/stats_fs/inode.c | 337 ++++++++++ fs/stats_fs/internal.h | 35 + fs/stats_fs/stats_fs-tests.c | 1088 +++++++++++++++++++++++++++++++ fs/stats_fs/stats_fs.c | 773 ++++++++++++++++++++++ include/linux/kref.h | 11 + include/linux/kvm_host.h | 39 +- include/linux/refcount.h | 2 + include/linux/stats_fs.h | 304 +++++++++ include/uapi/linux/magic.h | 1 + lib/refcount.c | 32 + tools/lib/api/fs/fs.c | 21 + virt/kvm/arm/arm.c | 2 +- virt/kvm/kvm_main.c | 314 ++------- 32 files changed, 2772 insertions(+), 382 deletions(-) delete mode 100644 arch/x86/kvm/debugfs.c create mode 100644 arch/x86/kvm/stats_fs.c create mode 100644 fs/stats_fs/Makefile create mode 100644 fs/stats_fs/inode.c create mode 100644 fs/stats_fs/internal.h create mode 100644 fs/stats_fs/stats_fs-tests.c create mode 100644 fs/stats_fs/stats_fs.c create mode 100644 include/linux/stats_fs.h