mbox series

[v9,0/5] KVM statistics data fd-based binary interface

Message ID 20210614212155.1670777-1-jingzhangos@google.com (mailing list archive)
Headers show
Series KVM statistics data fd-based binary interface | expand

Message

Jing Zhang June 14, 2021, 9:21 p.m. UTC
This patchset provides a file descriptor for every VM and VCPU to read
KVM statistics data in binary format.
It is meant to provide a lightweight, flexible, scalable and efficient
lock-free solution for user space telemetry applications to pull the
statistics data periodically for large scale systems. The pulling
frequency could be as high as a few times per second.
In this patchset, every statistics data are treated to have some
attributes as below:
  * architecture dependent or generic
  * VM statistics data or VCPU statistics data
  * type: cumulative, instantaneous, peak
  * unit: none for simple counter, nanosecond, microsecond,
    millisecond, second, Byte, KiByte, MiByte, GiByte. Clock Cycles
Since no lock/synchronization is used, the consistency between all
the statistics data is not guaranteed. That means not all statistics
data are read out at the exact same time, since the statistics date
are still being updated by KVM subsystems while they are read out.

---

* v8 -> v9
  - Rebase to commit 8331a2bc0898
    (KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall)
  - Reduce code duplication between binary and debugfs interface
  - Add field "offset" in stats descriptor to let us define stats
    descriptors in any order (not necessary in the order of stats
    defined in vm/vcpu stats structures)
  - Add static check to make sure the number of stats descriptors
    is the same as the number of stats defined in vm/vcpu stats
    structures
  - Fix missing/mismatched stats descriptor definition caused by
    rebase

* v7 -> v8
  - Rebase to kvm/queue, commit c1dc20e254b4 ("KVM: switch per-VM
  stats to u64")
  - Revise code to reflect the per-VM stats type from ulong to u64
  - Addressed some other nits

* v6 -> v7
  - Improve file descriptor allocation function by Krish suggestion
  - Use "generic stats" instead of "common stats" as Krish suggested
  - Addressed some other nits from Krish and David Matlack

* v5 -> v6
  - Use designated initializers for STATS_DESC
  - Change KVM_STATS_SCALE... to KVM_STATS_BASE...
  - Use a common function for kvm_[vm|vcpu]_stats_read
  - Fix some documentation errors/missings
  - Use TEST_ASSERT in selftest
  - Use a common function for [vm|vcpu]_stats_test in selftest

* v4 -> v5
  - Rebase to kvm/queue, commit a4345a7cecfb ("Merge tag
    'kvmarm-fixes-5.13-1'")
  - Change maximum stats name length to 48
  - Replace VM_STATS_COMMON/VCPU_STATS_COMMON macros with stats
    descriptor definition macros.
  - Fixed some errors/warnings reported by checkpatch.pl

* v3 -> v4
  - Rebase to kvm/queue, commit 9f242010c3b4 ("KVM: avoid "deadlock"
    between install_new_memslots and MMU notifier")
  - Use C-stype comments in the whole patch
  - Fix wrong count for x86 VCPU stats descriptors
  - Fix KVM stats data size counting and validity check in selftest

* v2 -> v3
  - Rebase to kvm/queue, commit edf408f5257b ("KVM: avoid "deadlock"
    between install_new_memslots and MMU notifier")
  - Resolve some nitpicks about format

* v1 -> v2
  - Use ARRAY_SIZE to count the number of stats descriptors
  - Fix missing `size` field initialization in macro STATS_DESC

[1] https://lore.kernel.org/kvm/20210402224359.2297157-1-jingzhangos@google.com
[2] https://lore.kernel.org/kvm/20210415151741.1607806-1-jingzhangos@google.com
[3] https://lore.kernel.org/kvm/20210423181727.596466-1-jingzhangos@google.com
[4] https://lore.kernel.org/kvm/20210429203740.1935629-1-jingzhangos@google.com
[5] https://lore.kernel.org/kvm/20210517145314.157626-1-jingzhangos@google.com
[6] https://lore.kernel.org/kvm/20210524151828.4113777-1-jingzhangos@google.com
[7] https://lore.kernel.org/kvm/20210603211426.790093-1-jingzhangos@google.com
[8] https://lore.kernel.org/kvm/20210611124624.1404010-1-jingzhangos@google.com

---

Jing Zhang (5):
  KVM: stats: Separate generic stats from architecture specific ones
  KVM: stats: Add fd-based API to read binary stats data
  KVM: stats: Add documentation for statistics data binary interface
  KVM: selftests: Add selftest for KVM statistics data binary interface
  KVM: stats: Remove code duplication for binary and debugfs stats

 Documentation/virt/kvm/api.rst                | 177 +++++++++++-
 arch/arm64/include/asm/kvm_host.h             |   9 +-
 arch/arm64/kvm/guest.c                        |  50 +++-
 arch/mips/include/asm/kvm_host.h              |   9 +-
 arch/mips/kvm/mips.c                          |  92 +++---
 arch/powerpc/include/asm/kvm_host.h           |   9 +-
 arch/powerpc/kvm/book3s.c                     |  93 ++++---
 arch/powerpc/kvm/book3s_hv.c                  |  12 +-
 arch/powerpc/kvm/book3s_pr.c                  |   2 +-
 arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
 arch/powerpc/kvm/booke.c                      |  78 ++++--
 arch/s390/include/asm/kvm_host.h              |   9 +-
 arch/s390/kvm/kvm-s390.c                      | 234 +++++++++-------
 arch/x86/include/asm/kvm_host.h               |   9 +-
 arch/x86/kvm/x86.c                            | 111 +++++---
 include/linux/kvm_host.h                      | 178 +++++++++++-
 include/linux/kvm_types.h                     |  12 +
 include/uapi/linux/kvm.h                      |  48 ++++
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 .../selftests/kvm/kvm_binary_stats_test.c     | 225 +++++++++++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  12 +
 virt/kvm/kvm_main.c                           | 261 +++++++++++++++---
 24 files changed, 1293 insertions(+), 346 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/kvm_binary_stats_test.c


base-commit: 8331a2bc089881d7fd2fc9a6658f39780817e4e0

Comments

Leon Romanovsky June 15, 2021, 5:25 a.m. UTC | #1
On Mon, Jun 14, 2021 at 09:21:50PM +0000, Jing Zhang wrote:
> This patchset provides a file descriptor for every VM and VCPU to read
> KVM statistics data in binary format.
> It is meant to provide a lightweight, flexible, scalable and efficient
> lock-free solution for user space telemetry applications to pull the
> statistics data periodically for large scale systems. The pulling
> frequency could be as high as a few times per second.
> In this patchset, every statistics data are treated to have some
> attributes as below:
>   * architecture dependent or generic
>   * VM statistics data or VCPU statistics data
>   * type: cumulative, instantaneous, peak
>   * unit: none for simple counter, nanosecond, microsecond,
>     millisecond, second, Byte, KiByte, MiByte, GiByte. Clock Cycles
> Since no lock/synchronization is used, the consistency between all
> the statistics data is not guaranteed. That means not all statistics
> data are read out at the exact same time, since the statistics date
> are still being updated by KVM subsystems while they are read out.

Sorry for my naive questions, but how does telemetry get statistics
for hypervisors? Why is KVM different from hypervisors or NIC's statistics
or any other high speed devices (RDMA) that generate tons of data?

Thanks
Paolo Bonzini June 15, 2021, 7:06 a.m. UTC | #2
On 15/06/21 07:25, Leon Romanovsky wrote:
> Sorry for my naive questions, but how does telemetry get statistics
> for hypervisors? Why is KVM different from hypervisors or NIC's statistics
> or any other high speed devices (RDMA) that generate tons of data?

Right now, the only way is debugfs but it's slow, and it's disabled when 
using lockdown mode; this series is a way to fix this.

I sense that there is another question in there; are you wondering if 
another mechanism should be used, for example netlink?  The main issue 
there is how to identify a VM, since KVM file descriptors don't have a 
name.  Using a pid works (sort of) for debugfs, but pids are not 
appropriate for a stable API.  Using a file descriptor as in this series 
requires collaboration from the userspace program; howver, once the file 
descriptor has been transmitted via SCM_RIGHTS, telemetry can read it 
forever without further IPC, and there is proper privilege separation.

Paolo
Leon Romanovsky June 15, 2021, 7:53 a.m. UTC | #3
On Tue, Jun 15, 2021 at 09:06:43AM +0200, Paolo Bonzini wrote:
> On 15/06/21 07:25, Leon Romanovsky wrote:
> > Sorry for my naive questions, but how does telemetry get statistics
> > for hypervisors? Why is KVM different from hypervisors or NIC's statistics
> > or any other high speed devices (RDMA) that generate tons of data?
> 
> Right now, the only way is debugfs but it's slow, and it's disabled when
> using lockdown mode; this series is a way to fix this.
> 
> I sense that there is another question in there; are you wondering if
> another mechanism should be used, for example netlink?  The main issue there
> is how to identify a VM, since KVM file descriptors don't have a name.
> Using a pid works (sort of) for debugfs, but pids are not appropriate for a
> stable API.  Using a file descriptor as in this series requires
> collaboration from the userspace program; howver, once the file descriptor
> has been transmitted via SCM_RIGHTS, telemetry can read it forever without
> further IPC, and there is proper privilege separation.

Yeah, sorry for mixing different questions into one.

So the answer to the question "why KVM is different" is that it doesn't
have any stable identification except file descriptor. While hypervisors
have stable names, NICs and RDMA devices have interface indexes e.t.c.
Did I get it right?

And this was second part of my question, the first part was my attempt to
get on answer why current statistics like process info (/proc/xxx/*), NICs
(netlink) and RDMA (sysfs) are not using binary format.

Thanks

> 
> Paolo
>
Enrico Weigelt, metux IT consult June 15, 2021, 8:37 a.m. UTC | #4
On 14.06.21 23:21, Jing Zhang wrote:

Hi,

> This patchset provides a file descriptor for every VM and VCPU to read
> KVM statistics data in binary format.

I've missed the discussions of previous versions, so please forgive my
stupid questions:

* why is it binary instead of text ? is it so very high volume that
   it really matters ?
* how will possible future extensions of the telemetry packets work ?
* aren't there other means to get this fd instead of an ioctl() on the
   VM fd ? something more from the outside (eg. sysfs/procfs)
* how will that relate to other hypervisors ?

Some notes from the operating perspective:

In typical datacenters we've got various monitoring tools that are able
to catch up lots of data from different sources (especially files). If
an operator e.g. is interested in something in happening in some file
(e.g. in /proc of /sys), it's quite trivial - just configure yet another
probe (maybe some regex for parsing) and done. Automatically fed in his
$monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot)

With your approach, it's not that simple: now the operator needs to
create (and deploy and manage) a separate agent that somehow receives
that fd from the VMM, reads and parses that specific binary stream
and finally pushes it into the monitoring infrastructure. Or the VMM
writes it into some file, where some monitoring agent can pick it up.
In any case, not actually trivial from ops perspective.

In general I tend to like the fd approach (even though I don't like
ioctls very much - I'd rather like to have it more Plan9-like ;-)).
But it has the drawback of acquiring those fd's by separate processes
isn't entirely easy and needs a lot of coordinated interaction.

That issue would be much easier if we had the ability to publish
existing fd's into the file system (like Plan9's srvfs does), but we
don't have that yet. (actually, I've hacked up some srvfs for Linux,
but ... well ... it's just a hack, nowhere near to production).

Why not putting this into sysfs ?

I see two options:

a) if it's really kvm-specific (and no chance of using the same
    interface for other hypervisors), we could put it under the
    kvm device (/sys/class/misc/kvm).

b) have a generic VMM stats interface that theroretically could work
    with any hypervisor.



--mtx
Greg KH June 15, 2021, 9:21 a.m. UTC | #5
On Tue, Jun 15, 2021 at 10:37:36AM +0200, Enrico Weigelt, metux IT consult wrote:
> Why not putting this into sysfs ?

Because sysfs is "one value per file".

> I see two options:
> 
> a) if it's really kvm-specific (and no chance of using the same
>    interface for other hypervisors), we could put it under the
>    kvm device (/sys/class/misc/kvm).

Again, that is NOT what sysfs is for.

> b) have a generic VMM stats interface that theroretically could work
>    with any hypervisor.

What other hypervisor matters?

greg k-h
Paolo Bonzini June 15, 2021, 11:03 a.m. UTC | #6
On 15/06/21 09:53, Leon Romanovsky wrote:
>> Sorry for my naive questions, but how does telemetry get statistics
>> for hypervisors? Why is KVM different from hypervisors or NIC's statistics
>> or any other high speed devices (RDMA) that generate tons of data?
>
> So the answer to the question "why KVM is different" is that it doesn't
> have any stable identification except file descriptor. While hypervisors
> have stable names, NICs and RDMA devices have interface indexes etc.
> Did I get it right?

Right.

> And this was second part of my question, the first part was my attempt to
> get on answer why current statistics like process info (/proc/xxx/*), NICs
> (netlink) and RDMA (sysfs) are not using binary format.

NICs are using binary format (partly in struct ethtool_stats, partly in 
an array of u64).  For KVM we decided to put the schema and the stats in 
the same file (though you can use pread to get only the stats) to have a 
single interface and avoid ioctls, unlike having both ETH_GSTRINGS and 
ETH_GSTATS.

I wouldn't say processes are using any specific format.  There's a mix 
of "one value per file" (e.g. cpuset), human-readable tabular format 
(e.g. limits, sched), human- and machine-readable tabular format (e.g. 
status), and files that are ASCII but not human-readable (e.g. stat).

Paolo
Paolo Bonzini June 15, 2021, 11:31 a.m. UTC | #7
On 15/06/21 10:37, Enrico Weigelt, metux IT consult wrote:
> * why is it binary instead of text ? is it so very high volume that
>    it really matters ?

The main reason to have a binary format is not the high volume actually 
(though it also has its part).  Rather, we would really like to include 
the schema to make the statistics self-describing.  This includes stuff 
like whether the unit of measure of a statistic is clock cycles, 
nanoseconds, pages or whatnot; having this kind of information in text 
leads to awkwardness in the parsers.  trace-cmd is another example where 
the data consists of a schema followed by binary data.

Text format could certainly be added if there's a usecase, but for 
developer use debugfs is usually a suitable replacement.

Last year we tried the opposite direction: we built a one-value-per-file 
filesystem with a common API that any subsystem could use (e.g. 
providing ethtool stats, /proc/interrupts, etc. in addition to KVM 
stats).  We started with text, similar to sysfs, with the plan of 
extending it to a binary format later.  However, other subsystems 
expressed very little interest in this, so instead we decided to go with 
something that is designed around KVM needs.

Still, the binary format that KVM uses is designed not to be 
KVM-specific.  If other subsystems want to publish high-volume, 
self-describing statistic information, they are welcome to share the 
binary format and the code.  Perhaps it may make sense in some cases to 
have them in sysfs, even (e.g. /sys/kernel/slab/*/.stats).  As Greg said 
sysfs is currently one value per file, but perhaps that could be changed 
if the binary format is an additional way to access the information and 
not the only one (not that I'm planning to do it).

> * how will possible future extensions of the telemetry packets work ?

The format includes a schema, so it's possible to add more statistics in 
the future.  The exact list of statistics varies per architecture and is 
not part of the userspace API (obvious caveat: https://xkcd.com/1172/).

> * aren't there other means to get this fd instead of an ioctl() on the
>    VM fd ? something more from the outside (eg. sysfs/procfs)

Not yet, but if there's a need it can be added.  It'd be plausible to 
publish system-wide statistics via a ioctl on /dev/kvm, for example. 
We'd have to check how this compares with stuff that is world-readable 
in procfs and sysfs, but I don't think there are security concerns in 
exposing that.

There's also pidfd_getfd(2) which can be used to pull a VM file 
descriptor from another running process.  That can be used to avoid the 
issue of KVM file descriptors being unnamed.

> * how will that relate to other hypervisors ?

Other hypervisors do not run as part of the Linux kernel (at least they 
are not upstream).  These statistics only apply to Linux *hosts*, not 
guests.

As far as I know, there is no standard that Xen or the proprietary 
hypervisors use to communicate their telemetry info to monitoring tools, 
and also no standard binary format used by exporters to talk to 
monitoring tools.  If this format will be adopted by other hypervisors 
or any random software, I will be happy.

> Some notes from the operating perspective:
> 
> In typical datacenters we've got various monitoring tools that are able
> to catch up lots of data from different sources (especially files). If
> an operator e.g. is interested in something in happening in some file
> (e.g. in /proc of /sys), it's quite trivial - just configure yet another
> probe (maybe some regex for parsing) and done. Automatically fed in his
> $monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot)

... but in practice what you do is you have prebuilt exporters that 
talks to $monitoring_solution.  Monitoring individual files is the 
exception, not the rule.  But indeed Libvirt already has I/O and network 
statistics and there is an exporter for Prometheus, so we should add 
support for this new method as well to both QEMU (exporting the file 
descriptor) and Libvirt.

I hope this helps clarifying your doubts!

Paolo

> With your approach, it's not that simple: now the operator needs to
> create (and deploy and manage) a separate agent that somehow receives
> that fd from the VMM, reads and parses that specific binary stream
> and finally pushes it into the monitoring infrastructure. Or the VMM
> writes it into some file, where some monitoring agent can pick it up.
> In any case, not actually trivial from ops perspective.
Leon Romanovsky June 15, 2021, 1:34 p.m. UTC | #8
On Tue, Jun 15, 2021 at 01:03:34PM +0200, Paolo Bonzini wrote:
> On 15/06/21 09:53, Leon Romanovsky wrote:
> > > Sorry for my naive questions, but how does telemetry get statistics
> > > for hypervisors? Why is KVM different from hypervisors or NIC's statistics
> > > or any other high speed devices (RDMA) that generate tons of data?
> > 
> > So the answer to the question "why KVM is different" is that it doesn't
> > have any stable identification except file descriptor. While hypervisors
> > have stable names, NICs and RDMA devices have interface indexes etc.
> > Did I get it right?
> 
> Right.
> 
> > And this was second part of my question, the first part was my attempt to
> > get on answer why current statistics like process info (/proc/xxx/*), NICs
> > (netlink) and RDMA (sysfs) are not using binary format.
> 
> NICs are using binary format (partly in struct ethtool_stats, partly in an
> array of u64).  For KVM we decided to put the schema and the stats in the
> same file (though you can use pread to get only the stats) to have a single
> interface and avoid ioctls, unlike having both ETH_GSTRINGS and ETH_GSTATS.
> 
> I wouldn't say processes are using any specific format.  There's a mix of
> "one value per file" (e.g. cpuset), human-readable tabular format (e.g.
> limits, sched), human- and machine-readable tabular format (e.g. status),
> and files that are ASCII but not human-readable (e.g. stat).

I see, your explanation to Enrico cleared the mud.

Thanks

> 
> Paolo
>