Message ID | 20210614212155.1670777-1-jingzhangos@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM statistics data fd-based binary interface | expand |
On Mon, Jun 14, 2021 at 09:21:50PM +0000, Jing Zhang wrote: > This patchset provides a file descriptor for every VM and VCPU to read > KVM statistics data in binary format. > It is meant to provide a lightweight, flexible, scalable and efficient > lock-free solution for user space telemetry applications to pull the > statistics data periodically for large scale systems. The pulling > frequency could be as high as a few times per second. > In this patchset, every statistics data are treated to have some > attributes as below: > * architecture dependent or generic > * VM statistics data or VCPU statistics data > * type: cumulative, instantaneous, peak > * unit: none for simple counter, nanosecond, microsecond, > millisecond, second, Byte, KiByte, MiByte, GiByte. Clock Cycles > Since no lock/synchronization is used, the consistency between all > the statistics data is not guaranteed. That means not all statistics > data are read out at the exact same time, since the statistics date > are still being updated by KVM subsystems while they are read out. Sorry for my naive questions, but how does telemetry get statistics for hypervisors? Why is KVM different from hypervisors or NIC's statistics or any other high speed devices (RDMA) that generate tons of data? Thanks
On 15/06/21 07:25, Leon Romanovsky wrote: > Sorry for my naive questions, but how does telemetry get statistics > for hypervisors? Why is KVM different from hypervisors or NIC's statistics > or any other high speed devices (RDMA) that generate tons of data? Right now, the only way is debugfs but it's slow, and it's disabled when using lockdown mode; this series is a way to fix this. I sense that there is another question in there; are you wondering if another mechanism should be used, for example netlink? The main issue there is how to identify a VM, since KVM file descriptors don't have a name. Using a pid works (sort of) for debugfs, but pids are not appropriate for a stable API. Using a file descriptor as in this series requires collaboration from the userspace program; howver, once the file descriptor has been transmitted via SCM_RIGHTS, telemetry can read it forever without further IPC, and there is proper privilege separation. Paolo
On Tue, Jun 15, 2021 at 09:06:43AM +0200, Paolo Bonzini wrote: > On 15/06/21 07:25, Leon Romanovsky wrote: > > Sorry for my naive questions, but how does telemetry get statistics > > for hypervisors? Why is KVM different from hypervisors or NIC's statistics > > or any other high speed devices (RDMA) that generate tons of data? > > Right now, the only way is debugfs but it's slow, and it's disabled when > using lockdown mode; this series is a way to fix this. > > I sense that there is another question in there; are you wondering if > another mechanism should be used, for example netlink? The main issue there > is how to identify a VM, since KVM file descriptors don't have a name. > Using a pid works (sort of) for debugfs, but pids are not appropriate for a > stable API. Using a file descriptor as in this series requires > collaboration from the userspace program; howver, once the file descriptor > has been transmitted via SCM_RIGHTS, telemetry can read it forever without > further IPC, and there is proper privilege separation. Yeah, sorry for mixing different questions into one. So the answer to the question "why KVM is different" is that it doesn't have any stable identification except file descriptor. While hypervisors have stable names, NICs and RDMA devices have interface indexes e.t.c. Did I get it right? And this was second part of my question, the first part was my attempt to get on answer why current statistics like process info (/proc/xxx/*), NICs (netlink) and RDMA (sysfs) are not using binary format. Thanks > > Paolo >
On 14.06.21 23:21, Jing Zhang wrote: Hi, > This patchset provides a file descriptor for every VM and VCPU to read > KVM statistics data in binary format. I've missed the discussions of previous versions, so please forgive my stupid questions: * why is it binary instead of text ? is it so very high volume that it really matters ? * how will possible future extensions of the telemetry packets work ? * aren't there other means to get this fd instead of an ioctl() on the VM fd ? something more from the outside (eg. sysfs/procfs) * how will that relate to other hypervisors ? Some notes from the operating perspective: In typical datacenters we've got various monitoring tools that are able to catch up lots of data from different sources (especially files). If an operator e.g. is interested in something in happening in some file (e.g. in /proc of /sys), it's quite trivial - just configure yet another probe (maybe some regex for parsing) and done. Automatically fed in his $monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot) With your approach, it's not that simple: now the operator needs to create (and deploy and manage) a separate agent that somehow receives that fd from the VMM, reads and parses that specific binary stream and finally pushes it into the monitoring infrastructure. Or the VMM writes it into some file, where some monitoring agent can pick it up. In any case, not actually trivial from ops perspective. In general I tend to like the fd approach (even though I don't like ioctls very much - I'd rather like to have it more Plan9-like ;-)). But it has the drawback of acquiring those fd's by separate processes isn't entirely easy and needs a lot of coordinated interaction. That issue would be much easier if we had the ability to publish existing fd's into the file system (like Plan9's srvfs does), but we don't have that yet. (actually, I've hacked up some srvfs for Linux, but ... well ... it's just a hack, nowhere near to production). Why not putting this into sysfs ? I see two options: a) if it's really kvm-specific (and no chance of using the same interface for other hypervisors), we could put it under the kvm device (/sys/class/misc/kvm). b) have a generic VMM stats interface that theroretically could work with any hypervisor. --mtx
On Tue, Jun 15, 2021 at 10:37:36AM +0200, Enrico Weigelt, metux IT consult wrote: > Why not putting this into sysfs ? Because sysfs is "one value per file". > I see two options: > > a) if it's really kvm-specific (and no chance of using the same > interface for other hypervisors), we could put it under the > kvm device (/sys/class/misc/kvm). Again, that is NOT what sysfs is for. > b) have a generic VMM stats interface that theroretically could work > with any hypervisor. What other hypervisor matters? greg k-h
On 15/06/21 09:53, Leon Romanovsky wrote: >> Sorry for my naive questions, but how does telemetry get statistics >> for hypervisors? Why is KVM different from hypervisors or NIC's statistics >> or any other high speed devices (RDMA) that generate tons of data? > > So the answer to the question "why KVM is different" is that it doesn't > have any stable identification except file descriptor. While hypervisors > have stable names, NICs and RDMA devices have interface indexes etc. > Did I get it right? Right. > And this was second part of my question, the first part was my attempt to > get on answer why current statistics like process info (/proc/xxx/*), NICs > (netlink) and RDMA (sysfs) are not using binary format. NICs are using binary format (partly in struct ethtool_stats, partly in an array of u64). For KVM we decided to put the schema and the stats in the same file (though you can use pread to get only the stats) to have a single interface and avoid ioctls, unlike having both ETH_GSTRINGS and ETH_GSTATS. I wouldn't say processes are using any specific format. There's a mix of "one value per file" (e.g. cpuset), human-readable tabular format (e.g. limits, sched), human- and machine-readable tabular format (e.g. status), and files that are ASCII but not human-readable (e.g. stat). Paolo
On 15/06/21 10:37, Enrico Weigelt, metux IT consult wrote: > * why is it binary instead of text ? is it so very high volume that > it really matters ? The main reason to have a binary format is not the high volume actually (though it also has its part). Rather, we would really like to include the schema to make the statistics self-describing. This includes stuff like whether the unit of measure of a statistic is clock cycles, nanoseconds, pages or whatnot; having this kind of information in text leads to awkwardness in the parsers. trace-cmd is another example where the data consists of a schema followed by binary data. Text format could certainly be added if there's a usecase, but for developer use debugfs is usually a suitable replacement. Last year we tried the opposite direction: we built a one-value-per-file filesystem with a common API that any subsystem could use (e.g. providing ethtool stats, /proc/interrupts, etc. in addition to KVM stats). We started with text, similar to sysfs, with the plan of extending it to a binary format later. However, other subsystems expressed very little interest in this, so instead we decided to go with something that is designed around KVM needs. Still, the binary format that KVM uses is designed not to be KVM-specific. If other subsystems want to publish high-volume, self-describing statistic information, they are welcome to share the binary format and the code. Perhaps it may make sense in some cases to have them in sysfs, even (e.g. /sys/kernel/slab/*/.stats). As Greg said sysfs is currently one value per file, but perhaps that could be changed if the binary format is an additional way to access the information and not the only one (not that I'm planning to do it). > * how will possible future extensions of the telemetry packets work ? The format includes a schema, so it's possible to add more statistics in the future. The exact list of statistics varies per architecture and is not part of the userspace API (obvious caveat: https://xkcd.com/1172/). > * aren't there other means to get this fd instead of an ioctl() on the > VM fd ? something more from the outside (eg. sysfs/procfs) Not yet, but if there's a need it can be added. It'd be plausible to publish system-wide statistics via a ioctl on /dev/kvm, for example. We'd have to check how this compares with stuff that is world-readable in procfs and sysfs, but I don't think there are security concerns in exposing that. There's also pidfd_getfd(2) which can be used to pull a VM file descriptor from another running process. That can be used to avoid the issue of KVM file descriptors being unnamed. > * how will that relate to other hypervisors ? Other hypervisors do not run as part of the Linux kernel (at least they are not upstream). These statistics only apply to Linux *hosts*, not guests. As far as I know, there is no standard that Xen or the proprietary hypervisors use to communicate their telemetry info to monitoring tools, and also no standard binary format used by exporters to talk to monitoring tools. If this format will be adopted by other hypervisors or any random software, I will be happy. > Some notes from the operating perspective: > > In typical datacenters we've got various monitoring tools that are able > to catch up lots of data from different sources (especially files). If > an operator e.g. is interested in something in happening in some file > (e.g. in /proc of /sys), it's quite trivial - just configure yet another > probe (maybe some regex for parsing) and done. Automatically fed in his > $monitoring_solution (e.g. nagios, ELK, Splunk, whatsnot) ... but in practice what you do is you have prebuilt exporters that talks to $monitoring_solution. Monitoring individual files is the exception, not the rule. But indeed Libvirt already has I/O and network statistics and there is an exporter for Prometheus, so we should add support for this new method as well to both QEMU (exporting the file descriptor) and Libvirt. I hope this helps clarifying your doubts! Paolo > With your approach, it's not that simple: now the operator needs to > create (and deploy and manage) a separate agent that somehow receives > that fd from the VMM, reads and parses that specific binary stream > and finally pushes it into the monitoring infrastructure. Or the VMM > writes it into some file, where some monitoring agent can pick it up. > In any case, not actually trivial from ops perspective.
On Tue, Jun 15, 2021 at 01:03:34PM +0200, Paolo Bonzini wrote: > On 15/06/21 09:53, Leon Romanovsky wrote: > > > Sorry for my naive questions, but how does telemetry get statistics > > > for hypervisors? Why is KVM different from hypervisors or NIC's statistics > > > or any other high speed devices (RDMA) that generate tons of data? > > > > So the answer to the question "why KVM is different" is that it doesn't > > have any stable identification except file descriptor. While hypervisors > > have stable names, NICs and RDMA devices have interface indexes etc. > > Did I get it right? > > Right. > > > And this was second part of my question, the first part was my attempt to > > get on answer why current statistics like process info (/proc/xxx/*), NICs > > (netlink) and RDMA (sysfs) are not using binary format. > > NICs are using binary format (partly in struct ethtool_stats, partly in an > array of u64). For KVM we decided to put the schema and the stats in the > same file (though you can use pread to get only the stats) to have a single > interface and avoid ioctls, unlike having both ETH_GSTRINGS and ETH_GSTATS. > > I wouldn't say processes are using any specific format. There's a mix of > "one value per file" (e.g. cpuset), human-readable tabular format (e.g. > limits, sched), human- and machine-readable tabular format (e.g. status), > and files that are ASCII but not human-readable (e.g. stat). I see, your explanation to Enrico cleared the mud. Thanks > > Paolo >