Message ID | 20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | vDSO: Introduce generic data storage | expand |
On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote: > Currently each architecture defines the setup of the vDSO data page on > its own, mostly through copy-and-paste from some other architecture. > Extend the existing generic vDSO implementation to also provide generic > data storage. > This removes duplicated code and paves the way for further changes to > the generic vDSO implementation without having to go through a lot of > per-architecture changes. > > Based on v6.14-rc1 and intended to be merged through the tip tree. Thanks for working on this. Is there a plan to expose the time data directly to userspace in a form which is usable *other* than by function calls which get the value of the clock at a given moment? For populating the vmclock device¹ we need to know the actual relationship between the hardware counter (TSC, arch timer, etc.) and real time in order to propagate that to the guest. I see two options for doing this: 1. Via userspace, exposing the vdso time data (and a notification when it changes?) and letting the userspace VMM populate the vmclock. This is complex for x86 because of TSC scaling; in fact userspace doesn't currently know the precise scaling from host to guest TSC so we'd have to be able to extract that from KVM. 2. In kernel, asking KVM to populate the vmclock structure much like it does other pvclocks shared with the guest. KVM/x86 already uses pvclock_gtod_register_notifier() to hook changes; should we expand on that? The problem with that notifier is that it seems to be called far more frequently than I'd expect. ¹ https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5
On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote: > On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote: > > Currently each architecture defines the setup of the vDSO data page on > > its own, mostly through copy-and-paste from some other architecture. > > Extend the existing generic vDSO implementation to also provide generic > > data storage. > > This removes duplicated code and paves the way for further changes to > > the generic vDSO implementation without having to go through a lot of > > per-architecture changes. > > > > Based on v6.14-rc1 and intended to be merged through the tip tree. Note: The real answer will need to come from the timekeeping maintainers, my personal two cents below. > Thanks for working on this. Is there a plan to expose the time data > directly to userspace in a form which is usable *other* than by > function calls which get the value of the clock at a given moment? There are no current plans that I am aware of. > For populating the vmclock device¹ we need to know the actual > relationship between the hardware counter (TSC, arch timer, etc.) and > real time in order to propagate that to the guest. > > I see two options for doing this: > > 1. Via userspace, exposing the vdso time data (and a notification when > it changes?) and letting the userspace VMM populate the vmclock. > This is complex for x86 because of TSC scaling; in fact userspace > doesn't currently know the precise scaling from host to guest TSC > so we'd have to be able to extract that from KVM. Exposing the raw vdso time data is problematic as it precludes any evolution to its datastructures, like the one we are currently doing. An additional, trimmed down and stable data structure could be used. But I don't think it makes sense. The vDSO is all about a stable highlevel function interface on top of an unstable data interface. However the vmclock needs the lowlevel data to populate its own datastructure, wrapping raw data access in function calls is unnecessary. If no functions are involved then the vDSO is not needed. The data can be maintained separately in any other place in the kernel and accessed or mapped by userspace from there. Also the vDSO does not have an active notification mechanism, this would probably be implemented through a filedescriptor, but then the data can also be mapped through exactly that fd. > 2. In kernel, asking KVM to populate the vmclock structure much like > it does other pvclocks shared with the guest. KVM/x86 already uses > pvclock_gtod_register_notifier() to hook changes; should we expand > on that? The problem with that notifier is that it seems to be > called far more frequently than I'd expect. This sounds better, especially as any custom ABI from the host kernel to the VMM would look a lot like the vmclock structure anyways. Timekeeper updates are indeed very frequent, but what are the concrete issues? That frequency is fine for regular vDSO data page updates, updating the vmclock data page should be very similar. The timekeeper core can pass context to the notifier callbacks, maybe this can be used to skip some expensive steps where possible. > ¹ https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5
On Thu, 2025-02-06 at 11:59 +0100, Thomas Weißschuh wrote: > On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote: > > On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote: > > > Currently each architecture defines the setup of the vDSO data page on > > > its own, mostly through copy-and-paste from some other architecture. > > > Extend the existing generic vDSO implementation to also provide generic > > > data storage. > > > This removes duplicated code and paves the way for further changes to > > > the generic vDSO implementation without having to go through a lot of > > > per-architecture changes. > > > > > > Based on v6.14-rc1 and intended to be merged through the tip tree. > > Note: The real answer will need to come from the timekeeping > maintainers, my personal two cents below. > > > Thanks for working on this. Is there a plan to expose the time data > > directly to userspace in a form which is usable *other* than by > > function calls which get the value of the clock at a given moment? > > There are no current plans that I am aware of. > > > For populating the vmclock device¹ we need to know the actual > > relationship between the hardware counter (TSC, arch timer, etc.) and > > real time in order to propagate that to the guest. > > > > I see two options for doing this: > > > > 1. Via userspace, exposing the vdso time data (and a notification when > > it changes?) and letting the userspace VMM populate the vmclock. > > This is complex for x86 because of TSC scaling; in fact userspace > > doesn't currently know the precise scaling from host to guest TSC > > so we'd have to be able to extract that from KVM. > > Exposing the raw vdso time data is problematic as it precludes any > evolution to its datastructures, like the one we are currently doing. > > An additional, trimmed down and stable data structure could be used. > But I don't think it makes sense. The vDSO is all about a stable > highlevel function interface on top of an unstable data interface. > However the vmclock needs the lowlevel data to populate its own > datastructure, wrapping raw data access in function calls is unnecessary. > If no functions are involved then the vDSO is not needed. The data can > be maintained separately in any other place in the kernel and accessed > or mapped by userspace from there. > Also the vDSO does not have an active notification mechanism, this would > probably be implemented through a filedescriptor, but then the data > can also be mapped through exactly that fd. > > > 2. In kernel, asking KVM to populate the vmclock structure much like > > it does other pvclocks shared with the guest. KVM/x86 already uses > > pvclock_gtod_register_notifier() to hook changes; should we expand > > on that? The problem with that notifier is that it seems to be > > called far more frequently than I'd expect. > > This sounds better, especially as any custom ABI from the host kernel to > the VMM would look a lot like the vmclock structure anyways. > > Timekeeper updates are indeed very frequent, but what are the concrete > issues? That frequency is fine for regular vDSO data page updates, > updating the vmclock data page should be very similar. > The timekeeper core can pass context to the notifier callbacks, maybe > this can be used to skip some expensive steps where possible. In the context of a hypervisor with lots of guests running, that's a lot of pointless steal time. But it isn't just that; ISTR the result was also *inaccurate*. I need to go back and reproduce the testing, but I think it was constantly adjusting the apparent rate even with no changed inputs from NTP. Where the number of clock counts per jiffy wasn't an integer, the notification would be constantly changing, for example to report 333333 counts per jiffy for most of the time, and occasionally 333334 counts for a single jiffy before flipping back again. Or something like that.
David! On Thu, Feb 06 2025 at 09:31, David Woodhouse wrote: > Thanks for working on this. Is there a plan to expose the time data > directly to userspace in a form which is usable *other* than by > function calls which get the value of the clock at a given moment? > > For populating the vmclock device¹ we need to know the actual > relationship between the hardware counter (TSC, arch timer, etc.) and > real time in order to propagate that to the guest. > > I see two options for doing this: > > 1. Via userspace, exposing the vdso time data (and a notification when > it changes?) and letting the userspace VMM populate the vmclock. > This is complex for x86 because of TSC scaling; in fact userspace > doesn't currently know the precise scaling from host to guest TSC > so we'd have to be able to extract that from KVM. Exposing the raw data is not going to happen as we would create an ABI preventing any modifications to the internals. VDSO data is considered a fully internal (think kernel) representation and the accessor functions create an ABI around it. So if at all you can add a accessor function which exposes data to user space so that the internal data representation can still be modified as necessary. > 2. In kernel, asking KVM to populate the vmclock structure much like > it does other pvclocks shared with the guest. KVM/x86 already uses > pvclock_gtod_register_notifier() to hook changes; should we expand > on that? The problem with that notifier is that it seems to be > called far more frequently than I'd expect. It's called once per tick to expose the continous updates to the conversion factors and related internal data. Thanks, tglx
On Fri, 2025-02-14 at 12:34 +0100, Thomas Gleixner wrote: > > 2. In kernel, asking KVM to populate the vmclock structure much like > > it does other pvclocks shared with the guest. KVM/x86 already uses > > pvclock_gtod_register_notifier() to hook changes; should we expand > > on that? The problem with that notifier is that it seems to be > > called far more frequently than I'd expect. > > It's called once per tick to expose the continous updates to the > conversion factors and related internal data. My recollection (a vague one) is that it's called, and reports "changes", even when there *are* no changes to underlying conversion factors. Something along the lines of "N ticks at 333 counts per tick, then one tick at 334 counts per tick to catch up" because it can't express the division factor completely without that discontinuity? The actual 'error' caused by the apparent fluctuation in rate is probably entirely negligible, but I am slightly concerned about the steal time, if the hypervisor then spends stolen CPU time relaying all those "changes" to the guest, and then the guest has to spend time feeding the "changes" into its own timekeeping. I'd like to strive for a mode where we only adjust what we tell guests, when adjtimex actually changes the real timing factors. In fact if we have a userspace tool like chrony feeding adjtimex based on external NTP/PPS/whatever, that tool could probably calibrate a stable host TSC directly against the external real time. And in that mode maybe we don't even need to feed the guest from the kernel's CLOCK_REALTIME; that would be just another conversion step to introduce noise. We might end up with the direct setup for dedicated hosting environments, but I do also want to support the general-purpose QEMU- based setup where we expose the host's CLOCK_REALTIME as efficiently as possible. How about this: A KVM feature to provide/populate the VMCLOCK, since only KVM knows the precise TSC scaling (and can immediately flip the VMCLOCK to report invalid state if the TSC becomes unreliable). It can *either* be fed the precise TSC/realtime relationship from userspace (maybe in a vmclock structure that *userspace* populates, so all the kernel has to do is scale/offset to account for the guest TSC being different from the host TSC). Or it can be in 'automatic' mode, where it derives from the host's timekeeping. Which at the moment would have "too many" updates for my liking, but we can worry about that later if necessary.