mbox series

[v3,00/18] vDSO: Introduce generic data storage

Message ID 20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de (mailing list archive)
Headers show
Series vDSO: Introduce generic data storage | expand

Message

Thomas Weißschuh Feb. 4, 2025, 12:05 p.m. UTC
Currently each architecture defines the setup of the vDSO data page on
its own, mostly through copy-and-paste from some other architecture.
Extend the existing generic vDSO implementation to also provide generic
data storage.
This removes duplicated code and paves the way for further changes to
the generic vDSO implementation without having to go through a lot of
per-architecture changes.

Based on v6.14-rc1 and intended to be merged through the tip tree.

This also provides the basis for some generic vDSO reworks.
The commits from this series and the upcoming reworks can be seen at:
https://git.kernel.org/pub/scm/linux/kernel/git/thomas.weissschuh/linux.git/log/?h=vdso/store

---
Changes in v3:
- Rebase on v6.14-rc1
- Fix build on riscv64-nommu
- Link to v2: https://lore.kernel.org/r/20250110-vdso-store-rng-v2-0-350c9179bbf1@linutronix.de

Changes in v2:
- Drop __arch_get_vdso_u_timens_data() (Christophe)
- Move to lib/vdso/ (Christophe)
- Rename __ppc_get_vdso_u_timens_data() to
  __arch_get_vdso_u_timens_data(), same for other hooks
  (Christophe)
- Fix build for architectures with time-less vDSO, like riscv32. (Conor)
- Explicitly fix bug around x86 vclock pages
- Link to v1: https://lore.kernel.org/r/20241216-vdso-store-rng-v1-0-f7aed1bdb3b2@linutronix.de

---
Thomas Weißschuh (18):
      x86/vdso: Fix latent bug in vclock_pages calculation
      parisc: Remove unused symbol vdso_data
      vdso: Introduce vdso/align.h
      vdso: Rename included Makefile
      vdso: Add generic time data storage
      vdso: Add generic random data storage
      vdso: Add generic architecture-specific data storage
      arm64: vdso: Switch to generic storage implementation
      riscv: vdso: Switch to generic storage implementation
      LoongArch: vDSO: Switch to generic storage implementation
      arm: vdso: Switch to generic storage implementation
      s390/vdso: Switch to generic storage implementation
      MIPS: vdso: Switch to generic storage implementation
      powerpc/vdso: Switch to generic storage implementation
      x86/vdso: Switch to generic storage implementation
      x86/vdso/vdso2c: Remove page handling
      vdso: Remove remnants of architecture-specific random state storage
      vdso: Remove remnants of architecture-specific time storage

 arch/Kconfig                                       |   4 +
 arch/arm/include/asm/vdso.h                        |   2 +
 arch/arm/include/asm/vdso/gettimeofday.h           |   7 +-
 arch/arm/include/asm/vdso/vsyscall.h               |  12 +-
 arch/arm/kernel/asm-offsets.c                      |   4 -
 arch/arm/kernel/vdso.c                             |  34 ++----
 arch/arm/mm/Kconfig                                |   1 +
 arch/arm/vdso/Makefile                             |   2 +-
 arch/arm/vdso/vdso.lds.S                           |   4 +-
 arch/arm64/Kconfig                                 |   1 +
 arch/arm64/include/asm/vdso.h                      |   2 +-
 arch/arm64/include/asm/vdso/compat_gettimeofday.h  |  36 ++----
 arch/arm64/include/asm/vdso/getrandom.h            |  12 --
 arch/arm64/include/asm/vdso/gettimeofday.h         |  16 +--
 arch/arm64/include/asm/vdso/vsyscall.h             |  25 +---
 arch/arm64/kernel/vdso.c                           |  90 +-------------
 arch/arm64/kernel/vdso/Makefile                    |   2 +-
 arch/arm64/kernel/vdso/vdso.lds.S                  |   7 +-
 arch/arm64/kernel/vdso32/Makefile                  |   2 +-
 arch/arm64/kernel/vdso32/vdso.lds.S                |   7 +-
 arch/csky/kernel/vdso/Makefile                     |   2 +-
 arch/loongarch/Kconfig                             |   2 +
 arch/loongarch/include/asm/vdso.h                  |   1 -
 arch/loongarch/include/asm/vdso/arch_data.h        |  25 ++++
 arch/loongarch/include/asm/vdso/getrandom.h        |   5 -
 arch/loongarch/include/asm/vdso/gettimeofday.h     |  14 +--
 arch/loongarch/include/asm/vdso/vdso.h             |  38 +-----
 arch/loongarch/include/asm/vdso/vsyscall.h         |  17 ---
 arch/loongarch/kernel/asm-offsets.c                |   2 +-
 arch/loongarch/kernel/vdso.c                       |  92 +--------------
 arch/loongarch/vdso/Makefile                       |   2 +-
 arch/loongarch/vdso/vdso.lds.S                     |   8 +-
 arch/loongarch/vdso/vgetcpu.c                      |  12 +-
 arch/mips/Kconfig                                  |   1 +
 arch/mips/include/asm/vdso/gettimeofday.h          |   9 +-
 arch/mips/include/asm/vdso/vdso.h                  |  19 ++-
 arch/mips/include/asm/vdso/vsyscall.h              |  14 +--
 arch/mips/kernel/vdso.c                            |  47 +++-----
 arch/mips/vdso/Makefile                            |   2 +-
 arch/mips/vdso/vdso.lds.S                          |   5 +-
 arch/parisc/include/asm/vdso.h                     |   2 -
 arch/parisc/kernel/vdso32/Makefile                 |   2 +-
 arch/parisc/kernel/vdso64/Makefile                 |   2 +-
 arch/powerpc/Kconfig                               |   2 +
 arch/powerpc/include/asm/vdso.h                    |   1 +
 arch/powerpc/include/asm/vdso/arch_data.h          |  37 ++++++
 arch/powerpc/include/asm/vdso/getrandom.h          |  11 +-
 arch/powerpc/include/asm/vdso/gettimeofday.h       |  29 ++---
 arch/powerpc/include/asm/vdso/vsyscall.h           |  13 ---
 arch/powerpc/include/asm/vdso_datapage.h           |  44 +------
 arch/powerpc/kernel/asm-offsets.c                  |   1 -
 arch/powerpc/kernel/time.c                         |   2 +-
 arch/powerpc/kernel/vdso.c                         | 115 ++----------------
 arch/powerpc/kernel/vdso/Makefile                  |   2 +-
 arch/powerpc/kernel/vdso/cacheflush.S              |   2 +-
 arch/powerpc/kernel/vdso/datapage.S                |   4 +-
 arch/powerpc/kernel/vdso/gettimeofday.S            |   4 +-
 arch/powerpc/kernel/vdso/vdso32.lds.S              |   4 +-
 arch/powerpc/kernel/vdso/vdso64.lds.S              |   4 +-
 arch/powerpc/kernel/vdso/vgettimeofday.c           |  14 +--
 arch/riscv/Kconfig                                 |   3 +-
 arch/riscv/include/asm/vdso.h                      |   2 +-
 .../include/asm/vdso/{time_data.h => arch_data.h}  |   8 +-
 arch/riscv/include/asm/vdso/gettimeofday.h         |  14 +--
 arch/riscv/include/asm/vdso/vsyscall.h             |   9 --
 arch/riscv/kernel/sys_hwprobe.c                    |   3 +-
 arch/riscv/kernel/vdso.c                           |  90 +-------------
 arch/riscv/kernel/vdso/Makefile                    |   2 +-
 arch/riscv/kernel/vdso/hwprobe.c                   |   6 +-
 arch/riscv/kernel/vdso/vdso.lds.S                  |   7 +-
 arch/s390/Kconfig                                  |   1 +
 arch/s390/include/asm/vdso.h                       |   4 +-
 arch/s390/include/asm/vdso/getrandom.h             |  12 --
 arch/s390/include/asm/vdso/gettimeofday.h          |  15 +--
 arch/s390/include/asm/vdso/vsyscall.h              |  20 ----
 arch/s390/kernel/time.c                            |   6 +-
 arch/s390/kernel/vdso.c                            |  97 +---------------
 arch/s390/kernel/vdso32/Makefile                   |   2 +-
 arch/s390/kernel/vdso32/vdso32.lds.S               |   7 +-
 arch/s390/kernel/vdso64/Makefile                   |   2 +-
 arch/s390/kernel/vdso64/vdso64.lds.S               |   8 +-
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/vdso/Makefile                       |   2 +-
 arch/x86/entry/vdso/vdso-layout.lds.S              |  10 +-
 arch/x86/entry/vdso/vdso2c.c                       |  21 ----
 arch/x86/entry/vdso/vdso2c.h                       |  20 ----
 arch/x86/entry/vdso/vma.c                          | 125 ++------------------
 arch/x86/include/asm/vdso.h                        |   6 -
 arch/x86/include/asm/vdso/getrandom.h              |  10 --
 arch/x86/include/asm/vdso/gettimeofday.h           |  25 +---
 arch/x86/include/asm/vdso/vsyscall.h               |  23 +---
 drivers/char/random.c                              |   6 +-
 include/asm-generic/vdso/vsyscall.h                |  27 +++--
 include/linux/align.h                              |  10 +-
 include/linux/time_namespace.h                     |   2 -
 include/linux/vdso_datastore.h                     |  10 ++
 include/vdso/align.h                               |  15 +++
 include/vdso/datapage.h                            |  74 +++++++++---
 include/vdso/helpers.h                             |   8 +-
 kernel/time/namespace.c                            |  12 +-
 kernel/time/vsyscall.c                             |  19 ++-
 lib/Makefile                                       |   2 +-
 lib/vdso/Kconfig                                   |   5 +
 lib/vdso/Makefile                                  |  19 +--
 lib/vdso/Makefile.include                          |  18 +++
 lib/vdso/datastore.c                               | 129 +++++++++++++++++++++
 lib/vdso/getrandom.c                               |   8 +-
 lib/vdso/gettimeofday.c                            |  74 +++++++-----
 108 files changed, 604 insertions(+), 1277 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20240911-vdso-store-rng-607a3dc9e050

Best regards,

Comments

David Woodhouse Feb. 6, 2025, 9:31 a.m. UTC | #1
On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> Currently each architecture defines the setup of the vDSO data page on
> its own, mostly through copy-and-paste from some other architecture.
> Extend the existing generic vDSO implementation to also provide generic
> data storage.
> This removes duplicated code and paves the way for further changes to
> the generic vDSO implementation without having to go through a lot of
> per-architecture changes.
> 
> Based on v6.14-rc1 and intended to be merged through the tip tree.

Thanks for working on this. Is there a plan to expose the time data
directly to userspace in a form which is usable *other* than by
function calls which get the value of the clock at a given moment?

For populating the vmclock device¹ we need to know the actual
relationship between the hardware counter (TSC, arch timer, etc.) and
real time in order to propagate that to the guest.

I see two options for doing this:

 1. Via userspace, exposing the vdso time data (and a notification when
    it changes?) and letting the userspace VMM populate the vmclock.
    This is complex for x86 because of TSC scaling; in fact userspace
    doesn't currently know the precise scaling from host to guest TSC
    so we'd have to be able to extract that from KVM.

 2. In kernel, asking KVM to populate the vmclock structure much like
    it does other pvclocks shared with the guest. KVM/x86 already uses
    pvclock_gtod_register_notifier() to hook changes; should we expand
    on that? The problem with that notifier is that it seems to be
    called far more frequently than I'd expect.



¹ https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5
Thomas Weißschuh Feb. 6, 2025, 10:59 a.m. UTC | #2
On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote:
> On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> > Currently each architecture defines the setup of the vDSO data page on
> > its own, mostly through copy-and-paste from some other architecture.
> > Extend the existing generic vDSO implementation to also provide generic
> > data storage.
> > This removes duplicated code and paves the way for further changes to
> > the generic vDSO implementation without having to go through a lot of
> > per-architecture changes.
> > 
> > Based on v6.14-rc1 and intended to be merged through the tip tree.

Note: The real answer will need to come from the timekeeping
maintainers, my personal two cents below.

> Thanks for working on this. Is there a plan to expose the time data
> directly to userspace in a form which is usable *other* than by
> function calls which get the value of the clock at a given moment?

There are no current plans that I am aware of.

> For populating the vmclock device¹ we need to know the actual
> relationship between the hardware counter (TSC, arch timer, etc.) and
> real time in order to propagate that to the guest.
> 
> I see two options for doing this:
> 
>  1. Via userspace, exposing the vdso time data (and a notification when
>     it changes?) and letting the userspace VMM populate the vmclock.
>     This is complex for x86 because of TSC scaling; in fact userspace
>     doesn't currently know the precise scaling from host to guest TSC
>     so we'd have to be able to extract that from KVM.

Exposing the raw vdso time data is problematic as it precludes any
evolution to its datastructures, like the one we are currently doing.

An additional, trimmed down and stable data structure could be used.
But I don't think it makes sense. The vDSO is all about a stable
highlevel function interface on top of an unstable data interface.
However the vmclock needs the lowlevel data to populate its own
datastructure, wrapping raw data access in function calls is unnecessary.
If no functions are involved then the vDSO is not needed. The data can
be maintained separately in any other place in the kernel and accessed
or mapped by userspace from there.
Also the vDSO does not have an active notification mechanism, this would
probably be implemented through a filedescriptor, but then the data
can also be mapped through exactly that fd.

>  2. In kernel, asking KVM to populate the vmclock structure much like
>     it does other pvclocks shared with the guest. KVM/x86 already uses
>     pvclock_gtod_register_notifier() to hook changes; should we expand
>     on that? The problem with that notifier is that it seems to be
>     called far more frequently than I'd expect.

This sounds better, especially as any custom ABI from the host kernel to
the VMM would look a lot like the vmclock structure anyways.

Timekeeper updates are indeed very frequent, but what are the concrete
issues? That frequency is fine for regular vDSO data page updates,
updating the vmclock data page should be very similar.
The timekeeper core can pass context to the notifier callbacks, maybe
this can be used to skip some expensive steps where possible.

> ¹ https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5
David Woodhouse Feb. 7, 2025, 10:15 a.m. UTC | #3
On Thu, 2025-02-06 at 11:59 +0100, Thomas Weißschuh wrote:
> On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote:
> > On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> > > Currently each architecture defines the setup of the vDSO data page on
> > > its own, mostly through copy-and-paste from some other architecture.
> > > Extend the existing generic vDSO implementation to also provide generic
> > > data storage.
> > > This removes duplicated code and paves the way for further changes to
> > > the generic vDSO implementation without having to go through a lot of
> > > per-architecture changes.
> > > 
> > > Based on v6.14-rc1 and intended to be merged through the tip tree.
> 
> Note: The real answer will need to come from the timekeeping
> maintainers, my personal two cents below.
> 
> > Thanks for working on this. Is there a plan to expose the time data
> > directly to userspace in a form which is usable *other* than by
> > function calls which get the value of the clock at a given moment?
> 
> There are no current plans that I am aware of.
> 
> > For populating the vmclock device¹ we need to know the actual
> > relationship between the hardware counter (TSC, arch timer, etc.) and
> > real time in order to propagate that to the guest.
> > 
> > I see two options for doing this:
> > 
> >  1. Via userspace, exposing the vdso time data (and a notification when
> >     it changes?) and letting the userspace VMM populate the vmclock.
> >     This is complex for x86 because of TSC scaling; in fact userspace
> >     doesn't currently know the precise scaling from host to guest TSC
> >     so we'd have to be able to extract that from KVM.
> 
> Exposing the raw vdso time data is problematic as it precludes any
> evolution to its datastructures, like the one we are currently doing.
> 
> An additional, trimmed down and stable data structure could be used.
> But I don't think it makes sense. The vDSO is all about a stable
> highlevel function interface on top of an unstable data interface.
> However the vmclock needs the lowlevel data to populate its own
> datastructure, wrapping raw data access in function calls is unnecessary.
> If no functions are involved then the vDSO is not needed. The data can
> be maintained separately in any other place in the kernel and accessed
> or mapped by userspace from there.
> Also the vDSO does not have an active notification mechanism, this would
> probably be implemented through a filedescriptor, but then the data
> can also be mapped through exactly that fd.
> 
> >  2. In kernel, asking KVM to populate the vmclock structure much like
> >     it does other pvclocks shared with the guest. KVM/x86 already uses
> >     pvclock_gtod_register_notifier() to hook changes; should we expand
> >     on that? The problem with that notifier is that it seems to be
> >     called far more frequently than I'd expect.
> 
> This sounds better, especially as any custom ABI from the host kernel to
> the VMM would look a lot like the vmclock structure anyways.
> 
> Timekeeper updates are indeed very frequent, but what are the concrete
> issues? That frequency is fine for regular vDSO data page updates,
> updating the vmclock data page should be very similar.
> The timekeeper core can pass context to the notifier callbacks, maybe
> this can be used to skip some expensive steps where possible.

In the context of a hypervisor with lots of guests running, that's a
lot of pointless steal time. But it isn't just that; ISTR the result
was also *inaccurate*.

I need to go back and reproduce the testing, but I think it was
constantly adjusting the apparent rate even with no changed inputs from
NTP. Where the number of clock counts per jiffy wasn't an integer, the
notification would be constantly changing, for example to report 333333
counts per jiffy for most of the time, and occasionally 333334 counts
for a single jiffy before flipping back again. Or something like that.
Thomas Gleixner Feb. 14, 2025, 11:34 a.m. UTC | #4
David!

On Thu, Feb 06 2025 at 09:31, David Woodhouse wrote:
> Thanks for working on this. Is there a plan to expose the time data
> directly to userspace in a form which is usable *other* than by
> function calls which get the value of the clock at a given moment?
>
> For populating the vmclock device¹ we need to know the actual
> relationship between the hardware counter (TSC, arch timer, etc.) and
> real time in order to propagate that to the guest.
>
> I see two options for doing this:
>
>  1. Via userspace, exposing the vdso time data (and a notification when
>     it changes?) and letting the userspace VMM populate the vmclock.
>     This is complex for x86 because of TSC scaling; in fact userspace
>     doesn't currently know the precise scaling from host to guest TSC
>     so we'd have to be able to extract that from KVM.

Exposing the raw data is not going to happen as we would create an ABI
preventing any modifications to the internals. VDSO data is considered a
fully internal (think kernel) representation and the accessor functions
create an ABI around it. So if at all you can add a accessor function
which exposes data to user space so that the internal data
representation can still be modified as necessary.

>  2. In kernel, asking KVM to populate the vmclock structure much like
>     it does other pvclocks shared with the guest. KVM/x86 already uses
>     pvclock_gtod_register_notifier() to hook changes; should we expand
>     on that? The problem with that notifier is that it seems to be
>     called far more frequently than I'd expect.

It's called once per tick to expose the continous updates to the
conversion factors and related internal data.

Thanks,

        tglx
David Woodhouse Feb. 14, 2025, 12:04 p.m. UTC | #5
On Fri, 2025-02-14 at 12:34 +0100, Thomas Gleixner wrote:
> >  2. In kernel, asking KVM to populate the vmclock structure much like
> >     it does other pvclocks shared with the guest. KVM/x86 already uses
> >     pvclock_gtod_register_notifier() to hook changes; should we expand
> >     on that? The problem with that notifier is that it seems to be
> >     called far more frequently than I'd expect.
> 
> It's called once per tick to expose the continous updates to the
> conversion factors and related internal data.

My recollection (a vague one) is that it's called, and reports
"changes", even when there *are* no changes to underlying conversion
factors. Something along the lines of "N ticks at 333 counts per tick,
then one tick at 334 counts per tick to catch up" because it can't
express the division factor completely without that discontinuity?

The actual 'error' caused by the apparent fluctuation in rate is
probably entirely negligible, but I am slightly concerned about the
steal time, if the hypervisor then spends stolen CPU time relaying all
those "changes" to the guest, and then the guest has to spend time
feeding the "changes" into its own timekeeping.

I'd like to strive for a mode where we only adjust what we tell guests,
when adjtimex actually changes the real timing factors.

In fact if we have a userspace tool like chrony feeding adjtimex based
on external NTP/PPS/whatever, that tool could probably calibrate a
stable host TSC directly against the external real time. And in that
mode maybe we don't even need to feed the guest from the kernel's
CLOCK_REALTIME; that would be just another conversion step to introduce
noise.

We might end up with the direct setup for dedicated hosting
environments, but I do also want to support the general-purpose QEMU-
based setup where we expose the host's CLOCK_REALTIME as efficiently as
possible.

How about this: A KVM feature to provide/populate the VMCLOCK, since
only KVM knows the precise TSC scaling (and can immediately flip the
VMCLOCK to report invalid state if the TSC becomes unreliable).

It can *either* be fed the precise TSC/realtime relationship from
userspace (maybe in a vmclock structure that *userspace* populates, so
all the kernel has to do is scale/offset to account for the guest TSC
being different from the host TSC).

Or it can be in 'automatic' mode, where it derives from the host's
timekeeping. Which at the moment would have "too many" updates for my
liking, but we can worry about that later if necessary.