Message ID | 20240708092924.1473461-1-dwmw2@infradead.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,v4] ptp: Add vDSO-style vmclock support | expand |
On 08.07.24 11:27, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > actually helpful. > > The memory region of the device is also exposed to userspace so it can be > read or memory mapped by application which need reliable notification of > clock disruptions. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> [...] > + > +struct vmclock_abi { > + /* CONSTANT FIELDS */ > + uint32_t magic; > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > + uint32_t size; /* Size of region containing this structure */ > + uint16_t version; /* 1 */ > + uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */ > +#define VMCLOCK_COUNTER_ARM_VCNT 0 > +#define VMCLOCK_COUNTER_X86_TSC 1 > +#define VMCLOCK_COUNTER_INVALID 0xff > + uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ > +#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ > +#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ > +#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ > +#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ > +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ > + > + /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */ > + uint32_t seq_count; /* Low bit means an update is in progress */ > + /* > + * This field changes to another non-repeating value when the CPU > + * counter is disrupted, for example on live migration. This lets > + * the guest know that it should discard any calibration it has > + * performed of the counter against external sources (NTP/PTP/etc.). > + */ > + uint64_t disruption_marker; > + uint64_t flags; > + /* Indicates that the tai_offset_sec field is valid */ > +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) > + /* > + * Optionally used to notify guests of pending maintenance events. > + * A guest which provides latency-sensitive services may wish to > + * remove itself from service if an event is coming up. Two flags > + * indicate the approximate imminence of the event. > + */ > +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ > +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ > +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3) > +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4) > +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5) > +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6) > + /* > + * Even regardless of leap seconds, the time presented through this > + * mechanism may not be strictly monotonic. If the counter slows down > + * and the host adapts to this discovery, the time calculated from > + * the value of the counter immediately after an update to this > + * structure, may appear to be *earlier* than a calculation just > + * before the update (while the counter was believed to be running > + * faster than it now is). A guest operating system will typically > + * *skew* its own system clock back towards the reference clock > + * exposed here, rather than following this clock directly. If, > + * however, this structure is being populated from such a system > + * clock which is already handled in such a fashion and the results > + * *are* guaranteed to be monotonic, such monotonicity can be > + * advertised by setting this bit. > + */ I wonder if this might be difficult to define in a standard. Is there a need to define device and driver behavior in more detail? What would happen if e.g. the device first decides how to update the clock, but is then slow to update the SHM? > +#define VMCLOCK_FLAG_TIME_MONOTONIC (1 << 7) > + > + uint8_t pad[2]; > + uint8_t clock_status; > +#define VMCLOCK_STATUS_UNKNOWN 0 > +#define VMCLOCK_STATUS_INITIALIZING 1 > +#define VMCLOCK_STATUS_SYNCHRONIZED 2 > +#define VMCLOCK_STATUS_FREERUNNING 3 > +#define VMCLOCK_STATUS_UNRELIABLE 4 > + > + /* > + * The time exposed through this device is never smeared. This field > + * corresponds to the 'subtype' field in virtio-rtc, which indicates > + * the smearing method. However in this case it provides a *hint* to > + * the guest operating system, such that *if* the guest OS wants to > + * provide its users with an alternative clock which does not follow > + * the POSIX CLOCK_REALTIME standard, it may do so in a fashion > + * consistent with the other systems in the nearby environment. AFAIU the POSIX.1-2017 standard does not mandate UTC, esp. not w.r.t. leap seconds [1, A.4.16 Seconds Since the Epoch]: > Those applications which do care about leap seconds can determine how to > handle them in whatever way those applications feel is best. This was > particularly emphasized because there was disagreement about what the best > way of handling leap seconds might be. It is a practical impossibility to > mandate that a conforming implementation must have a fixed relationship to > any particular official clock (consider isolated systems, or systems > performing "reruns" by setting the clock to some arbitrary time). So the above comment should probably refer to UTC instead of POSIX CLOCK_REALTIME. > + */ > + uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */ > +#define VMCLOCK_SMEARING_STRICT 0 > +#define VMCLOCK_SMEARING_NOON_LINEAR 1 > +#define VMCLOCK_SMEARING_UTC_SLS 2 > + int16_t tai_offset_sec; > + uint8_t leap_indicator; /* Based on VIRTIO_RTC_LEAP_xxx */ > +#define VMCLOCK_LEAP_NONE 0 /* No known nearby leap second */ > +#define VMCLOCK_LEAP_PRE_POS 1 /* Leap second + at end of month */ A positive leap second usually means stepping the clock backwards, so `Leap second +` is somewhat confusing. > +#define VMCLOCK_LEAP_PRE_NEG 2 /* Leap second - at end of month */ > +#define VMCLOCK_LEAP_POS 3 /* Set during 23:59:60 second */ > +#define VMCLOCK_LEAP_NEG 4 /* Not used in VMCLOCK */ > + /* > + * These values are not (yet) in virtio-rtc. They indicate that a > + * leap second *has* occurred at the start of the month. This allows > + * a guest to generate a smeared clock from the accurate clock which > + * this device provides, as smearing may need to continue for up to a > + * period of time *after* the point of the leap second itself. Must > + * be cleared by the 15th day of the month. > + */ > +#define VMCLOCK_LEAP_POST_POS 5 > +#define VMCLOCK_LEAP_POST_NEG 6 I think it can still be discussed in the context of virtio-rtc whether we should add dedicated identifiers for message-based smeared clock readouts. [1] https://pubs.opengroup.org/onlinepubs/9699919799/
On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote: > On 08.07.24 11:27, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The vmclock "device" provides a shared memory region with precision clock > > information. By using shared memory, it is safe across Live Migration. > > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > > actually helpful. > > > > The memory region of the device is also exposed to userspace so it can be > > read or memory mapped by application which need reliable notification of > > clock disruptions. > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > [...] > > > + > > +struct vmclock_abi { > > + /* CONSTANT FIELDS */ > > + uint32_t magic; > > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > > + uint32_t size; /* Size of region containing this structure */ > > + uint16_t version; /* 1 */ > > + uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */ > > +#define VMCLOCK_COUNTER_ARM_VCNT 0 > > +#define VMCLOCK_COUNTER_X86_TSC 1 > > +#define VMCLOCK_COUNTER_INVALID 0xff > > + uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ > > +#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ > > +#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ > > +#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ > > +#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ > > +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ > > + > > + /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */ > > + uint32_t seq_count; /* Low bit means an update is in progress */ > > + /* > > + * This field changes to another non-repeating value when the CPU > > + * counter is disrupted, for example on live migration. This lets > > + * the guest know that it should discard any calibration it has > > + * performed of the counter against external sources (NTP/PTP/etc.). > > + */ > > + uint64_t disruption_marker; > > + uint64_t flags; > > + /* Indicates that the tai_offset_sec field is valid */ > > +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) > > + /* > > + * Optionally used to notify guests of pending maintenance events. > > + * A guest which provides latency-sensitive services may wish to > > + * remove itself from service if an event is coming up. Two flags > > + * indicate the approximate imminence of the event. > > + */ > > +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ > > +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ > > +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3) > > +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4) > > +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5) > > +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6) > > + /* > > + * Even regardless of leap seconds, the time presented through this > > + * mechanism may not be strictly monotonic. If the counter slows down > > + * and the host adapts to this discovery, the time calculated from > > + * the value of the counter immediately after an update to this > > + * structure, may appear to be *earlier* than a calculation just > > + * before the update (while the counter was believed to be running > > + * faster than it now is). A guest operating system will typically > > + * *skew* its own system clock back towards the reference clock > > + * exposed here, rather than following this clock directly. If, > > + * however, this structure is being populated from such a system > > + * clock which is already handled in such a fashion and the results > > + * *are* guaranteed to be monotonic, such monotonicity can be > > + * advertised by setting this bit. > > + */ > > I wonder if this might be difficult to define in a standard. I'm sure we could do better than my attempt above, but surely it isn't *so* hard to define monotonicity? > Is there a need to define device and driver behavior in more detail? What > would happen if e.g. the device first decides how to update the clock, but > is then slow to update the SHM? That's OK, isn't it? The key in the VMCLOCK_FLAG_TIME_MONOTONIC flag is that by setting it, the host guarantees that the time calculated according to this structure at any given moment, shall not appear to be later than the time calculated via the structure at any *later* moment. The kernel typically takes create care to ensure that time as seen by userspace (e.g. in the real vDSO) *is* monotonic. If the underlying counter is speeding up and the relationship from counter to real time is being adjusted, the kernel is careful to ensure that the latest time which can be calculated from one version of the data cannot appear later than the earliest time which can be calculated from the next version. However, that naturally means a loss of precision, as if the counter has been running faster than expected, the apparent time will be later than the true time, and the kernel has to *overcompensate* for the rate increase, running the apparent clock slower than the true counter period until the apparent time converges with true time again. The time exposed through this structure can be as *precise* as possible, or it can guarantee monotonicity. This flag allows the host to advertise the latter. It can be useful for the guest to know, because a non-monotonic clock is less suitable for *direct* use by client applications, and more suitable for feeding the guest kernel's own timekeeping. > > +#define VMCLOCK_FLAG_TIME_MONOTONIC (1 << 7) > > + > > + uint8_t pad[2]; > > + uint8_t clock_status; > > +#define VMCLOCK_STATUS_UNKNOWN 0 > > +#define VMCLOCK_STATUS_INITIALIZING 1 > > +#define VMCLOCK_STATUS_SYNCHRONIZED 2 > > +#define VMCLOCK_STATUS_FREERUNNING 3 > > +#define VMCLOCK_STATUS_UNRELIABLE 4 > > + > > + /* > > + * The time exposed through this device is never smeared. This field > > + * corresponds to the 'subtype' field in virtio-rtc, which indicates > > + * the smearing method. However in this case it provides a *hint* to > > + * the guest operating system, such that *if* the guest OS wants to > > + * provide its users with an alternative clock which does not follow > > + * the POSIX CLOCK_REALTIME standard, it may do so in a fashion > > + * consistent with the other systems in the nearby environment. > > AFAIU the POSIX.1-2017 standard does not mandate UTC, esp. not w.r.t. > leap seconds [1, A.4.16 Seconds Since the Epoch]: > > > Those applications which do care about leap seconds can determine how to > > handle them in whatever way those applications feel is best. This was > > particularly emphasized because there was disagreement about what the best > > way of handling leap seconds might be. It is a practical impossibility to > > mandate that a conforming implementation must have a fixed relationship to > > any particular official clock (consider isolated systems, or systems > > performing "reruns" by setting the clock to some arbitrary time). > > So the above comment should probably refer to UTC instead of POSIX > CLOCK_REALTIME. Ack, I'll fix that; thanks. > > + */ > > + uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */ > > +#define VMCLOCK_SMEARING_STRICT 0 > > +#define VMCLOCK_SMEARING_NOON_LINEAR 1 > > +#define VMCLOCK_SMEARING_UTC_SLS 2 > > + int16_t tai_offset_sec; > > + uint8_t leap_indicator; /* Based on VIRTIO_RTC_LEAP_xxx */ > > +#define VMCLOCK_LEAP_NONE 0 /* No known nearby leap second */ > > +#define VMCLOCK_LEAP_PRE_POS 1 /* Leap second + at end of month */ > > A positive leap second usually means stepping the clock backwards, so > `Leap second +` is somewhat confusing. I was trying to avoid hitting 80 characters. I'll rework it; thanks. > > +#define VMCLOCK_LEAP_PRE_NEG 2 /* Leap second - at end of month */ > > +#define VMCLOCK_LEAP_POS 3 /* Set during 23:59:60 second */ > > +#define VMCLOCK_LEAP_NEG 4 /* Not used in VMCLOCK */ > > + /* > > + * These values are not (yet) in virtio-rtc. They indicate that a > > + * leap second *has* occurred at the start of the month. This allows > > + * a guest to generate a smeared clock from the accurate clock which > > + * this device provides, as smearing may need to continue for up to a > > + * period of time *after* the point of the leap second itself. Must > > + * be cleared by the 15th day of the month. > > + */ > > +#define VMCLOCK_LEAP_POST_POS 5 > > +#define VMCLOCK_LEAP_POST_NEG 6 > > I think it can still be discussed in the context of virtio-rtc whether we > should add dedicated identifiers for message-based smeared clock readouts. Yeah, I think the only thing we really care about for SHM is the POST values I added there. As long as you'll let me have those, I think the structure is basically ready to be seriously proposed as an addition to
On 10.07.24 18:01, David Woodhouse wrote: > On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote: >> On 08.07.24 11:27, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> The vmclock "device" provides a shared memory region with precision clock >>> information. By using shared memory, it is safe across Live Migration. >>> >>> Like the KVM PTP clock, this can convert TSC-based cross timestamps into >>> KVM clock values. Unlike the KVM PTP clock, it does so only when such is >>> actually helpful. >>> >>> The memory region of the device is also exposed to userspace so it can be >>> read or memory mapped by application which need reliable notification of >>> clock disruptions. >>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> [...] >> >>> + >>> +struct vmclock_abi { >>> + /* CONSTANT FIELDS */ >>> + uint32_t magic; >>> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ >>> + uint32_t size; /* Size of region containing this structure */ >>> + uint16_t version; /* 1 */ >>> + uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */ >>> +#define VMCLOCK_COUNTER_ARM_VCNT 0 >>> +#define VMCLOCK_COUNTER_X86_TSC 1 >>> +#define VMCLOCK_COUNTER_INVALID 0xff >>> + uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ >>> +#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ >>> +#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ >>> +#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ >>> +#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ >>> +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ >>> + >>> + /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */ >>> + uint32_t seq_count; /* Low bit means an update is in progress */ >>> + /* >>> + * This field changes to another non-repeating value when the CPU >>> + * counter is disrupted, for example on live migration. This lets >>> + * the guest know that it should discard any calibration it has >>> + * performed of the counter against external sources (NTP/PTP/etc.). >>> + */ >>> + uint64_t disruption_marker; >>> + uint64_t flags; >>> + /* Indicates that the tai_offset_sec field is valid */ >>> +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) >>> + /* >>> + * Optionally used to notify guests of pending maintenance events. >>> + * A guest which provides latency-sensitive services may wish to >>> + * remove itself from service if an event is coming up. Two flags >>> + * indicate the approximate imminence of the event. >>> + */ >>> +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ >>> +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ >>> +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3) >>> +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4) >>> +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5) >>> +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6) >>> + /* >>> + * Even regardless of leap seconds, the time presented through this >>> + * mechanism may not be strictly monotonic. If the counter slows down >>> + * and the host adapts to this discovery, the time calculated from >>> + * the value of the counter immediately after an update to this >>> + * structure, may appear to be *earlier* than a calculation just >>> + * before the update (while the counter was believed to be running >>> + * faster than it now is). A guest operating system will typically >>> + * *skew* its own system clock back towards the reference clock >>> + * exposed here, rather than following this clock directly. If, >>> + * however, this structure is being populated from such a system >>> + * clock which is already handled in such a fashion and the results >>> + * *are* guaranteed to be monotonic, such monotonicity can be >>> + * advertised by setting this bit. >>> + */ >> >> I wonder if this might be difficult to define in a standard. > > I'm sure we could do better than my attempt above, but surely it isn't > *so* hard to define monotonicity? > >> Is there a need to define device and driver behavior in more detail? What >> would happen if e.g. the device first decides how to update the clock, but >> is then slow to update the SHM? > > That's OK, isn't it? > > The key in the VMCLOCK_FLAG_TIME_MONOTONIC flag is that by setting it, > the host guarantees that the time calculated according to this > structure at any given moment, shall not appear to be later than the > time calculated via the structure at any *later* moment. IMHO this phrasing is better, since it directly refers to the state of the structure. AFAIU if there would be abnormal delays in store buffers, causing some driver to still see the old clock for some time, the monotonicity could be violated: 1. device writes new, much slower clock to store buffer 2. some time passes 3. driver reads old, much faster clock 4. device writes store buffer to cache 5. driver reads new, much slower clock But I hope such delays do not occur.
On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote: > > IMHO this phrasing is better, since it directly refers to the state of the > structure. Thanks. I'll update it. > AFAIU if there would be abnormal delays in store buffers, causing some > driver to still see the old clock for some time, the monotonicity could be > violated: > > 1. device writes new, much slower clock to store buffer > 2. some time passes > 3. driver reads old, much faster clock > 4. device writes store buffer to cache > 5. driver reads new, much slower clock > > But I hope such delays do not occur. For the case of the hypervisor←→guest interface this should be handled by the use of memory barriers and the seqcount lock. The guest driver reads the seqcount, performs a read memory barrier, then reads the contents of the structure. Then performs *another* read memory barrier, and checks the seqcount hasn't changed: https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351 The converse happens with write barriers on the hypervisor side: https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68 Do we need to think harder about the ordering across a real PCI bus? It isn't entirely unreasonable for this to be implemented in hardware if we eventually add a counter_id value for a bus-visible counter like the Intel Always Running Timer (ART). I'm also OK with saying that device implementations may only provide the shared memory structure if they can ensure memory ordering.
On 08.07.24 11:27, David Woodhouse wrote: > + > + /* > + * Time according to time_type field above. > + */ > + uint64_t time_sec; /* Seconds since time_type epoch */ > + uint64_t time_frac_sec; /* (seconds >> 64) */ > + uint64_t time_esterror_picosec; /* (± picoseconds) */ > + uint64_t time_maxerror_picosec; /* (± picoseconds) */ Is this unsigned or signed?
On 11.07.24 09:50, David Woodhouse wrote: > On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote: >> >> IMHO this phrasing is better, since it directly refers to the state of the >> structure. > > Thanks. I'll update it. > >> AFAIU if there would be abnormal delays in store buffers, causing some >> driver to still see the old clock for some time, the monotonicity could be >> violated: >> >> 1. device writes new, much slower clock to store buffer >> 2. some time passes >> 3. driver reads old, much faster clock >> 4. device writes store buffer to cache >> 5. driver reads new, much slower clock >> >> But I hope such delays do not occur. > > For the case of the hypervisor←→guest interface this should be handled > by the use of memory barriers and the seqcount lock. > > The guest driver reads the seqcount, performs a read memory barrier, > then reads the contents of the structure. Then performs *another* read > memory barrier, and checks the seqcount hasn't changed: > https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351 > > The converse happens with write barriers on the hypervisor side: > https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68 My point is that, looking at the above steps 1. - 5.: 3. read HW counter, smp_rmb, read seqcount 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become effective 5. read seqcount, smp_rmb, read HW counter AFAIU this would still be a theoretical problem suggesting the use of stronger barriers. > > Do we need to think harder about the ordering across a real PCI bus? It > isn't entirely unreasonable for this to be implemented in hardware if > we eventually add a counter_id value for a bus-visible counter like the > Intel Always Running Timer (ART). I'm also OK with saying that device > implementations may only provide the shared memory structure if they > can ensure memory ordering. Sounds good to me. This statement would then also address the above.
On 16 July 2024 12:54:52 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: >On 11.07.24 09:50, David Woodhouse wrote: >> On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote: >>> >>> IMHO this phrasing is better, since it directly refers to the state of the >>> structure. >> >> Thanks. I'll update it. >> >>> AFAIU if there would be abnormal delays in store buffers, causing some >>> driver to still see the old clock for some time, the monotonicity could be >>> violated: >>> >>> 1. device writes new, much slower clock to store buffer >>> 2. some time passes >>> 3. driver reads old, much faster clock >>> 4. device writes store buffer to cache >>> 5. driver reads new, much slower clock >>> >>> But I hope such delays do not occur. >> >> For the case of the hypervisor←→guest interface this should be handled >> by the use of memory barriers and the seqcount lock. >> >> The guest driver reads the seqcount, performs a read memory barrier, >> then reads the contents of the structure. Then performs *another* read >> memory barrier, and checks the seqcount hasn't changed: >> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351 >> >> The converse happens with write barriers on the hypervisor side: >> https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68 > >My point is that, looking at the above steps 1. - 5.: > >3. read HW counter, smp_rmb, read seqcount >4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become effective >5. read seqcount, smp_rmb, read HW counter > >AFAIU this would still be a theoretical problem suggesting the use of >stronger barriers. This seems like a bug on the guest side. The HW counter needs to be read *within* the (paired, matching) seqcount reads, not before or after.
On 16.07.24 14:32, David Woodhouse wrote: > On 16 July 2024 12:54:52 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: >> On 11.07.24 09:50, David Woodhouse wrote: >>> On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote: >>>> >>>> IMHO this phrasing is better, since it directly refers to the state of the >>>> structure. >>> >>> Thanks. I'll update it. >>> >>>> AFAIU if there would be abnormal delays in store buffers, causing some >>>> driver to still see the old clock for some time, the monotonicity could be >>>> violated: >>>> >>>> 1. device writes new, much slower clock to store buffer >>>> 2. some time passes >>>> 3. driver reads old, much faster clock >>>> 4. device writes store buffer to cache >>>> 5. driver reads new, much slower clock >>>> >>>> But I hope such delays do not occur. >>> >>> For the case of the hypervisor←→guest interface this should be handled >>> by the use of memory barriers and the seqcount lock. >>> >>> The guest driver reads the seqcount, performs a read memory barrier, >>> then reads the contents of the structure. Then performs *another* read >>> memory barrier, and checks the seqcount hasn't changed: >>> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351 >>> >>> The converse happens with write barriers on the hypervisor side: >>> https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68 >> >> My point is that, looking at the above steps 1. - 5.: >> >> 3. read HW counter, smp_rmb, read seqcount >> 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become effective >> 5. read seqcount, smp_rmb, read HW counter >> >> AFAIU this would still be a theoretical problem suggesting the use of >> stronger barriers. > > This seems like a bug on the guest side. The HW counter needs to be read *within* the (paired, matching) seqcount reads, not before or after. > > There would be paired reads: 1. device writes new, much slower clock to store buffer 2. some time passes 3. read seqcount, smp_rmb, ..., read HW counter, smp_rmb, read seqcount 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount all become effective only now 5. read seqcount, smp_rmb, read HW counter, ..., smp_rmb, read seqcount I just omitted the parts which do not necessarily need to happen close to 4. for the monotonicity to be violated. My point is that 1. could become visible to other cores long after it happened on the local core (during 4.).
On Tue, 2024-07-16 at 13:54 +0200, Peter Hilber wrote: > On 08.07.24 11:27, David Woodhouse wrote: > > + > > + /* > > + * Time according to time_type field above. > > + */ > > + uint64_t time_sec; /* Seconds since time_type epoch */ > > + uint64_t time_frac_sec; /* (seconds >> 64) */ > > + uint64_t time_esterror_picosec; /* (± picoseconds) */ > > + uint64_t time_maxerror_picosec; /* (± picoseconds) */ > > Is this unsigned or signed? The field itself is unsigned, as it provides the absolute value of the error (which can be in either direction). Probably better just to drop the ± from the comment. Julien is now back from vacation and I'm expecting to see his opinion on whether we can change that to nanoseconds for consistency.
On Tue, 2024-07-16 at 15:20 +0200, Peter Hilber wrote: > On 16.07.24 14:32, David Woodhouse wrote: > > On 16 July 2024 12:54:52 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: > > > On 11.07.24 09:50, David Woodhouse wrote: > > > > On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote: > > > > > > > > > > IMHO this phrasing is better, since it directly refers to the state of the > > > > > structure. > > > > > > > > Thanks. I'll update it. > > > > > > > > > AFAIU if there would be abnormal delays in store buffers, causing some > > > > > driver to still see the old clock for some time, the monotonicity could be > > > > > violated: > > > > > > > > > > 1. device writes new, much slower clock to store buffer > > > > > 2. some time passes > > > > > 3. driver reads old, much faster clock > > > > > 4. device writes store buffer to cache > > > > > 5. driver reads new, much slower clock > > > > > > > > > > But I hope such delays do not occur. > > > > > > > > For the case of the hypervisor←→guest interface this should be handled > > > > by the use of memory barriers and the seqcount lock. > > > > > > > > The guest driver reads the seqcount, performs a read memory barrier, > > > > then reads the contents of the structure. Then performs *another* read > > > > memory barrier, and checks the seqcount hasn't changed: > > > > https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351 > > > > > > > > The converse happens with write barriers on the hypervisor side: > > > > https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68 > > > > > > My point is that, looking at the above steps 1. - 5.: > > > > > > 3. read HW counter, smp_rmb, read seqcount > > > 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become effective > > > 5. read seqcount, smp_rmb, read HW counter > > > > > > AFAIU this would still be a theoretical problem suggesting the use of > > > stronger barriers. > > > > This seems like a bug on the guest side. The HW counter needs to be read *within* the (paired, matching) seqcount reads, not before or after. > > > > > > There would be paired reads: > > 1. device writes new, much slower clock to store buffer > 2. some time passes > 3. read seqcount, smp_rmb, ..., read HW counter, smp_rmb, read seqcount > 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount all become > effective only now > 5. read seqcount, smp_rmb, read HW counter, ..., smp_rmb, read seqcount > > I just omitted the parts which do not necessarily need to happen close to > 4. for the monotonicity to be violated. My point is that 1. could become > visible to other cores long after it happened on the local core (during > 4.). Oh, I see. That would be a bug on the device side then. And as you say, it could be fixed by using the appropriate barriers. Or my alternative of just documenting "Don't Do That Then".
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig index 604541dcb320..e98c9767e0ef 100644 --- a/drivers/ptp/Kconfig +++ b/drivers/ptp/Kconfig @@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM To compile this driver as a module, choose M here: the module will be called ptp_kvm. +config PTP_1588_CLOCK_VMCLOCK + tristate "Virtual machine PTP clock" + depends on X86_TSC || ARM_ARCH_TIMER + depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128 + default y + help + This driver adds support for using a virtual precision clock + advertised by the hypervisor. This clock is only useful in virtual + machines where such a device is present. + + To compile this driver as a module, choose M here: the module + will be called ptp_vmclock. + config PTP_1588_CLOCK_IDT82P33 tristate "IDT 82P33xxx PTP clock" depends on PTP_1588_CLOCK && I2C diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile index 68bf02078053..01b5cd91eb61 100644 --- a/drivers/ptp/Makefile +++ b/drivers/ptp/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o obj-$(CONFIG_PTP_1588_CLOCK_INES) += ptp_ines.o obj-$(CONFIG_PTP_1588_CLOCK_PCH) += ptp_pch.o obj-$(CONFIG_PTP_1588_CLOCK_KVM) += ptp_kvm.o +obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK) += ptp_vmclock.o obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o ptp-qoriq-y += ptp_qoriq.o ptp-qoriq-$(CONFIG_DEBUG_FS) += ptp_qoriq_debugfs.o diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c new file mode 100644 index 000000000000..30f15d7753bb --- /dev/null +++ b/drivers/ptp/ptp_vmclock.c @@ -0,0 +1,567 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Virtual PTP 1588 clock for use with LM-safe VMclock device. + * + * Copyright © 2024 Amazon.com, Inc. or its affiliates. + */ + +#include <linux/acpi.h> +#include <linux/device.h> +#include <linux/err.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/miscdevice.h> +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/platform_device.h> +#include <linux/slab.h> + +#include <uapi/linux/vmclock-abi.h> + +#include <linux/ptp_clock_kernel.h> + +#ifdef CONFIG_X86 +#include <asm/pvclock.h> +#include <asm/kvmclock.h> +#endif + +#ifdef CONFIG_KVM_GUEST +#define SUPPORT_KVMCLOCK +#endif + +static DEFINE_IDA(vmclock_ida); + +ACPI_MODULE_NAME("vmclock"); + +struct vmclock_state { + struct resource res; + struct vmclock_abi *clk; + struct miscdevice miscdev; + struct ptp_clock_info ptp_clock_info; + struct ptp_clock *ptp_clock; + enum clocksource_ids cs_id, sys_cs_id; + int index; + char *name; +}; + +#define VMCLOCK_MAX_WAIT ms_to_ktime(100) + +/* + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64 + * and add the fractional second part of the reference time. + * + * The result is a 128-bit value, the top 64 bits of which are seconds, and + * the low 64 bits are (seconds >> 64). + * + * If __int128 isn't available, perform the calculation 32 bits at a time to + * avoid overflow. + */ +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta, + uint64_t period, uint8_t shift, + uint64_t frac_sec) +{ + unsigned __int128 res = (unsigned __int128)delta * period; + + res >>= shift; + res += frac_sec; + *res_hi = res >> 64; + return (uint64_t)res; +} + +static inline bool tai_adjust(struct vmclock_abi *clk, uint64_t *sec) +{ + if (likely(clk->time_type == VMCLOCK_TIME_UTC)) + return true; + + if (clk->time_type == VMCLOCK_TIME_TAI && + (clk->flags & VMCLOCK_FLAG_TAI_OFFSET_VALID)) { + if (sec) + *sec += clk->tai_offset_sec; + return true; + } + return false; +} + +static int vmclock_get_crosststamp(struct vmclock_state *st, + struct ptp_system_timestamp *sts, + struct system_counterval_t *system_counter, + struct timespec64 *tspec) +{ + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); + struct system_time_snapshot systime_snapshot; + uint64_t cycle, delta, seq, frac_sec; + +#ifdef CONFIG_X86 + /* + * We'd expect the hypervisor to know this and to report the clock + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid. + */ + if (check_tsc_unstable()) + return -EINVAL; +#endif + + while (1) { + seq = st->clk->seq_count & ~1ULL; + virt_rmb(); + + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) + return -EINVAL; + + /* + * When invoked for gettimex64(), fill in the pre/post system + * times. The simple case is when system time is based on the + * same counter as st->cs_id, in which case all three times + * will be derived from the *same* counter value. + * + * If the system isn't using the same counter, then the value + * from ktime_get_snapshot() will still be used as pre_ts, and + * ptp_read_system_postts() is called to populate postts after + * calling get_cycles(). + * + * The conversion to timespec64 happens further down, outside + * the seq_count loop. + */ + if (sts) { + ktime_get_snapshot(&systime_snapshot); + if (systime_snapshot.cs_id == st->cs_id) { + cycle = systime_snapshot.cycles; + } else { + cycle = get_cycles(); + ptp_read_system_postts(sts); + } + } else + cycle = get_cycles(); + + delta = cycle - st->clk->counter_value; + + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta, + st->clk->counter_period_frac_sec, + st->clk->counter_period_shift, + st->clk->time_frac_sec); + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64); + tspec->tv_sec += st->clk->time_sec; + + if (!tai_adjust(st->clk, &tspec->tv_sec)) + return -EINVAL; + + virt_rmb(); + if (seq == st->clk->seq_count) + break; + + if (ktime_after(ktime_get(), deadline)) + return -ETIMEDOUT; + } + + if (system_counter) { + system_counter->cycles = cycle; + system_counter->cs_id = st->cs_id; + } + + if (sts) { + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real); + if (systime_snapshot.cs_id == st->cs_id) + sts->post_ts = sts->pre_ts; + } + + return 0; +} + +#ifdef SUPPORT_KVMCLOCK +/* + * In the case where the system is using the KVM clock for timekeeping, convert + * the TSC value into a KVM clock time in order to return a paired reading that + * get_device_system_crosststamp() can cope with. + */ +static int vmclock_get_crosststamp_kvmclock(struct vmclock_state *st, + struct ptp_system_timestamp *sts, + struct system_counterval_t *system_counter, + struct timespec64 *tspec) +{ + struct pvclock_vcpu_time_info *pvti = this_cpu_pvti(); + unsigned pvti_ver; + int ret; + + preempt_disable_notrace(); + + do { + pvti_ver = pvclock_read_begin(pvti); + + ret = vmclock_get_crosststamp(st, sts, system_counter, tspec); + if (ret) + break; + + system_counter->cycles = __pvclock_read_cycles(pvti, + system_counter->cycles); + system_counter->cs_id = CSID_X86_KVM_CLK; + + /* + * This retry should never really happen; if the TSC is + * stable and reliable enough across vCPUS that it is sane + * for the hypervisor to expose a VMCLOCK device which uses + * it as the reference counter, then the KVM clock sohuld be + * in 'master clock mode' and basically never changed. But + * the KVM clock is a fickle and often broken thing, so do + * it "properly" just in case. + */ + } while (pvclock_read_retry(pvti, pvti_ver)); + + preempt_enable_notrace(); + + return ret; +} +#endif + +static int ptp_vmclock_get_time_fn(ktime_t *device_time, + struct system_counterval_t *system_counter, + void *ctx) +{ + struct vmclock_state *st = ctx; + struct timespec64 tspec; + int ret; + +#ifdef SUPPORT_KVMCLOCK + if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK) + ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter, + &tspec); + else +#endif + ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec); + + if (!ret) + *device_time = timespec64_to_ktime(tspec); + + return ret; +} + + +static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp, + struct system_device_crosststamp *xtstamp) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + int ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, st, + NULL, xtstamp); +#ifdef SUPPORT_KVMCLOCK + /* + * On x86, the KVM clock may be used for the system time. We can + * actually convert a TSC reading to that, and return a paired + * timestamp that get_device_system_crosststamp() *can* handle. + */ + if (ret == -ENODEV) { + struct system_time_snapshot systime_snapshot; + ktime_get_snapshot(&systime_snapshot); + + if (systime_snapshot.cs_id == CSID_X86_TSC || + systime_snapshot.cs_id == CSID_X86_KVM_CLK) { + WRITE_ONCE(st->sys_cs_id, systime_snapshot.cs_id); + ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, + st, NULL, xtstamp); + } + } +#endif + return ret; +} + +/* + * PTP clock operations + */ + +static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_settime(struct ptp_clock_info *ptp, + const struct timespec64 *ts) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts, + struct ptp_system_timestamp *sts) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + + return vmclock_get_crosststamp(st, sts, NULL, ts); +} + +static int ptp_vmclock_enable(struct ptp_clock_info *ptp, + struct ptp_clock_request *rq, int on) +{ + return -EOPNOTSUPP; +} + +static const struct ptp_clock_info ptp_vmclock_info = { + .owner = THIS_MODULE, + .max_adj = 0, + .n_ext_ts = 0, + .n_pins = 0, + .pps = 0, + .adjfine = ptp_vmclock_adjfine, + .adjtime = ptp_vmclock_adjtime, + .gettimex64 = ptp_vmclock_gettimex, + .settime64 = ptp_vmclock_settime, + .enable = ptp_vmclock_enable, + .getcrosststamp = ptp_vmclock_getcrosststamp, +}; + +static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma) +{ + struct vmclock_state *st = container_of(fp->private_data, + struct vmclock_state, miscdev); + + if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ) + return -EROFS; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff) + return -EINVAL; + + if (io_remap_pfn_range(vma, vma->vm_start, + st->res.start >> PAGE_SHIFT, PAGE_SIZE, + vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + +static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf, + size_t count, loff_t *ppos) +{ + struct vmclock_state *st = container_of(fp->private_data, + struct vmclock_state, miscdev); + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); + size_t max_count; + int32_t seq; + + if (*ppos >= PAGE_SIZE) + return 0; + + max_count = PAGE_SIZE - *ppos; + if (count > max_count) + count = max_count; + + while (1) { + seq = st->clk->seq_count & ~1ULL; + virt_rmb(); + + if (copy_to_user(buf, ((char *)st->clk) + *ppos, count)) + return -EFAULT; + + virt_rmb(); + if (seq == st->clk->seq_count) + break; + + if (ktime_after(ktime_get(), deadline)) + return -ETIMEDOUT; + } + + *ppos += count; + return count; +} + +static const struct file_operations vmclock_miscdev_fops = { + .mmap = vmclock_miscdev_mmap, + .read = vmclock_miscdev_read, +}; + +/* module operations */ + +static void vmclock_remove(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct vmclock_state *st = dev_get_drvdata(dev); + + if (st->ptp_clock) + ptp_clock_unregister(st->ptp_clock); + + if (st->miscdev.minor == MISC_DYNAMIC_MINOR) + misc_deregister(&st->miscdev); +} + +static acpi_status vmclock_acpi_resources(struct acpi_resource *ares, void *data) +{ + struct vmclock_state *st = data; + struct resource_win win; + struct resource *res = &(win.res); + + if (ares->type == ACPI_RESOURCE_TYPE_END_TAG) + return AE_OK; + + /* There can be only one */ + if (resource_type(&st->res) == IORESOURCE_MEM) + return AE_ERROR; + + if (acpi_dev_resource_memory(ares, res) || + acpi_dev_resource_address_space(ares, &win)) { + + if (resource_type(res) != IORESOURCE_MEM || + resource_size(res) < sizeof(st->clk)) + return AE_ERROR; + + st->res = *res; + return AE_OK; + } + + return AE_ERROR; +} + +static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st) +{ + struct acpi_device *adev = ACPI_COMPANION(dev); + acpi_status status; + + status = acpi_walk_resources(adev->handle, METHOD_NAME__CRS, + vmclock_acpi_resources, st); + if (ACPI_FAILURE(status) || resource_type(&st->res) != IORESOURCE_MEM) { + dev_err(dev, "failed to get resources\n"); + return -ENODEV; + } + + return 0; +} + +static void vmclock_put_idx(void *data) +{ + struct vmclock_state *st = data; + + ida_free(&vmclock_ida, st->index); +} + +static int vmclock_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct vmclock_state *st; + int ret; + + st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL); + if (!st) + return -ENOMEM; + + if (has_acpi_companion(dev)) + ret = vmclock_probe_acpi(dev, st); + else + ret = -EINVAL; /* Only ACPI for now */ + + if (ret) { + dev_info(dev, "Failed to obtain physical address: %d\n", ret); + goto out; + } + + st->clk = devm_memremap(dev, st->res.start, resource_size(&st->res), + MEMREMAP_WB | MEMREMAP_DEC); + if (IS_ERR(st->clk)) { + ret = PTR_ERR(st->clk); + dev_info(dev, "failed to map shared memory\n"); + st->clk = NULL; + goto out; + } + + if (st->clk->magic != VMCLOCK_MAGIC || + st->clk->size < sizeof(*st->clk) || + st->clk->version != 1) { + dev_info(dev, "vmclock magic fields invalid\n"); + ret = -EINVAL; + goto out; + } + + ret = ida_alloc(&vmclock_ida, GFP_KERNEL); + if (ret < 0) + goto out; + + st->index = ret; + ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st); + if (ret) + goto out; + + st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index); + if (!st->name) { + ret = -ENOMEM; + goto out; + } + + /* If the structure is big enough, it can be mapped to userspace */ + if (st->clk->size >= PAGE_SIZE) { + st->miscdev.minor = MISC_DYNAMIC_MINOR; + st->miscdev.fops = &vmclock_miscdev_fops; + st->miscdev.name = st->name; + + ret = misc_register(&st->miscdev); + if (ret) + goto out; + } + + /* If there is valid clock information, register a PTP clock */ + if (IS_ENABLED(CONFIG_ARM64) && + st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) { + /* Can we check it's the virtual counter? */ + st->cs_id = CSID_ARM_ARCH_COUNTER; + } else if (IS_ENABLED(CONFIG_X86) && + st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) { + st->cs_id = CSID_X86_TSC; + } + + /* Only UTC, or TAI with offset */ + if (!tai_adjust(st->clk, NULL)) { + dev_info(dev, "vmclock does not provide unambiguous UTC\n"); + st->cs_id = CSID_GENERIC; + } + + if (st->cs_id) { + st->sys_cs_id = st->cs_id; + + st->ptp_clock_info = ptp_vmclock_info; + strscpy(st->ptp_clock_info.name, st->name); + st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev); + + if (IS_ERR(st->ptp_clock)) { + ret = PTR_ERR(st->ptp_clock); + st->ptp_clock = NULL; + vmclock_remove(pdev); + goto out; + } + } else if (!st->miscdev.minor) { + /* Neither miscdev nor PTP registered */ + dev_info(dev, "vmclock: Neither miscdev nor PTP available; not registering\n"); + ret = -ENODEV; + goto out; + } + + dev_info(dev, "%s: registered %s%s%s\n", st->name, + st->miscdev.minor ? "miscdev" : "", + (st->miscdev.minor && st->ptp_clock) ? ", " : "", + st->ptp_clock ? "PTP" : ""); + + dev_set_drvdata(dev, st); + + out: + return ret; +} + +static const struct acpi_device_id vmclock_acpi_ids[] = { + { "VMCLOCK", 0 }, + {} +}; +MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids); + +static struct platform_driver vmclock_platform_driver = { + .probe = vmclock_probe, + .remove_new = vmclock_remove, + .driver = { + .name = "vmclock", + .acpi_match_table = vmclock_acpi_ids, + }, +}; + +module_platform_driver(vmclock_platform_driver) + +MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>"); +MODULE_DESCRIPTION("PTP clock using VMCLOCK"); +MODULE_LICENSE("GPL v2"); diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h new file mode 100644 index 000000000000..84f0e37a8a06 --- /dev/null +++ b/include/uapi/linux/vmclock-abi.h @@ -0,0 +1,187 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ + +/* + * This structure provides a vDSO-style clock to VM guests, exposing the + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch + * counter, etc.) and real time. It is designed to address the problem of + * live migration, which other clock enlightenments do not. + * + * When a guest is live migrated, this affects the clock in two ways. + * + * First, even between identical hosts the actual frequency of the underlying + * counter will change within the tolerances of its specification (typically + * ±50PPM, or 4 seconds a day). This frequency also varies over time on the + * same host, but can be tracked by NTP as it generally varies slowly. With + * live migration there is a step change in the frequency, with no warning. + * + * Second, there may be a step change in the value of the counter itself, as + * its accuracy is limited by the precision of the NTP synchronization on the + * source and destination hosts. + * + * So any calibration (NTP, PTP, etc.) which the guest has done on the source + * host before migration is invalid, and needs to be redone on the new host. + * + * In its most basic mode, this structure provides only an indication to the + * guest that live migration has occurred. This allows the guest to know that + * its clock is invalid and take remedial action. For applications that need + * reliable accurate timestamps (e.g. distributed databases), the structure + * can be mapped all the way to userspace. This allows the application to see + * directly for itself that the clock is disrupted and take appropriate + * action, even when using a vDSO-style method to get the time instead of a + * system call. + * + * In its more advanced mode. this structure can also be used to expose the + * precise relationship of the CPU counter to real time, as calibrated by the + * host. This means that userspace applications can have accurate time + * immediately after live migration, rather than having to pause operations + * and wait for NTP to recover. This mode does, of course, rely on the + * counter being reliable and consistent across CPUs. + * + * Note that this must be true UTC, never with smeared leap seconds. If a + * guest wishes to construct a smeared clock, it can do so. Presenting a + * smeared clock through this interface would be problematic because it + * actually messes with the apparent counter *period*. A linear smearing + * of 1 ms per second would effectively tweak the counter period by 1000PPM + * at the start/end of the smearing period, while a sinusoidal smear would + * basically be impossible to represent. + * + * This structure is offered with the intent that it be adopted into the + * nascent virtio-rtc standard, as a virtio-rtc that does not address the live + * migration problem seems a little less than fit for purpose. For that + * reason, certain fields use precisely the same numeric definitions as in + * the virtio-rtc proposal. The structure can also be exposed through an ACPI + * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for + * the fact that it uses a real _CRS to convey the address of the structure + * (which should be a full page, to allow for mapping directly to userspace). + */ + +#ifndef __VMCLOCK_ABI_H__ +#define __VMCLOCK_ABI_H__ + +#ifdef __KERNEL__ +#include <linux/types.h> +#else +#include <stdint.h> +#endif + +struct vmclock_abi { + /* CONSTANT FIELDS */ + uint32_t magic; +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ + uint32_t size; /* Size of region containing this structure */ + uint16_t version; /* 1 */ + uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */ +#define VMCLOCK_COUNTER_ARM_VCNT 0 +#define VMCLOCK_COUNTER_X86_TSC 1 +#define VMCLOCK_COUNTER_INVALID 0xff + uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ +#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ +#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ +#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ +#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ + + /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */ + uint32_t seq_count; /* Low bit means an update is in progress */ + /* + * This field changes to another non-repeating value when the CPU + * counter is disrupted, for example on live migration. This lets + * the guest know that it should discard any calibration it has + * performed of the counter against external sources (NTP/PTP/etc.). + */ + uint64_t disruption_marker; + uint64_t flags; + /* Indicates that the tai_offset_sec field is valid */ +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) + /* + * Optionally used to notify guests of pending maintenance events. + * A guest which provides latency-sensitive services may wish to + * remove itself from service if an event is coming up. Two flags + * indicate the approximate imminence of the event. + */ +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3) +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4) +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5) +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6) + /* + * Even regardless of leap seconds, the time presented through this + * mechanism may not be strictly monotonic. If the counter slows down + * and the host adapts to this discovery, the time calculated from + * the value of the counter immediately after an update to this + * structure, may appear to be *earlier* than a calculation just + * before the update (while the counter was believed to be running + * faster than it now is). A guest operating system will typically + * *skew* its own system clock back towards the reference clock + * exposed here, rather than following this clock directly. If, + * however, this structure is being populated from such a system + * clock which is already handled in such a fashion and the results + * *are* guaranteed to be monotonic, such monotonicity can be + * advertised by setting this bit. + */ +#define VMCLOCK_FLAG_TIME_MONOTONIC (1 << 7) + + uint8_t pad[2]; + uint8_t clock_status; +#define VMCLOCK_STATUS_UNKNOWN 0 +#define VMCLOCK_STATUS_INITIALIZING 1 +#define VMCLOCK_STATUS_SYNCHRONIZED 2 +#define VMCLOCK_STATUS_FREERUNNING 3 +#define VMCLOCK_STATUS_UNRELIABLE 4 + + /* + * The time exposed through this device is never smeared. This field + * corresponds to the 'subtype' field in virtio-rtc, which indicates + * the smearing method. However in this case it provides a *hint* to + * the guest operating system, such that *if* the guest OS wants to + * provide its users with an alternative clock which does not follow + * the POSIX CLOCK_REALTIME standard, it may do so in a fashion + * consistent with the other systems in the nearby environment. + */ + uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */ +#define VMCLOCK_SMEARING_STRICT 0 +#define VMCLOCK_SMEARING_NOON_LINEAR 1 +#define VMCLOCK_SMEARING_UTC_SLS 2 + int16_t tai_offset_sec; + uint8_t leap_indicator; /* Based on VIRTIO_RTC_LEAP_xxx */ +#define VMCLOCK_LEAP_NONE 0 /* No known nearby leap second */ +#define VMCLOCK_LEAP_PRE_POS 1 /* Leap second + at end of month */ +#define VMCLOCK_LEAP_PRE_NEG 2 /* Leap second - at end of month */ +#define VMCLOCK_LEAP_POS 3 /* Set during 23:59:60 second */ +#define VMCLOCK_LEAP_NEG 4 /* Not used in VMCLOCK */ + /* + * These values are not (yet) in virtio-rtc. They indicate that a + * leap second *has* occurred at the start of the month. This allows + * a guest to generate a smeared clock from the accurate clock which + * this device provides, as smearing may need to continue for up to a + * period of time *after* the point of the leap second itself. Must + * be cleared by the 15th day of the month. + */ +#define VMCLOCK_LEAP_POST_POS 5 +#define VMCLOCK_LEAP_POST_NEG 6 + + /* Bit shift for counter_period_frac_sec and its error rate */ + uint8_t counter_period_shift; + /* + * Paired values of counter and UTC at a given point in time. + */ + uint64_t counter_value; + /* + * Counter frequency, and error margin. The unit of these fields is + * seconds >> (64 + counter_period_shift) + */ + uint64_t counter_period_frac_sec; + uint64_t counter_period_esterror_rate_frac_sec; + uint64_t counter_period_maxerror_rate_frac_sec; + + /* + * Time according to time_type field above. + */ + uint64_t time_sec; /* Seconds since time_type epoch */ + uint64_t time_frac_sec; /* (seconds >> 64) */ + uint64_t time_esterror_picosec; /* (± picoseconds) */ + uint64_t time_maxerror_picosec; /* (± picoseconds) */ +}; + +#endif /* __VMCLOCK_ABI_H__ */