Message ID | 20240805173234.3542917-5-vdonnefort@google.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Tracefs support for pKVM | expand |
On Mon, 05 Aug 2024 18:32:27 +0100, Vincent Donnefort <vdonnefort@google.com> wrote: > > On arm64 systems, the arch timer can be accessible by both EL1 and EL2. > This means when running with nVHE or protected KVM, it is easy to > generate clock values from the hypervisor, synchronized with the kernel. When you say "arch_timer" here, are you talking about the data structure describing the timer? Or about the actual *counter*, a system register provided by the HW? I'm not sure the architecture-specific details are massively relevant, given that this is an arch-agnostic change. > > For tracing purpose, the boot clock is interesting as it doesn't stop on > suspend. Export it as part of the time snapshot. This will later allow > the hypervisor to add boot clock timestamps to its events. Isn't that the actual description of the change? By getting the boot time as well as the parameters to compute an increment, you allow any subsystem able to perform a snapshot to compute a delta from boot time as long as they have access to the counter source. > > Signed-off-by: Vincent Donnefort <vdonnefort@google.com> > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > index fc12a9ba2c88..0fc6a61d64bd 100644 > --- a/include/linux/timekeeping.h > +++ b/include/linux/timekeeping.h > @@ -275,18 +275,24 @@ struct ktime_timestamps { > * counter value > * @cycles: Clocksource counter value to produce the system times > * @real: Realtime system time > + * @boot: Boot time > * @raw: Monotonic raw system time > * @cs_id: Clocksource ID > * @clock_was_set_seq: The sequence number of clock-was-set events > * @cs_was_changed_seq: The sequence number of clocksource change events > + * @mono_shift: The monotonic clock slope shift > + * @mono_mult: The monotonic clock slope mult > */ > struct system_time_snapshot { > u64 cycles; > ktime_t real; > + ktime_t boot; > ktime_t raw; > enum clocksource_ids cs_id; > unsigned int clock_was_set_seq; > u8 cs_was_changed_seq; > + u32 mono_shift; > + u32 mono_mult; > }; > > /** > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 2fa87dcfeda9..6d0488a555a7 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -1057,9 +1057,11 @@ noinstr time64_t __ktime_get_real_seconds(void) > void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) > { > struct timekeeper *tk = &tk_core.timekeeper; > + u32 mono_mult, mono_shift; > unsigned int seq; > ktime_t base_raw; > ktime_t base_real; > + ktime_t base_boot; > u64 nsec_raw; > u64 nsec_real; > u64 now; > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; > base_real = ktime_add(tk->tkr_mono.base, > tk_core.timekeeper.offs_real); > + base_boot = ktime_add(tk->tkr_mono.base, > + tk_core.timekeeper.offs_boot); > base_raw = tk->tkr_raw.base; > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now); > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now); > + mono_mult = tk->tkr_mono.mult; > + mono_shift = tk->tkr_mono.shift; > } while (read_seqcount_retry(&tk_core.seq, seq)); > > systime_snapshot->cycles = now; > systime_snapshot->real = ktime_add_ns(base_real, nsec_real); > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real); > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); > + systime_snapshot->mono_shift = mono_shift; > + systime_snapshot->mono_mult = mono_mult; > } > EXPORT_SYMBOL_GPL(ktime_get_snapshot); > This looks good to me, but you should probably Cc the timekeeping maintainers (tglx, John Stultz, and Stephen Boyd). Thanks, M.
On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team <kernel-team@android.com> wrote: > > On arm64 systems, the arch timer can be accessible by both EL1 and EL2. > This means when running with nVHE or protected KVM, it is easy to > generate clock values from the hypervisor, synchronized with the kernel. > > For tracing purpose, the boot clock is interesting as it doesn't stop on > suspend. Export it as part of the time snapshot. This will later allow > the hypervisor to add boot clock timestamps to its events. > > Signed-off-by: Vincent Donnefort <vdonnefort@google.com> > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > index fc12a9ba2c88..0fc6a61d64bd 100644 > --- a/include/linux/timekeeping.h > +++ b/include/linux/timekeeping.h > @@ -275,18 +275,24 @@ struct ktime_timestamps { > * counter value > * @cycles: Clocksource counter value to produce the system times > * @real: Realtime system time > + * @boot: Boot time So, adding the boottime to this kernel-internal snapshot seems reasonable to me. > * @raw: Monotonic raw system time > * @cs_id: Clocksource ID > * @clock_was_set_seq: The sequence number of clock-was-set events > * @cs_was_changed_seq: The sequence number of clocksource change events > + * @mono_shift: The monotonic clock slope shift > + * @mono_mult: The monotonic clock slope mult This bit, including the mult/shift pair however, isn't well explained and is a little more worrying. > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; > base_real = ktime_add(tk->tkr_mono.base, > tk_core.timekeeper.offs_real); > + base_boot = ktime_add(tk->tkr_mono.base, > + tk_core.timekeeper.offs_boot); > base_raw = tk->tkr_raw.base; > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now); > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now); > + mono_mult = tk->tkr_mono.mult; > + mono_shift = tk->tkr_mono.shift; > } while (read_seqcount_retry(&tk_core.seq, seq)); > > systime_snapshot->cycles = now; > systime_snapshot->real = ktime_add_ns(base_real, nsec_real); > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real); > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); > + systime_snapshot->mono_shift = mono_shift; > + systime_snapshot->mono_mult = mono_mult; > } > EXPORT_SYMBOL_GPL(ktime_get_snapshot); So this looks like you're trying to stuff kernel timekeeping internal values into the snapshot so you can skirt around the timekeeping subsystem and generate your own timestamps. This ends up duplicating logic, but in an incomplete way. For instance, you don't have things like ntp state, etc, so the timestamps you generate will not exactly match the kernel, and may have discontinuities. :( Now for many cases "close enough" is fine. But the difficulty is the expectation bar always raises, and eventually "close enough" isn't and we have a broken interface that has to be fixed. That said, I do get the need to have something like this is legitimate. There have been a number of cases where external hardware (PTP timestamps from NICs) or contexts (virt) are able to record hardware clocksource timestamps on their own, and want to be able to map that back to the kernel's (or maybe "a kernel's" if there are multiple VMs) sense of time. Sometimes even wanting to do this quite a bit later after the timestamp was recorded. The ktime_get_snapshot() logic was added in the first place for this reason. Some more aggressive approaches try to dump a bunch of the internal kernel timekeeping state out to userland and call it an api. See https://lore.kernel.org/lkml/410bbef9771ef8aa51704994a70d5965e367e2ce.camel@infradead.org/ for a recent (and thorough) effort there. I'm very much not a fan of this approach, as it mimics older efforts for userspace time calculations that were done before we settled on VDSOs, which were very fragile and required years of keeping backwards compatibility logic to map the current kernel state back to separate structures and expensive conversions to different units that userland expected. The benefit with VDSO interface is while the data is exposed to userland, the structure is not, and the logic is still kernel controlled, so changes to internal state can be done without breaking userland. Something I have been thinking about is maybe it would be beneficial to rework the timekeeping core so that given a clocksource timestamp, it could calculate the time for that timestamp. While existing apis would still do a new read of the clocksource, so the timestamps would always increase, an old timestamp could be used to retro-calculate a past time. The thing that prevents this now is that the timekeeping core doesn't keep any history, so we can't correctly back-calculate times before the last state change. But potentially we could keep a buffer of timekeeper states associated with clocksource intervals, and so we could find the right state to use for a given clocksource timestamp. Now, this would still only work to a point, as we don't want to keep tons of historical state. But then with this, maybe we could switch to something more VDSO-like where the PTP drivers or host systems could request a time given a timestamp (and probably some clocksource id so we can sanity check everyone is using the same clock), and we could still provide what they want without having to expose all of our state. Unfortunately though, this is all hand waving and pontificating on my part, as it would be a large rework. But it seems something closer where we share opaque kernel state along with logic with proper syscall like APIs to do the calculations, would be a much better approach over just exporting more kernel state as an API. For a more short term approach, since you can't be exact outside of the timekeeping logic, why not interpolate from the data ktime_get_snapshot already provides to calculate your own sense of the frequency? thanks -john
On Thu, Aug 22 2024 at 11:13, John Stultz wrote: > On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team > <kernel-team@android.com> wrote: >> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h >> index fc12a9ba2c88..0fc6a61d64bd 100644 >> --- a/include/linux/timekeeping.h >> +++ b/include/linux/timekeeping.h >> @@ -275,18 +275,24 @@ struct ktime_timestamps { >> * counter value >> * @cycles: Clocksource counter value to produce the system times >> * @real: Realtime system time >> + * @boot: Boot time > > So, adding the boottime to this kernel-internal snapshot seems reasonable to me. Maybe for you, but I have zero context to this as this submission obviously failed to CC the relevant mailing lists and maintainers... Documentation/process is there for a reason... Thanks, tgkx
On Thu, Aug 22, 2024 at 10:13:34AM +0100, Marc Zyngier wrote: > On Mon, 05 Aug 2024 18:32:27 +0100, > Vincent Donnefort <vdonnefort@google.com> wrote: > > > > On arm64 systems, the arch timer can be accessible by both EL1 and EL2. > > This means when running with nVHE or protected KVM, it is easy to > > generate clock values from the hypervisor, synchronized with the kernel. > > When you say "arch_timer" here, are you talking about the data > structure describing the timer? Or about the actual *counter*, a > system register provided by the HW? > > I'm not sure the architecture-specific details are massively relevant, > given that this is an arch-agnostic change. I meant the counter but happy to drop this entire paragraph and just keep the following one! > > > > > For tracing purpose, the boot clock is interesting as it doesn't stop on > > suspend. Export it as part of the time snapshot. This will later allow > > the hypervisor to add boot clock timestamps to its events. > > Isn't that the actual description of the change? By getting the boot > time as well as the parameters to compute an increment, you allow any > subsystem able to perform a snapshot to compute a delta from boot time > as long as they have access to the counter source. > > > > > Signed-off-by: Vincent Donnefort <vdonnefort@google.com> > > > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > > index fc12a9ba2c88..0fc6a61d64bd 100644 > > --- a/include/linux/timekeeping.h > > +++ b/include/linux/timekeeping.h > > @@ -275,18 +275,24 @@ struct ktime_timestamps { > > * counter value > > * @cycles: Clocksource counter value to produce the system times > > * @real: Realtime system time > > + * @boot: Boot time > > * @raw: Monotonic raw system time > > * @cs_id: Clocksource ID > > * @clock_was_set_seq: The sequence number of clock-was-set events > > * @cs_was_changed_seq: The sequence number of clocksource change events > > + * @mono_shift: The monotonic clock slope shift > > + * @mono_mult: The monotonic clock slope mult > > */ > > struct system_time_snapshot { > > u64 cycles; > > ktime_t real; > > + ktime_t boot; > > ktime_t raw; > > enum clocksource_ids cs_id; > > unsigned int clock_was_set_seq; > > u8 cs_was_changed_seq; > > + u32 mono_shift; > > + u32 mono_mult; > > }; > > > > /** > > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > > index 2fa87dcfeda9..6d0488a555a7 100644 > > --- a/kernel/time/timekeeping.c > > +++ b/kernel/time/timekeeping.c > > @@ -1057,9 +1057,11 @@ noinstr time64_t __ktime_get_real_seconds(void) > > void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) > > { > > struct timekeeper *tk = &tk_core.timekeeper; > > + u32 mono_mult, mono_shift; > > unsigned int seq; > > ktime_t base_raw; > > ktime_t base_real; > > + ktime_t base_boot; > > u64 nsec_raw; > > u64 nsec_real; > > u64 now; > > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) > > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; > > base_real = ktime_add(tk->tkr_mono.base, > > tk_core.timekeeper.offs_real); > > + base_boot = ktime_add(tk->tkr_mono.base, > > + tk_core.timekeeper.offs_boot); > > base_raw = tk->tkr_raw.base; > > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now); > > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now); > > + mono_mult = tk->tkr_mono.mult; > > + mono_shift = tk->tkr_mono.shift; > > } while (read_seqcount_retry(&tk_core.seq, seq)); > > > > systime_snapshot->cycles = now; > > systime_snapshot->real = ktime_add_ns(base_real, nsec_real); > > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real); > > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); > > + systime_snapshot->mono_shift = mono_shift; > > + systime_snapshot->mono_mult = mono_mult; > > } > > EXPORT_SYMBOL_GPL(ktime_get_snapshot); > > > > This looks good to me, but you should probably Cc the timekeeping > maintainers (tglx, John Stultz, and Stephen Boyd). Yep, my bad! > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Thu, Aug 22, 2024 at 11:13:11AM -0700, John Stultz wrote: > On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team > <kernel-team@android.com> wrote: > > > > On arm64 systems, the arch timer can be accessible by both EL1 and EL2. > > This means when running with nVHE or protected KVM, it is easy to > > generate clock values from the hypervisor, synchronized with the kernel. > > > > For tracing purpose, the boot clock is interesting as it doesn't stop on > > suspend. Export it as part of the time snapshot. This will later allow > > the hypervisor to add boot clock timestamps to its events. > > > > Signed-off-by: Vincent Donnefort <vdonnefort@google.com> > > > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > > index fc12a9ba2c88..0fc6a61d64bd 100644 > > --- a/include/linux/timekeeping.h > > +++ b/include/linux/timekeeping.h > > @@ -275,18 +275,24 @@ struct ktime_timestamps { > > * counter value > > * @cycles: Clocksource counter value to produce the system times > > * @real: Realtime system time > > + * @boot: Boot time > > So, adding the boottime to this kernel-internal snapshot seems reasonable to me. > > > * @raw: Monotonic raw system time > > * @cs_id: Clocksource ID > > * @clock_was_set_seq: The sequence number of clock-was-set events > > * @cs_was_changed_seq: The sequence number of clocksource change events > > + * @mono_shift: The monotonic clock slope shift > > + * @mono_mult: The monotonic clock slope mult > > > This bit, including the mult/shift pair however, isn't well explained > and is a little more worrying. > > > > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) > > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; > > base_real = ktime_add(tk->tkr_mono.base, > > tk_core.timekeeper.offs_real); > > + base_boot = ktime_add(tk->tkr_mono.base, > > + tk_core.timekeeper.offs_boot); > > base_raw = tk->tkr_raw.base; > > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now); > > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now); > > + mono_mult = tk->tkr_mono.mult; > > + mono_shift = tk->tkr_mono.shift; > > } while (read_seqcount_retry(&tk_core.seq, seq)); > > > > systime_snapshot->cycles = now; > > systime_snapshot->real = ktime_add_ns(base_real, nsec_real); > > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real); > > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); > > + systime_snapshot->mono_shift = mono_shift; > > + systime_snapshot->mono_mult = mono_mult; > > } > > EXPORT_SYMBOL_GPL(ktime_get_snapshot); > > So this looks like you're trying to stuff kernel timekeeping internal > values into the snapshot so you can skirt around the timekeeping > subsystem and generate your own timestamps. > > This ends up duplicating logic, but in an incomplete way. For > instance, you don't have things like ntp state, etc, so the timestamps > you generate will not exactly match the kernel, and may have > discontinuities. :( > > Now for many cases "close enough" is fine. But the difficulty is the > expectation bar always raises, and eventually "close enough" isn't and > we have a broken interface that has to be fixed. > > That said, I do get the need to have something like this is > legitimate. There have been a number of cases where external hardware > (PTP timestamps from NICs) or contexts (virt) are able to record > hardware clocksource timestamps on their own, and want to be able to > map that back to the kernel's (or maybe "a kernel's" if there are > multiple VMs) sense of time. Sometimes even wanting to do this quite > a bit later after the timestamp was recorded. The ktime_get_snapshot() > logic was added in the first place for this reason. > > Some more aggressive approaches try to dump a bunch of the internal > kernel timekeeping state out to userland and call it an api. > See https://lore.kernel.org/lkml/410bbef9771ef8aa51704994a70d5965e367e2ce.camel@infradead.org/ > for a recent (and thorough) effort there. > > I'm very much not a fan of this approach, as it mimics older efforts > for userspace time calculations that were done before we settled on > VDSOs, which were very fragile and required years of keeping backwards > compatibility logic to map the current kernel state back to separate > structures and expensive conversions to different units that userland > expected. > > The benefit with VDSO interface is while the data is exposed to > userland, the structure is not, and the logic is still kernel > controlled, so changes to internal state can be done without breaking > userland. > > Something I have been thinking about is maybe it would be beneficial > to rework the timekeeping core so that given a clocksource timestamp, > it could calculate the time for that timestamp. While existing apis > would still do a new read of the clocksource, so the timestamps would > always increase, an old timestamp could be used to retro-calculate a > past time. The thing that prevents this now is that the timekeeping > core doesn't keep any history, so we can't correctly back-calculate > times before the last state change. But potentially we could keep a > buffer of timekeeper states associated with clocksource intervals, and > so we could find the right state to use for a given clocksource > timestamp. Now, this would still only work to a point, as we don't > want to keep tons of historical state. But then with this, maybe we > could switch to something more VDSO-like where the PTP drivers or host > systems could request a time given a timestamp (and probably some > clocksource id so we can sanity check everyone is using the same > clock), and we could still provide what they want without having to > expose all of our state. > > Unfortunately though, this is all hand waving and pontificating on my > part, as it would be a large rework. But it seems something closer > where we share opaque kernel state along with logic with proper > syscall like APIs to do the calculations, would be a much better > approach over just exporting more kernel state as an API. > > For a more short term approach, since you can't be exact outside of > the timekeeping logic, why not interpolate from the data > ktime_get_snapshot already provides to calculate your own sense of the > frequency? Understood, I shouldn't sneak out mult and shift. So for the following version, I'll just use the boot clock value and process my "own" mult and "shift". Thanks for having a look at the change! > > thanks > -john
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index fc12a9ba2c88..0fc6a61d64bd 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -275,18 +275,24 @@ struct ktime_timestamps { * counter value * @cycles: Clocksource counter value to produce the system times * @real: Realtime system time + * @boot: Boot time * @raw: Monotonic raw system time * @cs_id: Clocksource ID * @clock_was_set_seq: The sequence number of clock-was-set events * @cs_was_changed_seq: The sequence number of clocksource change events + * @mono_shift: The monotonic clock slope shift + * @mono_mult: The monotonic clock slope mult */ struct system_time_snapshot { u64 cycles; ktime_t real; + ktime_t boot; ktime_t raw; enum clocksource_ids cs_id; unsigned int clock_was_set_seq; u8 cs_was_changed_seq; + u32 mono_shift; + u32 mono_mult; }; /** diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2fa87dcfeda9..6d0488a555a7 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1057,9 +1057,11 @@ noinstr time64_t __ktime_get_real_seconds(void) void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) { struct timekeeper *tk = &tk_core.timekeeper; + u32 mono_mult, mono_shift; unsigned int seq; ktime_t base_raw; ktime_t base_real; + ktime_t base_boot; u64 nsec_raw; u64 nsec_real; u64 now; @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; base_real = ktime_add(tk->tkr_mono.base, tk_core.timekeeper.offs_real); + base_boot = ktime_add(tk->tkr_mono.base, + tk_core.timekeeper.offs_boot); base_raw = tk->tkr_raw.base; nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now); nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now); + mono_mult = tk->tkr_mono.mult; + mono_shift = tk->tkr_mono.shift; } while (read_seqcount_retry(&tk_core.seq, seq)); systime_snapshot->cycles = now; systime_snapshot->real = ktime_add_ns(base_real, nsec_real); + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real); systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); + systime_snapshot->mono_shift = mono_shift; + systime_snapshot->mono_mult = mono_mult; } EXPORT_SYMBOL_GPL(ktime_get_snapshot);
On arm64 systems, the arch timer can be accessible by both EL1 and EL2. This means when running with nVHE or protected KVM, it is easy to generate clock values from the hypervisor, synchronized with the kernel. For tracing purpose, the boot clock is interesting as it doesn't stop on suspend. Export it as part of the time snapshot. This will later allow the hypervisor to add boot clock timestamps to its events. Signed-off-by: Vincent Donnefort <vdonnefort@google.com>