Message ID | 20250320200306.1712599-1-jstultz@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids | expand |
On Thu, Mar 20, 2025 at 01:03:00PM -0700, John Stultz wrote: > +static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset, > + enum timekeeping_adv_mode mode, > + unsigned int *clock_set) > + * Also reset tk::ntp_error as it does not make sense to keep the > + * old accumulated error around in this case. > + */ I'm not sure if I still understand the timekeeping code correctly, but that doesn't seem right to me. At least the comment should explain why it does not make sense to keep the NTP error. Resetting the NTP error causes a small time step. An NTP/PTP client can be setting the frequency very frequently, e.g. up to 128 times per second and the interval between updates can be random. If the timing was right, I suspect it could cause a measurable drift. The client should be able to compensate for it, but why make its job harder by making the clock less predictable. My expectation for the clock is that its frequency will not change if the same (or only slightly different) frequency is set repeatedly by adjtimex(). > + if (mode == TK_ADV_FREQ) { > + timekeeping_forward(tk, tk->tkr_mono.cycle_last + offset); > + tk->ntp_error = 0; > + return 0; > + }
On Tue, Mar 25 2025 at 12:32, Miroslav Lichvar wrote: > On Thu, Mar 20, 2025 at 01:03:00PM -0700, John Stultz wrote: >> +static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset, >> + enum timekeeping_adv_mode mode, >> + unsigned int *clock_set) > >> + * Also reset tk::ntp_error as it does not make sense to keep the >> + * old accumulated error around in this case. >> + */ > > I'm not sure if I still understand the timekeeping code correctly, but > that doesn't seem right to me. At least the comment should explain why > it does not make sense to keep the NTP error. > > Resetting the NTP error causes a small time step. An NTP/PTP client > can be setting the frequency very frequently, e.g. up to 128 times per > second and the interval between updates can be random. If the timing I never observed that behaviour, but I'm not a NTP/PTP wizard/power user. > was right, I suspect it could cause a measurable drift. The client > should be able to compensate for it, but why make its job harder by > making the clock less predictable. My expectation for the clock is > that its frequency will not change if the same (or only slightly > different) frequency is set repeatedly by adjtimex(). The point is that timekeeper::ntp_error accumulates the error between NTP and the clock interval. With John's change to forward the clock in case of adjtimex() setting the tick length or frequency, the previously accumulated information is out of sync because the forwarding resets the period asynchronously. The fundamental property of the timekeeper adjustment is that it advances everything in multiples of the clock interval. The clock interval is the number of hardware clock increments per tick, which has been determined from the initial multiplier/shift pair of the clock source at the point where the clock source is installed as the timekeeper source. It never changes throughout the life time of the clocksource. The original implementation respected this base period, but John's approach of forwarding, which cures the coarse time getter issue, violates it. As a consequence the previous error accumulation is not longer based on the base period because the period has been reset to the random point in time when adjtimex() was invoked, which makes the error accumulation a random number. There are two ways to deal with that. Both require to revert this change completely. 1) Handle the coarse time getter problem seperately and leave the existing adjtimex logic alone. That was my initial suggestion: https://lore.kernel.org/all/87cyej5rid.ffs@tglx 2) Handle adjtimex(ADJ_TICK/ADJ_FREQUENCY) at the next tick boundary instead of doing it immediately at the random point in time when adjtimex() is invoked. That cures the coarse time getter problem as well, but obviously delays the multiplier update to the next tick, which means that only the last adjtimex(ADJ_TICK/ADJ_FREQUENCY) invocation between two ticks becomes effective. From a pure mathematical point of view, this is keeping everything consistent. A quick test shows that it works. Though again, I'm not the NTP wizard here and don't know which dragons are lurking in the NTP/PTP clients. Patch on top of the revert below. That requires some thought vs. NOHZ delaying the next tick, but that's a solvable problem. Thanks, tglx --- --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -34,14 +34,6 @@ #define TK_UPDATE_ALL (TK_CLEAR_NTP | TK_CLOCK_WAS_SET) -enum timekeeping_adv_mode { - /* Update timekeeper when a tick has passed */ - TK_ADV_TICK, - - /* Update timekeeper on a direct frequency change */ - TK_ADV_FREQ -}; - /* * The most important data for readout fits into a single 64 byte * cache line. @@ -2155,7 +2147,7 @@ static u64 logarithmic_accumulation(stru * timekeeping_advance - Updates the timekeeper to the current time and * current NTP tick length */ -static bool timekeeping_advance(enum timekeeping_adv_mode mode) +static bool timekeeping_advance(void) { struct timekeeper *tk = &tk_core.shadow_timekeeper; struct timekeeper *real_tk = &tk_core.timekeeper; @@ -2173,8 +2165,8 @@ static bool timekeeping_advance(enum tim tk->tkr_mono.cycle_last, tk->tkr_mono.mask, tk->tkr_mono.clock->max_raw_delta); - /* Check if there's really nothing to do */ - if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK) + /* Check if there's really something to do */ + if (offset < real_tk->cycle_interval) return false; /* @@ -2216,7 +2208,7 @@ static bool timekeeping_advance(enum tim */ void update_wall_time(void) { - if (timekeeping_advance(TK_ADV_TICK)) + if (timekeeping_advance()) clock_was_set_delayed(); } @@ -2548,10 +2540,6 @@ int do_adjtimex(struct __kernel_timex *t audit_ntp_log(&ad); - /* Update the multiplier immediately if frequency was set directly */ - if (txc->modes & (ADJ_FREQUENCY | ADJ_TICK)) - clock_set |= timekeeping_advance(TK_ADV_FREQ); - if (clock_set) clock_was_set(CLOCK_SET_WALL);
On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote: > On Tue, Mar 25 2025 at 12:32, Miroslav Lichvar wrote: > > Resetting the NTP error causes a small time step. An NTP/PTP client > > can be setting the frequency very frequently, e.g. up to 128 times per > > second and the interval between updates can be random. If the timing > > I never observed that behaviour, but I'm not a NTP/PTP wizard/power user. On a machine that has a /dev/ptp device available, a simple test to observe such a high update rate is to run: 1) phc_ctl /dev/ptp0 set 2) phc2sys -m -q -O 0 -s /dev/ptp0 -R 128 or alternatively 2) chronyd -d 'refclock PHC /dev/ptp0 poll -7' > The original implementation respected this base period, but John's > approach of forwarding, which cures the coarse time getter issue, > violates it. As a consequence the previous error accumulation is not > longer based on the base period because the period has been reset to the > random point in time when adjtimex() was invoked, which makes the error > accumulation a random number. I see, so that value of the NTP error is already wrong at that point where it's reset to 0. To clearly see the difference with the new code, I made an attempt to update the old linux-tktest simulation that was used back when the multiplier adjustment was reworked, but there are too many missing things now and I gave up. Maybe I could simply patch the kernel to force a small clock multiplier to increase the rate at which the error accumulates.
On Thu, Mar 27 2025 at 16:42, Miroslav Lichvar wrote: > On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote: >> The original implementation respected this base period, but John's >> approach of forwarding, which cures the coarse time getter issue, >> violates it. As a consequence the previous error accumulation is not >> longer based on the base period because the period has been reset to the >> random point in time when adjtimex() was invoked, which makes the error >> accumulation a random number. > > I see, so that value of the NTP error is already wrong at that point > where it's reset to 0. > > To clearly see the difference with the new code, I made an attempt > to update the old linux-tktest simulation that was used back when the > multiplier adjustment was reworked, but there are too many missing > things now and I gave up. Can you point me to that code? It would be probably useful to create a test mechanism which allows to exercise all of this in a simulated way so we actually don't have to wonder every time we change a bit what the consequences are. Thanks, tglx
On Thu, Mar 27, 2025 at 06:32:27PM +0100, Thomas Gleixner wrote: > On Thu, Mar 27 2025 at 16:42, Miroslav Lichvar wrote: > > On Thu, Mar 27, 2025 at 10:22:31AM +0100, Thomas Gleixner wrote: > > To clearly see the difference with the new code, I made an attempt > > to update the old linux-tktest simulation that was used back when the > > multiplier adjustment was reworked, but there are too many missing > > things now and I gave up. > > Can you point me to that code? It's this thing: https://github.com/mlichvar/linux-tktest > It would be probably useful to create a test mechanism which allows to > exercise all of this in a simulated way so we actually don't have to > wonder every time we change a bit what the consequences are. Yes, that would be very nice if we could run the timekeeping code in a deterministic simulated environment with a configurable clocksource, timing of kernel updates, timing and values of injected adjtimex() calls, etc. The question is how to isolate it.
On Thu, Mar 27, 2025 at 04:42:49PM +0100, Miroslav Lichvar wrote: > Maybe I could simply patch the kernel to force a small clock > multiplier to increase the rate at which the error accumulates. I tried that and it indeed makes the issue clearly visible. The COARSE fix makes the clock less stable. It's barely visible with the normal multiplier, at least for the clocksource I tested, but a reduced multiplier forces a larger NTP error and raises it above the precision and instability of the system and reference clocks. The test was done on a machine with a TSC clocksource (3GHz CPU with disabled frequency scaling - normal multplier is 5592407) and tried a multiplier reduced by 4, 16, 64 with this COARSE-fixing patch not applied and applied. Each test ran for 1 minute and produced an average value of skew - stability of the clock frequency as reported by chronyd in the tracking log when synchronizing to a free-running PTP clock at 64, 16, and 4 updates per second. It's in parts per million (resolution in the chrony log is limited to 0.001 ppm). Mult reduction Updates/sec Skew before Skew after 1 4 0.000 0.000 1 16 0.001 0.002 1 64 0.002 0.006 4 4 0.001 0.001 4 16 0.003 0.005 4 64 0.005 0.015 16 4 0.004 0.009 16 16 0.011 0.069 16 64 0.020 0.117 64 4 0.013 0.012 64 16 0.030 0.107 64 64 0.058 0.879
On Mon, Mar 31 2025 at 16:53, Miroslav Lichvar wrote: > On Thu, Mar 27, 2025 at 04:42:49PM +0100, Miroslav Lichvar wrote: >> Maybe I could simply patch the kernel to force a small clock >> multiplier to increase the rate at which the error accumulates. > > I tried that and it indeed makes the issue clearly visible. The COARSE > fix makes the clock less stable. It's barely visible with the normal > multiplier, at least for the clocksource I tested, but a reduced > multiplier forces a larger NTP error and raises it above the precision > and instability of the system and reference clocks. > > The test was done on a machine with a TSC clocksource (3GHz CPU with > disabled frequency scaling - normal multplier is 5592407) and tried a > multiplier reduced by 4, 16, 64 with this COARSE-fixing patch not > applied and applied. Each test ran for 1 minute and produced an > average value of skew - stability of the clock frequency as reported > by chronyd in the tracking log when synchronizing to a free-running > PTP clock at 64, 16, and 4 updates per second. It's in parts per > million (resolution in the chrony log is limited to 0.001 ppm). > > Mult reduction Updates/sec Skew before Skew after > 1 4 0.000 0.000 > 1 16 0.001 0.002 > 1 64 0.002 0.006 > 4 4 0.001 0.001 > 4 16 0.003 0.005 > 4 64 0.005 0.015 > 16 4 0.004 0.009 > 16 16 0.011 0.069 > 16 64 0.020 0.117 > 64 4 0.013 0.012 > 64 16 0.030 0.107 > 64 64 0.058 0.879 Hrm. Can you try the delta patch below? Thanks, tglx --- kernel/time/timekeeping.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -2234,8 +2234,8 @@ static bool timekeeping_advance(enum tim tk->tkr_mono.cycle_last, tk->tkr_mono.mask, tk->tkr_mono.clock->max_raw_delta); - /* Check if there's really nothing to do */ - if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK) + /* Check if there's really something to do */ + if (offset < real_tk->cycle_interval) return false; offset = timekeeping_accumulate(tk, offset, mode, &clock_set);
On Tue, Apr 01, 2025 at 08:34:23AM +0200, Thomas Gleixner wrote: > On Mon, Mar 31 2025 at 16:53, Miroslav Lichvar wrote: > > Mult reduction Updates/sec Skew before Skew after > > 16 4 0.004 0.009 > > 16 16 0.011 0.069 > > 16 64 0.020 0.117 > > 64 4 0.013 0.012 > > 64 16 0.030 0.107 > > 64 64 0.058 0.879 > > Hrm. > > Can you try the delta patch below? It seems to improve the worst cases, but overall it's still a regression. Mult reduction Updates/sec Skew 16 4 0.012 16 16 0.014 16 64 0.033 64 4 0.012 64 16 0.105 64 64 0.138 Thanks,
On Tue, Apr 01 2025 at 13:19, Miroslav Lichvar wrote: > On Tue, Apr 01, 2025 at 08:34:23AM +0200, Thomas Gleixner wrote: >> On Mon, Mar 31 2025 at 16:53, Miroslav Lichvar wrote: >> > Mult reduction Updates/sec Skew before Skew after >> > 16 4 0.004 0.009 >> > 16 16 0.011 0.069 >> > 16 64 0.020 0.117 >> > 64 4 0.013 0.012 >> > 64 16 0.030 0.107 >> > 64 64 0.058 0.879 >> >> Hrm. >> >> Can you try the delta patch below? > > It seems to improve the worst cases, but overall it's still > a regression. > > Mult reduction Updates/sec Skew > 16 4 0.012 > 16 16 0.014 > 16 64 0.033 > 64 4 0.012 > 64 16 0.105 > 64 64 0.138 That's weird as it only delays the update to the next tick. My original approach of maintaining seperate state for the coarse time keeper is preserving the existing NTP behaviour. Patch applies after reverting 757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids"). Thanks, tglx --- --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -51,7 +51,7 @@ struct tk_read_base { * @offs_real: Offset clock monotonic -> clock realtime * @offs_boot: Offset clock monotonic -> clock boottime * @offs_tai: Offset clock monotonic -> clock tai - * @tai_offset: The current UTC to TAI offset in seconds + * @coarse_nsec: The nanoseconds part for coarse time getters * @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds * @clock_was_set_seq: The sequence number of clock was set events @@ -76,6 +76,7 @@ struct tk_read_base { * ntp shifted nano seconds. * @ntp_err_mult: Multiplication factor for scaled math conversion * @skip_second_overflow: Flag used to avoid updating NTP twice with same second + * @tai_offset: The current UTC to TAI offset in seconds * * Note: For timespec(64) based interfaces wall_to_monotonic is what * we need to add to xtime (or xtime corrected for sub jiffy times) @@ -100,7 +101,7 @@ struct tk_read_base { * which results in the following cacheline layout: * * 0: seqcount, tkr_mono - * 1: xtime_sec ... tai_offset + * 1: xtime_sec ... coarse_nsec * 2: tkr_raw, raw_sec * 3,4: Internal variables * @@ -121,7 +122,7 @@ struct timekeeper { ktime_t offs_real; ktime_t offs_boot; ktime_t offs_tai; - s32 tai_offset; + u32 coarse_nsec; /* Cacheline 2: */ struct tk_read_base tkr_raw; @@ -144,6 +145,7 @@ struct timekeeper { u32 ntp_error_shift; u32 ntp_err_mult; u32 skip_second_overflow; + s32 tai_offset; }; #ifdef CONFIG_GENERIC_TIME_VSYSCALL --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -31,6 +31,7 @@ #define TK_CLEAR_NTP (1 << 0) #define TK_CLOCK_WAS_SET (1 << 1) +#define TK_RETAIN_COARSE (1 << 2) #define TK_UPDATE_ALL (TK_CLEAR_NTP | TK_CLOCK_WAS_SET) @@ -164,6 +165,15 @@ static inline struct timespec64 tk_xtime return ts; } +static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk) +{ + struct timespec64 ts; + + ts.tv_sec = tk->xtime_sec; + ts.tv_nsec = tk->coarse_nsec; + return ts; +} + static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts) { tk->xtime_sec = ts->tv_sec; @@ -636,7 +646,34 @@ static void timekeeping_restore_shadow(s memcpy(&tkd->shadow_timekeeper, &tkd->timekeeper, sizeof(tkd->timekeeper)); } -static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int action) +/* + * Update the nanoseconds part for the coarse time keepers. They can't rely + * on xtime_nsec because xtime_nsec is adjusted when the multiplication + * factor of the clock is adjusted. See timekeeping_apply_adjustment(). + * + * This is required because tk_read::cycle_last must be advanced by + * timekeeper::cycle_interval so that the accumulation happens with a + * periodic reference. + * + * But that adjustment of xtime_nsec can make it go backward to compensate + * for a larger multiplicator. + * + * @offset contains the leftover cycles which were not accumulated. + * Therefore the nanoseconds portion of the time when the clocksource was + * read in timekeeping_advance() is: + * + * nsec = (xtime_nsec + offset * mult) >> shift; + * + * Calculate that value and store it in timekeeper::coarse_nsec, from where + * the coarse time getters consume it. + */ +static inline void tk_update_coarse_nsecs(struct timekeeper *tk, u64 offset) +{ + offset *= tk->tkr_mono.mult; + tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift; +} + +static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int action, u64 offset) { struct timekeeper *tk = &tk_core.shadow_timekeeper; @@ -659,6 +696,9 @@ static void timekeeping_update_from_shad tk_update_leap_state(tk); tk_update_ktime_data(tk); + if (!(action & TK_RETAIN_COARSE)) + tk_update_coarse_nsecs(tk, offset); + update_vsyscall(tk); update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET); @@ -804,8 +844,8 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset) ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs) { struct timekeeper *tk = &tk_core.timekeeper; - unsigned int seq; ktime_t base, *offset = offsets[offs]; + unsigned int seq; u64 nsecs; WARN_ON(timekeeping_suspended); @@ -813,7 +853,7 @@ ktime_t ktime_get_coarse_with_offset(enu do { seq = read_seqcount_begin(&tk_core.seq); base = ktime_add(tk->tkr_mono.base, *offset); - nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + nsecs = tk->coarse_nsec; } while (read_seqcount_retry(&tk_core.seq, seq)); @@ -1374,7 +1414,7 @@ int do_settimeofday64(const struct times tk_set_wall_to_mono(tks, timespec64_sub(tks->wall_to_monotonic, ts_delta)); tk_set_xtime(tks, ts); - timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL); + timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0); } /* Signal hrtimers about time change */ @@ -1413,7 +1453,7 @@ static int timekeeping_inject_offset(con tk_xtime_add(tks, ts); tk_set_wall_to_mono(tks, timespec64_sub(tks->wall_to_monotonic, *ts)); - timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL); + timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0); } /* Signal hrtimers about time change */ @@ -1493,7 +1533,7 @@ static int change_clocksource(void *data timekeeping_forward_now(tks); old = tks->tkr_mono.clock; tk_setup_internals(tks, new); - timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL); + timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0); } if (old) { @@ -1690,7 +1730,7 @@ void __init timekeeping_init(void) tk_set_wall_to_mono(tks, wall_to_mono); - timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET); + timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET, 0); } /* time in seconds when suspend began for persistent clock */ @@ -1774,7 +1814,7 @@ void timekeeping_inject_sleeptime64(cons suspend_timing_needed = false; timekeeping_forward_now(tks); __timekeeping_inject_sleeptime(tks, delta); - timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL); + timekeeping_update_from_shadow(&tk_core, TK_UPDATE_ALL, 0); } /* Signal hrtimers about time change */ @@ -1834,7 +1874,7 @@ void timekeeping_resume(void) tks->ntp_error = 0; timekeeping_suspended = 0; - timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET); + timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET, 0); raw_spin_unlock_irqrestore(&tk_core.lock, flags); touch_softlockup_watchdog(); @@ -1901,7 +1941,7 @@ int timekeeping_suspend(void) } } - timekeeping_update_from_shadow(&tk_core, 0); + timekeeping_update_from_shadow(&tk_core, 0, 0); halt_fast_timekeeper(tks); raw_spin_unlock_irqrestore(&tk_core.lock, flags); @@ -2205,7 +2245,7 @@ static bool timekeeping_advance(enum tim */ clock_set |= accumulate_nsecs_to_secs(tk); - timekeeping_update_from_shadow(&tk_core, clock_set); + timekeeping_update_from_shadow(&tk_core, clock_set, offset); return !!clock_set; } @@ -2248,7 +2288,7 @@ void ktime_get_coarse_real_ts64(struct t do { seq = read_seqcount_begin(&tk_core.seq); - *ts = tk_xtime(tk); + *ts = tk_xtime_coarse(tk); } while (read_seqcount_retry(&tk_core.seq, seq)); } EXPORT_SYMBOL(ktime_get_coarse_real_ts64); @@ -2271,7 +2311,7 @@ void ktime_get_coarse_real_ts64_mg(struc do { seq = read_seqcount_begin(&tk_core.seq); - *ts = tk_xtime(tk); + *ts = tk_xtime_coarse(tk); offset = tk_core.timekeeper.offs_real; } while (read_seqcount_retry(&tk_core.seq, seq)); @@ -2350,12 +2390,12 @@ void ktime_get_coarse_ts64(struct timesp do { seq = read_seqcount_begin(&tk_core.seq); - now = tk_xtime(tk); + now = tk_xtime_coarse(tk); mono = tk->wall_to_monotonic; } while (read_seqcount_retry(&tk_core.seq, seq)); set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec, - now.tv_nsec + mono.tv_nsec); + now.tv_nsec + mono.tv_nsec); } EXPORT_SYMBOL(ktime_get_coarse_ts64); @@ -2539,7 +2579,8 @@ int do_adjtimex(struct __kernel_timex *t if (tai != orig_tai) { __timekeeping_set_tai_offset(tks, tai); - timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET); + timekeeping_update_from_shadow(&tk_core, TK_CLOCK_WAS_SET | + TK_RETAIN_COARSE, 0); clock_set = true; } else { tk_update_leap_state_all(&tk_core); --- a/kernel/time/vsyscall.c +++ b/kernel/time/vsyscall.c @@ -98,12 +98,12 @@ void update_vsyscall(struct timekeeper * /* CLOCK_REALTIME_COARSE */ vdso_ts = &vc[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE]; vdso_ts->sec = tk->xtime_sec; - vdso_ts->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + vdso_ts->nsec = tk->coarse_nsec; /* CLOCK_MONOTONIC_COARSE */ vdso_ts = &vc[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE]; vdso_ts->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; - nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + nsec = tk->coarse_nsec; nsec = nsec + tk->wall_to_monotonic.tv_nsec; vdso_ts->sec += __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 1e67d076f1955..929846b8b45ab 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -682,20 +682,19 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act } /** - * timekeeping_forward_now - update clock to the current time + * timekeeping_forward - update clock to given cycle now value * @tk: Pointer to the timekeeper to update + * @cycle_now: Current clocksource read value * * Forward the current clock to update its state since the last call to * update_wall_time(). This is useful before significant clock changes, * as it avoids having to deal with this time offset explicitly. */ -static void timekeeping_forward_now(struct timekeeper *tk) +static void timekeeping_forward(struct timekeeper *tk, u64 cycle_now) { - u64 cycle_now, delta; + u64 delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask, + tk->tkr_mono.clock->max_raw_delta); - cycle_now = tk_clock_read(&tk->tkr_mono); - delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask, - tk->tkr_mono.clock->max_raw_delta); tk->tkr_mono.cycle_last = cycle_now; tk->tkr_raw.cycle_last = cycle_now; @@ -710,6 +709,21 @@ static void timekeeping_forward_now(struct timekeeper *tk) } } +/** + * timekeeping_forward_now - update clock to the current time + * @tk: Pointer to the timekeeper to update + * + * Forward the current clock to update its state since the last call to + * update_wall_time(). This is useful before significant clock changes, + * as it avoids having to deal with this time offset explicitly. + */ +static void timekeeping_forward_now(struct timekeeper *tk) +{ + u64 cycle_now = tk_clock_read(&tk->tkr_mono); + + timekeeping_forward(tk, cycle_now); +} + /** * ktime_get_real_ts64 - Returns the time of day in a timespec64. * @ts: pointer to the timespec to be set @@ -2151,6 +2165,54 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset, return offset; } +static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset, + enum timekeeping_adv_mode mode, + unsigned int *clock_set) +{ + int shift = 0, maxshift; + + /* + * TK_ADV_FREQ indicates that adjtimex(2) directly set the + * frequency or the tick length. + * + * Accumulate the offset, so that the new multiplier starts from + * now. This is required as otherwise for offsets, which are + * smaller than tk::cycle_interval, timekeeping_adjust() could set + * xtime_nsec backwards, which subsequently causes time going + * backwards in the coarse time getters. But even for the case + * where offset is greater than tk::cycle_interval the periodic + * accumulation does not have much value. + * + * Also reset tk::ntp_error as it does not make sense to keep the + * old accumulated error around in this case. + */ + if (mode == TK_ADV_FREQ) { + timekeeping_forward(tk, tk->tkr_mono.cycle_last + offset); + tk->ntp_error = 0; + return 0; + } + + /* + * With NO_HZ we may have to accumulate many cycle_intervals + * (think "ticks") worth of time at once. To do this efficiently, + * we calculate the largest doubling multiple of cycle_intervals + * that is smaller than the offset. We then accumulate that + * chunk in one go, and then try to consume the next smaller + * doubled multiple. + */ + shift = ilog2(offset) - ilog2(tk->cycle_interval); + shift = max(0, shift); + /* Bound shift to one less than what overflows tick_length */ + maxshift = (64 - (ilog2(ntp_tick_length()) + 1)) - 1; + shift = min(shift, maxshift); + while (offset >= tk->cycle_interval) { + offset = logarithmic_accumulation(tk, offset, shift, clock_set); + if (offset < tk->cycle_interval << shift) + shift--; + } + return offset; +} + /* * timekeeping_advance - Updates the timekeeper to the current time and * current NTP tick length @@ -2160,7 +2222,6 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode) struct timekeeper *tk = &tk_core.shadow_timekeeper; struct timekeeper *real_tk = &tk_core.timekeeper; unsigned int clock_set = 0; - int shift = 0, maxshift; u64 offset; guard(raw_spinlock_irqsave)(&tk_core.lock); @@ -2177,24 +2238,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode) if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK) return false; - /* - * With NO_HZ we may have to accumulate many cycle_intervals - * (think "ticks") worth of time at once. To do this efficiently, - * we calculate the largest doubling multiple of cycle_intervals - * that is smaller than the offset. We then accumulate that - * chunk in one go, and then try to consume the next smaller - * doubled multiple. - */ - shift = ilog2(offset) - ilog2(tk->cycle_interval); - shift = max(0, shift); - /* Bound shift to one less than what overflows tick_length */ - maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1; - shift = min(shift, maxshift); - while (offset >= tk->cycle_interval) { - offset = logarithmic_accumulation(tk, offset, shift, &clock_set); - if (offset < tk->cycle_interval<<shift) - shift--; - } + offset = timekeeping_accumulate(tk, offset, mode, &clock_set); /* Adjust the multiplier to correct NTP error */ timekeeping_adjust(tk, offset);
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time inconsistencies. Lei tracked down that this was being caused by the adjustment tk->tkr_mono.xtime_nsec -= offset; which is made to compensate for the unaccumulated cycles in offset when the mult value is adjusted forward, so that the non-_COARSE clockids don't see inconsistencies. However, the _COARSE clockids don't use the mult*offset value in their calculations, so this subtraction can cause the _COARSE clock ids to jump back a bit. Now, by design, this negative adjustment should be fine, because the logic run from timekeeping_adjust() is done after we accumulate approx mult*interval_cycles into xtime_nsec. The accumulated (mult*interval_cycles) will be larger then the (mult_adj*offset) value subtracted from xtime_nsec, and both operations are done together under the tk_core.lock, so the net change to xtime_nsec should always be positive. However, do_adjtimex() calls into timekeeping_advance() as well, since we want to apply the ntp freq adjustment immediately. In this case, we don't return early when the offset is smaller then interval_cycles, so we don't end up accumulating any time into xtime_nsec. But we do go on to call timekeeping_adjust(), which modifies the mult value, and subtracts from xtime_nsec to correct for the new mult value. Here because we did not accumulate anything, we have a window where the _COARSE clockids that don't utilize the mult*offset value, can see an inconsistency. So to fix this, rework the timekeeping_advance() logic a bit so that when we are called from do_adjtimex(), we call timekeeping_forward(), to first accumulate the sub-interval time into xtime_nsec. Then with no unaccumulated cycles in offset, we can do the mult adjustment without worry of the subtraction having an impact. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: linux-kselftest@vger.kernel.org Cc: kernel-team@android.com Cc: Lei Chen <lei.chen@smartx.com> Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE") Reported-by: Lei Chen <lei.chen@smartx.com> Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/ Diagnosed-by: Thomas Gleixner <tglx@linutronix.de> Additional-fixes-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <jstultz@google.com> --- v2: Include fixes from Thomas, dropping the unnecessary clock_set setting, and instead clearing ntp_error, along with some other minor tweaks. --- kernel/time/timekeeping.c | 94 ++++++++++++++++++++++++++++----------- 1 file changed, 69 insertions(+), 25 deletions(-)