On NTP, RTCs and accurately setting their time

Message ID	20170920112152.GL20805@n2100.armlinux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Wed, 20 Sep 2017 12:21:52 +0100 From: Russell King - ARM Linux <linux@armlinux.org.uk> To: linux-arm-kernel@lists.infradead.org, Thomas Gleixner <tglx@linutronix.de>, John Stultz <john.stultz@linaro.org>, Stephen Boyd <sboyd@codeaurora.org>, Alessandro Zummo <a.zummo@towertech.it>, Alexandre Belloni <alexandre.belloni@free-electrons.com>, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Subject: On NTP, RTCs and accurately setting their time Message-ID: <20170920112152.GL20805@n2100.armlinux.org.uk> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Message ID

20170920112152.GL20805@n2100.armlinux.org.uk (mailing list archive)

State

New, archived

Headers

Date: Wed, 20 Sep 2017 12:21:52 +0100
From: Russell King - ARM Linux <linux@armlinux.org.uk>
To: linux-arm-kernel@lists.infradead.org,
	Thomas Gleixner <tglx@linutronix.de>, 
	John Stultz <john.stultz@linaro.org>,
	Stephen Boyd <sboyd@codeaurora.org>, 
	Alessandro Zummo <a.zummo@towertech.it>,
	Alexandre Belloni <alexandre.belloni@free-electrons.com>,
	Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Subject: On NTP, RTCs and accurately setting their time
Message-ID: <20170920112152.GL20805@n2100.armlinux.org.uk>
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.23 (2014-03-12)
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Commit Message

Russell King (Oracle) Sept. 20, 2017, 11:21 a.m. UTC

Hi,

It's common for systems to be time synchronised using programs such as
chrony or ntpd.  In such situations, when we are properly synchronised,
the kernel writes the current time to the RTC every 11 seconds.

However, assumptions are made about the RTC:

1. kernel/time/ntp.c assumes that all RTCs want to be told to set the
   time at around 500ms into the second.

2. drivers/rtc/systohc.c assumes that if the time being set is >= 500ms,
   then we want to set the _next_ second.

This leads to RTCs being set with an offset time, where the offset
depends on the RTC.  For example, the PCF8523 ends up reliably set
around 970ms in the future:

RTC Time: 19-09-2017 14:27:51
System Time was:     14:27:50.031

What this means is that when the RTC ticked from 50 to 51 seconds,
system time was at 50.031s.  Hence, the RTC in this case is 969ms
in the future.

For Armada 388, the situation is different:

RTC Time: 19-09-2017 14:48:17
System Time was:     14:48:16.521

Here, the RTC is being set 479ms in the future.

The SNVS RTC in imx6 is different again, although I have no definitive
figures yet.

Why does this matter - if you introduce a 1s offset in system time
(eg, you reboot a machine running NTP) then it takes at least four hours
after the reboot for ntpd to settle down, with the PPM swinging faster
and slower than normal as it compensates for the offset.  This means if
you are using a Linux system for time measurement purposes, it is
useless for those four hours.

Being able to accurately save and restore system time helps to reduce
the disruptive effect of a reboot.

Currently, there are no controls in the kernel over this mechanism - if
you're running a multi-platform kernel which has support for writing the
RTC if ntp-sync'd, then you're stuck with the kernel writing the RTC
every 11 seconds.  Not only is this detrimental due to the enforced RTC
dependent offset, but it also means that if you want to trim your RTC
to a synchronised source of time (for which the kernel must not write
the RTC, but you do want it to ntp sync), the only way to do it is to
either modify the kernel, or disable CONFIG_RTC_SYSTOHC in the multi-
platform kernel.


The kernel can do better - elimination of the rounding-up in systohc.c
gives a better result for PCF8523:

RTC Time: 19-09-2017 16:30:38
System Time was:     16:30:38.034

and the remaining offset can be reduced by adjusting the 500ms offset
in ntp.c to 470ms.  This is specific to the offset that PCF8523 wants.
We know that MC146818 RTCs found in PCs want a 500ms offset (without
the addition of one second that systohc.c does.)  The Armada 388 RTC
wants to be set on the exact second.

We need some way to cater for these differing requirements (eg, RTC
drivers provide some properties to rtclib and the ntp code to describe
how long it takes for a written time to "take"), or we decide (as I
think was decided in the past) that the kernel should not be setting
the RTC, but userspace should be responsible for performing that
function.  Either way, we need to know about these RTC specific
properties in order to set their time accurately.

One of the issues here, however, is that RTC datasheets do not give this
information - this can only be found out by experimentation and
measurement.

Currently I'm using the hacky patch below to be able to (a) disable the
regular RTC write during runtime, so I can measure the RTC drift and
trim it, and (b) provide a knob to adjust how far past the second the
RTC will receive its write.  For a properly trimmed PCF8523 (measured
over about 12 hours to have 0.1ppm drift - which is better than its
associated crystal is rated for), setting this knob to 470ms results
in the following (from two samples this morning):

RTC Time: 20-09-2017 10:41:30
System Time was:     10:41:30.000
RTC Time: 20-09-2017 11:17:38
System Time was:     11:17:38.000

There's probably some noise in the setting of this due to the workqueue,
but getting it within 10ms is definitely an improvement over being
almost a second out.

So, the question is... how should these differences in rtc requirements
be handled?

 drivers/rtc/systohc.c |  6 +++---
 kernel/sysctl.c       | 21 +++++++++++++++++++++
 kernel/time/ntp.c     |  9 ++++++---
 3 files changed, 30 insertions(+), 6 deletions(-)

Comments

Russell King (Oracle) Sept. 20, 2017, 1:23 p.m. UTC | #1

On Wed, Sep 20, 2017 at 12:21:52PM +0100, Russell King - ARM Linux wrote:
> Hi,
> 
> It's common for systems to be time synchronised using programs such as
> chrony or ntpd.  In such situations, when we are properly synchronised,
> the kernel writes the current time to the RTC every 11 seconds.

That should be 11 minutes.

> 
> However, assumptions are made about the RTC:
> 
> 1. kernel/time/ntp.c assumes that all RTCs want to be told to set the
>    time at around 500ms into the second.
> 
> 2. drivers/rtc/systohc.c assumes that if the time being set is >= 500ms,
>    then we want to set the _next_ second.
> 
> This leads to RTCs being set with an offset time, where the offset
> depends on the RTC.  For example, the PCF8523 ends up reliably set
> around 970ms in the future:
> 
> RTC Time: 19-09-2017 14:27:51
> System Time was:     14:27:50.031
> 
> What this means is that when the RTC ticked from 50 to 51 seconds,
> system time was at 50.031s.  Hence, the RTC in this case is 969ms
> in the future.
> 
> For Armada 388, the situation is different:
> 
> RTC Time: 19-09-2017 14:48:17
> System Time was:     14:48:16.521
> 
> Here, the RTC is being set 479ms in the future.
> 
> The SNVS RTC in imx6 is different again, although I have no definitive
> figures yet.
> 
> Why does this matter - if you introduce a 1s offset in system time
> (eg, you reboot a machine running NTP) then it takes at least four hours
> after the reboot for ntpd to settle down, with the PPM swinging faster
> and slower than normal as it compensates for the offset.  This means if
> you are using a Linux system for time measurement purposes, it is
> useless for those four hours.
> 
> Being able to accurately save and restore system time helps to reduce
> the disruptive effect of a reboot.
> 
> Currently, there are no controls in the kernel over this mechanism - if
> you're running a multi-platform kernel which has support for writing the
> RTC if ntp-sync'd, then you're stuck with the kernel writing the RTC
> every 11 seconds.  Not only is this detrimental due to the enforced RTC

and here too.

> dependent offset, but it also means that if you want to trim your RTC
> to a synchronised source of time (for which the kernel must not write
> the RTC, but you do want it to ntp sync), the only way to do it is to
> either modify the kernel, or disable CONFIG_RTC_SYSTOHC in the multi-
> platform kernel.
> 
> 
> The kernel can do better - elimination of the rounding-up in systohc.c
> gives a better result for PCF8523:
> 
> RTC Time: 19-09-2017 16:30:38
> System Time was:     16:30:38.034
> 
> and the remaining offset can be reduced by adjusting the 500ms offset
> in ntp.c to 470ms.  This is specific to the offset that PCF8523 wants.
> We know that MC146818 RTCs found in PCs want a 500ms offset (without
> the addition of one second that systohc.c does.)  The Armada 388 RTC
> wants to be set on the exact second.
> 
> We need some way to cater for these differing requirements (eg, RTC
> drivers provide some properties to rtclib and the ntp code to describe
> how long it takes for a written time to "take"), or we decide (as I
> think was decided in the past) that the kernel should not be setting
> the RTC, but userspace should be responsible for performing that
> function.  Either way, we need to know about these RTC specific
> properties in order to set their time accurately.
> 
> One of the issues here, however, is that RTC datasheets do not give this
> information - this can only be found out by experimentation and
> measurement.
> 
> Currently I'm using the hacky patch below to be able to (a) disable the
> regular RTC write during runtime, so I can measure the RTC drift and
> trim it, and (b) provide a knob to adjust how far past the second the
> RTC will receive its write.  For a properly trimmed PCF8523 (measured
> over about 12 hours to have 0.1ppm drift - which is better than its
> associated crystal is rated for), setting this knob to 470ms results
> in the following (from two samples this morning):
> 
> RTC Time: 20-09-2017 10:41:30
> System Time was:     10:41:30.000
> RTC Time: 20-09-2017 11:17:38
> System Time was:     11:17:38.000
> 
> There's probably some noise in the setting of this due to the workqueue,
> but getting it within 10ms is definitely an improvement over being
> almost a second out.
> 
> So, the question is... how should these differences in rtc requirements
> be handled?
> 
>  drivers/rtc/systohc.c |  6 +++---
>  kernel/sysctl.c       | 21 +++++++++++++++++++++
>  kernel/time/ntp.c     |  9 ++++++---
>  3 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/rtc/systohc.c b/drivers/rtc/systohc.c
> index b4a68ffcd06b..df804597de71 100644
> --- a/drivers/rtc/systohc.c
> +++ b/drivers/rtc/systohc.c
> @@ -26,10 +26,7 @@ int rtc_set_ntp_time(struct timespec64 now)
>  	struct rtc_time tm;
>  	int err = -ENODEV;
>  
> -	if (now.tv_nsec < (NSEC_PER_SEC >> 1))
> -		rtc_time64_to_tm(now.tv_sec, &tm);
> -	else
> -		rtc_time64_to_tm(now.tv_sec + 1, &tm);
> +	rtc_time64_to_tm(now.tv_sec, &tm);
>  
>  	rtc = rtc_class_open(CONFIG_RTC_SYSTOHC_DEVICE);
>  	if (rtc) {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbbb8157..c2ce802f7ab9 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -98,6 +98,9 @@
>  
>  #if defined(CONFIG_SYSCTL)
>  
> +extern unsigned int sysctl_ntp_rtc_offset;
> +extern unsigned int sysctl_ntp_rtc_sync;
> +
>  /* External variables not in a header file. */
>  extern int suid_dumpable;
>  #ifdef CONFIG_COREDUMP
> @@ -303,6 +306,24 @@ static int max_extfrag_threshold = 1000;
>  #endif
>  
>  static struct ctl_table kern_table[] = {
> +#if defined(CONFIG_GENERIC_CMOS_UPDATE) || defined(CONFIG_RTC_SYSTOHC)
> +	{
> +		.procname	= "ntp_rtc_offset",
> +		.data		= &sysctl_ntp_rtc_offset,
> +		.maxlen		= sizeof(sysctl_ntp_rtc_offset),
> +		.mode		= 0644,
> +		.proc_handler	= proc_douintvec,
> +	},
> +	{
> +		.procname	= "ntp_rtc_sync",
> +		.data		= &sysctl_ntp_rtc_sync,
> +		.maxlen		= sizeof(sysctl_ntp_rtc_sync),
> +		.mode		= 0644,
> +		.proc_handler	= proc_douintvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +#endif
>  	{
>  		.procname	= "sched_child_runs_first",
>  		.data		= &sysctl_sched_child_runs_first,
> diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
> index edf19cc53140..674c45d30561 100644
> --- a/kernel/time/ntp.c
> +++ b/kernel/time/ntp.c
> @@ -508,6 +508,9 @@ int __weak update_persistent_clock64(struct timespec64 now64)
>  #endif
>  
>  #if defined(CONFIG_GENERIC_CMOS_UPDATE) || defined(CONFIG_RTC_SYSTOHC)
> +unsigned long sysctl_ntp_rtc_offset = NSEC_PER_SEC / 2;
> +unsigned int sysctl_ntp_rtc_sync = true;
> +
>  static void sync_cmos_clock(struct work_struct *work);
>  
>  static DECLARE_DELAYED_WORK(sync_cmos_work, sync_cmos_clock);
> @@ -526,7 +529,7 @@ static void sync_cmos_clock(struct work_struct *work)
>  	 * may not expire at the correct time.  Thus, we adjust...
>  	 * We want the clock to be within a couple of ticks from the target.
>  	 */
> -	if (!ntp_synced()) {
> +	if (!ntp_synced() || !sysctl_ntp_rtc_sync) {
>  		/*
>  		 * Not synced, exit, do not restart a timer (if one is
>  		 * running, let it run out).
> @@ -535,7 +538,7 @@ static void sync_cmos_clock(struct work_struct *work)
>  	}
>  
>  	getnstimeofday64(&now);
> -	if (abs(now.tv_nsec - (NSEC_PER_SEC / 2)) <= tick_nsec * 5) {
> +	if (abs(now.tv_nsec - sysctl_ntp_rtc_offset) <= tick_nsec * 5) {
>  		struct timespec64 adjust = now;
>  
>  		fail = -ENODEV;
> @@ -551,7 +554,7 @@ static void sync_cmos_clock(struct work_struct *work)
>  #endif
>  	}
>  
> -	next.tv_nsec = (NSEC_PER_SEC / 2) - now.tv_nsec - (TICK_NSEC / 2);
> +	next.tv_nsec = sysctl_ntp_rtc_offset - now.tv_nsec - (TICK_NSEC / 2);
>  	if (next.tv_nsec <= 0)
>  		next.tv_nsec += NSEC_PER_SEC;
>  
> 
> 
> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
> According to speedtest.net: 8.21Mbps down 510kbps up
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Jason Gunthorpe Sept. 20, 2017, 4:22 p.m. UTC | #2

On Wed, Sep 20, 2017 at 12:21:52PM +0100, Russell King - ARM Linux wrote:

> However, assumptions are made about the RTC:
> 
> 1. kernel/time/ntp.c assumes that all RTCs want to be told to set the
>    time at around 500ms into the second.
> 
> 2. drivers/rtc/systohc.c assumes that if the time being set is >= 500ms,
>    then we want to set the _next_ second.

I looked at these issues when I did the sys to HC patches and I
concluced the first problem was that the RTC read functions generally
did not return sub second resolution, either in sense of directly
returning ts_nsec, or the sense of delaying the read until a clock
tick over event.

Eg if I repeatedly booted my various embedded ARM/PPC systems, and
recorded the time offset from NTP, I got a fairly random distribution
of offsets.

Given that we couldn't seem to read back sub second resolution on my
systems, storing it more accurately did not seem important.

I think we also did some experiments with a few of the RTCs we were
using and some of them did not adjust the seconds clock phase on
write, so they seemed incapable of storing sub second data anyhow.

So.. My feeling was that we'd need driver support in each RTC driver
to enable sub section resolution.

Do you know differently?

Our pragmatic solution in our products was to have the initial time
sync from NTP step the clock even if the offset is small.

> So, the question is... how should these differences in rtc requirements
> be handled?

I think patch wise, this is something I would rather see handled
internally via the drivers and perhaps with input from DT, not via
sysctl knobs.

The HW driver should know how to read and write with sub second
resolution. If it works best with a certain value in the ts_nsec
field, then it should set something inside rtc_chip that causes the
systohc code to try and call it with that tv_nsec.

It would probably also make sense to add new ntp ops for sub second
get/set that includes the full timespec. This way the RTC driver can
provide the right adjustments and we can get rid of the +1 in
rtc_set_ntp_time and the +0.5 in rtc_hctosys for sub sec aware
drivers..

Jason

Russell King (Oracle) Sept. 20, 2017, 4:51 p.m. UTC | #3

On Wed, Sep 20, 2017 at 10:22:08AM -0600, Jason Gunthorpe wrote:
> On Wed, Sep 20, 2017 at 12:21:52PM +0100, Russell King - ARM Linux wrote:
> 
> > However, assumptions are made about the RTC:
> > 
> > 1. kernel/time/ntp.c assumes that all RTCs want to be told to set the
> >    time at around 500ms into the second.
> > 
> > 2. drivers/rtc/systohc.c assumes that if the time being set is >= 500ms,
> >    then we want to set the _next_ second.
> 
> I looked at these issues when I did the sys to HC patches and I
> concluced the first problem was that the RTC read functions generally
> did not return sub second resolution, either in sense of directly
> returning ts_nsec, or the sense of delaying the read until a clock
> tick over event.

The boot time problem can be resolved by using hwclock to set the
system time in userspace - there are distros that do exactly that.
For example, debian has a udev rule:

# Set the System Time from the Hardware Clock and set the kernel's timezone
# value to the local timezone when the kernel clock module is loaded.

KERNEL=="rtc0", RUN+="/lib/udev/hwclock-set $root/$name"

and that script uses hwclock to set the system time from the RTC.
hwclock doesn't just read the RTC and set the system time, it tries
to read the RTC at a whole second, and apply any known RTC drift
correction.

So I wouldn't worry about the kernel's RTC read being inaccurate,
that's been worked around in distros for years.

Embedded distros may not care so much about this (since they care
more about boot speed) except if they're reliant on having correct
time, in which case they may /choose/ to use hwclock in a blocking
manner to ensure that system time is accurately set from the RTC.

However, all that effort is for nowt if the RTC isn't set correctly
in the first place.

> I think we also did some experiments with a few of the RTCs we were
> using and some of them did not adjust the seconds clock phase on
> write, so they seemed incapable of storing sub second data anyhow.

I can imagine that there are RTCs out there which do not reset the
seconds pre-scaler, but just because there are some like that is no
reason to cripple those which are more inteligent about it.

> So.. My feeling was that we'd need driver support in each RTC driver
> to enable sub section resolution.
> 
> Do you know differently?

See above...

> Our pragmatic solution in our products was to have the initial time
> sync from NTP step the clock even if the offset is small.

That's all very well, but consider when your time source is GPS and
you're not on a network connection that would allow ntpdate to do
any better (eg, a 3G mobile data card...)  I have exactly that setup
with a remote monitoring system.  If I reboot the gateway with the
3G and GPS, NTP makes big PPM adjustments while trying to correct for
the 970ms time shift, and that madness is transmitted by ntpd to the
stratum 2 servers.

> > So, the question is... how should these differences in rtc requirements
> > be handled?
> 
> I think patch wise, this is something I would rather see handled
> internally via the drivers and perhaps with input from DT, not via
> sysctl knobs.

I sort-of agree as far as the time offset information goes, but there's
a complication that we only open the RTC to set the time at the point in
time that we want to set it - while the RTC is closed, the RTC driver
module could be removed and replaced by a different RTC driver which
replaces the existing device.

Don't think that's not possible, there are boards out there with multiple
RTCs on them, so its entirely possible that the wrong RTC could be
selected with random module probe ordering and need manual resolution.
For example, the system I mention above has the built-in iMX6 SVNS RTC
which gets powered down when power is removed, and a PCF8523 RTC which
doesn't... and there's variants of the board where the PCF8523 isn't
fitted, so users are reliant on the iMX6 RTC for cross-reboot time.

The patch is not meant to be a final solution, so criticising it for
using sysctls is not appropriate - it is firstly to prove the point that
we are in fact able to correctly set RTCs, and secondly (as I explained
in the bits you cut) to provide a knob to turn off the kernel's automatic
RTC setting so it's possible to accurately trim the RTC.

Many RTCs contain registers that allow fine trimming of their tick rate,
and if you care about the time keeping of your RTC, it's something that
needs to be calibrated against a known correct time source.  Having the
kernel repeatedly set the RTC every 11 minutes while you're trying to
measure the RTCs drift over a period of 12 hours is not on - and in
order to make such a measurement, you want the machine to be NTP
synchronised.  That's the exact situation that triggers kernels to
periodically write the time to the RTC.

So, we _do_ need a knob to turn that kernel timekeeping facility on and
off in addition to the "are we NTP sync'd" status.

> The HW driver should know how to read and write with sub second
> resolution. If it works best with a certain value in the ts_nsec
> field, then it should set something inside rtc_chip that causes the
> systohc code to try and call it with that tv_nsec.

The problem is deeper than the systohc.c code - the timing of the call
made into the systohc.c code is decided by kernel/time/ntp.c, and
currently is within a tick of 500ms past the second.

> It would probably also make sense to add new ntp ops for sub second
> get/set that includes the full timespec. This way the RTC driver can
> provide the right adjustments and we can get rid of the +1 in
> rtc_set_ntp_time and the +0.5 in rtc_hctosys for sub sec aware
> drivers..

Do we have any sub-second aware drivers?  None of my RTCs are, and I
don't think it's fair for RTC drivers to sleep when getting a request
to set the time.

The userspace API doesn't do that, and the workqueue involved in
setting NTP time is probably run using shared system_power_efficient_wq
resources, so blocking there will be detrimental to other works queued
on that.

Maybe the NTP code should call into RTCs to enable/disable the time
setting feature and leave RTC drivers to do that, but that seems to me
like a lot of code in each driver as well as something of a layering
issue.

Jason Gunthorpe Sept. 20, 2017, 5:16 p.m. UTC | #4

On Wed, Sep 20, 2017 at 05:51:41PM +0100, Russell King - ARM Linux wrote:
> On Wed, Sep 20, 2017 at 10:22:08AM -0600, Jason Gunthorpe wrote:
> > On Wed, Sep 20, 2017 at 12:21:52PM +0100, Russell King - ARM Linux wrote:
> > 
> > > However, assumptions are made about the RTC:
> > > 
> > > 1. kernel/time/ntp.c assumes that all RTCs want to be told to set the
> > >    time at around 500ms into the second.
> > > 
> > > 2. drivers/rtc/systohc.c assumes that if the time being set is >= 500ms,
> > >    then we want to set the _next_ second.
> > 
> > I looked at these issues when I did the sys to HC patches and I
> > concluced the first problem was that the RTC read functions generally
> > did not return sub second resolution, either in sense of directly
> > returning ts_nsec, or the sense of delaying the read until a clock
> > tick over event.
> 
> The boot time problem can be resolved by using hwclock to set the
> system time in userspace - there are distros that do exactly that.
> For example, debian has a udev rule:

Okay, that part I didn't know, thanks.

> > I think patch wise, this is something I would rather see handled
> > internally via the drivers and perhaps with input from DT, not via
> > sysctl knobs.
> 
> I sort-of agree as far as the time offset information goes, but there's
> a complication that we only open the RTC to set the time at the point in
> time that we want to set it - while the RTC is closed, the RTC driver
> module could be removed and replaced by a different RTC driver which
> replaces the existing device.

Yes, the ntp code would have to open the rtc initially and record the
offset. Each time it goes to write it would have to check the offset
it got against what the device wants and, if necessary, reschedule
the update if the ts_nsec is not correct. That should handle
infrequent dynamic changes..

> So, we _do_ need a knob to turn that kernel timekeeping facility on and
> off in addition to the "are we NTP sync'd" status.

Sure, that makes sense as a sysctl

> > The HW driver should know how to read and write with sub second
> > resolution. If it works best with a certain value in the ts_nsec
> > field, then it should set something inside rtc_chip that causes the
> > systohc code to try and call it with that tv_nsec.
> 
> The problem is deeper than the systohc.c code - the timing of the call
> made into the systohc.c code is decided by kernel/time/ntp.c, and
> currently is within a tick of 500ms past the second.

Right, I was thinking about the entire systohc process, including the
part in ntp.c

> Do we have any sub-second aware drivers?  None of my RTCs are, and I
> don't think it's fair for RTC drivers to sleep when getting a request
> to set the time.

I would call any RTC that can change the phase of the seconds clock as
sub-second capable. Currently, I think that only the CMOS driver is
really a sub-second aware driver, and the rest of the RTC stack has been
sort of hardcoded around what it does..

My thinking was adding a new sub-second entry point would let us keep
the existing assumptions above while having a cleaner entry point that
allows fixing the RTC drivers gradually. Eg you have tested PCF, so it
could use the subsecond entry and define the proper ts_nsec value/etc.

> The userspace API doesn't do that, and the workqueue involved in
> setting NTP time is probably run using shared system_power_efficient_wq
> resources, so blocking there will be detrimental to other works queued
> on that.

I think the NTP path should be non blocking, and rely on the ntp.c
code calling into the driver with the desired tv_nsec.

However, it also makes sense to provide a blocking sub-second user
space API that would have the required sleep to make setting the time
work properly. This way we can hide the desired tv_nsec implementation
detail from userspace completely.

eg I assume hwclock also has the built in 0.5s assumption when trying
to write to the RTC?

The same would be true for read, I think it would be cleaner to have a
kernel uapi that reads the time with subsecond accuracy than to have
hwclock implement a spinning loop in userspace. That gives drivers
more options in how they measure the second tick over. The common code
would just provide a simple loop like we see in hwclock.

Jason

diff --git a/drivers/rtc/systohc.c b/drivers/rtc/systohc.c
index b4a68ffcd06b..df804597de71 100644
--- a/drivers/rtc/systohc.c
+++ b/drivers/rtc/systohc.c
@@ -26,10 +26,7 @@  int rtc_set_ntp_time(struct timespec64 now)
 	struct rtc_time tm;
 	int err = -ENODEV;
 
-	if (now.tv_nsec < (NSEC_PER_SEC >> 1))
-		rtc_time64_to_tm(now.tv_sec, &tm);
-	else
-		rtc_time64_to_tm(now.tv_sec + 1, &tm);
+	rtc_time64_to_tm(now.tv_sec, &tm);
 
 	rtc = rtc_class_open(CONFIG_RTC_SYSTOHC_DEVICE);
 	if (rtc) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..c2ce802f7ab9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -98,6 +98,9 @@ 
 
 #if defined(CONFIG_SYSCTL)
 
+extern unsigned int sysctl_ntp_rtc_offset;
+extern unsigned int sysctl_ntp_rtc_sync;
+
 /* External variables not in a header file. */
 extern int suid_dumpable;
 #ifdef CONFIG_COREDUMP
@@ -303,6 +306,24 @@  static int max_extfrag_threshold = 1000;
 #endif
 
 static struct ctl_table kern_table[] = {
+#if defined(CONFIG_GENERIC_CMOS_UPDATE) || defined(CONFIG_RTC_SYSTOHC)
+	{
+		.procname	= "ntp_rtc_offset",
+		.data		= &sysctl_ntp_rtc_offset,
+		.maxlen		= sizeof(sysctl_ntp_rtc_offset),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec,
+	},
+	{
+		.procname	= "ntp_rtc_sync",
+		.data		= &sysctl_ntp_rtc_sync,
+		.maxlen		= sizeof(sysctl_ntp_rtc_sync),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{
 		.procname	= "sched_child_runs_first",
 		.data		= &sysctl_sched_child_runs_first,
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index edf19cc53140..674c45d30561 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -508,6 +508,9 @@  int __weak update_persistent_clock64(struct timespec64 now64)
 #endif
 
 #if defined(CONFIG_GENERIC_CMOS_UPDATE) || defined(CONFIG_RTC_SYSTOHC)
+unsigned long sysctl_ntp_rtc_offset = NSEC_PER_SEC / 2;
+unsigned int sysctl_ntp_rtc_sync = true;
+
 static void sync_cmos_clock(struct work_struct *work);
 
 static DECLARE_DELAYED_WORK(sync_cmos_work, sync_cmos_clock);
@@ -526,7 +529,7 @@  static void sync_cmos_clock(struct work_struct *work)
 	 * may not expire at the correct time.  Thus, we adjust...
 	 * We want the clock to be within a couple of ticks from the target.
 	 */
-	if (!ntp_synced()) {
+	if (!ntp_synced() || !sysctl_ntp_rtc_sync) {
 		/*
 		 * Not synced, exit, do not restart a timer (if one is
 		 * running, let it run out).
@@ -535,7 +538,7 @@  static void sync_cmos_clock(struct work_struct *work)
 	}
 
 	getnstimeofday64(&now);
-	if (abs(now.tv_nsec - (NSEC_PER_SEC / 2)) <= tick_nsec * 5) {
+	if (abs(now.tv_nsec - sysctl_ntp_rtc_offset) <= tick_nsec * 5) {
 		struct timespec64 adjust = now;
 
 		fail = -ENODEV;
@@ -551,7 +554,7 @@  static void sync_cmos_clock(struct work_struct *work)
 #endif
 	}
 
-	next.tv_nsec = (NSEC_PER_SEC / 2) - now.tv_nsec - (TICK_NSEC / 2);
+	next.tv_nsec = sysctl_ntp_rtc_offset - now.tv_nsec - (TICK_NSEC / 2);
 	if (next.tv_nsec <= 0)
 		next.tv_nsec += NSEC_PER_SEC;