diff mbox

[1/8] time: Add persistent clock support

Message ID d330f02c8ef2a348b179850e1745a52ab9c47e90.1528878545.git.baolin.wang@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

(Exiting) Baolin Wang June 13, 2018, 11:32 a.m. UTC
On our Spreadtrum SC9860 platform, we registered the high resolution
ARM generic timer as one clocksource to update the OS time, but the
ARM generic timer will be stopped in suspend state. So we use one 64bit
always-on timer (but low resolution) of Spreadtrum to calculate the
suspend time to compensate the OS time. Though we can register the
always-on timer as one clocksource, we need re-calculate the
mult/shift with one larger conversion range to calculate the suspend
time.

But now we have too many different ways of dealing with persistent
timekeeping across architectures, and there will be many duplicate
code if we register one timer to be one persistent clock. Thus it
will be more helpful if we add one common framework for timer drivers
to be registered as one persistent clock and implement the common
read_persistent_clock64() to compensate the OS time.

Moreover we can register the clocksource with CLOCK_SOURCE_SUSPEND_NONSTOP
to be one persistent clock, then we can simplify the suspend/resume
accounting by removing CLOCK_SOURCE_SUSPEND_NONSTOP timing. After that
we can only compensate the OS time by persistent clock or RTC.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
---
 include/linux/persistent_clock.h |   23 +++++
 kernel/time/Kconfig              |    4 +
 kernel/time/Makefile             |    1 +
 kernel/time/alarmtimer.c         |    4 +
 kernel/time/persistent_clock.c   |  184 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 216 insertions(+)
 create mode 100644 include/linux/persistent_clock.h
 create mode 100644 kernel/time/persistent_clock.c

Comments

Thomas Gleixner June 24, 2018, 12:14 a.m. UTC | #1
On Wed, 13 Jun 2018, Baolin Wang wrote:
> Moreover we can register the clocksource with CLOCK_SOURCE_SUSPEND_NONSTOP
> to be one persistent clock, then we can simplify the suspend/resume
> accounting by removing CLOCK_SOURCE_SUSPEND_NONSTOP timing. After that
> we can only compensate the OS time by persistent clock or RTC.

That makes sense because it adds a gazillion lines of code and removes 5?
Not really,

> +/**
> + * persistent_clock_read_data - data required to read persistent clock
> + * @read: Returns a cycle value from persistent clock.
> + * @last_cycles: Clock cycle value at last update.
> + * @last_ns: Time value (nanoseconds) at last update.
> + * @mask: Bitmask for two's complement subtraction of non 64bit clocks.
> + * @mult: Cycle to nanosecond multiplier.
> + * @shift: Cycle to nanosecond divisor.
> + */
> +struct persistent_clock_read_data {
> +	u64 (*read)(void);
> +	u64 last_cycles;
> +	u64 last_ns;
> +	u64 mask;
> +	u32 mult;
> +	u32 shift;
> +};
> +/**
> + * persistent_clock - represent the persistent clock
> + * @read_data: Data required to read from persistent clock.
> + * @seq: Sequence counter for protecting updates.
> + * @freq: The frequency of the persistent clock.
> + * @wrap: Duration for persistent clock can run before wrapping.
> + * @alarm: Update timeout for persistent clock wrap.
> + * @alarm_inited: Indicate if the alarm has been initialized.
> + */
> +struct persistent_clock {
> +	struct persistent_clock_read_data read_data;
> +	seqcount_t seq;
> +	u32 freq;
> +	ktime_t wrap;
> +	struct alarm alarm;
> +	bool alarm_inited;
> +};

NAK!

There is no reason to invent yet another set of data structures and yet
more read functions with a sequence counter. which are just a bad and
broken copy of the existing timekeeping/clocksource code. And of course the
stuff is not serialized against multiple registrations, etc. etc.

Plus the utter nonsense that any call site has to do the same thing over
and over:

    register():
    start_alarm_timer();

Why is this required in the first place? It's not at all. The only place
where such an alarm timer will be required is when the system actually goes
to suspend. Starting it at registration time is pointless and even counter
productive. Assume the clocksource wraps every 2 hours. So you start it at
boot time and after 119 minutes uptime the system suspends. So it will
wakeup one minute later to update the clocksource. Heck no. If the timer is
started when the machine actually suspends it will wake up earliest in 120
minutes.

And you even add that to the TSC which does not need it at all. It will
wrap in about 400 years on a 2GHZ machine. So you degrade the functionality
instead of improving it.

So no, this is not going anywhere.

Let's look at the problem itself:

   You want to use one clocksource for timekeeping during runtime which is
   fast and accurate and another one for suspend time injection which is
   slower and/or less accurate because the fast one stops in suspend.

   Plus you need an alarmtimer which makes sure that the clocksource does
   not wrap around during suspend.

Now lets look what we have already:

   Both clocksources already exist and are registered as clocksources with
   all required data in the clocksource core.

Ergo the only sane and logical conclusion is to expand the existing
infrastructure to handle that.

When a clocksource is registered, then the registration function already
makes decisions about using it as timekeeping clocksource. So add a few
lines of code to check whether the newly registered clocksource is suitable
and preferred for suspend.

	if (!stops_in_suspend(newcs)) {
		if (!suspend_cs || is_preferred_suspend_cs(newcs))
			suspend_cs = newcs;
	}		
		
The is_preferred_suspend_cs() can be based on rating, the maximum suspend
length which can be achieved or whatever is sensible. It should start of as
a very simple decision function based on rating and not an prematurely
overengineered monstrosity.

The suspend/resume() code needs a few very simple changes:
   
xxx_suspend():
	clocksource_prepare_suspend();

  Note, this is _NOT_ timekeeping_suspend() because that is invoked _AFTER_
  alarmtimer_suspend(). So if an alarm timer is required it needs to be
  armed before that. A trivial solution might be to just call it from
  alarmtimer_suspend(), but that a minor detail to worry about.

timekeeping_suspend()
{
	clocksource_enter_suspend();
	...

timekeeping_resume()
{
...
	if (clocksource_leave_suspend(&nsec)) {
		ts_delta = ns_to_timespec64(nsec);
		sleeptime_injected = true;
	} else if (......


Now lets look at the new functions:

void clocksource_prepare_suspend(void)
{
	if (!suspend_cs)
		return;

	if (needs_alarmtimer(suspend_cs))
		start_suspend_alarm(suspend_cs);
}

void clocksource_enter_suspend(void)
{
	if (!suspend_cs)
		return;
	suspend_start = suspend_cs->read();
}

bool clocksource_leave_suspend(u64 *nsec)
{
	u64 now, delta;

	if (!suspend_cs)
		return false;

	now = suspend_cs->read();
	delta = clocksource_delta(now, suspend_start, suspend_cs->mask);
	*nsec = mul_u64_u32_shr(delta, suspend_cs->mult, suspend_cs->shift);
	return true;
}

See? It does not need any of this totally nonsensical stuff in your
registration function and not any new read functions and whatever, because
it simply can use the bog standard mult/shift values. Why? Because the
conversion above can cope with a full 64 bit * 32 bit multiply without
falling apart. It's already there in timekeeping_resume() otherwise
resuming with a NONSTOP TSC would result in bogus sleep times after a few
minutes. It's slower than the normal clocksource conversion which is
optimized for performance, but thats completely irrelevant on resume. This
whole blurb about requiring separate mult/shift values is just plain
drivel.

Plus any reasonably broad clocksource will not need an alarmtimer at
all. Because the only reason it is needed is when the clocksource itself
wraps around. And that has absolutely nothing to do with mult/shift
values. That just depends on the frequency and the bitwidth of the counter,

So it does not need an update function either because in case of broad
enough clocksources there is absolutely no need for update and in case of
wrapping ones the alarmtimer brings it out of suspend on time. And because
the only interesting thing is the delta between suspend and resume this is
all a complete non issue.

The clocksource core already has all the registration/unregistration
functionality plus an interface to reconfigure the frequency, so
clocksources can come and go and be reconfigured and all of this just
works.

Once the extra few lines of code are in place, then you can go and cleanup
the existing mess of homebrewn interfaces and claim that this is
consolidation and simplification.

<rant>

What's wrong with you people? Didn't they teach you in school that the
first thing to do is proper problem and code analysis? If they did not, go
back to them and ask your money back,

I'm really tired of these overengineered trainwrecks which are then
advertised with bullshit marketing like the best invention since sliced
bread. This might work in random corporates, but not here.

</rant>

Thanks,

	tglx
Thomas Gleixner June 25, 2018, 4:18 p.m. UTC | #2
On Sun, 24 Jun 2018, Thomas Gleixner wrote:
> The clocksource core already has all the registration/unregistration
> functionality plus an interface to reconfigure the frequency, so
> clocksources can come and go and be reconfigured and all of this just
> works.

Just for completeness sake.

If this extra suspend clocksource is not needed, i.e. because the system is
not suspended, then it does not need to be enabled. clocksources already
have an enable/disable callback exactly for that purpose. So instead of
having the clocks running during normal system operation the thing can be
OFF and save power and only be switched to ON when it is actually needed,
i.e. when going into suspend.

Thanks,

	tglx
John Stultz June 25, 2018, 5:23 p.m. UTC | #3
On Sat, Jun 23, 2018 at 5:14 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 13 Jun 2018, Baolin Wang wrote:
>> Moreover we can register the clocksource with CLOCK_SOURCE_SUSPEND_NONSTOP
>> to be one persistent clock, then we can simplify the suspend/resume
>> accounting by removing CLOCK_SOURCE_SUSPEND_NONSTOP timing. After that
>> we can only compensate the OS time by persistent clock or RTC.
>
> That makes sense because it adds a gazillion lines of code and removes 5?
> Not really,
>
>> +/**
>> + * persistent_clock_read_data - data required to read persistent clock
>> + * @read: Returns a cycle value from persistent clock.
>> + * @last_cycles: Clock cycle value at last update.
>> + * @last_ns: Time value (nanoseconds) at last update.
>> + * @mask: Bitmask for two's complement subtraction of non 64bit clocks.
>> + * @mult: Cycle to nanosecond multiplier.
>> + * @shift: Cycle to nanosecond divisor.
>> + */
>> +struct persistent_clock_read_data {
>> +     u64 (*read)(void);
>> +     u64 last_cycles;
>> +     u64 last_ns;
>> +     u64 mask;
>> +     u32 mult;
>> +     u32 shift;
>> +};
>> +/**
>> + * persistent_clock - represent the persistent clock
>> + * @read_data: Data required to read from persistent clock.
>> + * @seq: Sequence counter for protecting updates.
>> + * @freq: The frequency of the persistent clock.
>> + * @wrap: Duration for persistent clock can run before wrapping.
>> + * @alarm: Update timeout for persistent clock wrap.
>> + * @alarm_inited: Indicate if the alarm has been initialized.
>> + */
>> +struct persistent_clock {
>> +     struct persistent_clock_read_data read_data;
>> +     seqcount_t seq;
>> +     u32 freq;
>> +     ktime_t wrap;
>> +     struct alarm alarm;
>> +     bool alarm_inited;
>> +};
>
> NAK!
>
> There is no reason to invent yet another set of data structures and yet
> more read functions with a sequence counter. which are just a bad and
> broken copy of the existing timekeeping/clocksource code. And of course the
> stuff is not serialized against multiple registrations, etc. etc.
>
> Plus the utter nonsense that any call site has to do the same thing over
> and over:
>
>     register():
>     start_alarm_timer();
>
> Why is this required in the first place? It's not at all. The only place
> where such an alarm timer will be required is when the system actually goes
> to suspend. Starting it at registration time is pointless and even counter
> productive. Assume the clocksource wraps every 2 hours. So you start it at
> boot time and after 119 minutes uptime the system suspends. So it will
> wakeup one minute later to update the clocksource. Heck no. If the timer is
> started when the machine actually suspends it will wake up earliest in 120
> minutes.
>
> And you even add that to the TSC which does not need it at all. It will
> wrap in about 400 years on a 2GHZ machine. So you degrade the functionality
> instead of improving it.
>
> So no, this is not going anywhere.
>
> Let's look at the problem itself:
>
>    You want to use one clocksource for timekeeping during runtime which is
>    fast and accurate and another one for suspend time injection which is
>    slower and/or less accurate because the fast one stops in suspend.
>
>    Plus you need an alarmtimer which makes sure that the clocksource does
>    not wrap around during suspend.
>
> Now lets look what we have already:
>
>    Both clocksources already exist and are registered as clocksources with
>    all required data in the clocksource core.
>
> Ergo the only sane and logical conclusion is to expand the existing
> infrastructure to handle that.
>
> When a clocksource is registered, then the registration function already
> makes decisions about using it as timekeeping clocksource. So add a few
> lines of code to check whether the newly registered clocksource is suitable
> and preferred for suspend.
>
>         if (!stops_in_suspend(newcs)) {
>                 if (!suspend_cs || is_preferred_suspend_cs(newcs))
>                         suspend_cs = newcs;
>         }
>
> The is_preferred_suspend_cs() can be based on rating, the maximum suspend
> length which can be achieved or whatever is sensible. It should start of as
> a very simple decision function based on rating and not an prematurely
> overengineered monstrosity.
>
> The suspend/resume() code needs a few very simple changes:
>
> xxx_suspend():
>         clocksource_prepare_suspend();
>
>   Note, this is _NOT_ timekeeping_suspend() because that is invoked _AFTER_
>   alarmtimer_suspend(). So if an alarm timer is required it needs to be
>   armed before that. A trivial solution might be to just call it from
>   alarmtimer_suspend(), but that a minor detail to worry about.
>
> timekeeping_suspend()
> {
>         clocksource_enter_suspend();
>         ...
>
> timekeeping_resume()
> {
> ...
>         if (clocksource_leave_suspend(&nsec)) {
>                 ts_delta = ns_to_timespec64(nsec);
>                 sleeptime_injected = true;
>         } else if (......
>
>
> Now lets look at the new functions:
>
> void clocksource_prepare_suspend(void)
> {
>         if (!suspend_cs)
>                 return;
>
>         if (needs_alarmtimer(suspend_cs))
>                 start_suspend_alarm(suspend_cs);
> }
>
> void clocksource_enter_suspend(void)
> {
>         if (!suspend_cs)
>                 return;
>         suspend_start = suspend_cs->read();
> }
>
> bool clocksource_leave_suspend(u64 *nsec)
> {
>         u64 now, delta;
>
>         if (!suspend_cs)
>                 return false;
>
>         now = suspend_cs->read();
>         delta = clocksource_delta(now, suspend_start, suspend_cs->mask);
>         *nsec = mul_u64_u32_shr(delta, suspend_cs->mult, suspend_cs->shift);
>         return true;
> }
>
> See? It does not need any of this totally nonsensical stuff in your
> registration function and not any new read functions and whatever, because
> it simply can use the bog standard mult/shift values. Why? Because the
> conversion above can cope with a full 64 bit * 32 bit multiply without
> falling apart. It's already there in timekeeping_resume() otherwise
> resuming with a NONSTOP TSC would result in bogus sleep times after a few
> minutes. It's slower than the normal clocksource conversion which is
> optimized for performance, but thats completely irrelevant on resume. This
> whole blurb about requiring separate mult/shift values is just plain
> drivel.
>
> Plus any reasonably broad clocksource will not need an alarmtimer at
> all. Because the only reason it is needed is when the clocksource itself
> wraps around. And that has absolutely nothing to do with mult/shift
> values. That just depends on the frequency and the bitwidth of the counter,
>
> So it does not need an update function either because in case of broad
> enough clocksources there is absolutely no need for update and in case of
> wrapping ones the alarmtimer brings it out of suspend on time. And because
> the only interesting thing is the delta between suspend and resume this is
> all a complete non issue.
>
> The clocksource core already has all the registration/unregistration
> functionality plus an interface to reconfigure the frequency, so
> clocksources can come and go and be reconfigured and all of this just
> works.
>
> Once the extra few lines of code are in place, then you can go and cleanup
> the existing mess of homebrewn interfaces and claim that this is
> consolidation and simplification.
>
> <rant>
>
> What's wrong with you people? Didn't they teach you in school that the
> first thing to do is proper problem and code analysis? If they did not, go
> back to them and ask your money back,
>
> I'm really tired of these overengineered trainwrecks which are then
> advertised with bullshit marketing like the best invention since sliced
> bread. This might work in random corporates, but not here.
>
> </rant>

Thomas, what is wrong with *you*?  This is completely unnecessary.

First of all, I sent Baolin on this approach, as I was getting sick of
seeing the suspend/resume timing continually adding extra warts and
complexity to handle odd hardware that needs some solution. So I
suggested he clean that up in a way that lets these multiple ways of
solving the problem be done in device specific code w/o adding more
complexity to the core logic - in a fashion how the clocksources
allowed us to support nice fast/well-behaving hardware w/o preventing
ugly-but-neccessary solutions like the pit clocksource.

Now, I've also been quite a poor maintainer of late, and haven't done
much to help with pre-review of Baloin's code.  So that's on me, not
him.

Admittedly, my taste isn't always great, so there is likely to be a
better approach, and having your guidance here is *really* valued. I
just wish we could get it without all this unnecessary vinegar.

I was really lucky to have your guidance and support when I was
starting in the community. Your helping bring me up as a co-maintainer
(only a a relatively small set of responsibility compared to your much
wider maintainership), does let me see that having the responsibility
of billions of devices running your code is immense and comes with no
small amount of stress. Juggling the infinite incoming stream of
review requests on naieve or otherwise lacking code with other
important development priorities is a major burden, so *I really get
how frustrating it is*.

And its super annoying to have lots of short-term development requests
being thrown at you when you're the one who has to maintain it for the
long term.  But long-long-term, no one is going to be a maintainer
forever, and we are not going to develop more olympic swimmers if we
keep peeing in the pool.


Baolin: My apologies for leading you poorly here. Work on something
else for a day to let the flames burn off you, then review Thomas'
technical feedback, ignoring the harsh style, and try to address his
technical comments with the approach. I'll try to do better in finding
time to review your work.

-john
Thomas Gleixner June 25, 2018, 8:06 p.m. UTC | #4
On Mon, 25 Jun 2018, John Stultz wrote:
> On Sat, Jun 23, 2018 at 5:14 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Thomas, what is wrong with *you*?  This is completely unnecessary.

Nothing is wrong with me. I just hit a point where the amount of half baken
crap tripped me over the edge and I let of steam. Sorry for that, it surely
wasn't necessary. It was one of those days where grumpiness took
control. My steam pressure is back to normal and I'm more than happy to put
that back on the pure technical level.

To be clear, that rant was not a shot at Baolin. It was definitely
addressed to the people behind him who set him on this project and then
failed to do any sanity checking.

> Now, I've also been quite a poor maintainer of late, and haven't done
> much to help with pre-review of Baloin's code.  So that's on me, not
> him.

Fair enough.

> I was really lucky to have your guidance and support when I was
> starting in the community. Your helping bring me up as a co-maintainer
> (only a a relatively small set of responsibility compared to your much
> wider maintainership), does let me see that having the responsibility
> of billions of devices running your code is immense and comes with no
> small amount of stress. Juggling the infinite incoming stream of
> review requests on naieve or otherwise lacking code with other
> important development priorities is a major burden, so *I really get
> how frustrating it is*.

Yes, and I have to admit that I'm not a super human and despite trying to
control my emotions, there are points where I fail in doing so. It makes me
explode and that's it. I can assure Baolin, that this is nothing which
sticks. I'll look at the next patchset from a pure technical level. There
are surely precautions, but definitely no prejudice.

> And its super annoying to have lots of short-term development requests
> being thrown at you when you're the one who has to maintain it for the
> long term.  But long-long-term, no one is going to be a maintainer
> forever, and we are not going to develop more olympic swimmers if we
> keep peeing in the pool.

Right, that's why I early looked for co/sub-maintainers, but as you
admitted yourself, that's not always working out and I surely would
appreciate if there would be more help on all ends. I'm the last person who
rejects help, quite the contrary I'm very happy when I can mark mails as
'none of my business, somebody else takes care of that'.

Thanks,

	tglx
(Exiting) Baolin Wang June 26, 2018, 2:08 a.m. UTC | #5
Hi Thomas,

On 24 June 2018 at 08:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 13 Jun 2018, Baolin Wang wrote:
>> Moreover we can register the clocksource with CLOCK_SOURCE_SUSPEND_NONSTOP
>> to be one persistent clock, then we can simplify the suspend/resume
>> accounting by removing CLOCK_SOURCE_SUSPEND_NONSTOP timing. After that
>> we can only compensate the OS time by persistent clock or RTC.
>
> That makes sense because it adds a gazillion lines of code and removes 5?
> Not really,
>
>> +/**
>> + * persistent_clock_read_data - data required to read persistent clock
>> + * @read: Returns a cycle value from persistent clock.
>> + * @last_cycles: Clock cycle value at last update.
>> + * @last_ns: Time value (nanoseconds) at last update.
>> + * @mask: Bitmask for two's complement subtraction of non 64bit clocks.
>> + * @mult: Cycle to nanosecond multiplier.
>> + * @shift: Cycle to nanosecond divisor.
>> + */
>> +struct persistent_clock_read_data {
>> +     u64 (*read)(void);
>> +     u64 last_cycles;
>> +     u64 last_ns;
>> +     u64 mask;
>> +     u32 mult;
>> +     u32 shift;
>> +};
>> +/**
>> + * persistent_clock - represent the persistent clock
>> + * @read_data: Data required to read from persistent clock.
>> + * @seq: Sequence counter for protecting updates.
>> + * @freq: The frequency of the persistent clock.
>> + * @wrap: Duration for persistent clock can run before wrapping.
>> + * @alarm: Update timeout for persistent clock wrap.
>> + * @alarm_inited: Indicate if the alarm has been initialized.
>> + */
>> +struct persistent_clock {
>> +     struct persistent_clock_read_data read_data;
>> +     seqcount_t seq;
>> +     u32 freq;
>> +     ktime_t wrap;
>> +     struct alarm alarm;
>> +     bool alarm_inited;
>> +};
>
> NAK!
>
> There is no reason to invent yet another set of data structures and yet
> more read functions with a sequence counter. which are just a bad and
> broken copy of the existing timekeeping/clocksource code. And of course the
> stuff is not serialized against multiple registrations, etc. etc.
>
> Plus the utter nonsense that any call site has to do the same thing over
> and over:
>
>     register():
>     start_alarm_timer();
>
> Why is this required in the first place? It's not at all. The only place
> where such an alarm timer will be required is when the system actually goes
> to suspend. Starting it at registration time is pointless and even counter
> productive. Assume the clocksource wraps every 2 hours. So you start it at
> boot time and after 119 minutes uptime the system suspends. So it will
> wakeup one minute later to update the clocksource. Heck no. If the timer is
> started when the machine actually suspends it will wake up earliest in 120
> minutes.
>
> And you even add that to the TSC which does not need it at all. It will
> wrap in about 400 years on a 2GHZ machine. So you degrade the functionality
> instead of improving it.
>
> So no, this is not going anywhere.
>
> Let's look at the problem itself:
>
>    You want to use one clocksource for timekeeping during runtime which is
>    fast and accurate and another one for suspend time injection which is
>    slower and/or less accurate because the fast one stops in suspend.
>
>    Plus you need an alarmtimer which makes sure that the clocksource does
>    not wrap around during suspend.
>
> Now lets look what we have already:
>
>    Both clocksources already exist and are registered as clocksources with
>    all required data in the clocksource core.
>
> Ergo the only sane and logical conclusion is to expand the existing
> infrastructure to handle that.
>
> When a clocksource is registered, then the registration function already
> makes decisions about using it as timekeeping clocksource. So add a few
> lines of code to check whether the newly registered clocksource is suitable
> and preferred for suspend.
>
>         if (!stops_in_suspend(newcs)) {
>                 if (!suspend_cs || is_preferred_suspend_cs(newcs))
>                         suspend_cs = newcs;
>         }
>
> The is_preferred_suspend_cs() can be based on rating, the maximum suspend
> length which can be achieved or whatever is sensible. It should start of as
> a very simple decision function based on rating and not an prematurely
> overengineered monstrosity.
>
> The suspend/resume() code needs a few very simple changes:
>
> xxx_suspend():
>         clocksource_prepare_suspend();
>
>   Note, this is _NOT_ timekeeping_suspend() because that is invoked _AFTER_
>   alarmtimer_suspend(). So if an alarm timer is required it needs to be
>   armed before that. A trivial solution might be to just call it from
>   alarmtimer_suspend(), but that a minor detail to worry about.
>
> timekeeping_suspend()
> {
>         clocksource_enter_suspend();
>         ...
>
> timekeeping_resume()
> {
> ...
>         if (clocksource_leave_suspend(&nsec)) {
>                 ts_delta = ns_to_timespec64(nsec);
>                 sleeptime_injected = true;
>         } else if (......
>
>
> Now lets look at the new functions:
>
> void clocksource_prepare_suspend(void)
> {
>         if (!suspend_cs)
>                 return;
>
>         if (needs_alarmtimer(suspend_cs))
>                 start_suspend_alarm(suspend_cs);
> }
>
> void clocksource_enter_suspend(void)
> {
>         if (!suspend_cs)
>                 return;
>         suspend_start = suspend_cs->read();
> }
>
> bool clocksource_leave_suspend(u64 *nsec)
> {
>         u64 now, delta;
>
>         if (!suspend_cs)
>                 return false;
>
>         now = suspend_cs->read();
>         delta = clocksource_delta(now, suspend_start, suspend_cs->mask);
>         *nsec = mul_u64_u32_shr(delta, suspend_cs->mult, suspend_cs->shift);
>         return true;
> }
>
> See? It does not need any of this totally nonsensical stuff in your
> registration function and not any new read functions and whatever, because
> it simply can use the bog standard mult/shift values. Why? Because the
> conversion above can cope with a full 64 bit * 32 bit multiply without
> falling apart. It's already there in timekeeping_resume() otherwise
> resuming with a NONSTOP TSC would result in bogus sleep times after a few
> minutes. It's slower than the normal clocksource conversion which is
> optimized for performance, but thats completely irrelevant on resume. This
> whole blurb about requiring separate mult/shift values is just plain
> drivel.
>
> Plus any reasonably broad clocksource will not need an alarmtimer at
> all. Because the only reason it is needed is when the clocksource itself
> wraps around. And that has absolutely nothing to do with mult/shift
> values. That just depends on the frequency and the bitwidth of the counter,
>
> So it does not need an update function either because in case of broad
> enough clocksources there is absolutely no need for update and in case of
> wrapping ones the alarmtimer brings it out of suspend on time. And because
> the only interesting thing is the delta between suspend and resume this is
> all a complete non issue.
>
> The clocksource core already has all the registration/unregistration
> functionality plus an interface to reconfigure the frequency, so
> clocksources can come and go and be reconfigured and all of this just
> works.
>
> Once the extra few lines of code are in place, then you can go and cleanup
> the existing mess of homebrewn interfaces and claim that this is
> consolidation and simplification.

I am very grateful for your suggestion, and I am sorry I made a stupid
mistake that I missed the mul_u64_u32_shr() function which can avoid
introducing extra mult/shift for suspend clocksource. I will use that
and follow your other suggestions too. Thanks again.

>
> <rant>
>
> What's wrong with you people? Didn't they teach you in school that the
> first thing to do is proper problem and code analysis? If they did not, go
> back to them and ask your money back,
>
> I'm really tired of these overengineered trainwrecks which are then
> advertised with bullshit marketing like the best invention since sliced
> bread. This might work in random corporates, but not here.
>
> </rant>
>
> Thanks,
>
>         tglx



---
Baolin Wang
Best Regards
(Exiting) Baolin Wang June 26, 2018, 2:12 a.m. UTC | #6
Hi John,

On 26 June 2018 at 01:23, John Stultz <john.stultz@linaro.org> wrote:
> On Sat, Jun 23, 2018 at 5:14 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Wed, 13 Jun 2018, Baolin Wang wrote:
>>> Moreover we can register the clocksource with CLOCK_SOURCE_SUSPEND_NONSTOP
>>> to be one persistent clock, then we can simplify the suspend/resume
>>> accounting by removing CLOCK_SOURCE_SUSPEND_NONSTOP timing. After that
>>> we can only compensate the OS time by persistent clock or RTC.
>>
>> That makes sense because it adds a gazillion lines of code and removes 5?
>> Not really,
>>
>>> +/**
>>> + * persistent_clock_read_data - data required to read persistent clock
>>> + * @read: Returns a cycle value from persistent clock.
>>> + * @last_cycles: Clock cycle value at last update.
>>> + * @last_ns: Time value (nanoseconds) at last update.
>>> + * @mask: Bitmask for two's complement subtraction of non 64bit clocks.
>>> + * @mult: Cycle to nanosecond multiplier.
>>> + * @shift: Cycle to nanosecond divisor.
>>> + */
>>> +struct persistent_clock_read_data {
>>> +     u64 (*read)(void);
>>> +     u64 last_cycles;
>>> +     u64 last_ns;
>>> +     u64 mask;
>>> +     u32 mult;
>>> +     u32 shift;
>>> +};
>>> +/**
>>> + * persistent_clock - represent the persistent clock
>>> + * @read_data: Data required to read from persistent clock.
>>> + * @seq: Sequence counter for protecting updates.
>>> + * @freq: The frequency of the persistent clock.
>>> + * @wrap: Duration for persistent clock can run before wrapping.
>>> + * @alarm: Update timeout for persistent clock wrap.
>>> + * @alarm_inited: Indicate if the alarm has been initialized.
>>> + */
>>> +struct persistent_clock {
>>> +     struct persistent_clock_read_data read_data;
>>> +     seqcount_t seq;
>>> +     u32 freq;
>>> +     ktime_t wrap;
>>> +     struct alarm alarm;
>>> +     bool alarm_inited;
>>> +};
>>
>> NAK!
>>
>> There is no reason to invent yet another set of data structures and yet
>> more read functions with a sequence counter. which are just a bad and
>> broken copy of the existing timekeeping/clocksource code. And of course the
>> stuff is not serialized against multiple registrations, etc. etc.
>>
>> Plus the utter nonsense that any call site has to do the same thing over
>> and over:
>>
>>     register():
>>     start_alarm_timer();
>>
>> Why is this required in the first place? It's not at all. The only place
>> where such an alarm timer will be required is when the system actually goes
>> to suspend. Starting it at registration time is pointless and even counter
>> productive. Assume the clocksource wraps every 2 hours. So you start it at
>> boot time and after 119 minutes uptime the system suspends. So it will
>> wakeup one minute later to update the clocksource. Heck no. If the timer is
>> started when the machine actually suspends it will wake up earliest in 120
>> minutes.
>>
>> And you even add that to the TSC which does not need it at all. It will
>> wrap in about 400 years on a 2GHZ machine. So you degrade the functionality
>> instead of improving it.
>>
>> So no, this is not going anywhere.
>>
>> Let's look at the problem itself:
>>
>>    You want to use one clocksource for timekeeping during runtime which is
>>    fast and accurate and another one for suspend time injection which is
>>    slower and/or less accurate because the fast one stops in suspend.
>>
>>    Plus you need an alarmtimer which makes sure that the clocksource does
>>    not wrap around during suspend.
>>
>> Now lets look what we have already:
>>
>>    Both clocksources already exist and are registered as clocksources with
>>    all required data in the clocksource core.
>>
>> Ergo the only sane and logical conclusion is to expand the existing
>> infrastructure to handle that.
>>
>> When a clocksource is registered, then the registration function already
>> makes decisions about using it as timekeeping clocksource. So add a few
>> lines of code to check whether the newly registered clocksource is suitable
>> and preferred for suspend.
>>
>>         if (!stops_in_suspend(newcs)) {
>>                 if (!suspend_cs || is_preferred_suspend_cs(newcs))
>>                         suspend_cs = newcs;
>>         }
>>
>> The is_preferred_suspend_cs() can be based on rating, the maximum suspend
>> length which can be achieved or whatever is sensible. It should start of as
>> a very simple decision function based on rating and not an prematurely
>> overengineered monstrosity.
>>
>> The suspend/resume() code needs a few very simple changes:
>>
>> xxx_suspend():
>>         clocksource_prepare_suspend();
>>
>>   Note, this is _NOT_ timekeeping_suspend() because that is invoked _AFTER_
>>   alarmtimer_suspend(). So if an alarm timer is required it needs to be
>>   armed before that. A trivial solution might be to just call it from
>>   alarmtimer_suspend(), but that a minor detail to worry about.
>>
>> timekeeping_suspend()
>> {
>>         clocksource_enter_suspend();
>>         ...
>>
>> timekeeping_resume()
>> {
>> ...
>>         if (clocksource_leave_suspend(&nsec)) {
>>                 ts_delta = ns_to_timespec64(nsec);
>>                 sleeptime_injected = true;
>>         } else if (......
>>
>>
>> Now lets look at the new functions:
>>
>> void clocksource_prepare_suspend(void)
>> {
>>         if (!suspend_cs)
>>                 return;
>>
>>         if (needs_alarmtimer(suspend_cs))
>>                 start_suspend_alarm(suspend_cs);
>> }
>>
>> void clocksource_enter_suspend(void)
>> {
>>         if (!suspend_cs)
>>                 return;
>>         suspend_start = suspend_cs->read();
>> }
>>
>> bool clocksource_leave_suspend(u64 *nsec)
>> {
>>         u64 now, delta;
>>
>>         if (!suspend_cs)
>>                 return false;
>>
>>         now = suspend_cs->read();
>>         delta = clocksource_delta(now, suspend_start, suspend_cs->mask);
>>         *nsec = mul_u64_u32_shr(delta, suspend_cs->mult, suspend_cs->shift);
>>         return true;
>> }
>>
>> See? It does not need any of this totally nonsensical stuff in your
>> registration function and not any new read functions and whatever, because
>> it simply can use the bog standard mult/shift values. Why? Because the
>> conversion above can cope with a full 64 bit * 32 bit multiply without
>> falling apart. It's already there in timekeeping_resume() otherwise
>> resuming with a NONSTOP TSC would result in bogus sleep times after a few
>> minutes. It's slower than the normal clocksource conversion which is
>> optimized for performance, but thats completely irrelevant on resume. This
>> whole blurb about requiring separate mult/shift values is just plain
>> drivel.
>>
>> Plus any reasonably broad clocksource will not need an alarmtimer at
>> all. Because the only reason it is needed is when the clocksource itself
>> wraps around. And that has absolutely nothing to do with mult/shift
>> values. That just depends on the frequency and the bitwidth of the counter,
>>
>> So it does not need an update function either because in case of broad
>> enough clocksources there is absolutely no need for update and in case of
>> wrapping ones the alarmtimer brings it out of suspend on time. And because
>> the only interesting thing is the delta between suspend and resume this is
>> all a complete non issue.
>>
>> The clocksource core already has all the registration/unregistration
>> functionality plus an interface to reconfigure the frequency, so
>> clocksources can come and go and be reconfigured and all of this just
>> works.
>>
>> Once the extra few lines of code are in place, then you can go and cleanup
>> the existing mess of homebrewn interfaces and claim that this is
>> consolidation and simplification.
>>
>> <rant>
>>
>> What's wrong with you people? Didn't they teach you in school that the
>> first thing to do is proper problem and code analysis? If they did not, go
>> back to them and ask your money back,
>>
>> I'm really tired of these overengineered trainwrecks which are then
>> advertised with bullshit marketing like the best invention since sliced
>> bread. This might work in random corporates, but not here.
>>
>> </rant>
>
> Thomas, what is wrong with *you*?  This is completely unnecessary.
>
> First of all, I sent Baolin on this approach, as I was getting sick of
> seeing the suspend/resume timing continually adding extra warts and
> complexity to handle odd hardware that needs some solution. So I
> suggested he clean that up in a way that lets these multiple ways of
> solving the problem be done in device specific code w/o adding more
> complexity to the core logic - in a fashion how the clocksources
> allowed us to support nice fast/well-behaving hardware w/o preventing
> ugly-but-neccessary solutions like the pit clocksource.
>
> Now, I've also been quite a poor maintainer of late, and haven't done
> much to help with pre-review of Baloin's code.  So that's on me, not
> him.
>
> Admittedly, my taste isn't always great, so there is likely to be a
> better approach, and having your guidance here is *really* valued. I
> just wish we could get it without all this unnecessary vinegar.
>
> I was really lucky to have your guidance and support when I was
> starting in the community. Your helping bring me up as a co-maintainer
> (only a a relatively small set of responsibility compared to your much
> wider maintainership), does let me see that having the responsibility
> of billions of devices running your code is immense and comes with no
> small amount of stress. Juggling the infinite incoming stream of
> review requests on naieve or otherwise lacking code with other
> important development priorities is a major burden, so *I really get
> how frustrating it is*.
>
> And its super annoying to have lots of short-term development requests
> being thrown at you when you're the one who has to maintain it for the
> long term.  But long-long-term, no one is going to be a maintainer
> forever, and we are not going to develop more olympic swimmers if we
> keep peeing in the pool.
>
>
> Baolin: My apologies for leading you poorly here. Work on something
> else for a day to let the flames burn off you, then review Thomas'
> technical feedback, ignoring the harsh style, and try to address his
> technical comments with the approach. I'll try to do better in finding
> time to review your work.

Thanks John, I will follow Thomas' suggestion and re-implement the approach.

---
Baolin Wang
Best Regards
diff mbox

Patch

diff --git a/include/linux/persistent_clock.h b/include/linux/persistent_clock.h
new file mode 100644
index 0000000..7d42c1a
--- /dev/null
+++ b/include/linux/persistent_clock.h
@@ -0,0 +1,23 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __PERSISTENT_CLOCK_H__
+#define __PERSISTENT_CLOCK_H__
+
+#ifdef CONFIG_PERSISTENT_CLOCK
+extern int persistent_clock_init_and_register(u64 (*read)(void),
+					      u64 mask, u32 freq,
+					      u64 maxsec);
+extern void persistent_clock_cleanup(void);
+extern void persistent_clock_start_alarmtimer(void);
+#else
+static inline int persistent_clock_init_and_register(u64 (*read)(void),
+						     u64 mask, u32 freq,
+						     u64 maxsec)
+{
+	return 0;
+}
+
+static inline void persistent_clock_cleanup(void) { }
+static inline void persistent_clock_start_alarmtimer(void) { }
+#endif
+
+#endif
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 78eabc4..7188600 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -47,6 +47,10 @@  config GENERIC_CLOCKEVENTS_MIN_ADJUST
 config GENERIC_CMOS_UPDATE
 	bool
 
+# Persistent clock support
+config PERSISTENT_CLOCK
+	bool
+
 if GENERIC_CLOCKEVENTS
 menu "Timers subsystem"
 
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index f1e46f3..f6d368f 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -18,3 +18,4 @@  obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o tick-sched.o
 obj-$(CONFIG_DEBUG_FS)				+= timekeeping_debug.o
 obj-$(CONFIG_TEST_UDELAY)			+= test_udelay.o
+obj-$(CONFIG_PERSISTENT_CLOCK)			+= persistent_clock.o
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 639321b..1518fdb 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -29,6 +29,7 @@ 
 #include <linux/freezer.h>
 #include <linux/compat.h>
 #include <linux/module.h>
+#include <linux/persistent_clock.h>
 
 #include "posix-timers.h"
 
@@ -892,6 +893,9 @@  static int __init alarmtimer_init(void)
 		error = PTR_ERR(pdev);
 		goto out_drv;
 	}
+
+	/* Start one alarmtimer to update persistent clock */
+	persistent_clock_start_alarmtimer();
 	return 0;
 
 out_drv:
diff --git a/kernel/time/persistent_clock.c b/kernel/time/persistent_clock.c
new file mode 100644
index 0000000..edaa659
--- /dev/null
+++ b/kernel/time/persistent_clock.c
@@ -0,0 +1,184 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Linaro, Inc.
+ *
+ * Author: Baolin Wang <baolin.wang@linaro.org>
+ */
+
+#include <linux/alarmtimer.h>
+#include <linux/clocksource.h>
+#include <linux/persistent_clock.h>
+
+/**
+ * persistent_clock_read_data - data required to read persistent clock
+ * @read: Returns a cycle value from persistent clock.
+ * @last_cycles: Clock cycle value at last update.
+ * @last_ns: Time value (nanoseconds) at last update.
+ * @mask: Bitmask for two's complement subtraction of non 64bit clocks.
+ * @mult: Cycle to nanosecond multiplier.
+ * @shift: Cycle to nanosecond divisor.
+ */
+struct persistent_clock_read_data {
+	u64 (*read)(void);
+	u64 last_cycles;
+	u64 last_ns;
+	u64 mask;
+	u32 mult;
+	u32 shift;
+};
+
+/**
+ * persistent_clock - represent the persistent clock
+ * @read_data: Data required to read from persistent clock.
+ * @seq: Sequence counter for protecting updates.
+ * @freq: The frequency of the persistent clock.
+ * @wrap: Duration for persistent clock can run before wrapping.
+ * @alarm: Update timeout for persistent clock wrap.
+ * @alarm_inited: Indicate if the alarm has been initialized.
+ */
+struct persistent_clock {
+	struct persistent_clock_read_data read_data;
+	seqcount_t seq;
+	u32 freq;
+	ktime_t wrap;
+	struct alarm alarm;
+	bool alarm_inited;
+};
+
+static struct persistent_clock p;
+
+void read_persistent_clock64(struct timespec64 *ts)
+{
+	struct persistent_clock_read_data *read_data = &p.read_data;
+	unsigned long seq;
+	u64 delta, nsecs;
+
+	if (!read_data->read) {
+		ts->tv_sec = 0;
+		ts->tv_nsec = 0;
+		return;
+	}
+
+	do {
+		seq = read_seqcount_begin(&p.seq);
+		delta = (read_data->read() - read_data->last_cycles) &
+			read_data->mask;
+
+		nsecs = read_data->last_ns +
+			clocksource_cyc2ns(delta, read_data->mult,
+					   read_data->shift);
+		*ts = ns_to_timespec64(nsecs);
+	} while (read_seqcount_retry(&p.seq, seq));
+}
+
+static void persistent_clock_update(void)
+{
+	struct persistent_clock_read_data *read_data = &p.read_data;
+	u64 cycles, delta;
+
+	write_seqcount_begin(&p.seq);
+
+	cycles = read_data->read();
+	delta = (cycles - read_data->last_cycles) & read_data->mask;
+	read_data->last_ns += clocksource_cyc2ns(delta, read_data->mult,
+						 read_data->shift);
+	read_data->last_cycles = cycles;
+
+	write_seqcount_end(&p.seq);
+}
+
+static enum alarmtimer_restart persistent_clock_alarm_fired(struct alarm *alarm,
+							    ktime_t now)
+{
+	persistent_clock_update();
+
+	alarm_forward(&p.alarm, now, p.wrap);
+	return ALARMTIMER_RESTART;
+}
+
+int persistent_clock_init_and_register(u64 (*read)(void), u64 mask,
+				       u32 freq, u64 maxsec)
+{
+	struct persistent_clock_read_data *read_data = &p.read_data;
+	u64 wrap, res, secs = maxsec;
+
+	if (!read || !mask || !freq)
+		return -EINVAL;
+
+	if (!secs) {
+		/*
+		 * If the timer driver did not specify the maximum conversion
+		 * seconds of the persistent clock, then we calculate the
+		 * conversion range with the persistent clock's bits and
+		 * frequency.
+		 */
+		secs = mask;
+		do_div(secs, freq);
+
+		/*
+		 * Some persistent counter can be larger than 32bit, so we
+		 * need limit the max suspend time to have a good conversion
+		 * precision. So 24 hours may be enough usually.
+		 */
+		if (secs > 86400)
+			secs = 86400;
+	}
+
+	/* Calculate the mult/shift to convert cycles to ns. */
+	clocks_calc_mult_shift(&read_data->mult, &read_data->shift, freq,
+			       NSEC_PER_SEC, (u32)secs);
+
+	/* Calculate how many nanoseconds until we risk wrapping. */
+	wrap = clocks_calc_max_nsecs(read_data->mult, read_data->shift, 0,
+				     mask, NULL);
+	p.wrap = ns_to_ktime(wrap);
+
+	p.freq = freq;
+	read_data->mask = mask;
+	read_data->read = read;
+
+	persistent_clock_update();
+
+	/* Calculate the ns resolution of this persistent clock. */
+	res = clocksource_cyc2ns(1ULL, read_data->mult, read_data->shift);
+
+	pr_info("persistent clock: mask %llu at %uHz, resolution %lluns, wraps every %lluns\n",
+		mask, freq, res, wrap);
+	return 0;
+}
+
+void persistent_clock_cleanup(void)
+{
+	p.read_data.read = NULL;
+
+	if (p.alarm_inited) {
+		alarm_cancel(&p.alarm);
+		p.alarm_inited = false;
+	}
+}
+
+void persistent_clock_start_alarmtimer(void)
+{
+	struct persistent_clock_read_data *read_data = &p.read_data;
+	ktime_t now;
+
+	/*
+	 * If no persistent clock function has been provided or the alarmtimer
+	 * has been initialized at that point, just return.
+	 */
+	if (!read_data->read || p.alarm_inited)
+		return;
+
+	persistent_clock_update();
+
+	/*
+	 * Since the persistent clock will not be stopped when system enters the
+	 * suspend state, thus we need start one alarmtimer to wakeup the system
+	 * to update the persistent clock before wrapping. We should start the
+	 * update alarmtimer after the alarmtimer subsystem was initialized.
+	 */
+	alarm_init(&p.alarm, ALARM_BOOTTIME, persistent_clock_alarm_fired);
+	now = ktime_get_boottime();
+	alarm_start(&p.alarm, ktime_add(now, p.wrap));
+	p.alarm_inited = true;
+}