[v2,6/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT

Message ID	1459259051-4943-7-git-send-email-joao.m.martins@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Joao Martins <joao.m.martins@oracle.com> To: xen-devel@lists.xen.org Date: Tue, 29 Mar 2016 14:44:11 +0100 Message-Id: <1459259051-4943-7-git-send-email-joao.m.martins@oracle.com> In-Reply-To: <1459259051-4943-1-git-send-email-joao.m.martins@oracle.com> References: <1459259051-4943-1-git-send-email-joao.m.martins@oracle.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Joao Martins <joao.m.martins@oracle.com>, Keir Fraser <keir@xen.org>, Jan Beulich <jbeulich@suse.com> Subject: [Xen-devel] [PATCH v2 6/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Joao Martins March 29, 2016, 1:44 p.m. UTC

When using TSC as clocksource we will solely rely on TSC for updating
vcpu time infos (pvti). Right now, each vCPU takes the tsc_timestamp
at different instants meaning every EPOCH + delta.  This delta is
variable depending on the time the CPU calibrates with CPU 0 (master),
and will likely be different and variable across vCPUS. This means
that each VCPU pvti won't account to its calibration error which could
lead to time going backwards, and allowing a situation where time read
on VCPU B immediately after A being smaller. While this doesn't happen
a lot, I was able to observe (for clocksource=tsc) around 50 times in
an hour having warps of < 100 ns.

This patch proposes relying on host TSC synchronization and
passthrough to the guest, when running on a TSC-safe platform. On
time_calibration we retrieve the platform time in ns and the counter
read by the clocksource that was used to compute system time. We
introduce a new rendezous function which doesn't require
synchronization between master and slave CPUS and just reads
calibration_rendezvous struct and writes it down the stime and stamp
to the cpu_calibration struct to be used later on. We can guarantee that
on a platform with a constant and reliable TSC, that the time read on
vcpu B right after A is bigger independently of the CPU calibration
error. Since pvclock time infos are monotonic as seen by any vCPU set
PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
IIUC, this is similar to how it's implemented on KVM.

Note that PVCLOCK_TSC_STABLE_BIT is set only when CPU hotplug isn't
meant to be performed on the host, which will either be when max vcpus
and num_present_cpu are the same or if "nocpuhotplug" command line
parameter is used. This is because a newly hotplugged CPU may not
satisfy the condition of having all TSCs synchronized.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Perhaps "cpuhotplugsafe" would be a better name, since potentially
hardware could guarantee TSCs are synchronized on hotplug?

Changes since v1:
 - Change approach to follow Andrew's guideline to
 skip std_rendezvous. And doing so by introducing a nop_rendezvous
 - Change commit message reflecting the change above.
 - Use TSC_STABLE_BIT only if cpu hotplug isn't possible.
 - Add command line option to override it if no cpu hotplug is
 intended.
---
 xen/arch/x86/time.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

Jan Beulich April 5, 2016, 12:22 p.m. UTC | #1

>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -43,6 +43,10 @@
>  static char __initdata opt_clocksource[10];
>  string_param("clocksource", opt_clocksource);
>  
> +/* opt_nocpuhotplug: Set if CPU hotplug isn't meant to be used */
> +static bool_t __initdata opt_nocpuhotplug;
> +boolean_param("nocpuhotplug", opt_nocpuhotplug);

If we were to have such a new option, it would need to be
accompanied by an entry in the command line option doc. But
I'm opposed to this: For one, the variable being static here
means there is nothing that actually suppresses CPU hotplug
to happen. And then I think this can, for all practical purposes,
be had by suitably using existing command line options, namely
"max_cpus=", such that set_nr_cpu_ids() won't allow for any
further CPUs to get added. Albeit I admit that if someone was
to bring down some CPU and then hotplug another one, we
might still be in trouble. So maybe the better approach would
be to fail onlining of CPUs that don't meet the criteria when
"clocksource=tsc"?

> @@ -435,6 +439,7 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>   * PLATFORM TIMER 4: TSC
>   */
>  static bool_t clocksource_is_tsc;
> +static bool_t use_tsc_stable_bit;

__read_mostly?

> @@ -468,6 +473,11 @@ static int __init init_tsctimer(struct platform_timesource *pts)
>  
>      pts->frequency = tsc_freq;
>      clocksource_is_tsc = tsc_reliable;
> +    use_tsc_stable_bit = clocksource_is_tsc &&
> +        ((nr_cpu_ids == num_present_cpus()) || opt_nocpuhotplug);

See my remark above regarding the reliability of this.

> @@ -1431,6 +1443,22 @@ static void time_calibration_std_rendezvous(void *_r)
>      raise_softirq(TIME_CALIBRATE_SOFTIRQ);
>  }
>  
> +/*
> + * Rendezvous function used when clocksource is TSC and
> + * no CPU hotplug will be performed.
> + */
> +static void time_calibration_nop_rendezvous(void *_r)

Even if done so in existing code - no new local variable or function
parameters starting with an underscore please.

> +{
> +    struct cpu_calibration *c = &this_cpu(cpu_calibration);
> +    struct calibration_rendezvous *r = _r;

const

> +    c->local_tsc_stamp = r->master_tsc_stamp;
> +    c->stime_local_stamp = get_s_time();
> +    c->stime_master_stamp = r->master_stime;
> +
> +    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
> +}

Perhaps this whole function should move up ahead of the other
two, so that they both can use this one instead of now duplicating
the same code a 3rd time? Or maybe a new helper function would
be better, to also account for the difference in what
c->local_tsc_stamp gets set from (which could then become a
parameter of that new function).

> @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused)
>          .semaphore = ATOMIC_INIT(0)
>      };
>  
> +    if ( use_tsc_stable_bit )
> +    {
> +        local_irq_disable();
> +        r.master_stime = read_platform_stime(&r.master_tsc_stamp);
> +        local_irq_enable();
> +    }

So this can't be in time_calibration_nop_rendezvous() because
you want to avoid the actual rendezvousing. But isn't the then
possibly much larger gap between read_platform_stime() (which
parallels the rdtsc()-s in the other two cases) and get_s_time()
invocation going to become a problem?

Jan

Joao Martins April 5, 2016, 9:34 p.m. UTC | #2

On 04/05/2016 01:22 PM, Jan Beulich wrote:
>>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -43,6 +43,10 @@
>>  static char __initdata opt_clocksource[10];
>>  string_param("clocksource", opt_clocksource);
>>  
>> +/* opt_nocpuhotplug: Set if CPU hotplug isn't meant to be used */
>> +static bool_t __initdata opt_nocpuhotplug;
>> +boolean_param("nocpuhotplug", opt_nocpuhotplug);
> 
> If we were to have such a new option, it would need to be
> accompanied by an entry in the command line option doc.
Yes, Konrad pointed that out too - and I had it already documented
already for the next version. But given your argument below might
not even be needed to add this option.

> But
> I'm opposed to this: For one, the variable being static here
> means there is nothing that actually suppresses CPU hotplug
> to happen.
> And then I think this can, for all practical purposes,
> be had by suitably using existing command line options, namely
> "max_cpus=", such that set_nr_cpu_ids() won't allow for any
> further CPUs to get added. Albeit I admit that if someone was
> to bring down some CPU and then hotplug another one, we
> might still be in trouble. So maybe the better approach would
> be to fail onlining of CPUs that don't meet the criteria when
> "clocksource=tsc"?
True - max_cpus would produce the same effect. But I should point out
that even when clocksource=tsc the rendezvous would be std_rendezvous. So the
reference TSC is CPU 0 and tsc_timestamps are of the individual
CPUs. So perhaps the criteria would be for clocksource=tsc and use_tsc_stable_bit.


> 
>> @@ -435,6 +439,7 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>   * PLATFORM TIMER 4: TSC
>>   */
>>  static bool_t clocksource_is_tsc;
>> +static bool_t use_tsc_stable_bit;
> 
> __read_mostly?
Yeah - I will add it there.

> 
>> @@ -468,6 +473,11 @@ static int __init init_tsctimer(struct platform_timesource *pts)
>>  
>>      pts->frequency = tsc_freq;
>>      clocksource_is_tsc = tsc_reliable;
>> +    use_tsc_stable_bit = clocksource_is_tsc &&
>> +        ((nr_cpu_ids == num_present_cpus()) || opt_nocpuhotplug);
> 
> See my remark above regarding the reliability of this.
> 
>> @@ -1431,6 +1443,22 @@ static void time_calibration_std_rendezvous(void *_r)
>>      raise_softirq(TIME_CALIBRATE_SOFTIRQ);
>>  }
>>  
>> +/*
>> + * Rendezvous function used when clocksource is TSC and
>> + * no CPU hotplug will be performed.
>> + */
>> +static void time_calibration_nop_rendezvous(void *_r)
> 
> Even if done so in existing code - no new local variable or function
> parameters starting with an underscore please.
OK.

> 
>> +{
>> +    struct cpu_calibration *c = &this_cpu(cpu_calibration);
>> +    struct calibration_rendezvous *r = _r;
> 
> const
> 
>> +    c->local_tsc_stamp = r->master_tsc_stamp;
>> +    c->stime_local_stamp = get_s_time();
>> +    c->stime_master_stamp = r->master_stime;
>> +
>> +    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
>> +}
> 
> Perhaps this whole function should move up ahead of the other
> two, so that they both can use this one instead of now duplicating
> the same code a 3rd time? Or maybe a new helper function would
> be better, to also account for the difference in what
> c->local_tsc_stamp gets set from (which could then become a
> parameter of that new function).
The refactoring you suggest sounds a good idea indeed as that
code is shared across all rendezvous - I will do so following
this guideline you advised.

> 
>> @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused)
>>          .semaphore = ATOMIC_INIT(0)
>>      };
>>  
>> +    if ( use_tsc_stable_bit )
>> +    {
>> +        local_irq_disable();
>> +        r.master_stime = read_platform_stime(&r.master_tsc_stamp);
>> +        local_irq_enable();
>> +    }
> 
> So this can't be in time_calibration_nop_rendezvous() because
> you want to avoid the actual rendezvousing. But isn't the then
> possibly much larger gap between read_platform_stime() (which
> parallels the rdtsc()-s in the other two cases) and get_s_time()
> invocation going to become a problem?
Perhaps I am not not seeing the potential problem of this. The main
difference I see between both would be the base system time: read_platform_stime
uses stime_platform_stamp as base, and computes a difference from the
read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp
(platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from
stime_master_stamp on local_time_calibration) as base plus delta from rdtsc()
with local_tsc_stamp. And since this is now all TSC, and TSC monotonically
increase and is synchronized across CPUs, both calls would end up returning the
same or a always up-to-date value, whether cpu_time have a larger gap or not
from stime_platform_stamp. Unless the concern you are raising comes from the
fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed to
roughly at the same time with std_rendezvous?

Joao

Jan Beulich April 7, 2016, 3:58 p.m. UTC | #3

>>> On 05.04.16 at 23:34, <joao.m.martins@oracle.com> wrote:
> On 04/05/2016 01:22 PM, Jan Beulich wrote:
>>>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
>> But
>> I'm opposed to this: For one, the variable being static here
>> means there is nothing that actually suppresses CPU hotplug
>> to happen.
>> And then I think this can, for all practical purposes,
>> be had by suitably using existing command line options, namely
>> "max_cpus=", such that set_nr_cpu_ids() won't allow for any
>> further CPUs to get added. Albeit I admit that if someone was
>> to bring down some CPU and then hotplug another one, we
>> might still be in trouble. So maybe the better approach would
>> be to fail onlining of CPUs that don't meet the criteria when
>> "clocksource=tsc"?
> True - max_cpus would produce the same effect. But I should point out
> that even when clocksource=tsc the rendezvous would be std_rendezvous. So 
> the
> reference TSC is CPU 0 and tsc_timestamps are of the individual
> CPUs. So perhaps the criteria would be for clocksource=tsc and 
> use_tsc_stable_bit.

Oh, of course I didn't mean this to be the precise condition, just
an outline. Considering use_tsc_stable_bit certainly makes sense.

>>> @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused)
>>>          .semaphore = ATOMIC_INIT(0)
>>>      };
>>>  
>>> +    if ( use_tsc_stable_bit )
>>> +    {
>>> +        local_irq_disable();
>>> +        r.master_stime = read_platform_stime(&r.master_tsc_stamp);
>>> +        local_irq_enable();
>>> +    }
>> 
>> So this can't be in time_calibration_nop_rendezvous() because
>> you want to avoid the actual rendezvousing. But isn't the then
>> possibly much larger gap between read_platform_stime() (which
>> parallels the rdtsc()-s in the other two cases) and get_s_time()
>> invocation going to become a problem?
> Perhaps I am not not seeing the potential problem of this.

I'm not sure there's a problem, I'm just asking because I've noticed
this behavioral difference.

> The main
> difference I see between both would be the base system time: 
> read_platform_stime
> uses stime_platform_stamp as base, and computes a difference from the
> read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp
> (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from
> stime_master_stamp on local_time_calibration) as base plus delta from 
> rdtsc()
> with local_tsc_stamp. And since this is now all TSC, and TSC monotonically
> increase and is synchronized across CPUs, both calls would end up returning 
> the
> same or a always up-to-date value, whether cpu_time have a larger gap or not
> from stime_platform_stamp. Unless the concern you are raising comes from the
> fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed 
> to
> roughly at the same time with std_rendezvous?

In a way, yes. I'm concerned by the two time stamps no longer
being obtained at (almost) the same time. If that's not having
any bad consequences, the better.

Jan

Joao Martins April 7, 2016, 9:17 p.m. UTC | #4

>> The main
>> difference I see between both would be the base system time: 
>> read_platform_stime
>> uses stime_platform_stamp as base, and computes a difference from the
>> read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp
>> (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from
>> stime_master_stamp on local_time_calibration) as base plus delta from 
>> rdtsc()
>> with local_tsc_stamp. And since this is now all TSC, and TSC monotonically
>> increase and is synchronized across CPUs, both calls would end up returning 
>> the
>> same or a always up-to-date value, whether cpu_time have a larger gap or not
>> from stime_platform_stamp. Unless the concern you are raising comes from the
>> fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed 
>> to
>> roughly at the same time with std_rendezvous?
> 
> In a way, yes. I'm concerned by the two time stamps no longer
> being obtained at (almost) the same time. If that's not having
> any bad consequences, the better.

I don't think there would be bad consequences as both timestamps correspond to
the same time reference - thus returning always the latest system time
irrespective of the gap between both stamps.

If you prefer I can go back with my initial approach (v1, with std_rendezvous)
to have both timestamps closely updated. And later (post-release?) revisit the
introduction of nop_rendezvous. Perhaps this way is more reasonable?

Joao

Jan Beulich April 7, 2016, 9:32 p.m. UTC | #5

>>> On 07.04.16 at 23:17, <joao.m.martins@oracle.com> wrote:
>> > The main
>>> difference I see between both would be the base system time: 
>>> read_platform_stime
>>> uses stime_platform_stamp as base, and computes a difference from the
>>> read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp
>>> (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from
>>> stime_master_stamp on local_time_calibration) as base plus delta from 
>>> rdtsc()
>>> with local_tsc_stamp. And since this is now all TSC, and TSC monotonically
>>> increase and is synchronized across CPUs, both calls would end up returning 
>>> the
>>> same or a always up-to-date value, whether cpu_time have a larger gap or not
>>> from stime_platform_stamp. Unless the concern you are raising comes from the
>>> fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed 
>>> to
>>> roughly at the same time with std_rendezvous?
>> 
>> In a way, yes. I'm concerned by the two time stamps no longer
>> being obtained at (almost) the same time. If that's not having
>> any bad consequences, the better.
> 
> I don't think there would be bad consequences as both timestamps correspond 
> to the same time reference - thus returning always the latest system time
> irrespective of the gap between both stamps.
> 
> If you prefer I can go back with my initial approach (v1, with std_rendezvous)
> to have both timestamps closely updated. And later (post-release?) revisit 
> the introduction of nop_rendezvous. Perhaps this way is more reasonable?

Since the new mode need to be actively asked for, I don't think
that's necessary.

Jan

[v2,6/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT

Commit Message

Comments

Patch