[v3,5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT

Message ID	1472042632-12883-6-git-send-email-joao.m.martins@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Joao Martins <joao.m.martins@oracle.com> To: xen-devel@lists.xenproject.org Date: Wed, 24 Aug 2016 13:43:51 +0100 Message-Id: <1472042632-12883-6-git-send-email-joao.m.martins@oracle.com> In-Reply-To: <1472042632-12883-1-git-send-email-joao.m.martins@oracle.com> References: <1472042632-12883-1-git-send-email-joao.m.martins@oracle.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <jbeulich@suse.com> Subject: [Xen-devel] [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Joao Martins Aug. 24, 2016, 12:43 p.m. UTC

This patch proposes relying on host TSC synchronization and
passthrough to the guest, when running on a TSC-safe platform. On
time_calibration we retrieve the platform time in ns and the counter
read by the clocksource that was used to compute system time. We
introduce a new rendezous function which doesn't require
synchronization between master and slave CPUS and just reads
calibration_rendezvous struct and writes it down the stime and stamp
to the cpu_calibration struct to be used later on. We can guarantee that
on a platform with a constant and reliable TSC, that the time read on
vcpu B right after A is bigger independently of the VCPU calibration
error. Since pvclock time infos are monotonic as seen by any vCPU set
PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
IIUC, this is similar to how it's implemented on KVM.

This also changes clocksource=tsc initialization to be used only when CPU
hotplug isn't meant to be performed on the host, which will either be when max
vcpus and num_present_cpu are the same.  This is because a newly hotplugged CPU
may not satisfy the condition of having all TSCs synchronized - so when having
tsc clocksource being used we allow offlining CPUs but not onlining any ones
back. Should note that I've yet to see time going backwards in a long running
test in the past few days (in a dual socket machine), plus few other tests I
did on older platforms.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v2:
 - Add XEN_ prefix to pvclock flags.
 - Adapter time_calibration_rendezvous_tail to have the case of setting master
 tsc/stime and use it for the nop_rendezvous.
 - Removed hotplug CPU option that was added in v1
 - Prevent online of CPUs when clocksource is tsc.
 - Remove use_tsc_stable_bit, since clocksource is only used to seed
 values. So instead we test if hotplug is possible, and prevent clocksource=tsc
 to be used.
 - Remove 1st paragrah of commit message since the behaviour described
   no longer applies since b64438c.

Changes since v1:
 - Change approach to skip std_rendezvous by introducing a
   nop_rendezvous
 - Change commit message reflecting the change above.
 - Use TSC_STABLE_BIT only if cpu hotplug isn't possible.
 - Add command line option to override it if no cpu hotplug is
 intended.
---
 xen/arch/x86/platform_hypercall.c |  3 +-
 xen/arch/x86/time.c               | 59 +++++++++++++++++++++++++++++++++++----
 xen/include/asm-x86/time.h        |  1 +
 3 files changed, 57 insertions(+), 6 deletions(-)

Jan Beulich Aug. 25, 2016, 10:37 a.m. UTC | #1

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> This patch proposes relying on host TSC synchronization and
> passthrough to the guest, when running on a TSC-safe platform. On
> time_calibration we retrieve the platform time in ns and the counter
> read by the clocksource that was used to compute system time. We
> introduce a new rendezous function which doesn't require
> synchronization between master and slave CPUS and just reads
> calibration_rendezvous struct and writes it down the stime and stamp
> to the cpu_calibration struct to be used later on. We can guarantee that
> on a platform with a constant and reliable TSC, that the time read on
> vcpu B right after A is bigger independently of the VCPU calibration
> error. Since pvclock time infos are monotonic as seen by any vCPU set
> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
> IIUC, this is similar to how it's implemented on KVM.

Without any tools side change, how is it guaranteed that a guest
which observed the stable bit won't get migrated to a host not
providing that guarantee?

> Changes since v2:
>  - Add XEN_ prefix to pvclock flags.
>  - Adapter time_calibration_rendezvous_tail to have the case of setting master
>  tsc/stime and use it for the nop_rendezvous.
>  - Removed hotplug CPU option that was added in v1
>  - Prevent online of CPUs when clocksource is tsc.

Some of the above listed changes don't seem to belong here, but
rather in one of the earlier patches.

> --- a/xen/arch/x86/platform_hypercall.c
> +++ b/xen/arch/x86/platform_hypercall.c
> @@ -631,7 +631,8 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
>          if ( ret )
>              break;
>  
> -        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
> +        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
> +             host_tsc_is_clocksource() )

I have to admit that I consider the "host" here confusing: What other
TSCs could one think of on x86? Maybe clocksource_is_tsc()?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>  
>  static s64 __init init_tsctimer(struct platform_timesource *pts)
>  {
> +    if ( nr_cpu_ids != num_present_cpus() )
> +    {
> +        printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
> +                           "not using TSC as clocksource\n");

Please don't split log messages across lines. Keep the XENLOG_INFO
on one line, and then the whole literal on the next.

Also you/we will need to take some measures to avoid this triggering
on systems where extra MADT entries exist just as dummies (perhaps
to ease firmware implementation, as was the case prior to commit
0875433389 ["hvmloader: limit CPUs exposed to guests"] with our
own ACPI tables we present to HVM guests).

> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
>  };
>  
>  static void
> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
> +                                 bool_t master_tsc)

Please use plain"bool" in new code. And then I'm not convinced this
refactoring is better than simply having the new rendezvous function
not call time_calibration_rendezvous_tail().

>  {
>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>  
> -    c->local_tsc    = rdtsc_ordered();
> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
> +    if ( master_tsc )
> +    {
> +        c->local_tsc    = r->master_tsc_stamp;

Doesn't this require the TSCs to run in perfect sync (not even off
wrt each other by a single cycle)? Is such even possible on multi
socket systems? I.e. how would multiple chips get into such a
mode in the first place, considering their TSCs can't possibly start
ticking at exactly the same (perhaps even sub-)nanosecond?

> +static void time_calibration_nop_rendezvous(void *rv)
> +{
> +    const struct calibration_rendezvous *r = rv;
> +
> +    time_calibration_rendezvous_tail(r, true);
>  }

I don't see what you need the local variable for, unless you
stopped calling time_calibration_rendezvous_tail() as suggested
above.

Jan

Joao Martins Aug. 26, 2016, 3:44 p.m. UTC | #2

On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> This patch proposes relying on host TSC synchronization and
>> passthrough to the guest, when running on a TSC-safe platform. On
>> time_calibration we retrieve the platform time in ns and the counter
>> read by the clocksource that was used to compute system time. We
>> introduce a new rendezous function which doesn't require
>> synchronization between master and slave CPUS and just reads
>> calibration_rendezvous struct and writes it down the stime and stamp
>> to the cpu_calibration struct to be used later on. We can guarantee that
>> on a platform with a constant and reliable TSC, that the time read on
>> vcpu B right after A is bigger independently of the VCPU calibration
>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>> IIUC, this is similar to how it's implemented on KVM.
> 
> Without any tools side change, how is it guaranteed that a guest
> which observed the stable bit won't get migrated to a host not
> providing that guarantee?
Do you want to prevent migration in such cases? The worst that can happen is that the
guest might need to fallback to a system call if this bit is 0 and would keep doing
so if the bit is 0.

>> Changes since v2:
>>  - Add XEN_ prefix to pvclock flags.
>>  - Adapter time_calibration_rendezvous_tail to have the case of setting master
>>  tsc/stime and use it for the nop_rendezvous.
>>  - Removed hotplug CPU option that was added in v1
>>  - Prevent online of CPUs when clocksource is tsc.
> 
> Some of the above listed changes don't seem to belong here, but
> rather in one of the earlier patches.
OK - as mentioned in patch two this was intended. Will merge these changes into patch 2.

> 
>> --- a/xen/arch/x86/platform_hypercall.c
>> +++ b/xen/arch/x86/platform_hypercall.c
>> @@ -631,7 +631,8 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
>>          if ( ret )
>>              break;
>>  
>> -        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
>> +        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
>> +             host_tsc_is_clocksource() )
> 
> I have to admit that I consider the "host" here confusing: What other
> TSCs could one think of on x86? Maybe clocksource_is_tsc()?
Hmm, didn't thought of any other TSC, I was just merely follow the existent naming
style in the header ("host_tsc_is_safe()"). I am also fine with clocksource_is_tsc().

> 
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>  
>>  static s64 __init init_tsctimer(struct platform_timesource *pts)
>>  {
>> +    if ( nr_cpu_ids != num_present_cpus() )
>> +    {
>> +        printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
>> +                           "not using TSC as clocksource\n");
> 
> Please don't split log messages across lines. Keep the XENLOG_INFO
> on one line, and then the whole literal on the next.
> 
Fixed.

> Also you/we will need to take some measures to avoid this triggering
> on systems where extra MADT entries exist just as dummies (perhaps
> to ease firmware implementation, as was the case prior to commit
> 0875433389 ["hvmloader: limit CPUs exposed to guests"] with our
> own ACPI tables we present to HVM guests).
OK. I think my laptop might be one of those but I need to
check (at least do need to adjust maxcpus= to use clocksource=tsc, but that's the
only case. Other boxes I don't need to)

> 
>> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
>>  };
>>  
>>  static void
>> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
>> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
>> +                                 bool_t master_tsc)
> 
> Please use plain"bool" in new code.
OK.

> And then I'm not convinced this
> refactoring is better than simply having the new rendezvous function
> not call time_calibration_rendezvous_tail().
I will move it to the nop_rendezvous.

> 
>>  {
>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>  
>> -    c->local_tsc    = rdtsc_ordered();
>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>> +    if ( master_tsc )
>> +    {
>> +        c->local_tsc    = r->master_tsc_stamp;
> 
> Doesn't this require the TSCs to run in perfect sync (not even off
> wrt each other by a single cycle)? Is such even possible on multi
> socket systems? I.e. how would multiple chips get into such a
> mode in the first place, considering their TSCs can't possibly start
> ticking at exactly the same (perhaps even sub-)nanosecond?
They do require to be in sync with multi-sockets, otherwise this wouldn't work.
Invariant TSC only refers to cores in a package, but multi-socket is up to board
vendors (no manuals mention this guarantee across sockets). That one of the reasons
TSC is such a burden :(

Looking at datasheets (on the oldest processor I was testing this) it mentions this note:

"In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
relative to BCLK, as outlined in Table 2-26.".

[0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5600-vol-1-datasheet.pdf

The BCLK looks to be the global reference clock shared across sockets IIUC used in
the PLLs in the individual packages (to generate the signal where the TSC is
extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
readings of these datasheets ). But If it was a box with TSC skewed among sockets,
wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
potentially fast enough to catch any oddities? Our docs
(https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
something along these lines on multi-socket systems. And Linux tsc code seems to
assume that Intel boxes have synchronized TSCs across sockets [1] and that the
exceptions cases should mark tsc=skewed (we also have such parameter).

[1] arch/x86/kernel/tsc.c#L1094

As reassurance I've been running tests for days long (currently in almost a week on
2-socket system) and I'll keep running to see if it catches any issues or time going
backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
would represent only a tiny fraction of what x86 has available these days.

Other than the things above I am not sure how to go about this :( Should we start
adjusting the TSCs if we find disparities or skew is observed on the long run? Or
allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
take on this? Appreciate your feedback.

>> +static void time_calibration_nop_rendezvous(void *rv)
>> +{
>> +    const struct calibration_rendezvous *r = rv;
>> +
>> +    time_calibration_rendezvous_tail(r, true);
>>  }
> 
> I don't see what you need the local variable for, unless you
> stopped calling time_calibration_rendezvous_tail() as suggested
> above.
I like the one above as you suggested.

Jan Beulich Aug. 29, 2016, 10:06 a.m. UTC | #3

>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>> This patch proposes relying on host TSC synchronization and
>>> passthrough to the guest, when running on a TSC-safe platform. On
>>> time_calibration we retrieve the platform time in ns and the counter
>>> read by the clocksource that was used to compute system time. We
>>> introduce a new rendezous function which doesn't require
>>> synchronization between master and slave CPUS and just reads
>>> calibration_rendezvous struct and writes it down the stime and stamp
>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>> on a platform with a constant and reliable TSC, that the time read on
>>> vcpu B right after A is bigger independently of the VCPU calibration
>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>> IIUC, this is similar to how it's implemented on KVM.
>> 
>> Without any tools side change, how is it guaranteed that a guest
>> which observed the stable bit won't get migrated to a host not
>> providing that guarantee?
> Do you want to prevent migration in such cases? The worst that can happen is that the
> guest might need to fallback to a system call if this bit is 0 and would keep doing
> so if the bit is 0.

Whether migration needs preventing I'm not sure; all I was trying
to indicate is that there seem to be pieces missing wrt migration.
As to the guest falling back to a system call - are guest kernels and
(as far as as affected) applications required to cope with the flag
changing from 1 to 0 behind their back?

>>>  {
>>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>>  
>>> -    c->local_tsc    = rdtsc_ordered();
>>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>>> +    if ( master_tsc )
>>> +    {
>>> +        c->local_tsc    = r->master_tsc_stamp;
>> 
>> Doesn't this require the TSCs to run in perfect sync (not even off
>> wrt each other by a single cycle)? Is such even possible on multi
>> socket systems? I.e. how would multiple chips get into such a
>> mode in the first place, considering their TSCs can't possibly start
>> ticking at exactly the same (perhaps even sub-)nanosecond?
> They do require to be in sync with multi-sockets, otherwise this wouldn't work.

"In sync" may mean two things: Ticking at exactly the same rate, or
(more strict) holding the exact same values at all times.

> Invariant TSC only refers to cores in a package, but multi-socket is up to board
> vendors (no manuals mention this guarantee across sockets). That one of the reasons
> TSC is such a burden :(
> 
> Looking at datasheets (on the oldest processor I was testing this) it 
> mentions this note:
> 
> "In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
> in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
> rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
> relative to BCLK, as outlined in Table 2-26.".

Hmm, a dual socket system is certainly still one of the easier ones to
deal with. 600ps means 18cm difference in signaling paths, which on
larger systems (and namely ones composed of mostly independent
nodes) I could easily seem getting exceeded. That can certainly be
compensated (e.g. by deasserting RESET# at different times for
different sockets), but I'd then still question the accuracy.

> [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5 
> 600-vol-1-datasheet.pdf
> 
> The BCLK looks to be the global reference clock shared across sockets IIUC used in
> the PLLs in the individual packages (to generate the signal where the TSC is
> extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
> readings of these datasheets ). But If it was a box with TSC skewed among sockets,
> wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
> potentially fast enough to catch any oddities?

That's my main fear: The check can't possibly determine whether TSCs
are in perfect sync, it can only check an approximation. Perhaps even
worse than the multi-node consideration here is hyper-threading, as
that makes it fundamentally impossible that all threads within one core
execute the same operation at exactly the same time. Not to speak of
the various odd cache effects which I did observe while doing the
measurements for my series (e.g. the second thread speculating the
TSC reads much farther than the primary ones, presumably because
the primary ones first needed to get the I-cache populated).

> Our docs
> (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
> something along these lines on multi-socket systems. And Linux tsc code seems to
> assume that Intel boxes have synchronized TSCs across sockets [1] and that the
> exceptions cases should mark tsc=skewed (we also have such parameter).
> 
> [1] arch/x86/kernel/tsc.c#L1094

Referring back to what I've said above: Does this mean exact same
tick rate, or exact same values?

> As reassurance I've been running tests for days long (currently in almost a week on
> 2-socket system) and I'll keep running to see if it catches any issues or time going
> backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
> would represent only a tiny fraction of what x86 has available these days.

A truly interesting case would be, as mentioned, a box composed of
individual nodes. Not sure whether that 8-socket one you mention
would meet that.

> Other than the things above I am not sure how to go about this :( Should we start
> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
> take on this? Appreciate your feedback.

At least as an initial approach requiring affinities to be limited to a
single socket would seem like a good compromise, provided HT
aspects don't have a bad effect (in which case also excluding HT
may be required). I'd also be fine with command line options
allowing to further relax that, but a simple "clocksource=tsc"
should imo result in a setup which from all we can tell will work as
intended.

Jan

Joao Martins Aug. 30, 2016, 12:26 p.m. UTC | #4

On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>> This patch proposes relying on host TSC synchronization and
>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>> time_calibration we retrieve the platform time in ns and the counter
>>>> read by the clocksource that was used to compute system time. We
>>>> introduce a new rendezous function which doesn't require
>>>> synchronization between master and slave CPUS and just reads
>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>> on a platform with a constant and reliable TSC, that the time read on
>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>> IIUC, this is similar to how it's implemented on KVM.
>>>
>>> Without any tools side change, how is it guaranteed that a guest
>>> which observed the stable bit won't get migrated to a host not
>>> providing that guarantee?
>> Do you want to prevent migration in such cases? The worst that can happen is that the
>> guest might need to fallback to a system call if this bit is 0 and would keep doing
>> so if the bit is 0.
> 
> Whether migration needs preventing I'm not sure; all I was trying
> to indicate is that there seem to be pieces missing wrt migration.
> As to the guest falling back to a system call - are guest kernels and
> (as far as as affected) applications required to cope with the flag
> changing from 1 to 0 behind their back?
It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
applications) always check this bit on every read to decide whether to fallback to a
system call. And same for pvclock code in the guest kernel on every read in both
Linux/FreeBSD to see whether to skip or not the monotonicity checks.

>>>>  {
>>>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>>>  
>>>> -    c->local_tsc    = rdtsc_ordered();
>>>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>>>> +    if ( master_tsc )
>>>> +    {
>>>> +        c->local_tsc    = r->master_tsc_stamp;
>>>
>>> Doesn't this require the TSCs to run in perfect sync (not even off
>>> wrt each other by a single cycle)? Is such even possible on multi
>>> socket systems? I.e. how would multiple chips get into such a
>>> mode in the first place, considering their TSCs can't possibly start
>>> ticking at exactly the same (perhaps even sub-)nanosecond?
>> They do require to be in sync with multi-sockets, otherwise this wouldn't work.
> 
> "In sync" may mean two things: Ticking at exactly the same rate, or
> (more strict) holding the exact same values at all times.
I meant the more strict one.

> 
>> Invariant TSC only refers to cores in a package, but multi-socket is up to board
>> vendors (no manuals mention this guarantee across sockets). That one of the reasons
>> TSC is such a burden :(
>>
>> Looking at datasheets (on the oldest processor I was testing this) it 
>> mentions this note:
>>
>> "In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
>> in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
>> rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
>> relative to BCLK, as outlined in Table 2-26.".
> 
> Hmm, a dual socket system is certainly still one of the easier ones to
> deal with. 600ps means 18cm difference in signaling paths, which on
> larger systems (and namely ones composed of mostly independent
> nodes) I could easily seem getting exceeded. That can certainly be
> compensated (e.g. by deasserting RESET# at different times for
> different sockets), but I'd then still question the accuracy.
Interesting, good point. FWIW the linux code doesn't deem multi-node systems as TSC
invariant/reliable.

> 
>> [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
>> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5 
>> 600-vol-1-datasheet.pdf
>>
>> The BCLK looks to be the global reference clock shared across sockets IIUC used in
>> the PLLs in the individual packages (to generate the signal where the TSC is
>> extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
>> readings of these datasheets ). But If it was a box with TSC skewed among sockets,
>> wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
>> potentially fast enough to catch any oddities?
> 
> That's my main fear: The check can't possibly determine whether TSCs
> are in perfect sync, it can only check an approximation. 
Indeed, and as we add more CPUs, the tsc reliability check will significantly slow
down, therefore minimizing this approximation, unless there's a significant skew.

> Perhaps even
> worse than the multi-node consideration here is hyper-threading, as
> that makes it fundamentally impossible that all threads within one core
> execute the same operation at exactly the same time. Not to speak of
> the various odd cache effects which I did observe while doing the
> measurements for my series (e.g. the second thread speculating the
> TSC reads much farther than the primary ones, presumably because
> the primary ones first needed to get the I-cache populated).
Hmmm, not sure how we could cope with TSC HT issues. In this patch, we propagate TSC
reads from platform timer on CPU 0 into the other CPUs, it would probably is
non-visible as there aren't TSC reads being done on multiple threads approximately at
the same time?

>> Our docs
>> (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
>> something along these lines on multi-socket systems. And Linux tsc code seems to
>> assume that Intel boxes have synchronized TSCs across sockets [1] and that the
>> exceptions cases should mark tsc=skewed (we also have such parameter).
>>
>> [1] arch/x86/kernel/tsc.c#L1094
> 
> Referring back to what I've said above: Does this mean exact same
> tick rate, or exact same values?
Here I also meant the invariant condition i.e exact same values.

> 
>> As reassurance I've been running tests for days long (currently in almost a week on
>> 2-socket system) and I'll keep running to see if it catches any issues or time going
>> backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
>> would represent only a tiny fraction of what x86 has available these days.
> 
> A truly interesting case would be, as mentioned, a box composed of
> individual nodes. Not sure whether that 8-socket one you mention
> would meet that.
It's not a multi-node machine - but within single-node machines it's potentially the
worst case scenario.

>> Other than the things above I am not sure how to go about this :( Should we start
>> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
>> take on this? Appreciate your feedback.
> 
> At least as an initial approach requiring affinities to be limited to a
> single socket would seem like a good compromise, provided HT
> aspects don't have a bad effect (in which case also excluding HT
> may be required). I'd also be fine with command line options
> allowing to further relax that, but a simple "clocksource=tsc"
> should imo result in a setup which from all we can tell will work as
> intended.
Sounds reasonable, so unless command line options are specified we disallow TSC to be
clocksource on multi-socket systems. WRT to command line options, how about extending
"tsc" parameter to accept another possible value such as "global" or "socketsafe"?
Current values are "unstable" and "skewed".

Thanks so far for all the comments so far!
Joao

Jan Beulich Aug. 30, 2016, 12:45 p.m. UTC | #5

>>> On 30.08.16 at 14:26, <joao.m.martins@oracle.com> wrote:
> On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
>>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>>> This patch proposes relying on host TSC synchronization and
>>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>>> time_calibration we retrieve the platform time in ns and the counter
>>>>> read by the clocksource that was used to compute system time. We
>>>>> introduce a new rendezous function which doesn't require
>>>>> synchronization between master and slave CPUS and just reads
>>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>>> on a platform with a constant and reliable TSC, that the time read on
>>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>>> IIUC, this is similar to how it's implemented on KVM.
>>>>
>>>> Without any tools side change, how is it guaranteed that a guest
>>>> which observed the stable bit won't get migrated to a host not
>>>> providing that guarantee?
>>> Do you want to prevent migration in such cases? The worst that can happen is that the
>>> guest might need to fallback to a system call if this bit is 0 and would keep doing
>>> so if the bit is 0.
>> 
>> Whether migration needs preventing I'm not sure; all I was trying
>> to indicate is that there seem to be pieces missing wrt migration.
>> As to the guest falling back to a system call - are guest kernels and
>> (as far as as affected) applications required to cope with the flag
>> changing from 1 to 0 behind their back?
> It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
> applications) always check this bit on every read to decide whether to fallback to a
> system call. And same for pvclock code in the guest kernel on every read in both
> Linux/FreeBSD to see whether to skip or not the monotonicity checks.

Okay, but please make sure this is called out at least in the commit
message, if not in a code comment.

>> Perhaps even
>> worse than the multi-node consideration here is hyper-threading, as
>> that makes it fundamentally impossible that all threads within one core
>> execute the same operation at exactly the same time. Not to speak of
>> the various odd cache effects which I did observe while doing the
>> measurements for my series (e.g. the second thread speculating the
>> TSC reads much farther than the primary ones, presumably because
>> the primary ones first needed to get the I-cache populated).
> Hmmm, not sure how we could cope with TSC HT issues. In this patch, we propagate TSC
> reads from platform timer on CPU 0 into the other CPUs, it would probably is
> non-visible as there aren't TSC reads being done on multiple threads approximately at
> the same time?

Right - much depends on parameters the values of which we don't
even have an idea of. Like how frequently get hyperthreads get
switched within a core.

>>> Other than the things above I am not sure how to go about this :( Should we start
>>> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
>>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
>>> take on this? Appreciate your feedback.
>> 
>> At least as an initial approach requiring affinities to be limited to a
>> single socket would seem like a good compromise, provided HT
>> aspects don't have a bad effect (in which case also excluding HT
>> may be required). I'd also be fine with command line options
>> allowing to further relax that, but a simple "clocksource=tsc"
>> should imo result in a setup which from all we can tell will work as
>> intended.
> Sounds reasonable, so unless command line options are specified we disallow TSC to be
> clocksource on multi-socket systems. WRT to command line options, how about extending
> "tsc" parameter to accept another possible value such as "global" or "socketsafe"?
> Current values are "unstable" and "skewed".

What about "stable, "stable:socket" (and then perhaps also
"stable:node")?

Jan

Joao Martins Aug. 30, 2016, 2:14 p.m. UTC | #6

On 08/30/2016 01:45 PM, Jan Beulich wrote:
>>>> On 30.08.16 at 14:26, <joao.m.martins@oracle.com> wrote:
>> On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
>>>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>>>> This patch proposes relying on host TSC synchronization and
>>>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>>>> time_calibration we retrieve the platform time in ns and the counter
>>>>>> read by the clocksource that was used to compute system time. We
>>>>>> introduce a new rendezous function which doesn't require
>>>>>> synchronization between master and slave CPUS and just reads
>>>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>>>> on a platform with a constant and reliable TSC, that the time read on
>>>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>>>> IIUC, this is similar to how it's implemented on KVM.
>>>>>
>>>>> Without any tools side change, how is it guaranteed that a guest
>>>>> which observed the stable bit won't get migrated to a host not
>>>>> providing that guarantee?
>>>> Do you want to prevent migration in such cases? The worst that can happen is that the
>>>> guest might need to fallback to a system call if this bit is 0 and would keep doing
>>>> so if the bit is 0.
>>>
>>> Whether migration needs preventing I'm not sure; all I was trying
>>> to indicate is that there seem to be pieces missing wrt migration.
>>> As to the guest falling back to a system call - are guest kernels and
>>> (as far as as affected) applications required to cope with the flag
>>> changing from 1 to 0 behind their back?
>> It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
>> applications) always check this bit on every read to decide whether to fallback to a
>> system call. And same for pvclock code in the guest kernel on every read in both
>> Linux/FreeBSD to see whether to skip or not the monotonicity checks.
> 
> Okay, but please make sure this is called out at least in the commit
> message, if not in a code comment.
Got it.

>>>> Other than the things above I am not sure how to go about this :( Should we start
>>>> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
>>>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
>>>> take on this? Appreciate your feedback.
>>>
>>> At least as an initial approach requiring affinities to be limited to a
>>> single socket would seem like a good compromise, provided HT
>>> aspects don't have a bad effect (in which case also excluding HT
>>> may be required). I'd also be fine with command line options
>>> allowing to further relax that, but a simple "clocksource=tsc"
>>> should imo result in a setup which from all we can tell will work as
>>> intended.
>> Sounds reasonable, so unless command line options are specified we disallow TSC to be
>> clocksource on multi-socket systems. WRT to command line options, how about extending
>> "tsc" parameter to accept another possible value such as "global" or "socketsafe"?
>> Current values are "unstable" and "skewed".
> 
> What about "stable, "stable:socket" (and then perhaps also
> "stable:node")?
Hmm, much nicer. Let me add these two options, alongside with the docs update wrt to
the tsc param. I'll probably do so in a separate patch in the series.

[v3,5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT

Commit Message

Comments

Patch