[5/5] x86/time: implement PVCLOCK_TSC_STABLE_BIT

Message ID	1458231136-13457-6-git-send-email-joao.m.martins@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> MIME-Version: 1.0 Message-ID: <1458231136-13457-6-git-send-email-joao.m.martins@oracle.com> Date: Thu, 17 Mar 2016 09:12:16 -0700 (PDT) From: Joao Martins <joao.m.martins@oracle.com> To: xen-devel@lists.xen.org References: <1458231136-13457-1-git-send-email-joao.m.martins@oracle.com> In-Reply-To: <1458231136-13457-1-git-send-email-joao.m.martins@oracle.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Joao Martins <joao.m.martins@oracle.com>, Keir Fraser <keir@xen.org>, Jan Beulich <jbeulich@suse.com> Subject: [Xen-devel] [PATCH 5/5] x86/time: implement PVCLOCK_TSC_STABLE_BIT Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Message ID

1458231136-13457-6-git-send-email-joao.m.martins@oracle.com (mailing list archive)

State

New, archived

Headers

MIME-Version: 1.0
Message-ID: <1458231136-13457-6-git-send-email-joao.m.martins@oracle.com>
Date: Thu, 17 Mar 2016 09:12:16 -0700 (PDT)
From: Joao Martins <joao.m.martins@oracle.com>
To: xen-devel@lists.xen.org
References: <1458231136-13457-1-git-send-email-joao.m.martins@oracle.com>
In-Reply-To: <1458231136-13457-1-git-send-email-joao.m.martins@oracle.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	Joao Martins <joao.m.martins@oracle.com>, Keir Fraser <keir@xen.org>, 
	Jan Beulich <jbeulich@suse.com>
Subject: [Xen-devel] [PATCH 5/5] x86/time: implement PVCLOCK_TSC_STABLE_BIT
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Commit Message

Joao Martins March 17, 2016, 4:12 p.m. UTC

When using TSC as clocksource we will solely rely on TSC for updating
vcpu time infos (pvti). Right now, each vCPU takes the tsc_timestamp at
different instants meaning every EPOCH + delta. This delta is variable
depending on the time the CPU calibrates with CPU 0 (master), and will
likely be different and variable across vCPUS. This means that each VCPU
pvti won't account to its calibration error which could lead to time
going backwards, and allowing a situation where time read on VCPU B
immediately after A being smaller. While this doesn't happen a lot, I
was able to observe (for clocksource=tsc) around 50 times in an hour
having warps of < 100 ns.

This patch proposes relying on host TSC synchronization and passthrough
of the master tsc to the guest, when running on a TSC-safe platform.  On
the rendezvous function we will retrieve the platform time in ns and the
last count read by the clocksource that was used to compute system time.
master will write both master_tsc_stamp and master_stime, and the other
vCPUS (slave) will use it to update their correspondent time infos.
This way we can guarantee that on a platform with a constant and
reliable TSC, that the time read on vcpu B right after A is bigger
independently of the VCPU calibration error. Since pvclock time infos
are monotonic as seen by any vCPU set PVCLOCK_TSC_STABLE_BIT, which then
enables usage of VDSO on Linux.  IIUC, this is similar to how it's
implemented on KVM.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/time.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Andrew Cooper March 18, 2016, 8:58 p.m. UTC | #1

On 17/03/16 16:12, Joao Martins wrote:
> When using TSC as clocksource we will solely rely on TSC for updating
> vcpu time infos (pvti). Right now, each vCPU takes the tsc_timestamp at
> different instants meaning every EPOCH + delta. This delta is variable
> depending on the time the CPU calibrates with CPU 0 (master), and will
> likely be different and variable across vCPUS. This means that each VCPU
> pvti won't account to its calibration error which could lead to time
> going backwards, and allowing a situation where time read on VCPU B
> immediately after A being smaller. While this doesn't happen a lot, I
> was able to observe (for clocksource=tsc) around 50 times in an hour
> having warps of < 100 ns.
>
> This patch proposes relying on host TSC synchronization and passthrough
> of the master tsc to the guest, when running on a TSC-safe platform.  On
> the rendezvous function we will retrieve the platform time in ns and the
> last count read by the clocksource that was used to compute system time.
> master will write both master_tsc_stamp and master_stime, and the other
> vCPUS (slave) will use it to update their correspondent time infos.
> This way we can guarantee that on a platform with a constant and
> reliable TSC, that the time read on vcpu B right after A is bigger
> independently of the VCPU calibration error. Since pvclock time infos
> are monotonic as seen by any vCPU set PVCLOCK_TSC_STABLE_BIT, which then
> enables usage of VDSO on Linux.  IIUC, this is similar to how it's
> implemented on KVM.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Cc: Keir Fraser <keir@xen.org>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/time.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index 89c35d0..a17529c 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -917,6 +917,8 @@ static void __update_vcpu_system_time(struct vcpu *v, int force)
>  
>      _u.tsc_timestamp = tsc_stamp;
>      _u.system_time   = t->stime_local_stamp;
> +    if ( clocksource_is_tsc )
> +        _u.flags    |= PVCLOCK_TSC_STABLE_BIT;
>  
>      if ( is_hvm_domain(d) )
>          _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
> @@ -1377,9 +1379,12 @@ static void time_calibration_std_rendezvous(void *_r)
>  
>      if ( smp_processor_id() == 0 )
>      {
> +        u64 last_counter;

Blank line here please.

>          while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
>              cpu_relax();
> -        r->master_stime = read_platform_stime();
> +        r->master_stime = read_platform_stime(&last_counter);
> +        if ( clocksource_is_tsc )
> +            r->master_tsc_stamp = last_counter;
>          mb(); /* write r->master_stime /then/ signal */
>          atomic_inc(&r->semaphore);
>      }
> @@ -1391,7 +1396,10 @@ static void time_calibration_std_rendezvous(void *_r)
>          mb(); /* receive signal /then/ read r->master_stime */
>      }
>  
> -    c->local_tsc_stamp = rdtsc();
> +    if ( clocksource_is_tsc )
> +        c->local_tsc_stamp = r->master_tsc_stamp;
> +    else
> +        c->local_tsc_stamp = rdtsc();
>      c->stime_local_stamp = get_s_time();
>      c->stime_master_stamp = r->master_stime;
>  

The point of the rendezvous is to run rdtsc() at a the time on each cpu
at the same time.  With this logic, it seems that you don't need the
rendezvous at all.

Avoiding the time_calibration_std_rendezvous() entirely in this
situation would be the better, surely?

~Andrew

Joao Martins March 21, 2016, 11:50 a.m. UTC | #2

On 03/18/2016 08:58 PM, Andrew Cooper wrote:
> On 17/03/16 16:12, Joao Martins wrote:
>> When using TSC as clocksource we will solely rely on TSC for updating
>> vcpu time infos (pvti). Right now, each vCPU takes the tsc_timestamp at
>> different instants meaning every EPOCH + delta. This delta is variable
>> depending on the time the CPU calibrates with CPU 0 (master), and will
>> likely be different and variable across vCPUS. This means that each VCPU
>> pvti won't account to its calibration error which could lead to time
>> going backwards, and allowing a situation where time read on VCPU B
>> immediately after A being smaller. While this doesn't happen a lot, I
>> was able to observe (for clocksource=tsc) around 50 times in an hour
>> having warps of < 100 ns.
>>
>> This patch proposes relying on host TSC synchronization and passthrough
>> of the master tsc to the guest, when running on a TSC-safe platform.  On
>> the rendezvous function we will retrieve the platform time in ns and the
>> last count read by the clocksource that was used to compute system time.
>> master will write both master_tsc_stamp and master_stime, and the other
>> vCPUS (slave) will use it to update their correspondent time infos.
>> This way we can guarantee that on a platform with a constant and
>> reliable TSC, that the time read on vcpu B right after A is bigger
>> independently of the VCPU calibration error. Since pvclock time infos
>> are monotonic as seen by any vCPU set PVCLOCK_TSC_STABLE_BIT, which then
>> enables usage of VDSO on Linux.  IIUC, this is similar to how it's
>> implemented on KVM.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Cc: Keir Fraser <keir@xen.org>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>  xen/arch/x86/time.c | 12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
>> index 89c35d0..a17529c 100644
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -917,6 +917,8 @@ static void __update_vcpu_system_time(struct vcpu *v, int force)
>>  
>>      _u.tsc_timestamp = tsc_stamp;
>>      _u.system_time   = t->stime_local_stamp;
>> +    if ( clocksource_is_tsc )
>> +        _u.flags    |= PVCLOCK_TSC_STABLE_BIT;
>>  
>>      if ( is_hvm_domain(d) )
>>          _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
>> @@ -1377,9 +1379,12 @@ static void time_calibration_std_rendezvous(void *_r)
>>  
>>      if ( smp_processor_id() == 0 )
>>      {
>> +        u64 last_counter;
> 
> Blank line here please.
> 
>>          while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
>>              cpu_relax();
>> -        r->master_stime = read_platform_stime();
>> +        r->master_stime = read_platform_stime(&last_counter);
>> +        if ( clocksource_is_tsc )
>> +            r->master_tsc_stamp = last_counter;
>>          mb(); /* write r->master_stime /then/ signal */
>>          atomic_inc(&r->semaphore);
>>      }
>> @@ -1391,7 +1396,10 @@ static void time_calibration_std_rendezvous(void *_r)
>>          mb(); /* receive signal /then/ read r->master_stime */
>>      }
>>  
>> -    c->local_tsc_stamp = rdtsc();
>> +    if ( clocksource_is_tsc )
>> +        c->local_tsc_stamp = r->master_tsc_stamp;
>> +    else
>> +        c->local_tsc_stamp = rdtsc();
>>      c->stime_local_stamp = get_s_time();
>>      c->stime_master_stamp = r->master_stime;
>>  
> 
> The point of the rendezvous is to run rdtsc() at a the time on each cpu
> at the same time.  With this logic, it seems that you don't need the
> rendezvous at all.
> 
> Avoiding the time_calibration_std_rendezvous() entirely in this
> situation would be the better, surely?
Indeed, and would look cleaner too. I've changed the approach in this patch for
v2 following your guideline, along with some retesting.

> 
> ~Andrew
>

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 89c35d0..a17529c 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -917,6 +917,8 @@  static void __update_vcpu_system_time(struct vcpu *v, int force)
 
     _u.tsc_timestamp = tsc_stamp;
     _u.system_time   = t->stime_local_stamp;
+    if ( clocksource_is_tsc )
+        _u.flags    |= PVCLOCK_TSC_STABLE_BIT;
 
     if ( is_hvm_domain(d) )
         _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
@@ -1377,9 +1379,12 @@  static void time_calibration_std_rendezvous(void *_r)
 
     if ( smp_processor_id() == 0 )
     {
+        u64 last_counter;
         while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
             cpu_relax();
-        r->master_stime = read_platform_stime();
+        r->master_stime = read_platform_stime(&last_counter);
+        if ( clocksource_is_tsc )
+            r->master_tsc_stamp = last_counter;
         mb(); /* write r->master_stime /then/ signal */
         atomic_inc(&r->semaphore);
     }
@@ -1391,7 +1396,10 @@  static void time_calibration_std_rendezvous(void *_r)
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    c->local_tsc_stamp = rdtsc();
+    if ( clocksource_is_tsc )
+        c->local_tsc_stamp = r->master_tsc_stamp;
+    else
+        c->local_tsc_stamp = rdtsc();
     c->stime_local_stamp = get_s_time();
     c->stime_master_stamp = r->master_stime;

[5/5] x86/time: implement PVCLOCK_TSC_STABLE_BIT

Commit Message

Comments

Patch