[v2,3/6] x86/time: implement tsc as clocksource

Message ID	1459259051-4943-4-git-send-email-joao.m.martins@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Joao Martins <joao.m.martins@oracle.com> To: xen-devel@lists.xen.org Date: Tue, 29 Mar 2016 14:44:08 +0100 Message-Id: <1459259051-4943-4-git-send-email-joao.m.martins@oracle.com> In-Reply-To: <1459259051-4943-1-git-send-email-joao.m.martins@oracle.com> References: <1459259051-4943-1-git-send-email-joao.m.martins@oracle.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Joao Martins <joao.m.martins@oracle.com>, Keir Fraser <keir@xen.org>, Jan Beulich <jbeulich@suse.com> Subject: [Xen-devel] [PATCH v2 3/6] x86/time: implement tsc as clocksource Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Joao Martins March 29, 2016, 1:44 p.m. UTC

Introduce support for using TSC as platform time which is the highest
resolution time and most performant to get (~20 nsecs).  Though there
are also several problems associated with its usage, and there isn't a
complete (and architecturally defined) guarantee that all machines
will provide reliable and monotonic TSC across all CPUs, on different
sockets and different P/C states.  I believe Intel to be the only that
can guarantee that. For this reason it's only set when adminstrator
changes "clocksource" boot option to "tsc". Initializing TSC
clocksource requires all CPUs to have the tsc reliability checks
performed. init_xen_time is called before all CPUs are up, and so we
start with HPET at boot time, and switch later to TSC. The switch then
happens on verify_tsc_reliability initcall that is invoked when all
CPUs are up. When attempting to initializing TSC we also check for
time warps and appropriate CPU features i.e.  TSC_RELIABLE,
CONSTANT_TSC and NONSTOP_TSC. And in case none of these conditions are
met, we keep the clocksource that was previously initialized on
init_xen_time.

It is also worth noting that with clocksource=tsc there isn't any
need to synchronise with another clocksource, and I could verify that
great portion the time skew was eliminated and seeing much less time
warps happening. With HPET I used to observe ~500 warps in the period
of 1h of around 27 us, and with TSC down to 50 warps in the same
period having each warp < 100 ns. The warps still exist though but
are only related to cross CPU calibration (being how much it takes to
rendezvous with master), in which a later patch in this series aims to
solve.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v1:
 - s/printk/printk(XENLOG_INFO
 - Remove extra space on inner brackets
 - Add missing space around brackets
 - Defer clocksource TSC initialization when all CPUs are up.

Changes since RFC:
 - Spelling fixes in the commit message.
 - Remove unused clocksource_is_tsc variable and introduce it instead
 on the patch that uses it.
 - Move plt_tsc from second to last in the available clocksources.
---
 xen/arch/x86/time.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 95 insertions(+), 2 deletions(-)

Konrad Rzeszutek Wilk March 29, 2016, 5:39 p.m. UTC | #1

>  static void tsc_check_slave(void *unused)
> @@ -1437,6 +1515,20 @@ static int __init verify_tsc_reliability(void)
>          }
>      }
>  
> +    if ( !strcmp(opt_clocksource, "tsc") )

Pls also update xen-command-line.markdown 

> +    {
> +        if ( try_platform_timer(&plt_tsc) > 0 )
> +        {
> +            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
> +                   freq_string(plt_src.frequency));
> +
> +            init_percpu_time();
> +
> +            init_timer(&calibration_timer, time_calibration, NULL, 0);
> +            set_timer(&calibration_timer, NOW() + EPOCH);
> +        }
> +    }
> +
>      return 0;
>  }
>  __initcall(verify_tsc_reliability);
> @@ -1476,6 +1568,7 @@ void __init early_time_init(void)
>      struct cpu_time *t = &this_cpu(cpu_time);
>      u64 tmp = init_pit_and_calibrate_tsc();
>  
> +    tsc_freq = tmp;
>      set_time_scale(&t->tsc_scale, tmp);
>      t->local_tsc_stamp = boot_tsc_stamp;
>  
> -- 
> 2.1.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Joao Martins March 29, 2016, 5:52 p.m. UTC | #2

On 03/29/2016 06:39 PM, Konrad Rzeszutek Wilk wrote:
>>  static void tsc_check_slave(void *unused)
>> @@ -1437,6 +1515,20 @@ static int __init verify_tsc_reliability(void)
>>          }
>>      }
>>  
>> +    if ( !strcmp(opt_clocksource, "tsc") )
> 
> Pls also update xen-command-line.markdown 
> 
OK. I fixed it for the new clocksource, and also added documentation on the new
boot param that is introduced in the last patch.

>> +    {
>> +        if ( try_platform_timer(&plt_tsc) > 0 )
>> +        {
>> +            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
>> +                   freq_string(plt_src.frequency));
>> +
>> +            init_percpu_time();
>> +
>> +            init_timer(&calibration_timer, time_calibration, NULL, 0);
>> +            set_timer(&calibration_timer, NOW() + EPOCH);
>> +        }
>> +    }
>> +
>>      return 0;
>>  }
>>  __initcall(verify_tsc_reliability);
>> @@ -1476,6 +1568,7 @@ void __init early_time_init(void)
>>      struct cpu_time *t = &this_cpu(cpu_time);
>>      u64 tmp = init_pit_and_calibrate_tsc();
>>  
>> +    tsc_freq = tmp;
>>      set_time_scale(&t->tsc_scale, tmp);
>>      t->local_tsc_stamp = boot_tsc_stamp;
>>  
>> -- 
>> 2.1.4
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk April 1, 2016, 4:43 p.m. UTC | #3

On Tue, Mar 29, 2016 at 02:44:08PM +0100, Joao Martins wrote:
> Introduce support for using TSC as platform time which is the highest
> resolution time and most performant to get (~20 nsecs).  Though there
> are also several problems associated with its usage, and there isn't a
> complete (and architecturally defined) guarantee that all machines
> will provide reliable and monotonic TSC across all CPUs, on different
> sockets and different P/C states.  I believe Intel to be the only that
> can guarantee that. For this reason it's only set when adminstrator
> changes "clocksource" boot option to "tsc". Initializing TSC
> clocksource requires all CPUs to have the tsc reliability checks
> performed. init_xen_time is called before all CPUs are up, and so we
> start with HPET at boot time, and switch later to TSC. The switch then
> happens on verify_tsc_reliability initcall that is invoked when all
> CPUs are up. When attempting to initializing TSC we also check for
> time warps and appropriate CPU features i.e.  TSC_RELIABLE,
> CONSTANT_TSC and NONSTOP_TSC. And in case none of these conditions are
> met, we keep the clocksource that was previously initialized on
> init_xen_time.
> 
> It is also worth noting that with clocksource=tsc there isn't any
> need to synchronise with another clocksource, and I could verify that
> great portion the time skew was eliminated and seeing much less time
> warps happening. With HPET I used to observe ~500 warps in the period
> of 1h of around 27 us, and with TSC down to 50 warps in the same
> period having each warp < 100 ns. The warps still exist though but
> are only related to cross CPU calibration (being how much it takes to
> rendezvous with master), in which a later patch in this series aims to
> solve.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Cc: Keir Fraser <keir@xen.org>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> Changes since v1:
>  - s/printk/printk(XENLOG_INFO
>  - Remove extra space on inner brackets
>  - Add missing space around brackets
>  - Defer clocksource TSC initialization when all CPUs are up.
> 
> Changes since RFC:
>  - Spelling fixes in the commit message.
>  - Remove unused clocksource_is_tsc variable and introduce it instead
>  on the patch that uses it.
>  - Move plt_tsc from second to last in the available clocksources.
> ---
>  xen/arch/x86/time.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index ed4ed24..2602dda 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -432,6 +432,63 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>  }
>  
>  /************************************************************
> + * PLATFORM TIMER 4: TSC
> + */
> +static u64 tsc_freq;
> +static unsigned long tsc_max_warp;
> +static void tsc_check_reliability(void);
> +
> +static int __init init_tsctimer(struct platform_timesource *pts)
> +{
> +    bool_t tsc_reliable = 0;

No need to set it to zero.
> +
> +    tsc_check_reliability();

This has been already called by verify_tsc_reliability which calls this
function. Should we set tsc_max_warp to zero before calling it?

> +
> +    if ( tsc_max_warp > 0 )
> +    {
> +        tsc_reliable = 0;

Ditto. It is by default zero.

> +        printk(XENLOG_INFO "TSC: didn't passed warp test\n");

So the earlier test by verify_tsc_reliability did already this check and
printed this out - and also cleared the X86_FEATURE_TSC_RELIABLE.

So can this check above be removed then?

Or are you thinking to ditch what verify_tsc_reliability does?

> +    }
> +    else if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
> +              (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +               boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) )
> +    {
> +        tsc_reliable = 1;
> +    }
> +    else if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> +    {
> +        tsc_reliable = (max_cstate <= 2);
> +
> +        if ( tsc_reliable )
> +            printk(XENLOG_INFO "TSC: no deep Cstates, deemed reliable\n");
> +        else
> +            printk(XENLOG_INFO "TSC: deep Cstates possible, so not reliable\n");
> +    }
> +
> +    pts->frequency = tsc_freq;
> +    return tsc_reliable;
> +}
> +
> +static u64 read_tsc(void)
> +{
> +    return rdtsc();
> +}
> +
> +static void resume_tsctimer(struct platform_timesource *pts)
> +{
> +}
> +
> +static struct platform_timesource __initdata plt_tsc =
> +{
> +    .id = "tsc",
> +    .name = "TSC",
> +    .read_counter = read_tsc,
> +    .counter_bits = 64,
> +    .init = init_tsctimer,

Could you add a note in here:

/* Not called by init_platform_timer as it is not on the plt_timers array. */

> +    .resume = resume_tsctimer,
> +};
> +
> +/************************************************************
>   * GENERIC PLATFORM TIMER INFRASTRUCTURE
>   */
>  
> @@ -533,6 +590,21 @@ static void resume_platform_timer(void)
>      plt_stamp = plt_src.read_counter();
>  }
>  
> +static void __init reset_platform_timer(void)
> +{
> +    /* Deactivate any timers running */
> +    kill_timer(&plt_overflow_timer);
> +    kill_timer(&calibration_timer);
> +
> +    /* Reset counters and stamps */
> +    spin_lock_irq(&platform_timer_lock);
> +    plt_stamp = 0;
> +    plt_stamp64 = 0;
> +    platform_timer_stamp = 0;
> +    stime_platform_stamp = 0;
> +    spin_unlock_irq(&platform_timer_lock);
> +}
> +
>  static int __init try_platform_timer(struct platform_timesource *pts)
>  {
>      int rc = -1;
> @@ -541,6 +613,10 @@ static int __init try_platform_timer(struct platform_timesource *pts)
>      if ( rc <= 0 )
>          return rc;
>  
> +    /* We have a platform timesource already so reset it */
> +    if ( plt_src.counter_bits != 0 )
> +        reset_platform_timer();
> +
>      plt_mask = (u64)~0ull >> (64 - pts->counter_bits);
>  
>      set_time_scale(&plt_scale, pts->frequency);
> @@ -566,7 +642,9 @@ static void __init init_platform_timer(void)
>      struct platform_timesource *pts = NULL;
>      int i, rc = -1;
>  
> -    if ( opt_clocksource[0] != '\0' )
> +    /* clocksource=tsc is initialized later when all CPUS are up */

Perhaps:
/* clocksource=tsc is initialized via __initcalls (when CPUs are up). */ ?

> +    if ( (opt_clocksource[0] != '\0') &&
> +         (strcmp(opt_clocksource, "tsc") != 0) )
>      {
>          for ( i = 0; i < ARRAY_SIZE(plt_timers); i++ )
>          {
> @@ -1192,7 +1270,7 @@ static void check_tsc_warp(unsigned long tsc_khz, unsigned long *max_warp)
>      }
>  }
>  
> -static unsigned long tsc_max_warp, tsc_check_count;
> +static unsigned long tsc_check_count;
>  static cpumask_t tsc_check_cpumask;
>  
>  static void tsc_check_slave(void *unused)
> @@ -1437,6 +1515,20 @@ static int __init verify_tsc_reliability(void)
>          }
>      }
>  
> +    if ( !strcmp(opt_clocksource, "tsc") )
> +    {
> +        if ( try_platform_timer(&plt_tsc) > 0 )
> +        {
> +            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
> +                   freq_string(plt_src.frequency));
> +
> +            init_percpu_time();
> +
> +            init_timer(&calibration_timer, time_calibration, NULL, 0);
> +            set_timer(&calibration_timer, NOW() + EPOCH);
> +        }
> +    }
> +
>      return 0;
>  }
>  __initcall(verify_tsc_reliability);
> @@ -1476,6 +1568,7 @@ void __init early_time_init(void)
>      struct cpu_time *t = &this_cpu(cpu_time);
>      u64 tmp = init_pit_and_calibrate_tsc();
>  
> +    tsc_freq = tmp;
>      set_time_scale(&t->tsc_scale, tmp);
>      t->local_tsc_stamp = boot_tsc_stamp;
>  
> -- 
> 2.1.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Joao Martins April 1, 2016, 6:38 p.m. UTC | #4

[snip]

>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
>> index ed4ed24..2602dda 100644
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -432,6 +432,63 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>  }
>>  
>>  /************************************************************
>> + * PLATFORM TIMER 4: TSC
>> + */
>> +static u64 tsc_freq;
>> +static unsigned long tsc_max_warp;
>> +static void tsc_check_reliability(void);
>> +
>> +static int __init init_tsctimer(struct platform_timesource *pts)
>> +{
>> +    bool_t tsc_reliable = 0;
> 
> No need to set it to zero.
OK.

>> +
>> +    tsc_check_reliability();
> 
> This has been already called by verify_tsc_reliability which calls this
> function. Should we set tsc_max_warp to zero before calling it?
Ah, correct. But may be it's not necessary to run the tsc_check_reliability here
at all as what I am doing is ineficient. See my other comment below.

> 
>> +
>> +    if ( tsc_max_warp > 0 )
>> +    {
>> +        tsc_reliable = 0;
> 
> Ditto. It is by default zero.
OK.

> 
>> +        printk(XENLOG_INFO "TSC: didn't passed warp test\n");
> 
> So the earlier test by verify_tsc_reliability did already this check and
> printed this out - and also cleared the X86_FEATURE_TSC_RELIABLE.
> 
> So can this check above be removed then?
> 
> Or are you thinking to ditch what verify_tsc_reliability does?
> 
I had the tsc_check_reliability here because TSC could still be deemed reliable
for max_cstate <= 2 or with CONSTANT_TSC + NONSTOP_TSC. The way I have it, the
most likely scenario (i.e. having TSC_RELIABLE) would run twice. Perhaps a
better way of doing this would be to run the warp test solely on
verify_tsc_reliability() in all possible conditions to be deemed reliable? And
then I could even remove almost the whole init_tsctimer if it was to be called
when no warps are observed.

>> +    }
>> +    else if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
>> +              (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>> +               boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) )
>> +    {
>> +        tsc_reliable = 1;
>> +    }
>> +    else if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>> +    {
>> +        tsc_reliable = (max_cstate <= 2);
>> +
>> +        if ( tsc_reliable )
>> +            printk(XENLOG_INFO "TSC: no deep Cstates, deemed reliable\n");
>> +        else
>> +            printk(XENLOG_INFO "TSC: deep Cstates possible, so not reliable\n");
>> +    }
>> +
>> +    pts->frequency = tsc_freq;
>> +    return tsc_reliable;
>> +}
>> +
>> +static u64 read_tsc(void)
>> +{
>> +    return rdtsc();
>> +}
>> +
>> +static void resume_tsctimer(struct platform_timesource *pts)
>> +{
>> +}
>> +
>> +static struct platform_timesource __initdata plt_tsc =
>> +{
>> +    .id = "tsc",
>> +    .name = "TSC",
>> +    .read_counter = read_tsc,
>> +    .counter_bits = 64,
>> +    .init = init_tsctimer,
> 
> Could you add a note in here:
> 
> /* Not called by init_platform_timer as it is not on the plt_timers array. */
> 
OK, I fixed that.

>> +    .resume = resume_tsctimer,
>> +};
>> +
>> +/************************************************************
>>   * GENERIC PLATFORM TIMER INFRASTRUCTURE
>>   */
>>  
>> @@ -533,6 +590,21 @@ static void resume_platform_timer(void)
>>      plt_stamp = plt_src.read_counter();
>>  }
>>  
>> +static void __init reset_platform_timer(void)
>> +{
>> +    /* Deactivate any timers running */
>> +    kill_timer(&plt_overflow_timer);
>> +    kill_timer(&calibration_timer);
>> +
>> +    /* Reset counters and stamps */
>> +    spin_lock_irq(&platform_timer_lock);
>> +    plt_stamp = 0;
>> +    plt_stamp64 = 0;
>> +    platform_timer_stamp = 0;
>> +    stime_platform_stamp = 0;
>> +    spin_unlock_irq(&platform_timer_lock);
>> +}
>> +
>>  static int __init try_platform_timer(struct platform_timesource *pts)
>>  {
>>      int rc = -1;
>> @@ -541,6 +613,10 @@ static int __init try_platform_timer(struct platform_timesource *pts)
>>      if ( rc <= 0 )
>>          return rc;
>>  
>> +    /* We have a platform timesource already so reset it */
>> +    if ( plt_src.counter_bits != 0 )
>> +        reset_platform_timer();
>> +
>>      plt_mask = (u64)~0ull >> (64 - pts->counter_bits);
>>  
>>      set_time_scale(&plt_scale, pts->frequency);
>> @@ -566,7 +642,9 @@ static void __init init_platform_timer(void)
>>      struct platform_timesource *pts = NULL;
>>      int i, rc = -1;
>>  
>> -    if ( opt_clocksource[0] != '\0' )
>> +    /* clocksource=tsc is initialized later when all CPUS are up */
> 
> Perhaps:
> /* clocksource=tsc is initialized via __initcalls (when CPUs are up). */ ?
> 
It's clearer indeed. Fixed that too.

>> +    if ( (opt_clocksource[0] != '\0') &&
>> +         (strcmp(opt_clocksource, "tsc") != 0) )
>>      {
>>          for ( i = 0; i < ARRAY_SIZE(plt_timers); i++ )
>>          {
>> @@ -1192,7 +1270,7 @@ static void check_tsc_warp(unsigned long tsc_khz, unsigned long *max_warp)
>>      }
>>  }
>>  
>> -static unsigned long tsc_max_warp, tsc_check_count;
>> +static unsigned long tsc_check_count;
>>  static cpumask_t tsc_check_cpumask;
>>  
>>  static void tsc_check_slave(void *unused)
>> @@ -1437,6 +1515,20 @@ static int __init verify_tsc_reliability(void)
>>          }
>>      }
>>  
>> +    if ( !strcmp(opt_clocksource, "tsc") )
>> +    {
>> +        if ( try_platform_timer(&plt_tsc) > 0 )
>> +        {
>> +            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
>> +                   freq_string(plt_src.frequency));
>> +
>> +            init_percpu_time();
>> +
>> +            init_timer(&calibration_timer, time_calibration, NULL, 0);
>> +            set_timer(&calibration_timer, NOW() + EPOCH);
>> +        }
>> +    }
>> +
>>      return 0;
>>  }
>>  __initcall(verify_tsc_reliability);
>> @@ -1476,6 +1568,7 @@ void __init early_time_init(void)
>>      struct cpu_time *t = &this_cpu(cpu_time);
>>      u64 tmp = init_pit_and_calibrate_tsc();
>>  
>> +    tsc_freq = tmp;
>>      set_time_scale(&t->tsc_scale, tmp);
>>      t->local_tsc_stamp = boot_tsc_stamp;
>>  
>> -- 
>> 2.1.4

Konrad Rzeszutek Wilk April 1, 2016, 6:45 p.m. UTC | #5

On Fri, Apr 01, 2016 at 07:38:18PM +0100, Joao Martins wrote:
> [snip]
> 
> >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> >> index ed4ed24..2602dda 100644
> >> --- a/xen/arch/x86/time.c
> >> +++ b/xen/arch/x86/time.c
> >> @@ -432,6 +432,63 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
> >>  }
> >>  
> >>  /************************************************************
> >> + * PLATFORM TIMER 4: TSC
> >> + */
> >> +static u64 tsc_freq;
> >> +static unsigned long tsc_max_warp;
> >> +static void tsc_check_reliability(void);
> >> +
> >> +static int __init init_tsctimer(struct platform_timesource *pts)
> >> +{
> >> +    bool_t tsc_reliable = 0;
> > 
> > No need to set it to zero.
> OK.
> 
> >> +
> >> +    tsc_check_reliability();
> > 
> > This has been already called by verify_tsc_reliability which calls this
> > function. Should we set tsc_max_warp to zero before calling it?
> Ah, correct. But may be it's not necessary to run the tsc_check_reliability here
> at all as what I am doing is ineficient. See my other comment below.
> 
> > 
> >> +
> >> +    if ( tsc_max_warp > 0 )
> >> +    {
> >> +        tsc_reliable = 0;
> > 
> > Ditto. It is by default zero.
> OK.
> 
> > 
> >> +        printk(XENLOG_INFO "TSC: didn't passed warp test\n");
> > 
> > So the earlier test by verify_tsc_reliability did already this check and
> > printed this out - and also cleared the X86_FEATURE_TSC_RELIABLE.
> > 
> > So can this check above be removed then?
> > 
> > Or are you thinking to ditch what verify_tsc_reliability does?
> > 
> I had the tsc_check_reliability here because TSC could still be deemed reliable
> for max_cstate <= 2 or with CONSTANT_TSC + NONSTOP_TSC. The way I have it, the
> most likely scenario (i.e. having TSC_RELIABLE) would run twice. Perhaps a
> better way of doing this would be to run the warp test solely on
> verify_tsc_reliability() in all possible conditions to be deemed reliable? And
> then I could even remove almost the whole init_tsctimer if it was to be called
> when no warps are observed.

So..
> 
> >> +    }
> >> +    else if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
> >> +              (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >> +               boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) )
> >> +    {
> >> +        tsc_reliable = 1;
> >> +    }
> >> +    else if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> >> +    {
> >> +        tsc_reliable = (max_cstate <= 2);
> >> +
> >> +        if ( tsc_reliable )
> >> +            printk(XENLOG_INFO "TSC: no deep Cstates, deemed reliable\n");
> >> +        else
> >> +            printk(XENLOG_INFO "TSC: deep Cstates possible, so not reliable\n");

.. is that always true? As in if this is indeed the case should we clear
X86_FEATURE_CONSTANT_TSC bit? And make check be part of tsc_check_reliability?

Then init_tsctimer() would just need to check for the boot_cpu_has bits being
set.

As in:

 if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
      (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
       boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) )
  {
	pts->frequency = tsc_freq;
	return 1;
   }

   return 0;

And tsc_check_reliability would be in charge of clearing the CPU bits if something
is off.

But maybe that is not good? As in, could we mess up and clear those bits
even if they are suppose to be set?

Joao Martins April 3, 2016, 6:47 p.m. UTC | #6

On 04/01/2016 07:45 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 01, 2016 at 07:38:18PM +0100, Joao Martins wrote:
>> [snip]
>>
>>>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
>>>> index ed4ed24..2602dda 100644
>>>> --- a/xen/arch/x86/time.c
>>>> +++ b/xen/arch/x86/time.c
>>>> @@ -432,6 +432,63 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>>>  }
>>>>  
>>>>  /************************************************************
>>>> + * PLATFORM TIMER 4: TSC
>>>> + */
>>>> +static u64 tsc_freq;
>>>> +static unsigned long tsc_max_warp;
>>>> +static void tsc_check_reliability(void);
>>>> +
>>>> +static int __init init_tsctimer(struct platform_timesource *pts)
>>>> +{
>>>> +    bool_t tsc_reliable = 0;
>>>
>>> No need to set it to zero.
>> OK.
>>
>>>> +
>>>> +    tsc_check_reliability();
>>>
>>> This has been already called by verify_tsc_reliability which calls this
>>> function. Should we set tsc_max_warp to zero before calling it?
>> Ah, correct. But may be it's not necessary to run the tsc_check_reliability here
>> at all as what I am doing is ineficient. See my other comment below.
>>
>>>
>>>> +
>>>> +    if ( tsc_max_warp > 0 )
>>>> +    {
>>>> +        tsc_reliable = 0;
>>>
>>> Ditto. It is by default zero.
>> OK.
>>
>>>
>>>> +        printk(XENLOG_INFO "TSC: didn't passed warp test\n");
>>>
>>> So the earlier test by verify_tsc_reliability did already this check and
>>> printed this out - and also cleared the X86_FEATURE_TSC_RELIABLE.
>>>
>>> So can this check above be removed then?
>>>
>>> Or are you thinking to ditch what verify_tsc_reliability does?
>>>
>> I had the tsc_check_reliability here because TSC could still be deemed reliable
>> for max_cstate <= 2 or with CONSTANT_TSC + NONSTOP_TSC. The way I have it, the
>> most likely scenario (i.e. having TSC_RELIABLE) would run twice. Perhaps a
>> better way of doing this would be to run the warp test solely on
>> verify_tsc_reliability() in all possible conditions to be deemed reliable? And
>> then I could even remove almost the whole init_tsctimer if it was to be called
>> when no warps are observed.
> 
> So..
>>
>>>> +    }
>>>> +    else if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
>>>> +              (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>>>> +               boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) )
>>>> +    {
>>>> +        tsc_reliable = 1;
>>>> +    }
>>>> +    else if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>>>> +    {
>>>> +        tsc_reliable = (max_cstate <= 2);
>>>> +
>>>> +        if ( tsc_reliable )
>>>> +            printk(XENLOG_INFO "TSC: no deep Cstates, deemed reliable\n");
>>>> +        else
>>>> +            printk(XENLOG_INFO "TSC: deep Cstates possible, so not reliable\n");
> 
> .. is that always true?
Might not be on older platforms - but I could be stricter here and require
max_cstates 0 to allow TSC clocksource to be used. CONSTANT_TSC is set based on
CPU model (Manual section 17.14, 2nd paragrah), and (on Intel) AFAICT constant
rate can be interpreted to being across P/T states, thus leaving us with the C
states. It is only architecturally guaranteed that TSC doesn't stop on deep
C-states (Section 36.8.3) if invariant TSC is set (CPUID.80000007:EDX[8]). If
this bit is set, on Intel it guarantees to have synchronized and nonstop TSCs
across all states (as also stated in the Section 17.14.1). NONSTOP_TSC means
that the TSC does not stop in deep C states (C3 or higher), so decreasing the
maximum C states to 2 (or disabling entirely) would provide the equivalent to a
nonstop TSC. Thus deeming it reliable _if_ the warp test passes.

> As in if this is indeed the case should we clear
> X86_FEATURE_CONSTANT_TSC bit? And make check be part of tsc_check_reliability?
I am not sure about clearing CONSTANT_TSC as this is based on CPU model. But I
believe the main issue would be to know whether the TSC stops or on deep C states.

> Then init_tsctimer() would just need to check for the boot_cpu_has bits being
> set.
> 
> As in:
> 
>  if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
>       (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>        boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) )
>   {
> 	pts->frequency = tsc_freq;
> 	return 1;
>    }
>
Yeah something along those lines. My idea previously was to not even check these
bits, and assume init_tsctimer is called knowing that we have a reliable TSC
source as we check all of it beforehand? Something like this:

 static int __init verify_tsc_reliability(void)
 {
    if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
         (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
          (boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || max_cstate <= 2)) )
     {
         if (tsc_warp_warp)
         {
             printk("%s: TSC warp detected, disabling TSC_RELIABLE\n",
                    __func__);
             if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
                  setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
          }
          else if ( !strcmp(opt_clocksource, "tsc") )
          {
          ...


And then init_tsctimer would just be:

static int __init init_tsctimer(struct platform_timesource *pts)
{
    pts->frequency = tsc_freq;
    return 1;
}

?

>    return 0;
>
> And tsc_check_reliability would be in charge of clearing the CPU bits if something
> is off.
Perhaps you mean verify_tsc_reliability()? That's already
clearing TSC_RELIABLE in the event of warps as you previously mentioned.
tsc_check_reliability is the actual warp test - might not be the best place to
clear it.

> But maybe that is not good? As in, could we mess up and clear those bits
> even if they are suppose to be set?
Hm, I am not sure how we could mess up if we clear them, but these bits look
correctly set to me.

Joao

Jan Beulich April 5, 2016, 10:43 a.m. UTC | #7

>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
> Introduce support for using TSC as platform time which is the highest
> resolution time and most performant to get (~20 nsecs).  Though there
> are also several problems associated with its usage, and there isn't a
> complete (and architecturally defined) guarantee that all machines
> will provide reliable and monotonic TSC across all CPUs, on different
> sockets and different P/C states.  I believe Intel to be the only that
> can guarantee that. For this reason it's only set when adminstrator
> changes "clocksource" boot option to "tsc". Initializing TSC
> clocksource requires all CPUs to have the tsc reliability checks
> performed. init_xen_time is called before all CPUs are up, and so we
> start with HPET at boot time, and switch later to TSC.

Please don't make possibly wrong statements: There's no guarantee
we'd start with HPET - might as well be PMTMR or PIT.

> The switch then
> happens on verify_tsc_reliability initcall that is invoked when all
> CPUs are up. When attempting to initializing TSC we also check for
> time warps and appropriate CPU features i.e.  TSC_RELIABLE,
> CONSTANT_TSC and NONSTOP_TSC. And in case none of these conditions are
> met, we keep the clocksource that was previously initialized on
> init_xen_time.

DYM "And in case any of these conditions is not met, ..."? Or,
looking at the code, something more complex?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -432,6 +432,63 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>  }
>  
>  /************************************************************
> + * PLATFORM TIMER 4: TSC
> + */
> +static u64 tsc_freq;
> +static unsigned long tsc_max_warp;
> +static void tsc_check_reliability(void);
> +
> +static int __init init_tsctimer(struct platform_timesource *pts)
> +{
> +    bool_t tsc_reliable = 0;
> +
> +    tsc_check_reliability();
> +
> +    if ( tsc_max_warp > 0 )
> +    {
> +        tsc_reliable = 0;

Redundant assignment.

> +static void resume_tsctimer(struct platform_timesource *pts)
> +{
> +}

No need for an empty function - other clock sources don't have
such empty stubs either (i.e. the caller checks for NULL).

> @@ -541,6 +613,10 @@ static int __init try_platform_timer(struct platform_timesource *pts)
>      if ( rc <= 0 )
>          return rc;
>  
> +    /* We have a platform timesource already so reset it */
> +    if ( plt_src.counter_bits != 0 )
> +        reset_platform_timer();

What if any of the time functions get used between here and the
setting of the new clock source? For example, what will happen to
log messages when "console_timestamps=..." is in effect?

> @@ -566,7 +642,9 @@ static void __init init_platform_timer(void)
>      struct platform_timesource *pts = NULL;
>      int i, rc = -1;
>  
> -    if ( opt_clocksource[0] != '\0' )
> +    /* clocksource=tsc is initialized later when all CPUS are up */
> +    if ( (opt_clocksource[0] != '\0') &&
> +         (strcmp(opt_clocksource, "tsc") != 0) )

In line with the inverse check done below (using ! instead of == 0)
please omit the != 0 here.

> @@ -1437,6 +1515,20 @@ static int __init verify_tsc_reliability(void)
>          }
>      }
>  
> +    if ( !strcmp(opt_clocksource, "tsc") )
> +    {
> +        if ( try_platform_timer(&plt_tsc) > 0 )
> +        {
> +            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
> +                   freq_string(plt_src.frequency));
> +
> +            init_percpu_time();

So you re-do this for the BP, but not for any of the APs?

Jan

Joao Martins April 5, 2016, 2:56 p.m. UTC | #8

On 04/05/2016 11:43 AM, Jan Beulich wrote:
>>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
>> Introduce support for using TSC as platform time which is the highest
>> resolution time and most performant to get (~20 nsecs).  Though there
>> are also several problems associated with its usage, and there isn't a
>> complete (and architecturally defined) guarantee that all machines
>> will provide reliable and monotonic TSC across all CPUs, on different
>> sockets and different P/C states.  I believe Intel to be the only that
>> can guarantee that. For this reason it's only set when adminstrator
>> changes "clocksource" boot option to "tsc". Initializing TSC
>> clocksource requires all CPUs to have the tsc reliability checks
>> performed. init_xen_time is called before all CPUs are up, and so we
>> start with HPET at boot time, and switch later to TSC.
> 
> Please don't make possibly wrong statements: There's no guarantee
> we'd start with HPET - might as well be PMTMR or PIT.
My apologies - while the commit message doesn't express this, I meant it to be
"for example we would start with HPET (...)". I gave that example as it's the
one preferable in plt_timers.

> 
>> The switch then
>> happens on verify_tsc_reliability initcall that is invoked when all
>> CPUs are up. When attempting to initializing TSC we also check for
>> time warps and appropriate CPU features i.e.  TSC_RELIABLE,
>> CONSTANT_TSC and NONSTOP_TSC. And in case none of these conditions are
>> met, we keep the clocksource that was previously initialized on
>> init_xen_time.
> 
> DYM "And in case any of these conditions is not met, ..."?
Yes. My use of "none" may be wrong here so just to be sure what I mean is: (...)
If all of the conditions in the previous sentence are not met (...)

> Or,
> looking at the code, something more complex?
> 
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -432,6 +432,63 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>  }
>>  
>>  /************************************************************
>> + * PLATFORM TIMER 4: TSC
>> + */
>> +static u64 tsc_freq;
>> +static unsigned long tsc_max_warp;
>> +static void tsc_check_reliability(void);
>> +
>> +static int __init init_tsctimer(struct platform_timesource *pts)
>> +{
>> +    bool_t tsc_reliable = 0;
>> +
>> +    tsc_check_reliability();
>> +
>> +    if ( tsc_max_warp > 0 )
>> +    {
>> +        tsc_reliable = 0;
> 
> Redundant assignment.
> 
Yes, Konrad had point that out too. Fixed that.

>> +static void resume_tsctimer(struct platform_timesource *pts)
>> +{
>> +}
> 
> No need for an empty function - other clock sources don't have
> such empty stubs either (i.e. the caller checks for NULL).
> 
OK.

>> @@ -541,6 +613,10 @@ static int __init try_platform_timer(struct platform_timesource *pts)
>>      if ( rc <= 0 )
>>          return rc;
>>  
>> +    /* We have a platform timesource already so reset it */
>> +    if ( plt_src.counter_bits != 0 )
>> +        reset_platform_timer();
> 
> What if any of the time functions get used between here and the
> setting of the new clock source? For example, what will happen to
> log messages when "console_timestamps=..." is in effect?
time functions will use NOW() (i.e. get_s_time) which uses cpu_time structs
being updated on local_time_calibration() or cpu frequency changes.
reset_platform_timer() will zero out some of the variables used by the
plt_overflow and read_platform_stime(). The overflow and calibration isn't an
issue as the timers are previously killed so any further calls will use what's
on cpu_time while plt_src is being changed.

The case I could see this being different is if cpu_frequency_change is called
between reset_platform_time() and the next update of stime_platform_stamp. In
this situation the calculation would end up a bit different meaning subsequent
calls of read_platform_stime() would return the equivalent to
scale_delta(plt_src->read_counter(), plt_scale), as opposed to computing with a
previous taken stime_platform_stamp and incrementing a difference with a counter
read. But can this situation actually happen?

> 
>> @@ -566,7 +642,9 @@ static void __init init_platform_timer(void)
>>      struct platform_timesource *pts = NULL;
>>      int i, rc = -1;
>>  
>> -    if ( opt_clocksource[0] != '\0' )
>> +    /* clocksource=tsc is initialized later when all CPUS are up */
>> +    if ( (opt_clocksource[0] != '\0') &&
>> +         (strcmp(opt_clocksource, "tsc") != 0) )
> 
> In line with the inverse check done below (using ! instead of == 0)
> please omit the != 0 here.
> 
OK.

>> @@ -1437,6 +1515,20 @@ static int __init verify_tsc_reliability(void)
>>          }
>>      }
>>  
>> +    if ( !strcmp(opt_clocksource, "tsc") )
>> +    {
>> +        if ( try_platform_timer(&plt_tsc) > 0 )
>> +        {
>> +            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
>> +                   freq_string(plt_src.frequency));
>> +
>> +            init_percpu_time();
> 
> So you re-do this for the BP, but not for any of the APs?
I had overlooked the invocation made in start_secondary(). I also need to redo
this on the APs.

> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Jan Beulich April 5, 2016, 3:12 p.m. UTC | #9

>>> On 05.04.16 at 16:56, <joao.m.martins@oracle.com> wrote:
> On 04/05/2016 11:43 AM, Jan Beulich wrote:
>>>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
>>> @@ -541,6 +613,10 @@ static int __init try_platform_timer(struct platform_timesource *pts)
>>>      if ( rc <= 0 )
>>>          return rc;
>>>  
>>> +    /* We have a platform timesource already so reset it */
>>> +    if ( plt_src.counter_bits != 0 )
>>> +        reset_platform_timer();
>> 
>> What if any of the time functions get used between here and the
>> setting of the new clock source? For example, what will happen to
>> log messages when "console_timestamps=..." is in effect?
> time functions will use NOW() (i.e. get_s_time) which uses cpu_time structs
> being updated on local_time_calibration() or cpu frequency changes.
> reset_platform_timer() will zero out some of the variables used by the
> plt_overflow and read_platform_stime(). The overflow and calibration isn't an
> issue as the timers are previously killed so any further calls will use what's
> on cpu_time while plt_src is being changed.
> 
> The case I could see this being different is if cpu_frequency_change is called
> between reset_platform_time() and the next update of stime_platform_stamp. In
> this situation the calculation would end up a bit different meaning subsequent
> calls of read_platform_stime() would return the equivalent to
> scale_delta(plt_src->read_counter(), plt_scale), as opposed to computing with a
> previous taken stime_platform_stamp and incrementing a difference with a counter
> read. But can this situation actually happen?

No matter which CPU freq model is being used (right now, may
change when the Intel P-state driver finally arrives), Dom0 is
required to be executing already, so no issue here.

Since you didn't go into any detail on the specific example of log
time stamps - am I to imply that they're completely unaffected by
this (i.e. there's also no risk of them going backwards for a brief
period of time)?

Jan

Joao Martins April 5, 2016, 5:07 p.m. UTC | #10

On 04/05/2016 04:12 PM, Jan Beulich wrote:
>>>> On 05.04.16 at 16:56, <joao.m.martins@oracle.com> wrote:
>> On 04/05/2016 11:43 AM, Jan Beulich wrote:
>>>>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote:
>>>> @@ -541,6 +613,10 @@ static int __init try_platform_timer(struct platform_timesource *pts)
>>>>      if ( rc <= 0 )
>>>>          return rc;
>>>>  
>>>> +    /* We have a platform timesource already so reset it */
>>>> +    if ( plt_src.counter_bits != 0 )
>>>> +        reset_platform_timer();
>>>
>>> What if any of the time functions get used between here and the
>>> setting of the new clock source? For example, what will happen to
>>> log messages when "console_timestamps=..." is in effect?
>> time functions will use NOW() (i.e. get_s_time) which uses cpu_time structs
>> being updated on local_time_calibration() or cpu frequency changes.
>> reset_platform_timer() will zero out some of the variables used by the
>> plt_overflow and read_platform_stime(). The overflow and calibration isn't an
>> issue as the timers are previously killed so any further calls will use what's
>> on cpu_time while plt_src is being changed.
>>
>> The case I could see this being different is if cpu_frequency_change is called
>> between reset_platform_time() and the next update of stime_platform_stamp. In
>> this situation the calculation would end up a bit different meaning subsequent
>> calls of read_platform_stime() would return the equivalent to
>> scale_delta(plt_src->read_counter(), plt_scale), as opposed to computing with a
>> previous taken stime_platform_stamp and incrementing a difference with a counter
>> read. But can this situation actually happen?
> 
> No matter which CPU freq model is being used (right now, may
> change when the Intel P-state driver finally arrives), Dom0 is
> required to be executing already, so no issue here.
Cool.

> Since you didn't go into any detail on the specific example of log
> time stamps - am I to imply that they're completely unaffected by
> this (i.e. there's also no risk of them going backwards for a brief
> period of time)?
log timestamps uses wallclock_time() function which uses NOW() and follows the
first paragraph I explained. But I need to correct that statement as get_s_time
uses a previously seeded value from stime_local_stamp (which is set through
read_platform_time()). Time going backwards could happen only if TSC is behind
the currently-in-use clocksource. During that brief period it uses the values
taken from cpu_time but after I set the new clocksource it would go backwards as
I zeroed the previous stime/stamp snapshots in reset_platform_time. In this case
with I propose I don't have any other timestamps (platform_timer_stamp =
(stime_platform_stamp = 0)) to calculate the difference with, so it would go
backwards if the TSC falls behind. I could prevent this situation from happening
by having an offset which would be used until the new clocksource catches up to
the old one? I am also searching in the manual, if that can situation can happen.

Joao

[v2,3/6] x86/time: implement tsc as clocksource

Commit Message

Comments

Patch