diff mbox series

[v2,2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()

Message ID 26b71f94-d1c7-d906-5b2a-4e7994d6f7c0@suse.com (mailing list archive)
State Superseded
Headers show
Series x86/time: calibration rendezvous adjustments | expand

Commit Message

Jan Beulich Feb. 1, 2021, 12:43 p.m. UTC
The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
Therefore the two better get taken as close to one another as possible.
This means two things: First, reading platform time is too early when
done on the first iteration. The closest we can get is on the last
iteration, immediately before telling other CPUs to write their TSCs
(and then also writing CPU0's). While at the first glance it may seem
not overly relevant when exactly platform time is read (when assuming
that only stime is ever relevant anywhere, and hence the association
with the precise TSC values is of lower interest), both CPU frequency
changes and the effects of SMT make it unpredictable (between individual
rendezvous instances) how long the loop iterations will take. This will
in turn lead to higher an error than neccesary in how close to linear
stime movement we can get.

Second, re-reading the TSC for local recording is increasing the overall
error as well, when we already know a more precise value - the one just
written.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

Comments

Roger Pau Monne Feb. 5, 2021, 4:15 p.m. UTC | #1
On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote:
> The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
> Therefore the two better get taken as close to one another as possible.
> This means two things: First, reading platform time is too early when
> done on the first iteration. The closest we can get is on the last
> iteration, immediately before telling other CPUs to write their TSCs
> (and then also writing CPU0's). While at the first glance it may seem
> not overly relevant when exactly platform time is read (when assuming
> that only stime is ever relevant anywhere, and hence the association
> with the precise TSC values is of lower interest), both CPU frequency
> changes and the effects of SMT make it unpredictable (between individual
> rendezvous instances) how long the loop iterations will take. This will
> in turn lead to higher an error than neccesary in how close to linear
> stime movement we can get.
> 
> Second, re-reading the TSC for local recording is increasing the overall
> error as well, when we already know a more precise value - the one just
> written.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

I've been thinking this all seems doomed when Xen runs in a virtualized
environment, and should likely be disabled. There's no point on trying
to sync the TSC over multiple vCPUs as the scheduling delay between
them will likely skew any calculations.

Thanks, Roger.
Jan Beulich Feb. 8, 2021, 10:56 a.m. UTC | #2
On 05.02.2021 17:15, Roger Pau Monné wrote:
> On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote:
>> The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
>> Therefore the two better get taken as close to one another as possible.
>> This means two things: First, reading platform time is too early when
>> done on the first iteration. The closest we can get is on the last
>> iteration, immediately before telling other CPUs to write their TSCs
>> (and then also writing CPU0's). While at the first glance it may seem
>> not overly relevant when exactly platform time is read (when assuming
>> that only stime is ever relevant anywhere, and hence the association
>> with the precise TSC values is of lower interest), both CPU frequency
>> changes and the effects of SMT make it unpredictable (between individual
>> rendezvous instances) how long the loop iterations will take. This will
>> in turn lead to higher an error than neccesary in how close to linear
>> stime movement we can get.
>>
>> Second, re-reading the TSC for local recording is increasing the overall
>> error as well, when we already know a more precise value - the one just
>> written.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> I've been thinking this all seems doomed when Xen runs in a virtualized
> environment, and should likely be disabled. There's no point on trying
> to sync the TSC over multiple vCPUs as the scheduling delay between
> them will likely skew any calculations.

We may want to consider to force the equivalent of
"clocksource=tsc" in that case. Otoh a well behaved hypervisor
underneath shouldn't lead to us finding a need to clear
TSC_RELIABLE, at which point this logic wouldn't get engaged
in the first place.

Jan
Roger Pau Monne Feb. 8, 2021, 11:05 a.m. UTC | #3
On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote:
> On 05.02.2021 17:15, Roger Pau Monné wrote:
> > On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote:
> >> The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
> >> Therefore the two better get taken as close to one another as possible.
> >> This means two things: First, reading platform time is too early when
> >> done on the first iteration. The closest we can get is on the last
> >> iteration, immediately before telling other CPUs to write their TSCs
> >> (and then also writing CPU0's). While at the first glance it may seem
> >> not overly relevant when exactly platform time is read (when assuming
> >> that only stime is ever relevant anywhere, and hence the association
> >> with the precise TSC values is of lower interest), both CPU frequency
> >> changes and the effects of SMT make it unpredictable (between individual
> >> rendezvous instances) how long the loop iterations will take. This will
> >> in turn lead to higher an error than neccesary in how close to linear
> >> stime movement we can get.
> >>
> >> Second, re-reading the TSC for local recording is increasing the overall
> >> error as well, when we already know a more precise value - the one just
> >> written.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > I've been thinking this all seems doomed when Xen runs in a virtualized
> > environment, and should likely be disabled. There's no point on trying
> > to sync the TSC over multiple vCPUs as the scheduling delay between
> > them will likely skew any calculations.
> 
> We may want to consider to force the equivalent of
> "clocksource=tsc" in that case. Otoh a well behaved hypervisor
> underneath shouldn't lead to us finding a need to clear
> TSC_RELIABLE, at which point this logic wouldn't get engaged
> in the first place.

I got the impression that on a loaded system guests with a non-trivial
amount of vCPUs might be in trouble to be able to schedule them all
close enough for the rendezvous to not report a big skew, and thus
disable TSC_RELIABLE?

Thanks, Roger.
Jan Beulich Feb. 8, 2021, 11:50 a.m. UTC | #4
On 08.02.2021 12:05, Roger Pau Monné wrote:
> On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote:
>> On 05.02.2021 17:15, Roger Pau Monné wrote:
>>> I've been thinking this all seems doomed when Xen runs in a virtualized
>>> environment, and should likely be disabled. There's no point on trying
>>> to sync the TSC over multiple vCPUs as the scheduling delay between
>>> them will likely skew any calculations.
>>
>> We may want to consider to force the equivalent of
>> "clocksource=tsc" in that case. Otoh a well behaved hypervisor
>> underneath shouldn't lead to us finding a need to clear
>> TSC_RELIABLE, at which point this logic wouldn't get engaged
>> in the first place.
> 
> I got the impression that on a loaded system guests with a non-trivial
> amount of vCPUs might be in trouble to be able to schedule them all
> close enough for the rendezvous to not report a big skew, and thus
> disable TSC_RELIABLE?

No, check_tsc_warp() / tsc_check_reliability() don't have a
problem there. Every CPU reads the shared "most advanced"
stamp before reading its local one. So it doesn't matter how
large the gaps are here. In fact the possible bad effect is
the other way around here - if the scheduling effects are
too heavy, we may mistakenly consider TSCs reliable when
they aren't.

A problem of the kind you describe exists in the actual
rendezvous function. And actually any problem of this kind
can, on a smaller scale, already be be observed with SMT,
because the individual hyperthreads of a core can't
possibly all run at the same time.

As occurs to me only now, I think we can improve accuracy
some (in particular on big systems) by making sure
struct calibration_rendezvous's master_tsc_stamp is not
sharing its cache line with semaphore and master_stime. The
latter get written by (at least) the BSP, while
master_tsc_stamp is stable after the 2nd loop iteration.
Hence on the 3rd and 4th iterations we could even prefetch
it to reduce the delay on the last one.

Jan
Roger Pau Monne Feb. 8, 2021, 4:39 p.m. UTC | #5
On Mon, Feb 08, 2021 at 12:50:09PM +0100, Jan Beulich wrote:
> On 08.02.2021 12:05, Roger Pau Monné wrote:
> > On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote:
> >> On 05.02.2021 17:15, Roger Pau Monné wrote:
> >>> I've been thinking this all seems doomed when Xen runs in a virtualized
> >>> environment, and should likely be disabled. There's no point on trying
> >>> to sync the TSC over multiple vCPUs as the scheduling delay between
> >>> them will likely skew any calculations.
> >>
> >> We may want to consider to force the equivalent of
> >> "clocksource=tsc" in that case. Otoh a well behaved hypervisor
> >> underneath shouldn't lead to us finding a need to clear
> >> TSC_RELIABLE, at which point this logic wouldn't get engaged
> >> in the first place.
> > 
> > I got the impression that on a loaded system guests with a non-trivial
> > amount of vCPUs might be in trouble to be able to schedule them all
> > close enough for the rendezvous to not report a big skew, and thus
> > disable TSC_RELIABLE?
> 
> No, check_tsc_warp() / tsc_check_reliability() don't have a
> problem there. Every CPU reads the shared "most advanced"
> stamp before reading its local one. So it doesn't matter how
> large the gaps are here. In fact the possible bad effect is
> the other way around here - if the scheduling effects are
> too heavy, we may mistakenly consider TSCs reliable when
> they aren't.
> 
> A problem of the kind you describe exists in the actual
> rendezvous function. And actually any problem of this kind
> can, on a smaller scale, already be be observed with SMT,
> because the individual hyperthreads of a core can't
> possibly all run at the same time.

Indeed I got confused between tsc_check_reliability and the actual
rendezvous function, so it's likely the adjustments done by the
rendezvous are pointless when running virtualized IMO, due to the
inability to likely schedule all the vCPUs at one to execute the
rendezvous.

> As occurs to me only now, I think we can improve accuracy
> some (in particular on big systems) by making sure
> struct calibration_rendezvous's master_tsc_stamp is not
> sharing its cache line with semaphore and master_stime. The
> latter get written by (at least) the BSP, while
> master_tsc_stamp is stable after the 2nd loop iteration.
> Hence on the 3rd and 4th iterations we could even prefetch
> it to reduce the delay on the last one.

Seems like a possibility indeed.

Thanks, Roger.
diff mbox series

Patch

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1662,11 +1662,12 @@  struct calibration_rendezvous {
 };
 
 static void
-time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
+time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
+                                 uint64_t tsc)
 {
     struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
-    c->local_tsc    = rdtsc_ordered();
+    c->local_tsc    = tsc;
     c->local_stime  = get_s_time_fixed(c->local_tsc);
     c->master_stime = r->master_stime;
 
@@ -1691,11 +1692,11 @@  static void time_calibration_tsc_rendezv
             while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
                 cpu_relax();
 
-            if ( r->master_stime == 0 )
-            {
-                r->master_stime = read_platform_stime(NULL);
+            if ( r->master_tsc_stamp == 0 )
                 r->master_tsc_stamp = rdtsc_ordered();
-            }
+            else if ( i == 0 )
+                r->master_stime = read_platform_stime(NULL);
+
             atomic_inc(&r->semaphore);
 
             if ( i == 0 )
@@ -1720,7 +1721,7 @@  static void time_calibration_tsc_rendezv
         }
     }
 
-    time_calibration_rendezvous_tail(r);
+    time_calibration_rendezvous_tail(r, r->master_tsc_stamp);
 }
 
 /* Ordinary rendezvous function which does not modify TSC values. */
@@ -1745,7 +1746,7 @@  static void time_calibration_std_rendezv
         smp_rmb(); /* receive signal /then/ read r->master_stime */
     }
 
-    time_calibration_rendezvous_tail(r);
+    time_calibration_rendezvous_tail(r, rdtsc_ordered());
 }
 
 /*