Message ID | 1459259051-4943-7-git-send-email-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote: > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -43,6 +43,10 @@ > static char __initdata opt_clocksource[10]; > string_param("clocksource", opt_clocksource); > > +/* opt_nocpuhotplug: Set if CPU hotplug isn't meant to be used */ > +static bool_t __initdata opt_nocpuhotplug; > +boolean_param("nocpuhotplug", opt_nocpuhotplug); If we were to have such a new option, it would need to be accompanied by an entry in the command line option doc. But I'm opposed to this: For one, the variable being static here means there is nothing that actually suppresses CPU hotplug to happen. And then I think this can, for all practical purposes, be had by suitably using existing command line options, namely "max_cpus=", such that set_nr_cpu_ids() won't allow for any further CPUs to get added. Albeit I admit that if someone was to bring down some CPU and then hotplug another one, we might still be in trouble. So maybe the better approach would be to fail onlining of CPUs that don't meet the criteria when "clocksource=tsc"? > @@ -435,6 +439,7 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns) > * PLATFORM TIMER 4: TSC > */ > static bool_t clocksource_is_tsc; > +static bool_t use_tsc_stable_bit; __read_mostly? > @@ -468,6 +473,11 @@ static int __init init_tsctimer(struct platform_timesource *pts) > > pts->frequency = tsc_freq; > clocksource_is_tsc = tsc_reliable; > + use_tsc_stable_bit = clocksource_is_tsc && > + ((nr_cpu_ids == num_present_cpus()) || opt_nocpuhotplug); See my remark above regarding the reliability of this. > @@ -1431,6 +1443,22 @@ static void time_calibration_std_rendezvous(void *_r) > raise_softirq(TIME_CALIBRATE_SOFTIRQ); > } > > +/* > + * Rendezvous function used when clocksource is TSC and > + * no CPU hotplug will be performed. > + */ > +static void time_calibration_nop_rendezvous(void *_r) Even if done so in existing code - no new local variable or function parameters starting with an underscore please. > +{ > + struct cpu_calibration *c = &this_cpu(cpu_calibration); > + struct calibration_rendezvous *r = _r; const > + c->local_tsc_stamp = r->master_tsc_stamp; > + c->stime_local_stamp = get_s_time(); > + c->stime_master_stamp = r->master_stime; > + > + raise_softirq(TIME_CALIBRATE_SOFTIRQ); > +} Perhaps this whole function should move up ahead of the other two, so that they both can use this one instead of now duplicating the same code a 3rd time? Or maybe a new helper function would be better, to also account for the difference in what c->local_tsc_stamp gets set from (which could then become a parameter of that new function). > @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused) > .semaphore = ATOMIC_INIT(0) > }; > > + if ( use_tsc_stable_bit ) > + { > + local_irq_disable(); > + r.master_stime = read_platform_stime(&r.master_tsc_stamp); > + local_irq_enable(); > + } So this can't be in time_calibration_nop_rendezvous() because you want to avoid the actual rendezvousing. But isn't the then possibly much larger gap between read_platform_stime() (which parallels the rdtsc()-s in the other two cases) and get_s_time() invocation going to become a problem? Jan
On 04/05/2016 01:22 PM, Jan Beulich wrote: >>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote: >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -43,6 +43,10 @@ >> static char __initdata opt_clocksource[10]; >> string_param("clocksource", opt_clocksource); >> >> +/* opt_nocpuhotplug: Set if CPU hotplug isn't meant to be used */ >> +static bool_t __initdata opt_nocpuhotplug; >> +boolean_param("nocpuhotplug", opt_nocpuhotplug); > > If we were to have such a new option, it would need to be > accompanied by an entry in the command line option doc. Yes, Konrad pointed that out too - and I had it already documented already for the next version. But given your argument below might not even be needed to add this option. > But > I'm opposed to this: For one, the variable being static here > means there is nothing that actually suppresses CPU hotplug > to happen. > And then I think this can, for all practical purposes, > be had by suitably using existing command line options, namely > "max_cpus=", such that set_nr_cpu_ids() won't allow for any > further CPUs to get added. Albeit I admit that if someone was > to bring down some CPU and then hotplug another one, we > might still be in trouble. So maybe the better approach would > be to fail onlining of CPUs that don't meet the criteria when > "clocksource=tsc"? True - max_cpus would produce the same effect. But I should point out that even when clocksource=tsc the rendezvous would be std_rendezvous. So the reference TSC is CPU 0 and tsc_timestamps are of the individual CPUs. So perhaps the criteria would be for clocksource=tsc and use_tsc_stable_bit. > >> @@ -435,6 +439,7 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns) >> * PLATFORM TIMER 4: TSC >> */ >> static bool_t clocksource_is_tsc; >> +static bool_t use_tsc_stable_bit; > > __read_mostly? Yeah - I will add it there. > >> @@ -468,6 +473,11 @@ static int __init init_tsctimer(struct platform_timesource *pts) >> >> pts->frequency = tsc_freq; >> clocksource_is_tsc = tsc_reliable; >> + use_tsc_stable_bit = clocksource_is_tsc && >> + ((nr_cpu_ids == num_present_cpus()) || opt_nocpuhotplug); > > See my remark above regarding the reliability of this. > >> @@ -1431,6 +1443,22 @@ static void time_calibration_std_rendezvous(void *_r) >> raise_softirq(TIME_CALIBRATE_SOFTIRQ); >> } >> >> +/* >> + * Rendezvous function used when clocksource is TSC and >> + * no CPU hotplug will be performed. >> + */ >> +static void time_calibration_nop_rendezvous(void *_r) > > Even if done so in existing code - no new local variable or function > parameters starting with an underscore please. OK. > >> +{ >> + struct cpu_calibration *c = &this_cpu(cpu_calibration); >> + struct calibration_rendezvous *r = _r; > > const > >> + c->local_tsc_stamp = r->master_tsc_stamp; >> + c->stime_local_stamp = get_s_time(); >> + c->stime_master_stamp = r->master_stime; >> + >> + raise_softirq(TIME_CALIBRATE_SOFTIRQ); >> +} > > Perhaps this whole function should move up ahead of the other > two, so that they both can use this one instead of now duplicating > the same code a 3rd time? Or maybe a new helper function would > be better, to also account for the difference in what > c->local_tsc_stamp gets set from (which could then become a > parameter of that new function). The refactoring you suggest sounds a good idea indeed as that code is shared across all rendezvous - I will do so following this guideline you advised. > >> @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused) >> .semaphore = ATOMIC_INIT(0) >> }; >> >> + if ( use_tsc_stable_bit ) >> + { >> + local_irq_disable(); >> + r.master_stime = read_platform_stime(&r.master_tsc_stamp); >> + local_irq_enable(); >> + } > > So this can't be in time_calibration_nop_rendezvous() because > you want to avoid the actual rendezvousing. But isn't the then > possibly much larger gap between read_platform_stime() (which > parallels the rdtsc()-s in the other two cases) and get_s_time() > invocation going to become a problem? Perhaps I am not not seeing the potential problem of this. The main difference I see between both would be the base system time: read_platform_stime uses stime_platform_stamp as base, and computes a difference from the read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from stime_master_stamp on local_time_calibration) as base plus delta from rdtsc() with local_tsc_stamp. And since this is now all TSC, and TSC monotonically increase and is synchronized across CPUs, both calls would end up returning the same or a always up-to-date value, whether cpu_time have a larger gap or not from stime_platform_stamp. Unless the concern you are raising comes from the fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed to roughly at the same time with std_rendezvous? Joao
>>> On 05.04.16 at 23:34, <joao.m.martins@oracle.com> wrote: > On 04/05/2016 01:22 PM, Jan Beulich wrote: >>>>> On 29.03.16 at 15:44, <joao.m.martins@oracle.com> wrote: >> But >> I'm opposed to this: For one, the variable being static here >> means there is nothing that actually suppresses CPU hotplug >> to happen. >> And then I think this can, for all practical purposes, >> be had by suitably using existing command line options, namely >> "max_cpus=", such that set_nr_cpu_ids() won't allow for any >> further CPUs to get added. Albeit I admit that if someone was >> to bring down some CPU and then hotplug another one, we >> might still be in trouble. So maybe the better approach would >> be to fail onlining of CPUs that don't meet the criteria when >> "clocksource=tsc"? > True - max_cpus would produce the same effect. But I should point out > that even when clocksource=tsc the rendezvous would be std_rendezvous. So > the > reference TSC is CPU 0 and tsc_timestamps are of the individual > CPUs. So perhaps the criteria would be for clocksource=tsc and > use_tsc_stable_bit. Oh, of course I didn't mean this to be the precise condition, just an outline. Considering use_tsc_stable_bit certainly makes sense. >>> @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused) >>> .semaphore = ATOMIC_INIT(0) >>> }; >>> >>> + if ( use_tsc_stable_bit ) >>> + { >>> + local_irq_disable(); >>> + r.master_stime = read_platform_stime(&r.master_tsc_stamp); >>> + local_irq_enable(); >>> + } >> >> So this can't be in time_calibration_nop_rendezvous() because >> you want to avoid the actual rendezvousing. But isn't the then >> possibly much larger gap between read_platform_stime() (which >> parallels the rdtsc()-s in the other two cases) and get_s_time() >> invocation going to become a problem? > Perhaps I am not not seeing the potential problem of this. I'm not sure there's a problem, I'm just asking because I've noticed this behavioral difference. > The main > difference I see between both would be the base system time: > read_platform_stime > uses stime_platform_stamp as base, and computes a difference from the > read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp > (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from > stime_master_stamp on local_time_calibration) as base plus delta from > rdtsc() > with local_tsc_stamp. And since this is now all TSC, and TSC monotonically > increase and is synchronized across CPUs, both calls would end up returning > the > same or a always up-to-date value, whether cpu_time have a larger gap or not > from stime_platform_stamp. Unless the concern you are raising comes from the > fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed > to > roughly at the same time with std_rendezvous? In a way, yes. I'm concerned by the two time stamps no longer being obtained at (almost) the same time. If that's not having any bad consequences, the better. Jan
>> The main >> difference I see between both would be the base system time: >> read_platform_stime >> uses stime_platform_stamp as base, and computes a difference from the >> read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp >> (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from >> stime_master_stamp on local_time_calibration) as base plus delta from >> rdtsc() >> with local_tsc_stamp. And since this is now all TSC, and TSC monotonically >> increase and is synchronized across CPUs, both calls would end up returning >> the >> same or a always up-to-date value, whether cpu_time have a larger gap or not >> from stime_platform_stamp. Unless the concern you are raising comes from the >> fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed >> to >> roughly at the same time with std_rendezvous? > > In a way, yes. I'm concerned by the two time stamps no longer > being obtained at (almost) the same time. If that's not having > any bad consequences, the better. I don't think there would be bad consequences as both timestamps correspond to the same time reference - thus returning always the latest system time irrespective of the gap between both stamps. If you prefer I can go back with my initial approach (v1, with std_rendezvous) to have both timestamps closely updated. And later (post-release?) revisit the introduction of nop_rendezvous. Perhaps this way is more reasonable? Joao
>>> On 07.04.16 at 23:17, <joao.m.martins@oracle.com> wrote: >> > The main >>> difference I see between both would be the base system time: >>> read_platform_stime >>> uses stime_platform_stamp as base, and computes a difference from the >>> read_counter (i.e. rdtsc() ) with previously saved platform-wide stamp >>> (platform_timer_stamp). get_s_time uses the stime_local_stamp (updated from >>> stime_master_stamp on local_time_calibration) as base plus delta from >>> rdtsc() >>> with local_tsc_stamp. And since this is now all TSC, and TSC monotonically >>> increase and is synchronized across CPUs, both calls would end up returning >>> the >>> same or a always up-to-date value, whether cpu_time have a larger gap or not >>> from stime_platform_stamp. Unless the concern you are raising comes from the >>> fact CPU 0 calibrates much sooner than the last calibrated CPU, as opposed >>> to >>> roughly at the same time with std_rendezvous? >> >> In a way, yes. I'm concerned by the two time stamps no longer >> being obtained at (almost) the same time. If that's not having >> any bad consequences, the better. > > I don't think there would be bad consequences as both timestamps correspond > to the same time reference - thus returning always the latest system time > irrespective of the gap between both stamps. > > If you prefer I can go back with my initial approach (v1, with std_rendezvous) > to have both timestamps closely updated. And later (post-release?) revisit > the introduction of nop_rendezvous. Perhaps this way is more reasonable? Since the new mode need to be actively asked for, I don't think that's necessary. Jan
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c index 123aa42..1dcd4af 100644 --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -43,6 +43,10 @@ static char __initdata opt_clocksource[10]; string_param("clocksource", opt_clocksource); +/* opt_nocpuhotplug: Set if CPU hotplug isn't meant to be used */ +static bool_t __initdata opt_nocpuhotplug; +boolean_param("nocpuhotplug", opt_nocpuhotplug); + unsigned long __read_mostly cpu_khz; /* CPU clock frequency in kHz. */ DEFINE_SPINLOCK(rtc_lock); unsigned long pit0_ticks; @@ -435,6 +439,7 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns) * PLATFORM TIMER 4: TSC */ static bool_t clocksource_is_tsc; +static bool_t use_tsc_stable_bit; static u64 tsc_freq; static unsigned long tsc_max_warp; static void tsc_check_reliability(void); @@ -468,6 +473,11 @@ static int __init init_tsctimer(struct platform_timesource *pts) pts->frequency = tsc_freq; clocksource_is_tsc = tsc_reliable; + use_tsc_stable_bit = clocksource_is_tsc && + ((nr_cpu_ids == num_present_cpus()) || opt_nocpuhotplug); + + if ( clocksource_is_tsc && !use_tsc_stable_bit ) + printk(XENLOG_INFO "TSC: CPU Hotplug intended, not setting stable bit\n"); return tsc_reliable; } @@ -950,6 +960,8 @@ static void __update_vcpu_system_time(struct vcpu *v, int force) _u.tsc_timestamp = tsc_stamp; _u.system_time = t->stime_local_stamp; + if ( use_tsc_stable_bit ) + _u.flags |= PVCLOCK_TSC_STABLE_BIT; if ( is_hvm_domain(d) ) _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset; @@ -1431,6 +1443,22 @@ static void time_calibration_std_rendezvous(void *_r) raise_softirq(TIME_CALIBRATE_SOFTIRQ); } +/* + * Rendezvous function used when clocksource is TSC and + * no CPU hotplug will be performed. + */ +static void time_calibration_nop_rendezvous(void *_r) +{ + struct cpu_calibration *c = &this_cpu(cpu_calibration); + struct calibration_rendezvous *r = _r; + + c->local_tsc_stamp = r->master_tsc_stamp; + c->stime_local_stamp = get_s_time(); + c->stime_master_stamp = r->master_stime; + + raise_softirq(TIME_CALIBRATE_SOFTIRQ); +} + static void (*time_calibration_rendezvous_fn)(void *) = time_calibration_std_rendezvous; @@ -1440,6 +1468,13 @@ static void time_calibration(void *unused) .semaphore = ATOMIC_INIT(0) }; + if ( use_tsc_stable_bit ) + { + local_irq_disable(); + r.master_stime = read_platform_stime(&r.master_tsc_stamp); + local_irq_enable(); + } + cpumask_copy(&r.cpu_calibration_map, &cpu_online_map); /* @wait=1 because we must wait for all cpus before freeing @r. */ @@ -1555,6 +1590,14 @@ static int __init verify_tsc_reliability(void) init_percpu_time(); + /* + * We won't do CPU Hotplug and TSC clocksource is being used which + * means we have a reliable TSC, plus we don't sync with any other + * clocksource so no need for rendezvous. + */ + if ( use_tsc_stable_bit ) + time_calibration_rendezvous_fn = time_calibration_nop_rendezvous; + init_timer(&calibration_timer, time_calibration, NULL, 0); set_timer(&calibration_timer, NOW() + EPOCH); }
When using TSC as clocksource we will solely rely on TSC for updating vcpu time infos (pvti). Right now, each vCPU takes the tsc_timestamp at different instants meaning every EPOCH + delta. This delta is variable depending on the time the CPU calibrates with CPU 0 (master), and will likely be different and variable across vCPUS. This means that each VCPU pvti won't account to its calibration error which could lead to time going backwards, and allowing a situation where time read on VCPU B immediately after A being smaller. While this doesn't happen a lot, I was able to observe (for clocksource=tsc) around 50 times in an hour having warps of < 100 ns. This patch proposes relying on host TSC synchronization and passthrough to the guest, when running on a TSC-safe platform. On time_calibration we retrieve the platform time in ns and the counter read by the clocksource that was used to compute system time. We introduce a new rendezous function which doesn't require synchronization between master and slave CPUS and just reads calibration_rendezvous struct and writes it down the stime and stamp to the cpu_calibration struct to be used later on. We can guarantee that on a platform with a constant and reliable TSC, that the time read on vcpu B right after A is bigger independently of the CPU calibration error. Since pvclock time infos are monotonic as seen by any vCPU set PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux. IIUC, this is similar to how it's implemented on KVM. Note that PVCLOCK_TSC_STABLE_BIT is set only when CPU hotplug isn't meant to be performed on the host, which will either be when max vcpus and num_present_cpu are the same or if "nocpuhotplug" command line parameter is used. This is because a newly hotplugged CPU may not satisfy the condition of having all TSCs synchronized. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- Cc: Keir Fraser <keir@xen.org> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Perhaps "cpuhotplugsafe" would be a better name, since potentially hardware could guarantee TSCs are synchronized on hotplug? Changes since v1: - Change approach to follow Andrew's guideline to skip std_rendezvous. And doing so by introducing a nop_rendezvous - Change commit message reflecting the change above. - Use TSC_STABLE_BIT only if cpu hotplug isn't possible. - Add command line option to override it if no cpu hotplug is intended. --- xen/arch/x86/time.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+)