Message ID | 575976AE02000078000F3695@prv-mh.provo.novell.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09/06/16 13:01, Jan Beulich wrote: > This looks like a copy and paste mistake in commit 1b6a99892d ("x86: > Simpler time handling when TSC is constant across all power saving > states"), responsible for occasional many-microsecond cross-CPU skew of > what NOW() returns. > > Also improve the correlation between local TSC and stime stamps > obtained at the end of the two calibration handlers: Compute the stime > one from the TSC one, instead of doing another rdtsc() for that > compuation. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> That does indeed look like a copy/paste mistake. I would be tempted to leave this in -unstable for a while. ~Andrew
On 06/09/2016 01:01 PM, Jan Beulich wrote: > This looks like a copy and paste mistake in commit 1b6a99892d ("x86: > Simpler time handling when TSC is constant across all power saving > states"), responsible for occasional many-microsecond cross-CPU skew of > what NOW() returns. > > Also improve the correlation between local TSC and stime stamps > obtained at the end of the two calibration handlers: Compute the stime > one from the TSC one, instead of doing another rdtsc() for that > compuation. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > As to 4.7 inclusion: This of course looks like a pretty blatant mistake > that has been there for many years (and hence many releases). There's > certainly non-zero risk that I'm overlooking something here (despite > Joao apparently having come to the same conclusion), so I can't really > make up my mind on whether to request this patch to be put there right > away, or rather having linger in -unstable for a while. > Initially I thought of this as a fix too, but then wouldn't having t->stime_local_stamp be c->stime_local_stamp, render no use to the platform timer reads done on calibration? Unless we would change update_vcpu_system to use stime_master_stamp instead? stime_master_stamp field isn't used anywhere other than the dom0 injected cpu_frequency_change or when at boot seeding the cpu_time struct on init_percpu_time (and the already mentioned use on local_time_calibration) ? init_percpu_time also takes a different read of the platform timer per cpu and could probably be inherited by a read done on the boot processor and written on remaining CPUs, so that all would start from the same stamp. IOW - this sounds like time we are turning stime to be totally TSC except when initially seeding each cpu_time? > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -998,7 +998,7 @@ static void local_time_calibration(void) > /* Atomically read cpu_calibration struct and write cpu_time struct. */ > local_irq_disable(); > t->local_tsc_stamp = c->local_tsc_stamp; > - t->stime_local_stamp = c->stime_master_stamp; > + t->stime_local_stamp = c->stime_local_stamp; > t->stime_master_stamp = c->stime_master_stamp; > local_irq_enable(); > update_vcpu_system_time(current); > @@ -1275,7 +1275,7 @@ static void time_calibration_tsc_rendezv > } > > c->local_tsc_stamp = rdtsc(); > - c->stime_local_stamp = get_s_time(); > + c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp); > c->stime_master_stamp = r->master_stime; > > raise_softirq(TIME_CALIBRATE_SOFTIRQ); > @@ -1305,7 +1305,7 @@ static void time_calibration_std_rendezv > } > > c->local_tsc_stamp = rdtsc(); > - c->stime_local_stamp = get_s_time(); > + c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp); > c->stime_master_stamp = r->master_stime; > > raise_softirq(TIME_CALIBRATE_SOFTIRQ); > > >
On Thu, Jun 09, 2016 at 06:01:18AM -0600, Jan Beulich wrote: > This looks like a copy and paste mistake in commit 1b6a99892d ("x86: > Simpler time handling when TSC is constant across all power saving > states"), responsible for occasional many-microsecond cross-CPU skew of > what NOW() returns. > > Also improve the correlation between local TSC and stime stamps > obtained at the end of the two calibration handlers: Compute the stime > one from the TSC one, instead of doing another rdtsc() for that > compuation. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > As to 4.7 inclusion: This of course looks like a pretty blatant mistake > that has been there for many years (and hence many releases). There's > certainly non-zero risk that I'm overlooking something here (despite > Joao apparently having come to the same conclusion), so I can't really > make up my mind on whether to request this patch to be put there right > away, or rather having linger in -unstable for a while. > Leave it in unstable for a while please. I don't want to delay the release any further. Wei.
>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote: > On 06/09/2016 01:01 PM, Jan Beulich wrote: >> This looks like a copy and paste mistake in commit 1b6a99892d ("x86: >> Simpler time handling when TSC is constant across all power saving >> states"), responsible for occasional many-microsecond cross-CPU skew of >> what NOW() returns. >> >> Also improve the correlation between local TSC and stime stamps >> obtained at the end of the two calibration handlers: Compute the stime >> one from the TSC one, instead of doing another rdtsc() for that >> compuation. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> As to 4.7 inclusion: This of course looks like a pretty blatant mistake >> that has been there for many years (and hence many releases). There's >> certainly non-zero risk that I'm overlooking something here (despite >> Joao apparently having come to the same conclusion), so I can't really >> make up my mind on whether to request this patch to be put there right >> away, or rather having linger in -unstable for a while. >> > Initially I thought of this as a fix too, but then wouldn't having > t->stime_local_stamp be c->stime_local_stamp, render no use to the > platform timer reads done on calibration? Unless we would change > update_vcpu_system to use stime_master_stamp instead? > stime_master_stamp field isn't used anywhere other than the dom0 injected > cpu_frequency_change or when at boot seeding the cpu_time struct on > init_percpu_time (and the already mentioned use on local_time_calibration) ? > init_percpu_time also takes a different read of the platform timer per > cpu and could probably be inherited by a read done on the boot processor > and written on remaining CPUs, so that all would start from the same stamp. > IOW - this sounds like time we are turning stime to be totally TSC except > when initially seeding each cpu_time? Did you also look at the "slow" path in local_time_calibration()? That does use the master stamp. So in effect for the fast path the patch changes the situation from c->stime_local_stamp being effectively unused to c->stime_master_stamp being so. In the former case, if that really hadn't been a typo, deleting the write of that field from time_calibration_std_rendezvous() would have made sense, as get_s_time() certainly is more overhead than the simply memory read and write needed for keeping c->stime_master_stamp up to date (despite not being used). Jan
On 06/09/2016 01:57 PM, Jan Beulich wrote: >>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote: >> On 06/09/2016 01:01 PM, Jan Beulich wrote: >>> This looks like a copy and paste mistake in commit 1b6a99892d ("x86: >>> Simpler time handling when TSC is constant across all power saving >>> states"), responsible for occasional many-microsecond cross-CPU skew of >>> what NOW() returns. >>> >>> Also improve the correlation between local TSC and stime stamps >>> obtained at the end of the two calibration handlers: Compute the stime >>> one from the TSC one, instead of doing another rdtsc() for that >>> compuation. >>> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> --- >>> As to 4.7 inclusion: This of course looks like a pretty blatant mistake >>> that has been there for many years (and hence many releases). There's >>> certainly non-zero risk that I'm overlooking something here (despite >>> Joao apparently having come to the same conclusion), so I can't really >>> make up my mind on whether to request this patch to be put there right >>> away, or rather having linger in -unstable for a while. >>> >> Initially I thought of this as a fix too, but then wouldn't having >> t->stime_local_stamp be c->stime_local_stamp, render no use to the >> platform timer reads done on calibration? Unless we would change >> update_vcpu_system to use stime_master_stamp instead? >> stime_master_stamp field isn't used anywhere other than the dom0 injected >> cpu_frequency_change or when at boot seeding the cpu_time struct on >> init_percpu_time (and the already mentioned use on local_time_calibration) ? >> init_percpu_time also takes a different read of the platform timer per >> cpu and could probably be inherited by a read done on the boot processor >> and written on remaining CPUs, so that all would start from the same stamp. >> IOW - this sounds like time we are turning stime to be totally TSC except >> when initially seeding each cpu_time? > > Did you also look at the "slow" path in local_time_calibration()? That > does use the master stamp. Yeah I am aware of its usage there - my comment was only on the fast path. I think the latter is the most common case too. > So in effect for the fast path the patch > changes the situation from c->stime_local_stamp being effectively > unused to c->stime_master_stamp being so. In the former case, if > that really hadn't been a typo, deleting the write of that field from > time_calibration_std_rendezvous() would have made sense, as > get_s_time() certainly is more overhead than the simply memory > read and write needed for keeping c->stime_master_stamp up to > date (despite not being used). I agree, but what I meant previously was more of a concern meaning: CPU 0 is doing an expensive read_platform_time (e.g. HPET supposedly microseconds order, plus a non-contented lock) to set stime_master_stamp that doesn't get used at all - effectively not using the clocksource set initially at boot. What if verify_tsc_reliability clears out X86_FEATURE_TSC_RELIABLE when failing the warp test? The calibration function is set early on right after interrupts are enabled and the time warp check later on when all CPUs are up. So on problematic platforms it's possible that std_rendezvous is used with a constant TSC but still deemed unreliable. We still keep incrementing deltas at roughly about the same time, but in effect with this change the stime_local_stamp would be TSC-only based thus leading to warps with an unreliable TSC? And there's also the CPU hotplug/onlining case that we once discussed. Joao
>>> On 09.06.16 at 17:00, <joao.m.martins@oracle.com> wrote: > On 06/09/2016 01:57 PM, Jan Beulich wrote: >>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote: >> So in effect for the fast path the patch >> changes the situation from c->stime_local_stamp being effectively >> unused to c->stime_master_stamp being so. In the former case, if >> that really hadn't been a typo, deleting the write of that field from >> time_calibration_std_rendezvous() would have made sense, as >> get_s_time() certainly is more overhead than the simply memory >> read and write needed for keeping c->stime_master_stamp up to >> date (despite not being used). > I agree, but what I meant previously was more of a concern meaning: CPU 0 is > doing an > expensive read_platform_time (e.g. HPET supposedly microseconds order, plus > a > non-contented lock) to set stime_master_stamp that doesn't get used at all - > effectively not using the clocksource set initially at boot. Yeah, there's likely some cleanup potential here, but of course we need to be pretty careful about not doing something that may be needed by some code paths but not others. But if you think you can help the situation without harming maintainability, feel free to go ahead. > What if verify_tsc_reliability clears out X86_FEATURE_TSC_RELIABLE when > failing > the warp test? The calibration function is set early on right after > interrupts are > enabled and the time warp check later on when all CPUs are up. So on > problematic > platforms it's possible that std_rendezvous is used with a constant TSC but > still > deemed unreliable. We still keep incrementing deltas at roughly about the > same time, > but in effect with this change the stime_local_stamp would be TSC-only based > thus > leading to warps with an unreliable TSC? And there's also the CPU > hotplug/onlining > case that we once discussed. I agree that we're likely in trouble in such a case. But for the moment I'd be glad if we could get the "normal" case work right. Jan
[changing Dario address to citrix.com as it was bouncing for me ] On 06/09/2016 04:52 PM, Jan Beulich wrote: >>>> On 09.06.16 at 17:00, <joao.m.martins@oracle.com> wrote: >> On 06/09/2016 01:57 PM, Jan Beulich wrote: >>>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote: >>> So in effect for the fast path the patch >>> changes the situation from c->stime_local_stamp being effectively >>> unused to c->stime_master_stamp being so. In the former case, if >>> that really hadn't been a typo, deleting the write of that field from >>> time_calibration_std_rendezvous() would have made sense, as >>> get_s_time() certainly is more overhead than the simply memory >>> read and write needed for keeping c->stime_master_stamp up to >>> date (despite not being used). >> I agree, but what I meant previously was more of a concern meaning: CPU 0 is >> doing an >> expensive read_platform_time (e.g. HPET supposedly microseconds order, plus >> a >> non-contented lock) to set stime_master_stamp that doesn't get used at all - >> effectively not using the clocksource set initially at boot. > > Yeah, there's likely some cleanup potential here, but of course we > need to be pretty careful about not doing something that may be > needed by some code paths but not others. But if you think you > can help the situation without harming maintainability, feel free to > go ahead. > OK, Makes sense. I'll likely do already so of it on my related series. >> What if verify_tsc_reliability clears out X86_FEATURE_TSC_RELIABLE when >> failing >> the warp test? The calibration function is set early on right after >> interrupts are >> enabled and the time warp check later on when all CPUs are up. So on >> problematic >> platforms it's possible that std_rendezvous is used with a constant TSC but >> still >> deemed unreliable. We still keep incrementing deltas at roughly about the >> same time, >> but in effect with this change the stime_local_stamp would be TSC-only based >> thus >> leading to warps with an unreliable TSC? And there's also the CPU >> hotplug/onlining >> case that we once discussed. > > I agree that we're likely in trouble in such a case. But for the > moment I'd be glad if we could get the "normal" case work right. > OK. Apologies for the noise, I was just pointing out things that I tried and some also discussed here in the PVCLOCK_TSC_STABLE_BIT series, although didn't cross me that Xen own idea of time could be a little broken. IMO adding another clocksource for TSC would be more correct if we are only using TSC (and having its associated limitations made aware/explicit to the user) rather then being on the back of another clocksource in use. But it wouldn't cover the normal case :( unless set manually NB: Guests on the other hand aren't affected since they take care of keeping the latest stamp when different vCPUS slightly diverge. Joao
>>> On 09.06.16 at 20:19, <joao.m.martins@oracle.com> wrote: > [changing Dario address to citrix.com as it was bouncing for me ] > > On 06/09/2016 04:52 PM, Jan Beulich wrote: >>>>> On 09.06.16 at 17:00, <joao.m.martins@oracle.com> wrote: >>> On 06/09/2016 01:57 PM, Jan Beulich wrote: >>>>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote: >>>> So in effect for the fast path the patch >>>> changes the situation from c->stime_local_stamp being effectively >>>> unused to c->stime_master_stamp being so. In the former case, if >>>> that really hadn't been a typo, deleting the write of that field from >>>> time_calibration_std_rendezvous() would have made sense, as >>>> get_s_time() certainly is more overhead than the simply memory >>>> read and write needed for keeping c->stime_master_stamp up to >>>> date (despite not being used). >>> I agree, but what I meant previously was more of a concern meaning: CPU 0 is > >>> doing an >>> expensive read_platform_time (e.g. HPET supposedly microseconds order, plus >>> a >>> non-contented lock) to set stime_master_stamp that doesn't get used at all - >>> effectively not using the clocksource set initially at boot. >> >> Yeah, there's likely some cleanup potential here, but of course we >> need to be pretty careful about not doing something that may be >> needed by some code paths but not others. But if you think you >> can help the situation without harming maintainability, feel free to >> go ahead. >> > OK, Makes sense. I'll likely do already so of it on my related series. Actually, since local time gets seeded from platform time in init_percpu_time(), I don't think we can do away with maintaining platform time. Jan
>>> On 10.06.16 at 08:59, <JBeulich@suse.com> wrote: > Actually, since local time gets seeded from platform time in > init_percpu_time(), I don't think we can do away with > maintaining platform time. And it looks like this seeding is where much of the remaining backwards deltas are coming from: While on local_time_calibration()'s slow path we use the platform time as reference (and hence the seeding is fine), on the fast path we don't, and hence using the platform time in init_percpu_time() to initialize local stime introduces a discontinuity. Jan
On 06/10/2016 10:29 AM, Jan Beulich wrote: >>>> On 10.06.16 at 08:59, <JBeulich@suse.com> wrote: >> Actually, since local time gets seeded from platform time in >> init_percpu_time(), I don't think we can do away with >> maintaining platform time. > Yeah, I agree. But the case of my previous message was towards improvement potential of the rendezvous after this patch (for the fast path). Platform time overflow could be another example but probably a bit borderline as certain clocksources have very short intervals. > And it looks like this seeding is where much of the remaining backwards > deltas are coming from: While on local_time_calibration()'s slow path > we use the platform time as reference (and hence the seeding is fine), > on the fast path we don't, and hence using the platform time in > init_percpu_time() to initialize local stime introduces a discontinuity. I'll do some testing too on your patch that's addressing this on Monday (today is an holiday in my country). Joao
--- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -998,7 +998,7 @@ static void local_time_calibration(void) /* Atomically read cpu_calibration struct and write cpu_time struct. */ local_irq_disable(); t->local_tsc_stamp = c->local_tsc_stamp; - t->stime_local_stamp = c->stime_master_stamp; + t->stime_local_stamp = c->stime_local_stamp; t->stime_master_stamp = c->stime_master_stamp; local_irq_enable(); update_vcpu_system_time(current); @@ -1275,7 +1275,7 @@ static void time_calibration_tsc_rendezv } c->local_tsc_stamp = rdtsc(); - c->stime_local_stamp = get_s_time(); + c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp); c->stime_master_stamp = r->master_stime; raise_softirq(TIME_CALIBRATE_SOFTIRQ); @@ -1305,7 +1305,7 @@ static void time_calibration_std_rendezv } c->local_tsc_stamp = rdtsc(); - c->stime_local_stamp = get_s_time(); + c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp); c->stime_master_stamp = r->master_stime; raise_softirq(TIME_CALIBRATE_SOFTIRQ);