Message ID | 1579030581-7929-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/time: update TSC stamp on restore from deep C-state | expand |
On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: > If ITSC is not available on CPU (e.g if running nested as PV shim) > then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. > all AMD and some old Intel processors. In which case TSC would need to > be restored on CPU from platform time by Xen upon exiting deep C-states. > > As platform time might be behind the last TSC stamp recorded for the > current CPU, invariant of TSC stamp being always behind local TSC counter > is violated. This has an effect of get_s_time() going negative resulting > in eventual system hang or crash. > > Fix this issue by updating local TSC stamp along with TSC counter write. Thanks! I haven't seen such issue because I've been running the shim with nomigrate in order to prevent the vTSC overhead. > > Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > --- > This caused reliable hangs of shim domains with multiple vCPUs on all AMD > systems. The problem got also reproduced on bare-metal by artifically > masking ITSC feature bit. The proposed fix has been verified for both > cases. > --- > xen/arch/x86/time.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c > index e79cb4d..f6b26f8 100644 > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) > > void cstate_restore_tsc(void) > { > + struct cpu_time *t = &this_cpu(cpu_time); > + > if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) > return; > > - write_tsc(stime2tsc(read_platform_stime(NULL))); > + t->stamp.master_stime = read_platform_stime(NULL); > + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); > + t->stamp.local_stime = t->stamp.master_stime; > + > + write_tsc(t->stamp.local_tsc); In order to avoid the TSC write (and the likely associated vmexit), could you instead do: t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); t->stamp.local_tsc = rdtsc_ordered(); I think it should achieve the same as it syncs the local TSC stamp and times, would avoid the TSC write and slightly simplifies the logic. Thanks, Roger.
On 14.01.2020 20:36, Igor Druzhinin wrote: > If ITSC is not available on CPU (e.g if running nested as PV shim) > then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. > all AMD and some old Intel processors. In which case TSC would need to > be restored on CPU from platform time by Xen upon exiting deep C-states. How does waking from deep C states correspond to the PV shim? I notice that cstate_restore_tsc() gets called irrespective of the C state being exited, so I wonder whether there's room for improvement there independent of the issue at hand. As far as this change is concerned, I think you want to drop the notion of "deep" from the description. Jan
On 15.01.2020 10:47, Roger Pau Monné wrote: > On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >> >> void cstate_restore_tsc(void) >> { >> + struct cpu_time *t = &this_cpu(cpu_time); >> + >> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >> return; >> >> - write_tsc(stime2tsc(read_platform_stime(NULL))); >> + t->stamp.master_stime = read_platform_stime(NULL); >> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >> + t->stamp.local_stime = t->stamp.master_stime; >> + >> + write_tsc(t->stamp.local_tsc); > > In order to avoid the TSC write (and the likely associated vmexit), > could you instead do: > > t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); > t->stamp.local_tsc = rdtsc_ordered(); > > I think it should achieve the same as it syncs the local TSC stamp and > times, would avoid the TSC write and slightly simplifies the logic. Wouldn't this result in guests possibly observing the TSC moving backwards? Jan
On Wed, Jan 15, 2020 at 12:40:27PM +0100, Jan Beulich wrote: > On 15.01.2020 10:47, Roger Pau Monné wrote: > > On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: > >> --- a/xen/arch/x86/time.c > >> +++ b/xen/arch/x86/time.c > >> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) > >> > >> void cstate_restore_tsc(void) > >> { > >> + struct cpu_time *t = &this_cpu(cpu_time); > >> + > >> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) > >> return; > >> > >> - write_tsc(stime2tsc(read_platform_stime(NULL))); > >> + t->stamp.master_stime = read_platform_stime(NULL); > >> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); > >> + t->stamp.local_stime = t->stamp.master_stime; > >> + > >> + write_tsc(t->stamp.local_tsc); > > > > In order to avoid the TSC write (and the likely associated vmexit), > > could you instead do: > > > > t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); > > t->stamp.local_tsc = rdtsc_ordered(); > > > > I think it should achieve the same as it syncs the local TSC stamp and > > times, would avoid the TSC write and slightly simplifies the logic. > > Wouldn't this result in guests possibly observing the TSC moving > backwards? Isn't local_tsc storing a TSC value read from the same CPU always, and hence could only go backwards if rdtsc actually goes backwards? Ie: cpu_frequency_change seems to do something similar, together with a re-adjusting of the time scale, but doesn't perform any TSC write. Thanks, Roger.
On 15/01/2020 11:40, Jan Beulich wrote: > On 15.01.2020 10:47, Roger Pau Monné wrote: >> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >>> --- a/xen/arch/x86/time.c >>> +++ b/xen/arch/x86/time.c >>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >>> >>> void cstate_restore_tsc(void) >>> { >>> + struct cpu_time *t = &this_cpu(cpu_time); >>> + >>> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >>> return; >>> >>> - write_tsc(stime2tsc(read_platform_stime(NULL))); >>> + t->stamp.master_stime = read_platform_stime(NULL); >>> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >>> + t->stamp.local_stime = t->stamp.master_stime; >>> + >>> + write_tsc(t->stamp.local_tsc); >> >> In order to avoid the TSC write (and the likely associated vmexit), >> could you instead do: >> >> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); >> t->stamp.local_tsc = rdtsc_ordered(); >> >> I think it should achieve the same as it syncs the local TSC stamp and >> times, would avoid the TSC write and slightly simplifies the logic. > > Wouldn't this result in guests possibly observing the TSC moving > backwards? Yes, I think so. Would restoring from TSC stamp if it's higher than platform time better you think? Igor
On 15/01/2020 11:32, Jan Beulich wrote: > On 14.01.2020 20:36, Igor Druzhinin wrote: >> If ITSC is not available on CPU (e.g if running nested as PV shim) >> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >> all AMD and some old Intel processors. In which case TSC would need to >> be restored on CPU from platform time by Xen upon exiting deep C-states. > > How does waking from deep C states correspond to the PV shim? I notice > that cstate_restore_tsc() gets called irrespective of the C state being > exited, so I wonder whether there's room for improvement there > independent of the issue at hand. As far as this change is concerned, > I think you want to drop the notion of "deep" from the description. I'm not familiar with what to call "deep C-state" so for me it was anything higher than C1. If you prefer "deep" to be dropped - so be it. Igor
On 15/01/2020 12:25, Igor Druzhinin wrote: > On 15/01/2020 11:40, Jan Beulich wrote: >> On 15.01.2020 10:47, Roger Pau Monné wrote: >>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >>>> --- a/xen/arch/x86/time.c >>>> +++ b/xen/arch/x86/time.c >>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >>>> >>>> void cstate_restore_tsc(void) >>>> { >>>> + struct cpu_time *t = &this_cpu(cpu_time); >>>> + >>>> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >>>> return; >>>> >>>> - write_tsc(stime2tsc(read_platform_stime(NULL))); >>>> + t->stamp.master_stime = read_platform_stime(NULL); >>>> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >>>> + t->stamp.local_stime = t->stamp.master_stime; >>>> + >>>> + write_tsc(t->stamp.local_tsc); >>> >>> In order to avoid the TSC write (and the likely associated vmexit), >>> could you instead do: >>> >>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); >>> t->stamp.local_tsc = rdtsc_ordered(); >>> >>> I think it should achieve the same as it syncs the local TSC stamp and >>> times, would avoid the TSC write and slightly simplifies the logic. >> >> Wouldn't this result in guests possibly observing the TSC moving >> backwards? > > Yes, I think so. Would restoring from TSC stamp if it's higher than > platform time better you think? > Ignore my reply. I was thinking you're asking whether the original code would do such a thing. Although I'm concerned if what you say actually applies to the original code as well. Would you think the existing logic handles it already? Igor
On 15/01/2020 09:47, Roger Pau Monné wrote: > On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >> If ITSC is not available on CPU (e.g if running nested as PV shim) >> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >> all AMD and some old Intel processors. In which case TSC would need to >> be restored on CPU from platform time by Xen upon exiting deep C-states. >> >> As platform time might be behind the last TSC stamp recorded for the >> current CPU, invariant of TSC stamp being always behind local TSC counter >> is violated. This has an effect of get_s_time() going negative resulting >> in eventual system hang or crash. >> >> Fix this issue by updating local TSC stamp along with TSC counter write. > > Thanks! I haven't seen such issue because I've been running the shim > with nomigrate in order to prevent the vTSC overhead. > >> >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >> --- >> This caused reliable hangs of shim domains with multiple vCPUs on all AMD >> systems. The problem got also reproduced on bare-metal by artifically >> masking ITSC feature bit. The proposed fix has been verified for both >> cases. >> --- >> xen/arch/x86/time.c | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c >> index e79cb4d..f6b26f8 100644 >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >> >> void cstate_restore_tsc(void) >> { >> + struct cpu_time *t = &this_cpu(cpu_time); >> + >> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >> return; >> >> - write_tsc(stime2tsc(read_platform_stime(NULL))); >> + t->stamp.master_stime = read_platform_stime(NULL); >> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >> + t->stamp.local_stime = t->stamp.master_stime; >> + >> + write_tsc(t->stamp.local_tsc); > > In order to avoid the TSC write (and the likely associated vmexit), > could you instead do: > > t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); > t->stamp.local_tsc = rdtsc_ordered(); I think in that case RDTSC might return something behind platform time which is not right I guess. Igor
On 15.01.2020 13:28, Igor Druzhinin wrote: > On 15/01/2020 11:32, Jan Beulich wrote: >> On 14.01.2020 20:36, Igor Druzhinin wrote: >>> If ITSC is not available on CPU (e.g if running nested as PV shim) >>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >>> all AMD and some old Intel processors. In which case TSC would need to >>> be restored on CPU from platform time by Xen upon exiting deep C-states. >> >> How does waking from deep C states correspond to the PV shim? I notice >> that cstate_restore_tsc() gets called irrespective of the C state being >> exited, so I wonder whether there's room for improvement there >> independent of the issue at hand. As far as this change is concerned, >> I think you want to drop the notion of "deep" from the description. > > I'm not familiar with what to call "deep C-state" so for me it was anything > higher than C1. If you prefer "deep" to be dropped - so be it. "Higher than C1" may be fine (albeit I vaguely recall TSC issues starting with C3 only), but at least mwait_idle() calls the function even for C1. As to the PV shim - does it know about any C-states at all (beyond HLT-invoked C1)? Jan
On 15.01.2020 13:31, Igor Druzhinin wrote: > On 15/01/2020 12:25, Igor Druzhinin wrote: >> On 15/01/2020 11:40, Jan Beulich wrote: >>> On 15.01.2020 10:47, Roger Pau Monné wrote: >>>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >>>>> --- a/xen/arch/x86/time.c >>>>> +++ b/xen/arch/x86/time.c >>>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >>>>> >>>>> void cstate_restore_tsc(void) >>>>> { >>>>> + struct cpu_time *t = &this_cpu(cpu_time); >>>>> + >>>>> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >>>>> return; >>>>> >>>>> - write_tsc(stime2tsc(read_platform_stime(NULL))); >>>>> + t->stamp.master_stime = read_platform_stime(NULL); >>>>> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >>>>> + t->stamp.local_stime = t->stamp.master_stime; >>>>> + >>>>> + write_tsc(t->stamp.local_tsc); >>>> >>>> In order to avoid the TSC write (and the likely associated vmexit), >>>> could you instead do: >>>> >>>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); >>>> t->stamp.local_tsc = rdtsc_ordered(); >>>> >>>> I think it should achieve the same as it syncs the local TSC stamp and >>>> times, would avoid the TSC write and slightly simplifies the logic. >>> >>> Wouldn't this result in guests possibly observing the TSC moving >>> backwards? >> >> Yes, I think so. Would restoring from TSC stamp if it's higher than >> platform time better you think? >> > > Ignore my reply. I was thinking you're asking whether the original code > would do such a thing. Although I'm concerned if what you say actually > applies to the original code as well. Would you think the existing logic > handles it already? I think the original code won't guarantee to backwards move at all, but it also wouldn't produce large backwards jumps. It's large ones I was mainly thinking of when replying to Roger. But let me also reply back to him... Jan
On 15/01/2020 12:39, Jan Beulich wrote: > On 15.01.2020 13:28, Igor Druzhinin wrote: >> On 15/01/2020 11:32, Jan Beulich wrote: >>> On 14.01.2020 20:36, Igor Druzhinin wrote: >>>> If ITSC is not available on CPU (e.g if running nested as PV shim) >>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >>>> all AMD and some old Intel processors. In which case TSC would need to >>>> be restored on CPU from platform time by Xen upon exiting deep C-states. >>> >>> How does waking from deep C states correspond to the PV shim? I notice >>> that cstate_restore_tsc() gets called irrespective of the C state being >>> exited, so I wonder whether there's room for improvement there >>> independent of the issue at hand. As far as this change is concerned, >>> I think you want to drop the notion of "deep" from the description. >> >> I'm not familiar with what to call "deep C-state" so for me it was anything >> higher than C1. If you prefer "deep" to be dropped - so be it. > > "Higher than C1" may be fine (albeit I vaguely recall TSC issues > starting with C3 only), but at least mwait_idle() calls the > function even for C1. As to the PV shim - does it know about any > C-states at all (beyond HLT-invoked C1)? Yes, PV-shim knows about C states as it looks they are tied to processor ID in some cases. For AMD specifically C2 is HLT. Igor
On 15.01.2020 12:53, Roger Pau Monné wrote: > On Wed, Jan 15, 2020 at 12:40:27PM +0100, Jan Beulich wrote: >> On 15.01.2020 10:47, Roger Pau Monné wrote: >>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >>>> --- a/xen/arch/x86/time.c >>>> +++ b/xen/arch/x86/time.c >>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >>>> >>>> void cstate_restore_tsc(void) >>>> { >>>> + struct cpu_time *t = &this_cpu(cpu_time); >>>> + >>>> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >>>> return; >>>> >>>> - write_tsc(stime2tsc(read_platform_stime(NULL))); >>>> + t->stamp.master_stime = read_platform_stime(NULL); >>>> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >>>> + t->stamp.local_stime = t->stamp.master_stime; >>>> + >>>> + write_tsc(t->stamp.local_tsc); >>> >>> In order to avoid the TSC write (and the likely associated vmexit), >>> could you instead do: >>> >>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); >>> t->stamp.local_tsc = rdtsc_ordered(); >>> >>> I think it should achieve the same as it syncs the local TSC stamp and >>> times, would avoid the TSC write and slightly simplifies the logic. >> >> Wouldn't this result in guests possibly observing the TSC moving >> backwards? > > Isn't local_tsc storing a TSC value read from the same CPU always, and > hence could only go backwards if rdtsc actually goes backwards? For one I have to admit I was (mistakenly) thinking of wakeup from S states more than that from C states. So assuming the TSC indeed only stops (but won't get e.g. restarted), backwards moves ought to be excluded. What I'm then worried about is too little progress observable by guests. The PV time protocol ought to be fine in this regard (and consumers of raw TSC values are on their own anyway), but wouldn't you need to update TSC offsets of HVM guests in order to compensate for the elapsed time? > Ie: cpu_frequency_change seems to do something similar, together with > a re-adjusting of the time scale, but doesn't perform any TSC write. A P-state change at most alters the the tick rate, but wouldn't stop or even reset the TSC (afaict). Jan
On 15.01.2020 13:47, Igor Druzhinin wrote: > On 15/01/2020 12:39, Jan Beulich wrote: >> On 15.01.2020 13:28, Igor Druzhinin wrote: >>> On 15/01/2020 11:32, Jan Beulich wrote: >>>> On 14.01.2020 20:36, Igor Druzhinin wrote: >>>>> If ITSC is not available on CPU (e.g if running nested as PV shim) >>>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >>>>> all AMD and some old Intel processors. In which case TSC would need to >>>>> be restored on CPU from platform time by Xen upon exiting deep C-states. >>>> >>>> How does waking from deep C states correspond to the PV shim? I notice >>>> that cstate_restore_tsc() gets called irrespective of the C state being >>>> exited, so I wonder whether there's room for improvement there >>>> independent of the issue at hand. As far as this change is concerned, >>>> I think you want to drop the notion of "deep" from the description. >>> >>> I'm not familiar with what to call "deep C-state" so for me it was anything >>> higher than C1. If you prefer "deep" to be dropped - so be it. >> >> "Higher than C1" may be fine (albeit I vaguely recall TSC issues >> starting with C3 only), but at least mwait_idle() calls the >> function even for C1. As to the PV shim - does it know about any >> C-states at all (beyond HLT-invoked C1)? > > Yes, PV-shim knows about C states as it looks they are tied to > processor ID in some cases. For AMD specifically C2 is HLT. The AMD part is pretty new, and is - afaict - an exception compared to everything else. Under PVH there's no respective ACPI data (iirc), and we also don't surface MONITOR/MWAIT to HVM/PVH guests, making mwait_idle_probe() bail early. I wonder whether this special behavior on AMD Fam17 should be suppressed in this case. Jan
On Wed, Jan 15, 2020 at 12:36:08PM +0000, Igor Druzhinin wrote: > On 15/01/2020 09:47, Roger Pau Monné wrote: > > On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: > >> If ITSC is not available on CPU (e.g if running nested as PV shim) > >> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. > >> all AMD and some old Intel processors. In which case TSC would need to > >> be restored on CPU from platform time by Xen upon exiting deep C-states. > >> > >> As platform time might be behind the last TSC stamp recorded for the > >> current CPU, invariant of TSC stamp being always behind local TSC counter > >> is violated. This has an effect of get_s_time() going negative resulting > >> in eventual system hang or crash. > >> > >> Fix this issue by updating local TSC stamp along with TSC counter write. > > > > Thanks! I haven't seen such issue because I've been running the shim > > with nomigrate in order to prevent the vTSC overhead. > > > >> > >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > >> --- > >> This caused reliable hangs of shim domains with multiple vCPUs on all AMD > >> systems. The problem got also reproduced on bare-metal by artifically > >> masking ITSC feature bit. The proposed fix has been verified for both > >> cases. > >> --- > >> xen/arch/x86/time.c | 8 +++++++- > >> 1 file changed, 7 insertions(+), 1 deletion(-) > >> > >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c > >> index e79cb4d..f6b26f8 100644 > >> --- a/xen/arch/x86/time.c > >> +++ b/xen/arch/x86/time.c > >> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) > >> > >> void cstate_restore_tsc(void) > >> { > >> + struct cpu_time *t = &this_cpu(cpu_time); > >> + > >> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) > >> return; > >> > >> - write_tsc(stime2tsc(read_platform_stime(NULL))); > >> + t->stamp.master_stime = read_platform_stime(NULL); > >> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); > >> + t->stamp.local_stime = t->stamp.master_stime; > >> + > >> + write_tsc(t->stamp.local_tsc); > > > > In order to avoid the TSC write (and the likely associated vmexit), > > could you instead do: > > > > t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); > > t->stamp.local_tsc = rdtsc_ordered(); > > I think in that case RDTSC might return something behind platform time > which is not right I guess. The TSC and the platform time are completely independent from Xen's PoV, you can have a platform time not based on the TSC (ie: PIT, HPET or PM), and hence there's no direct relation between both. The TSC is used as a way to get an approximate platform time based on the last platform time value and the TSC delta between than value and the current TSC value, I assume that's done because reading the TSC is much cheaper than reading the platform time. As long as the platform time and the TSC stamps are both updated at the same time it should be fine. Roger.
On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote: > On 15.01.2020 12:53, Roger Pau Monné wrote: > > On Wed, Jan 15, 2020 at 12:40:27PM +0100, Jan Beulich wrote: > >> On 15.01.2020 10:47, Roger Pau Monné wrote: > >>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: > >>>> --- a/xen/arch/x86/time.c > >>>> +++ b/xen/arch/x86/time.c > >>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) > >>>> > >>>> void cstate_restore_tsc(void) > >>>> { > >>>> + struct cpu_time *t = &this_cpu(cpu_time); > >>>> + > >>>> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) > >>>> return; > >>>> > >>>> - write_tsc(stime2tsc(read_platform_stime(NULL))); > >>>> + t->stamp.master_stime = read_platform_stime(NULL); > >>>> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); > >>>> + t->stamp.local_stime = t->stamp.master_stime; > >>>> + > >>>> + write_tsc(t->stamp.local_tsc); > >>> > >>> In order to avoid the TSC write (and the likely associated vmexit), > >>> could you instead do: > >>> > >>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); > >>> t->stamp.local_tsc = rdtsc_ordered(); > >>> > >>> I think it should achieve the same as it syncs the local TSC stamp and > >>> times, would avoid the TSC write and slightly simplifies the logic. > >> > >> Wouldn't this result in guests possibly observing the TSC moving > >> backwards? > > > > Isn't local_tsc storing a TSC value read from the same CPU always, and > > hence could only go backwards if rdtsc actually goes backwards? > > For one I have to admit I was (mistakenly) thinking of wakeup > from S states more than that from C states. So assuming the > TSC indeed only stops (but won't get e.g. restarted), backwards > moves ought to be excluded. Even if the TSC was restarted I think my proposed approach should be fine. The only requirement is that the stored TSC stamp must always be behind than the value returned by rdtsc. See get_s_time_fixed: as long as the delta is positive the returned time should be correct. > What I'm then worried about is too > little progress observable by guests. The PV time protocol > ought to be fine in this regard (and consumers of raw TSC values > are on their own anyway), but wouldn't you need to update TSC > offsets of HVM guests in order to compensate for the elapsed > time? That will be done when the HVM vCPU gets scheduled in as part of the update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be called with the idle vCPU context, and hence there's always going to be a vCPU switch before scheduling anything else. > > Ie: cpu_frequency_change seems to do something similar, together with > > a re-adjusting of the time scale, but doesn't perform any TSC write. > > A P-state change at most alters the the tick rate, but wouldn't > stop or even reset the TSC (afaict). Right, just wanted to point out that the cpu_time stamp can be updated without having to write to the TSC. Anyway, not sure it's very relevant or useful, so forget this reference. Roger.
On 15/01/2020 13:23, Roger Pau Monné wrote: > On Wed, Jan 15, 2020 at 12:36:08PM +0000, Igor Druzhinin wrote: >> On 15/01/2020 09:47, Roger Pau Monné wrote: >>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote: >>>> If ITSC is not available on CPU (e.g if running nested as PV shim) >>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >>>> all AMD and some old Intel processors. In which case TSC would need to >>>> be restored on CPU from platform time by Xen upon exiting deep C-states. >>>> >>>> As platform time might be behind the last TSC stamp recorded for the >>>> current CPU, invariant of TSC stamp being always behind local TSC counter >>>> is violated. This has an effect of get_s_time() going negative resulting >>>> in eventual system hang or crash. >>>> >>>> Fix this issue by updating local TSC stamp along with TSC counter write. >>> >>> Thanks! I haven't seen such issue because I've been running the shim >>> with nomigrate in order to prevent the vTSC overhead. >>> >>>> >>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >>>> --- >>>> This caused reliable hangs of shim domains with multiple vCPUs on all AMD >>>> systems. The problem got also reproduced on bare-metal by artifically >>>> masking ITSC feature bit. The proposed fix has been verified for both >>>> cases. >>>> --- >>>> xen/arch/x86/time.c | 8 +++++++- >>>> 1 file changed, 7 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c >>>> index e79cb4d..f6b26f8 100644 >>>> --- a/xen/arch/x86/time.c >>>> +++ b/xen/arch/x86/time.c >>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) >>>> >>>> void cstate_restore_tsc(void) >>>> { >>>> + struct cpu_time *t = &this_cpu(cpu_time); >>>> + >>>> if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) >>>> return; >>>> >>>> - write_tsc(stime2tsc(read_platform_stime(NULL))); >>>> + t->stamp.master_stime = read_platform_stime(NULL); >>>> + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); >>>> + t->stamp.local_stime = t->stamp.master_stime; >>>> + >>>> + write_tsc(t->stamp.local_tsc); >>> >>> In order to avoid the TSC write (and the likely associated vmexit), >>> could you instead do: >>> >>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL); >>> t->stamp.local_tsc = rdtsc_ordered(); >> >> I think in that case RDTSC might return something behind platform time >> which is not right I guess. > > The TSC and the platform time are completely independent from Xen's > PoV, you can have a platform time not based on the TSC (ie: PIT, HPET > or PM), and hence there's no direct relation between both. > > The TSC is used as a way to get an approximate platform time based on > the last platform time value and the TSC delta between than value and > the current TSC value, I assume that's done because reading the TSC is > much cheaper than reading the platform time. > > As long as the platform time and the TSC stamps are both updated at the > same time it should be fine. I see your point. I'll test your approach and get back here with the results. Igor
On 15/01/2020 12:54, Jan Beulich wrote: > On 15.01.2020 13:47, Igor Druzhinin wrote: >> On 15/01/2020 12:39, Jan Beulich wrote: >>> On 15.01.2020 13:28, Igor Druzhinin wrote: >>>> On 15/01/2020 11:32, Jan Beulich wrote: >>>>> On 14.01.2020 20:36, Igor Druzhinin wrote: >>>>>> If ITSC is not available on CPU (e.g if running nested as PV shim) >>>>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. >>>>>> all AMD and some old Intel processors. In which case TSC would need to >>>>>> be restored on CPU from platform time by Xen upon exiting deep C-states. >>>>> How does waking from deep C states correspond to the PV shim? I notice >>>>> that cstate_restore_tsc() gets called irrespective of the C state being >>>>> exited, so I wonder whether there's room for improvement there >>>>> independent of the issue at hand. As far as this change is concerned, >>>>> I think you want to drop the notion of "deep" from the description. >>>> I'm not familiar with what to call "deep C-state" so for me it was anything >>>> higher than C1. If you prefer "deep" to be dropped - so be it. >>> "Higher than C1" may be fine (albeit I vaguely recall TSC issues >>> starting with C3 only), but at least mwait_idle() calls the >>> function even for C1. As to the PV shim - does it know about any >>> C-states at all (beyond HLT-invoked C1)? >> Yes, PV-shim knows about C states as it looks they are tied to >> processor ID in some cases. For AMD specifically C2 is HLT. > The AMD part is pretty new, and is - afaict - an exception compared > to everything else. Under PVH there's no respective ACPI data (iirc), > and we also don't surface MONITOR/MWAIT to HVM/PVH guests, making > mwait_idle_probe() bail early. I wonder whether this special > behavior on AMD Fam17 should be suppressed in this case. Anything which knows it is virtualised should use hypercalls to yield, and failing that, hlt. A VM isn't going to get a plausible idea of C/P states (give or take our dom0 special case seeing the real APCI tables which is still an open problem.) ~Andrew
On 15.01.2020 14:44, Roger Pau Monné wrote: > On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote: >> What I'm then worried about is too >> little progress observable by guests. The PV time protocol >> ought to be fine in this regard (and consumers of raw TSC values >> are on their own anyway), but wouldn't you need to update TSC >> offsets of HVM guests in order to compensate for the elapsed >> time? > > That will be done when the HVM vCPU gets scheduled in as part of the > update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be > called with the idle vCPU context, and hence there's always going to > be a vCPU switch before scheduling anything else. Which step would this be? All I see is a call to hvm_scale_tsc(). In time.c only tsc_set_info() calls hvm_set_tsc_offset(). Jan
On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote: > On 15.01.2020 14:44, Roger Pau Monné wrote: > > On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote: > >> What I'm then worried about is too > >> little progress observable by guests. The PV time protocol > >> ought to be fine in this regard (and consumers of raw TSC values > >> are on their own anyway), but wouldn't you need to update TSC > >> offsets of HVM guests in order to compensate for the elapsed > >> time? > > > > That will be done when the HVM vCPU gets scheduled in as part of the > > update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be > > called with the idle vCPU context, and hence there's always going to > > be a vCPU switch before scheduling anything else. > > Which step would this be? All I see is a call to hvm_scale_tsc(). > In time.c only tsc_set_info() calls hvm_set_tsc_offset(). My bad, I've mistaken the scaling with the offset. Accounting for the offset in update_vcpu_system_time seems quite more complicated that just updating the TSC here, so: Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> I also wonder whether it would be interesting to add at least an assert to get_s_time_fixed in order to assure that the TSC stamp is always behind the current TSC value, this would have likely helped catch this issue earlier. Thanks, Roger.
On 16.01.2020 10:33, Roger Pau Monné wrote: > On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote: >> On 15.01.2020 14:44, Roger Pau Monné wrote: >>> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote: >>>> What I'm then worried about is too >>>> little progress observable by guests. The PV time protocol >>>> ought to be fine in this regard (and consumers of raw TSC values >>>> are on their own anyway), but wouldn't you need to update TSC >>>> offsets of HVM guests in order to compensate for the elapsed >>>> time? >>> >>> That will be done when the HVM vCPU gets scheduled in as part of the >>> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be >>> called with the idle vCPU context, and hence there's always going to >>> be a vCPU switch before scheduling anything else. >> >> Which step would this be? All I see is a call to hvm_scale_tsc(). >> In time.c only tsc_set_info() calls hvm_set_tsc_offset(). > > My bad, I've mistaken the scaling with the offset. > > Accounting for the offset in update_vcpu_system_time seems quite > more complicated that just updating the TSC here, so: > > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> And then (preferably with "deep" dropped from the description, if you, Igor, agree, and which can be done while committing) Acked-by: Jan Beulich <jbeulich@suse.com> Jan
On 16/01/2020 09:38, Jan Beulich wrote: > On 16.01.2020 10:33, Roger Pau Monné wrote: >> On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote: >>> On 15.01.2020 14:44, Roger Pau Monné wrote: >>>> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote: >>>>> What I'm then worried about is too >>>>> little progress observable by guests. The PV time protocol >>>>> ought to be fine in this regard (and consumers of raw TSC values >>>>> are on their own anyway), but wouldn't you need to update TSC >>>>> offsets of HVM guests in order to compensate for the elapsed >>>>> time? >>>> >>>> That will be done when the HVM vCPU gets scheduled in as part of the >>>> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be >>>> called with the idle vCPU context, and hence there's always going to >>>> be a vCPU switch before scheduling anything else. >>> >>> Which step would this be? All I see is a call to hvm_scale_tsc(). >>> In time.c only tsc_set_info() calls hvm_set_tsc_offset(). >> >> My bad, I've mistaken the scaling with the offset. >> >> Accounting for the offset in update_vcpu_system_time seems quite >> more complicated that just updating the TSC here, so: >> >> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> > > And then (preferably with "deep" dropped from the description, > if you, Igor, agree, and which can be done while committing) > Acked-by: Jan Beulich <jbeulich@suse.com> No objection. FYI I tested Roger's approach this night and it seems to work at least in sense of fixing the original issue. Igor
On Thu, Jan 16, 2020 at 12:09:12PM +0000, Igor Druzhinin wrote: > On 16/01/2020 09:38, Jan Beulich wrote: > > On 16.01.2020 10:33, Roger Pau Monné wrote: > >> On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote: > >>> On 15.01.2020 14:44, Roger Pau Monné wrote: > >>>> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote: > >>>>> What I'm then worried about is too > >>>>> little progress observable by guests. The PV time protocol > >>>>> ought to be fine in this regard (and consumers of raw TSC values > >>>>> are on their own anyway), but wouldn't you need to update TSC > >>>>> offsets of HVM guests in order to compensate for the elapsed > >>>>> time? > >>>> > >>>> That will be done when the HVM vCPU gets scheduled in as part of the > >>>> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be > >>>> called with the idle vCPU context, and hence there's always going to > >>>> be a vCPU switch before scheduling anything else. > >>> > >>> Which step would this be? All I see is a call to hvm_scale_tsc(). > >>> In time.c only tsc_set_info() calls hvm_set_tsc_offset(). > >> > >> My bad, I've mistaken the scaling with the offset. > >> > >> Accounting for the offset in update_vcpu_system_time seems quite > >> more complicated that just updating the TSC here, so: > >> > >> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> > > > > And then (preferably with "deep" dropped from the description, > > if you, Igor, agree, and which can be done while committing) > > Acked-by: Jan Beulich <jbeulich@suse.com> > > No objection. FYI I tested Roger's approach this night and it seems > to work at least in sense of fixing the original issue. Right, but HVM guests using a timekeeping approach similar to what Xen does (read the platform timer periodically and calculate time based on the last read plus the TSC delta) would get wrong (behind the real time) results AFAICT. Roger.
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c index e79cb4d..f6b26f8 100644 --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime) void cstate_restore_tsc(void) { + struct cpu_time *t = &this_cpu(cpu_time); + if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) return; - write_tsc(stime2tsc(read_platform_stime(NULL))); + t->stamp.master_stime = read_platform_stime(NULL); + t->stamp.local_tsc = stime2tsc(t->stamp.master_stime); + t->stamp.local_stime = t->stamp.master_stime; + + write_tsc(t->stamp.local_tsc); } /***************************************************************************
If ITSC is not available on CPU (e.g if running nested as PV shim) then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e. all AMD and some old Intel processors. In which case TSC would need to be restored on CPU from platform time by Xen upon exiting deep C-states. As platform time might be behind the last TSC stamp recorded for the current CPU, invariant of TSC stamp being always behind local TSC counter is violated. This has an effect of get_s_time() going negative resulting in eventual system hang or crash. Fix this issue by updating local TSC stamp along with TSC counter write. Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> --- This caused reliable hangs of shim domains with multiple vCPUs on all AMD systems. The problem got also reproduced on bare-metal by artifically masking ITSC feature bit. The proposed fix has been verified for both cases. --- xen/arch/x86/time.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)