Message ID | 1576277282-6590-3-git-send-email-igor.druzhinin@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vTSC performance improvements | expand |
On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: > Now that vtsc_last is the only entity protected by vtsc_lock we can > simply update it using a single atomic operation and drop the spinlock > entirely. This is extremely important for the case of running nested > (e.g. shim instance with lots of vCPUs assigned) since if preemption > happens somewhere inside the critical section that would immediately > mean that other vCPU stop progressing (and probably being preempted > as well) waiting for the spinlock to be freed. > > This fixes constant shim guest boot lockups with ~32 vCPUs if there is > vCPU overcommit present (which increases the likelihood of preemption). > > Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > --- > xen/arch/x86/domain.c | 1 - > xen/arch/x86/time.c | 16 ++++++---------- > xen/include/asm-x86/domain.h | 1 - > 3 files changed, 6 insertions(+), 12 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index bed19fc..94531be 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -539,7 +539,6 @@ int arch_domain_create(struct domain *d, > INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); > > spin_lock_init(&d->arch.e820_lock); > - spin_lock_init(&d->arch.vtsc_lock); > > /* Minimal initialisation for the idle domain. */ > if ( unlikely(is_idle_domain(d)) ) > diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c > index 216169a..202446f 100644 > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -2130,19 +2130,15 @@ u64 gtsc_to_gtime(struct domain *d, u64 tsc) > > uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) > { > - s_time_t now = get_s_time(); > + s_time_t old, new, now = get_s_time(); > struct domain *d = v->domain; > > - spin_lock(&d->arch.vtsc_lock); > - > - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) > - d->arch.vtsc_last = now; > - else > - now = ++d->arch.vtsc_last; > - > - spin_unlock(&d->arch.vtsc_lock); > + do { > + old = d->arch.vtsc_last; > + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; Why do you need to do this subtraction? Isn't it easier to just do: new = now > d->arch.vtsc_last ? now : old + 1; That avoids the cast and the subtraction. > + } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); I'm not sure if the following would be slightly better performance wise: do { old = d->arch.vtsc_last; if ( d->arch.vtsc_last >= now ) { new = atomic_inc_return(&d->arch.vtsc_last); break; } else new = now; } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); In any case I'm fine with your version using cmpxchg exclusively. Thanks, Roger.
On 16.12.2019 11:00, Roger Pau Monné wrote: > On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: >> Now that vtsc_last is the only entity protected by vtsc_lock we can >> simply update it using a single atomic operation and drop the spinlock >> entirely. This is extremely important for the case of running nested >> (e.g. shim instance with lots of vCPUs assigned) since if preemption >> happens somewhere inside the critical section that would immediately >> mean that other vCPU stop progressing (and probably being preempted >> as well) waiting for the spinlock to be freed. >> >> This fixes constant shim guest boot lockups with ~32 vCPUs if there is >> vCPU overcommit present (which increases the likelihood of preemption). >> >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >> --- >> xen/arch/x86/domain.c | 1 - >> xen/arch/x86/time.c | 16 ++++++---------- >> xen/include/asm-x86/domain.h | 1 - >> 3 files changed, 6 insertions(+), 12 deletions(-) >> >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >> index bed19fc..94531be 100644 >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -539,7 +539,6 @@ int arch_domain_create(struct domain *d, >> INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); >> >> spin_lock_init(&d->arch.e820_lock); >> - spin_lock_init(&d->arch.vtsc_lock); >> >> /* Minimal initialisation for the idle domain. */ >> if ( unlikely(is_idle_domain(d)) ) >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c >> index 216169a..202446f 100644 >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -2130,19 +2130,15 @@ u64 gtsc_to_gtime(struct domain *d, u64 tsc) >> >> uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) >> { >> - s_time_t now = get_s_time(); >> + s_time_t old, new, now = get_s_time(); >> struct domain *d = v->domain; >> >> - spin_lock(&d->arch.vtsc_lock); >> - >> - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) >> - d->arch.vtsc_last = now; >> - else >> - now = ++d->arch.vtsc_last; >> - >> - spin_unlock(&d->arch.vtsc_lock); >> + do { >> + old = d->arch.vtsc_last; >> + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; > > Why do you need to do this subtraction? Isn't it easier to just do: > > new = now > d->arch.vtsc_last ? now : old + 1; This wouldn't be reliable when the TSC wraps. Remember that firmware may set the TSC, and it has been seen to be set to very large (effectively negative, if they were signed quantities) values, which will then eventually wrap (whereas we're not typically concerned of 64-bit counters wrapping when they start from zero). Jan
On Mon, Dec 16, 2019 at 12:21:09PM +0100, Jan Beulich wrote: > On 16.12.2019 11:00, Roger Pau Monné wrote: > > On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: > >> Now that vtsc_last is the only entity protected by vtsc_lock we can > >> simply update it using a single atomic operation and drop the spinlock > >> entirely. This is extremely important for the case of running nested > >> (e.g. shim instance with lots of vCPUs assigned) since if preemption > >> happens somewhere inside the critical section that would immediately > >> mean that other vCPU stop progressing (and probably being preempted > >> as well) waiting for the spinlock to be freed. > >> > >> This fixes constant shim guest boot lockups with ~32 vCPUs if there is > >> vCPU overcommit present (which increases the likelihood of preemption). > >> > >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> > >> --- > >> xen/arch/x86/domain.c | 1 - > >> xen/arch/x86/time.c | 16 ++++++---------- > >> xen/include/asm-x86/domain.h | 1 - > >> 3 files changed, 6 insertions(+), 12 deletions(-) > >> > >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > >> index bed19fc..94531be 100644 > >> --- a/xen/arch/x86/domain.c > >> +++ b/xen/arch/x86/domain.c > >> @@ -539,7 +539,6 @@ int arch_domain_create(struct domain *d, > >> INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); > >> > >> spin_lock_init(&d->arch.e820_lock); > >> - spin_lock_init(&d->arch.vtsc_lock); > >> > >> /* Minimal initialisation for the idle domain. */ > >> if ( unlikely(is_idle_domain(d)) ) > >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c > >> index 216169a..202446f 100644 > >> --- a/xen/arch/x86/time.c > >> +++ b/xen/arch/x86/time.c > >> @@ -2130,19 +2130,15 @@ u64 gtsc_to_gtime(struct domain *d, u64 tsc) > >> > >> uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) > >> { > >> - s_time_t now = get_s_time(); > >> + s_time_t old, new, now = get_s_time(); > >> struct domain *d = v->domain; > >> > >> - spin_lock(&d->arch.vtsc_lock); > >> - > >> - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) > >> - d->arch.vtsc_last = now; > >> - else > >> - now = ++d->arch.vtsc_last; > >> - > >> - spin_unlock(&d->arch.vtsc_lock); > >> + do { > >> + old = d->arch.vtsc_last; > >> + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; > > > > Why do you need to do this subtraction? Isn't it easier to just do: > > > > new = now > d->arch.vtsc_last ? now : old + 1; > > This wouldn't be reliable when the TSC wraps. Remember that firmware > may set the TSC, and it has been seen to be set to very large > (effectively negative, if they were signed quantities) values, s_time_t is a signed value AFAICT (s64). > which > will then eventually wrap (whereas we're not typically concerned of > 64-bit counters wrapping when they start from zero). But get_s_time returns the system time in ns since boot, not the TSC value, hence it will start from 0 and we shouldn't be concerned about wraps? Thanks, Roger.
On 16.12.2019 13:30, Roger Pau Monné wrote: > On Mon, Dec 16, 2019 at 12:21:09PM +0100, Jan Beulich wrote: >> On 16.12.2019 11:00, Roger Pau Monné wrote: >>> On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: >>>> Now that vtsc_last is the only entity protected by vtsc_lock we can >>>> simply update it using a single atomic operation and drop the spinlock >>>> entirely. This is extremely important for the case of running nested >>>> (e.g. shim instance with lots of vCPUs assigned) since if preemption >>>> happens somewhere inside the critical section that would immediately >>>> mean that other vCPU stop progressing (and probably being preempted >>>> as well) waiting for the spinlock to be freed. >>>> >>>> This fixes constant shim guest boot lockups with ~32 vCPUs if there is >>>> vCPU overcommit present (which increases the likelihood of preemption). >>>> >>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >>>> --- >>>> xen/arch/x86/domain.c | 1 - >>>> xen/arch/x86/time.c | 16 ++++++---------- >>>> xen/include/asm-x86/domain.h | 1 - >>>> 3 files changed, 6 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >>>> index bed19fc..94531be 100644 >>>> --- a/xen/arch/x86/domain.c >>>> +++ b/xen/arch/x86/domain.c >>>> @@ -539,7 +539,6 @@ int arch_domain_create(struct domain *d, >>>> INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); >>>> >>>> spin_lock_init(&d->arch.e820_lock); >>>> - spin_lock_init(&d->arch.vtsc_lock); >>>> >>>> /* Minimal initialisation for the idle domain. */ >>>> if ( unlikely(is_idle_domain(d)) ) >>>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c >>>> index 216169a..202446f 100644 >>>> --- a/xen/arch/x86/time.c >>>> +++ b/xen/arch/x86/time.c >>>> @@ -2130,19 +2130,15 @@ u64 gtsc_to_gtime(struct domain *d, u64 tsc) >>>> >>>> uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) >>>> { >>>> - s_time_t now = get_s_time(); >>>> + s_time_t old, new, now = get_s_time(); >>>> struct domain *d = v->domain; >>>> >>>> - spin_lock(&d->arch.vtsc_lock); >>>> - >>>> - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) >>>> - d->arch.vtsc_last = now; >>>> - else >>>> - now = ++d->arch.vtsc_last; >>>> - >>>> - spin_unlock(&d->arch.vtsc_lock); >>>> + do { >>>> + old = d->arch.vtsc_last; >>>> + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; >>> >>> Why do you need to do this subtraction? Isn't it easier to just do: >>> >>> new = now > d->arch.vtsc_last ? now : old + 1; >> >> This wouldn't be reliable when the TSC wraps. Remember that firmware >> may set the TSC, and it has been seen to be set to very large >> (effectively negative, if they were signed quantities) values, > > s_time_t is a signed value AFAICT (s64). Oh, I should have looked at types, rather than inferring uint64_t in particular for something like vtsc_last. >> which >> will then eventually wrap (whereas we're not typically concerned of >> 64-bit counters wrapping when they start from zero). > > But get_s_time returns the system time in ns since boot, not the TSC > value, hence it will start from 0 and we shouldn't be concerned about > wraps? Good point, seeing that all parts here are s_time_t. Of course with all parts being so, there's indeed no need for the cast, but comparing both values is then equivalent to comparing the difference against zero. Jan
On 16/12/2019 10:00, Roger Pau Monné wrote: > On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: >> Now that vtsc_last is the only entity protected by vtsc_lock we can >> simply update it using a single atomic operation and drop the spinlock >> entirely. This is extremely important for the case of running nested >> (e.g. shim instance with lots of vCPUs assigned) since if preemption >> happens somewhere inside the critical section that would immediately >> mean that other vCPU stop progressing (and probably being preempted >> as well) waiting for the spinlock to be freed. >> >> This fixes constant shim guest boot lockups with ~32 vCPUs if there is >> vCPU overcommit present (which increases the likelihood of preemption). >> >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> >> --- >> xen/arch/x86/domain.c | 1 - >> xen/arch/x86/time.c | 16 ++++++---------- >> xen/include/asm-x86/domain.h | 1 - >> 3 files changed, 6 insertions(+), 12 deletions(-) >> >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >> index bed19fc..94531be 100644 >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -539,7 +539,6 @@ int arch_domain_create(struct domain *d, >> INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); >> >> spin_lock_init(&d->arch.e820_lock); >> - spin_lock_init(&d->arch.vtsc_lock); >> >> /* Minimal initialisation for the idle domain. */ >> if ( unlikely(is_idle_domain(d)) ) >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c >> index 216169a..202446f 100644 >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -2130,19 +2130,15 @@ u64 gtsc_to_gtime(struct domain *d, u64 tsc) >> >> uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) >> { >> - s_time_t now = get_s_time(); >> + s_time_t old, new, now = get_s_time(); >> struct domain *d = v->domain; >> >> - spin_lock(&d->arch.vtsc_lock); >> - >> - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) >> - d->arch.vtsc_last = now; >> - else >> - now = ++d->arch.vtsc_last; >> - >> - spin_unlock(&d->arch.vtsc_lock); >> + do { >> + old = d->arch.vtsc_last; >> + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; > > Why do you need to do this subtraction? Isn't it easier to just do: > > new = now > d->arch.vtsc_last ? now : old + 1; > > That avoids the cast and the subtraction. I'm afraid I fell into the same trap as Jan. Given they are both signed will change in v2. >> + } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); > > I'm not sure if the following would be slightly better performance > wise: > > do { > old = d->arch.vtsc_last; > if ( d->arch.vtsc_last >= now ) > { > new = atomic_inc_return(&d->arch.vtsc_last); > break; > } > else > new = now; > } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); > > In any case I'm fine with your version using cmpxchg exclusively. That could be marginally better (knowing that atomic increment usually performs better than cmpxchg) but it took me some time to work out there is no hidden race here. I'd request a third opinion on the matter if it's worth changing. Igor
On Mon, Dec 16, 2019 at 01:45:10PM +0100, Jan Beulich wrote: > On 16.12.2019 13:30, Roger Pau Monné wrote: > > On Mon, Dec 16, 2019 at 12:21:09PM +0100, Jan Beulich wrote: > >> On 16.12.2019 11:00, Roger Pau Monné wrote: > >>> On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: > >>>> uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) > >>>> { > >>>> - s_time_t now = get_s_time(); > >>>> + s_time_t old, new, now = get_s_time(); > >>>> struct domain *d = v->domain; > >>>> > >>>> - spin_lock(&d->arch.vtsc_lock); > >>>> - > >>>> - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) > >>>> - d->arch.vtsc_last = now; > >>>> - else > >>>> - now = ++d->arch.vtsc_last; > >>>> - > >>>> - spin_unlock(&d->arch.vtsc_lock); > >>>> + do { > >>>> + old = d->arch.vtsc_last; > >>>> + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; > >>> > >>> Why do you need to do this subtraction? Isn't it easier to just do: > >>> > >>> new = now > d->arch.vtsc_last ? now : old + 1; > >> > >> This wouldn't be reliable when the TSC wraps. Remember that firmware > >> may set the TSC, and it has been seen to be set to very large > >> (effectively negative, if they were signed quantities) values, > > > > s_time_t is a signed value AFAICT (s64). > > Oh, I should have looked at types, rather than inferring uint64_t > in particular for something like vtsc_last. > > >> which > >> will then eventually wrap (whereas we're not typically concerned of > >> 64-bit counters wrapping when they start from zero). > > > > But get_s_time returns the system time in ns since boot, not the TSC > > value, hence it will start from 0 and we shouldn't be concerned about > > wraps? > > Good point, seeing that all parts here are s_time_t. Of course > with all parts being so, there's indeed no need for the cast, > but comparing both values is then equivalent to comparing the > difference against zero. Right, I just think it's easier to compare both values instead of comparing the difference against zero (and likely less expensive in terms of performance). Anyway, I prefer comparing both values instead of the difference, but that's also correct and I would be fine with it as long as the cast is dropped. Thanks, Roger.
On Mon, Dec 16, 2019 at 12:53:40PM +0000, Igor Druzhinin wrote: > On 16/12/2019 10:00, Roger Pau Monné wrote: > > On Fri, Dec 13, 2019 at 10:48:02PM +0000, Igor Druzhinin wrote: > > I'm not sure if the following would be slightly better performance > > wise: > > > > do { > > old = d->arch.vtsc_last; > > if ( d->arch.vtsc_last >= now ) > > { > > new = atomic_inc_return(&d->arch.vtsc_last); > > break; > > } > > else > > new = now; > > } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); > > > > In any case I'm fine with your version using cmpxchg exclusively. > > That could be marginally better (knowing that atomic increment usually performs > better than cmpxchg) but it took me some time to work out there is no hidden > race here. I'd request a third opinion on the matter if it's worth changing. Anyway, your proposed approach using cmpxchg is fine IMO, we can leave the atomic increment for a further improvement if there's a need for it. Thanks, Roger.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index bed19fc..94531be 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -539,7 +539,6 @@ int arch_domain_create(struct domain *d, INIT_PAGE_LIST_HEAD(&d->arch.relmem_list); spin_lock_init(&d->arch.e820_lock); - spin_lock_init(&d->arch.vtsc_lock); /* Minimal initialisation for the idle domain. */ if ( unlikely(is_idle_domain(d)) ) diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c index 216169a..202446f 100644 --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -2130,19 +2130,15 @@ u64 gtsc_to_gtime(struct domain *d, u64 tsc) uint64_t pv_soft_rdtsc(const struct vcpu *v, const struct cpu_user_regs *regs) { - s_time_t now = get_s_time(); + s_time_t old, new, now = get_s_time(); struct domain *d = v->domain; - spin_lock(&d->arch.vtsc_lock); - - if ( (int64_t)(now - d->arch.vtsc_last) > 0 ) - d->arch.vtsc_last = now; - else - now = ++d->arch.vtsc_last; - - spin_unlock(&d->arch.vtsc_lock); + do { + old = d->arch.vtsc_last; + new = (int64_t)(now - d->arch.vtsc_last) > 0 ? now : old + 1; + } while ( cmpxchg(&d->arch.vtsc_last, old, new) != old ); - return gtime_to_gtsc(d, now); + return gtime_to_gtsc(d, new); } bool clocksource_is_tsc(void) diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index 3780287..e4da373 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -364,7 +364,6 @@ struct arch_domain int tsc_mode; /* see include/asm-x86/time.h */ bool_t vtsc; /* tsc is emulated (may change after migrate) */ s_time_t vtsc_last; /* previous TSC value (guarantee monotonicity) */ - spinlock_t vtsc_lock; uint64_t vtsc_offset; /* adjustment for save/restore/migrate */ uint32_t tsc_khz; /* cached guest khz for certain emulated or hardware TSC scaling cases */
Now that vtsc_last is the only entity protected by vtsc_lock we can simply update it using a single atomic operation and drop the spinlock entirely. This is extremely important for the case of running nested (e.g. shim instance with lots of vCPUs assigned) since if preemption happens somewhere inside the critical section that would immediately mean that other vCPU stop progressing (and probably being preempted as well) waiting for the spinlock to be freed. This fixes constant shim guest boot lockups with ~32 vCPUs if there is vCPU overcommit present (which increases the likelihood of preemption). Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> --- xen/arch/x86/domain.c | 1 - xen/arch/x86/time.c | 16 ++++++---------- xen/include/asm-x86/domain.h | 1 - 3 files changed, 6 insertions(+), 12 deletions(-)