diff mbox

[RFC] Make need_resched() return true when rcu_urgent_qs requested

Message ID e21c6d7666bfbad1e1c8295559a6ddc7e4110105.camel@infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

David Woodhouse July 11, 2018, 5:03 p.m. UTC
On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> And here is an updated v4.15 patch with Marius's Reported-by and David's
> fix to my lost exclamation point.

Thanks. Are you sending the original version of that to Linus? It'd be
useful to have the commit ID so that we can watch for it landing, and
chase this one up to Greg.

As discussed on IRC, this patch reduces synchronize_sched() latency for
us from ~4600s to ~160ms, which is nice.

However, it isn't going to be sufficient in the NO_HZ_FULL case. For
that you want a patch like the one below, which happily reduces the
latency in our (!NO_HZ_FULL) case still further to ~40ms.

Adding kvm list for better review...

From: David Woodhouse <dwmw@amazon.co.uk>
Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode

RCU can spend long periods of time waiting for a CPU which is actually in
KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
modes, and don't wait for it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c      |  2 ++
 include/linux/rcutree.h |  2 ++
 kernel/rcu/tree.c       | 16 ++++++++++++++++
 3 files changed, 20 insertions(+)

Comments

Paul E. McKenney July 11, 2018, 5:48 p.m. UTC | #1
On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > fix to my lost exclamation point.
> 
> Thanks. Are you sending the original version of that to Linus? It'd be
> useful to have the commit ID so that we can watch for it landing, and
> chase this one up to Greg.

That would be great!  The commit ID is currently 6d1b6b684e1f ("rcu: Make
need_resched() respond to urgent RCU-QS needs"), but is subject to change
given the likely need to rebase in order to fix bugs in commits preceding
that one.  Given your Reported-by, you will be CCed when it reaches -tip,
won't you?  At that point, the commit ID would be set in stone.

Either way, I would very much welcome any help with -stable.  I would
be happy to send you an email when its commit ID become set in stone,
for example, if that would help.

> As discussed on IRC, this patch reduces synchronize_sched() latency for
> us from ~4600s to ~160ms, which is nice.

Woo-hoo!!!  ;-)

> However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> that you want a patch like the one below, which happily reduces the
> latency in our (!NO_HZ_FULL) case still further to ~40ms.

Even better, good stuff, thank you!

> Adding kvm list for better review...

And a comment below.

							Thanx, Paul

> From: David Woodhouse <dwmw@amazon.co.uk>
> Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode
> 
> RCU can spend long periods of time waiting for a CPU which is actually in
> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> modes, and don't wait for it.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c      |  2 ++
>  include/linux/rcutree.h |  2 ++
>  kernel/rcu/tree.c       | 16 ++++++++++++++++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0046aa70205a..b0c82f70afa7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
>  	} > 
> +	rcu_kvm_enter();
>  	kvm_x86_ops->run(vcpu);
> +	rcu_kvm_exit();
> 
>  	/*
>  	 * Do this here before restoring debug registers on the host.  And
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 914655848ef6..6d07af5a50fc 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate);
> 
>  void rcu_idle_enter(void);
>  void rcu_idle_exit(void);
> +void rcu_kvm_enter(void);
> +void rcu_kvm_exit(void);
>  void rcu_irq_enter(void);
>  void rcu_irq_exit(void);
>  void rcu_irq_enter_irqson(void);
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index aa7cade1b9f3..df7893273939 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void)
>  	local_irq_restore(flags);
>  }
> 
> +/*
> + * These are currently identical to the _idle_ versions but let's
> + * explicitly have separate copies to keep Paul honest in future.
> + */
> +void rcu_kvm_enter(void)
> +{
> +	rcu_idle_enter();
> +}
> +EXPORT_SYMBOL_GPL(rcu_kvm_enter);
> +
> +void rcu_kvm_exit(void)
> +{
> +	rcu_idle_exit();
> +}
> +EXPORT_SYMBOL_GPL(rcu_kvm_exit);

These look good, but we also need these in include/linux/rcutiny.h:

	static inline void rcu_kvm_enter(void) { }
	static inline void rcu_kvm_exit(void) { }

Unless KVM is excluded on !SMP systems or some such.

Alternatively, you could just have a single pair of static inlines in
include/linux/rcupdate.h (after the #include of rcutree.h and rcutiny.h)
that mapped the _kvm_ functions to the _idle_ functions.  Your choice,
I am fine either way.

> +
>  /**
>   * rcu_is_watching - see if RCU thinks that the current CPU is idle
>   *
> -- 
> 2.17.1
> 
> 
> -- 
> dwmw2
>
Christian Borntraeger July 11, 2018, 6:31 p.m. UTC | #2
So why is the rcu_virt_note_context_switch(smp_processor_id());
in guest_enter_irqoff not good enough? 

This was actually supposed to tell rcu that being in the guest
is an extended quiescing period (like userspace).

What has changed?



On 07/11/2018 07:03 PM, David Woodhouse wrote:
> On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
>> And here is an updated v4.15 patch with Marius's Reported-by and David's
>> fix to my lost exclamation point.
> 
> Thanks. Are you sending the original version of that to Linus? It'd be
> useful to have the commit ID so that we can watch for it landing, and
> chase this one up to Greg.
> 
> As discussed on IRC, this patch reduces synchronize_sched() latency for
> us from ~4600s to ~160ms, which is nice.
> 
> However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> that you want a patch like the one below, which happily reduces the
> latency in our (!NO_HZ_FULL) case still further to ~40ms.
> 
> Adding kvm list for better review...
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode
> 
> RCU can spend long periods of time waiting for a CPU which is actually in
> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> modes, and don't wait for it.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c      |  2 ++
>  include/linux/rcutree.h |  2 ++
>  kernel/rcu/tree.c       | 16 ++++++++++++++++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0046aa70205a..b0c82f70afa7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
>  	}
> 
> +	rcu_kvm_enter();
>  	kvm_x86_ops->run(vcpu);
> +	rcu_kvm_exit();
> 
>  	/*
>  	 * Do this here before restoring debug registers on the host.  And
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 914655848ef6..6d07af5a50fc 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate);
> 
>  void rcu_idle_enter(void);
>  void rcu_idle_exit(void);
> +void rcu_kvm_enter(void);
> +void rcu_kvm_exit(void);
>  void rcu_irq_enter(void);
>  void rcu_irq_exit(void);
>  void rcu_irq_enter_irqson(void);
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index aa7cade1b9f3..df7893273939 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void)
>  	local_irq_restore(flags);
>  }
> 
> +/*
> + * These are currently identical to the _idle_ versions but let's
> + * explicitly have separate copies to keep Paul honest in future.
> + */
> +void rcu_kvm_enter(void)
> +{
> +	rcu_idle_enter();
> +}
> +EXPORT_SYMBOL_GPL(rcu_kvm_enter);
> +
> +void rcu_kvm_exit(void)
> +{
> +	rcu_idle_exit();
> +}
> +EXPORT_SYMBOL_GPL(rcu_kvm_exit);
> +
>  /**
>   * rcu_is_watching - see if RCU thinks that the current CPU is idle
>   *
>
Paul E. McKenney July 11, 2018, 8:17 p.m. UTC | #3
On Wed, Jul 11, 2018 at 08:31:55PM +0200, Christian Borntraeger wrote:
> So why is the rcu_virt_note_context_switch(smp_processor_id());
> in guest_enter_irqoff not good enough? 
> 
> This was actually supposed to tell rcu that being in the guest
> is an extended quiescing period (like userspace).
> 
> What has changed?

As I understand it, they would like to have their guest run uninterrupted
for extended times.  Because rcu_virt_note_context_switch() is a
point-in-time quiescent state, it cannot tell RCU about the extended
quiescent state.

Should we replace the current calls to rcu_virt_note_context_switch()
with rcu_kvm_enter() and rcu_kvm_exit()?  Would that be better
than the below architecture-by-architecture approach?

							Thanx, Paul

> On 07/11/2018 07:03 PM, David Woodhouse wrote:
> > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> >> And here is an updated v4.15 patch with Marius's Reported-by and David's
> >> fix to my lost exclamation point.
> > 
> > Thanks. Are you sending the original version of that to Linus? It'd be
> > useful to have the commit ID so that we can watch for it landing, and
> > chase this one up to Greg.
> > 
> > As discussed on IRC, this patch reduces synchronize_sched() latency for
> > us from ~4600s to ~160ms, which is nice.
> > 
> > However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> > that you want a patch like the one below, which happily reduces the
> > latency in our (!NO_HZ_FULL) case still further to ~40ms.
> > 
> > Adding kvm list for better review...
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode
> > 
> > RCU can spend long periods of time waiting for a CPU which is actually in
> > KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> > modes, and don't wait for it.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >  arch/x86/kvm/x86.c      |  2 ++
> >  include/linux/rcutree.h |  2 ++
> >  kernel/rcu/tree.c       | 16 ++++++++++++++++
> >  3 files changed, 20 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 0046aa70205a..b0c82f70afa7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  		vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
> >  	}
> > 
> > +	rcu_kvm_enter();
> >  	kvm_x86_ops->run(vcpu);
> > +	rcu_kvm_exit();
> > 
> >  	/*
> >  	 * Do this here before restoring debug registers on the host.  And
> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > index 914655848ef6..6d07af5a50fc 100644
> > --- a/include/linux/rcutree.h
> > +++ b/include/linux/rcutree.h
> > @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate);
> > 
> >  void rcu_idle_enter(void);
> >  void rcu_idle_exit(void);
> > +void rcu_kvm_enter(void);
> > +void rcu_kvm_exit(void);
> >  void rcu_irq_enter(void);
> >  void rcu_irq_exit(void);
> >  void rcu_irq_enter_irqson(void);
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index aa7cade1b9f3..df7893273939 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void)
> >  	local_irq_restore(flags);
> >  }
> > 
> > +/*
> > + * These are currently identical to the _idle_ versions but let's
> > + * explicitly have separate copies to keep Paul honest in future.
> > + */
> > +void rcu_kvm_enter(void)
> > +{
> > +	rcu_idle_enter();
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_kvm_enter);
> > +
> > +void rcu_kvm_exit(void)
> > +{
> > +	rcu_idle_exit();
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_kvm_exit);
> > +
> >  /**
> >   * rcu_is_watching - see if RCU thinks that the current CPU is idle
> >   *
> >
David Woodhouse July 11, 2018, 8:19 p.m. UTC | #4
On Wed, 2018-07-11 at 13:17 -0700, Paul E. McKenney wrote:
> As I understand it, they would like to have their guest run uninterrupted
> for extended times.  Because rcu_virt_note_context_switch() is a
> point-in-time quiescent state, it cannot tell RCU about the extended
> quiescent state.
> 
> Should we replace the current calls to rcu_virt_note_context_switch()
> with rcu_kvm_enter() and rcu_kvm_exit()?  Would that be better
> than the below architecture-by-architecture approach?

Yes it would. I was already starting to mutter about needing the same
for ARM and POWER. I'll do a v3 (incorporating your fixes too) in the
morning.

Thanks.

Also... why in $DEITY's name was the existing
rcu_virt_note_context_switch() not actually sufficient? If we had that
there, why did we need an additional explicit calls to rcu_all_qs() in
the KVM loop, or the more complex fixes to need_resched() which
ultimately had the same effect, to avoid ten-second latencies?
Paul E. McKenney July 11, 2018, 9:08 p.m. UTC | #5
On Wed, Jul 11, 2018 at 09:19:44PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 13:17 -0700, Paul E. McKenney wrote:
> > As I understand it, they would like to have their guest run uninterrupted
> > for extended times.  Because rcu_virt_note_context_switch() is a
> > point-in-time quiescent state, it cannot tell RCU about the extended
> > quiescent state.
> > 
> > Should we replace the current calls to rcu_virt_note_context_switch()
> > with rcu_kvm_enter() and rcu_kvm_exit()?  Would that be better
> > than the below architecture-by-architecture approach?
> 
> Yes it would. I was already starting to mutter about needing the same
> for ARM and POWER. I'll do a v3 (incorporating your fixes too) in the
> morning.
> 
> Thanks.
> 
> Also... why in $DEITY's name was the existing
> rcu_virt_note_context_switch() not actually sufficient? If we had that
> there, why did we need an additional explicit calls to rcu_all_qs() in
> the KVM loop, or the more complex fixes to need_resched() which
> ultimately had the same effect, to avoid ten-second latencies?

My guess is that this was because control passed through the
rcu_virt_note_context_switch() only once, and then subsequent
scheduling-clock interrupts bypassed this code.  But that is just a guess.
I need to defer to someone who understands the KVM code better than I do.

							Thanx, Paul
David Woodhouse July 12, 2018, noon UTC | #6
On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> 
> > Also... why in $DEITY's name was the existing
> > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > there, why did we need an additional explicit calls to rcu_all_qs() in
> > the KVM loop, or the more complex fixes to need_resched() which
> > ultimately had the same effect, to avoid ten-second latencies?
> 
> My guess is that this was because control passed through the
> rcu_virt_note_context_switch() only once, and then subsequent
> scheduling-clock interrupts bypassed this code.  But that is just a guess.
> I need to defer to someone who understands the KVM code better than I do.

I think it's more likely that we just never happened at all. It's
conditional. From the latest patch iteration (see it being removed):

@@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
         * one time slice). Lets treat guest mode as quiescent state, just like
         * we do with user-mode execution.
         */
-       if (!context_tracking_cpu_is_enabled())
-               rcu_virt_note_context_switch(smp_processor_id());
+       rcu_kvm_enter();
 }


Given the vmexit overhead, I don't think we can do the currently-
proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's
really necessary. I'll make that conditional, but probably on the RCU
side.

Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and
rcu_kvm_enter() can do rcu_virt_note_context_switch().

OK?
Paul E. McKenney July 12, 2018, 12:53 p.m. UTC | #7
On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote:
> 
> 
> On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> > 
> > > Also... why in $DEITY's name was the existing
> > > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > > there, why did we need an additional explicit calls to rcu_all_qs() in
> > > the KVM loop, or the more complex fixes to need_resched() which
> > > ultimately had the same effect, to avoid ten-second latencies?
> > 
> > My guess is that this was because control passed through the
> > rcu_virt_note_context_switch() only once, and then subsequent
> > scheduling-clock interrupts bypassed this code.

Gah!  My guess was instead that the code did a rcu_kvm_enter() going in,
but somehow managed to miss the rcu_kvm_exit() going out.  But that makes
absolutely no sense -- had that happened, rcutorture would likely have
screamed bloody murder, loudly and often.  No mere near misses!

And besides, thus far, -ENOREPRODUCE.  :-/

Which indicates that I have an opportunity to improve rcutorture and
that this patch was with high probability an innocent bystander.

> >                                                  But that is just a guess.
> > I need to defer to someone who understands the KVM code better than I do.
> 
> I think it's more likely that we just never happened at all. It's
> conditional. From the latest patch iteration (see it being removed):
> 
> @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
>          * one time slice). Lets treat guest mode as quiescent state, just like
>          * we do with user-mode execution.
>          */
> -       if (!context_tracking_cpu_is_enabled())
> -               rcu_virt_note_context_switch(smp_processor_id());
> +       rcu_kvm_enter();
>  }
> 
> 
> Given the vmexit overhead, I don't think we can do the currently-
> proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's
> really necessary. I'll make that conditional, but probably on the RCU
> side.
> 
> Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and
> rcu_kvm_enter() can do rcu_virt_note_context_switch().
> 
> OK?

Makes sense to me!  And a big "thank you!" to Christian for testing
and analyzing this in a timely fashion!!!

							Thanx, Paul
Paul E. McKenney July 12, 2018, 4:17 p.m. UTC | #8
On Thu, Jul 12, 2018 at 05:53:51AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote:
> > 
> > 
> > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> > > 
> > > > Also... why in $DEITY's name was the existing
> > > > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > > > there, why did we need an additional explicit calls to rcu_all_qs() in
> > > > the KVM loop, or the more complex fixes to need_resched() which
> > > > ultimately had the same effect, to avoid ten-second latencies?
> > > 
> > > My guess is that this was because control passed through the
> > > rcu_virt_note_context_switch() only once, and then subsequent
> > > scheduling-clock interrupts bypassed this code.
> 
> Gah!  My guess was instead that the code did a rcu_kvm_enter() going in,
> but somehow managed to miss the rcu_kvm_exit() going out.  But that makes
> absolutely no sense -- had that happened, rcutorture would likely have
> screamed bloody murder, loudly and often.  No mere near misses!
> 
> And besides, thus far, -ENOREPRODUCE.  :-/

OK, one close call in 63 hours of rcutorture, this one on scenario TREE03
(yesterday hit TREE01 and TREE03).  Time for probabilitistic long-runtime
bisection.  Plus thought about how to get more information out of the near
misses.  Fun!  ;-)

							Thanx, Paul

> Which indicates that I have an opportunity to improve rcutorture and
> that this patch was with high probability an innocent bystander.
> 
> > >                                                  But that is just a guess.
> > > I need to defer to someone who understands the KVM code better than I do.
> > 
> > I think it's more likely that we just never happened at all. It's
> > conditional. From the latest patch iteration (see it being removed):
> > 
> > @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
> >          * one time slice). Lets treat guest mode as quiescent state, just like
> >          * we do with user-mode execution.
> >          */
> > -       if (!context_tracking_cpu_is_enabled())
> > -               rcu_virt_note_context_switch(smp_processor_id());
> > +       rcu_kvm_enter();
> >  }
> > 
> > 
> > Given the vmexit overhead, I don't think we can do the currently-
> > proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's
> > really necessary. I'll make that conditional, but probably on the RCU
> > side.
> > 
> > Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and
> > rcu_kvm_enter() can do rcu_virt_note_context_switch().
> > 
> > OK?
> 
> Makes sense to me!  And a big "thank you!" to Christian for testing
> and analyzing this in a timely fashion!!!
> 
> 							Thanx, Paul
Paul E. McKenney July 16, 2018, 3:40 p.m. UTC | #9
On Thu, Jul 12, 2018 at 09:17:04AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 12, 2018 at 05:53:51AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote:
> > > 
> > > 
> > > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> > > > 
> > > > > Also... why in $DEITY's name was the existing
> > > > > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > > > > there, why did we need an additional explicit calls to rcu_all_qs() in
> > > > > the KVM loop, or the more complex fixes to need_resched() which
> > > > > ultimately had the same effect, to avoid ten-second latencies?
> > > > 
> > > > My guess is that this was because control passed through the
> > > > rcu_virt_note_context_switch() only once, and then subsequent
> > > > scheduling-clock interrupts bypassed this code.
> > 
> > Gah!  My guess was instead that the code did a rcu_kvm_enter() going in,
> > but somehow managed to miss the rcu_kvm_exit() going out.  But that makes
> > absolutely no sense -- had that happened, rcutorture would likely have
> > screamed bloody murder, loudly and often.  No mere near misses!
> > 
> > And besides, thus far, -ENOREPRODUCE.  :-/
> 
> OK, one close call in 63 hours of rcutorture, this one on scenario TREE03
> (yesterday hit TREE01 and TREE03).  Time for probabilitistic long-runtime
> bisection.  Plus thought about how to get more information out of the near
> misses.  Fun!  ;-)

Most of the weekend was devoted to testing today's upcoming pull request,
but I did get a bit more testing done on this.

I was able to make this happen more often by tweaking rcutorture a
bit, but I still do not yet have statistically significant results.
Nevertheless, I have thus far only seen failures with David's patch or
with both David's and my patch.  And I actually got a full-up rcutorture
failure (a too-short grace period) in addition to the aforementioned
close calls.

Over this coming week I expect to devote significant testing time to
the commit just prior to David's in my stack.  If I don't see failures
on that commit, we will need to spent some quality time with the KVM
folks on whether or not kvm_x86_ops->run() and friends have the option of
failing to return, but instead causing control to pop up somewhere else.
Or someone could tell me how I am being blind to some obvious bug in
the two commits that allow RCU to treat KVM guest-OS execution as an
extended quiescent state.  ;-)

							Thanx, Paul
David Woodhouse July 17, 2018, 8:19 a.m. UTC | #10
On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote:
> Most of the weekend was devoted to testing today's upcoming pull request,
> but I did get a bit more testing done on this.
> 
> I was able to make this happen more often by tweaking rcutorture a
> bit, but I still do not yet have statistically significant results.
> Nevertheless, I have thus far only seen failures with David's patch or
> with both David's and my patch.  And I actually got a full-up rcutorture
> failure (a too-short grace period) in addition to the aforementioned
> close calls.
> 
> Over this coming week I expect to devote significant testing time to
> the commit just prior to David's in my stack.  If I don't see failures
> on that commit, we will need to spent some quality time with the KVM
> folks on whether or not kvm_x86_ops->run() and friends have the option of
> failing to return, but instead causing control to pop up somewhere else.
> Or someone could tell me how I am being blind to some obvious bug in
> the two commits that allow RCU to treat KVM guest-OS execution as an
> extended quiescent state.  ;-)

One thing we can try, if my patch is implicated, is moving the calls to
rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting
them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run()
for example. If that fixes it, then we know we've missed something else
interesting that's happening in the middle.

Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500
with this patch, so in the next iteration it definitely needs to be
ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there
(AFAICT) and it's too expensive otherwise as Christian pointed out.
Paul E. McKenney July 17, 2018, 12:56 p.m. UTC | #11
On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote:
> On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote:
> > Most of the weekend was devoted to testing today's upcoming pull request,
> > but I did get a bit more testing done on this.
> > 
> > I was able to make this happen more often by tweaking rcutorture a
> > bit, but I still do not yet have statistically significant results.
> > Nevertheless, I have thus far only seen failures with David's patch or
> > with both David's and my patch.  And I actually got a full-up rcutorture
> > failure (a too-short grace period) in addition to the aforementioned
> > close calls.
> > 
> > Over this coming week I expect to devote significant testing time to
> > the commit just prior to David's in my stack.  If I don't see failures
> > on that commit, we will need to spent some quality time with the KVM
> > folks on whether or not kvm_x86_ops->run() and friends have the option of
> > failing to return, but instead causing control to pop up somewhere else.
> > Or someone could tell me how I am being blind to some obvious bug in
> > the two commits that allow RCU to treat KVM guest-OS execution as an
> > extended quiescent state.  ;-)
> 
> One thing we can try, if my patch is implicated, is moving the calls to
> rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting
> them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run()
> for example. If that fixes it, then we know we've missed something else
> interesting that's happening in the middle.

I don't have enough data to say anything with too much certainty, but
my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours
does, and I am not seeing massive increases in error rate in my patch
compared to yours.  Which again might or might not mean anything.

Plus I haven't proven that your patch isn't an innocent bystander yet.
If it isn't just an innocent bystander, that will take most of this
week do demonstrate given current failure rates.

I am also working on improving rcutorture diagnostics which should help
me work out how to change rcutorture so as to find this more quickly.

> Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500
> with this patch, so in the next iteration it definitely needs to be
> ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there
> (AFAICT) and it's too expensive otherwise as Christian pointed out.

Makes sense!

							Thanx, Paul
Paul E. McKenney July 18, 2018, 3:36 p.m. UTC | #12
On Tue, Jul 17, 2018 at 05:56:53AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote:
> > On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote:
> > > Most of the weekend was devoted to testing today's upcoming pull request,
> > > but I did get a bit more testing done on this.
> > > 
> > > I was able to make this happen more often by tweaking rcutorture a
> > > bit, but I still do not yet have statistically significant results.
> > > Nevertheless, I have thus far only seen failures with David's patch or
> > > with both David's and my patch.  And I actually got a full-up rcutorture
> > > failure (a too-short grace period) in addition to the aforementioned
> > > close calls.
> > > 
> > > Over this coming week I expect to devote significant testing time to
> > > the commit just prior to David's in my stack.  If I don't see failures
> > > on that commit, we will need to spent some quality time with the KVM
> > > folks on whether or not kvm_x86_ops->run() and friends have the option of
> > > failing to return, but instead causing control to pop up somewhere else.
> > > Or someone could tell me how I am being blind to some obvious bug in
> > > the two commits that allow RCU to treat KVM guest-OS execution as an
> > > extended quiescent state.  ;-)
> > 
> > One thing we can try, if my patch is implicated, is moving the calls to
> > rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting
> > them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run()
> > for example. If that fixes it, then we know we've missed something else
> > interesting that's happening in the middle.
> 
> I don't have enough data to say anything with too much certainty, but
> my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours
> does, and I am not seeing massive increases in error rate in my patch
> compared to yours.  Which again might or might not mean anything.
> 
> Plus I haven't proven that your patch isn't an innocent bystander yet.
> If it isn't just an innocent bystander, that will take most of this
> week do demonstrate given current failure rates.

And I finally did get some near misses from an earlier commit, so we
should consider your patch to be officially off the hook.

							Thanx, Paul

> I am also working on improving rcutorture diagnostics which should help
> me work out how to change rcutorture so as to find this more quickly.
> 
> > Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500
> > with this patch, so in the next iteration it definitely needs to be
> > ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there
> > (AFAICT) and it's too expensive otherwise as Christian pointed out.
> 
> Makes sense!
> 
> 							Thanx, Paul
David Woodhouse July 18, 2018, 4:01 p.m. UTC | #13
On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote:
> And I finally did get some near misses from an earlier commit, so we
> should consider your patch to be officially off the hook.

Yay, I like it when it's not my fault. I'll redo it with the ifdef
CONFIG_NO_HZ_FULL.

What should it do for the !CONFIG_NO_HZ_FULL case? The existing call in
guest_enter_irqoff() clearly wasn't actually doing the right thing
anyway, hence the need for the need_resched() patch in $SUBJECT... so
should I just leave it doing nothing in guest_enter_irqoff()?
Frederic Weisbecker July 19, 2018, 12:32 a.m. UTC | #14
On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > fix to my lost exclamation point.
> 
> Thanks. Are you sending the original version of that to Linus? It'd be
> useful to have the commit ID so that we can watch for it landing, and
> chase this one up to Greg.
> 
> As discussed on IRC, this patch reduces synchronize_sched() latency for
> us from ~4600s to ~160ms, which is nice.
> 
> However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> that you want a patch like the one below, which happily reduces the
> latency in our (!NO_HZ_FULL) case still further to ~40ms.

That is interesting. As I replied to Paul, we are already calling
rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
you're seeing such an optimization by repeating those calls.

Perhaps the rcu_user_* somehow aren't actually called from
__context_tracking_enter()...? Some bug in context tracking?
Otherwise it's a curious side effect.

Thanks.
Paul E. McKenney July 19, 2018, 3:11 a.m. UTC | #15
On Thu, Jul 19, 2018 at 02:32:06AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > > fix to my lost exclamation point.
> > 
> > Thanks. Are you sending the original version of that to Linus? It'd be
> > useful to have the commit ID so that we can watch for it landing, and
> > chase this one up to Greg.
> > 
> > As discussed on IRC, this patch reduces synchronize_sched() latency for
> > us from ~4600s to ~160ms, which is nice.
> > 
> > However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> > that you want a patch like the one below, which happily reduces the
> > latency in our (!NO_HZ_FULL) case still further to ~40ms.
> 
> That is interesting. As I replied to Paul, we are already calling
> rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> you're seeing such an optimization by repeating those calls.
> 
> Perhaps the rcu_user_* somehow aren't actually called from
> __context_tracking_enter()...? Some bug in context tracking?
> Otherwise it's a curious side effect.

David is working with v4.15.  Is this maybe something that has changed
since then?

							Thanx, Paul
David Woodhouse July 19, 2018, 6:16 a.m. UTC | #16
On Wed, 2018-07-18 at 20:11 -0700, Paul E. McKenney wrote:
> 
> > That is interesting. As I replied to Paul, we are already calling
> > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> > you're seeing such an optimization by repeating those calls.
> > 
> > Perhaps the rcu_user_* somehow aren't actually called from
> > __context_tracking_enter()...? Some bug in context tracking?
> > Otherwise it's a curious side effect.
> 
> David is working with v4.15.  Is this maybe something that has changed
> since then?

To clarify: in 4.15 without CONFIG_PREEMPT and without NO_HZ_FULL I was
seeing RCU stalls because a thread in vcpu_run() was *never* seen to go
through a quiescent state. Hence the change to need_resched() in the
first patch in this thread, which fixed the problem at hand and seemed
to address the general case.

It then seemed by *inspection* that the NO_HZ_FULL case was probably
broken, because we'd failed to spot the rcu_user_* calls. But
rcu_user_enter() does nothing in the !NO_HZ_FULL case, so wouldn't have
helped in the testing that we were doing anyway.
Frederic Weisbecker July 19, 2018, 1:15 p.m. UTC | #17
On Wed, Jul 18, 2018 at 08:11:52PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 19, 2018 at 02:32:06AM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> > > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > > > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > > > fix to my lost exclamation point.
> > > 
> > > Thanks. Are you sending the original version of that to Linus? It'd be
> > > useful to have the commit ID so that we can watch for it landing, and
> > > chase this one up to Greg.
> > > 
> > > As discussed on IRC, this patch reduces synchronize_sched() latency for
> > > us from ~4600s to ~160ms, which is nice.
> > > 
> > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> > > that you want a patch like the one below, which happily reduces the
> > > latency in our (!NO_HZ_FULL) case still further to ~40ms.
> > 
> > That is interesting. As I replied to Paul, we are already calling
> > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> > you're seeing such an optimization by repeating those calls.
> > 
> > Perhaps the rcu_user_* somehow aren't actually called from
> > __context_tracking_enter()...? Some bug in context tracking?
> > Otherwise it's a curious side effect.
> 
> David is working with v4.15.  Is this maybe something that has changed
> since then?

Hmm, nope I think the context tracking code hasn't changed for a while.
Frederic Weisbecker July 19, 2018, 1:17 p.m. UTC | #18
On Thu, Jul 19, 2018 at 08:16:47AM +0200, David Woodhouse wrote:
> 
> 
> On Wed, 2018-07-18 at 20:11 -0700, Paul E. McKenney wrote:
> > 
> > > That is interesting. As I replied to Paul, we are already calling
> > > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> > > you're seeing such an optimization by repeating those calls.
> > > 
> > > Perhaps the rcu_user_* somehow aren't actually called from
> > > __context_tracking_enter()...? Some bug in context tracking?
> > > Otherwise it's a curious side effect.
> > 
> > David is working with v4.15.  Is this maybe something that has changed
> > since then?
> 
> To clarify: in 4.15 without CONFIG_PREEMPT and without NO_HZ_FULL I was
> seeing RCU stalls because a thread in vcpu_run() was *never* seen to go
> through a quiescent state. Hence the change to need_resched() in the
> first patch in this thread, which fixed the problem at hand and seemed
> to address the general case.
> 
> It then seemed by *inspection* that the NO_HZ_FULL case was probably
> broken, because we'd failed to spot the rcu_user_* calls. But
> rcu_user_enter() does nothing in the !NO_HZ_FULL case, so wouldn't have
> helped in the testing that we were doing anyway.

Oh ok, so the optimization you saw is likely unrelated to the rcu_user* things.
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0046aa70205a..b0c82f70afa7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7458,7 +7458,9 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
 	}
 
+	rcu_kvm_enter();
 	kvm_x86_ops->run(vcpu);
+	rcu_kvm_exit();
 
 	/*
 	 * Do this here before restoring debug registers on the host.  And
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 914655848ef6..6d07af5a50fc 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -82,6 +82,8 @@  void cond_synchronize_sched(unsigned long oldstate);
 
 void rcu_idle_enter(void);
 void rcu_idle_exit(void);
+void rcu_kvm_enter(void);
+void rcu_kvm_exit(void);
 void rcu_irq_enter(void);
 void rcu_irq_exit(void);
 void rcu_irq_enter_irqson(void);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index aa7cade1b9f3..df7893273939 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1019,6 +1019,22 @@  void rcu_irq_enter_irqson(void)
 	local_irq_restore(flags);
 }
 
+/*
+ * These are currently identical to the _idle_ versions but let's
+ * explicitly have separate copies to keep Paul honest in future.
+ */
+void rcu_kvm_enter(void)
+{
+	rcu_idle_enter();
+}
+EXPORT_SYMBOL_GPL(rcu_kvm_enter);
+
+void rcu_kvm_exit(void)
+{
+	rcu_idle_exit();
+}
+EXPORT_SYMBOL_GPL(rcu_kvm_exit);
+
 /**
  * rcu_is_watching - see if RCU thinks that the current CPU is idle
  *