Message ID | e21c6d7666bfbad1e1c8295559a6ddc7e4110105.camel@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote: > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote: > > And here is an updated v4.15 patch with Marius's Reported-by and David's > > fix to my lost exclamation point. > > Thanks. Are you sending the original version of that to Linus? It'd be > useful to have the commit ID so that we can watch for it landing, and > chase this one up to Greg. That would be great! The commit ID is currently 6d1b6b684e1f ("rcu: Make need_resched() respond to urgent RCU-QS needs"), but is subject to change given the likely need to rebase in order to fix bugs in commits preceding that one. Given your Reported-by, you will be CCed when it reaches -tip, won't you? At that point, the commit ID would be set in stone. Either way, I would very much welcome any help with -stable. I would be happy to send you an email when its commit ID become set in stone, for example, if that would help. > As discussed on IRC, this patch reduces synchronize_sched() latency for > us from ~4600s to ~160ms, which is nice. Woo-hoo!!! ;-) > However, it isn't going to be sufficient in the NO_HZ_FULL case. For > that you want a patch like the one below, which happily reduces the > latency in our (!NO_HZ_FULL) case still further to ~40ms. Even better, good stuff, thank you! > Adding kvm list for better review... And a comment below. Thanx, Paul > From: David Woodhouse <dwmw@amazon.co.uk> > Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode > > RCU can spend long periods of time waiting for a CPU which is actually in > KVM guest mode, entirely pointlessly. Treat it like the idle and userspace > modes, and don't wait for it. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > arch/x86/kvm/x86.c | 2 ++ > include/linux/rcutree.h | 2 ++ > kernel/rcu/tree.c | 16 ++++++++++++++++ > 3 files changed, 20 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 0046aa70205a..b0c82f70afa7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; > } > > + rcu_kvm_enter(); > kvm_x86_ops->run(vcpu); > + rcu_kvm_exit(); > > /* > * Do this here before restoring debug registers on the host. And > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h > index 914655848ef6..6d07af5a50fc 100644 > --- a/include/linux/rcutree.h > +++ b/include/linux/rcutree.h > @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate); > > void rcu_idle_enter(void); > void rcu_idle_exit(void); > +void rcu_kvm_enter(void); > +void rcu_kvm_exit(void); > void rcu_irq_enter(void); > void rcu_irq_exit(void); > void rcu_irq_enter_irqson(void); > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index aa7cade1b9f3..df7893273939 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void) > local_irq_restore(flags); > } > > +/* > + * These are currently identical to the _idle_ versions but let's > + * explicitly have separate copies to keep Paul honest in future. > + */ > +void rcu_kvm_enter(void) > +{ > + rcu_idle_enter(); > +} > +EXPORT_SYMBOL_GPL(rcu_kvm_enter); > + > +void rcu_kvm_exit(void) > +{ > + rcu_idle_exit(); > +} > +EXPORT_SYMBOL_GPL(rcu_kvm_exit); These look good, but we also need these in include/linux/rcutiny.h: static inline void rcu_kvm_enter(void) { } static inline void rcu_kvm_exit(void) { } Unless KVM is excluded on !SMP systems or some such. Alternatively, you could just have a single pair of static inlines in include/linux/rcupdate.h (after the #include of rcutree.h and rcutiny.h) that mapped the _kvm_ functions to the _idle_ functions. Your choice, I am fine either way. > + > /** > * rcu_is_watching - see if RCU thinks that the current CPU is idle > * > -- > 2.17.1 > > > -- > dwmw2 >
So why is the rcu_virt_note_context_switch(smp_processor_id()); in guest_enter_irqoff not good enough? This was actually supposed to tell rcu that being in the guest is an extended quiescing period (like userspace). What has changed? On 07/11/2018 07:03 PM, David Woodhouse wrote: > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote: >> And here is an updated v4.15 patch with Marius's Reported-by and David's >> fix to my lost exclamation point. > > Thanks. Are you sending the original version of that to Linus? It'd be > useful to have the commit ID so that we can watch for it landing, and > chase this one up to Greg. > > As discussed on IRC, this patch reduces synchronize_sched() latency for > us from ~4600s to ~160ms, which is nice. > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For > that you want a patch like the one below, which happily reduces the > latency in our (!NO_HZ_FULL) case still further to ~40ms. > > Adding kvm list for better review... > > From: David Woodhouse <dwmw@amazon.co.uk> > Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode > > RCU can spend long periods of time waiting for a CPU which is actually in > KVM guest mode, entirely pointlessly. Treat it like the idle and userspace > modes, and don't wait for it. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > arch/x86/kvm/x86.c | 2 ++ > include/linux/rcutree.h | 2 ++ > kernel/rcu/tree.c | 16 ++++++++++++++++ > 3 files changed, 20 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 0046aa70205a..b0c82f70afa7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; > } > > + rcu_kvm_enter(); > kvm_x86_ops->run(vcpu); > + rcu_kvm_exit(); > > /* > * Do this here before restoring debug registers on the host. And > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h > index 914655848ef6..6d07af5a50fc 100644 > --- a/include/linux/rcutree.h > +++ b/include/linux/rcutree.h > @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate); > > void rcu_idle_enter(void); > void rcu_idle_exit(void); > +void rcu_kvm_enter(void); > +void rcu_kvm_exit(void); > void rcu_irq_enter(void); > void rcu_irq_exit(void); > void rcu_irq_enter_irqson(void); > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index aa7cade1b9f3..df7893273939 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void) > local_irq_restore(flags); > } > > +/* > + * These are currently identical to the _idle_ versions but let's > + * explicitly have separate copies to keep Paul honest in future. > + */ > +void rcu_kvm_enter(void) > +{ > + rcu_idle_enter(); > +} > +EXPORT_SYMBOL_GPL(rcu_kvm_enter); > + > +void rcu_kvm_exit(void) > +{ > + rcu_idle_exit(); > +} > +EXPORT_SYMBOL_GPL(rcu_kvm_exit); > + > /** > * rcu_is_watching - see if RCU thinks that the current CPU is idle > * >
On Wed, Jul 11, 2018 at 08:31:55PM +0200, Christian Borntraeger wrote: > So why is the rcu_virt_note_context_switch(smp_processor_id()); > in guest_enter_irqoff not good enough? > > This was actually supposed to tell rcu that being in the guest > is an extended quiescing period (like userspace). > > What has changed? As I understand it, they would like to have their guest run uninterrupted for extended times. Because rcu_virt_note_context_switch() is a point-in-time quiescent state, it cannot tell RCU about the extended quiescent state. Should we replace the current calls to rcu_virt_note_context_switch() with rcu_kvm_enter() and rcu_kvm_exit()? Would that be better than the below architecture-by-architecture approach? Thanx, Paul > On 07/11/2018 07:03 PM, David Woodhouse wrote: > > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote: > >> And here is an updated v4.15 patch with Marius's Reported-by and David's > >> fix to my lost exclamation point. > > > > Thanks. Are you sending the original version of that to Linus? It'd be > > useful to have the commit ID so that we can watch for it landing, and > > chase this one up to Greg. > > > > As discussed on IRC, this patch reduces synchronize_sched() latency for > > us from ~4600s to ~160ms, which is nice. > > > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For > > that you want a patch like the one below, which happily reduces the > > latency in our (!NO_HZ_FULL) case still further to ~40ms. > > > > Adding kvm list for better review... > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > Subject: [PATCH] kvm/x86: Inform RCU of quiescent state when entering guest mode > > > > RCU can spend long periods of time waiting for a CPU which is actually in > > KVM guest mode, entirely pointlessly. Treat it like the idle and userspace > > modes, and don't wait for it. > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > arch/x86/kvm/x86.c | 2 ++ > > include/linux/rcutree.h | 2 ++ > > kernel/rcu/tree.c | 16 ++++++++++++++++ > > 3 files changed, 20 insertions(+) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 0046aa70205a..b0c82f70afa7 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; > > } > > > > + rcu_kvm_enter(); > > kvm_x86_ops->run(vcpu); > > + rcu_kvm_exit(); > > > > /* > > * Do this here before restoring debug registers on the host. And > > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h > > index 914655848ef6..6d07af5a50fc 100644 > > --- a/include/linux/rcutree.h > > +++ b/include/linux/rcutree.h > > @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate); > > > > void rcu_idle_enter(void); > > void rcu_idle_exit(void); > > +void rcu_kvm_enter(void); > > +void rcu_kvm_exit(void); > > void rcu_irq_enter(void); > > void rcu_irq_exit(void); > > void rcu_irq_enter_irqson(void); > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index aa7cade1b9f3..df7893273939 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void) > > local_irq_restore(flags); > > } > > > > +/* > > + * These are currently identical to the _idle_ versions but let's > > + * explicitly have separate copies to keep Paul honest in future. > > + */ > > +void rcu_kvm_enter(void) > > +{ > > + rcu_idle_enter(); > > +} > > +EXPORT_SYMBOL_GPL(rcu_kvm_enter); > > + > > +void rcu_kvm_exit(void) > > +{ > > + rcu_idle_exit(); > > +} > > +EXPORT_SYMBOL_GPL(rcu_kvm_exit); > > + > > /** > > * rcu_is_watching - see if RCU thinks that the current CPU is idle > > * > >
On Wed, 2018-07-11 at 13:17 -0700, Paul E. McKenney wrote: > As I understand it, they would like to have their guest run uninterrupted > for extended times. Because rcu_virt_note_context_switch() is a > point-in-time quiescent state, it cannot tell RCU about the extended > quiescent state. > > Should we replace the current calls to rcu_virt_note_context_switch() > with rcu_kvm_enter() and rcu_kvm_exit()? Would that be better > than the below architecture-by-architecture approach? Yes it would. I was already starting to mutter about needing the same for ARM and POWER. I'll do a v3 (incorporating your fixes too) in the morning. Thanks. Also... why in $DEITY's name was the existing rcu_virt_note_context_switch() not actually sufficient? If we had that there, why did we need an additional explicit calls to rcu_all_qs() in the KVM loop, or the more complex fixes to need_resched() which ultimately had the same effect, to avoid ten-second latencies?
On Wed, Jul 11, 2018 at 09:19:44PM +0100, David Woodhouse wrote: > On Wed, 2018-07-11 at 13:17 -0700, Paul E. McKenney wrote: > > As I understand it, they would like to have their guest run uninterrupted > > for extended times. Because rcu_virt_note_context_switch() is a > > point-in-time quiescent state, it cannot tell RCU about the extended > > quiescent state. > > > > Should we replace the current calls to rcu_virt_note_context_switch() > > with rcu_kvm_enter() and rcu_kvm_exit()? Would that be better > > than the below architecture-by-architecture approach? > > Yes it would. I was already starting to mutter about needing the same > for ARM and POWER. I'll do a v3 (incorporating your fixes too) in the > morning. > > Thanks. > > Also... why in $DEITY's name was the existing > rcu_virt_note_context_switch() not actually sufficient? If we had that > there, why did we need an additional explicit calls to rcu_all_qs() in > the KVM loop, or the more complex fixes to need_resched() which > ultimately had the same effect, to avoid ten-second latencies? My guess is that this was because control passed through the rcu_virt_note_context_switch() only once, and then subsequent scheduling-clock interrupts bypassed this code. But that is just a guess. I need to defer to someone who understands the KVM code better than I do. Thanx, Paul
On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote: > > > Also... why in $DEITY's name was the existing > > rcu_virt_note_context_switch() not actually sufficient? If we had that > > there, why did we need an additional explicit calls to rcu_all_qs() in > > the KVM loop, or the more complex fixes to need_resched() which > > ultimately had the same effect, to avoid ten-second latencies? > > My guess is that this was because control passed through the > rcu_virt_note_context_switch() only once, and then subsequent > scheduling-clock interrupts bypassed this code. But that is just a guess. > I need to defer to someone who understands the KVM code better than I do. I think it's more likely that we just never happened at all. It's conditional. From the latest patch iteration (see it being removed): @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void) * one time slice). Lets treat guest mode as quiescent state, just like * we do with user-mode execution. */ - if (!context_tracking_cpu_is_enabled()) - rcu_virt_note_context_switch(smp_processor_id()); + rcu_kvm_enter(); } Given the vmexit overhead, I don't think we can do the currently- proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's really necessary. I'll make that conditional, but probably on the RCU side. Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and rcu_kvm_enter() can do rcu_virt_note_context_switch(). OK?
On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote: > > > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote: > > > > > Also... why in $DEITY's name was the existing > > > rcu_virt_note_context_switch() not actually sufficient? If we had that > > > there, why did we need an additional explicit calls to rcu_all_qs() in > > > the KVM loop, or the more complex fixes to need_resched() which > > > ultimately had the same effect, to avoid ten-second latencies? > > > > My guess is that this was because control passed through the > > rcu_virt_note_context_switch() only once, and then subsequent > > scheduling-clock interrupts bypassed this code. Gah! My guess was instead that the code did a rcu_kvm_enter() going in, but somehow managed to miss the rcu_kvm_exit() going out. But that makes absolutely no sense -- had that happened, rcutorture would likely have screamed bloody murder, loudly and often. No mere near misses! And besides, thus far, -ENOREPRODUCE. :-/ Which indicates that I have an opportunity to improve rcutorture and that this patch was with high probability an innocent bystander. > > But that is just a guess. > > I need to defer to someone who understands the KVM code better than I do. > > I think it's more likely that we just never happened at all. It's > conditional. From the latest patch iteration (see it being removed): > > @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void) > * one time slice). Lets treat guest mode as quiescent state, just like > * we do with user-mode execution. > */ > - if (!context_tracking_cpu_is_enabled()) > - rcu_virt_note_context_switch(smp_processor_id()); > + rcu_kvm_enter(); > } > > > Given the vmexit overhead, I don't think we can do the currently- > proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's > really necessary. I'll make that conditional, but probably on the RCU > side. > > Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and > rcu_kvm_enter() can do rcu_virt_note_context_switch(). > > OK? Makes sense to me! And a big "thank you!" to Christian for testing and analyzing this in a timely fashion!!! Thanx, Paul
On Thu, Jul 12, 2018 at 05:53:51AM -0700, Paul E. McKenney wrote: > On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote: > > > > > > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote: > > > > > > > Also... why in $DEITY's name was the existing > > > > rcu_virt_note_context_switch() not actually sufficient? If we had that > > > > there, why did we need an additional explicit calls to rcu_all_qs() in > > > > the KVM loop, or the more complex fixes to need_resched() which > > > > ultimately had the same effect, to avoid ten-second latencies? > > > > > > My guess is that this was because control passed through the > > > rcu_virt_note_context_switch() only once, and then subsequent > > > scheduling-clock interrupts bypassed this code. > > Gah! My guess was instead that the code did a rcu_kvm_enter() going in, > but somehow managed to miss the rcu_kvm_exit() going out. But that makes > absolutely no sense -- had that happened, rcutorture would likely have > screamed bloody murder, loudly and often. No mere near misses! > > And besides, thus far, -ENOREPRODUCE. :-/ OK, one close call in 63 hours of rcutorture, this one on scenario TREE03 (yesterday hit TREE01 and TREE03). Time for probabilitistic long-runtime bisection. Plus thought about how to get more information out of the near misses. Fun! ;-) Thanx, Paul > Which indicates that I have an opportunity to improve rcutorture and > that this patch was with high probability an innocent bystander. > > > > But that is just a guess. > > > I need to defer to someone who understands the KVM code better than I do. > > > > I think it's more likely that we just never happened at all. It's > > conditional. From the latest patch iteration (see it being removed): > > > > @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void) > > * one time slice). Lets treat guest mode as quiescent state, just like > > * we do with user-mode execution. > > */ > > - if (!context_tracking_cpu_is_enabled()) > > - rcu_virt_note_context_switch(smp_processor_id()); > > + rcu_kvm_enter(); > > } > > > > > > Given the vmexit overhead, I don't think we can do the currently- > > proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's > > really necessary. I'll make that conditional, but probably on the RCU > > side. > > > > Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and > > rcu_kvm_enter() can do rcu_virt_note_context_switch(). > > > > OK? > > Makes sense to me! And a big "thank you!" to Christian for testing > and analyzing this in a timely fashion!!! > > Thanx, Paul
On Thu, Jul 12, 2018 at 09:17:04AM -0700, Paul E. McKenney wrote: > On Thu, Jul 12, 2018 at 05:53:51AM -0700, Paul E. McKenney wrote: > > On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote: > > > > > > > > > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote: > > > > > > > > > Also... why in $DEITY's name was the existing > > > > > rcu_virt_note_context_switch() not actually sufficient? If we had that > > > > > there, why did we need an additional explicit calls to rcu_all_qs() in > > > > > the KVM loop, or the more complex fixes to need_resched() which > > > > > ultimately had the same effect, to avoid ten-second latencies? > > > > > > > > My guess is that this was because control passed through the > > > > rcu_virt_note_context_switch() only once, and then subsequent > > > > scheduling-clock interrupts bypassed this code. > > > > Gah! My guess was instead that the code did a rcu_kvm_enter() going in, > > but somehow managed to miss the rcu_kvm_exit() going out. But that makes > > absolutely no sense -- had that happened, rcutorture would likely have > > screamed bloody murder, loudly and often. No mere near misses! > > > > And besides, thus far, -ENOREPRODUCE. :-/ > > OK, one close call in 63 hours of rcutorture, this one on scenario TREE03 > (yesterday hit TREE01 and TREE03). Time for probabilitistic long-runtime > bisection. Plus thought about how to get more information out of the near > misses. Fun! ;-) Most of the weekend was devoted to testing today's upcoming pull request, but I did get a bit more testing done on this. I was able to make this happen more often by tweaking rcutorture a bit, but I still do not yet have statistically significant results. Nevertheless, I have thus far only seen failures with David's patch or with both David's and my patch. And I actually got a full-up rcutorture failure (a too-short grace period) in addition to the aforementioned close calls. Over this coming week I expect to devote significant testing time to the commit just prior to David's in my stack. If I don't see failures on that commit, we will need to spent some quality time with the KVM folks on whether or not kvm_x86_ops->run() and friends have the option of failing to return, but instead causing control to pop up somewhere else. Or someone could tell me how I am being blind to some obvious bug in the two commits that allow RCU to treat KVM guest-OS execution as an extended quiescent state. ;-) Thanx, Paul
On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote: > Most of the weekend was devoted to testing today's upcoming pull request, > but I did get a bit more testing done on this. > > I was able to make this happen more often by tweaking rcutorture a > bit, but I still do not yet have statistically significant results. > Nevertheless, I have thus far only seen failures with David's patch or > with both David's and my patch. And I actually got a full-up rcutorture > failure (a too-short grace period) in addition to the aforementioned > close calls. > > Over this coming week I expect to devote significant testing time to > the commit just prior to David's in my stack. If I don't see failures > on that commit, we will need to spent some quality time with the KVM > folks on whether or not kvm_x86_ops->run() and friends have the option of > failing to return, but instead causing control to pop up somewhere else. > Or someone could tell me how I am being blind to some obvious bug in > the two commits that allow RCU to treat KVM guest-OS execution as an > extended quiescent state. ;-) One thing we can try, if my patch is implicated, is moving the calls to rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run() for example. If that fixes it, then we know we've missed something else interesting that's happening in the middle. Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500 with this patch, so in the next iteration it definitely needs to be ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there (AFAICT) and it's too expensive otherwise as Christian pointed out.
On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote: > On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote: > > Most of the weekend was devoted to testing today's upcoming pull request, > > but I did get a bit more testing done on this. > > > > I was able to make this happen more often by tweaking rcutorture a > > bit, but I still do not yet have statistically significant results. > > Nevertheless, I have thus far only seen failures with David's patch or > > with both David's and my patch. And I actually got a full-up rcutorture > > failure (a too-short grace period) in addition to the aforementioned > > close calls. > > > > Over this coming week I expect to devote significant testing time to > > the commit just prior to David's in my stack. If I don't see failures > > on that commit, we will need to spent some quality time with the KVM > > folks on whether or not kvm_x86_ops->run() and friends have the option of > > failing to return, but instead causing control to pop up somewhere else. > > Or someone could tell me how I am being blind to some obvious bug in > > the two commits that allow RCU to treat KVM guest-OS execution as an > > extended quiescent state. ;-) > > One thing we can try, if my patch is implicated, is moving the calls to > rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting > them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run() > for example. If that fixes it, then we know we've missed something else > interesting that's happening in the middle. I don't have enough data to say anything with too much certainty, but my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours does, and I am not seeing massive increases in error rate in my patch compared to yours. Which again might or might not mean anything. Plus I haven't proven that your patch isn't an innocent bystander yet. If it isn't just an innocent bystander, that will take most of this week do demonstrate given current failure rates. I am also working on improving rcutorture diagnostics which should help me work out how to change rcutorture so as to find this more quickly. > Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500 > with this patch, so in the next iteration it definitely needs to be > ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there > (AFAICT) and it's too expensive otherwise as Christian pointed out. Makes sense! Thanx, Paul
On Tue, Jul 17, 2018 at 05:56:53AM -0700, Paul E. McKenney wrote: > On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote: > > On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote: > > > Most of the weekend was devoted to testing today's upcoming pull request, > > > but I did get a bit more testing done on this. > > > > > > I was able to make this happen more often by tweaking rcutorture a > > > bit, but I still do not yet have statistically significant results. > > > Nevertheless, I have thus far only seen failures with David's patch or > > > with both David's and my patch. And I actually got a full-up rcutorture > > > failure (a too-short grace period) in addition to the aforementioned > > > close calls. > > > > > > Over this coming week I expect to devote significant testing time to > > > the commit just prior to David's in my stack. If I don't see failures > > > on that commit, we will need to spent some quality time with the KVM > > > folks on whether or not kvm_x86_ops->run() and friends have the option of > > > failing to return, but instead causing control to pop up somewhere else. > > > Or someone could tell me how I am being blind to some obvious bug in > > > the two commits that allow RCU to treat KVM guest-OS execution as an > > > extended quiescent state. ;-) > > > > One thing we can try, if my patch is implicated, is moving the calls to > > rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting > > them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run() > > for example. If that fixes it, then we know we've missed something else > > interesting that's happening in the middle. > > I don't have enough data to say anything with too much certainty, but > my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours > does, and I am not seeing massive increases in error rate in my patch > compared to yours. Which again might or might not mean anything. > > Plus I haven't proven that your patch isn't an innocent bystander yet. > If it isn't just an innocent bystander, that will take most of this > week do demonstrate given current failure rates. And I finally did get some near misses from an earlier commit, so we should consider your patch to be officially off the hook. Thanx, Paul > I am also working on improving rcutorture diagnostics which should help > me work out how to change rcutorture so as to find this more quickly. > > > Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500 > > with this patch, so in the next iteration it definitely needs to be > > ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there > > (AFAICT) and it's too expensive otherwise as Christian pointed out. > > Makes sense! > > Thanx, Paul
On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote: > And I finally did get some near misses from an earlier commit, so we > should consider your patch to be officially off the hook. Yay, I like it when it's not my fault. I'll redo it with the ifdef CONFIG_NO_HZ_FULL. What should it do for the !CONFIG_NO_HZ_FULL case? The existing call in guest_enter_irqoff() clearly wasn't actually doing the right thing anyway, hence the need for the need_resched() patch in $SUBJECT... so should I just leave it doing nothing in guest_enter_irqoff()?
On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote: > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote: > > And here is an updated v4.15 patch with Marius's Reported-by and David's > > fix to my lost exclamation point. > > Thanks. Are you sending the original version of that to Linus? It'd be > useful to have the commit ID so that we can watch for it landing, and > chase this one up to Greg. > > As discussed on IRC, this patch reduces synchronize_sched() latency for > us from ~4600s to ~160ms, which is nice. > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For > that you want a patch like the one below, which happily reduces the > latency in our (!NO_HZ_FULL) case still further to ~40ms. That is interesting. As I replied to Paul, we are already calling rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why you're seeing such an optimization by repeating those calls. Perhaps the rcu_user_* somehow aren't actually called from __context_tracking_enter()...? Some bug in context tracking? Otherwise it's a curious side effect. Thanks.
On Thu, Jul 19, 2018 at 02:32:06AM +0200, Frederic Weisbecker wrote: > On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote: > > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote: > > > And here is an updated v4.15 patch with Marius's Reported-by and David's > > > fix to my lost exclamation point. > > > > Thanks. Are you sending the original version of that to Linus? It'd be > > useful to have the commit ID so that we can watch for it landing, and > > chase this one up to Greg. > > > > As discussed on IRC, this patch reduces synchronize_sched() latency for > > us from ~4600s to ~160ms, which is nice. > > > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For > > that you want a patch like the one below, which happily reduces the > > latency in our (!NO_HZ_FULL) case still further to ~40ms. > > That is interesting. As I replied to Paul, we are already calling > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why > you're seeing such an optimization by repeating those calls. > > Perhaps the rcu_user_* somehow aren't actually called from > __context_tracking_enter()...? Some bug in context tracking? > Otherwise it's a curious side effect. David is working with v4.15. Is this maybe something that has changed since then? Thanx, Paul
On Wed, 2018-07-18 at 20:11 -0700, Paul E. McKenney wrote: > > > That is interesting. As I replied to Paul, we are already calling > > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why > > you're seeing such an optimization by repeating those calls. > > > > Perhaps the rcu_user_* somehow aren't actually called from > > __context_tracking_enter()...? Some bug in context tracking? > > Otherwise it's a curious side effect. > > David is working with v4.15. Is this maybe something that has changed > since then? To clarify: in 4.15 without CONFIG_PREEMPT and without NO_HZ_FULL I was seeing RCU stalls because a thread in vcpu_run() was *never* seen to go through a quiescent state. Hence the change to need_resched() in the first patch in this thread, which fixed the problem at hand and seemed to address the general case. It then seemed by *inspection* that the NO_HZ_FULL case was probably broken, because we'd failed to spot the rcu_user_* calls. But rcu_user_enter() does nothing in the !NO_HZ_FULL case, so wouldn't have helped in the testing that we were doing anyway.
On Wed, Jul 18, 2018 at 08:11:52PM -0700, Paul E. McKenney wrote: > On Thu, Jul 19, 2018 at 02:32:06AM +0200, Frederic Weisbecker wrote: > > On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote: > > > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote: > > > > And here is an updated v4.15 patch with Marius's Reported-by and David's > > > > fix to my lost exclamation point. > > > > > > Thanks. Are you sending the original version of that to Linus? It'd be > > > useful to have the commit ID so that we can watch for it landing, and > > > chase this one up to Greg. > > > > > > As discussed on IRC, this patch reduces synchronize_sched() latency for > > > us from ~4600s to ~160ms, which is nice. > > > > > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For > > > that you want a patch like the one below, which happily reduces the > > > latency in our (!NO_HZ_FULL) case still further to ~40ms. > > > > That is interesting. As I replied to Paul, we are already calling > > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why > > you're seeing such an optimization by repeating those calls. > > > > Perhaps the rcu_user_* somehow aren't actually called from > > __context_tracking_enter()...? Some bug in context tracking? > > Otherwise it's a curious side effect. > > David is working with v4.15. Is this maybe something that has changed > since then? Hmm, nope I think the context tracking code hasn't changed for a while.
On Thu, Jul 19, 2018 at 08:16:47AM +0200, David Woodhouse wrote: > > > On Wed, 2018-07-18 at 20:11 -0700, Paul E. McKenney wrote: > > > > > That is interesting. As I replied to Paul, we are already calling > > > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why > > > you're seeing such an optimization by repeating those calls. > > > > > > Perhaps the rcu_user_* somehow aren't actually called from > > > __context_tracking_enter()...? Some bug in context tracking? > > > Otherwise it's a curious side effect. > > > > David is working with v4.15. Is this maybe something that has changed > > since then? > > To clarify: in 4.15 without CONFIG_PREEMPT and without NO_HZ_FULL I was > seeing RCU stalls because a thread in vcpu_run() was *never* seen to go > through a quiescent state. Hence the change to need_resched() in the > first patch in this thread, which fixed the problem at hand and seemed > to address the general case. > > It then seemed by *inspection* that the NO_HZ_FULL case was probably > broken, because we'd failed to spot the rcu_user_* calls. But > rcu_user_enter() does nothing in the !NO_HZ_FULL case, so wouldn't have > helped in the testing that we were doing anyway. Oh ok, so the optimization you saw is likely unrelated to the rcu_user* things.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0046aa70205a..b0c82f70afa7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; } + rcu_kvm_enter(); kvm_x86_ops->run(vcpu); + rcu_kvm_exit(); /* * Do this here before restoring debug registers on the host. And diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 914655848ef6..6d07af5a50fc 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -82,6 +82,8 @@ void cond_synchronize_sched(unsigned long oldstate); void rcu_idle_enter(void); void rcu_idle_exit(void); +void rcu_kvm_enter(void); +void rcu_kvm_exit(void); void rcu_irq_enter(void); void rcu_irq_exit(void); void rcu_irq_enter_irqson(void); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index aa7cade1b9f3..df7893273939 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1019,6 +1019,22 @@ void rcu_irq_enter_irqson(void) local_irq_restore(flags); } +/* + * These are currently identical to the _idle_ versions but let's + * explicitly have separate copies to keep Paul honest in future. + */ +void rcu_kvm_enter(void) +{ + rcu_idle_enter(); +} +EXPORT_SYMBOL_GPL(rcu_kvm_enter); + +void rcu_kvm_exit(void) +{ + rcu_idle_exit(); +} +EXPORT_SYMBOL_GPL(rcu_kvm_exit); + /** * rcu_is_watching - see if RCU thinks that the current CPU is idle *