[RFC,1/1] kvm: Note an RCU quiescent state on guest exit

Message ID	20240511020557.1198200-1-leobras@redhat.com (mailing list archive)
State	New, archived
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE18D10958 for <kvm@vger.kernel.org>; Sat, 11 May 2024 02:06:31 +0000 (UTC) From: Leonardo Bras <leobras@redhat.com> To: Frederic Weisbecker <frederic@kernel.org>, "Paul E. McKenney" <paulmck@kernel.org>, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com>, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit Date: Fri, 10 May 2024 23:05:56 -0300 Message-ID: <20240511020557.1198200-1-leobras@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC,1/1] kvm: Note an RCU quiescent state on guest exit \| expand [RFC,1/1] kvm: Note an RCU quiescent state on guest exit

Leonardo Bras May 11, 2024, 2:05 a.m. UTC

As of today, KVM notes a quiescent state only in guest entry, which is good
as it avoids the guest being interrupted for current RCU operations.

While the guest vcpu runs, it can be interrupted by a timer IRQ that will
check for any RCU operations waiting for this CPU. In case there are any of
such, it invokes rcu_core() in order to sched-out the current thread and
note a quiescent state.

This occasional schedule work will introduce tens of microsseconds of
latency, which is really bad for vcpus running latency-sensitive
applications, such as real-time workloads.

So, note a quiescent state in guest exit, so the interrupted guests is able
to deal with any pending RCU operations before being required to invoke
rcu_core(), and thus avoid the overhead of related scheduler work.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---

ps: A patch fixing this same issue was discussed in this thread:
https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/

Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
to avoid having invoke_rcu() being called on grace-periods starting between
guest exit and the timer IRQ. This RCU option is being discussed in a
sub-thread of this message:
https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/


 include/linux/context_tracking.h |  6 ++++--
 include/linux/kvm_host.h         | 10 +++++++++-
 2 files changed, 13 insertions(+), 3 deletions(-)

Leonardo Bras May 11, 2024, 2:11 a.m. UTC | #1

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able

s/guests/guest

> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
> 
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> 
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> 
>  include/linux/context_tracking.h |  6 ++++--
>  include/linux/kvm_host.h         | 10 +++++++++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  }
>  
>  static __always_inline bool context_tracking_guest_enter(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_enter(CONTEXT_GUEST);
>  
>  	return context_tracking_enabled_this_cpu();
>  }
>  
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_exit(CONTEXT_GUEST);
> +
> +	return context_tracking_enabled_this_cpu();
>  }
>  
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline void user_enter_irqoff(void) { }
>  static inline void user_exit_irqoff(void) { }
>  static inline int exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
>  static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
>  extern void context_tracking_init(void);
>  #else
>  static inline void context_tracking_init(void) { }
>  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
>   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
>   * unsafe to use any code which may directly or indirectly use RCU, tracing
>   * (including IRQ flag tracing), or lockdep. All code in this period must be
>   * non-instrumentable.
>   */
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
> -	context_tracking_guest_exit();
> +	/*
> +	 * Guest mode is treated as a quiescent state, see
> +	 * guest_context_enter_irqoff() for more details.
> +	 */
> +	if (!context_tracking_guest_exit()) {
> +		instrumentation_begin();
> +		rcu_virt_note_context_switch();
> +		instrumentation_end();
> +	}
>  }
>  
>  /*
>   * Stop accounting time towards a guest.
>   * Must be called after exiting guest context.
>   */
>  static __always_inline void guest_timing_exit_irqoff(void)
>  {
>  	instrumentation_begin();
>  	/* Flush the guest cputime we spent on the guest */
> -- 
> 2.45.0
>

Paul E. McKenney May 11, 2024, 2:55 p.m. UTC | #2

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> 
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> 
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> 
>  include/linux/context_tracking.h |  6 ++++--
>  include/linux/kvm_host.h         | 10 +++++++++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  }
>  
>  static __always_inline bool context_tracking_guest_enter(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_enter(CONTEXT_GUEST);
>  
>  	return context_tracking_enabled_this_cpu();
>  }
>  
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_exit(CONTEXT_GUEST);
> +
> +	return context_tracking_enabled_this_cpu();
>  }
>  
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline void user_enter_irqoff(void) { }
>  static inline void user_exit_irqoff(void) { }
>  static inline int exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
>  static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
>  extern void context_tracking_init(void);
>  #else
>  static inline void context_tracking_init(void) { }
>  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
>   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
>   * unsafe to use any code which may directly or indirectly use RCU, tracing
>   * (including IRQ flag tracing), or lockdep. All code in this period must be
>   * non-instrumentable.
>   */
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
> -	context_tracking_guest_exit();
> +	/*
> +	 * Guest mode is treated as a quiescent state, see
> +	 * guest_context_enter_irqoff() for more details.
> +	 */
> +	if (!context_tracking_guest_exit()) {
> +		instrumentation_begin();
> +		rcu_virt_note_context_switch();
> +		instrumentation_end();
> +	}
>  }
>  
>  /*
>   * Stop accounting time towards a guest.
>   * Must be called after exiting guest context.
>   */
>  static __always_inline void guest_timing_exit_irqoff(void)
>  {
>  	instrumentation_begin();
>  	/* Flush the guest cputime we spent on the guest */
> -- 
> 2.45.0
>

Leonardo Bras May 11, 2024, 8:31 p.m. UTC | #3

On Sat, May 11, 2024 at 07:55:55AM -0700, Paul E. McKenney wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> > 
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> > 
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> > 
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 
> Acked-by: Paul E. McKenney <paulmck@kernel.org>

Thanks!
Leo

> 
> > ---
> > 
> > ps: A patch fixing this same issue was discussed in this thread:
> > https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > 
> > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > to avoid having invoke_rcu() being called on grace-periods starting between
> > guest exit and the timer IRQ. This RCU option is being discussed in a
> > sub-thread of this message:
> > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> > 
> > 
> >  include/linux/context_tracking.h |  6 ++++--
> >  include/linux/kvm_host.h         | 10 +++++++++-
> >  2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 6e76b9dba00e..8a78fabeafc3 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> >  }
> >  
> >  static __always_inline bool context_tracking_guest_enter(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_enter(CONTEXT_GUEST);
> >  
> >  	return context_tracking_enabled_this_cpu();
> >  }
> >  
> > -static __always_inline void context_tracking_guest_exit(void)
> > +static __always_inline bool context_tracking_guest_exit(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_exit(CONTEXT_GUEST);
> > +
> > +	return context_tracking_enabled_this_cpu();
> >  }
> >  
> >  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
> >  
> >  #else
> >  static inline void user_enter(void) { }
> >  static inline void user_exit(void) { }
> >  static inline void user_enter_irqoff(void) { }
> >  static inline void user_exit_irqoff(void) { }
> >  static inline int exception_enter(void) { return 0; }
> >  static inline void exception_exit(enum ctx_state prev_ctx) { }
> >  static inline int ct_state(void) { return -1; }
> >  static inline int __ct_state(void) { return -1; }
> >  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> > -static __always_inline void context_tracking_guest_exit(void) { }
> > +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> >  #define CT_WARN_ON(cond) do { } while (0)
> >  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> >  extern void context_tracking_init(void);
> >  #else
> >  static inline void context_tracking_init(void) { }
> >  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..e37724c44843 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> >  /*
> >   * Exit guest context and exit an RCU extended quiescent state.
> >   *
> >   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> >   * unsafe to use any code which may directly or indirectly use RCU, tracing
> >   * (including IRQ flag tracing), or lockdep. All code in this period must be
> >   * non-instrumentable.
> >   */
> >  static __always_inline void guest_context_exit_irqoff(void)
> >  {
> > -	context_tracking_guest_exit();
> > +	/*
> > +	 * Guest mode is treated as a quiescent state, see
> > +	 * guest_context_enter_irqoff() for more details.
> > +	 */
> > +	if (!context_tracking_guest_exit()) {
> > +		instrumentation_begin();
> > +		rcu_virt_note_context_switch();
> > +		instrumentation_end();
> > +	}
> >  }
> >  
> >  /*
> >   * Stop accounting time towards a guest.
> >   * Must be called after exiting guest context.
> >   */
> >  static __always_inline void guest_timing_exit_irqoff(void)
> >  {
> >  	instrumentation_begin();
> >  	/* Flush the guest cputime we spent on the guest */
> > -- 
> > 2.45.0
> > 
>

Marcelo Tosatti May 12, 2024, 9:44 p.m. UTC | #4

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.

This does not properly fix the current problem, as RCU work might be
scheduled after the VM exit, followed by a timer interrupt.

Correct?

> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
> 
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> 
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> 
>  include/linux/context_tracking.h |  6 ++++--
>  include/linux/kvm_host.h         | 10 +++++++++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  }
>  
>  static __always_inline bool context_tracking_guest_enter(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_enter(CONTEXT_GUEST);
>  
>  	return context_tracking_enabled_this_cpu();
>  }
>  
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_exit(CONTEXT_GUEST);
> +
> +	return context_tracking_enabled_this_cpu();
>  }
>  
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline void user_enter_irqoff(void) { }
>  static inline void user_exit_irqoff(void) { }
>  static inline int exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
>  static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
>  extern void context_tracking_init(void);
>  #else
>  static inline void context_tracking_init(void) { }
>  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
>   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
>   * unsafe to use any code which may directly or indirectly use RCU, tracing
>   * (including IRQ flag tracing), or lockdep. All code in this period must be
>   * non-instrumentable.
>   */
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
> -	context_tracking_guest_exit();
> +	/*
> +	 * Guest mode is treated as a quiescent state, see
> +	 * guest_context_enter_irqoff() for more details.
> +	 */
> +	if (!context_tracking_guest_exit()) {
> +		instrumentation_begin();
> +		rcu_virt_note_context_switch();
> +		instrumentation_end();
> +	}
>  }
>  
>  /*
>   * Stop accounting time towards a guest.
>   * Must be called after exiting guest context.
>   */
>  static __always_inline void guest_timing_exit_irqoff(void)
>  {
>  	instrumentation_begin();
>  	/* Flush the guest cputime we spent on the guest */
> -- 
> 2.45.0
> 
> 
>

Marcelo Tosatti May 13, 2024, 1:06 a.m. UTC | #5

On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> > 
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> > 
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> > 
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> This does not properly fix the current problem, as RCU work might be
> scheduled after the VM exit, followed by a timer interrupt.
> 
> Correct?

Not that i am against the patch... 

But, regarding the problem at hand, it does not fix it reliably.

Leonardo Bras May 13, 2024, 3:14 a.m. UTC | #6

On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> > 
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> > 
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> > 
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> This does not properly fix the current problem, as RCU work might be
> scheduled after the VM exit, followed by a timer interrupt.
> 
> Correct?

Correct, for this case, check the note below:

> 
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> > 
> > ps: A patch fixing this same issue was discussed in this thread:
> > https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > 
> > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > to avoid having invoke_rcu() being called on grace-periods starting between
> > guest exit and the timer IRQ. This RCU option is being discussed in a
> > sub-thread of this message:
> > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/

^ This one above.
The idea is to use this rcutree.nocb_patience_delay=N :
a new option we added on RCU that allow us to avoid invoking rcu_core() if 
the grace_period < N miliseconds. This only works on nohz_full cpus.

So with both the current patch and the one in above link, we have the same 
effect as we previously had with last_guest_exit, with a cherry on top: we 
can avoid rcu_core() getting called in situations where a grace period just 
started after going into kernel code, and a timer interrupt happened before 
it can report quiescent state again. 

For our nohz_full vcpu thread scenario, we have:

- guest_exit note a quiescent state
- let's say we start a grace period in the next cycle
- If timer interrupts, it requires the grace period to be older than N 
  miliseconds
  - If we configure a proper value for patience, it will never reach the 
    end of patience before going guest_entry, and thus noting a quiescent 
    state

What do you think?

Thanks!
Leo

> > 
> > 
> >  include/linux/context_tracking.h |  6 ++++--
> >  include/linux/kvm_host.h         | 10 +++++++++-
> >  2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 6e76b9dba00e..8a78fabeafc3 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> >  }
> >  
> >  static __always_inline bool context_tracking_guest_enter(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_enter(CONTEXT_GUEST);
> >  
> >  	return context_tracking_enabled_this_cpu();
> >  }
> >  
> > -static __always_inline void context_tracking_guest_exit(void)
> > +static __always_inline bool context_tracking_guest_exit(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_exit(CONTEXT_GUEST);
> > +
> > +	return context_tracking_enabled_this_cpu();
> >  }
> >  
> >  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
> >  
> >  #else
> >  static inline void user_enter(void) { }
> >  static inline void user_exit(void) { }
> >  static inline void user_enter_irqoff(void) { }
> >  static inline void user_exit_irqoff(void) { }
> >  static inline int exception_enter(void) { return 0; }
> >  static inline void exception_exit(enum ctx_state prev_ctx) { }
> >  static inline int ct_state(void) { return -1; }
> >  static inline int __ct_state(void) { return -1; }
> >  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> > -static __always_inline void context_tracking_guest_exit(void) { }
> > +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> >  #define CT_WARN_ON(cond) do { } while (0)
> >  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> >  extern void context_tracking_init(void);
> >  #else
> >  static inline void context_tracking_init(void) { }
> >  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..e37724c44843 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> >  /*
> >   * Exit guest context and exit an RCU extended quiescent state.
> >   *
> >   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> >   * unsafe to use any code which may directly or indirectly use RCU, tracing
> >   * (including IRQ flag tracing), or lockdep. All code in this period must be
> >   * non-instrumentable.
> >   */
> >  static __always_inline void guest_context_exit_irqoff(void)
> >  {
> > -	context_tracking_guest_exit();
> > +	/*
> > +	 * Guest mode is treated as a quiescent state, see
> > +	 * guest_context_enter_irqoff() for more details.
> > +	 */
> > +	if (!context_tracking_guest_exit()) {
> > +		instrumentation_begin();
> > +		rcu_virt_note_context_switch();
> > +		instrumentation_end();
> > +	}
> >  }
> >  
> >  /*
> >   * Stop accounting time towards a guest.
> >   * Must be called after exiting guest context.
> >   */
> >  static __always_inline void guest_timing_exit_irqoff(void)
> >  {
> >  	instrumentation_begin();
> >  	/* Flush the guest cputime we spent on the guest */
> > -- 
> > 2.45.0
> > 
> > 
> > 
>

Marcelo Tosatti May 13, 2024, 7:14 p.m. UTC | #7

On Mon, May 13, 2024 at 12:14:32AM -0300, Leonardo Bras wrote:
> On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> > On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > as it avoids the guest being interrupted for current RCU operations.
> > > 
> > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > check for any RCU operations waiting for this CPU. In case there are any of
> > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > note a quiescent state.
> > > 
> > > This occasional schedule work will introduce tens of microsseconds of
> > > latency, which is really bad for vcpus running latency-sensitive
> > > applications, such as real-time workloads.
> > > 
> > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > to deal with any pending RCU operations before being required to invoke
> > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > 
> > This does not properly fix the current problem, as RCU work might be
> > scheduled after the VM exit, followed by a timer interrupt.
> > 
> > Correct?
> 
> Correct, for this case, check the note below:
> 
> > 
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > > 
> > > ps: A patch fixing this same issue was discussed in this thread:
> > > https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > > 
> > > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > > to avoid having invoke_rcu() being called on grace-periods starting between
> > > guest exit and the timer IRQ. This RCU option is being discussed in a
> > > sub-thread of this message:
> > > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> ^ This one above.
> The idea is to use this rcutree.nocb_patience_delay=N :
> a new option we added on RCU that allow us to avoid invoking rcu_core() if 
> the grace_period < N miliseconds. This only works on nohz_full cpus.
> 
> So with both the current patch and the one in above link, we have the same 
> effect as we previously had with last_guest_exit, with a cherry on top: we 
> can avoid rcu_core() getting called in situations where a grace period just 
> started after going into kernel code, and a timer interrupt happened before 
> it can report quiescent state again. 
> 
> For our nohz_full vcpu thread scenario, we have:
> 
> - guest_exit note a quiescent state
> - let's say we start a grace period in the next cycle
> - If timer interrupts, it requires the grace period to be older than N 
>   miliseconds
>   - If we configure a proper value for patience, it will never reach the 
>     end of patience before going guest_entry, and thus noting a quiescent 
>     state
> 
> What do you think?

I don't fully understand all of the RCU details, but since RCU quiescent
state marking happens in IRQ disabled section, there is no chance for a
timer interrupt to conflict with the marking of quiescent state.

So seem to make sense to me.

Sean Christopherson May 13, 2024, 7:40 p.m. UTC | #8

On Fri, May 10, 2024, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.

Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
a context switch on the next VM-Enter, so even if there is extra latency or
something, KVM will eventually take the hit in the common case no matter what.
But I know some setups are sensitive to handling select VM-Exits as soon as possible.

I ask mainly because it seems like a no brainer to me to have both VM-Entry and
VM-Exit note the context switch, which begs the question of why KVM isn't already
doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
use RCU extended quiescent state when running KVM guest") handled the VM-Entry
case?

Leonardo Bras May 13, 2024, 9:47 p.m. UTC | #9

On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, May 10, 2024, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> >
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> >
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> >
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
>
> Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> a context switch on the next VM-Enter, so even if there is extra latency or
> something, KVM will eventually take the hit in the common case no matter what.
> But I know some setups are sensitive to handling select VM-Exits as soon as possible.
>
> I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> VM-Exit note the context switch, which begs the question of why KVM isn't already
> doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> case?

I don't know, by the lore I see it happening in guest entry since the
first time it was introduced at
https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/

Noting a quiescent state is cheap, but it may cost a few accesses to
possibly non-local cachelines. (Not an expert in this, Paul please let
me know if I got it wrong).

I don't have a historic context on why it was just implemented on
guest_entry, but it would make sense when we don't worry about latency
to take the entry-only approach:
- It saves the overhead of calling rcu_virt_note_context_switch()
twice per guest entry in the loop
- KVM will probably run guest entry soon after guest exit (in loop),
so there is no need to run it twice
- Eventually running rcu_core() may be cheaper than noting quiescent
state every guest entry/exit cycle

Upsides of the new strategy:
- Noting a quiescent state in guest exit avoids calling rcu_core() if
there was a grace period request while guest was running, and timer
interrupt hits the cpu.
- If the loop re-enter quickly there is a high chance that guest
entry's rcu_virt_note_context_switch() will be fast (local cacheline)
as there is low probability of a grace period request happening
between exit & re-entry.
- It allows us to use the rcu patience strategy to avoid rcu_core()
running if any grace period request happens between guest exit and
guest re-entry, which is very important for low latency workloads
running on guests as it reduces maximum latency in long runs.

What do you think?

Thanks!
Leo

Paul E. McKenney May 14, 2024, 10:54 p.m. UTC | #10

On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > as it avoids the guest being interrupted for current RCU operations.
> > >
> > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > check for any RCU operations waiting for this CPU. In case there are any of
> > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > note a quiescent state.
> > >
> > > This occasional schedule work will introduce tens of microsseconds of
> > > latency, which is really bad for vcpus running latency-sensitive
> > > applications, such as real-time workloads.
> > >
> > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > to deal with any pending RCU operations before being required to invoke
> > > rcu_core(), and thus avoid the overhead of related scheduler work.
> >
> > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > a context switch on the next VM-Enter, so even if there is extra latency or
> > something, KVM will eventually take the hit in the common case no matter what.
> > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> >
> > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > case?
> 
> I don't know, by the lore I see it happening in guest entry since the
> first time it was introduced at
> https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> 
> Noting a quiescent state is cheap, but it may cost a few accesses to
> possibly non-local cachelines. (Not an expert in this, Paul please let
> me know if I got it wrong).

Yes, it is cheap, especially if interrupts are already disabled.
(As in the scheduler asks RCU to do the same amount of work on its
context-switch fastpath.)

> I don't have a historic context on why it was just implemented on
> guest_entry, but it would make sense when we don't worry about latency
> to take the entry-only approach:
> - It saves the overhead of calling rcu_virt_note_context_switch()
> twice per guest entry in the loop
> - KVM will probably run guest entry soon after guest exit (in loop),
> so there is no need to run it twice
> - Eventually running rcu_core() may be cheaper than noting quiescent
> state every guest entry/exit cycle
> 
> Upsides of the new strategy:
> - Noting a quiescent state in guest exit avoids calling rcu_core() if
> there was a grace period request while guest was running, and timer
> interrupt hits the cpu.
> - If the loop re-enter quickly there is a high chance that guest
> entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> as there is low probability of a grace period request happening
> between exit & re-entry.
> - It allows us to use the rcu patience strategy to avoid rcu_core()
> running if any grace period request happens between guest exit and
> guest re-entry, which is very important for low latency workloads
> running on guests as it reduces maximum latency in long runs.
> 
> What do you think?

Try both on the workload of interest with appropriate tracing and
see what happens?  The hardware's opinion overrides mine.  ;-)

							Thanx, Paul

Leonardo Bras May 15, 2024, 4:45 a.m. UTC | #11

On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > as it avoids the guest being interrupted for current RCU operations.
> > > >
> > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > note a quiescent state.
> > > >
> > > > This occasional schedule work will introduce tens of microsseconds of
> > > > latency, which is really bad for vcpus running latency-sensitive
> > > > applications, such as real-time workloads.
> > > >
> > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > to deal with any pending RCU operations before being required to invoke
> > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > >
> > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > something, KVM will eventually take the hit in the common case no matter what.
> > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > >
> > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > case?
> > 
> > I don't know, by the lore I see it happening in guest entry since the
> > first time it was introduced at
> > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > 
> > Noting a quiescent state is cheap, but it may cost a few accesses to
> > possibly non-local cachelines. (Not an expert in this, Paul please let
> > me know if I got it wrong).
> 
> Yes, it is cheap, especially if interrupts are already disabled.
> (As in the scheduler asks RCU to do the same amount of work on its
> context-switch fastpath.)

Thanks!

> 
> > I don't have a historic context on why it was just implemented on
> > guest_entry, but it would make sense when we don't worry about latency
> > to take the entry-only approach:
> > - It saves the overhead of calling rcu_virt_note_context_switch()
> > twice per guest entry in the loop
> > - KVM will probably run guest entry soon after guest exit (in loop),
> > so there is no need to run it twice
> > - Eventually running rcu_core() may be cheaper than noting quiescent
> > state every guest entry/exit cycle
> > 
> > Upsides of the new strategy:
> > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > there was a grace period request while guest was running, and timer
> > interrupt hits the cpu.
> > - If the loop re-enter quickly there is a high chance that guest
> > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > as there is low probability of a grace period request happening
> > between exit & re-entry.
> > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > running if any grace period request happens between guest exit and
> > guest re-entry, which is very important for low latency workloads
> > running on guests as it reduces maximum latency in long runs.
> > 
> > What do you think?
> 
> Try both on the workload of interest with appropriate tracing and
> see what happens?  The hardware's opinion overrides mine.  ;-)

That's a great approach!

But in this case I think noting a quiescent state in guest exit is 
necessary to avoid a scenario in which a VM takes longer than RCU 
patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
exit was quite brief. 

IIUC Sean's question is more on the tone of "Why KVM does not note a 
quiescent state in guest exit already, if it does in guest entry", and I 
just came with a few arguments to try finding a possible rationale, since 
I could find no discussion on that topic in the lore for the original 
commit.

Since noting a quiescent state in guest exit is cheap enough, avoids rcuc 
schedules when grace period starts during guest execution, and enables a 
much more rational usage of RCU patience, it's a safe to assume it's a 
better way of dealing with RCU compared to current implementation.

Sean, what do you think?

Thanks!
Leo

> 
> 							Thanx, Paul
>

Paul E. McKenney May 15, 2024, 2:57 p.m. UTC | #12

On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > >
> > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > note a quiescent state.
> > > > >
> > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > applications, such as real-time workloads.
> > > > >
> > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > to deal with any pending RCU operations before being required to invoke
> > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > >
> > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > >
> > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > case?
> > > 
> > > I don't know, by the lore I see it happening in guest entry since the
> > > first time it was introduced at
> > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > 
> > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > me know if I got it wrong).
> > 
> > Yes, it is cheap, especially if interrupts are already disabled.
> > (As in the scheduler asks RCU to do the same amount of work on its
> > context-switch fastpath.)
> 
> Thanks!
> 
> > 
> > > I don't have a historic context on why it was just implemented on
> > > guest_entry, but it would make sense when we don't worry about latency
> > > to take the entry-only approach:
> > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > twice per guest entry in the loop
> > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > so there is no need to run it twice
> > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > state every guest entry/exit cycle
> > > 
> > > Upsides of the new strategy:
> > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > there was a grace period request while guest was running, and timer
> > > interrupt hits the cpu.
> > > - If the loop re-enter quickly there is a high chance that guest
> > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > as there is low probability of a grace period request happening
> > > between exit & re-entry.
> > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > running if any grace period request happens between guest exit and
> > > guest re-entry, which is very important for low latency workloads
> > > running on guests as it reduces maximum latency in long runs.
> > > 
> > > What do you think?
> > 
> > Try both on the workload of interest with appropriate tracing and
> > see what happens?  The hardware's opinion overrides mine.  ;-)
> 
> That's a great approach!
> 
> But in this case I think noting a quiescent state in guest exit is 
> necessary to avoid a scenario in which a VM takes longer than RCU 
> patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> exit was quite brief. 
> 
> IIUC Sean's question is more on the tone of "Why KVM does not note a 
> quiescent state in guest exit already, if it does in guest entry", and I 
> just came with a few arguments to try finding a possible rationale, since 
> I could find no discussion on that topic in the lore for the original 
> commit.

Understood, and maybe trying it would answer that question quickly.
Don't get me wrong, just because it appears to work in a few tests doesn't
mean that it really works, but if it visibly blows up, that answers the
question quite quickly and easily.  ;-)

But yes, if it appears to work, there must be a full investigation into
whether or not the change really is safe.

							Thanx, Paul

> Since noting a quiescent state in guest exit is cheap enough, avoids rcuc 
> schedules when grace period starts during guest execution, and enables a 
> much more rational usage of RCU patience, it's a safe to assume it's a 
> better way of dealing with RCU compared to current implementation.
> 
> Sean, what do you think?
> 
> Thanks!
> Leo
> 
> > 
> > 							Thanx, Paul
> > 
>

Leonardo Bras June 20, 2024, 7:03 a.m. UTC | #13

On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > > >
> > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > > note a quiescent state.
> > > > > >
> > > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > > applications, such as real-time workloads.
> > > > > >
> > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > > to deal with any pending RCU operations before being required to invoke
> > > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > > >
> > > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > > >
> > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > > case?
> > > > 
> > > > I don't know, by the lore I see it happening in guest entry since the
> > > > first time it was introduced at
> > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > > 
> > > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > > me know if I got it wrong).
> > > 
> > > Yes, it is cheap, especially if interrupts are already disabled.
> > > (As in the scheduler asks RCU to do the same amount of work on its
> > > context-switch fastpath.)
> > 
> > Thanks!
> > 
> > > 
> > > > I don't have a historic context on why it was just implemented on
> > > > guest_entry, but it would make sense when we don't worry about latency
> > > > to take the entry-only approach:
> > > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > > twice per guest entry in the loop
> > > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > > so there is no need to run it twice
> > > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > > state every guest entry/exit cycle
> > > > 
> > > > Upsides of the new strategy:
> > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > > there was a grace period request while guest was running, and timer
> > > > interrupt hits the cpu.
> > > > - If the loop re-enter quickly there is a high chance that guest
> > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > > as there is low probability of a grace period request happening
> > > > between exit & re-entry.
> > > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > > running if any grace period request happens between guest exit and
> > > > guest re-entry, which is very important for low latency workloads
> > > > running on guests as it reduces maximum latency in long runs.
> > > > 
> > > > What do you think?
> > > 
> > > Try both on the workload of interest with appropriate tracing and
> > > see what happens?  The hardware's opinion overrides mine.  ;-)
> > 
> > That's a great approach!
> > 
> > But in this case I think noting a quiescent state in guest exit is 
> > necessary to avoid a scenario in which a VM takes longer than RCU 
> > patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> > exit was quite brief. 
> > 
> > IIUC Sean's question is more on the tone of "Why KVM does not note a 
> > quiescent state in guest exit already, if it does in guest entry", and I 
> > just came with a few arguments to try finding a possible rationale, since 
> > I could find no discussion on that topic in the lore for the original 
> > commit.
> 
> Understood, and maybe trying it would answer that question quickly.
> Don't get me wrong, just because it appears to work in a few tests doesn't
> mean that it really works, but if it visibly blows up, that answers the
> question quite quickly and easily.  ;-)
> 
> But yes, if it appears to work, there must be a full investigation into
> whether or not the change really is safe.
> 
> 							Thanx, Paul

Hello Paul, Sean, sorry for the delay on this.

I tested x86 by counting cycles (using rdtsc_ordered()).

Cycles were counted upon function entry/exit on 
{svm,vmx}_vcpu_enter_exit(), and right before / after 
__{svm,vmx}_vcpu_run() in the same function.

The main idea was to get cycles spend in the procedures before entering 
guest (such as reporting RCU quiescent state in entry / exit) and the 
cycles actually used by the VM. 

Those cycles were summed-up and stored in per-cpu structures, with a 
counter to get the average value. I then created a debug file to read the 
results and reset the counters.

As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.

The workload inside the VM consisted in cyclictest in 16 vcpus 
(SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
(SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 

 $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
  --mainaffinity 0-3

All tests were run for exaclty 1 hour, and the clock counter was reset at 
the same moment cyclictest stared. After that VM was poweroff from guest.
Results show the average for all CPUs in the same category, in cycles.

With above setup, I tested 2 use cases:
1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
    RCU patience = 1000ms (best case for RT)

Results are:
# Test case 1:
Vanilla: (average on all vcpus)
VM Cycles / RT vcpu:		123287.75 
VM Cycles / non-RT vcpu:	709847.25
Setup Cycles:			186.00
VM entries / RT vcpu:		58737094.81
VM entries / non-RT vcpu:	10527869.25
Total cycles in RT VM:		7241564260969.80
Total cycles in non-RT VM:	7473179035472.06

Patched: (average on all vcpus)
VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
Setup Cycles:			218.65           (+17.55%)
VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)

Discussion:
Setup cycles raised in ~33 cycles, increasing overhead.
It proves that noting a quiescent state in guest entry introduces setup 
routine costs, which is expected.

On the other hand, both the average time spend inside the VM and the number 
of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
available to run, which is positive. Extra cycles probably came from not 
having invoke_rcu_core() getting ran after VM exit.


# Test case 2:
Vanilla: (average on all vcpus)
VM Cycles / RT vcpu:		123785.63
VM Cycles / non-RT vcpu:	698758.25
Setup Cycles:			187.20
VM entries / RT vcpu:		61096820.75
VM entries / non-RT vcpu:	11191873.00
Total cycles in RT VM:		7562908142051.72
Total cycles in non-RT VM:	7820413591702.25

Patched: (average on all vcpus)
VM Cycles / RT vcpu:		123137.13        (- 0.52%)
VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
Setup Cycles:			229.35           (+22.52%)
VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)

Discussion:
Setup cycles raised in ~42 cycles, increasing overhead.
It proves that noting a quiescent state in guest entry introduces setup 
routine costs, which is expected.

The average time spend inside the VM was reduced, but the number of VM  
entries raised, causing the VM to have around the same number of cpu cycles 
available to run, meaning that the overhead caused by reporting RCU 
quiescent state in VM exit got absorbed, and it may have to do with those 
rare invoke_rcu_core()s that were bothering latency.

The difference is much smaller compared to case 1, and this is probably 
because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
which may be already inhibiting a lot of invoke_rcu_core()s.

Sean, Paul, what do you think?

Thanks!
Leo

> 
> > Since noting a quiescent state in guest exit is cheap enough, avoids rcuc 
> > schedules when grace period starts during guest execution, and enables a 
> > much more rational usage of RCU patience, it's a safe to assume it's a 
> > better way of dealing with RCU compared to current implementation.
> > 
> > Sean, what do you think?
> > 
> > Thanks!
> > Leo
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > 
>

Paul E. McKenney June 20, 2024, 5:26 p.m. UTC | #14

On Thu, Jun 20, 2024 at 04:03:41AM -0300, Leonardo Bras wrote:
> On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> > On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> > > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > >
> > > > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > > > >
> > > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > > > note a quiescent state.
> > > > > > >
> > > > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > > > applications, such as real-time workloads.
> > > > > > >
> > > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > > > to deal with any pending RCU operations before being required to invoke
> > > > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > > > >
> > > > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > > > >
> > > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > > > case?
> > > > > 
> > > > > I don't know, by the lore I see it happening in guest entry since the
> > > > > first time it was introduced at
> > > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > > > 
> > > > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > > > me know if I got it wrong).
> > > > 
> > > > Yes, it is cheap, especially if interrupts are already disabled.
> > > > (As in the scheduler asks RCU to do the same amount of work on its
> > > > context-switch fastpath.)
> > > 
> > > Thanks!
> > > 
> > > > 
> > > > > I don't have a historic context on why it was just implemented on
> > > > > guest_entry, but it would make sense when we don't worry about latency
> > > > > to take the entry-only approach:
> > > > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > > > twice per guest entry in the loop
> > > > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > > > so there is no need to run it twice
> > > > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > > > state every guest entry/exit cycle
> > > > > 
> > > > > Upsides of the new strategy:
> > > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > > > there was a grace period request while guest was running, and timer
> > > > > interrupt hits the cpu.
> > > > > - If the loop re-enter quickly there is a high chance that guest
> > > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > > > as there is low probability of a grace period request happening
> > > > > between exit & re-entry.
> > > > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > > > running if any grace period request happens between guest exit and
> > > > > guest re-entry, which is very important for low latency workloads
> > > > > running on guests as it reduces maximum latency in long runs.
> > > > > 
> > > > > What do you think?
> > > > 
> > > > Try both on the workload of interest with appropriate tracing and
> > > > see what happens?  The hardware's opinion overrides mine.  ;-)
> > > 
> > > That's a great approach!
> > > 
> > > But in this case I think noting a quiescent state in guest exit is 
> > > necessary to avoid a scenario in which a VM takes longer than RCU 
> > > patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> > > exit was quite brief. 
> > > 
> > > IIUC Sean's question is more on the tone of "Why KVM does not note a 
> > > quiescent state in guest exit already, if it does in guest entry", and I 
> > > just came with a few arguments to try finding a possible rationale, since 
> > > I could find no discussion on that topic in the lore for the original 
> > > commit.
> > 
> > Understood, and maybe trying it would answer that question quickly.
> > Don't get me wrong, just because it appears to work in a few tests doesn't
> > mean that it really works, but if it visibly blows up, that answers the
> > question quite quickly and easily.  ;-)
> > 
> > But yes, if it appears to work, there must be a full investigation into
> > whether or not the change really is safe.
> > 
> > 							Thanx, Paul
> 
> Hello Paul, Sean, sorry for the delay on this.
> 
> I tested x86 by counting cycles (using rdtsc_ordered()).
> 
> Cycles were counted upon function entry/exit on 
> {svm,vmx}_vcpu_enter_exit(), and right before / after 
> __{svm,vmx}_vcpu_run() in the same function.
> 
> The main idea was to get cycles spend in the procedures before entering 
> guest (such as reporting RCU quiescent state in entry / exit) and the 
> cycles actually used by the VM. 
> 
> Those cycles were summed-up and stored in per-cpu structures, with a 
> counter to get the average value. I then created a debug file to read the 
> results and reset the counters.
> 
> As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.
> 
> The workload inside the VM consisted in cyclictest in 16 vcpus 
> (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
> (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 
> 
>  $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
>   --mainaffinity 0-3
> 
> All tests were run for exaclty 1 hour, and the clock counter was reset at 
> the same moment cyclictest stared. After that VM was poweroff from guest.
> Results show the average for all CPUs in the same category, in cycles.
> 
> With above setup, I tested 2 use cases:
> 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
> 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
>     RCU patience = 1000ms (best case for RT)
> 
> Results are:
> # Test case 1:
> Vanilla: (average on all vcpus)
> VM Cycles / RT vcpu:		123287.75 
> VM Cycles / non-RT vcpu:	709847.25
> Setup Cycles:			186.00
> VM entries / RT vcpu:		58737094.81
> VM entries / non-RT vcpu:	10527869.25
> Total cycles in RT VM:		7241564260969.80
> Total cycles in non-RT VM:	7473179035472.06
> 
> Patched: (average on all vcpus)
> VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
> VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
> Setup Cycles:			218.65           (+17.55%)
> VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
> VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
> Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
> Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)
> 
> Discussion:
> Setup cycles raised in ~33 cycles, increasing overhead.
> It proves that noting a quiescent state in guest entry introduces setup 
> routine costs, which is expected.
> 
> On the other hand, both the average time spend inside the VM and the number 
> of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
> available to run, which is positive. Extra cycles probably came from not 
> having invoke_rcu_core() getting ran after VM exit.
> 
> 
> # Test case 2:
> Vanilla: (average on all vcpus)
> VM Cycles / RT vcpu:		123785.63
> VM Cycles / non-RT vcpu:	698758.25
> Setup Cycles:			187.20
> VM entries / RT vcpu:		61096820.75
> VM entries / non-RT vcpu:	11191873.00
> Total cycles in RT VM:		7562908142051.72
> Total cycles in non-RT VM:	7820413591702.25
> 
> Patched: (average on all vcpus)
> VM Cycles / RT vcpu:		123137.13        (- 0.52%)
> VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
> Setup Cycles:			229.35           (+22.52%)
> VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
> VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
> Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
> Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)
> 
> Discussion:
> Setup cycles raised in ~42 cycles, increasing overhead.
> It proves that noting a quiescent state in guest entry introduces setup 
> routine costs, which is expected.
> 
> The average time spend inside the VM was reduced, but the number of VM  
> entries raised, causing the VM to have around the same number of cpu cycles 
> available to run, meaning that the overhead caused by reporting RCU 
> quiescent state in VM exit got absorbed, and it may have to do with those 
> rare invoke_rcu_core()s that were bothering latency.
> 
> The difference is much smaller compared to case 1, and this is probably 
> because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
> which may be already inhibiting a lot of invoke_rcu_core()s.
> 
> Sean, Paul, what do you think?

First, thank you for doing this work!

I must defer to Sean on how to handle this tradeoff.  My kneejerk reaction
would be to add yet another knob, but knobs are not free.

							Thanx, Paul

Leonardo Bras June 25, 2024, 2:31 a.m. UTC | #15

On Thu, Jun 20, 2024 at 10:26:38AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 20, 2024 at 04:03:41AM -0300, Leonardo Bras wrote:
> > On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> > > On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> > > > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > > > > >
> > > > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > > > > note a quiescent state.
> > > > > > > >
> > > > > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > > > > applications, such as real-time workloads.
> > > > > > > >
> > > > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > > > > to deal with any pending RCU operations before being required to invoke
> > > > > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > > > > >
> > > > > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > > > > >
> > > > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > > > > case?
> > > > > > 
> > > > > > I don't know, by the lore I see it happening in guest entry since the
> > > > > > first time it was introduced at
> > > > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > > > > 
> > > > > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > > > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > > > > me know if I got it wrong).
> > > > > 
> > > > > Yes, it is cheap, especially if interrupts are already disabled.
> > > > > (As in the scheduler asks RCU to do the same amount of work on its
> > > > > context-switch fastpath.)
> > > > 
> > > > Thanks!
> > > > 
> > > > > 
> > > > > > I don't have a historic context on why it was just implemented on
> > > > > > guest_entry, but it would make sense when we don't worry about latency
> > > > > > to take the entry-only approach:
> > > > > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > > > > twice per guest entry in the loop
> > > > > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > > > > so there is no need to run it twice
> > > > > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > > > > state every guest entry/exit cycle
> > > > > > 
> > > > > > Upsides of the new strategy:
> > > > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > > > > there was a grace period request while guest was running, and timer
> > > > > > interrupt hits the cpu.
> > > > > > - If the loop re-enter quickly there is a high chance that guest
> > > > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > > > > as there is low probability of a grace period request happening
> > > > > > between exit & re-entry.
> > > > > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > > > > running if any grace period request happens between guest exit and
> > > > > > guest re-entry, which is very important for low latency workloads
> > > > > > running on guests as it reduces maximum latency in long runs.
> > > > > > 
> > > > > > What do you think?
> > > > > 
> > > > > Try both on the workload of interest with appropriate tracing and
> > > > > see what happens?  The hardware's opinion overrides mine.  ;-)
> > > > 
> > > > That's a great approach!
> > > > 
> > > > But in this case I think noting a quiescent state in guest exit is 
> > > > necessary to avoid a scenario in which a VM takes longer than RCU 
> > > > patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> > > > exit was quite brief. 
> > > > 
> > > > IIUC Sean's question is more on the tone of "Why KVM does not note a 
> > > > quiescent state in guest exit already, if it does in guest entry", and I 
> > > > just came with a few arguments to try finding a possible rationale, since 
> > > > I could find no discussion on that topic in the lore for the original 
> > > > commit.
> > > 
> > > Understood, and maybe trying it would answer that question quickly.
> > > Don't get me wrong, just because it appears to work in a few tests doesn't
> > > mean that it really works, but if it visibly blows up, that answers the
> > > question quite quickly and easily.  ;-)
> > > 
> > > But yes, if it appears to work, there must be a full investigation into
> > > whether or not the change really is safe.
> > > 
> > > 							Thanx, Paul
> > 
> > Hello Paul, Sean, sorry for the delay on this.
> > 
> > I tested x86 by counting cycles (using rdtsc_ordered()).
> > 
> > Cycles were counted upon function entry/exit on 
> > {svm,vmx}_vcpu_enter_exit(), and right before / after 
> > __{svm,vmx}_vcpu_run() in the same function.
> > 
> > The main idea was to get cycles spend in the procedures before entering 
> > guest (such as reporting RCU quiescent state in entry / exit) and the 
> > cycles actually used by the VM. 
> > 
> > Those cycles were summed-up and stored in per-cpu structures, with a 
> > counter to get the average value. I then created a debug file to read the 
> > results and reset the counters.
> > 
> > As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.
> > 
> > The workload inside the VM consisted in cyclictest in 16 vcpus 
> > (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
> > (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 
> > 
> >  $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
> >   --mainaffinity 0-3
> > 
> > All tests were run for exaclty 1 hour, and the clock counter was reset at 
> > the same moment cyclictest stared. After that VM was poweroff from guest.
> > Results show the average for all CPUs in the same category, in cycles.
> > 
> > With above setup, I tested 2 use cases:
> > 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
> > 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
> >     RCU patience = 1000ms (best case for RT)
> > 
> > Results are:
> > # Test case 1:
> > Vanilla: (average on all vcpus)
> > VM Cycles / RT vcpu:		123287.75 
> > VM Cycles / non-RT vcpu:	709847.25
> > Setup Cycles:			186.00
> > VM entries / RT vcpu:		58737094.81
> > VM entries / non-RT vcpu:	10527869.25
> > Total cycles in RT VM:		7241564260969.80
> > Total cycles in non-RT VM:	7473179035472.06
> > 
> > Patched: (average on all vcpus)
> > VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
> > VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
> > Setup Cycles:			218.65           (+17.55%)
> > VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
> > VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
> > Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
> > Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)
> > 
> > Discussion:
> > Setup cycles raised in ~33 cycles, increasing overhead.
> > It proves that noting a quiescent state in guest entry introduces setup 
> > routine costs, which is expected.
> > 
> > On the other hand, both the average time spend inside the VM and the number 
> > of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
> > available to run, which is positive. Extra cycles probably came from not 
> > having invoke_rcu_core() getting ran after VM exit.
> > 
> > 
> > # Test case 2:
> > Vanilla: (average on all vcpus)
> > VM Cycles / RT vcpu:		123785.63
> > VM Cycles / non-RT vcpu:	698758.25
> > Setup Cycles:			187.20
> > VM entries / RT vcpu:		61096820.75
> > VM entries / non-RT vcpu:	11191873.00
> > Total cycles in RT VM:		7562908142051.72
> > Total cycles in non-RT VM:	7820413591702.25
> > 
> > Patched: (average on all vcpus)
> > VM Cycles / RT vcpu:		123137.13        (- 0.52%)
> > VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
> > Setup Cycles:			229.35           (+22.52%)
> > VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
> > VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
> > Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
> > Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)
> > 
> > Discussion:
> > Setup cycles raised in ~42 cycles, increasing overhead.
> > It proves that noting a quiescent state in guest entry introduces setup 
> > routine costs, which is expected.
> > 
> > The average time spend inside the VM was reduced, but the number of VM  
> > entries raised, causing the VM to have around the same number of cpu cycles 
> > available to run, meaning that the overhead caused by reporting RCU 
> > quiescent state in VM exit got absorbed, and it may have to do with those 
> > rare invoke_rcu_core()s that were bothering latency.
> > 
> > The difference is much smaller compared to case 1, and this is probably 
> > because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
> > which may be already inhibiting a lot of invoke_rcu_core()s.
> > 
> > Sean, Paul, what do you think?
> 
> First, thank you for doing this work!

Thanks you!

> 
> I must defer to Sean on how to handle this tradeoff.  My kneejerk reaction
> would be to add yet another knob, but knobs are not free.

Sure, let's wait for Sean's feedback. But I am very positive with the 
results, as we have lantecy improvements for RT, and IIUC performance 
improvements for non-RT.

While we wait, I ran stress-ng in a VM with non-RT kernel (rcu.patience=0), 
and could see instruction count raising in around 1.5% after applying the patch.

Maybe I am missing something, but I thought we could get the ~4.5% 
improvement as seen above.

Thanks!
Leo


> 
> 							Thanx, Paul
>

Leonardo Bras June 25, 2024, 2:34 a.m. UTC | #16

On Mon, Jun 24, 2024 at 11:31:51PM -0300, Leonardo Bras wrote:
> On Thu, Jun 20, 2024 at 10:26:38AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 20, 2024 at 04:03:41AM -0300, Leonardo Bras wrote:
> > > On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> > > > On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> > > > > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > > > > > >
> > > > > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > > > > > note a quiescent state.
> > > > > > > > >
> > > > > > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > > > > > applications, such as real-time workloads.
> > > > > > > > >
> > > > > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > > > > > to deal with any pending RCU operations before being required to invoke
> > > > > > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > > > > > >
> > > > > > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > > > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > > > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > > > > > >
> > > > > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > > > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > > > > > case?
> > > > > > > 
> > > > > > > I don't know, by the lore I see it happening in guest entry since the
> > > > > > > first time it was introduced at
> > > > > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > > > > > 
> > > > > > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > > > > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > > > > > me know if I got it wrong).
> > > > > > 
> > > > > > Yes, it is cheap, especially if interrupts are already disabled.
> > > > > > (As in the scheduler asks RCU to do the same amount of work on its
> > > > > > context-switch fastpath.)
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > > 
> > > > > > > I don't have a historic context on why it was just implemented on
> > > > > > > guest_entry, but it would make sense when we don't worry about latency
> > > > > > > to take the entry-only approach:
> > > > > > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > > > > > twice per guest entry in the loop
> > > > > > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > > > > > so there is no need to run it twice
> > > > > > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > > > > > state every guest entry/exit cycle
> > > > > > > 
> > > > > > > Upsides of the new strategy:
> > > > > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > > > > > there was a grace period request while guest was running, and timer
> > > > > > > interrupt hits the cpu.
> > > > > > > - If the loop re-enter quickly there is a high chance that guest
> > > > > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > > > > > as there is low probability of a grace period request happening
> > > > > > > between exit & re-entry.
> > > > > > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > > > > > running if any grace period request happens between guest exit and
> > > > > > > guest re-entry, which is very important for low latency workloads
> > > > > > > running on guests as it reduces maximum latency in long runs.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > 
> > > > > > Try both on the workload of interest with appropriate tracing and
> > > > > > see what happens?  The hardware's opinion overrides mine.  ;-)
> > > > > 
> > > > > That's a great approach!
> > > > > 
> > > > > But in this case I think noting a quiescent state in guest exit is 
> > > > > necessary to avoid a scenario in which a VM takes longer than RCU 
> > > > > patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> > > > > exit was quite brief. 
> > > > > 
> > > > > IIUC Sean's question is more on the tone of "Why KVM does not note a 
> > > > > quiescent state in guest exit already, if it does in guest entry", and I 
> > > > > just came with a few arguments to try finding a possible rationale, since 
> > > > > I could find no discussion on that topic in the lore for the original 
> > > > > commit.
> > > > 
> > > > Understood, and maybe trying it would answer that question quickly.
> > > > Don't get me wrong, just because it appears to work in a few tests doesn't
> > > > mean that it really works, but if it visibly blows up, that answers the
> > > > question quite quickly and easily.  ;-)
> > > > 
> > > > But yes, if it appears to work, there must be a full investigation into
> > > > whether or not the change really is safe.
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > Hello Paul, Sean, sorry for the delay on this.
> > > 
> > > I tested x86 by counting cycles (using rdtsc_ordered()).
> > > 
> > > Cycles were counted upon function entry/exit on 
> > > {svm,vmx}_vcpu_enter_exit(), and right before / after 
> > > __{svm,vmx}_vcpu_run() in the same function.
> > > 
> > > The main idea was to get cycles spend in the procedures before entering 
> > > guest (such as reporting RCU quiescent state in entry / exit) and the 
> > > cycles actually used by the VM. 
> > > 
> > > Those cycles were summed-up and stored in per-cpu structures, with a 
> > > counter to get the average value. I then created a debug file to read the 
> > > results and reset the counters.
> > > 
> > > As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.
> > > 
> > > The workload inside the VM consisted in cyclictest in 16 vcpus 
> > > (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
> > > (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 
> > > 
> > >  $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
> > >   --mainaffinity 0-3
> > > 
> > > All tests were run for exaclty 1 hour, and the clock counter was reset at 
> > > the same moment cyclictest stared. After that VM was poweroff from guest.
> > > Results show the average for all CPUs in the same category, in cycles.
> > > 
> > > With above setup, I tested 2 use cases:
> > > 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
> > > 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
> > >     RCU patience = 1000ms (best case for RT)
> > > 
> > > Results are:
> > > # Test case 1:
> > > Vanilla: (average on all vcpus)
> > > VM Cycles / RT vcpu:		123287.75 
> > > VM Cycles / non-RT vcpu:	709847.25
> > > Setup Cycles:			186.00
> > > VM entries / RT vcpu:		58737094.81
> > > VM entries / non-RT vcpu:	10527869.25
> > > Total cycles in RT VM:		7241564260969.80
> > > Total cycles in non-RT VM:	7473179035472.06
> > > 
> > > Patched: (average on all vcpus)
> > > VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
> > > VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
> > > Setup Cycles:			218.65           (+17.55%)
> > > VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
> > > VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
> > > Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
> > > Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)
> > > 
> > > Discussion:
> > > Setup cycles raised in ~33 cycles, increasing overhead.
> > > It proves that noting a quiescent state in guest entry introduces setup 
> > > routine costs, which is expected.
> > > 
> > > On the other hand, both the average time spend inside the VM and the number 
> > > of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
> > > available to run, which is positive. Extra cycles probably came from not 
> > > having invoke_rcu_core() getting ran after VM exit.
> > > 
> > > 
> > > # Test case 2:
> > > Vanilla: (average on all vcpus)
> > > VM Cycles / RT vcpu:		123785.63
> > > VM Cycles / non-RT vcpu:	698758.25
> > > Setup Cycles:			187.20
> > > VM entries / RT vcpu:		61096820.75
> > > VM entries / non-RT vcpu:	11191873.00
> > > Total cycles in RT VM:		7562908142051.72
> > > Total cycles in non-RT VM:	7820413591702.25
> > > 
> > > Patched: (average on all vcpus)
> > > VM Cycles / RT vcpu:		123137.13        (- 0.52%)
> > > VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
> > > Setup Cycles:			229.35           (+22.52%)
> > > VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
> > > VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
> > > Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
> > > Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)
> > > 
> > > Discussion:
> > > Setup cycles raised in ~42 cycles, increasing overhead.
> > > It proves that noting a quiescent state in guest entry introduces setup 
> > > routine costs, which is expected.
> > > 
> > > The average time spend inside the VM was reduced, but the number of VM  
> > > entries raised, causing the VM to have around the same number of cpu cycles 
> > > available to run, meaning that the overhead caused by reporting RCU 
> > > quiescent state in VM exit got absorbed, and it may have to do with those 
> > > rare invoke_rcu_core()s that were bothering latency.
> > > 
> > > The difference is much smaller compared to case 1, and this is probably 
> > > because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
> > > which may be already inhibiting a lot of invoke_rcu_core()s.
> > > 
> > > Sean, Paul, what do you think?
> > 
> > First, thank you for doing this work!
> 
> Thanks you!
> 
> > 
> > I must defer to Sean on how to handle this tradeoff.  My kneejerk reaction
> > would be to add yet another knob, but knobs are not free.
> 
> Sure, let's wait for Sean's feedback. But I am very positive with the 
> results, as we have lantecy improvements for RT, and IIUC performance 
> improvements for non-RT.
> 
> While we wait, I ran stress-ng in a VM with non-RT kernel (rcu.patience=0), 

More info here:
The VM was on a host with non-RT kernel, without cpu isolation and with 
unset rcu.patience.

> and could see instruction count raising in around 1.5% after applying the patch.
> 
> Maybe I am missing something, but I thought we could get the ~4.5% 
> improvement as seen above.
> 
> Thanks!
> Leo
> 
> 
> > 
> > 							Thanx, Paul
> >

Leonardo Bras July 10, 2024, 11:18 p.m. UTC | #17

On Mon, Jun 24, 2024 at 11:34:04PM -0300, Leonardo Bras wrote:
> On Mon, Jun 24, 2024 at 11:31:51PM -0300, Leonardo Bras wrote:
> > On Thu, Jun 20, 2024 at 10:26:38AM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 20, 2024 at 04:03:41AM -0300, Leonardo Bras wrote:
> > > > On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> > > > > > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > > > > > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > > > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > > > > > > >
> > > > > > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > > > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > > > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > > > > > > note a quiescent state.
> > > > > > > > > >
> > > > > > > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > > > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > > > > > > applications, such as real-time workloads.
> > > > > > > > > >
> > > > > > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > > > > > > to deal with any pending RCU operations before being required to invoke
> > > > > > > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > > > > > > >
> > > > > > > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > > > > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > > > > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > > > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > > > > > > >
> > > > > > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > > > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > > > > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > > > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > > > > > > case?
> > > > > > > > 
> > > > > > > > I don't know, by the lore I see it happening in guest entry since the
> > > > > > > > first time it was introduced at
> > > > > > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > > > > > > 
> > > > > > > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > > > > > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > > > > > > me know if I got it wrong).
> > > > > > > 
> > > > > > > Yes, it is cheap, especially if interrupts are already disabled.
> > > > > > > (As in the scheduler asks RCU to do the same amount of work on its
> > > > > > > context-switch fastpath.)
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > > 
> > > > > > > > I don't have a historic context on why it was just implemented on
> > > > > > > > guest_entry, but it would make sense when we don't worry about latency
> > > > > > > > to take the entry-only approach:
> > > > > > > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > > > > > > twice per guest entry in the loop
> > > > > > > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > > > > > > so there is no need to run it twice
> > > > > > > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > > > > > > state every guest entry/exit cycle
> > > > > > > > 
> > > > > > > > Upsides of the new strategy:
> > > > > > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > > > > > > there was a grace period request while guest was running, and timer
> > > > > > > > interrupt hits the cpu.
> > > > > > > > - If the loop re-enter quickly there is a high chance that guest
> > > > > > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > > > > > > as there is low probability of a grace period request happening
> > > > > > > > between exit & re-entry.
> > > > > > > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > > > > > > running if any grace period request happens between guest exit and
> > > > > > > > guest re-entry, which is very important for low latency workloads
> > > > > > > > running on guests as it reduces maximum latency in long runs.
> > > > > > > > 
> > > > > > > > What do you think?
> > > > > > > 
> > > > > > > Try both on the workload of interest with appropriate tracing and
> > > > > > > see what happens?  The hardware's opinion overrides mine.  ;-)
> > > > > > 
> > > > > > That's a great approach!
> > > > > > 
> > > > > > But in this case I think noting a quiescent state in guest exit is 
> > > > > > necessary to avoid a scenario in which a VM takes longer than RCU 
> > > > > > patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> > > > > > exit was quite brief. 
> > > > > > 
> > > > > > IIUC Sean's question is more on the tone of "Why KVM does not note a 
> > > > > > quiescent state in guest exit already, if it does in guest entry", and I 
> > > > > > just came with a few arguments to try finding a possible rationale, since 
> > > > > > I could find no discussion on that topic in the lore for the original 
> > > > > > commit.
> > > > > 
> > > > > Understood, and maybe trying it would answer that question quickly.
> > > > > Don't get me wrong, just because it appears to work in a few tests doesn't
> > > > > mean that it really works, but if it visibly blows up, that answers the
> > > > > question quite quickly and easily.  ;-)
> > > > > 
> > > > > But yes, if it appears to work, there must be a full investigation into
> > > > > whether or not the change really is safe.
> > > > > 
> > > > > 							Thanx, Paul
> > > > 
> > > > Hello Paul, Sean, sorry for the delay on this.
> > > > 
> > > > I tested x86 by counting cycles (using rdtsc_ordered()).
> > > > 
> > > > Cycles were counted upon function entry/exit on 
> > > > {svm,vmx}_vcpu_enter_exit(), and right before / after 
> > > > __{svm,vmx}_vcpu_run() in the same function.
> > > > 
> > > > The main idea was to get cycles spend in the procedures before entering 
> > > > guest (such as reporting RCU quiescent state in entry / exit) and the 
> > > > cycles actually used by the VM. 
> > > > 
> > > > Those cycles were summed-up and stored in per-cpu structures, with a 
> > > > counter to get the average value. I then created a debug file to read the 
> > > > results and reset the counters.
> > > > 
> > > > As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.
> > > > 
> > > > The workload inside the VM consisted in cyclictest in 16 vcpus 
> > > > (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
> > > > (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 
> > > > 
> > > >  $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
> > > >   --mainaffinity 0-3
> > > > 
> > > > All tests were run for exaclty 1 hour, and the clock counter was reset at 
> > > > the same moment cyclictest stared. After that VM was poweroff from guest.
> > > > Results show the average for all CPUs in the same category, in cycles.
> > > > 
> > > > With above setup, I tested 2 use cases:
> > > > 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
> > > > 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
> > > >     RCU patience = 1000ms (best case for RT)
> > > > 
> > > > Results are:
> > > > # Test case 1:
> > > > Vanilla: (average on all vcpus)
> > > > VM Cycles / RT vcpu:		123287.75 
> > > > VM Cycles / non-RT vcpu:	709847.25
> > > > Setup Cycles:			186.00
> > > > VM entries / RT vcpu:		58737094.81
> > > > VM entries / non-RT vcpu:	10527869.25
> > > > Total cycles in RT VM:		7241564260969.80
> > > > Total cycles in non-RT VM:	7473179035472.06
> > > > 
> > > > Patched: (average on all vcpus)
> > > > VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
> > > > VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
> > > > Setup Cycles:			218.65           (+17.55%)
> > > > VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
> > > > VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
> > > > Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
> > > > Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)
> > > > 
> > > > Discussion:
> > > > Setup cycles raised in ~33 cycles, increasing overhead.
> > > > It proves that noting a quiescent state in guest entry introduces setup 
> > > > routine costs, which is expected.
> > > > 
> > > > On the other hand, both the average time spend inside the VM and the number 
> > > > of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
> > > > available to run, which is positive. Extra cycles probably came from not 
> > > > having invoke_rcu_core() getting ran after VM exit.
> > > > 
> > > > 
> > > > # Test case 2:
> > > > Vanilla: (average on all vcpus)
> > > > VM Cycles / RT vcpu:		123785.63
> > > > VM Cycles / non-RT vcpu:	698758.25
> > > > Setup Cycles:			187.20
> > > > VM entries / RT vcpu:		61096820.75
> > > > VM entries / non-RT vcpu:	11191873.00
> > > > Total cycles in RT VM:		7562908142051.72
> > > > Total cycles in non-RT VM:	7820413591702.25
> > > > 
> > > > Patched: (average on all vcpus)
> > > > VM Cycles / RT vcpu:		123137.13        (- 0.52%)
> > > > VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
> > > > Setup Cycles:			229.35           (+22.52%)
> > > > VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
> > > > VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
> > > > Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
> > > > Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)
> > > > 
> > > > Discussion:
> > > > Setup cycles raised in ~42 cycles, increasing overhead.
> > > > It proves that noting a quiescent state in guest entry introduces setup 
> > > > routine costs, which is expected.
> > > > 
> > > > The average time spend inside the VM was reduced, but the number of VM  
> > > > entries raised, causing the VM to have around the same number of cpu cycles 
> > > > available to run, meaning that the overhead caused by reporting RCU 
> > > > quiescent state in VM exit got absorbed, and it may have to do with those 
> > > > rare invoke_rcu_core()s that were bothering latency.
> > > > 
> > > > The difference is much smaller compared to case 1, and this is probably 
> > > > because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
> > > > which may be already inhibiting a lot of invoke_rcu_core()s.
> > > > 
> > > > Sean, Paul, what do you think?
> > > 
> > > First, thank you for doing this work!
> > 
> > Thanks you!
> > 
> > > 
> > > I must defer to Sean on how to handle this tradeoff.  My kneejerk reaction
> > > would be to add yet another knob, but knobs are not free.
> > 
> > Sure, let's wait for Sean's feedback. But I am very positive with the 
> > results, as we have lantecy improvements for RT, and IIUC performance 
> > improvements for non-RT.
> > 
> > While we wait, I ran stress-ng in a VM with non-RT kernel (rcu.patience=0), 
> 
> More info here:
> The VM was on a host with non-RT kernel, without cpu isolation and with 
> unset rcu.patience.
> 
> > and could see instruction count raising in around 1.5% after applying the patch.
> > 
> > Maybe I am missing something, but I thought we could get the ~4.5% 
> > improvement as seen above.
> > 


Hello Sean,

What are your thoughts on above results?
Anything you would suggest changing?

Thanks!
Leo

Paolo Bonzini July 12, 2024, 3:57 p.m. UTC | #18

On 7/11/24 01:18, Leonardo Bras wrote:
> What are your thoughts on above results?
> Anything you would suggest changing?

Can you run the test with a conditional on "!tick_nohz_full_cpu(vcpu->cpu)"?

If your hunch is correct that nohz-full CPUs already avoid 
invoke_rcu_core() you might get the best of both worlds.

tick_nohz_full_cpu() is very fast when there is no nohz-full CPU, 
because then it shortcuts on context_tracking_enabled() (which is just a 
static key).

Paolo

Leonardo Bras July 12, 2024, 8:02 p.m. UTC | #19

On Fri, Jul 12, 2024 at 05:57:10PM +0200, Paolo Bonzini wrote:
> On 7/11/24 01:18, Leonardo Bras wrote:
> > What are your thoughts on above results?
> > Anything you would suggest changing?
> 

Hello Paolo, thanks for the feedback!

> Can you run the test with a conditional on "!tick_nohz_full_cpu(vcpu->cpu)"?
> 
> If your hunch is correct that nohz-full CPUs already avoid invoke_rcu_core()
> you might get the best of both worlds.
> 
> tick_nohz_full_cpu() is very fast when there is no nohz-full CPU, because
> then it shortcuts on context_tracking_enabled() (which is just a static
> key).

But that would mean not noting an RCU quiescent state in guest_exit of 
nohz_full cpus, right?

The original issue we were dealing was having invoke_rcu_core() running on 
nohz_full cpus, and messing up the latency of RT workloads inside the VM.

While most of the invoke_rcu_core() get ignored by the nohz_full rule, 
there are some scenarios in which it the vcpu thread may take more than 1s 
between a guest_entry and the next one (VM busy), and those which did 
not get ignored have caused latency peaks in our tests.

The main idea of this patch is to note RCU quiescent states on guest_exit 
at nohz_full cpus (and use rcu.patience) to avoid running invoke_rcu_core()
between a guest_exit and the next guest_entry if it takes less than 
rcu.patience miliseconds between exit and entry, and thus avoiding the 
latency increase.

What I tried to prove above is that it also improves non-Isolated cores as 
well, since rcu_core will not be running as often, saving cpu cycles that 
can be used by the VM.

What are your thoughts on that?

Thanks!
Leo

Leonardo Bras July 29, 2024, 11:28 a.m. UTC | #20

On Fri, Jul 12, 2024 at 5:02 PM Leonardo Bras <leobras@redhat.com> wrote:
>
> On Fri, Jul 12, 2024 at 05:57:10PM +0200, Paolo Bonzini wrote:
> > On 7/11/24 01:18, Leonardo Bras wrote:
> > > What are your thoughts on above results?
> > > Anything you would suggest changing?
> >
>
> Hello Paolo, thanks for the feedback!
>
> > Can you run the test with a conditional on "!tick_nohz_full_cpu(vcpu->cpu)"?
> >
> > If your hunch is correct that nohz-full CPUs already avoid invoke_rcu_core()
> > you might get the best of both worlds.
> >
> > tick_nohz_full_cpu() is very fast when there is no nohz-full CPU, because
> > then it shortcuts on context_tracking_enabled() (which is just a static
> > key).
>
> But that would mean not noting an RCU quiescent state in guest_exit of
> nohz_full cpus, right?
>
> The original issue we were dealing was having invoke_rcu_core() running on
> nohz_full cpus, and messing up the latency of RT workloads inside the VM.
>
> While most of the invoke_rcu_core() get ignored by the nohz_full rule,
> there are some scenarios in which it the vcpu thread may take more than 1s
> between a guest_entry and the next one (VM busy), and those which did
> not get ignored have caused latency peaks in our tests.
>
> The main idea of this patch is to note RCU quiescent states on guest_exit
> at nohz_full cpus (and use rcu.patience) to avoid running invoke_rcu_core()
> between a guest_exit and the next guest_entry if it takes less than
> rcu.patience miliseconds between exit and entry, and thus avoiding the
> latency increase.
>
> What I tried to prove above is that it also improves non-Isolated cores as
> well, since rcu_core will not be running as often, saving cpu cycles that
> can be used by the VM.
>
>
> What are your thoughts on that?

Hello Paolo, Sean,
Thanks for the feedback so far!

Do you have any thoughts or suggestions for this patch?

Thanks!
Leo

>
> Thanks!
> Leo

Leonardo Bras Aug. 27, 2024, 7:50 p.m. UTC | #21

Hi Sean,

Have you had the time to review this?

QE team is hitting this bug a lot, and I am afraid that it will start
to hit customers soon.

Please let me know if you need any further data / assistance.

Thanks!
Leo


On Mon, Jul 29, 2024 at 8:28 AM Leonardo Bras Soares Passos
<leobras@redhat.com> wrote:
>
> On Fri, Jul 12, 2024 at 5:02 PM Leonardo Bras <leobras@redhat.com> wrote:
> >
> > On Fri, Jul 12, 2024 at 05:57:10PM +0200, Paolo Bonzini wrote:
> > > On 7/11/24 01:18, Leonardo Bras wrote:
> > > > What are your thoughts on above results?
> > > > Anything you would suggest changing?
> > >
> >
> > Hello Paolo, thanks for the feedback!
> >
> > > Can you run the test with a conditional on "!tick_nohz_full_cpu(vcpu->cpu)"?
> > >
> > > If your hunch is correct that nohz-full CPUs already avoid invoke_rcu_core()
> > > you might get the best of both worlds.
> > >
> > > tick_nohz_full_cpu() is very fast when there is no nohz-full CPU, because
> > > then it shortcuts on context_tracking_enabled() (which is just a static
> > > key).
> >
> > But that would mean not noting an RCU quiescent state in guest_exit of
> > nohz_full cpus, right?
> >
> > The original issue we were dealing was having invoke_rcu_core() running on
> > nohz_full cpus, and messing up the latency of RT workloads inside the VM.
> >
> > While most of the invoke_rcu_core() get ignored by the nohz_full rule,
> > there are some scenarios in which it the vcpu thread may take more than 1s
> > between a guest_entry and the next one (VM busy), and those which did
> > not get ignored have caused latency peaks in our tests.
> >
> > The main idea of this patch is to note RCU quiescent states on guest_exit
> > at nohz_full cpus (and use rcu.patience) to avoid running invoke_rcu_core()
> > between a guest_exit and the next guest_entry if it takes less than
> > rcu.patience miliseconds between exit and entry, and thus avoiding the
> > latency increase.
> >
> > What I tried to prove above is that it also improves non-Isolated cores as
> > well, since rcu_core will not be running as often, saving cpu cycles that
> > can be used by the VM.
> >
> >
> > What are your thoughts on that?
>
> Hello Paolo, Sean,
> Thanks for the feedback so far!
>
> Do you have any thoughts or suggestions for this patch?
>
> Thanks!
> Leo
>
> >
> > Thanks!
> > Leo

Sean Christopherson Sept. 3, 2024, 6:07 p.m. UTC | #22

On Thu, Jun 20, 2024, Leonardo Bras wrote:
> On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> I tested x86 by counting cycles (using rdtsc_ordered()).
> 
> Cycles were counted upon function entry/exit on 
> {svm,vmx}_vcpu_enter_exit(), and right before / after 
> __{svm,vmx}_vcpu_run() in the same function.

Note, for posterity, this is the super-duper inner portion of KVM's run loop.
I.e. the 18-22% increase in latency sounds scary, but only because Leo used a
a relatively small portion of the entry flow for the baseline.  E.g. the total
time would be significantly higher, and thus the percentage increase much lower,
if the measurement started at the beginning of vmx_vcpu_run().

> The main idea was to get cycles spend in the procedures before entering 
> guest (such as reporting RCU quiescent state in entry / exit) and the 
> cycles actually used by the VM. 
> 
> Those cycles were summed-up and stored in per-cpu structures, with a 
> counter to get the average value. I then created a debug file to read the 
> results and reset the counters.
> 
> As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.
> 
> The workload inside the VM consisted in cyclictest in 16 vcpus 
> (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
> (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 
> 
>  $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
>   --mainaffinity 0-3
> 
> All tests were run for exaclty 1 hour, and the clock counter was reset at 
> the same moment cyclictest stared. After that VM was poweroff from guest.
> Results show the average for all CPUs in the same category, in cycles.
> 
> With above setup, I tested 2 use cases:
> 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
> 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
>     RCU patience = 1000ms (best case for RT)
> 
> Results are:
> # Test case 1:
> Vanilla: (average on all vcpus)
> VM Cycles / RT vcpu:		123287.75 
> VM Cycles / non-RT vcpu:	709847.25
> Setup Cycles:			186.00
> VM entries / RT vcpu:		58737094.81
> VM entries / non-RT vcpu:	10527869.25
> Total cycles in RT VM:		7241564260969.80
> Total cycles in non-RT VM:	7473179035472.06
> 
> Patched: (average on all vcpus)
> VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
> VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
> Setup Cycles:			218.65           (+17.55%)
> VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
> VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
> Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
> Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)
> 
> Discussion:
> Setup cycles raised in ~33 cycles, increasing overhead.
> It proves that noting a quiescent state in guest entry introduces setup 

Isn't the effect of the patch note a quiescent state in guest exit?  

> routine costs, which is expected.
> 
> On the other hand, both the average time spend inside the VM and the number 
> of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
> available to run, which is positive. Extra cycles probably came from not 
> having invoke_rcu_core() getting ran after VM exit.
> 
> 
> # Test case 2:
> Vanilla: (average on all vcpus)
> VM Cycles / RT vcpu:		123785.63
> VM Cycles / non-RT vcpu:	698758.25
> Setup Cycles:			187.20
> VM entries / RT vcpu:		61096820.75
> VM entries / non-RT vcpu:	11191873.00
> Total cycles in RT VM:		7562908142051.72
> Total cycles in non-RT VM:	7820413591702.25
> 
> Patched: (average on all vcpus)
> VM Cycles / RT vcpu:		123137.13        (- 0.52%)
> VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
> Setup Cycles:			229.35           (+22.52%)
> VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
> VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
> Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
> Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)
> 
> Discussion:
> Setup cycles raised in ~42 cycles, increasing overhead.
> It proves that noting a quiescent state in guest entry introduces setup 

Same here, s/entry/exit, correct?  I just want to make sure I'm not completely
misunderstanding something.

> routine costs, which is expected.
> 
> The average time spend inside the VM was reduced, but the number of VM  
> entries raised, causing the VM to have around the same number of cpu cycles 
> available to run, meaning that the overhead caused by reporting RCU 
> quiescent state in VM exit got absorbed, and it may have to do with those 
> rare invoke_rcu_core()s that were bothering latency.
> 
> The difference is much smaller compared to case 1, and this is probably 
> because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
> which may be already inhibiting a lot of invoke_rcu_core()s.
> 
> Sean, Paul, what do you think?

I'm in favor of noting the context switch on exit.  Singling out this patch for
introducing ~30 cycles of latency in entry/exit is rather ridiculous.  I guarantee
ww have introduced far more latency at various times without any scrutiny, even
if we were to limit the search to just the fastpath entry/exit loop.  Outside of
the fastpath, there's basically zero chance the extra 30-40 cycles are meaningful.

Even without the non-RT improvement, I'd give this a thumbs up just so that entry
and exit are symmetrical.  The fact that noting the context switches seems to be
a win-win is icing on the cake.

So, assuming the above entry/exit confusion was just a typo,

Acked-by: Sean Christopherson <seanjc@google.com>

[RFC,1/1] kvm: Note an RCU quiescent state on guest exit

Commit Message

Comments

Patch