[v3,3/3] KVM: vCPU kick tax cut for running vCPU

Message ID	1634631160-67276-3-git-send-email-wanpengli@tencent.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wanpeng Li <kernellwp@gmail.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org> Subject: [PATCH v3 3/3] KVM: vCPU kick tax cut for running vCPU Date: Tue, 19 Oct 2021 01:12:40 -0700 Message-Id: <1634631160-67276-3-git-send-email-wanpengli@tencent.com> In-Reply-To: <1634631160-67276-1-git-send-email-wanpengli@tencent.com> References: <1634631160-67276-1-git-send-email-wanpengli@tencent.com> Precedence: bulk
Series	[v3,1/3] KVM: emulate: Don't inject #GP when emulating RDMPC if CR0.PE=0 \| expand [v3,1/3] KVM: emulate: Don't inject #GP when emulating RDMPC if CR0.PE=0 [v3,2/3] KVM: vPMU: Fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 [v3,3/3] KVM: vCPU kick tax cut for running vCPU

Wanpeng Li Oct. 19, 2021, 8:12 a.m. UTC

From: Wanpeng Li <wanpengli@tencent.com>

Sometimes a vCPU kick is following a pending request, even if @vcpu is 
the running vCPU. It suffers from both rcuwait_wake_up() which has 
rcu/memory barrier operations and cmpxchg(). Let's check vcpu->wait 
before rcu_wait_wake_up() and whether @vcpu is the running vCPU before 
cmpxchg() to tax cut this overhead.

We evaluate the kvm-unit-test/vmexit.flat on an Intel ICX box, most of the 
scores can improve ~600 cpu cycles especially when APICv is disabled.

tscdeadline_immed
tscdeadline
self_ipi_sti_nop
..............
x2apic_self_ipi_tpr_sti_hlt

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
v2 -> v3:
 * use kvm_arch_vcpu_get_wait()
v1 -> v2:
 * move checking running vCPU logic to kvm_vcpu_kick
 * check rcuwait_active(&vcpu->wait) etc

 virt/kvm/kvm_main.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Paolo Bonzini Oct. 19, 2021, 4:59 p.m. UTC | #1

On 19/10/21 10:12, Wanpeng Li wrote:
> -	if (kvm_vcpu_wake_up(vcpu))
> -		return;
> +	me = get_cpu();
> +
> +	if (rcuwait_active(kvm_arch_vcpu_get_wait(vcpu)) && kvm_vcpu_wake_up(vcpu))
> +		goto out;

This is racy.  You are basically doing the same check that 
rcuwait_wake_up does, but without the memory barrier before.

Also here:

> +	if (vcpu == __this_cpu_read(kvm_running_vcpu)) {
> +		WARN_ON_ONCE(vcpu->mode == IN_GUEST_MODE);

it's better to do

	if (vcpu == ... && !WARN_ON_ONCE(vcpu->mode == IN_GUEST_MODE))
		goto out;

so that if the bug happens you do get a smp_send_reschedule() and fail 
safely.

Paolo

> +		goto out;
> +	}

Sean Christopherson Oct. 19, 2021, 5:34 p.m. UTC | #2

On Tue, Oct 19, 2021, Paolo Bonzini wrote:
> On 19/10/21 10:12, Wanpeng Li wrote:
> > -	if (kvm_vcpu_wake_up(vcpu))
> > -		return;
> > +	me = get_cpu();
> > +
> > +	if (rcuwait_active(kvm_arch_vcpu_get_wait(vcpu)) && kvm_vcpu_wake_up(vcpu))
> > +		goto out;
> 
> This is racy.  You are basically doing the same check that rcuwait_wake_up
> does, but without the memory barrier before.

I was worried that was the case[*], but I didn't have the two hours it would have
taken me to verify there was indeed a problem :-)

The intent of the extra check was to avoid the locked instruction that comes with
disabling preemption via rcu_read_lock().  But thinking more, the extra op should
be little more than a basic arithmetic operation in the grand scheme on modern x86
since the cache line is going to be locked and written no matter what, either
immediately before or immediately after.

So with Paolo's other comment, maybe just this?  And if this doesn't provide the
desired performance boost, changes to the rcuwait behavior should go in separate
patch.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9ec99f5b972c..ebc6d4f2fbfa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3333,11 +3333,22 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
         * vCPU also requires it to leave IN_GUEST_MODE.
         */
        me = get_cpu();
+
+       /*
+        * Avoid the moderately expensive "should kick" operation if this pCPU
+        * is currently running the target vCPU, in which case it's a KVM bug
+        * if the vCPU is in the inner run loop.
+        */
+       if (vcpu == __this_cpu_read(kvm_running_vcpu) &&
+           !WARN_ON_ONCE(vcpu->mode == IN_GUEST_MODE))
+               goto out;
+
        if (kvm_arch_vcpu_should_kick(vcpu)) {
                cpu = READ_ONCE(vcpu->cpu);
                if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
                        smp_send_reschedule(cpu);
        }
+out:
        put_cpu();
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_kick);

[*] https://lkml.kernel.org/r/YWoOG40Ap0Islpu2@google.com

Wanpeng Li Oct. 20, 2021, 2:49 a.m. UTC | #3

On Wed, 20 Oct 2021 at 01:34, Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 19, 2021, Paolo Bonzini wrote:
> > On 19/10/21 10:12, Wanpeng Li wrote:
> > > -   if (kvm_vcpu_wake_up(vcpu))
> > > -           return;
> > > +   me = get_cpu();
> > > +
> > > +   if (rcuwait_active(kvm_arch_vcpu_get_wait(vcpu)) && kvm_vcpu_wake_up(vcpu))
> > > +           goto out;
> >
> > This is racy.  You are basically doing the same check that rcuwait_wake_up
> > does, but without the memory barrier before.
>
> I was worried that was the case[*], but I didn't have the two hours it would have
> taken me to verify there was indeed a problem :-)
>
> The intent of the extra check was to avoid the locked instruction that comes with
> disabling preemption via rcu_read_lock().  But thinking more, the extra op should
> be little more than a basic arithmetic operation in the grand scheme on modern x86
> since the cache line is going to be locked and written no matter what, either
> immediately before or immediately after.

I observe the main overhead of rcuwait_wake_up() is from rcu
operations, especially rcu_read_lock/unlock().

>
> So with Paolo's other comment, maybe just this?  And if this doesn't provide the
> desired performance boost, changes to the rcuwait behavior should go in separate
> patch.

Ok.

    Wanpeng

Paolo Bonzini Oct. 20, 2021, 6:47 a.m. UTC | #4

On 20/10/21 04:49, Wanpeng Li wrote:
>> The intent of the extra check was to avoid the locked instruction that comes with
>> disabling preemption via rcu_read_lock().  But thinking more, the extra op should
>> be little more than a basic arithmetic operation in the grand scheme on modern x86
>> since the cache line is going to be locked and written no matter what, either
>> immediately before or immediately after.
>
> I observe the main overhead of rcuwait_wake_up() is from rcu
> operations, especially rcu_read_lock/unlock().

Do you have CONFIG_PREEMPT_RCU set?  If so, maybe something like this would help:

diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..ca1e60a1234d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -235,8 +235,6 @@ int rcuwait_wake_up(struct rcuwait *w)
  	int ret = 0;
  	struct task_struct *task;
  
-	rcu_read_lock();
-
  	/*
  	 * Order condition vs @task, such that everything prior to the load
  	 * of @task is visible. This is the condition as to why the user called
@@ -250,6 +248,14 @@ int rcuwait_wake_up(struct rcuwait *w)
  	 */
  	smp_mb(); /* (B) */
  
+#ifdef CONFIG_PREEMPT_RCU
+	/* The cost of rcu_read_lock() is nontrivial for preemptable RCU.  */
+	if (!rcuwait_active(w))
+		return ret;
+#endif
+
+	rcu_read_lock();
+
  	task = rcu_dereference(w->task);
  	if (task)
  		ret = wake_up_process(task);

(If you don't, rcu_read_lock is essentially preempt_disable() and it
should not have a large overhead).  You still need the memory barrier
though, in order to avoid missed wakeups; shameless plug for my
article at https://lwn.net/Articles/847481/.

Paolo

>> So with Paolo's other comment, maybe just this?  And if this doesn't provide the
>> desired performance boost, changes to the rcuwait behavior should go in separate
>> patch.
> Ok.

Wanpeng Li Oct. 20, 2021, 10:02 a.m. UTC | #5

On Wed, 20 Oct 2021 at 14:47, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 20/10/21 04:49, Wanpeng Li wrote:
> >> The intent of the extra check was to avoid the locked instruction that comes with
> >> disabling preemption via rcu_read_lock().  But thinking more, the extra op should
> >> be little more than a basic arithmetic operation in the grand scheme on modern x86
> >> since the cache line is going to be locked and written no matter what, either
> >> immediately before or immediately after.
> >
> > I observe the main overhead of rcuwait_wake_up() is from rcu
> > operations, especially rcu_read_lock/unlock().
>
> Do you have CONFIG_PREEMPT_RCU set?  If so, maybe something like this would help:

Yes.

>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd1c04193e18..ca1e60a1234d 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -235,8 +235,6 @@ int rcuwait_wake_up(struct rcuwait *w)
>         int ret = 0;
>         struct task_struct *task;
>
> -       rcu_read_lock();
> -
>         /*
>          * Order condition vs @task, such that everything prior to the load
>          * of @task is visible. This is the condition as to why the user called
> @@ -250,6 +248,14 @@ int rcuwait_wake_up(struct rcuwait *w)
>          */
>         smp_mb(); /* (B) */
>
> +#ifdef CONFIG_PREEMPT_RCU
> +       /* The cost of rcu_read_lock() is nontrivial for preemptable RCU.  */
> +       if (!rcuwait_active(w))
> +               return ret;
> +#endif
> +
> +       rcu_read_lock();
> +
>         task = rcu_dereference(w->task);
>         if (task)
>                 ret = wake_up_process(task);
>
> (If you don't, rcu_read_lock is essentially preempt_disable() and it
> should not have a large overhead).  You still need the memory barrier
> though, in order to avoid missed wakeups; shameless plug for my
> article at https://lwn.net/Articles/847481/.

You are right, the cost of rcu_read_lock() for preemptable RCU
introduces too much overhead, do you want to send a separate patch?

    Wanpeng

Paolo Bonzini Oct. 20, 2021, 10:33 a.m. UTC | #6

On 19/10/21 19:34, Sean Christopherson wrote:
> The intent of the extra check was to avoid the locked instruction that comes with
> disabling preemption via rcu_read_lock().  But thinking more, the extra op should
> be little more than a basic arithmetic operation in the grand scheme on modern x86
> since the cache line is going to be locked and written no matter what, either
> immediately before or immediately after.

There should be no locked instructions unless you're using 
PREEMPT_RT/PREEMPT_RCU, no?  The preempt_disable count is in a percpu 
variable.

> 
> +       /*
> +        * Avoid the moderately expensive "should kick" operation if this pCPU
> +        * is currently running the target vCPU, in which case it's a KVM bug
> +        * if the vCPU is in the inner run loop.
> +        */
> +       if (vcpu == __this_cpu_read(kvm_running_vcpu) &&
> +           !WARN_ON_ONCE(vcpu->mode == IN_GUEST_MODE))
> +               goto out;
> +

It should not even be a problem if vcpu->mode == IN_GUEST_MODE, you just 
set it to EXITING_GUEST_MODE without even the need for atomic_cmpxchg.
I'll send a few patches out, since I think I found some related issues.

Paolo

Paolo Bonzini Oct. 20, 2021, 10:37 a.m. UTC | #7

On 20/10/21 12:02, Wanpeng Li wrote:
>> +#ifdef CONFIG_PREEMPT_RCU
>> +       /* The cost of rcu_read_lock() is nontrivial for preemptable RCU.  */
>> +       if (!rcuwait_active(w))
>> +               return ret;
>> +#endif
>> +
>> +       rcu_read_lock();
>> +
>>          task = rcu_dereference(w->task);
>>          if (task)
>>                  ret = wake_up_process(task);
>>
>> (If you don't, rcu_read_lock is essentially preempt_disable() and it
>> should not have a large overhead).  You still need the memory barrier
>> though, in order to avoid missed wakeups; shameless plug for my
>> article athttps://lwn.net/Articles/847481/.
> You are right, the cost of rcu_read_lock() for preemptable RCU
> introduces too much overhead, do you want to send a separate patch?

Yes, I'll take care of this.  Thanks!

Paolo

[v3,3/3] KVM: vCPU kick tax cut for running vCPU

Commit Message

Comments

Patch