diff mbox

[RFC] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively

Message ID 20120924152447.36c71b8f.yoshikawa_takuya_b1@lab.ntt.co.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Takuya Yoshikawa Sept. 24, 2012, 6:24 a.m. UTC
This is an RFC since I have not done any comparison with the approach
using for_each_set_bit() which can be seen in Avi's work.

	Takuya
---

We did a simple test to see which requests we would get at the same time
in vcpu_enter_guest() and got the following numbers:

|...........|...............|........|...............|.|
     (N)           (E)         (S)          (ES)      others
     22.3%         30.7%       16.0%        29.5%     1.4%

(N) : Nothing
(E) : Only KVM_REQ_EVENT
(S) : Only KVM_REQ_STEAL_UPDATE
(ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE

* Note that the exact numbers can change for other guests.

This motivated us to optimize the following code in vcpu_enter_guest():

    if (vcpu->requests) { /** (1) **/
        ...
        if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/
            record_steal_time(vcpu);
        ...
    }

    if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
        ...
    }

 - For case (E), we do kvm_check_request() for every request other than
   KVM_REQ_EVENT in the block (1) and always get false.
 - For case (S) and (ES), the only difference from case (E) is that we
   get true for KVM_REQ_STEAL_UPDATE.

This means that we were wasting a lot of time for the many if branches
in the block (1) even for these trivial three cases which dominated more
than 75% in total.

This patch mitigates the issue as follows:
 - For case (E), we change the first if condition to
       if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/
   so that we can skip the block completely.
 - For case (S) and (ES), we move the check (2) upwards, out of the
   block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the
   check (1').

Although this adds one if branch for case (N), the fact that steal
update occurs frequently enough except when we give each vcpu a
dedicated core justifies its tiny cost.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
---
 [My email address change is not a mistake.]

 arch/x86/kvm/x86.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

Comments

Xiao Guangrong Sept. 24, 2012, 6:59 a.m. UTC | #1
On 09/24/2012 02:24 PM, Takuya Yoshikawa wrote:
> This is an RFC since I have not done any comparison with the approach
> using for_each_set_bit() which can be seen in Avi's work.
> 

Why not compare it? I think for_each_set_bit is better and it can
improve for all cases (in your patch, you did not consider all case,
for example, if the guest under mm overcommit, it should generate
lots of TLB flush/RELOAD request).

Actually, i think Avi's way can be improved in the further, we can
just use one atomic operation to avoid cache-miss. May be like this:

while (vcpu->request) {
	xchg(vcpu->request, request);

	for_each_set_bit(request) {
		clear_bit(X);

		......
	}
	
}




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 10:09 a.m. UTC | #2
On 09/24/2012 08:59 AM, Xiao Guangrong wrote:
> On 09/24/2012 02:24 PM, Takuya Yoshikawa wrote:
>> This is an RFC since I have not done any comparison with the approach
>> using for_each_set_bit() which can be seen in Avi's work.
>> 
> 
> Why not compare it? I think for_each_set_bit is better and it can
> improve for all cases (in your patch, you did not consider all case,
> for example, if the guest under mm overcommit, it should generate
> lots of TLB flush/RELOAD request).
> 
> Actually, i think Avi's way can be improved in the further, we can
> just use one atomic operation to avoid cache-miss. May be like this:
> 
> while (vcpu->request) {
> 	xchg(vcpu->request, request);
> 
> 	for_each_set_bit(request) {
> 		clear_bit(X);
> 
> 		......
> 	}
> 	
> }

In fact I had something like that in one of the earlier versions, but it
was problematic.  I'll try to see what the issue was.
Avi Kivity Sept. 24, 2012, 10:18 a.m. UTC | #3
On 09/24/2012 08:24 AM, Takuya Yoshikawa wrote:
> This is an RFC since I have not done any comparison with the approach
> using for_each_set_bit() which can be seen in Avi's work.
> 
> 	Takuya
> ---
> 
> We did a simple test to see which requests we would get at the same time
> in vcpu_enter_guest() and got the following numbers:
> 
> |...........|...............|........|...............|.|
>      (N)           (E)         (S)          (ES)      others
>      22.3%         30.7%       16.0%        29.5%     1.4%
> 
> (N) : Nothing
> (E) : Only KVM_REQ_EVENT
> (S) : Only KVM_REQ_STEAL_UPDATE
> (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE
> 
> * Note that the exact numbers can change for other guests.

What was the workload?  STEAL_UPDATE is done on schedules and
heavyweight exit (erronously), so it should be rare.

Or maybe we're recording HLT time as steal time?

> 
> This motivated us to optimize the following code in vcpu_enter_guest():
> 
>     if (vcpu->requests) { /** (1) **/
>         ...
>         if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/
>             record_steal_time(vcpu);
>         ...
>     }
> 
>     if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>         ...
>     }
> 
>  - For case (E), we do kvm_check_request() for every request other than
>    KVM_REQ_EVENT in the block (1) and always get false.
>  - For case (S) and (ES), the only difference from case (E) is that we
>    get true for KVM_REQ_STEAL_UPDATE.
> 
> This means that we were wasting a lot of time for the many if branches
> in the block (1) even for these trivial three cases which dominated more
> than 75% in total.
> 
> This patch mitigates the issue as follows:
>  - For case (E), we change the first if condition to
>        if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/
>    so that we can skip the block completely.
>  - For case (S) and (ES), we move the check (2) upwards, out of the
>    block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the
>    check (1').
> 
> Although this adds one if branch for case (N), the fact that steal
> update occurs frequently enough except when we give each vcpu a
> dedicated core justifies its tiny cost.

Modern processors will eliminate KVM_REQ_EVENT in many cases, so the
optmimization is wasted on them.

Do you have numbers?  Just for your patch, not my alternative.

> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
> ---
>  [My email address change is not a mistake.]
> 
>  arch/x86/kvm/x86.c |   11 ++++++++---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fc2a0a1..e351902 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5233,7 +5233,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		vcpu->run->request_interrupt_window;
>  	bool req_immediate_exit = 0;
>  
> -	if (vcpu->requests) {
> +	/*
> +	 * Getting KVM_REQ_STEAL_UPDATE alone is so common that clearing it now
> +	 * will hopefully result in skipping the following checks.
> +	 */
> +	if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
> +		record_steal_time(vcpu);
> +
> +	if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) {
>  		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
>  			kvm_mmu_unload(vcpu);
>  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> @@ -5267,8 +5274,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			r = 1;
>  			goto out;
>  		}
> -		if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
> -			record_steal_time(vcpu);
>  		if (kvm_check_request(KVM_REQ_NMI, vcpu))
>  			process_nmi(vcpu);
>  		req_immediate_exit =
>
Takuya Yoshikawa Sept. 24, 2012, 1:55 p.m. UTC | #4
On Mon, 24 Sep 2012 14:59:44 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> On 09/24/2012 02:24 PM, Takuya Yoshikawa wrote:
> > This is an RFC since I have not done any comparison with the approach
> > using for_each_set_bit() which can be seen in Avi's work.
> > 
> 
> Why not compare it? I think for_each_set_bit is better and it can

I did not have enough time.  Sure, I should do some comparison.

But I also wanted to hear the requests pattern I got was normal.
So this is not a request for merging: I rather wanted to share
the numbers I got.

Thanks,
	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa Sept. 24, 2012, 2:14 p.m. UTC | #5
On Mon, 24 Sep 2012 12:18:15 +0200
Avi Kivity <avi@redhat.com> wrote:

> On 09/24/2012 08:24 AM, Takuya Yoshikawa wrote:
> > This is an RFC since I have not done any comparison with the approach
> > using for_each_set_bit() which can be seen in Avi's work.
> > 
> > 	Takuya
> > ---
> > 
> > We did a simple test to see which requests we would get at the same time
> > in vcpu_enter_guest() and got the following numbers:
> > 
> > |...........|...............|........|...............|.|
> >      (N)           (E)         (S)          (ES)      others
> >      22.3%         30.7%       16.0%        29.5%     1.4%
> > 
> > (N) : Nothing
> > (E) : Only KVM_REQ_EVENT
> > (S) : Only KVM_REQ_STEAL_UPDATE
> > (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE
> > 
> > * Note that the exact numbers can change for other guests.
> 
> What was the workload?  STEAL_UPDATE is done on schedules and
> heavyweight exit (erronously), so it should be rare.

Just ping'ing to the host.  But even without that, I got many
STEAL_UPDATE's and KVM_REQ_EVENT's.

> Or maybe we're recording HLT time as steal time?

Not sure.

BTW, schedule() is really rare?  We do either cond_resched() or
heavy weight exit, no?

I always see vcpu threads actively move around the cores.
(When I do not pin them.)

> > This motivated us to optimize the following code in vcpu_enter_guest():
> > 
> >     if (vcpu->requests) { /** (1) **/
> >         ...
> >         if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/
> >             record_steal_time(vcpu);
> >         ...
> >     }
> > 
> >     if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> >         ...
> >     }
> > 
> >  - For case (E), we do kvm_check_request() for every request other than
> >    KVM_REQ_EVENT in the block (1) and always get false.
> >  - For case (S) and (ES), the only difference from case (E) is that we
> >    get true for KVM_REQ_STEAL_UPDATE.
> > 
> > This means that we were wasting a lot of time for the many if branches
> > in the block (1) even for these trivial three cases which dominated more
> > than 75% in total.
> > 
> > This patch mitigates the issue as follows:
> >  - For case (E), we change the first if condition to
> >        if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/
> >    so that we can skip the block completely.
> >  - For case (S) and (ES), we move the check (2) upwards, out of the
> >    block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the
> >    check (1').
> > 
> > Although this adds one if branch for case (N), the fact that steal
> > update occurs frequently enough except when we give each vcpu a
> > dedicated core justifies its tiny cost.
> 
> Modern processors will eliminate KVM_REQ_EVENT in many cases, so the
> optmimization is wasted on them.

Then, my Nehalem server was not so modern.
I did something like this:

	if requests == KVM_REQ_EVENT
		++counter1;
	if requests == KVM_REQ_STEAL_UPDATE
		++counter2;
	...

in vcpu_enter_guest() and saw KVM_REQ_EVENT many times.

> Do you have numbers?  Just for your patch, not my alternative.

Not now.

Thanks,
	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa Sept. 24, 2012, 2:32 p.m. UTC | #6
On Mon, 24 Sep 2012 12:09:00 +0200
Avi Kivity <avi@redhat.com> wrote:

> > while (vcpu->request) {
> > 	xchg(vcpu->request, request);
> > 
> > 	for_each_set_bit(request) {
> > 		clear_bit(X);
> > 
> > 		......
> > 	}
> > 	
> > }
> 
> In fact I had something like that in one of the earlier versions, but it
> was problematic.  I'll try to see what the issue was.

Unless there are many requests, the cost of for_each_set_bit() and a few
added code may exceed that of the original code.
(Looping using __ffs() is an alternative because requests is "long".)

So I wanted to know the most common requests pattern.

Thanks,
	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 2:50 p.m. UTC | #7
On 09/24/2012 04:14 PM, Takuya Yoshikawa wrote:
> On Mon, 24 Sep 2012 12:18:15 +0200
> Avi Kivity <avi@redhat.com> wrote:
> 
>> On 09/24/2012 08:24 AM, Takuya Yoshikawa wrote:
>> > This is an RFC since I have not done any comparison with the approach
>> > using for_each_set_bit() which can be seen in Avi's work.
>> > 
>> > 	Takuya
>> > ---
>> > 
>> > We did a simple test to see which requests we would get at the same time
>> > in vcpu_enter_guest() and got the following numbers:
>> > 
>> > |...........|...............|........|...............|.|
>> >      (N)           (E)         (S)          (ES)      others
>> >      22.3%         30.7%       16.0%        29.5%     1.4%
>> > 
>> > (N) : Nothing
>> > (E) : Only KVM_REQ_EVENT
>> > (S) : Only KVM_REQ_STEAL_UPDATE
>> > (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE
>> > 
>> > * Note that the exact numbers can change for other guests.
>> 
>> What was the workload?  STEAL_UPDATE is done on schedules and
>> heavyweight exit (erronously), so it should be rare.
> 
> Just ping'ing to the host.  But even without that, I got many
> STEAL_UPDATE's and KVM_REQ_EVENT's.

Those are only relative measurements.  Sure on an idle guest practically
all of your exits are special, but there's so few of them that
optimizing them doesn't matter much.

However the trend is that as hardware improves exits get more and more
special.

Before NPT/EPT, most of the exits were page faults, and we optimized for
them.

Afterwards, most exits are APIC and interrupt related, HLT, and MMIO.
Of these, some are special (HLT, interrupt injection) and some are not
(read/write most APIC registers).  I don't think one group dominates the
other.  So already vcpu->requests processing is not such a slow path, it
is relatively common.  We still see a lot of page faults during boot and
during live migration though.

With AVIC/APIC-V (still in the future) the mix will change again, with
both special and non-special exits eliminated.  We'll be left mostly
with APIC timer and HLT (and ICR writes for APIC-V).

So maybe the direction of your patch makes sense.  Things like
KVM_REQ_EVENT (or anything above 2-3% of exits) shouldn't be in
vcpu->requests or maybe they deserve special treatment.

> 
>> Or maybe we're recording HLT time as steal time?
> 
> Not sure.

I think we do, but on the other hand qemu doesn't even expose it so we
can't measure it.

> 
> BTW, schedule() is really rare?  We do either cond_resched() or
> heavy weight exit, no?

If 25% of exits are HLT (like a ping workload), then 25% of your exits
end up in schedule().

On modern hardware, a relatively larger percentage of exits are
heavyweight (same analysis as above).  On AVIC hardware most exits will
be mmio, HLT, and host interrupts.  Of these only host interrupts that
don't lead to a context switch will be lightweight.

> 
> I always see vcpu threads actively move around the cores.
> (When I do not pin them.)

Sure, but the frequency is quite low.  If not that's a bug.

> 
>> > This motivated us to optimize the following code in vcpu_enter_guest():
>> > 
>> >     if (vcpu->requests) { /** (1) **/
>> >         ...
>> >         if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/
>> >             record_steal_time(vcpu);
>> >         ...
>> >     }
>> > 
>> >     if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>> >         ...
>> >     }
>> > 
>> >  - For case (E), we do kvm_check_request() for every request other than
>> >    KVM_REQ_EVENT in the block (1) and always get false.
>> >  - For case (S) and (ES), the only difference from case (E) is that we
>> >    get true for KVM_REQ_STEAL_UPDATE.
>> > 
>> > This means that we were wasting a lot of time for the many if branches
>> > in the block (1) even for these trivial three cases which dominated more
>> > than 75% in total.
>> > 
>> > This patch mitigates the issue as follows:
>> >  - For case (E), we change the first if condition to
>> >        if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/
>> >    so that we can skip the block completely.
>> >  - For case (S) and (ES), we move the check (2) upwards, out of the
>> >    block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the
>> >    check (1').
>> > 
>> > Although this adds one if branch for case (N), the fact that steal
>> > update occurs frequently enough except when we give each vcpu a
>> > dedicated core justifies its tiny cost.
>> 
>> Modern processors will eliminate KVM_REQ_EVENT in many cases, so the
>> optmimization is wasted on them.
> 
> Then, my Nehalem server was not so modern.

Well I was referring to APIC-v/AVIC hardware which nobody has.  On
current hardware they're very common.  So stuffing it in the
vcpu->requests slow path is not warranted.

My patch is cleaner than yours as it handles the problem generically,
but yours matches reality better.

> I did something like this:
> 
> 	if requests == KVM_REQ_EVENT
> 		++counter1;
> 	if requests == KVM_REQ_STEAL_UPDATE
> 		++counter2;
> 	...
> 
> in vcpu_enter_guest() and saw KVM_REQ_EVENT many times.


(in theory perf probe can do this.  But figuring out how is often more
time consuming than patching the kernel).
Avi Kivity Sept. 24, 2012, 2:53 p.m. UTC | #8
On 09/24/2012 04:32 PM, Takuya Yoshikawa wrote:
> On Mon, 24 Sep 2012 12:09:00 +0200
> Avi Kivity <avi@redhat.com> wrote:
> 
>> > while (vcpu->request) {
>> > 	xchg(vcpu->request, request);
>> > 
>> > 	for_each_set_bit(request) {
>> > 		clear_bit(X);
>> > 
>> > 		......
>> > 	}
>> > 	
>> > }
>> 
>> In fact I had something like that in one of the earlier versions, but it
>> was problematic.  I'll try to see what the issue was.
> 
> Unless there are many requests, the cost of for_each_set_bit() and a few
> added code may exceed that of the original code.
> (Looping using __ffs() is an alternative because requests is "long".)
> 
> So I wanted to know the most common requests pattern.

See other branch of this thread.  But in short, I now think you are
right and the special-case is warranted.

(not for STEAL_UPDATE - that's likely a bug, it should only happen on
overcommit)
Takuya Yoshikawa Sept. 26, 2012, 2:06 a.m. UTC | #9
On Mon, 24 Sep 2012 16:50:13 +0200
Avi Kivity <avi@redhat.com> wrote:

> Afterwards, most exits are APIC and interrupt related, HLT, and MMIO.
> Of these, some are special (HLT, interrupt injection) and some are not
> (read/write most APIC registers).  I don't think one group dominates the
> other.  So already vcpu->requests processing is not such a slow path, it
> is relatively common.  We still see a lot of page faults during boot and
> during live migration though.
> 
> With AVIC/APIC-V (still in the future) the mix will change again, with
> both special and non-special exits eliminated.  We'll be left mostly
> with APIC timer and HLT (and ICR writes for APIC-V).
> 
> So maybe the direction of your patch makes sense.  Things like
> KVM_REQ_EVENT (or anything above 2-3% of exits) shouldn't be in
> vcpu->requests or maybe they deserve special treatment.

I see the point.

Since KVM_REQ_EVENT must be checked after handling some other requests,
it needs special treatment anyway -- if we think defining it as the
last flag for for_each_set_bit() is kind of special treatment.

As Gleb and you pointed out, KVM_REQ_STEAL_UPDATE needs to be fixed
first not to be set unnecessarily.

Then by special casing KVM_REQ_EVENT, one line change or moving it out
from vcpu->requests, we can see if further improvement is needed.

If a few requests exceed the threshold, 2-3% as you wrote?, we can also
define a mask to indicate which requests should be treated as "not unlikely".

> > BTW, schedule() is really rare?  We do either cond_resched() or
> > heavy weight exit, no?
> 
> If 25% of exits are HLT (like a ping workload), then 25% of your exits
> end up in schedule().
> 
> On modern hardware, a relatively larger percentage of exits are
> heavyweight (same analysis as above).  On AVIC hardware most exits will
> be mmio, HLT, and host interrupts.  Of these only host interrupts that
> don't lead to a context switch will be lightweight.
> 
> > 
> > I always see vcpu threads actively move around the cores.
> > (When I do not pin them.)
> 
> Sure, but the frequency is quite low.  If not that's a bug.

That's what I was originally testing for: if vcpu threads were being
scheduled as expected.

I forgot why I reached here.

> >> Modern processors will eliminate KVM_REQ_EVENT in many cases, so the
> >> optmimization is wasted on them.
> > 
> > Then, my Nehalem server was not so modern.
> 
> Well I was referring to APIC-v/AVIC hardware which nobody has.  On
> current hardware they're very common.  So stuffing it in the
> vcpu->requests slow path is not warranted.
> 
> My patch is cleaner than yours as it handles the problem generically,
> but yours matches reality better.

I guess so.

I remember someone once tried to inline functions used inside
for_each_set_bit() complaining that it was slow.  A generic approach
needs some scale to win.

> > I did something like this:
> > 
> > 	if requests == KVM_REQ_EVENT
> > 		++counter1;
> > 	if requests == KVM_REQ_STEAL_UPDATE
> > 		++counter2;
> > 	...
> > 
> > in vcpu_enter_guest() and saw KVM_REQ_EVENT many times.
> 
> 
> (in theory perf probe can do this.  But figuring out how is often more
> time consuming than patching the kernel).

Yes, actually I was playing with perf before directly counting each pattern.
But since I could not see the details easily, because of inlining or ...,
I ended up going my way.

Thanks,
	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 27, 2012, 9:28 a.m. UTC | #10
On 09/26/2012 04:06 AM, Takuya Yoshikawa wrote:
> On Mon, 24 Sep 2012 16:50:13 +0200
> Avi Kivity <avi@redhat.com> wrote:
> 
>> Afterwards, most exits are APIC and interrupt related, HLT, and MMIO.
>> Of these, some are special (HLT, interrupt injection) and some are not
>> (read/write most APIC registers).  I don't think one group dominates the
>> other.  So already vcpu->requests processing is not such a slow path, it
>> is relatively common.  We still see a lot of page faults during boot and
>> during live migration though.
>> 
>> With AVIC/APIC-V (still in the future) the mix will change again, with
>> both special and non-special exits eliminated.  We'll be left mostly
>> with APIC timer and HLT (and ICR writes for APIC-V).
>> 
>> So maybe the direction of your patch makes sense.  Things like
>> KVM_REQ_EVENT (or anything above 2-3% of exits) shouldn't be in
>> vcpu->requests or maybe they deserve special treatment.
> 
> I see the point.
> 
> Since KVM_REQ_EVENT must be checked after handling some other requests,
> it needs special treatment anyway -- if we think defining it as the
> last flag for for_each_set_bit() is kind of special treatment.

Note that's an optimization - the code will work even if we don't do that.

But we can go further, like making it a bool, so it can be set without
an atomic operation.

It would depend on how frequent it is, in a real workload.

> 
> As Gleb and you pointed out, KVM_REQ_STEAL_UPDATE needs to be fixed
> first not to be set unnecessarily.
> 
> Then by special casing KVM_REQ_EVENT, one line change or moving it out
> from vcpu->requests, we can see if further improvement is needed.
> 
> If a few requests exceed the threshold, 2-3% as you wrote?, we can also
> define a mask to indicate which requests should be treated as "not unlikely".

Right, it will depend on each individual case.  For example synchronous
requests (can only happen from the vcpu thread) would be better off
outside vcpu->requests.

> 
>> > BTW, schedule() is really rare?  We do either cond_resched() or
>> > heavy weight exit, no?
>> 
>> If 25% of exits are HLT (like a ping workload), then 25% of your exits
>> end up in schedule().
>> 
>> On modern hardware, a relatively larger percentage of exits are
>> heavyweight (same analysis as above).  On AVIC hardware most exits will
>> be mmio, HLT, and host interrupts.  Of these only host interrupts that
>> don't lead to a context switch will be lightweight.
>> 
>> > 
>> > I always see vcpu threads actively move around the cores.
>> > (When I do not pin them.)
>> 
>> Sure, but the frequency is quite low.  If not that's a bug.
> 
> That's what I was originally testing for: if vcpu threads were being
> scheduled as expected.
> 
> I forgot why I reached here.

Human minds also tend to move around a lot.
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fc2a0a1..e351902 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5233,7 +5233,14 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		vcpu->run->request_interrupt_window;
 	bool req_immediate_exit = 0;
 
-	if (vcpu->requests) {
+	/*
+	 * Getting KVM_REQ_STEAL_UPDATE alone is so common that clearing it now
+	 * will hopefully result in skipping the following checks.
+	 */
+	if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
+		record_steal_time(vcpu);
+
+	if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) {
 		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
@@ -5267,8 +5274,6 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			r = 1;
 			goto out;
 		}
-		if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
-			record_steal_time(vcpu);
 		if (kvm_check_request(KVM_REQ_NMI, vcpu))
 			process_nmi(vcpu);
 		req_immediate_exit =