[RFC,3/3] KVM: MMU: Optimize guest page table walk

Message ID	20110419033814.3cc7ab5e.takuya.yoshikawa@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p3IIcSfq029853 for <patchwork-kvm@patchwork.kernel.org>; Mon, 18 Apr 2011 18:38:28 GMT DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:in-reply-to:references:x-mailer :mime-version:content-type:content-transfer-encoding; b=uoswu6a/Vma+SPepp5rHI0+fHavNXTyTSZmC9Nbv6PK/BrG54RmDQIW1iTlpNH375S 1hsdRjqGBM3R3+Aoj5iliDvefaRkMuSK4E5Vl6TTIWj8ZQbdS7FrTABgFjD4X2aUSJkg laK9SD5vjBQWxOVC0Hmqe+jYl01pYiLZpAdQQ= Date: Tue, 19 Apr 2011 03:38:14 +0900 From: Takuya Yoshikawa <takuya.yoshikawa@gmail.com> To: avi@redhat.com, mtosatti@redhat.com Cc: kvm@vger.kernel.org, yoshikawa.takuya@oss.ntt.co.jp Subject: [RFC PATCH 3/3] KVM: MMU: Optimize guest page table walk Message-Id: <20110419033814.3cc7ab5e.takuya.yoshikawa@gmail.com> In-Reply-To: <20110419033220.e527bcae.takuya.yoshikawa@gmail.com> References: <20110419033220.e527bcae.takuya.yoshikawa@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Takuya Yoshikawa April 18, 2011, 6:38 p.m. UTC

From: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

We optimize multi level guest page table walk as follows:

  1. We cache the memslot which, probably, includes the next guest page
     tables to avoid searching for it many times.
  2. We use get_user() instead of copy_from_user().

Note that this is kind of a restricted way of Xiao's more generic
work: "KVM: optimize memslots searching and cache GPN to GFN."

With this patch applied, paging64_walk_addr_generic() has improved
as the following tracing results show.

Before:
  3.169 us   |  paging64_walk_addr_generic();
  1.880 us   |  paging64_walk_addr_generic();
  1.243 us   |  paging64_walk_addr_generic();
  1.517 us   |  paging64_walk_addr_generic();
  3.009 us   |  paging64_walk_addr_generic();
  1.814 us   |  paging64_walk_addr_generic();
  1.340 us   |  paging64_walk_addr_generic();
  1.659 us   |  paging64_walk_addr_generic();
  1.748 us   |  paging64_walk_addr_generic();
  1.488 us   |  paging64_walk_addr_generic();

After:
  1.714 us   |  paging64_walk_addr_generic();
  0.806 us   |  paging64_walk_addr_generic();
  0.664 us   |  paging64_walk_addr_generic();
  0.619 us   |  paging64_walk_addr_generic();
  0.645 us   |  paging64_walk_addr_generic();
  0.605 us   |  paging64_walk_addr_generic();
  1.388 us   |  paging64_walk_addr_generic();
  0.753 us   |  paging64_walk_addr_generic();
  0.594 us   |  paging64_walk_addr_generic();
  0.833 us   |  paging64_walk_addr_generic();

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/paging_tmpl.h |   37 ++++++++++++++++++++++++++++++++-----
 1 files changed, 32 insertions(+), 5 deletions(-)

Joerg Roedel April 18, 2011, 6:52 p.m. UTC | #1

On Tue, Apr 19, 2011 at 03:38:14AM +0900, Takuya Yoshikawa wrote:
> From: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> 
> We optimize multi level guest page table walk as follows:
> 
>   1. We cache the memslot which, probably, includes the next guest page
>      tables to avoid searching for it many times.
>   2. We use get_user() instead of copy_from_user().
> 
> Note that this is kind of a restricted way of Xiao's more generic
> work: "KVM: optimize memslots searching and cache GPN to GFN."
> 
> With this patch applied, paging64_walk_addr_generic() has improved
> as the following tracing results show.
> 
> Before:
>   3.169 us   |  paging64_walk_addr_generic();
>   1.880 us   |  paging64_walk_addr_generic();
>   1.243 us   |  paging64_walk_addr_generic();
>   1.517 us   |  paging64_walk_addr_generic();
>   3.009 us   |  paging64_walk_addr_generic();
>   1.814 us   |  paging64_walk_addr_generic();
>   1.340 us   |  paging64_walk_addr_generic();
>   1.659 us   |  paging64_walk_addr_generic();
>   1.748 us   |  paging64_walk_addr_generic();
>   1.488 us   |  paging64_walk_addr_generic();
> 
> After:
>   1.714 us   |  paging64_walk_addr_generic();
>   0.806 us   |  paging64_walk_addr_generic();
>   0.664 us   |  paging64_walk_addr_generic();
>   0.619 us   |  paging64_walk_addr_generic();
>   0.645 us   |  paging64_walk_addr_generic();
>   0.605 us   |  paging64_walk_addr_generic();
>   1.388 us   |  paging64_walk_addr_generic();
>   0.753 us   |  paging64_walk_addr_generic();
>   0.594 us   |  paging64_walk_addr_generic();
>   0.833 us   |  paging64_walk_addr_generic();

Nice optimization! What scenarios have you used to test it?

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takuya Yoshikawa April 19, 2011, 1:24 a.m. UTC | #2

Joerg Roedel <joro@8bytes.org> wrote:
> Nice optimization! What scenarios have you used to test it?

I used my desktop Phenom II box, running the latest qemu-kvm.
So probably, NPT was ON by default.

The guest was running a .ogg movie during that test.

I am not an MMU expert.  So I would be glad if I can know what
scenarios should be tested for this patch!

  Can I test nested SVM easily, e.g.?

Thanks,
  Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong April 19, 2011, 1:42 a.m. UTC | #3

On 04/19/2011 02:38 AM, Takuya Yoshikawa wrote:
> From: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> 
> We optimize multi level guest page table walk as follows:
> 
>   1. We cache the memslot which, probably, includes the next guest page
>      tables to avoid searching for it many times.

Yeah, the hit is very high, after optimizing the algorithm of memslots
(http://lwn.net/Articles/429308/), maybe the advantage is not so significant,
could you apply this patchset and test again please?

>   2. We use get_user() instead of copy_from_user().
> 
> Note that this is kind of a restricted way of Xiao's more generic
> work: "KVM: optimize memslots searching and cache GPN to GFN."
> 
> With this patch applied, paging64_walk_addr_generic() has improved
> as the following tracing results show.
> 
> Before:
>   3.169 us   |  paging64_walk_addr_generic();
>   1.880 us   |  paging64_walk_addr_generic();
>   1.243 us   |  paging64_walk_addr_generic();
>   1.517 us   |  paging64_walk_addr_generic();
>   3.009 us   |  paging64_walk_addr_generic();
>   1.814 us   |  paging64_walk_addr_generic();
>   1.340 us   |  paging64_walk_addr_generic();
>   1.659 us   |  paging64_walk_addr_generic();
>   1.748 us   |  paging64_walk_addr_generic();
>   1.488 us   |  paging64_walk_addr_generic();
> 
> After:
>   1.714 us   |  paging64_walk_addr_generic();
>   0.806 us   |  paging64_walk_addr_generic();
>   0.664 us   |  paging64_walk_addr_generic();
>   0.619 us   |  paging64_walk_addr_generic();
>   0.645 us   |  paging64_walk_addr_generic();
>   0.605 us   |  paging64_walk_addr_generic();
>   1.388 us   |  paging64_walk_addr_generic();
>   0.753 us   |  paging64_walk_addr_generic();
>   0.594 us   |  paging64_walk_addr_generic();
>   0.833 us   |  paging64_walk_addr_generic();
> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  arch/x86/kvm/paging_tmpl.h |   37 ++++++++++++++++++++++++++++++++-----
>  1 files changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 109939a..614aa3f 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -109,12 +109,37 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte)
>  	return access;
>  }
>  
> +/*
> + * Read the guest PTE refered to by table_gfn and offset and put it into ptep.
> + *
> + * *slot_hint, if not NULL, should point to a memslot which probably includes
> + * the guest PTE.  The actual memslot will be put back into this so that
> + * callers can cache it.
> + */
>  static int FNAME(read_guest_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> -				 gfn_t table_gfn, int offset, pt_element_t *ptep)
> +				 gfn_t table_gfn, int offset, pt_element_t *ptep,
> +				 struct kvm_memory_slot **slot_hint)
>  {
> -	return kvm_read_guest_page_mmu(vcpu, mmu, table_gfn, ptep,
> -				       offset, sizeof(*ptep),
> -				       PFERR_USER_MASK | PFERR_WRITE_MASK);
> +	unsigned long addr;
> +	pt_element_t __user *ptep_user;
> +	gfn_t real_gfn;
> +
> +	real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> +				      PFERR_USER_MASK | PFERR_WRITE_MASK);
> +	if (real_gfn == UNMAPPED_GVA)
> +		return -EFAULT;
> +
> +	real_gfn = gpa_to_gfn(real_gfn);
> +
> +	if (!(*slot_hint) || !gfn_in_memslot(*slot_hint, real_gfn))
> +		*slot_hint = gfn_to_memslot(vcpu->kvm, real_gfn);
> +

You forgot to check the result. (if *slot_hint == NULL)? ...?;-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takuya Yoshikawa April 19, 2011, 3:47 a.m. UTC | #4

Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> wrote:

> > We optimize multi level guest page table walk as follows:
> > 
> >   1. We cache the memslot which, probably, includes the next guest page
> >      tables to avoid searching for it many times.
> 
> Yeah, the hit is very high, after optimizing the algorithm of memslots
> (http://lwn.net/Articles/429308/), maybe the advantage is not so significant,
> could you apply this patchset and test again please?

Any sorting, including tree based, strategies have tradoffs.
Compared to that, what I wanted to do here was to improve the table
walk locally without sacrificing other things.

  Of course, my strategy depends on the assumption that the page
  tables will be in the same slot in very high probability.

So if certain algorithm seems to be addapted, yes, I will test based
on that.  IIRC, any practically good algorithm has not been found yet,
right?

> 
> >   2. We use get_user() instead of copy_from_user().
> > 
> > Note that this is kind of a restricted way of Xiao's more generic
> > work: "KVM: optimize memslots searching and cache GPN to GFN."
> > 
> > With this patch applied, paging64_walk_addr_generic() has improved
> > as the following tracing results show.
> > 

> > +
> > +	if (!(*slot_hint) || !gfn_in_memslot(*slot_hint, real_gfn))
> > +		*slot_hint = gfn_to_memslot(vcpu->kvm, real_gfn);
> > +
> 
> You forgot to check the result. (if *slot_hint == NULL)? ...?;-)

Thank you!  I will check later.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joerg Roedel April 19, 2011, 6:20 a.m. UTC | #5

On Tue, Apr 19, 2011 at 10:24:10AM +0900, Takuya Yoshikawa wrote:

>   Can I test nested SVM easily, e.g.?

Yes, nested SVM would be good to test too with those changes. We should
make sure to not break something there.

Regards,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 20, 2011, 9:02 a.m. UTC | #6

On 04/18/2011 09:38 PM, Takuya Yoshikawa wrote:
> From: Takuya Yoshikawa<yoshikawa.takuya@oss.ntt.co.jp>
>
> We optimize multi level guest page table walk as follows:
>
>    1. We cache the memslot which, probably, includes the next guest page
>       tables to avoid searching for it many times.
>    2. We use get_user() instead of copy_from_user().
>
> Note that this is kind of a restricted way of Xiao's more generic
> work: "KVM: optimize memslots searching and cache GPN to GFN."

Good optimization.  copy_from_user() really isn't optimized for short 
buffers, I expect much of the improvement comes from that.

> +/*
> + * Read the guest PTE refered to by table_gfn and offset and put it into ptep.
> + *
> + * *slot_hint, if not NULL, should point to a memslot which probably includes
> + * the guest PTE.  The actual memslot will be put back into this so that
> + * callers can cache it.
> + */

Please drop the slot_hint optimization.  First, it belongs in a separate 
patch.  Second, I prefer to see a generic slot sort instead of an ad-hoc 
cache.

>   static int FNAME(read_guest_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> -				 gfn_t table_gfn, int offset, pt_element_t *ptep)
> +				 gfn_t table_gfn, int offset, pt_element_t *ptep,
> +				 struct kvm_memory_slot **slot_hint)
>   {
> -	return kvm_read_guest_page_mmu(vcpu, mmu, table_gfn, ptep,
> -				       offset, sizeof(*ptep),
> -				       PFERR_USER_MASK | PFERR_WRITE_MASK);
> +	unsigned long addr;
> +	pt_element_t __user *ptep_user;
> +	gfn_t real_gfn;
> +
> +	real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> +				      PFERR_USER_MASK | PFERR_WRITE_MASK);
> +	if (real_gfn == UNMAPPED_GVA)
> +		return -EFAULT;
> +
> +	real_gfn = gpa_to_gfn(real_gfn);
> +
> +	if (!(*slot_hint) || !gfn_in_memslot(*slot_hint, real_gfn))
> +		*slot_hint = gfn_to_memslot(vcpu->kvm, real_gfn);
> +
> +	addr = gfn_to_hva_memslot(*slot_hint, real_gfn);
> +	if (kvm_is_error_hva(addr))
> +		return -EFAULT;
> +
> +	ptep_user = (pt_element_t __user *)((void *)addr + offset);
> +	return get_user(*ptep, ptep_user);
>   }
>
>   /*
> @@ -130,6 +155,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>   	gpa_t pte_gpa;
>   	bool eperm, present, rsvd_fault;
>   	int offset, write_fault, user_fault, fetch_fault;
> +	struct kvm_memory_slot *slot_cache = NULL;
>
>   	write_fault = access&  PFERR_WRITE_MASK;
>   	user_fault = access&  PFERR_USER_MASK;
> @@ -168,7 +194,8 @@ walk:
>   		walker->table_gfn[walker->level - 1] = table_gfn;
>   		walker->pte_gpa[walker->level - 1] = pte_gpa;
>
> -		if (FNAME(read_guest_pte)(vcpu, mmu, table_gfn, offset,&pte)) {
> +		if (FNAME(read_guest_pte)(vcpu, mmu, table_gfn,
> +					  offset,&pte,&slot_cache)) {
>   			present = false;
>   			break;
>   		}

Avi Kivity April 20, 2011, 9:09 a.m. UTC | #7

On 04/19/2011 06:47 AM, Takuya Yoshikawa wrote:
> So if certain algorithm seems to be addapted, yes, I will test based
> on that.  IIRC, any practically good algorithm has not been found yet,
> right?

I think a simple sort based on size will provide the same optimization
(just the cache, not get_user()) without any downsides. Most memory in a
guest is usually in just one or two slots, that's the reason for the
high hit rate.

Andi Kleen April 29, 2011, 2:46 a.m. UTC | #8

Avi Kivity <avi@redhat.com> writes:
>
> Good optimization.  copy_from_user() really isn't optimized for short
> buffers, I expect much of the improvement comes from that.

Actually it is equivalent to get_user for the lenghts supported by
get_user, assuming you pass in a constant length. You probably do not.

-Andi

Takuya Yoshikawa April 29, 2011, 5:38 a.m. UTC | #9

On Thu, 28 Apr 2011 19:46:00 -0700
Andi Kleen <andi@firstfloor.org> wrote:

> Avi Kivity <avi@redhat.com> writes:
> >
> > Good optimization.  copy_from_user() really isn't optimized for short
> > buffers, I expect much of the improvement comes from that.
> 
> Actually it is equivalent to get_user for the lenghts supported by
> get_user, assuming you pass in a constant length. You probably do not.
> 
> -Andi


Weird, I actually measured some changes even after dropping other parts
than get_user() usage.

Only I can guess for that reason is the reduction of some function calls
by inlining some functions.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takuya Yoshikawa April 29, 2011, 6:30 a.m. UTC | #10

On Fri, 29 Apr 2011 14:38:08 +0900
Takuya Yoshikawa <takuya.yoshikawa@gmail.com> wrote:

> On Thu, 28 Apr 2011 19:46:00 -0700
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Avi Kivity <avi@redhat.com> writes:
> > >
> > > Good optimization.  copy_from_user() really isn't optimized for short
> > > buffers, I expect much of the improvement comes from that.
> > 
> > Actually it is equivalent to get_user for the lenghts supported by
> > get_user, assuming you pass in a constant length. You probably do not.
> > 
> > -Andi
> 
> 
> Weird, I actually measured some changes even after dropping other parts
> than get_user() usage.
> 
> Only I can guess for that reason is the reduction of some function calls
> by inlining some functions.

A bit to clarify.

Original path:

  kvm_read_guest_page_mmu()
    kvm_read_guest_page()
      copy_from_user()

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen April 29, 2011, 6:59 a.m. UTC | #11

> Only I can guess for that reason is the reduction of some function calls
> by inlining some functions.

Yes once at a time cfu was inline too and just checked for the right
sizes and the used g*u, but it got lost in the "icache over everything else" mania which is unfortunately en vogue for quite some time in kernel land (aka 
measuring patches only based on their impact on the .text
size, not the actual performance)

But you might getter better gains by fixing this general
c*u() regression.

-Andi

Takuya Yoshikawa April 29, 2011, 1:51 p.m. UTC | #12

Andi Kleen <andi@firstfloor.org> wrote:

> > Only I can guess for that reason is the reduction of some function calls
> > by inlining some functions.
> 
> Yes once at a time cfu was inline too and just checked for the right
> sizes and the used g*u, but it got lost in the "icache over everything else" mania which is unfortunately en vogue for quite some time in kernel land (aka 
> measuring patches only based on their impact on the .text
> size, not the actual performance)
> 
> But you might getter better gains by fixing this general
> c*u() regression.
> 

What I mentioned was about not only cfu but 3 functions.

Originally, we were doing the following function calls:

walk_addr_generic()              ---1
  kvm_read_guest_page_mmu()      ---2
    kvm_read_guest_page()        ---3
      copy_from_user()           ---4

And now, with my patch already applied, we are not using
generic kvm_read_guest_page_mmu() and some address
calculations are done in walk_addr_generic():

walk_addr_generic()              ---1'
  get_user()                     ---2'

The length is passed in from walk_addr_generic().

Do you think the following case would not differ so much
from (1' 2') ?

walk_addr_generic()              ---1''
  copy_from_user()               ---2''

This can satisfy your "assuming you pass in a constant length"
and almost same as get_user() ?

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen April 29, 2011, 4:05 p.m. UTC | #13

> Do you think the following case would not differ so much
> from (1' 2') ?
> 
> walk_addr_generic()              ---1''
>   copy_from_user()               ---2''

Yes it should be the same and is cleaner.

If you do a make .../foo.i and look at the code coming out of the
preprocessor you'll see it expands to a 

	if (!__builtin_constant_p(size))
                return copy_user_generic(dst, (__force void *)src, size);
        switch (size) {
        case 1:__get_user_asm(*(u8 *)dst, (u8 __user *)src,
                              ret, "b", "b", "=q", 1);
                return ret;
	case 2: ..
	case 4: ..
	case 8: ..
	case 10: ..
	case 16: ..
	}

Ok it looks like the 32bit kernel only handles 1/2/4. Maybe that
was the problem if you ran on 32bit.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity May 1, 2011, 1:32 p.m. UTC | #14

On 04/29/2011 07:05 PM, Andi Kleen wrote:
> >  Do you think the following case would not differ so much
> >  from (1' 2') ?
> >
> >  walk_addr_generic()              ---1''
> >    copy_from_user()               ---2''
>
> Yes it should be the same and is cleaner.
>
> If you do a make .../foo.i and look at the code coming out of the
> preprocessor you'll see it expands to a
>
> 	if (!__builtin_constant_p(size))
>                  return copy_user_generic(dst, (__force void *)src, size);
>          switch (size) {
>          case 1:__get_user_asm(*(u8 *)dst, (u8 __user *)src,
>                                ret, "b", "b", "=q", 1);
>                  return ret;
> 	case 2: ..
> 	case 4: ..
> 	case 8: ..
> 	case 10: ..
> 	case 16: ..
> 	}
>
> Ok it looks like the 32bit kernel only handles 1/2/4. Maybe that
> was the problem if you ran on 32bit.

I'm happy with a slower copy_from_user() for that particular case.

Andi Kleen May 1, 2011, 8:51 p.m. UTC | #15

>> Ok it looks like the 32bit kernel only handles 1/2/4. Maybe that
>> was the problem if you ran on 32bit.
>
> I'm happy with a slower copy_from_user() for that particular case.

It wouldn't be hard to fix.

-Andi


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC,3/3] KVM: MMU: Optimize guest page table walk

Commit Message

Comments

Patch