diff mbox

RFC: shadow page table reclaim

Message ID 200908280431.04960.max@laiers.net (mailing list archive)
State New, archived
Headers show

Commit Message

Max Laier Aug. 28, 2009, 2:31 a.m. UTC
Hello,

it seems to me that the reclaim mechanism for shadow page table pages is sub-
optimal.  The arch.active_mmu_pages list that is used for reclaiming does not 
move up parent shadow page tables when a child is added so when we need a new 
shadow page we zap the oldest - which can well be a directory level page 
holding a just added table level page.

Attached is a proof-of-concept diff and two plots before and after.  The plots 
show referenced guest pages over time.  As you can see there is less saw-
toothing in the after plot and also fewer changes overall (because we don't 
zap mappings that are still in use as often).  This is with a limit of 64 for 
the shadow page table to increase the effect and vmx/ept.

I realize that the list_move and parent walk are quite expensive and that 
kvm_mmu_alloc_page is only half the story.  It should really be done every 
time a new guest page table is mapped - maybe via rmap_add.  This would 
obviously completely kill performance-wise, though.

Another idea would be to improve the reclaim logic in a way that it prefers 
"old" PT_PAGE_TABLE_LEVEL over directories.  Though I'm not sure how to code 
that up sensibly, either.

As I said, this is proof-of-concept and RFC.  So any comments welcome.  For my 
use case the proof-of-concept diff seems to do well enough, though.

Thanks,

Comments

Avi Kivity Aug. 31, 2009, 9:55 a.m. UTC | #1
On 08/28/2009 05:31 AM, Max Laier wrote:
> Hello,
>
> it seems to me that the reclaim mechanism for shadow page table pages is sub-
> optimal.  The arch.active_mmu_pages list that is used for reclaiming does not
> move up parent shadow page tables when a child is added so when we need a new
> shadow page we zap the oldest - which can well be a directory level page
> holding a just added table level page.
>
> Attached is a proof-of-concept diff and two plots before and after.  The plots
> show referenced guest pages over time.

What do you mean by referenced guest pages?  Total number of populated 
sptes?

> As you can see there is less saw-
> toothing in the after plot and also fewer changes overall (because we don't
> zap mappings that are still in use as often).  This is with a limit of 64 for
> the shadow page table to increase the effect and vmx/ept.
>
> I realize that the list_move and parent walk are quite expensive and that
> kvm_mmu_alloc_page is only half the story.  It should really be done every
> time a new guest page table is mapped - maybe via rmap_add.  This would
> obviously completely kill performance-wise, though.
>
> Another idea would be to improve the reclaim logic in a way that it prefers
> "old" PT_PAGE_TABLE_LEVEL over directories.  Though I'm not sure how to code
> that up sensibly, either.
>
> As I said, this is proof-of-concept and RFC.  So any comments welcome.  For my
> use case the proof-of-concept diff seems to do well enough, though.
>    

Given that reclaim is fairly rare, we should try to move the cost 
there.  So how about this:

- add an 'accessed' flag to struct kvm_mmu_page
- when reclaiming, try to evict pages that were not recently accessed 
(but don't overscan - if you scan many recently accessed pages, evict 
some of them anyway)
- when scanning, update the accessed flag with the accessed bit of all 
parent_ptes
- when dropping an spte, update the accessed flag of the kvm_mmu_page it 
points to
- when reloading cr3, mark the page as accessed (since it has no 
parent_ptes)

This should introduce some LRU-ness that depends not only on fault 
behaviour but also on long-term guest access behaviour (which is 
important for long-running processes and kernel pages).
Max Laier Aug. 31, 2009, 12:09 p.m. UTC | #2
On Monday 31 August 2009 11:55:24 Avi Kivity wrote:
> On 08/28/2009 05:31 AM, Max Laier wrote:
> > Hello,
> >
> > it seems to me that the reclaim mechanism for shadow page table pages is
> > sub- optimal.  The arch.active_mmu_pages list that is used for reclaiming
> > does not move up parent shadow page tables when a child is added so when
> > we need a new shadow page we zap the oldest - which can well be a
> > directory level page holding a just added table level page.
> >
> > Attached is a proof-of-concept diff and two plots before and after.  The
> > plots show referenced guest pages over time.
>
> What do you mean by referenced guest pages?  Total number of populated
> sptes?

Yes.

> > As you can see there is less saw-
> > toothing in the after plot and also fewer changes overall (because we
> > don't zap mappings that are still in use as often).  This is with a limit
> > of 64 for the shadow page table to increase the effect and vmx/ept.
> >
> > I realize that the list_move and parent walk are quite expensive and that
> > kvm_mmu_alloc_page is only half the story.  It should really be done
> > every time a new guest page table is mapped - maybe via rmap_add.  This
> > would obviously completely kill performance-wise, though.
> >
> > Another idea would be to improve the reclaim logic in a way that it
> > prefers "old" PT_PAGE_TABLE_LEVEL over directories.  Though I'm not sure
> > how to code that up sensibly, either.
> >
> > As I said, this is proof-of-concept and RFC.  So any comments welcome. 
> > For my use case the proof-of-concept diff seems to do well enough,
> > though.
>
> Given that reclaim is fairly rare, we should try to move the cost
> there.  So how about this:
>
> - add an 'accessed' flag to struct kvm_mmu_page
> - when reclaiming, try to evict pages that were not recently accessed
> (but don't overscan - if you scan many recently accessed pages, evict
> some of them anyway)

- prefer page table level pages over directory level pages in the face of 
overscan.

> - when scanning, update the accessed flag with the accessed bit of all
> parent_ptes

I might be misunderstanding, but I think it should be the other way 'round.  
i.e. a page is accessed if any of it's children have been accessed.

> - when dropping an spte, update the accessed flag of the kvm_mmu_page it
> points to
> - when reloading cr3, mark the page as accessed (since it has no
> parent_ptes)
>
> This should introduce some LRU-ness that depends not only on fault
> behaviour but also on long-term guest access behaviour (which is
> important for long-running processes and kernel pages).

I'll try to come up with a patch for this, later tonight.  Unless you already 
have something in the making.  Thanks.
Avi Kivity Aug. 31, 2009, 12:40 p.m. UTC | #3
On 08/31/2009 03:09 PM, Max Laier wrote:
>
>>> As you can see there is less saw-
>>> toothing in the after plot and also fewer changes overall (because we
>>> don't zap mappings that are still in use as often).  This is with a limit
>>> of 64 for the shadow page table to increase the effect and vmx/ept.
>>>
>>> I realize that the list_move and parent walk are quite expensive and that
>>> kvm_mmu_alloc_page is only half the story.  It should really be done
>>> every time a new guest page table is mapped - maybe via rmap_add.  This
>>> would obviously completely kill performance-wise, though.
>>>
>>> Another idea would be to improve the reclaim logic in a way that it
>>> prefers "old" PT_PAGE_TABLE_LEVEL over directories.  Though I'm not sure
>>> how to code that up sensibly, either.
>>>
>>> As I said, this is proof-of-concept and RFC.  So any comments welcome.
>>> For my use case the proof-of-concept diff seems to do well enough,
>>> though.
>>>        
>> Given that reclaim is fairly rare, we should try to move the cost
>> there.  So how about this:
>>
>> - add an 'accessed' flag to struct kvm_mmu_page
>> - when reclaiming, try to evict pages that were not recently accessed
>> (but don't overscan - if you scan many recently accessed pages, evict
>> some of them anyway)
>>      
> - prefer page table level pages over directory level pages in the face of
> overscan.
>    

I'm hoping that overscan will only occur when we start to feel memory 
pressure, and that once we do a full scan we'll get accurate recency 
information.

>> - when scanning, update the accessed flag with the accessed bit of all
>> parent_ptes
>>      
> I might be misunderstanding, but I think it should be the other way 'round.
> i.e. a page is accessed if any of it's children have been accessed.
>    

They're both true, but looking at the parents is much more efficient.  
Note we need to look at the accessed bit of the parent_ptes, not parent 
kvm_mmu_pages.


>> - when dropping an spte, update the accessed flag of the kvm_mmu_page it
>> points to
>> - when reloading cr3, mark the page as accessed (since it has no
>> parent_ptes)
>>
>> This should introduce some LRU-ness that depends not only on fault
>> behaviour but also on long-term guest access behaviour (which is
>> important for long-running processes and kernel pages).
>>      
> I'll try to come up with a patch for this, later tonight.  Unless you already
> have something in the making.  Thanks.
>    

Please do, it's an area that need attention.
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95d5329..0a63570 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -190,6 +190,8 @@  struct kvm_unsync_walk {
 };
 
 typedef int (*mmu_parent_walk_fn) (struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp);
+static void mmu_parent_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+			    mmu_parent_walk_fn fn);
 
 static struct kmem_cache *pte_chain_cache;
 static struct kmem_cache *rmap_desc_cache;
@@ -900,6 +902,12 @@  static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return gfn & ((1 << KVM_MMU_HASH_SHIFT) - 1);
 }
 
+static int move_up_walk_fn(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+	list_move(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+	return 1;
+}
+
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
 					       u64 *parent_pte)
 {
@@ -918,6 +926,10 @@  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
 	bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
 	sp->multimapped = 0;
 	sp->parent_pte = parent_pte;
+#if 1
+	if (parent_pte)
+		mmu_parent_walk(vcpu, sp, move_up_walk_fn);
+#endif
 	--vcpu->kvm->arch.n_free_mmu_pages;
 	return sp;
 }